Architecting the MS SQL CDC Processor

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Architecting the MS SQL CDC Processor

Peter Wicks (pwicks)
I've been working on a new processor that does Change Data Capture with Microsoft SQL Server. I followed Microsoft's documentation on how CDC works, and I've got some code that gets me the changes and is testing well. Right now, I don't actually have a processor, but a number of scripts that generate SQL and I put it into ExecuteSQL and QueryDatabaseTable processors; with QDB using my as-yet incomplete NIFI-1706<https://github.com/apache/nifi/pull/2162>.

One of the reasons I don't have a processor yet is because I don't want to use the same output format as the MySQL CDC Processor, but I didn't want to put in the time if it was not going to get merged. The MySQL CDC processor uses JSON messages as the output format, but in MS SQL the CDC messages are rows in a table; and it's much more convenient to output them as records. Currently, I'm using Avro.

Questions:

  *   My output format doesn't have to be Avro, but given the source is rows in a table being returned by a ResultSet, using the JdbcCommon class makes a lot of sense to me. Can I move JdbcCommon to somewhere useful like nifi-avro-record-utils?
  *   I'll be looping through a list of tables and plan on committing the files immediately to the success relationship as that table's CDC records are pulled. I want to make sure that the max value tracking gets updated immediately too. Does calling setState on the State Manager cause an immediate state save? Is this safe to call repeatedly, assuming single threaded, during the execution of the processor?
  *   Concerns with using a different output format than the MySQL CDC Processor?

Thanks,
  Peter
Reply | Threaded
Open this post in threaded view
|

Re: Architecting the MS SQL CDC Processor

Matt Burgess-2
Peter,

This is great to hear, I'm sure the community is looking forward to
such a solution!  I worked on the first offering of the
CaptureChangeMySQL processor, so here are some notes, comments, and
(hopefully!) answers to your questions:

* If you support a RecordSetWriter controller service as your output,
then you won't need JdbcCommon per se; instead you would create
Records and pass those to the user-selected RecordSetWriter. In that
sense you can support Avro, CSV, JSON, or anything else for which
there is a RecordSetWriter implementation.

* Depending on how often you'll be updating state, you may want to
implement something similar to the State Update Interval property in
CaptureChangeMySQL, which came about due to similar concerns about the
overhead of state updates vs the amount of processing beforehand.
This allows to user to tune the tradeoff, based on their own
requirements and performance and such.

* I have no concerns with having a different output format from
CaptureChangeMySQL; in fact the only reason it doesn't have a
RecordSetWriter output interface is that those capabilities were being
developed in parallel, so rather than have to wait for the
record-aware API stuff, I chose to output JSON. I have written
NIFI-4491 to improve/augment CDC processor(s) with RecordSetWriter
support. This would be very helpful by supporting various output
formats as well as generating the accompanying schema. If your
processor were the first to support this, it could be the exemplar for
past and future CDC processors :)

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-4491

On Mon, Oct 16, 2017 at 10:36 PM, Peter Wicks (pwicks)
<[hidden email]> wrote:

> I've been working on a new processor that does Change Data Capture with Microsoft SQL Server. I followed Microsoft's documentation on how CDC works, and I've got some code that gets me the changes and is testing well. Right now, I don't actually have a processor, but a number of scripts that generate SQL and I put it into ExecuteSQL and QueryDatabaseTable processors; with QDB using my as-yet incomplete NIFI-1706<https://github.com/apache/nifi/pull/2162>.
>
> One of the reasons I don't have a processor yet is because I don't want to use the same output format as the MySQL CDC Processor, but I didn't want to put in the time if it was not going to get merged. The MySQL CDC processor uses JSON messages as the output format, but in MS SQL the CDC messages are rows in a table; and it's much more convenient to output them as records. Currently, I'm using Avro.
>
> Questions:
>
>   *   My output format doesn't have to be Avro, but given the source is rows in a table being returned by a ResultSet, using the JdbcCommon class makes a lot of sense to me. Can I move JdbcCommon to somewhere useful like nifi-avro-record-utils?
>   *   I'll be looping through a list of tables and plan on committing the files immediately to the success relationship as that table's CDC records are pulled. I want to make sure that the max value tracking gets updated immediately too. Does calling setState on the State Manager cause an immediate state save? Is this safe to call repeatedly, assuming single threaded, during the execution of the processor?
>   *   Concerns with using a different output format than the MySQL CDC Processor?
>
> Thanks,
>   Peter
Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Architecting the MS SQL CDC Processor

Peter Wicks (pwicks)
I've submitted the first pass at this processor: https://github.com/apache/nifi/pull/2230

I did use RecordSetWriter, and ResultSetRecordSet for reading, which has been working well.
I found I didn't need to worry about the rate that state gets updated. In the MySQL case it checks to see if it should update state after every single row, in mine it updates after all changes for the table since the last run have been processed, so more like a QueryDatabaseTable. This is possible because I'm not reading changes in anything remotely resembling the way MySQL works :)

I created unit tests for this processor, the unit tests run on Apache DB tables that match in schema, but not necessarily in type, to those in MS SQL.

The only quirky thing was in order to get my generated SQL to work for both Apache DB and MS SQL, I had to use quotes a lot more than usual in my SQL statements. So please no comments along the lines of, "Why are there so many quoted identifiers in your SQL statements" :)

Thanks,
  Peter

-----Original Message-----
From: Matt Burgess [mailto:[hidden email]]
Sent: Tuesday, October 17, 2017 10:59 AM
To: [hidden email]
Subject: [EXT] Re: Architecting the MS SQL CDC Processor

Peter,

This is great to hear, I'm sure the community is looking forward to such a solution!  I worked on the first offering of the CaptureChangeMySQL processor, so here are some notes, comments, and
(hopefully!) answers to your questions:

* If you support a RecordSetWriter controller service as your output, then you won't need JdbcCommon per se; instead you would create Records and pass those to the user-selected RecordSetWriter. In that sense you can support Avro, CSV, JSON, or anything else for which there is a RecordSetWriter implementation.

* Depending on how often you'll be updating state, you may want to implement something similar to the State Update Interval property in CaptureChangeMySQL, which came about due to similar concerns about the overhead of state updates vs the amount of processing beforehand.
This allows to user to tune the tradeoff, based on their own requirements and performance and such.

* I have no concerns with having a different output format from CaptureChangeMySQL; in fact the only reason it doesn't have a RecordSetWriter output interface is that those capabilities were being developed in parallel, so rather than have to wait for the record-aware API stuff, I chose to output JSON. I have written
NIFI-4491 to improve/augment CDC processor(s) with RecordSetWriter support. This would be very helpful by supporting various output formats as well as generating the accompanying schema. If your processor were the first to support this, it could be the exemplar for past and future CDC processors :)

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-4491

On Mon, Oct 16, 2017 at 10:36 PM, Peter Wicks (pwicks) <[hidden email]> wrote:

> I've been working on a new processor that does Change Data Capture with Microsoft SQL Server. I followed Microsoft's documentation on how CDC works, and I've got some code that gets me the changes and is testing well. Right now, I don't actually have a processor, but a number of scripts that generate SQL and I put it into ExecuteSQL and QueryDatabaseTable processors; with QDB using my as-yet incomplete NIFI-1706<https://github.com/apache/nifi/pull/2162>.
>
> One of the reasons I don't have a processor yet is because I don't want to use the same output format as the MySQL CDC Processor, but I didn't want to put in the time if it was not going to get merged. The MySQL CDC processor uses JSON messages as the output format, but in MS SQL the CDC messages are rows in a table; and it's much more convenient to output them as records. Currently, I'm using Avro.
>
> Questions:
>
>   *   My output format doesn't have to be Avro, but given the source is rows in a table being returned by a ResultSet, using the JdbcCommon class makes a lot of sense to me. Can I move JdbcCommon to somewhere useful like nifi-avro-record-utils?
>   *   I'll be looping through a list of tables and plan on committing the files immediately to the success relationship as that table's CDC records are pulled. I want to make sure that the max value tracking gets updated immediately too. Does calling setState on the State Manager cause an immediate state save? Is this safe to call repeatedly, assuming single threaded, during the execution of the processor?
>   *   Concerns with using a different output format than the MySQL CDC Processor?
>
> Thanks,
>   Peter