Processor logic

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Processor logic

Uwe Geercken
Hello,

I have a little bit of a hard time to design processors correctly. I find it difficult to decide if a processor should e.g. process a single line from a flow file or process also flow files with multiples lines of data (e.g. in the case of CSV files). Another point is the handling of header rows. One other point is data provenenance events: what is the correct event I should use when modifying attributes, content or both?

Is there a guide which outlines the best practices for such cases? I have the feeling that many of the processors handle these issues quite differently. I think there either should be a sort of standard or otherwise it should be well documented. And although there is very good documentation available for the project, for some of the processors one has to play around quite a bit to get it right because they behave differently or have a different philosophie and one has to understand it first to get it right.

Would appreciate to get some feedback and advice or pointers to documentation.

Uwe
Reply | Threaded
Open this post in threaded view
|

Re: Processor logic

Andy LoPresto-2
Hi Uwe,

I believe a lot of this is covered in the Developer Guide [1]. Specifically, there are discussions of various processor patterns, including Split Content [2], and a section on Cohesion and Usability [3], which states:

In order to avoid these issues, and make Processors more reusable, a Processor should always stick to the principal of "do one thing and do it well." Such a Processor should be broken into two separate Processors: one to convert the data from Format X to Format Y, and another Processor to send data to the remote resource.

I call this the “Unix model” — it is better to join several small, specific tools together to accomplish a task than re-invent larger tools every time a small modification is required. In general, that would lead me to develop processors that operate on the smallest unit of data necessary — a single line, element, or record makes sense — unless more context is needed for completeness, or the performance is so grossly different that it is inefficient to operate on such small quantities. 

Finally, with regards to your question about which provenance events to use in various scenarios, I agree the documentation is lacking. Luckily Drew Lim did some great work improving this documentation. While it has not been released in an official version, both the Developer Guide [4] and User Guide [5] have received substantial enhancements, describing the complete list of provenance event types and their usage/meaning. This work is available on master and will be released in 1.2.0. 

The project has certainly evolved over a long lifetime, and you are correct that different processors have different philosophies. Sometimes that is the result of different authors, sometimes it is a legitimate result of the wide variety of scenarios that these processors interact with. Improving the user experience and documentation is always important, and getting started with and maximizing the usefulness of these processors is one of our top priorities. 

I would also reference Chesterton’s Fence [6] here — there are definitely improvements to be made, I do not disagree. I would also caution against, as I have done in the past, making changes to improve a system without understanding the way it got to the current state. Once one has a firm grasp on the history, then a reasonable plan may be made to improve things. We are always welcoming of suggestions to improve the experience for the community. 

Hope this helps and I’d love to get your feedback on where else we can be better. Thanks. 


Andy LoPresto
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Mar 16, 2017, at 3:28 PM, Uwe Geercken <[hidden email]> wrote:

Hello,

I have a little bit of a hard time to design processors correctly. I find it difficult to decide if a processor should e.g. process a single line from a flow file or process also flow files with multiples lines of data (e.g. in the case of CSV files). Another point is the handling of header rows. One other point is data provenenance events: what is the correct event I should use when modifying attributes, content or both?

Is there a guide which outlines the best practices for such cases? I have the feeling that many of the processors handle these issues quite differently. I think there either should be a sort of standard or otherwise it should be well documented. And although there is very good documentation available for the project, for some of the processors one has to play around quite a bit to get it right because they behave differently or have a different philosophie and one has to understand it first to get it right.

Would appreciate to get some feedback and advice or pointers to documentation.

Uwe


signature.asc (859 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Processor logic

Mark Payne
Hi Uwe,

I think Andy did a great job of explaining the processor "design theory" here. I just had a few other points to help clarify.

Firstly, on Provenance Events. While the documentation does appear to be lacking there at the moment
(thanks for clarifying that for the next release, Drew!), the framework does its best to help here. The general guidance
that I can offer on this subject are as follows.

- An ATTRIBUTES_MODIFIED event should be emitted only if your processor does not emit any other event. This is
because all Provenance Events do include the attributes. So if you emit a CONTENT_MODIFIED event, for instance, there
is already an event that captures which attributes changed. But in the case that you have no other event (as is the case for
UpdateAttribute, for instance), ATTRIBUTES_MODIFIED is very useful.

- Any time that a FlowFile is created from data that is received from a remote system (including another NiFi instance),
this should be recorded as either a FETCH (if the Processor replaced the content of an existing FlowFile) or a RECEIVE
(if the Processor created a new FlowFile, such as a Get* Processor that starts a flow). This event indicates where the data came from.

- Any time that a FlowFile is sent to an remote system (including another NiFi instance) or leaves the comfy confines of
NiFi (as in the case of PutFile), a SEND event should be emitted the indicates where the data went.

- If you do not emit a Provenance Event in your processor, the framework will do so for you, if appropriate. However, the framework
has no way to know if data was received from a remote system or sent to a remote system. So it is important that FETCH/RECEIVE/SEND
events are always emitted by the Processor. If a "new" FlowFile is created (with no parent flowfiles) and no RECEIVE event is emitted, then
the framework will emit a CREATE event so that it's clear which component introduced the FlowFile into the flow. But in this case it doesn't
know where the data came from, so it's always best to emit your own RECEIVE event if getting data from elsewhere.

- If you emit a Provenance Event that "conflicts" with what the framework generates (for instance, you emit a CONTENT_MODIFIED event
that includes a value in the Details event) then the framework will discard its own event in favor of the one emitted by the Processor. The framework
always assumes that the Processor developer knows what is best in his/her case because the developer may have more context.


Regarding Processors handling all of the data vs. only a single "piece" of the data: Processors should handle all of the content of a FlowFile,
whenever it makes sense to do so. For example, if you have a Processor that is designed to operate on the "header" of a CSV file then it does
not make sense to read and process the rest of the data. However, if you have a Processor that say modifies one of the cells in a row of a CSV
file it should certainly operate on all rows. This gives the user the flexibility to send in single-row CSV files if they need to split it for some other reason,
or to send in a many-gigabyte CSV file without the need to pay the cost of splitting it up and then re-merging it.

Where you may see the existing Processors deviate from this is something like ExtractText, which says that it will only buffer up to the configured
amount of data. This is done because in order to run a Regular Expression over the data, the data has to be buffered in Java Heap. If we don't bound
this, then a user will inevitably send in a 10 GB CSV file and get Out of Memory Error's. In this case, we would encourage users to use a small buffer
size such as 1 MB and then split the content upstream.

I hope this provides some clarity instead of muddying the water :) Of course, if you have other design decisions that you struggle with, feel free
to shoot a note to the dev list, and we are more than happy to assist!

Thanks
-Mark




On Mar 16, 2017, at 11:43 PM, Andy LoPresto <[hidden email]<mailto:[hidden email]>> wrote:

Hi Uwe,

I believe a lot of this is covered in the Developer Guide [1]. Specifically, there are discussions of various processor patterns, including Split Content [2], and a section on Cohesion and Usability [3], which states:

In order to avoid these issues, and make Processors more reusable, a Processor should always stick to the principal of "do one thing and do it well." Such a Processor should be broken into two separate Processors: one to convert the data from Format X to Format Y, and another Processor to send data to the remote resource.

I call this the “Unix model” — it is better to join several small, specific tools together to accomplish a task than re-invent larger tools every time a small modification is required. In general, that would lead me to develop processors that operate on the smallest unit of data necessary — a single line, element, or record makes sense — unless more context is needed for completeness, or the performance is so grossly different that it is inefficient to operate on such small quantities.

Finally, with regards to your question about which provenance events to use in various scenarios, I agree the documentation is lacking. Luckily Drew Lim did some great work improving this documentation. While it has not been released in an official version, both the Developer Guide [4] and User Guide [5] have received substantial enhancements, describing the complete list of provenance event types and their usage/meaning. This work is available on master and will be released in 1.2.0.

The project has certainly evolved over a long lifetime, and you are correct that different processors have different philosophies. Sometimes that is the result of different authors, sometimes it is a legitimate result of the wide variety of scenarios that these processors interact with. Improving the user experience and documentation is always important, and getting started with and maximizing the usefulness of these processors is one of our top priorities.

I would also reference Chesterton’s Fence [6] here — there are definitely improvements to be made, I do not disagree. I would also caution against, as I have done in the past, making changes to improve a system without understanding the way it got to the current state. Once one has a firm grasp on the history, then a reasonable plan may be made to improve things. We are always welcoming of suggestions to improve the experience for the community.

Hope this helps and I’d love to get your feedback on where else we can be better. Thanks.

[1] https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html
[2] https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#split-content-one-to-many
[3] https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#cohesion-and-reusability
[4] https://github.com/andrewmlim/nifi/blob/bd9eb0ac6009845de9d5a34bd5384ade1945befd/nifi-docs/src/main/asciidoc/developer-guide.adoc#provenance-events
[5] https://github.com/andrewmlim/nifi/blob/bd9eb0ac6009845de9d5a34bd5384ade1945befd/nifi-docs/src/main/asciidoc/user-guide.adoc#data-provenance
[6] https://en.wikipedia.org/wiki/Wikipedia:Chesterton's_fence

Andy LoPresto
[hidden email]<mailto:[hidden email]>
[hidden email]<mailto:[hidden email]>
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Mar 16, 2017, at 3:28 PM, Uwe Geercken <[hidden email]<mailto:[hidden email]>> wrote:

Hello,

I have a little bit of a hard time to design processors correctly. I find it difficult to decide if a processor should e.g. process a single line from a flow file or process also flow files with multiples lines of data (e.g. in the case of CSV files). Another point is the handling of header rows. One other point is data provenenance events: what is the correct event I should use when modifying attributes, content or both?

Is there a guide which outlines the best practices for such cases? I have the feeling that many of the processors handle these issues quite differently. I think there either should be a sort of standard or otherwise it should be well documented. And although there is very good documentation available for the project, for some of the processors one has to play around quite a bit to get it right because they behave differently or have a different philosophie and one has to understand it first to get it right.

Would appreciate to get some feedback and advice or pointers to documentation.

Uwe