SplitRecord behaviour

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

SplitRecord behaviour

Kumara M S, Hemantha (Nokia - IN/Bangalore)
Hi All,

We have a use case where receiving huge json(file size might vary from 1GB to 50GB) via http, convert in to XML(xml format is not fixed, any other format is fine) and send out using Kafka. - here is the restriction is CPU & RAM usage requirement(once it is fixed, it should handle all size files) should not getting changed based on incoming file size.

We used ListenHTTP -->SplitRecord -->PublishKafa , but we have observed one behaviour where SplitRecord is sending out data to PublishKafa only after whole FlowFile processing. Is there any reason why did we design this way? Will it not be good if we send out splits  to next processor after each configured records instead of all sending all splits at one shot?


Regards,
Hemantha

Reply | Threaded
Open this post in threaded view
|

Re: SplitRecord behaviour

Bryan Bende
Hello,

Flow files are not transferred until the session they came form is
committed. So imagine we periodically commit and some of the splits
are transferred, then half way through a failure is encountered, the
entire original flow file will be reprocessed, producing some of the
same splits that were already send out. The way it is implemented now,
it is either completely successful, or not, but never partially
successful producing duplicates.

Based on the description of your flow with the three processors you
mentioned, I wouldn't bother using SplitRecord, just have ListenHttp
-> PublishKafkaRecord. PublishKafkaRecorcd can be configured with the
same reader and writer you were using in SplitRecord, and it will read
each record and send to Kafka, without having to produce unnecessary
flow files.

Thanks,

Bryan

On Fri, Mar 1, 2019 at 3:44 AM Kumara M S, Hemantha (Nokia -
IN/Bangalore) <[hidden email]> wrote:

>
> Hi All,
>
> We have a use case where receiving huge json(file size might vary from 1GB to 50GB) via http, convert in to XML(xml format is not fixed, any other format is fine) and send out using Kafka. - here is the restriction is CPU & RAM usage requirement(once it is fixed, it should handle all size files) should not getting changed based on incoming file size.
>
> We used ListenHTTP -->SplitRecord -->PublishKafa , but we have observed one behaviour where SplitRecord is sending out data to PublishKafa only after whole FlowFile processing. Is there any reason why did we design this way? Will it not be good if we send out splits  to next processor after each configured records instead of all sending all splits at one shot?
>
>
> Regards,
> Hemantha
>
Reply | Threaded
Open this post in threaded view
|

Re: SplitRecord behaviour

Otto Fowler
Bryan,
So the best practice when segmenting is to

- build your segments as a list while processing the incoming stream
- then after send them all to the relationship

right?


On March 1, 2019 at 09:21:46, Bryan Bende ([hidden email]) wrote:

Hello,

Flow files are not transferred until the session they came form is
committed. So imagine we periodically commit and some of the splits
are transferred, then half way through a failure is encountered, the
entire original flow file will be reprocessed, producing some of the
same splits that were already send out. The way it is implemented now,
it is either completely successful, or not, but never partially
successful producing duplicates.

Based on the description of your flow with the three processors you
mentioned, I wouldn't bother using SplitRecord, just have ListenHttp
-> PublishKafkaRecord. PublishKafkaRecorcd can be configured with the
same reader and writer you were using in SplitRecord, and it will read
each record and send to Kafka, without having to produce unnecessary
flow files.

Thanks,

Bryan

On Fri, Mar 1, 2019 at 3:44 AM Kumara M S, Hemantha (Nokia -
IN/Bangalore) <[hidden email]> wrote:
>
> Hi All,
>
> We have a use case where receiving huge json(file size might vary from
1GB to 50GB) via http, convert in to XML(xml format is not fixed, any other
format is fine) and send out using Kafka. - here is the restriction is CPU
& RAM usage requirement(once it is fixed, it should handle all size files)
should not getting changed based on incoming file size.
>
> We used ListenHTTP -->SplitRecord -->PublishKafa , but we have observed
one behaviour where SplitRecord is sending out data to PublishKafa only
after whole FlowFile processing. Is there any reason why did we design this
way? Will it not be good if we send out splits to next processor after each
configured records instead of all sending all splits at one shot?
>
>
> Regards,
> Hemantha
>
Reply | Threaded
Open this post in threaded view
|

Re: SplitRecord behaviour

Bryan Bende
You can call transfer for each segment while processing the incoming
stream, its just that the real transfer won't actually happen until
commit is called.

Most processors extend AbstractProcessor so commit is called for you
at the end, but you could choose to manage the session yourself and
call commit for each segment, as long as you are ok with the scenario
I described before where you have now sent some segments downstream
that can't be undone.

I guess an analogy would be writing lines of a flow file to a
relational database... you could either use a DB transaction per line
where if line 50 fails you can't undo the inserts for lines 1-49, or
you could do a transaction for all lines where they all insert or none
insert.

The ProcessSession is like the DB transaction.

On Fri, Mar 1, 2019 at 11:00 AM Otto Fowler <[hidden email]> wrote:

>
> Bryan,
> So the best practice when segmenting is to
>
> - build your segments as a list while processing the incoming stream
> - then after send them all to the relationship
>
> right?
>
>
> On March 1, 2019 at 09:21:46, Bryan Bende ([hidden email]) wrote:
>
> Hello,
>
> Flow files are not transferred until the session they came form is
> committed. So imagine we periodically commit and some of the splits
> are transferred, then half way through a failure is encountered, the
> entire original flow file will be reprocessed, producing some of the
> same splits that were already send out. The way it is implemented now,
> it is either completely successful, or not, but never partially
> successful producing duplicates.
>
> Based on the description of your flow with the three processors you
> mentioned, I wouldn't bother using SplitRecord, just have ListenHttp
> -> PublishKafkaRecord. PublishKafkaRecorcd can be configured with the
> same reader and writer you were using in SplitRecord, and it will read
> each record and send to Kafka, without having to produce unnecessary
> flow files.
>
> Thanks,
>
> Bryan
>
> On Fri, Mar 1, 2019 at 3:44 AM Kumara M S, Hemantha (Nokia -
> IN/Bangalore) <[hidden email]> wrote:
> >
> > Hi All,
> >
> > We have a use case where receiving huge json(file size might vary from
> 1GB to 50GB) via http, convert in to XML(xml format is not fixed, any other
> format is fine) and send out using Kafka. - here is the restriction is CPU
> & RAM usage requirement(once it is fixed, it should handle all size files)
> should not getting changed based on incoming file size.
> >
> > We used ListenHTTP -->SplitRecord -->PublishKafa , but we have observed
> one behaviour where SplitRecord is sending out data to PublishKafa only
> after whole FlowFile processing. Is there any reason why did we design this
> way? Will it not be good if we send out splits to next processor after each
> configured records instead of all sending all splits at one shot?
> >
> >
> > Regards,
> > Hemantha
> >
Reply | Threaded
Open this post in threaded view
|

RE: SplitRecord behaviour

Kumara M S, Hemantha (Nokia - IN/Bangalore)
In reply to this post by Bryan Bende
Thanks Bryan, I got your point.  Yeah we could try PublishKafkaRecord, as in some of other case we had already used PublishKafkaRecord(csv data to avro) to send out records.
In the below mentioned use case we thought of sending out bunch of records(as we are not doing anything with the data) at one shot instead of sending one record at a time.

Thanks,
Hemantha

-----Original Message-----
From: Bryan Bende <[hidden email]>
Sent: Friday, March 1, 2019 7:52 PM
To: [hidden email]
Subject: Re: SplitRecord behaviour

Hello,

Flow files are not transferred until the session they came form is committed. So imagine we periodically commit and some of the splits are transferred, then half way through a failure is encountered, the entire original flow file will be reprocessed, producing some of the same splits that were already send out. The way it is implemented now, it is either completely successful, or not, but never partially successful producing duplicates.

Based on the description of your flow with the three processors you mentioned, I wouldn't bother using SplitRecord, just have ListenHttp
-> PublishKafkaRecord. PublishKafkaRecorcd can be configured with the
same reader and writer you were using in SplitRecord, and it will read each record and send to Kafka, without having to produce unnecessary flow files.

Thanks,

Bryan

On Fri, Mar 1, 2019 at 3:44 AM Kumara M S, Hemantha (Nokia -
IN/Bangalore) <[hidden email]> wrote:

>
> Hi All,
>
> We have a use case where receiving huge json(file size might vary from 1GB to 50GB) via http, convert in to XML(xml format is not fixed, any other format is fine) and send out using Kafka. - here is the restriction is CPU & RAM usage requirement(once it is fixed, it should handle all size files) should not getting changed based on incoming file size.
>
> We used ListenHTTP -->SplitRecord -->PublishKafa , but we have observed one behaviour where SplitRecord is sending out data to PublishKafa only after whole FlowFile processing. Is there any reason why did we design this way? Will it not be good if we send out splits  to next processor after each configured records instead of all sending all splits at one shot?
>
>
> Regards,
> Hemantha
>
Reply | Threaded
Open this post in threaded view
|

Re: SplitRecord behaviour

Bryan Bende
If you increase the concurrent tasks on PublishKafka then you are
right that you could publish multiple records at the same time, but I
suspect that the overhead of doing the split will cancel out any gains
from publishing in parallel.

Assuming the flow file has a decent amount of records (thousands),
then you could do any of the following...

- Keep all the records in one flow file and use PublishKafkaRecord,
this will be most efficient for NiFi in terms of I/O and heap usage,
but only sending one record at a time to Kafka

- Split to one record per flow file, generally discouraged as it puts
significant stress on NiFI's repos and heap, but could publish
individual records in parallel once they reach PublisKafka

- Split to smaller batches, say you start with 10k records in the
original flow file then split to 5 flow files with 2k records each,
then PublishKafka with 5 concurrent tasks, but have to determine
whether this actually works out better than the first option

On Fri, Mar 1, 2019 at 12:47 PM Kumara M S, Hemantha (Nokia -
IN/Bangalore) <[hidden email]> wrote:

>
> Thanks Bryan, I got your point.  Yeah we could try PublishKafkaRecord, as in some of other case we had already used PublishKafkaRecord(csv data to avro) to send out records.
> In the below mentioned use case we thought of sending out bunch of records(as we are not doing anything with the data) at one shot instead of sending one record at a time.
>
> Thanks,
> Hemantha
>
> -----Original Message-----
> From: Bryan Bende <[hidden email]>
> Sent: Friday, March 1, 2019 7:52 PM
> To: [hidden email]
> Subject: Re: SplitRecord behaviour
>
> Hello,
>
> Flow files are not transferred until the session they came form is committed. So imagine we periodically commit and some of the splits are transferred, then half way through a failure is encountered, the entire original flow file will be reprocessed, producing some of the same splits that were already send out. The way it is implemented now, it is either completely successful, or not, but never partially successful producing duplicates.
>
> Based on the description of your flow with the three processors you mentioned, I wouldn't bother using SplitRecord, just have ListenHttp
> -> PublishKafkaRecord. PublishKafkaRecorcd can be configured with the
> same reader and writer you were using in SplitRecord, and it will read each record and send to Kafka, without having to produce unnecessary flow files.
>
> Thanks,
>
> Bryan
>
> On Fri, Mar 1, 2019 at 3:44 AM Kumara M S, Hemantha (Nokia -
> IN/Bangalore) <[hidden email]> wrote:
> >
> > Hi All,
> >
> > We have a use case where receiving huge json(file size might vary from 1GB to 50GB) via http, convert in to XML(xml format is not fixed, any other format is fine) and send out using Kafka. - here is the restriction is CPU & RAM usage requirement(once it is fixed, it should handle all size files) should not getting changed based on incoming file size.
> >
> > We used ListenHTTP -->SplitRecord -->PublishKafa , but we have observed one behaviour where SplitRecord is sending out data to PublishKafa only after whole FlowFile processing. Is there any reason why did we design this way? Will it not be good if we send out splits  to next processor after each configured records instead of all sending all splits at one shot?
> >
> >
> > Regards,
> > Hemantha
> >
Reply | Threaded
Open this post in threaded view
|

Re: SplitRecord behaviour

Kumara M S, Hemantha (Nokia - IN/Bangalore)
Yeah, I will try both the options and will see which option will suite better
1. Split incoming file using SplitRecord and use PublishKafka
2. Take large file and use PublishKafkaRecord

Thanks,
Hemantha
________________________________
From: Bryan Bende <[hidden email]>
Sent: Friday, March 1, 2019 11:37:00 PM
To: [hidden email]
Subject: Re: SplitRecord behaviour

If you increase the concurrent tasks on PublishKafka then you are
right that you could publish multiple records at the same time, but I
suspect that the overhead of doing the split will cancel out any gains
from publishing in parallel.

Assuming the flow file has a decent amount of records (thousands),
then you could do any of the following...

- Keep all the records in one flow file and use PublishKafkaRecord,
this will be most efficient for NiFi in terms of I/O and heap usage,
but only sending one record at a time to Kafka

- Split to one record per flow file, generally discouraged as it puts
significant stress on NiFI's repos and heap, but could publish
individual records in parallel once they reach PublisKafka

- Split to smaller batches, say you start with 10k records in the
original flow file then split to 5 flow files with 2k records each,
then PublishKafka with 5 concurrent tasks, but have to determine
whether this actually works out better than the first option

On Fri, Mar 1, 2019 at 12:47 PM Kumara M S, Hemantha (Nokia -
IN/Bangalore) <[hidden email]> wrote:

>
> Thanks Bryan, I got your point.  Yeah we could try PublishKafkaRecord, as in some of other case we had already used PublishKafkaRecord(csv data to avro) to send out records.
> In the below mentioned use case we thought of sending out bunch of records(as we are not doing anything with the data) at one shot instead of sending one record at a time.
>
> Thanks,
> Hemantha
>
> -----Original Message-----
> From: Bryan Bende <[hidden email]>
> Sent: Friday, March 1, 2019 7:52 PM
> To: [hidden email]
> Subject: Re: SplitRecord behaviour
>
> Hello,
>
> Flow files are not transferred until the session they came form is committed. So imagine we periodically commit and some of the splits are transferred, then half way through a failure is encountered, the entire original flow file will be reprocessed, producing some of the same splits that were already send out. The way it is implemented now, it is either completely successful, or not, but never partially successful producing duplicates.
>
> Based on the description of your flow with the three processors you mentioned, I wouldn't bother using SplitRecord, just have ListenHttp
> -> PublishKafkaRecord. PublishKafkaRecorcd can be configured with the
> same reader and writer you were using in SplitRecord, and it will read each record and send to Kafka, without having to produce unnecessary flow files.
>
> Thanks,
>
> Bryan
>
> On Fri, Mar 1, 2019 at 3:44 AM Kumara M S, Hemantha (Nokia -
> IN/Bangalore) <[hidden email]> wrote:
> >
> > Hi All,
> >
> > We have a use case where receiving huge json(file size might vary from 1GB to 50GB) via http, convert in to XML(xml format is not fixed, any other format is fine) and send out using Kafka. - here is the restriction is CPU & RAM usage requirement(once it is fixed, it should handle all size files) should not getting changed based on incoming file size.
> >
> > We used ListenHTTP -->SplitRecord -->PublishKafa , but we have observed one behaviour where SplitRecord is sending out data to PublishKafa only after whole FlowFile processing. Is there any reason why did we design this way? Will it not be good if we send out splits  to next processor after each configured records instead of all sending all splits at one shot?
> >
> >
> > Regards,
> > Hemantha
> >