How to optimise use of MergeContent for large number of bins.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How to optimise use of MergeContent for large number of bins.

ashwin.konale@gmail.com
I have a nifi workflow to read from multiple mysql binlogs and put it to
hdfs. I am using ChangeDataCapture as a source and PutHdfs as sink. I am
using MergeContent processor in between to chunk the messages together for
hdfs.

*[CDC (Primary node only. About 200 of this processor for each db)]
*-> *UpdateAttributes(db-table-hour)
-> MergeContent 500 msg(bin based on db-table-hour) ->MergeContent 200
msg(bin based on db-table-hour) -> Put hdfs*

I have about ~200 databases to read from, And ~2500 tables altogether.
Update rate for binlogs is around 1Mbps per database. I am planning to run
this on 3 node nifi cluster as of now.

Has anyone used mergecontent with more than 2000 bins before? Does it scale
well.? Can someone suggest me any improvements to the workflow or
alternatives.

Thanks
Ashwin
Reply | Threaded
Open this post in threaded view
|

Re: How to optimise use of MergeContent for large number of bins.

Mark Payne
Ashwin,

You should have no problem here in terms of MergeContent scaling. But there are
a few things that you'll want to consider:

1. If you have a cluster of these, you're going to be merging FlowFiles per node. So
you'll need to ensure that you have enough data on each node to reach your max
of 500 / 200. And you'll want to ensure that you have a timeout set in case you don't
fill the bin exactly.

2. The first MergeContent will need to hold (in Java heap) 500 FlowFiles per bin *
2000 bins = 1 million FlowFiles. The second MergeContent will hold up to 400,000
more FlowFiles. So you'll need a decent size heap to avoid Out of Memory Error
(probably 8 GB). If you still have issues with Java heap you may want to even consider
adding a third MergeContent that does 100 msgs/bin, then a second that does 100 msgs/bin
and a third that does 10 msgs/bin. This may not be necessary, but it's something to consider
if you want to avoid OOME and don't want to or can't allocate a huge Java heap. The good
news here is that the amount of data you're actually merging is not that much, it's just a lot
of FlowFiles. So MergeContent should be pretty efficient.

I hope this is helpful!

Thanks
-Mark


> On Sep 21, 2018, at 8:32 AM, ashwin konale <[hidden email]> wrote:
>
> I have a nifi workflow to read from multiple mysql binlogs and put it to
> hdfs. I am using ChangeDataCapture as a source and PutHdfs as sink. I am
> using MergeContent processor in between to chunk the messages together for
> hdfs.
>
> *[CDC (Primary node only. About 200 of this processor for each db)]
> *-> *UpdateAttributes(db-table-hour)
> -> MergeContent 500 msg(bin based on db-table-hour) ->MergeContent 200
> msg(bin based on db-table-hour) -> Put hdfs*
>
> I have about ~200 databases to read from, And ~2500 tables altogether.
> Update rate for binlogs is around 1Mbps per database. I am planning to run
> this on 3 node nifi cluster as of now.
>
> Has anyone used mergecontent with more than 2000 bins before? Does it scale
> well.? Can someone suggest me any improvements to the workflow or
> alternatives.
>
> Thanks
> Ashwin