Scaling source processors in nifi horizontally.

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Scaling source processors in nifi horizontally.

ashwin.konale@gmail.com
Hi,

I am experimenting with nifi for one of our usecases with plans of
extending it to various other data routing, ingestion usecases. Right now I
need to ingest data from mysql binlogs to hdfs/GCS. We have around 250
different schemas and about 3000 tables to read data from. Volume of the
data flow ranges from 500 - 2000 messages per second in different schemas.

Right now the problem is mysqlCDC processor can run in only one thread. To
overcome this issue I have two options.

1. Use primary node execution, so different processors for each of the
schemas. So eventually all processors which reads from mysql will run in
single node, which will be a bottleneck no matter how big my nifi cluster
is.

2. Another approach is to use multiple nifi instances to pull data and have
master nifi cluster for ingestion to various sinks. In this approach I will
have to manage all these small nifi instances, and may have to build some
kind of tooling on top of it to monitor/provision new processor for newly
added schemas etc.

Is there any better way to achieve my usecase with nifi ? Please advice me
on the architechture.

Looking forward for suggestion.

- Ashwin
Reply | Threaded
Open this post in threaded view
|

Re: Scaling source processors in nifi horizontally.

Mike Thomsen
> may have to build some kind of tooling on top of it to monitor/provision
new processor for newly added schemas etc.

Could you elaborate on this part of your use case?

On Wed, Oct 17, 2018 at 2:31 PM ashwin konale <[hidden email]>
wrote:

> Hi,
>
> I am experimenting with nifi for one of our usecases with plans of
> extending it to various other data routing, ingestion usecases. Right now I
> need to ingest data from mysql binlogs to hdfs/GCS. We have around 250
> different schemas and about 3000 tables to read data from. Volume of the
> data flow ranges from 500 - 2000 messages per second in different schemas.
>
> Right now the problem is mysqlCDC processor can run in only one thread. To
> overcome this issue I have two options.
>
> 1. Use primary node execution, so different processors for each of the
> schemas. So eventually all processors which reads from mysql will run in
> single node, which will be a bottleneck no matter how big my nifi cluster
> is.
>
> 2. Another approach is to use multiple nifi instances to pull data and have
> master nifi cluster for ingestion to various sinks. In this approach I will
> have to manage all these small nifi instances, and may have to build some
> kind of tooling on top of it to monitor/provision new processor for newly
> added schemas etc.
>
> Is there any better way to achieve my usecase with nifi ? Please advice me
> on the architechture.
>
> Looking forward for suggestion.
>
> - Ashwin
>
Reply | Threaded
Open this post in threaded view
|

Re: Scaling source processors in nifi horizontally.

Ashwin Konale
In reply to this post by ashwin.konale@gmail.com
Hi,

The flow is like this,

MysqlCDC -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS)

But we have around 250 schemas to pull data from, So with clustering setup,

MysqlCDC_schema1 -> RPG
MysqlCDC_schema2 -> RPG
MysqlCDC_schema3 -> RPG and so on

InputPort -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS)

But MysqlCDC can run only in primary node in the cluster, I will end up running all of input processors in single node. This can easily become bottleneck with increasing number of schemas we have. Could you suggest me any alternative approach to this problem.


On 2018/10/17 21:14:09, Mike Thomsen <[hidden email]> wrote:

> > may have to build some kind of tooling on top of it to monitor/provision>
> new processor for newly added schemas etc.>
>
> Could you elaborate on this part of your use case?>
>
> On Wed, Oct 17, 2018 at 2:31 PM ashwin konale <[hidden email]>>
> wrote:>
>
> > Hi,>
> >>
> > I am experimenting with nifi for one of our usecases with plans of>
> > extending it to various other data routing, ingestion usecases. Right now I>
> > need to ingest data from mysql binlogs to hdfs/GCS. We have around 250>
> > different schemas and about 3000 tables to read data from. Volume of the>
> > data flow ranges from 500 - 2000 messages per second in different schemas.>
> >>
> > Right now the problem is mysqlCDC processor can run in only one thread. To>
> > overcome this issue I have two options.>
> >>
> > 1. Use primary node execution, so different processors for each of the>
> > schemas. So eventually all processors which reads from mysql will run in>
> > single node, which will be a bottleneck no matter how big my nifi cluster>
> > is.>
> >>
> > 2. Another approach is to use multiple nifi instances to pull data and have>
> > master nifi cluster for ingestion to various sinks. In this approach I will>
> > have to manage all these small nifi instances, and may have to build some>
> > kind of tooling on top of it to monitor/provision new processor for newly>
> > added schemas etc.>
> >>
> > Is there any better way to achieve my usecase with nifi ? Please advice me>
> > on the architechture.>
> >>
> > Looking forward for suggestion.>
> >>
> > - Ashwin>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: Scaling source processors in nifi horizontally.

ashwin.konale@gmail.com
In reply to this post by ashwin.konale@gmail.com
Hi,

The flow is like this,

MysqlCDC -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS)

But we have around 250 schemas to pull data from, So with clustering setup,

MysqlCDC_schema1 -> RPG
MysqlCDC_schema2 -> RPG
MysqlCDC_schema3 -> RPG and so on

InputPort -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS)

But MysqlCDC can run only in primary node in the cluster, I will end up
running all of input processors in single node. This can easily become
bottleneck with increasing number of schemas we have. Could you suggest me
any alternative approach to this problem.

On 2018/10/17 21:14:09, Mike Thomsen <[hidden email]> wrote:
> > may have to build some kind of tooling on top of it to
monitor/provision>

> new processor for newly added schemas etc.>
>
> Could you elaborate on this part of your use case?>
>
> On Wed, Oct 17, 2018 at 2:31 PM ashwin konale <[hidden email]>>
> wrote:>
>
> > Hi,>
> >>
> > I am experimenting with nifi for one of our usecases with plans of>
> > extending it to various other data routing, ingestion usecases. Right
now I>
> > need to ingest data from mysql binlogs to hdfs/GCS. We have around 250>
> > different schemas and about 3000 tables to read data from. Volume of
the>
> > data flow ranges from 500 - 2000 messages per second in different
schemas.>
> >>
> > Right now the problem is mysqlCDC processor can run in only one thread.
To>
> > overcome this issue I have two options.>
> >>
> > 1. Use primary node execution, so different processors for each of the>
> > schemas. So eventually all processors which reads from mysql will run
in>
> > single node, which will be a bottleneck no matter how big my nifi
cluster>
> > is.>
> >>
> > 2. Another approach is to use multiple nifi instances to pull data and
have>
> > master nifi cluster for ingestion to various sinks. In this approach I
will>
> > have to manage all these small nifi instances, and may have to build
some>
> > kind of tooling on top of it to monitor/provision new processor for
newly>
> > added schemas etc.>
> >>
> > Is there any better way to achieve my usecase with nifi ? Please advice
me>
> > on the architechture.>
> >>
> > Looking forward for suggestion.>
> >>
> > - Ashwin>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: Scaling source processors in nifi horizontally.

Mike Thomsen
I initially thought you were saying that you had 250 Avro schemas that you
had to use, as in 250 different distinct data models.

Maybe someone else has a suggestion on how to do it, but I think this may
just be a fundamental problem of having that many different databases in
MySQL and trying to do CDC with them.

Is there a hard business requirement to segregate data like that or some
factor like pulling from many remote databases that is at play here?

On Thu, Oct 18, 2018 at 6:19 AM ashwin konale <[hidden email]>
wrote:

> Hi,
>
> The flow is like this,
>
> MysqlCDC -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS)
>
> But we have around 250 schemas to pull data from, So with clustering setup,
>
> MysqlCDC_schema1 -> RPG
> MysqlCDC_schema2 -> RPG
> MysqlCDC_schema3 -> RPG and so on
>
> InputPort -> UpdateAttributes -> MergeContent -> (PutHDFS, PutGCS)
>
> But MysqlCDC can run only in primary node in the cluster, I will end up
> running all of input processors in single node. This can easily become
> bottleneck with increasing number of schemas we have. Could you suggest me
> any alternative approach to this problem.
>
> On 2018/10/17 21:14:09, Mike Thomsen <[hidden email]> wrote:
> > > may have to build some kind of tooling on top of it to
> monitor/provision>
> > new processor for newly added schemas etc.>
> >
> > Could you elaborate on this part of your use case?>
> >
> > On Wed, Oct 17, 2018 at 2:31 PM ashwin konale <[hidden email]>>
> > wrote:>
> >
> > > Hi,>
> > >>
> > > I am experimenting with nifi for one of our usecases with plans of>
> > > extending it to various other data routing, ingestion usecases. Right
> now I>
> > > need to ingest data from mysql binlogs to hdfs/GCS. We have around 250>
> > > different schemas and about 3000 tables to read data from. Volume of
> the>
> > > data flow ranges from 500 - 2000 messages per second in different
> schemas.>
> > >>
> > > Right now the problem is mysqlCDC processor can run in only one thread.
> To>
> > > overcome this issue I have two options.>
> > >>
> > > 1. Use primary node execution, so different processors for each of the>
> > > schemas. So eventually all processors which reads from mysql will run
> in>
> > > single node, which will be a bottleneck no matter how big my nifi
> cluster>
> > > is.>
> > >>
> > > 2. Another approach is to use multiple nifi instances to pull data and
> have>
> > > master nifi cluster for ingestion to various sinks. In this approach I
> will>
> > > have to manage all these small nifi instances, and may have to build
> some>
> > > kind of tooling on top of it to monitor/provision new processor for
> newly>
> > > added schemas etc.>
> > >>
> > > Is there any better way to achieve my usecase with nifi ? Please advice
> me>
> > > on the architechture.>
> > >>
> > > Looking forward for suggestion.>
> > >>
> > > - Ashwin>
> > >>
> >
>