Load distribution in cluster mode

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Load distribution in cluster mode

Ricky Saltzer
Hi -

I have a question regarding load distribution in a clustered NiFi
environment. I have a really simple example, I'm using the GenerateFlowFile
processor to generate some random data, then I MD5 hash the file and print
out the resulting hash.

I want only the primary node to generate the data, but I want both nodes in
the cluster to share the hashing workload. It appears if I set the
scheduling strategy to "On primary node" for the GenerateFlowFile
processor, then the next processor (HashContent) is only being accepted and
processed by a single node.

I've put DistributeLoad processor in-between the HashContent and
GenerateFlowFile, but this requires me to use the remote process group to
distribute the load, which doesn't seem intuitive when I'm already
clustered.

I guess my question is, is it possible for the DistributeLoad processor to
understand that NiFi is in a clustered environment, and have an ability to
distribute the next processor (HashContent) amongst all nodes in the
cluster?

Cheers,
--
Ricky Saltzer
http://www.cloudera.com
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Mark Payne
Ricky,


The DistributeLoad processor is simply used to route to one of many relationships. So if you have, for instance, 5 different servers that you can FTP files to, you can use DistributeLoad to round robin the files between them, so that you end up pushing 20% to each of 5 PutFTP processors.


What you’re wanting to do, it sounds like, is to distribute the FlowFiles to different nodes in the cluster. The Remote Process Group is how you would need to do that at this time. We have discussed having the ability to mark a Connection as “Auto-Distributed” (or maybe some better name 😊) and have that automatically distribute the data between nodes in the cluster, but that feature hasn’t yet been implemented.


Does that answer your question?


Thanks

-Mark






From: Ricky Saltzer
Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
To: [hidden email]





Hi -

I have a question regarding load distribution in a clustered NiFi
environment. I have a really simple example, I'm using the GenerateFlowFile
processor to generate some random data, then I MD5 hash the file and print
out the resulting hash.

I want only the primary node to generate the data, but I want both nodes in
the cluster to share the hashing workload. It appears if I set the
scheduling strategy to "On primary node" for the GenerateFlowFile
processor, then the next processor (HashContent) is only being accepted and
processed by a single node.

I've put DistributeLoad processor in-between the HashContent and
GenerateFlowFile, but this requires me to use the remote process group to
distribute the load, which doesn't seem intuitive when I'm already
clustered.

I guess my question is, is it possible for the DistributeLoad processor to
understand that NiFi is in a clustered environment, and have an ability to
distribute the next processor (HashContent) amongst all nodes in the
cluster?

Cheers,
--
Ricky Saltzer
http://www.cloudera.com
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Ricky Saltzer
Mark -

Thanks for the fast reply, much appreciated. This is what I figured, but
since I was already in clustered mode, I wanted to make sure there wasn't
an easier way than adding each node as a remote process group.

Is there already a JIRA to track the ability to auto distribute in
clustered mode, or would you like me to open it up?

Thanks again,
Ricky

On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[hidden email]> wrote:

> Ricky,
>
>
> The DistributeLoad processor is simply used to route to one of many
> relationships. So if you have, for instance, 5 different servers that you
> can FTP files to, you can use DistributeLoad to round robin the files
> between them, so that you end up pushing 20% to each of 5 PutFTP processors.
>
>
> What you’re wanting to do, it sounds like, is to distribute the FlowFiles
> to different nodes in the cluster. The Remote Process Group is how you
> would need to do that at this time. We have discussed having the ability to
> mark a Connection as “Auto-Distributed” (or maybe some better name 😊) and
> have that automatically distribute the data between nodes in the cluster,
> but that feature hasn’t yet been implemented.
>
>
> Does that answer your question?
>
>
> Thanks
>
> -Mark
>
>
>
>
>
>
> From: Ricky Saltzer
> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> To: [hidden email]
>
>
>
>
>
> Hi -
>
> I have a question regarding load distribution in a clustered NiFi
> environment. I have a really simple example, I'm using the GenerateFlowFile
> processor to generate some random data, then I MD5 hash the file and print
> out the resulting hash.
>
> I want only the primary node to generate the data, but I want both nodes in
> the cluster to share the hashing workload. It appears if I set the
> scheduling strategy to "On primary node" for the GenerateFlowFile
> processor, then the next processor (HashContent) is only being accepted and
> processed by a single node.
>
> I've put DistributeLoad processor in-between the HashContent and
> GenerateFlowFile, but this requires me to use the remote process group to
> distribute the load, which doesn't seem intuitive when I'm already
> clustered.
>
> I guess my question is, is it possible for the DistributeLoad processor to
> understand that NiFi is in a clustered environment, and have an ability to
> distribute the next processor (HashContent) amongst all nodes in the
> cluster?
>
> Cheers,
> --
> Ricky Saltzer
> http://www.cloudera.com
>



--
Ricky Saltzer
http://www.cloudera.com
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Mark Payne
In reply to this post by Ricky Saltzer
Ricky,




I don’t think there’s a JIRA ticket currently. Feel free to create one.




I think we may need to do a better job documenting how the Remote Process Groups. If you have a cluster setup, you would add a Remote Process Group that points to the Cluster Manager. (I.e., the URL that you connect to in order to see the graph).


Then, anything that you send to the Remote Process Group will automatically get load-balanced across all of the nodes in the cluster. So you could setup a flow that looks something like:


GenerateFlowFile -> RemoteProcessGroup


Input Port -> HashContent


So these 2 flows are disjoint. The first part generates data and then distributes it to the cluster (when you connect to the Remote Process Group, you choose which Input Port to send to).


But what we’d like to do in the future is something like:


GenerateFlowFile -> HashContent


And then on the connection in the middle choose to auto-distribute the data. Right now you have to put the Remote Process Group in there to distribute to the cluster, and add the Input Port to receive the data. But there should only be a single RemoteProcessGroup that points to the entire cluster, not one per node.


Thanks

-Mark









From: Ricky Saltzer
Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
To: [hidden email]





Mark -

Thanks for the fast reply, much appreciated. This is what I figured, but
since I was already in clustered mode, I wanted to make sure there wasn't
an easier way than adding each node as a remote process group.

Is there already a JIRA to track the ability to auto distribute in
clustered mode, or would you like me to open it up?

Thanks again,
Ricky

On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[hidden email]> wrote:

> Ricky,
>
>
> The DistributeLoad processor is simply used to route to one of many
> relationships. So if you have, for instance, 5 different servers that you
> can FTP files to, you can use DistributeLoad to round robin the files
> between them, so that you end up pushing 20% to each of 5 PutFTP processors.
>
>
> What you’re wanting to do, it sounds like, is to distribute the FlowFiles
> to different nodes in the cluster. The Remote Process Group is how you
> would need to do that at this time. We have discussed having the ability to
> mark a Connection as “Auto-Distributed” (or maybe some better name 😊) and
> have that automatically distribute the data between nodes in the cluster,
> but that feature hasn’t yet been implemented.
>
>
> Does that answer your question?
>
>
> Thanks
>
> -Mark
>
>
>
>
>
>
> From: Ricky Saltzer
> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> To: [hidden email]
>
>
>
>
>
> Hi -
>
> I have a question regarding load distribution in a clustered NiFi
> environment. I have a really simple example, I'm using the GenerateFlowFile
> processor to generate some random data, then I MD5 hash the file and print
> out the resulting hash.
>
> I want only the primary node to generate the data, but I want both nodes in
> the cluster to share the hashing workload. It appears if I set the
> scheduling strategy to "On primary node" for the GenerateFlowFile
> processor, then the next processor (HashContent) is only being accepted and
> processed by a single node.
>
> I've put DistributeLoad processor in-between the HashContent and
> GenerateFlowFile, but this requires me to use the remote process group to
> distribute the load, which doesn't seem intuitive when I'm already
> clustered.
>
> I guess my question is, is it possible for the DistributeLoad processor to
> understand that NiFi is in a clustered environment, and have an ability to
> distribute the next processor (HashContent) amongst all nodes in the
> cluster?
>
> Cheers,
> --
> Ricky Saltzer
> http://www.cloudera.com
>



--
Ricky Saltzer
http://www.cloudera.com
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Joe Witt
Ricky,

So the use case you're coming from here is a good and common one which is:

If I have a datasource which does not offer scalabilty (it can only
send to a single node for instance) but I have a scalable distribution
cluster what are my options?

So today you can accept the data on a single node then immediate do as
Mark describes and fire it to a "Remote Process Group" addressing the
cluster itself.  That way NiFi will automatically figure out all the
nodes in the cluster and spread the data around factoring in
load/etc..  But we do want to establish an even more automatic
mechanism on a connection itself where the user can indicate the data
should be auto-balanced.

The reverse is really true as well where you can have a consumer which
only wants to accept from a single host.  So there too we need a
mechanism to descale the approach.

I realize the flow you're working with now is just a sort of
familiarization thing.  But do you think this is something we should
tackle soon (based on real scenarios you face)?

Thanks
Joe

On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[hidden email]> wrote:

> Ricky,
>
>
>
>
> I don’t think there’s a JIRA ticket currently. Feel free to create one.
>
>
>
>
> I think we may need to do a better job documenting how the Remote Process Groups. If you have a cluster setup, you would add a Remote Process Group that points to the Cluster Manager. (I.e., the URL that you connect to in order to see the graph).
>
>
> Then, anything that you send to the Remote Process Group will automatically get load-balanced across all of the nodes in the cluster. So you could setup a flow that looks something like:
>
>
> GenerateFlowFile -> RemoteProcessGroup
>
>
> Input Port -> HashContent
>
>
> So these 2 flows are disjoint. The first part generates data and then distributes it to the cluster (when you connect to the Remote Process Group, you choose which Input Port to send to).
>
>
> But what we’d like to do in the future is something like:
>
>
> GenerateFlowFile -> HashContent
>
>
> And then on the connection in the middle choose to auto-distribute the data. Right now you have to put the Remote Process Group in there to distribute to the cluster, and add the Input Port to receive the data. But there should only be a single RemoteProcessGroup that points to the entire cluster, not one per node.
>
>
> Thanks
>
> -Mark
>
>
>
>
>
>
>
>
>
> From: Ricky Saltzer
> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> To: [hidden email]
>
>
>
>
>
> Mark -
>
> Thanks for the fast reply, much appreciated. This is what I figured, but
> since I was already in clustered mode, I wanted to make sure there wasn't
> an easier way than adding each node as a remote process group.
>
> Is there already a JIRA to track the ability to auto distribute in
> clustered mode, or would you like me to open it up?
>
> Thanks again,
> Ricky
>
> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[hidden email]> wrote:
>
>> Ricky,
>>
>>
>> The DistributeLoad processor is simply used to route to one of many
>> relationships. So if you have, for instance, 5 different servers that you
>> can FTP files to, you can use DistributeLoad to round robin the files
>> between them, so that you end up pushing 20% to each of 5 PutFTP processors.
>>
>>
>> What you’re wanting to do, it sounds like, is to distribute the FlowFiles
>> to different nodes in the cluster. The Remote Process Group is how you
>> would need to do that at this time. We have discussed having the ability to
>> mark a Connection as “Auto-Distributed” (or maybe some better name 😊) and
>> have that automatically distribute the data between nodes in the cluster,
>> but that feature hasn’t yet been implemented.
>>
>>
>> Does that answer your question?
>>
>>
>> Thanks
>>
>> -Mark
>>
>>
>>
>>
>>
>>
>> From: Ricky Saltzer
>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
>> To: [hidden email]
>>
>>
>>
>>
>>
>> Hi -
>>
>> I have a question regarding load distribution in a clustered NiFi
>> environment. I have a really simple example, I'm using the GenerateFlowFile
>> processor to generate some random data, then I MD5 hash the file and print
>> out the resulting hash.
>>
>> I want only the primary node to generate the data, but I want both nodes in
>> the cluster to share the hashing workload. It appears if I set the
>> scheduling strategy to "On primary node" for the GenerateFlowFile
>> processor, then the next processor (HashContent) is only being accepted and
>> processed by a single node.
>>
>> I've put DistributeLoad processor in-between the HashContent and
>> GenerateFlowFile, but this requires me to use the remote process group to
>> distribute the load, which doesn't seem intuitive when I'm already
>> clustered.
>>
>> I guess my question is, is it possible for the DistributeLoad processor to
>> understand that NiFi is in a clustered environment, and have an ability to
>> distribute the next processor (HashContent) amongst all nodes in the
>> cluster?
>>
>> Cheers,
>> --
>> Ricky Saltzer
>> http://www.cloudera.com
>>
>
>
>
> --
> Ricky Saltzer
> http://www.cloudera.com
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Ricky Saltzer
Hey Joe -

This makes sense, and I'm in the process of trying it out now. I'm running
into a small issue where the remote process group is saying neither of the
nodes are configured for Site-to-Site communication.

Although not super intuitive, sending to the remote process group pointing
to the cluster should be fine as long as it works (which I'm sure it does).

Ricky

On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[hidden email]> wrote:

> Ricky,
>
> So the use case you're coming from here is a good and common one which is:
>
> If I have a datasource which does not offer scalabilty (it can only
> send to a single node for instance) but I have a scalable distribution
> cluster what are my options?
>
> So today you can accept the data on a single node then immediate do as
> Mark describes and fire it to a "Remote Process Group" addressing the
> cluster itself.  That way NiFi will automatically figure out all the
> nodes in the cluster and spread the data around factoring in
> load/etc..  But we do want to establish an even more automatic
> mechanism on a connection itself where the user can indicate the data
> should be auto-balanced.
>
> The reverse is really true as well where you can have a consumer which
> only wants to accept from a single host.  So there too we need a
> mechanism to descale the approach.
>
> I realize the flow you're working with now is just a sort of
> familiarization thing.  But do you think this is something we should
> tackle soon (based on real scenarios you face)?
>
> Thanks
> Joe
>
> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[hidden email]> wrote:
> > Ricky,
> >
> >
> >
> >
> > I don’t think there’s a JIRA ticket currently. Feel free to create one.
> >
> >
> >
> >
> > I think we may need to do a better job documenting how the Remote
> Process Groups. If you have a cluster setup, you would add a Remote Process
> Group that points to the Cluster Manager. (I.e., the URL that you connect
> to in order to see the graph).
> >
> >
> > Then, anything that you send to the Remote Process Group will
> automatically get load-balanced across all of the nodes in the cluster. So
> you could setup a flow that looks something like:
> >
> >
> > GenerateFlowFile -> RemoteProcessGroup
> >
> >
> > Input Port -> HashContent
> >
> >
> > So these 2 flows are disjoint. The first part generates data and then
> distributes it to the cluster (when you connect to the Remote Process
> Group, you choose which Input Port to send to).
> >
> >
> > But what we’d like to do in the future is something like:
> >
> >
> > GenerateFlowFile -> HashContent
> >
> >
> > And then on the connection in the middle choose to auto-distribute the
> data. Right now you have to put the Remote Process Group in there to
> distribute to the cluster, and add the Input Port to receive the data. But
> there should only be a single RemoteProcessGroup that points to the entire
> cluster, not one per node.
> >
> >
> > Thanks
> >
> > -Mark
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: Ricky Saltzer
> > Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> > To: [hidden email]
> >
> >
> >
> >
> >
> > Mark -
> >
> > Thanks for the fast reply, much appreciated. This is what I figured, but
> > since I was already in clustered mode, I wanted to make sure there wasn't
> > an easier way than adding each node as a remote process group.
> >
> > Is there already a JIRA to track the ability to auto distribute in
> > clustered mode, or would you like me to open it up?
> >
> > Thanks again,
> > Ricky
> >
> > On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[hidden email]> wrote:
> >
> >> Ricky,
> >>
> >>
> >> The DistributeLoad processor is simply used to route to one of many
> >> relationships. So if you have, for instance, 5 different servers that
> you
> >> can FTP files to, you can use DistributeLoad to round robin the files
> >> between them, so that you end up pushing 20% to each of 5 PutFTP
> processors.
> >>
> >>
> >> What you’re wanting to do, it sounds like, is to distribute the
> FlowFiles
> >> to different nodes in the cluster. The Remote Process Group is how you
> >> would need to do that at this time. We have discussed having the
> ability to
> >> mark a Connection as “Auto-Distributed” (or maybe some better name 😊)
> and
> >> have that automatically distribute the data between nodes in the
> cluster,
> >> but that feature hasn’t yet been implemented.
> >>
> >>
> >> Does that answer your question?
> >>
> >>
> >> Thanks
> >>
> >> -Mark
> >>
> >>
> >>
> >>
> >>
> >>
> >> From: Ricky Saltzer
> >> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> >> To: [hidden email]
> >>
> >>
> >>
> >>
> >>
> >> Hi -
> >>
> >> I have a question regarding load distribution in a clustered NiFi
> >> environment. I have a really simple example, I'm using the
> GenerateFlowFile
> >> processor to generate some random data, then I MD5 hash the file and
> print
> >> out the resulting hash.
> >>
> >> I want only the primary node to generate the data, but I want both
> nodes in
> >> the cluster to share the hashing workload. It appears if I set the
> >> scheduling strategy to "On primary node" for the GenerateFlowFile
> >> processor, then the next processor (HashContent) is only being accepted
> and
> >> processed by a single node.
> >>
> >> I've put DistributeLoad processor in-between the HashContent and
> >> GenerateFlowFile, but this requires me to use the remote process group
> to
> >> distribute the load, which doesn't seem intuitive when I'm already
> >> clustered.
> >>
> >> I guess my question is, is it possible for the DistributeLoad processor
> to
> >> understand that NiFi is in a clustered environment, and have an ability
> to
> >> distribute the next processor (HashContent) amongst all nodes in the
> >> cluster?
> >>
> >> Cheers,
> >> --
> >> Ricky Saltzer
> >> http://www.cloudera.com
> >>
> >
> >
> >
> > --
> > Ricky Saltzer
> > http://www.cloudera.com
>



--
Ricky Saltzer
http://www.cloudera.com
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Mark Payne
Ricky,

In the nifi.properties file, there's a property named "nifi.remote.input.port". By default it's empty. Set that to whatever port you want to use for site-to-site. Additionally, you'll need to either set "nifi.remote.input.secure" to false or configure keystore and truststore properties. Configure this for nodes and NCM.  After that you should be good to go (after restart)!

If you run into any issues let us know.

Thanks
-Mark

Sent from my iPhone

> On Feb 8, 2015, at 5:54 AM, Ricky Saltzer <[hidden email]> wrote:
>
> Hey Joe -
>
> This makes sense, and I'm in the process of trying it out now. I'm running
> into a small issue where the remote process group is saying neither of the
> nodes are configured for Site-to-Site communication.
>
> Although not super intuitive, sending to the remote process group pointing
> to the cluster should be fine as long as it works (which I'm sure it does).
>
> Ricky
>
>> On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[hidden email]> wrote:
>>
>> Ricky,
>>
>> So the use case you're coming from here is a good and common one which is:
>>
>> If I have a datasource which does not offer scalabilty (it can only
>> send to a single node for instance) but I have a scalable distribution
>> cluster what are my options?
>>
>> So today you can accept the data on a single node then immediate do as
>> Mark describes and fire it to a "Remote Process Group" addressing the
>> cluster itself.  That way NiFi will automatically figure out all the
>> nodes in the cluster and spread the data around factoring in
>> load/etc..  But we do want to establish an even more automatic
>> mechanism on a connection itself where the user can indicate the data
>> should be auto-balanced.
>>
>> The reverse is really true as well where you can have a consumer which
>> only wants to accept from a single host.  So there too we need a
>> mechanism to descale the approach.
>>
>> I realize the flow you're working with now is just a sort of
>> familiarization thing.  But do you think this is something we should
>> tackle soon (based on real scenarios you face)?
>>
>> Thanks
>> Joe
>>
>>> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[hidden email]> wrote:
>>> Ricky,
>>>
>>>
>>>
>>>
>>> I don’t think there’s a JIRA ticket currently. Feel free to create one.
>>>
>>>
>>>
>>>
>>> I think we may need to do a better job documenting how the Remote
>> Process Groups. If you have a cluster setup, you would add a Remote Process
>> Group that points to the Cluster Manager. (I.e., the URL that you connect
>> to in order to see the graph).
>>>
>>>
>>> Then, anything that you send to the Remote Process Group will
>> automatically get load-balanced across all of the nodes in the cluster. So
>> you could setup a flow that looks something like:
>>>
>>>
>>> GenerateFlowFile -> RemoteProcessGroup
>>>
>>>
>>> Input Port -> HashContent
>>>
>>>
>>> So these 2 flows are disjoint. The first part generates data and then
>> distributes it to the cluster (when you connect to the Remote Process
>> Group, you choose which Input Port to send to).
>>>
>>>
>>> But what we’d like to do in the future is something like:
>>>
>>>
>>> GenerateFlowFile -> HashContent
>>>
>>>
>>> And then on the connection in the middle choose to auto-distribute the
>> data. Right now you have to put the Remote Process Group in there to
>> distribute to the cluster, and add the Input Port to receive the data. But
>> there should only be a single RemoteProcessGroup that points to the entire
>> cluster, not one per node.
>>>
>>>
>>> Thanks
>>>
>>> -Mark
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Ricky Saltzer
>>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
>>> To: [hidden email]
>>>
>>>
>>>
>>>
>>>
>>> Mark -
>>>
>>> Thanks for the fast reply, much appreciated. This is what I figured, but
>>> since I was already in clustered mode, I wanted to make sure there wasn't
>>> an easier way than adding each node as a remote process group.
>>>
>>> Is there already a JIRA to track the ability to auto distribute in
>>> clustered mode, or would you like me to open it up?
>>>
>>> Thanks again,
>>> Ricky
>>>
>>>> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[hidden email]> wrote:
>>>>
>>>> Ricky,
>>>>
>>>>
>>>> The DistributeLoad processor is simply used to route to one of many
>>>> relationships. So if you have, for instance, 5 different servers that
>> you
>>>> can FTP files to, you can use DistributeLoad to round robin the files
>>>> between them, so that you end up pushing 20% to each of 5 PutFTP
>> processors.
>>>>
>>>>
>>>> What you’re wanting to do, it sounds like, is to distribute the
>> FlowFiles
>>>> to different nodes in the cluster. The Remote Process Group is how you
>>>> would need to do that at this time. We have discussed having the
>> ability to
>>>> mark a Connection as “Auto-Distributed” (or maybe some better name 😊)
>> and
>>>> have that automatically distribute the data between nodes in the
>> cluster,
>>>> but that feature hasn’t yet been implemented.
>>>>
>>>>
>>>> Does that answer your question?
>>>>
>>>>
>>>> Thanks
>>>>
>>>> -Mark
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> From: Ricky Saltzer
>>>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
>>>> To: [hidden email]
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Hi -
>>>>
>>>> I have a question regarding load distribution in a clustered NiFi
>>>> environment. I have a really simple example, I'm using the
>> GenerateFlowFile
>>>> processor to generate some random data, then I MD5 hash the file and
>> print
>>>> out the resulting hash.
>>>>
>>>> I want only the primary node to generate the data, but I want both
>> nodes in
>>>> the cluster to share the hashing workload. It appears if I set the
>>>> scheduling strategy to "On primary node" for the GenerateFlowFile
>>>> processor, then the next processor (HashContent) is only being accepted
>> and
>>>> processed by a single node.
>>>>
>>>> I've put DistributeLoad processor in-between the HashContent and
>>>> GenerateFlowFile, but this requires me to use the remote process group
>> to
>>>> distribute the load, which doesn't seem intuitive when I'm already
>>>> clustered.
>>>>
>>>> I guess my question is, is it possible for the DistributeLoad processor
>> to
>>>> understand that NiFi is in a clustered environment, and have an ability
>> to
>>>> distribute the next processor (HashContent) amongst all nodes in the
>>>> cluster?
>>>>
>>>> Cheers,
>>>> --
>>>> Ricky Saltzer
>>>> http://www.cloudera.com
>>>
>>>
>>>
>>> --
>>> Ricky Saltzer
>>> http://www.cloudera.com
>
>
>
> --
> Ricky Saltzer
> http://www.cloudera.com
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Joe Witt
Site to site is a powerhouse feature but has caused a good bit of
confusion.  Perhaps we should plan or its inclusion in the things that can
be tuned/set at runtime.

It would be good to include with that information about bounded
interfaces.   Information about what messages will get sent, etc.   Folks
in proxy type situations have a hard time reasoning over what is
happening.  That is little sense of "is this thing on".

What do you all think?
On Feb 8, 2015 6:51 AM, "Mark Payne" <[hidden email]> wrote:

> Ricky,
>
> In the nifi.properties file, there's a property named
> "nifi.remote.input.port". By default it's empty. Set that to whatever port
> you want to use for site-to-site. Additionally, you'll need to either set
> "nifi.remote.input.secure" to false or configure keystore and truststore
> properties. Configure this for nodes and NCM.  After that you should be
> good to go (after restart)!
>
> If you run into any issues let us know.
>
> Thanks
> -Mark
>
> Sent from my iPhone
>
> > On Feb 8, 2015, at 5:54 AM, Ricky Saltzer <[hidden email]> wrote:
> >
> > Hey Joe -
> >
> > This makes sense, and I'm in the process of trying it out now. I'm
> running
> > into a small issue where the remote process group is saying neither of
> the
> > nodes are configured for Site-to-Site communication.
> >
> > Although not super intuitive, sending to the remote process group
> pointing
> > to the cluster should be fine as long as it works (which I'm sure it
> does).
> >
> > Ricky
> >
> >> On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[hidden email]> wrote:
> >>
> >> Ricky,
> >>
> >> So the use case you're coming from here is a good and common one which
> is:
> >>
> >> If I have a datasource which does not offer scalabilty (it can only
> >> send to a single node for instance) but I have a scalable distribution
> >> cluster what are my options?
> >>
> >> So today you can accept the data on a single node then immediate do as
> >> Mark describes and fire it to a "Remote Process Group" addressing the
> >> cluster itself.  That way NiFi will automatically figure out all the
> >> nodes in the cluster and spread the data around factoring in
> >> load/etc..  But we do want to establish an even more automatic
> >> mechanism on a connection itself where the user can indicate the data
> >> should be auto-balanced.
> >>
> >> The reverse is really true as well where you can have a consumer which
> >> only wants to accept from a single host.  So there too we need a
> >> mechanism to descale the approach.
> >>
> >> I realize the flow you're working with now is just a sort of
> >> familiarization thing.  But do you think this is something we should
> >> tackle soon (based on real scenarios you face)?
> >>
> >> Thanks
> >> Joe
> >>
> >>> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[hidden email]>
> wrote:
> >>> Ricky,
> >>>
> >>>
> >>>
> >>>
> >>> I don’t think there’s a JIRA ticket currently. Feel free to create one.
> >>>
> >>>
> >>>
> >>>
> >>> I think we may need to do a better job documenting how the Remote
> >> Process Groups. If you have a cluster setup, you would add a Remote
> Process
> >> Group that points to the Cluster Manager. (I.e., the URL that you
> connect
> >> to in order to see the graph).
> >>>
> >>>
> >>> Then, anything that you send to the Remote Process Group will
> >> automatically get load-balanced across all of the nodes in the cluster.
> So
> >> you could setup a flow that looks something like:
> >>>
> >>>
> >>> GenerateFlowFile -> RemoteProcessGroup
> >>>
> >>>
> >>> Input Port -> HashContent
> >>>
> >>>
> >>> So these 2 flows are disjoint. The first part generates data and then
> >> distributes it to the cluster (when you connect to the Remote Process
> >> Group, you choose which Input Port to send to).
> >>>
> >>>
> >>> But what we’d like to do in the future is something like:
> >>>
> >>>
> >>> GenerateFlowFile -> HashContent
> >>>
> >>>
> >>> And then on the connection in the middle choose to auto-distribute the
> >> data. Right now you have to put the Remote Process Group in there to
> >> distribute to the cluster, and add the Input Port to receive the data.
> But
> >> there should only be a single RemoteProcessGroup that points to the
> entire
> >> cluster, not one per node.
> >>>
> >>>
> >>> Thanks
> >>>
> >>> -Mark
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> From: Ricky Saltzer
> >>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> >>> To: [hidden email]
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Mark -
> >>>
> >>> Thanks for the fast reply, much appreciated. This is what I figured,
> but
> >>> since I was already in clustered mode, I wanted to make sure there
> wasn't
> >>> an easier way than adding each node as a remote process group.
> >>>
> >>> Is there already a JIRA to track the ability to auto distribute in
> >>> clustered mode, or would you like me to open it up?
> >>>
> >>> Thanks again,
> >>> Ricky
> >>>
> >>>> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[hidden email]>
> wrote:
> >>>>
> >>>> Ricky,
> >>>>
> >>>>
> >>>> The DistributeLoad processor is simply used to route to one of many
> >>>> relationships. So if you have, for instance, 5 different servers that
> >> you
> >>>> can FTP files to, you can use DistributeLoad to round robin the files
> >>>> between them, so that you end up pushing 20% to each of 5 PutFTP
> >> processors.
> >>>>
> >>>>
> >>>> What you’re wanting to do, it sounds like, is to distribute the
> >> FlowFiles
> >>>> to different nodes in the cluster. The Remote Process Group is how you
> >>>> would need to do that at this time. We have discussed having the
> >> ability to
> >>>> mark a Connection as “Auto-Distributed” (or maybe some better name 😊)
> >> and
> >>>> have that automatically distribute the data between nodes in the
> >> cluster,
> >>>> but that feature hasn’t yet been implemented.
> >>>>
> >>>>
> >>>> Does that answer your question?
> >>>>
> >>>>
> >>>> Thanks
> >>>>
> >>>> -Mark
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> From: Ricky Saltzer
> >>>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> >>>> To: [hidden email]
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Hi -
> >>>>
> >>>> I have a question regarding load distribution in a clustered NiFi
> >>>> environment. I have a really simple example, I'm using the
> >> GenerateFlowFile
> >>>> processor to generate some random data, then I MD5 hash the file and
> >> print
> >>>> out the resulting hash.
> >>>>
> >>>> I want only the primary node to generate the data, but I want both
> >> nodes in
> >>>> the cluster to share the hashing workload. It appears if I set the
> >>>> scheduling strategy to "On primary node" for the GenerateFlowFile
> >>>> processor, then the next processor (HashContent) is only being
> accepted
> >> and
> >>>> processed by a single node.
> >>>>
> >>>> I've put DistributeLoad processor in-between the HashContent and
> >>>> GenerateFlowFile, but this requires me to use the remote process group
> >> to
> >>>> distribute the load, which doesn't seem intuitive when I'm already
> >>>> clustered.
> >>>>
> >>>> I guess my question is, is it possible for the DistributeLoad
> processor
> >> to
> >>>> understand that NiFi is in a clustered environment, and have an
> ability
> >> to
> >>>> distribute the next processor (HashContent) amongst all nodes in the
> >>>> cluster?
> >>>>
> >>>> Cheers,
> >>>> --
> >>>> Ricky Saltzer
> >>>> http://www.cloudera.com
> >>>
> >>>
> >>>
> >>> --
> >>> Ricky Saltzer
> >>> http://www.cloudera.com
> >
> >
> >
> > --
> > Ricky Saltzer
> > http://www.cloudera.com
>
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Ricky Saltzer
Thanks for the tip, Mark! Allowing the user to enable the site to site
feature during runtime would be a good step in the right direction.
Documentation on how it works and why it's different from having your nodes
in a cluster would also make things easier to understand.


On Sun, Feb 8, 2015 at 9:11 AM, Joe Witt <[hidden email]> wrote:

> Site to site is a powerhouse feature but has caused a good bit of
> confusion.  Perhaps we should plan or its inclusion in the things that can
> be tuned/set at runtime.
>
> It would be good to include with that information about bounded
> interfaces.   Information about what messages will get sent, etc.   Folks
> in proxy type situations have a hard time reasoning over what is
> happening.  That is little sense of "is this thing on".
>
> What do you all think?
> On Feb 8, 2015 6:51 AM, "Mark Payne" <[hidden email]> wrote:
>
> > Ricky,
> >
> > In the nifi.properties file, there's a property named
> > "nifi.remote.input.port". By default it's empty. Set that to whatever
> port
> > you want to use for site-to-site. Additionally, you'll need to either set
> > "nifi.remote.input.secure" to false or configure keystore and truststore
> > properties. Configure this for nodes and NCM.  After that you should be
> > good to go (after restart)!
> >
> > If you run into any issues let us know.
> >
> > Thanks
> > -Mark
> >
> > Sent from my iPhone
> >
> > > On Feb 8, 2015, at 5:54 AM, Ricky Saltzer <[hidden email]> wrote:
> > >
> > > Hey Joe -
> > >
> > > This makes sense, and I'm in the process of trying it out now. I'm
> > running
> > > into a small issue where the remote process group is saying neither of
> > the
> > > nodes are configured for Site-to-Site communication.
> > >
> > > Although not super intuitive, sending to the remote process group
> > pointing
> > > to the cluster should be fine as long as it works (which I'm sure it
> > does).
> > >
> > > Ricky
> > >
> > >> On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[hidden email]> wrote:
> > >>
> > >> Ricky,
> > >>
> > >> So the use case you're coming from here is a good and common one which
> > is:
> > >>
> > >> If I have a datasource which does not offer scalabilty (it can only
> > >> send to a single node for instance) but I have a scalable distribution
> > >> cluster what are my options?
> > >>
> > >> So today you can accept the data on a single node then immediate do as
> > >> Mark describes and fire it to a "Remote Process Group" addressing the
> > >> cluster itself.  That way NiFi will automatically figure out all the
> > >> nodes in the cluster and spread the data around factoring in
> > >> load/etc..  But we do want to establish an even more automatic
> > >> mechanism on a connection itself where the user can indicate the data
> > >> should be auto-balanced.
> > >>
> > >> The reverse is really true as well where you can have a consumer which
> > >> only wants to accept from a single host.  So there too we need a
> > >> mechanism to descale the approach.
> > >>
> > >> I realize the flow you're working with now is just a sort of
> > >> familiarization thing.  But do you think this is something we should
> > >> tackle soon (based on real scenarios you face)?
> > >>
> > >> Thanks
> > >> Joe
> > >>
> > >>> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[hidden email]>
> > wrote:
> > >>> Ricky,
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> I don’t think there’s a JIRA ticket currently. Feel free to create
> one.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> I think we may need to do a better job documenting how the Remote
> > >> Process Groups. If you have a cluster setup, you would add a Remote
> > Process
> > >> Group that points to the Cluster Manager. (I.e., the URL that you
> > connect
> > >> to in order to see the graph).
> > >>>
> > >>>
> > >>> Then, anything that you send to the Remote Process Group will
> > >> automatically get load-balanced across all of the nodes in the
> cluster.
> > So
> > >> you could setup a flow that looks something like:
> > >>>
> > >>>
> > >>> GenerateFlowFile -> RemoteProcessGroup
> > >>>
> > >>>
> > >>> Input Port -> HashContent
> > >>>
> > >>>
> > >>> So these 2 flows are disjoint. The first part generates data and then
> > >> distributes it to the cluster (when you connect to the Remote Process
> > >> Group, you choose which Input Port to send to).
> > >>>
> > >>>
> > >>> But what we’d like to do in the future is something like:
> > >>>
> > >>>
> > >>> GenerateFlowFile -> HashContent
> > >>>
> > >>>
> > >>> And then on the connection in the middle choose to auto-distribute
> the
> > >> data. Right now you have to put the Remote Process Group in there to
> > >> distribute to the cluster, and add the Input Port to receive the data.
> > But
> > >> there should only be a single RemoteProcessGroup that points to the
> > entire
> > >> cluster, not one per node.
> > >>>
> > >>>
> > >>> Thanks
> > >>>
> > >>> -Mark
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> From: Ricky Saltzer
> > >>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> > >>> To: [hidden email]
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Mark -
> > >>>
> > >>> Thanks for the fast reply, much appreciated. This is what I figured,
> > but
> > >>> since I was already in clustered mode, I wanted to make sure there
> > wasn't
> > >>> an easier way than adding each node as a remote process group.
> > >>>
> > >>> Is there already a JIRA to track the ability to auto distribute in
> > >>> clustered mode, or would you like me to open it up?
> > >>>
> > >>> Thanks again,
> > >>> Ricky
> > >>>
> > >>>> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[hidden email]>
> > wrote:
> > >>>>
> > >>>> Ricky,
> > >>>>
> > >>>>
> > >>>> The DistributeLoad processor is simply used to route to one of many
> > >>>> relationships. So if you have, for instance, 5 different servers
> that
> > >> you
> > >>>> can FTP files to, you can use DistributeLoad to round robin the
> files
> > >>>> between them, so that you end up pushing 20% to each of 5 PutFTP
> > >> processors.
> > >>>>
> > >>>>
> > >>>> What you’re wanting to do, it sounds like, is to distribute the
> > >> FlowFiles
> > >>>> to different nodes in the cluster. The Remote Process Group is how
> you
> > >>>> would need to do that at this time. We have discussed having the
> > >> ability to
> > >>>> mark a Connection as “Auto-Distributed” (or maybe some better name
> 😊)
> > >> and
> > >>>> have that automatically distribute the data between nodes in the
> > >> cluster,
> > >>>> but that feature hasn’t yet been implemented.
> > >>>>
> > >>>>
> > >>>> Does that answer your question?
> > >>>>
> > >>>>
> > >>>> Thanks
> > >>>>
> > >>>> -Mark
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> From: Ricky Saltzer
> > >>>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> > >>>> To: [hidden email]
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> Hi -
> > >>>>
> > >>>> I have a question regarding load distribution in a clustered NiFi
> > >>>> environment. I have a really simple example, I'm using the
> > >> GenerateFlowFile
> > >>>> processor to generate some random data, then I MD5 hash the file and
> > >> print
> > >>>> out the resulting hash.
> > >>>>
> > >>>> I want only the primary node to generate the data, but I want both
> > >> nodes in
> > >>>> the cluster to share the hashing workload. It appears if I set the
> > >>>> scheduling strategy to "On primary node" for the GenerateFlowFile
> > >>>> processor, then the next processor (HashContent) is only being
> > accepted
> > >> and
> > >>>> processed by a single node.
> > >>>>
> > >>>> I've put DistributeLoad processor in-between the HashContent and
> > >>>> GenerateFlowFile, but this requires me to use the remote process
> group
> > >> to
> > >>>> distribute the load, which doesn't seem intuitive when I'm already
> > >>>> clustered.
> > >>>>
> > >>>> I guess my question is, is it possible for the DistributeLoad
> > processor
> > >> to
> > >>>> understand that NiFi is in a clustered environment, and have an
> > ability
> > >> to
> > >>>> distribute the next processor (HashContent) amongst all nodes in the
> > >>>> cluster?
> > >>>>
> > >>>> Cheers,
> > >>>> --
> > >>>> Ricky Saltzer
> > >>>> http://www.cloudera.com
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Ricky Saltzer
> > >>> http://www.cloudera.com
> > >
> > >
> > >
> > > --
> > > Ricky Saltzer
> > > http://www.cloudera.com
> >
>



--
Ricky Saltzer
http://www.cloudera.com
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Mark Payne
In reply to this post by Ricky Saltzer
Originally, I set the default in the properties file so that site-to-site is configured to be secure but not enabled. I did this because I didn’t want to enable it as non-secure by default because I was afraid that this would be dangerous… so I required that users explicitly go in and set it up. But thinking back to this, maybe that was a mistake. We set the default UI port to be 8080 and non-secure, so maybe we should just set the default so that site-to-site is enabled non-secure, as well. That would probably just make this a lot easier.






From: Ricky Saltzer
Sent: ‎Sunday‎, ‎February‎ ‎8‎, ‎2015 ‎1‎:‎16‎ ‎PM
To: [hidden email]





Thanks for the tip, Mark! Allowing the user to enable the site to site
feature during runtime would be a good step in the right direction.
Documentation on how it works and why it's different from having your nodes
in a cluster would also make things easier to understand.


On Sun, Feb 8, 2015 at 9:11 AM, Joe Witt <[hidden email]> wrote:

> Site to site is a powerhouse feature but has caused a good bit of
> confusion.  Perhaps we should plan or its inclusion in the things that can
> be tuned/set at runtime.
>
> It would be good to include with that information about bounded
> interfaces.   Information about what messages will get sent, etc.   Folks
> in proxy type situations have a hard time reasoning over what is
> happening.  That is little sense of "is this thing on".
>
> What do you all think?
> On Feb 8, 2015 6:51 AM, "Mark Payne" <[hidden email]> wrote:
>
> > Ricky,
> >
> > In the nifi.properties file, there's a property named
> > "nifi.remote.input.port". By default it's empty. Set that to whatever
> port
> > you want to use for site-to-site. Additionally, you'll need to either set
> > "nifi.remote.input.secure" to false or configure keystore and truststore
> > properties. Configure this for nodes and NCM.  After that you should be
> > good to go (after restart)!
> >
> > If you run into any issues let us know.
> >
> > Thanks
> > -Mark
> >
> > Sent from my iPhone
> >
> > > On Feb 8, 2015, at 5:54 AM, Ricky Saltzer <[hidden email]> wrote:
> > >
> > > Hey Joe -
> > >
> > > This makes sense, and I'm in the process of trying it out now. I'm
> > running
> > > into a small issue where the remote process group is saying neither of
> > the
> > > nodes are configured for Site-to-Site communication.
> > >
> > > Although not super intuitive, sending to the remote process group
> > pointing
> > > to the cluster should be fine as long as it works (which I'm sure it
> > does).
> > >
> > > Ricky
> > >
> > >> On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[hidden email]> wrote:
> > >>
> > >> Ricky,
> > >>
> > >> So the use case you're coming from here is a good and common one which
> > is:
> > >>
> > >> If I have a datasource which does not offer scalabilty (it can only
> > >> send to a single node for instance) but I have a scalable distribution
> > >> cluster what are my options?
> > >>
> > >> So today you can accept the data on a single node then immediate do as
> > >> Mark describes and fire it to a "Remote Process Group" addressing the
> > >> cluster itself.  That way NiFi will automatically figure out all the
> > >> nodes in the cluster and spread the data around factoring in
> > >> load/etc..  But we do want to establish an even more automatic
> > >> mechanism on a connection itself where the user can indicate the data
> > >> should be auto-balanced.
> > >>
> > >> The reverse is really true as well where you can have a consumer which
> > >> only wants to accept from a single host.  So there too we need a
> > >> mechanism to descale the approach.
> > >>
> > >> I realize the flow you're working with now is just a sort of
> > >> familiarization thing.  But do you think this is something we should
> > >> tackle soon (based on real scenarios you face)?
> > >>
> > >> Thanks
> > >> Joe
> > >>
> > >>> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[hidden email]>
> > wrote:
> > >>> Ricky,
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> I don’t think there’s a JIRA ticket currently. Feel free to create
> one.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> I think we may need to do a better job documenting how the Remote
> > >> Process Groups. If you have a cluster setup, you would add a Remote
> > Process
> > >> Group that points to the Cluster Manager. (I.e., the URL that you
> > connect
> > >> to in order to see the graph).
> > >>>
> > >>>
> > >>> Then, anything that you send to the Remote Process Group will
> > >> automatically get load-balanced across all of the nodes in the
> cluster.
> > So
> > >> you could setup a flow that looks something like:
> > >>>
> > >>>
> > >>> GenerateFlowFile -> RemoteProcessGroup
> > >>>
> > >>>
> > >>> Input Port -> HashContent
> > >>>
> > >>>
> > >>> So these 2 flows are disjoint. The first part generates data and then
> > >> distributes it to the cluster (when you connect to the Remote Process
> > >> Group, you choose which Input Port to send to).
> > >>>
> > >>>
> > >>> But what we’d like to do in the future is something like:
> > >>>
> > >>>
> > >>> GenerateFlowFile -> HashContent
> > >>>
> > >>>
> > >>> And then on the connection in the middle choose to auto-distribute
> the
> > >> data. Right now you have to put the Remote Process Group in there to
> > >> distribute to the cluster, and add the Input Port to receive the data.
> > But
> > >> there should only be a single RemoteProcessGroup that points to the
> > entire
> > >> cluster, not one per node.
> > >>>
> > >>>
> > >>> Thanks
> > >>>
> > >>> -Mark
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> From: Ricky Saltzer
> > >>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> > >>> To: [hidden email]
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Mark -
> > >>>
> > >>> Thanks for the fast reply, much appreciated. This is what I figured,
> > but
> > >>> since I was already in clustered mode, I wanted to make sure there
> > wasn't
> > >>> an easier way than adding each node as a remote process group.
> > >>>
> > >>> Is there already a JIRA to track the ability to auto distribute in
> > >>> clustered mode, or would you like me to open it up?
> > >>>
> > >>> Thanks again,
> > >>> Ricky
> > >>>
> > >>>> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[hidden email]>
> > wrote:
> > >>>>
> > >>>> Ricky,
> > >>>>
> > >>>>
> > >>>> The DistributeLoad processor is simply used to route to one of many
> > >>>> relationships. So if you have, for instance, 5 different servers
> that
> > >> you
> > >>>> can FTP files to, you can use DistributeLoad to round robin the
> files
> > >>>> between them, so that you end up pushing 20% to each of 5 PutFTP
> > >> processors.
> > >>>>
> > >>>>
> > >>>> What you’re wanting to do, it sounds like, is to distribute the
> > >> FlowFiles
> > >>>> to different nodes in the cluster. The Remote Process Group is how
> you
> > >>>> would need to do that at this time. We have discussed having the
> > >> ability to
> > >>>> mark a Connection as “Auto-Distributed” (or maybe some better name
> 😊)
> > >> and
> > >>>> have that automatically distribute the data between nodes in the
> > >> cluster,
> > >>>> but that feature hasn’t yet been implemented.
> > >>>>
> > >>>>
> > >>>> Does that answer your question?
> > >>>>
> > >>>>
> > >>>> Thanks
> > >>>>
> > >>>> -Mark
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> From: Ricky Saltzer
> > >>>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> > >>>> To: [hidden email]
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> Hi -
> > >>>>
> > >>>> I have a question regarding load distribution in a clustered NiFi
> > >>>> environment. I have a really simple example, I'm using the
> > >> GenerateFlowFile
> > >>>> processor to generate some random data, then I MD5 hash the file and
> > >> print
> > >>>> out the resulting hash.
> > >>>>
> > >>>> I want only the primary node to generate the data, but I want both
> > >> nodes in
> > >>>> the cluster to share the hashing workload. It appears if I set the
> > >>>> scheduling strategy to "On primary node" for the GenerateFlowFile
> > >>>> processor, then the next processor (HashContent) is only being
> > accepted
> > >> and
> > >>>> processed by a single node.
> > >>>>
> > >>>> I've put DistributeLoad processor in-between the HashContent and
> > >>>> GenerateFlowFile, but this requires me to use the remote process
> group
> > >> to
> > >>>> distribute the load, which doesn't seem intuitive when I'm already
> > >>>> clustered.
> > >>>>
> > >>>> I guess my question is, is it possible for the DistributeLoad
> > processor
> > >> to
> > >>>> understand that NiFi is in a clustered environment, and have an
> > ability
> > >> to
> > >>>> distribute the next processor (HashContent) amongst all nodes in the
> > >>>> cluster?
> > >>>>
> > >>>> Cheers,
> > >>>> --
> > >>>> Ricky Saltzer
> > >>>> http://www.cloudera.com
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Ricky Saltzer
> > >>> http://www.cloudera.com
> > >
> > >
> > >
> > > --
> > > Ricky Saltzer
> > > http://www.cloudera.com
> >
>



--
Ricky Saltzer
http://www.cloudera.com
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Joe Witt
I think the change you made more recently was totally appropriate.
The right answer here in my opinion is to provide a way for users to
manage/view/understand this at runtime.

Site to site is a pretty great feature and we just need to give it a
more first-class treatment:
- Great documentation for users in the app
- Great documentation for the protocol itself and examples of clients
(we should likely even help seed the development of a few for popular
languages)
- Good user experience at runtime to modify and understand what is
happening, etc..

Also sorry for my unbelievably unreadable e-mail on this thread
earlier . I really should never send e-mails from my phone.

Thanks
Joe

On Sun, Feb 8, 2015 at 1:17 PM, Mark Payne <[hidden email]> wrote:

> Originally, I set the default in the properties file so that site-to-site is configured to be secure but not enabled. I did this because I didn’t want to enable it as non-secure by default because I was afraid that this would be dangerous… so I required that users explicitly go in and set it up. But thinking back to this, maybe that was a mistake. We set the default UI port to be 8080 and non-secure, so maybe we should just set the default so that site-to-site is enabled non-secure, as well. That would probably just make this a lot easier.
>
>
>
>
>
>
> From: Ricky Saltzer
> Sent: ‎Sunday‎, ‎February‎ ‎8‎, ‎2015 ‎1‎:‎16‎ ‎PM
> To: [hidden email]
>
>
>
>
>
> Thanks for the tip, Mark! Allowing the user to enable the site to site
> feature during runtime would be a good step in the right direction.
> Documentation on how it works and why it's different from having your nodes
> in a cluster would also make things easier to understand.
>
>
> On Sun, Feb 8, 2015 at 9:11 AM, Joe Witt <[hidden email]> wrote:
>
>> Site to site is a powerhouse feature but has caused a good bit of
>> confusion.  Perhaps we should plan or its inclusion in the things that can
>> be tuned/set at runtime.
>>
>> It would be good to include with that information about bounded
>> interfaces.   Information about what messages will get sent, etc.   Folks
>> in proxy type situations have a hard time reasoning over what is
>> happening.  That is little sense of "is this thing on".
>>
>> What do you all think?
>> On Feb 8, 2015 6:51 AM, "Mark Payne" <[hidden email]> wrote:
>>
>> > Ricky,
>> >
>> > In the nifi.properties file, there's a property named
>> > "nifi.remote.input.port". By default it's empty. Set that to whatever
>> port
>> > you want to use for site-to-site. Additionally, you'll need to either set
>> > "nifi.remote.input.secure" to false or configure keystore and truststore
>> > properties. Configure this for nodes and NCM.  After that you should be
>> > good to go (after restart)!
>> >
>> > If you run into any issues let us know.
>> >
>> > Thanks
>> > -Mark
>> >
>> > Sent from my iPhone
>> >
>> > > On Feb 8, 2015, at 5:54 AM, Ricky Saltzer <[hidden email]> wrote:
>> > >
>> > > Hey Joe -
>> > >
>> > > This makes sense, and I'm in the process of trying it out now. I'm
>> > running
>> > > into a small issue where the remote process group is saying neither of
>> > the
>> > > nodes are configured for Site-to-Site communication.
>> > >
>> > > Although not super intuitive, sending to the remote process group
>> > pointing
>> > > to the cluster should be fine as long as it works (which I'm sure it
>> > does).
>> > >
>> > > Ricky
>> > >
>> > >> On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[hidden email]> wrote:
>> > >>
>> > >> Ricky,
>> > >>
>> > >> So the use case you're coming from here is a good and common one which
>> > is:
>> > >>
>> > >> If I have a datasource which does not offer scalabilty (it can only
>> > >> send to a single node for instance) but I have a scalable distribution
>> > >> cluster what are my options?
>> > >>
>> > >> So today you can accept the data on a single node then immediate do as
>> > >> Mark describes and fire it to a "Remote Process Group" addressing the
>> > >> cluster itself.  That way NiFi will automatically figure out all the
>> > >> nodes in the cluster and spread the data around factoring in
>> > >> load/etc..  But we do want to establish an even more automatic
>> > >> mechanism on a connection itself where the user can indicate the data
>> > >> should be auto-balanced.
>> > >>
>> > >> The reverse is really true as well where you can have a consumer which
>> > >> only wants to accept from a single host.  So there too we need a
>> > >> mechanism to descale the approach.
>> > >>
>> > >> I realize the flow you're working with now is just a sort of
>> > >> familiarization thing.  But do you think this is something we should
>> > >> tackle soon (based on real scenarios you face)?
>> > >>
>> > >> Thanks
>> > >> Joe
>> > >>
>> > >>> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[hidden email]>
>> > wrote:
>> > >>> Ricky,
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> I don’t think there’s a JIRA ticket currently. Feel free to create
>> one.
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> I think we may need to do a better job documenting how the Remote
>> > >> Process Groups. If you have a cluster setup, you would add a Remote
>> > Process
>> > >> Group that points to the Cluster Manager. (I.e., the URL that you
>> > connect
>> > >> to in order to see the graph).
>> > >>>
>> > >>>
>> > >>> Then, anything that you send to the Remote Process Group will
>> > >> automatically get load-balanced across all of the nodes in the
>> cluster.
>> > So
>> > >> you could setup a flow that looks something like:
>> > >>>
>> > >>>
>> > >>> GenerateFlowFile -> RemoteProcessGroup
>> > >>>
>> > >>>
>> > >>> Input Port -> HashContent
>> > >>>
>> > >>>
>> > >>> So these 2 flows are disjoint. The first part generates data and then
>> > >> distributes it to the cluster (when you connect to the Remote Process
>> > >> Group, you choose which Input Port to send to).
>> > >>>
>> > >>>
>> > >>> But what we’d like to do in the future is something like:
>> > >>>
>> > >>>
>> > >>> GenerateFlowFile -> HashContent
>> > >>>
>> > >>>
>> > >>> And then on the connection in the middle choose to auto-distribute
>> the
>> > >> data. Right now you have to put the Remote Process Group in there to
>> > >> distribute to the cluster, and add the Input Port to receive the data.
>> > But
>> > >> there should only be a single RemoteProcessGroup that points to the
>> > entire
>> > >> cluster, not one per node.
>> > >>>
>> > >>>
>> > >>> Thanks
>> > >>>
>> > >>> -Mark
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> From: Ricky Saltzer
>> > >>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
>> > >>> To: [hidden email]
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> Mark -
>> > >>>
>> > >>> Thanks for the fast reply, much appreciated. This is what I figured,
>> > but
>> > >>> since I was already in clustered mode, I wanted to make sure there
>> > wasn't
>> > >>> an easier way than adding each node as a remote process group.
>> > >>>
>> > >>> Is there already a JIRA to track the ability to auto distribute in
>> > >>> clustered mode, or would you like me to open it up?
>> > >>>
>> > >>> Thanks again,
>> > >>> Ricky
>> > >>>
>> > >>>> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[hidden email]>
>> > wrote:
>> > >>>>
>> > >>>> Ricky,
>> > >>>>
>> > >>>>
>> > >>>> The DistributeLoad processor is simply used to route to one of many
>> > >>>> relationships. So if you have, for instance, 5 different servers
>> that
>> > >> you
>> > >>>> can FTP files to, you can use DistributeLoad to round robin the
>> files
>> > >>>> between them, so that you end up pushing 20% to each of 5 PutFTP
>> > >> processors.
>> > >>>>
>> > >>>>
>> > >>>> What you’re wanting to do, it sounds like, is to distribute the
>> > >> FlowFiles
>> > >>>> to different nodes in the cluster. The Remote Process Group is how
>> you
>> > >>>> would need to do that at this time. We have discussed having the
>> > >> ability to
>> > >>>> mark a Connection as “Auto-Distributed” (or maybe some better name
>> 😊)
>> > >> and
>> > >>>> have that automatically distribute the data between nodes in the
>> > >> cluster,
>> > >>>> but that feature hasn’t yet been implemented.
>> > >>>>
>> > >>>>
>> > >>>> Does that answer your question?
>> > >>>>
>> > >>>>
>> > >>>> Thanks
>> > >>>>
>> > >>>> -Mark
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> From: Ricky Saltzer
>> > >>>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
>> > >>>> To: [hidden email]
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> Hi -
>> > >>>>
>> > >>>> I have a question regarding load distribution in a clustered NiFi
>> > >>>> environment. I have a really simple example, I'm using the
>> > >> GenerateFlowFile
>> > >>>> processor to generate some random data, then I MD5 hash the file and
>> > >> print
>> > >>>> out the resulting hash.
>> > >>>>
>> > >>>> I want only the primary node to generate the data, but I want both
>> > >> nodes in
>> > >>>> the cluster to share the hashing workload. It appears if I set the
>> > >>>> scheduling strategy to "On primary node" for the GenerateFlowFile
>> > >>>> processor, then the next processor (HashContent) is only being
>> > accepted
>> > >> and
>> > >>>> processed by a single node.
>> > >>>>
>> > >>>> I've put DistributeLoad processor in-between the HashContent and
>> > >>>> GenerateFlowFile, but this requires me to use the remote process
>> group
>> > >> to
>> > >>>> distribute the load, which doesn't seem intuitive when I'm already
>> > >>>> clustered.
>> > >>>>
>> > >>>> I guess my question is, is it possible for the DistributeLoad
>> > processor
>> > >> to
>> > >>>> understand that NiFi is in a clustered environment, and have an
>> > ability
>> > >> to
>> > >>>> distribute the next processor (HashContent) amongst all nodes in the
>> > >>>> cluster?
>> > >>>>
>> > >>>> Cheers,
>> > >>>> --
>> > >>>> Ricky Saltzer
>> > >>>> http://www.cloudera.com
>> > >>>
>> > >>>
>> > >>>
>> > >>> --
>> > >>> Ricky Saltzer
>> > >>> http://www.cloudera.com
>> > >
>> > >
>> > >
>> > > --
>> > > Ricky Saltzer
>> > > http://www.cloudera.com
>> >
>>
>
>
>
> --
> Ricky Saltzer
> http://www.cloudera.com
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Ricky Saltzer
Agreed, making the site to site feature as easy to configure as a regular
processor would eliminate a lot of the confusion. Depending on how long it
would take to get this in as a first-class citizen, it might be worthwhile
writing up a "How to enable site to site" page so that the remote process
group can link to. As Mark described, this was easy, and things started to
work as soon as I restarted the services.

Ricky

On Sun, Feb 8, 2015 at 1:24 PM, Joe Witt <[hidden email]> wrote:

> I think the change you made more recently was totally appropriate.
> The right answer here in my opinion is to provide a way for users to
> manage/view/understand this at runtime.
>
> Site to site is a pretty great feature and we just need to give it a
> more first-class treatment:
> - Great documentation for users in the app
> - Great documentation for the protocol itself and examples of clients
> (we should likely even help seed the development of a few for popular
> languages)
> - Good user experience at runtime to modify and understand what is
> happening, etc..
>
> Also sorry for my unbelievably unreadable e-mail on this thread
> earlier . I really should never send e-mails from my phone.
>
> Thanks
> Joe
>
> On Sun, Feb 8, 2015 at 1:17 PM, Mark Payne <[hidden email]> wrote:
> > Originally, I set the default in the properties file so that
> site-to-site is configured to be secure but not enabled. I did this because
> I didn’t want to enable it as non-secure by default because I was afraid
> that this would be dangerous… so I required that users explicitly go in and
> set it up. But thinking back to this, maybe that was a mistake. We set the
> default UI port to be 8080 and non-secure, so maybe we should just set the
> default so that site-to-site is enabled non-secure, as well. That would
> probably just make this a lot easier.
> >
> >
> >
> >
> >
> >
> > From: Ricky Saltzer
> > Sent: ‎Sunday‎, ‎February‎ ‎8‎, ‎2015 ‎1‎:‎16‎ ‎PM
> > To: [hidden email]
> >
> >
> >
> >
> >
> > Thanks for the tip, Mark! Allowing the user to enable the site to site
> > feature during runtime would be a good step in the right direction.
> > Documentation on how it works and why it's different from having your
> nodes
> > in a cluster would also make things easier to understand.
> >
> >
> > On Sun, Feb 8, 2015 at 9:11 AM, Joe Witt <[hidden email]> wrote:
> >
> >> Site to site is a powerhouse feature but has caused a good bit of
> >> confusion.  Perhaps we should plan or its inclusion in the things that
> can
> >> be tuned/set at runtime.
> >>
> >> It would be good to include with that information about bounded
> >> interfaces.   Information about what messages will get sent, etc.
>  Folks
> >> in proxy type situations have a hard time reasoning over what is
> >> happening.  That is little sense of "is this thing on".
> >>
> >> What do you all think?
> >> On Feb 8, 2015 6:51 AM, "Mark Payne" <[hidden email]> wrote:
> >>
> >> > Ricky,
> >> >
> >> > In the nifi.properties file, there's a property named
> >> > "nifi.remote.input.port". By default it's empty. Set that to whatever
> >> port
> >> > you want to use for site-to-site. Additionally, you'll need to either
> set
> >> > "nifi.remote.input.secure" to false or configure keystore and
> truststore
> >> > properties. Configure this for nodes and NCM.  After that you should
> be
> >> > good to go (after restart)!
> >> >
> >> > If you run into any issues let us know.
> >> >
> >> > Thanks
> >> > -Mark
> >> >
> >> > Sent from my iPhone
> >> >
> >> > > On Feb 8, 2015, at 5:54 AM, Ricky Saltzer <[hidden email]>
> wrote:
> >> > >
> >> > > Hey Joe -
> >> > >
> >> > > This makes sense, and I'm in the process of trying it out now. I'm
> >> > running
> >> > > into a small issue where the remote process group is saying neither
> of
> >> > the
> >> > > nodes are configured for Site-to-Site communication.
> >> > >
> >> > > Although not super intuitive, sending to the remote process group
> >> > pointing
> >> > > to the cluster should be fine as long as it works (which I'm sure it
> >> > does).
> >> > >
> >> > > Ricky
> >> > >
> >> > >> On Fri, Feb 6, 2015 at 3:24 PM, Joe Witt <[hidden email]>
> wrote:
> >> > >>
> >> > >> Ricky,
> >> > >>
> >> > >> So the use case you're coming from here is a good and common one
> which
> >> > is:
> >> > >>
> >> > >> If I have a datasource which does not offer scalabilty (it can only
> >> > >> send to a single node for instance) but I have a scalable
> distribution
> >> > >> cluster what are my options?
> >> > >>
> >> > >> So today you can accept the data on a single node then immediate
> do as
> >> > >> Mark describes and fire it to a "Remote Process Group" addressing
> the
> >> > >> cluster itself.  That way NiFi will automatically figure out all
> the
> >> > >> nodes in the cluster and spread the data around factoring in
> >> > >> load/etc..  But we do want to establish an even more automatic
> >> > >> mechanism on a connection itself where the user can indicate the
> data
> >> > >> should be auto-balanced.
> >> > >>
> >> > >> The reverse is really true as well where you can have a consumer
> which
> >> > >> only wants to accept from a single host.  So there too we need a
> >> > >> mechanism to descale the approach.
> >> > >>
> >> > >> I realize the flow you're working with now is just a sort of
> >> > >> familiarization thing.  But do you think this is something we
> should
> >> > >> tackle soon (based on real scenarios you face)?
> >> > >>
> >> > >> Thanks
> >> > >> Joe
> >> > >>
> >> > >>> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[hidden email]>
> >> > wrote:
> >> > >>> Ricky,
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> I don’t think there’s a JIRA ticket currently. Feel free to create
> >> one.
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> I think we may need to do a better job documenting how the Remote
> >> > >> Process Groups. If you have a cluster setup, you would add a Remote
> >> > Process
> >> > >> Group that points to the Cluster Manager. (I.e., the URL that you
> >> > connect
> >> > >> to in order to see the graph).
> >> > >>>
> >> > >>>
> >> > >>> Then, anything that you send to the Remote Process Group will
> >> > >> automatically get load-balanced across all of the nodes in the
> >> cluster.
> >> > So
> >> > >> you could setup a flow that looks something like:
> >> > >>>
> >> > >>>
> >> > >>> GenerateFlowFile -> RemoteProcessGroup
> >> > >>>
> >> > >>>
> >> > >>> Input Port -> HashContent
> >> > >>>
> >> > >>>
> >> > >>> So these 2 flows are disjoint. The first part generates data and
> then
> >> > >> distributes it to the cluster (when you connect to the Remote
> Process
> >> > >> Group, you choose which Input Port to send to).
> >> > >>>
> >> > >>>
> >> > >>> But what we’d like to do in the future is something like:
> >> > >>>
> >> > >>>
> >> > >>> GenerateFlowFile -> HashContent
> >> > >>>
> >> > >>>
> >> > >>> And then on the connection in the middle choose to auto-distribute
> >> the
> >> > >> data. Right now you have to put the Remote Process Group in there
> to
> >> > >> distribute to the cluster, and add the Input Port to receive the
> data.
> >> > But
> >> > >> there should only be a single RemoteProcessGroup that points to the
> >> > entire
> >> > >> cluster, not one per node.
> >> > >>>
> >> > >>>
> >> > >>> Thanks
> >> > >>>
> >> > >>> -Mark
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> From: Ricky Saltzer
> >> > >>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> >> > >>> To: [hidden email]
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> Mark -
> >> > >>>
> >> > >>> Thanks for the fast reply, much appreciated. This is what I
> figured,
> >> > but
> >> > >>> since I was already in clustered mode, I wanted to make sure there
> >> > wasn't
> >> > >>> an easier way than adding each node as a remote process group.
> >> > >>>
> >> > >>> Is there already a JIRA to track the ability to auto distribute in
> >> > >>> clustered mode, or would you like me to open it up?
> >> > >>>
> >> > >>> Thanks again,
> >> > >>> Ricky
> >> > >>>
> >> > >>>> On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[hidden email]
> >
> >> > wrote:
> >> > >>>>
> >> > >>>> Ricky,
> >> > >>>>
> >> > >>>>
> >> > >>>> The DistributeLoad processor is simply used to route to one of
> many
> >> > >>>> relationships. So if you have, for instance, 5 different servers
> >> that
> >> > >> you
> >> > >>>> can FTP files to, you can use DistributeLoad to round robin the
> >> files
> >> > >>>> between them, so that you end up pushing 20% to each of 5 PutFTP
> >> > >> processors.
> >> > >>>>
> >> > >>>>
> >> > >>>> What you’re wanting to do, it sounds like, is to distribute the
> >> > >> FlowFiles
> >> > >>>> to different nodes in the cluster. The Remote Process Group is
> how
> >> you
> >> > >>>> would need to do that at this time. We have discussed having the
> >> > >> ability to
> >> > >>>> mark a Connection as “Auto-Distributed” (or maybe some better
> name
> >> 😊)
> >> > >> and
> >> > >>>> have that automatically distribute the data between nodes in the
> >> > >> cluster,
> >> > >>>> but that feature hasn’t yet been implemented.
> >> > >>>>
> >> > >>>>
> >> > >>>> Does that answer your question?
> >> > >>>>
> >> > >>>>
> >> > >>>> Thanks
> >> > >>>>
> >> > >>>> -Mark
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>> From: Ricky Saltzer
> >> > >>>> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> >> > >>>> To: [hidden email]
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>> Hi -
> >> > >>>>
> >> > >>>> I have a question regarding load distribution in a clustered NiFi
> >> > >>>> environment. I have a really simple example, I'm using the
> >> > >> GenerateFlowFile
> >> > >>>> processor to generate some random data, then I MD5 hash the file
> and
> >> > >> print
> >> > >>>> out the resulting hash.
> >> > >>>>
> >> > >>>> I want only the primary node to generate the data, but I want
> both
> >> > >> nodes in
> >> > >>>> the cluster to share the hashing workload. It appears if I set
> the
> >> > >>>> scheduling strategy to "On primary node" for the GenerateFlowFile
> >> > >>>> processor, then the next processor (HashContent) is only being
> >> > accepted
> >> > >> and
> >> > >>>> processed by a single node.
> >> > >>>>
> >> > >>>> I've put DistributeLoad processor in-between the HashContent and
> >> > >>>> GenerateFlowFile, but this requires me to use the remote process
> >> group
> >> > >> to
> >> > >>>> distribute the load, which doesn't seem intuitive when I'm
> already
> >> > >>>> clustered.
> >> > >>>>
> >> > >>>> I guess my question is, is it possible for the DistributeLoad
> >> > processor
> >> > >> to
> >> > >>>> understand that NiFi is in a clustered environment, and have an
> >> > ability
> >> > >> to
> >> > >>>> distribute the next processor (HashContent) amongst all nodes in
> the
> >> > >>>> cluster?
> >> > >>>>
> >> > >>>> Cheers,
> >> > >>>> --
> >> > >>>> Ricky Saltzer
> >> > >>>> http://www.cloudera.com
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> --
> >> > >>> Ricky Saltzer
> >> > >>> http://www.cloudera.com
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Ricky Saltzer
> >> > > http://www.cloudera.com
> >> >
> >>
> >
> >
> >
> > --
> > Ricky Saltzer
> > http://www.cloudera.com
>



--
Ricky Saltzer
http://www.cloudera.com
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Andrew Purtell-2
In reply to this post by Joe Witt
> But do you think this is something we should tackle soon (based on real
scenarios you face)?

+1

I was looking at the user guide as a first timer. Because I had somehow an
assumption that a NiFi cluster would transparently scale out (probably due
to Storm and Spark Streaming), in the sense that I would define
logical flows but there would then be a 'hidden' scale out physical plan
that emerges from it, having to explicitly do this with remote process
groups was a surprise. The resulting disjoint flows complicate what would
otherwise be a clean and easy to follow flow design? When I used to do flow
based programming with Cascading I'd define logical assemblies and its
topographical planner (
http://docs.cascading.org/cascading/2.6/userguide/html/ch14s02.html) would
automatically identify the parallelism possible in the design and
distribute it.



On Friday, February 6, 2015, Joe Witt <[hidden email]> wrote:

> Ricky,
>
> So the use case you're coming from here is a good and common one which is:
>
> If I have a datasource which does not offer scalabilty (it can only
> send to a single node for instance) but I have a scalable distribution
> cluster what are my options?
>
> So today you can accept the data on a single node then immediate do as
> Mark describes and fire it to a "Remote Process Group" addressing the
> cluster itself.  That way NiFi will automatically figure out all the
> nodes in the cluster and spread the data around factoring in
> load/etc..  But we do want to establish an even more automatic
> mechanism on a connection itself where the user can indicate the data
> should be auto-balanced.
>
> The reverse is really true as well where you can have a consumer which
> only wants to accept from a single host.  So there too we need a
> mechanism to descale the approach.
>
> I realize the flow you're working with now is just a sort of
> familiarization thing.  But do you think this is something we should
> tackle soon (based on real scenarios you face)?
>
> Thanks
> Joe
>
> On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[hidden email]
> <javascript:;>> wrote:
> > Ricky,
> >
> >
> >
> >
> > I don’t think there’s a JIRA ticket currently. Feel free to create one.
> >
> >
> >
> >
> > I think we may need to do a better job documenting how the Remote
> Process Groups. If you have a cluster setup, you would add a Remote Process
> Group that points to the Cluster Manager. (I.e., the URL that you connect
> to in order to see the graph).
> >
> >
> > Then, anything that you send to the Remote Process Group will
> automatically get load-balanced across all of the nodes in the cluster. So
> you could setup a flow that looks something like:
> >
> >
> > GenerateFlowFile -> RemoteProcessGroup
> >
> >
> > Input Port -> HashContent
> >
> >
> > So these 2 flows are disjoint. The first part generates data and then
> distributes it to the cluster (when you connect to the Remote Process
> Group, you choose which Input Port to send to).
> >
> >
> > But what we’d like to do in the future is something like:
> >
> >
> > GenerateFlowFile -> HashContent
> >
> >
> > And then on the connection in the middle choose to auto-distribute the
> data. Right now you have to put the Remote Process Group in there to
> distribute to the cluster, and add the Input Port to receive the data. But
> there should only be a single RemoteProcessGroup that points to the entire
> cluster, not one per node.
> >
> >
> > Thanks
> >
> > -Mark
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: Ricky Saltzer
> > Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> > To: [hidden email] <javascript:;>
> >
> >
> >
> >
> >
> > Mark -
> >
> > Thanks for the fast reply, much appreciated. This is what I figured, but
> > since I was already in clustered mode, I wanted to make sure there wasn't
> > an easier way than adding each node as a remote process group.
> >
> > Is there already a JIRA to track the ability to auto distribute in
> > clustered mode, or would you like me to open it up?
> >
> > Thanks again,
> > Ricky
> >
> > On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[hidden email]
> <javascript:;>> wrote:
> >
> >> Ricky,
> >>
> >>
> >> The DistributeLoad processor is simply used to route to one of many
> >> relationships. So if you have, for instance, 5 different servers that
> you
> >> can FTP files to, you can use DistributeLoad to round robin the files
> >> between them, so that you end up pushing 20% to each of 5 PutFTP
> processors.
> >>
> >>
> >> What you’re wanting to do, it sounds like, is to distribute the
> FlowFiles
> >> to different nodes in the cluster. The Remote Process Group is how you
> >> would need to do that at this time. We have discussed having the
> ability to
> >> mark a Connection as “Auto-Distributed” (or maybe some better name 😊)
> and
> >> have that automatically distribute the data between nodes in the
> cluster,
> >> but that feature hasn’t yet been implemented.
> >>
> >>
> >> Does that answer your question?
> >>
> >>
> >> Thanks
> >>
> >> -Mark
> >>
> >>
> >>
> >>
> >>
> >>
> >> From: Ricky Saltzer
> >> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> >> To: [hidden email] <javascript:;>
> >>
> >>
> >>
> >>
> >>
> >> Hi -
> >>
> >> I have a question regarding load distribution in a clustered NiFi
> >> environment. I have a really simple example, I'm using the
> GenerateFlowFile
> >> processor to generate some random data, then I MD5 hash the file and
> print
> >> out the resulting hash.
> >>
> >> I want only the primary node to generate the data, but I want both
> nodes in
> >> the cluster to share the hashing workload. It appears if I set the
> >> scheduling strategy to "On primary node" for the GenerateFlowFile
> >> processor, then the next processor (HashContent) is only being accepted
> and
> >> processed by a single node.
> >>
> >> I've put DistributeLoad processor in-between the HashContent and
> >> GenerateFlowFile, but this requires me to use the remote process group
> to
> >> distribute the load, which doesn't seem intuitive when I'm already
> >> clustered.
> >>
> >> I guess my question is, is it possible for the DistributeLoad processor
> to
> >> understand that NiFi is in a clustered environment, and have an ability
> to
> >> distribute the next processor (HashContent) amongst all nodes in the
> >> cluster?
> >>
> >> Cheers,
> >> --
> >> Ricky Saltzer
> >> http://www.cloudera.com
> >>
> >
> >
> >
> > --
> > Ricky Saltzer
> > http://www.cloudera.com
>


--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Joe Witt
Andrew

To be clear you are advocating we knock out an automated load distribution
across nodes in the cluster as data passes through  some point on the
flow?   This would be so that someone would even need to add a remote
process group for it to happen.

Cool if so just want to be sure that is also what you were advocating.

Thanks
Joe
On Feb 8, 2015 1:48 PM, "Andrew Purtell" <[hidden email]> wrote:

> > But do you think this is something we should tackle soon (based on real
> scenarios you face)?
>
> +1
>
> I was looking at the user guide as a first timer. Because I had somehow an
> assumption that a NiFi cluster would transparently scale out (probably due
> to Storm and Spark Streaming), in the sense that I would define
> logical flows but there would then be a 'hidden' scale out physical plan
> that emerges from it, having to explicitly do this with remote process
> groups was a surprise. The resulting disjoint flows complicate what would
> otherwise be a clean and easy to follow flow design? When I used to do flow
> based programming with Cascading I'd define logical assemblies and its
> topographical planner (
> http://docs.cascading.org/cascading/2.6/userguide/html/ch14s02.html) would
> automatically identify the parallelism possible in the design and
> distribute it.
>
>
>
> On Friday, February 6, 2015, Joe Witt <[hidden email]> wrote:
>
> > Ricky,
> >
> > So the use case you're coming from here is a good and common one which
> is:
> >
> > If I have a datasource which does not offer scalabilty (it can only
> > send to a single node for instance) but I have a scalable distribution
> > cluster what are my options?
> >
> > So today you can accept the data on a single node then immediate do as
> > Mark describes and fire it to a "Remote Process Group" addressing the
> > cluster itself.  That way NiFi will automatically figure out all the
> > nodes in the cluster and spread the data around factoring in
> > load/etc..  But we do want to establish an even more automatic
> > mechanism on a connection itself where the user can indicate the data
> > should be auto-balanced.
> >
> > The reverse is really true as well where you can have a consumer which
> > only wants to accept from a single host.  So there too we need a
> > mechanism to descale the approach.
> >
> > I realize the flow you're working with now is just a sort of
> > familiarization thing.  But do you think this is something we should
> > tackle soon (based on real scenarios you face)?
> >
> > Thanks
> > Joe
> >
> > On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[hidden email]
> > <javascript:;>> wrote:
> > > Ricky,
> > >
> > >
> > >
> > >
> > > I don’t think there’s a JIRA ticket currently. Feel free to create one.
> > >
> > >
> > >
> > >
> > > I think we may need to do a better job documenting how the Remote
> > Process Groups. If you have a cluster setup, you would add a Remote
> Process
> > Group that points to the Cluster Manager. (I.e., the URL that you connect
> > to in order to see the graph).
> > >
> > >
> > > Then, anything that you send to the Remote Process Group will
> > automatically get load-balanced across all of the nodes in the cluster.
> So
> > you could setup a flow that looks something like:
> > >
> > >
> > > GenerateFlowFile -> RemoteProcessGroup
> > >
> > >
> > > Input Port -> HashContent
> > >
> > >
> > > So these 2 flows are disjoint. The first part generates data and then
> > distributes it to the cluster (when you connect to the Remote Process
> > Group, you choose which Input Port to send to).
> > >
> > >
> > > But what we’d like to do in the future is something like:
> > >
> > >
> > > GenerateFlowFile -> HashContent
> > >
> > >
> > > And then on the connection in the middle choose to auto-distribute the
> > data. Right now you have to put the Remote Process Group in there to
> > distribute to the cluster, and add the Input Port to receive the data.
> But
> > there should only be a single RemoteProcessGroup that points to the
> entire
> > cluster, not one per node.
> > >
> > >
> > > Thanks
> > >
> > > -Mark
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > From: Ricky Saltzer
> > > Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> > > To: [hidden email] <javascript:;>
> > >
> > >
> > >
> > >
> > >
> > > Mark -
> > >
> > > Thanks for the fast reply, much appreciated. This is what I figured,
> but
> > > since I was already in clustered mode, I wanted to make sure there
> wasn't
> > > an easier way than adding each node as a remote process group.
> > >
> > > Is there already a JIRA to track the ability to auto distribute in
> > > clustered mode, or would you like me to open it up?
> > >
> > > Thanks again,
> > > Ricky
> > >
> > > On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[hidden email]
> > <javascript:;>> wrote:
> > >
> > >> Ricky,
> > >>
> > >>
> > >> The DistributeLoad processor is simply used to route to one of many
> > >> relationships. So if you have, for instance, 5 different servers that
> > you
> > >> can FTP files to, you can use DistributeLoad to round robin the files
> > >> between them, so that you end up pushing 20% to each of 5 PutFTP
> > processors.
> > >>
> > >>
> > >> What you’re wanting to do, it sounds like, is to distribute the
> > FlowFiles
> > >> to different nodes in the cluster. The Remote Process Group is how you
> > >> would need to do that at this time. We have discussed having the
> > ability to
> > >> mark a Connection as “Auto-Distributed” (or maybe some better name 😊)
> > and
> > >> have that automatically distribute the data between nodes in the
> > cluster,
> > >> but that feature hasn’t yet been implemented.
> > >>
> > >>
> > >> Does that answer your question?
> > >>
> > >>
> > >> Thanks
> > >>
> > >> -Mark
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> From: Ricky Saltzer
> > >> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> > >> To: [hidden email] <javascript:;>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Hi -
> > >>
> > >> I have a question regarding load distribution in a clustered NiFi
> > >> environment. I have a really simple example, I'm using the
> > GenerateFlowFile
> > >> processor to generate some random data, then I MD5 hash the file and
> > print
> > >> out the resulting hash.
> > >>
> > >> I want only the primary node to generate the data, but I want both
> > nodes in
> > >> the cluster to share the hashing workload. It appears if I set the
> > >> scheduling strategy to "On primary node" for the GenerateFlowFile
> > >> processor, then the next processor (HashContent) is only being
> accepted
> > and
> > >> processed by a single node.
> > >>
> > >> I've put DistributeLoad processor in-between the HashContent and
> > >> GenerateFlowFile, but this requires me to use the remote process group
> > to
> > >> distribute the load, which doesn't seem intuitive when I'm already
> > >> clustered.
> > >>
> > >> I guess my question is, is it possible for the DistributeLoad
> processor
> > to
> > >> understand that NiFi is in a clustered environment, and have an
> ability
> > to
> > >> distribute the next processor (HashContent) amongst all nodes in the
> > >> cluster?
> > >>
> > >> Cheers,
> > >> --
> > >> Ricky Saltzer
> > >> http://www.cloudera.com
> > >>
> > >
> > >
> > >
> > > --
> > > Ricky Saltzer
> > > http://www.cloudera.com
> >
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Andrew Purtell-2
> This would be so that someone would even need to add a remote process
group for it to happen.

... *not* even need ...

Yes.

I don't understand internals well enough to understand how difficult it
would be. Naively I think of processors exposing their effective/configured
parallelism and a processor instance placement planner (in the master?)
that is aware of both that and cluster wide constraints, and dynamically
adjusts where processors are running on the cluster and how many of them
are.

Speaking of maters ​I'm separately curious about the level of effort it
would require to introduce multiple masters.​ I think of like how HBase
does it: there is only ever one active master but a user can deploy
multiple standbys to take over service should the active master fail.



On Sun, Feb 8, 2015 at 11:19 AM, Joe Witt <[hidden email]> wrote:

> Andrew
>
> To be clear you are advocating we knock out an automated load distribution
> across nodes in the cluster as data passes through  some point on the
> flow?   This would be so that someone would even need to add a remote
> process group for it to happen.
>
> Cool if so just want to be sure that is also what you were advocating.
>
> Thanks
> Joe
> On Feb 8, 2015 1:48 PM, "Andrew Purtell" <[hidden email]> wrote:
>
> > > But do you think this is something we should tackle soon (based on real
> > scenarios you face)?
> >
> > +1
> >
> > I was looking at the user guide as a first timer. Because I had somehow
> an
> > assumption that a NiFi cluster would transparently scale out (probably
> due
> > to Storm and Spark Streaming), in the sense that I would define
> > logical flows but there would then be a 'hidden' scale out physical plan
> > that emerges from it, having to explicitly do this with remote process
> > groups was a surprise. The resulting disjoint flows complicate what would
> > otherwise be a clean and easy to follow flow design? When I used to do
> flow
> > based programming with Cascading I'd define logical assemblies and its
> > topographical planner (
> > http://docs.cascading.org/cascading/2.6/userguide/html/ch14s02.html)
> would
> > automatically identify the parallelism possible in the design and
> > distribute it.
> >
> >
> >
> > On Friday, February 6, 2015, Joe Witt <[hidden email]> wrote:
> >
> > > Ricky,
> > >
> > > So the use case you're coming from here is a good and common one which
> > is:
> > >
> > > If I have a datasource which does not offer scalabilty (it can only
> > > send to a single node for instance) but I have a scalable distribution
> > > cluster what are my options?
> > >
> > > So today you can accept the data on a single node then immediate do as
> > > Mark describes and fire it to a "Remote Process Group" addressing the
> > > cluster itself.  That way NiFi will automatically figure out all the
> > > nodes in the cluster and spread the data around factoring in
> > > load/etc..  But we do want to establish an even more automatic
> > > mechanism on a connection itself where the user can indicate the data
> > > should be auto-balanced.
> > >
> > > The reverse is really true as well where you can have a consumer which
> > > only wants to accept from a single host.  So there too we need a
> > > mechanism to descale the approach.
> > >
> > > I realize the flow you're working with now is just a sort of
> > > familiarization thing.  But do you think this is something we should
> > > tackle soon (based on real scenarios you face)?
> > >
> > > Thanks
> > > Joe
> > >
> > > On Fri, Feb 6, 2015 at 3:07 PM, Mark Payne <[hidden email]
> > > <javascript:;>> wrote:
> > > > Ricky,
> > > >
> > > >
> > > >
> > > >
> > > > I don’t think there’s a JIRA ticket currently. Feel free to create
> one.
> > > >
> > > >
> > > >
> > > >
> > > > I think we may need to do a better job documenting how the Remote
> > > Process Groups. If you have a cluster setup, you would add a Remote
> > Process
> > > Group that points to the Cluster Manager. (I.e., the URL that you
> connect
> > > to in order to see the graph).
> > > >
> > > >
> > > > Then, anything that you send to the Remote Process Group will
> > > automatically get load-balanced across all of the nodes in the cluster.
> > So
> > > you could setup a flow that looks something like:
> > > >
> > > >
> > > > GenerateFlowFile -> RemoteProcessGroup
> > > >
> > > >
> > > > Input Port -> HashContent
> > > >
> > > >
> > > > So these 2 flows are disjoint. The first part generates data and then
> > > distributes it to the cluster (when you connect to the Remote Process
> > > Group, you choose which Input Port to send to).
> > > >
> > > >
> > > > But what we’d like to do in the future is something like:
> > > >
> > > >
> > > > GenerateFlowFile -> HashContent
> > > >
> > > >
> > > > And then on the connection in the middle choose to auto-distribute
> the
> > > data. Right now you have to put the Remote Process Group in there to
> > > distribute to the cluster, and add the Input Port to receive the data.
> > But
> > > there should only be a single RemoteProcessGroup that points to the
> > entire
> > > cluster, not one per node.
> > > >
> > > >
> > > > Thanks
> > > >
> > > > -Mark
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > From: Ricky Saltzer
> > > > Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎3‎:‎06‎ ‎PM
> > > > To: [hidden email] <javascript:;>
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Mark -
> > > >
> > > > Thanks for the fast reply, much appreciated. This is what I figured,
> > but
> > > > since I was already in clustered mode, I wanted to make sure there
> > wasn't
> > > > an easier way than adding each node as a remote process group.
> > > >
> > > > Is there already a JIRA to track the ability to auto distribute in
> > > > clustered mode, or would you like me to open it up?
> > > >
> > > > Thanks again,
> > > > Ricky
> > > >
> > > > On Fri, Feb 6, 2015 at 2:58 PM, Mark Payne <[hidden email]
> > > <javascript:;>> wrote:
> > > >
> > > >> Ricky,
> > > >>
> > > >>
> > > >> The DistributeLoad processor is simply used to route to one of many
> > > >> relationships. So if you have, for instance, 5 different servers
> that
> > > you
> > > >> can FTP files to, you can use DistributeLoad to round robin the
> files
> > > >> between them, so that you end up pushing 20% to each of 5 PutFTP
> > > processors.
> > > >>
> > > >>
> > > >> What you’re wanting to do, it sounds like, is to distribute the
> > > FlowFiles
> > > >> to different nodes in the cluster. The Remote Process Group is how
> you
> > > >> would need to do that at this time. We have discussed having the
> > > ability to
> > > >> mark a Connection as “Auto-Distributed” (or maybe some better name
> 😊)
> > > and
> > > >> have that automatically distribute the data between nodes in the
> > > cluster,
> > > >> but that feature hasn’t yet been implemented.
> > > >>
> > > >>
> > > >> Does that answer your question?
> > > >>
> > > >>
> > > >> Thanks
> > > >>
> > > >> -Mark
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> From: Ricky Saltzer
> > > >> Sent: ‎Friday‎, ‎February‎ ‎6‎, ‎2015 ‎2‎:‎56‎ ‎PM
> > > >> To: [hidden email] <javascript:;>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Hi -
> > > >>
> > > >> I have a question regarding load distribution in a clustered NiFi
> > > >> environment. I have a really simple example, I'm using the
> > > GenerateFlowFile
> > > >> processor to generate some random data, then I MD5 hash the file and
> > > print
> > > >> out the resulting hash.
> > > >>
> > > >> I want only the primary node to generate the data, but I want both
> > > nodes in
> > > >> the cluster to share the hashing workload. It appears if I set the
> > > >> scheduling strategy to "On primary node" for the GenerateFlowFile
> > > >> processor, then the next processor (HashContent) is only being
> > accepted
> > > and
> > > >> processed by a single node.
> > > >>
> > > >> I've put DistributeLoad processor in-between the HashContent and
> > > >> GenerateFlowFile, but this requires me to use the remote process
> group
> > > to
> > > >> distribute the load, which doesn't seem intuitive when I'm already
> > > >> clustered.
> > > >>
> > > >> I guess my question is, is it possible for the DistributeLoad
> > processor
> > > to
> > > >> understand that NiFi is in a clustered environment, and have an
> > ability
> > > to
> > > >> distribute the next processor (HashContent) amongst all nodes in the
> > > >> cluster?
> > > >>
> > > >> Cheers,
> > > >> --
>
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Joe Witt
Andrew

*Automated Cluster Load Balancing*

The processors themselves are available and ready to run on all nodes
at all times.  It's really just a question of whether they have data
to run on.  We have always taken the view that 'if you want scalable
dataflow' use scalable interfaces.  And I think that is the way to go
in every case you can pull it off.  That generally meant one should
use datasources which offer queueing semantics where multiple
independent nodes can pull from the queue with 'at-least-once'
guarantees.  In addition each node has back pressure so if it falls
behind it slows its rate of pickup which  means other nodes in the
cluster can pickup the slack.  This has worked extremely well.

That said, I recognize that it simply isn't always possible to use
scalable interfaces and given enough non-scalable datasources the
cluster could become out of balance.  So this certainly seems like a
good [valuable, fun, non-trivial] problem to tackle.  If we allow
connections between processors to be auto-balanced then it will make
for a pretty smooth experience as users won't really have to think too
much about it.

So the key will be how to devise and implement an algorithm or
approach to spreading that load intelligently and so data doesn't just
bounce back and forth.  If anyone knows of good papers, similar
systems, or approaches they can describe for how to think through this
that would be great.  Things we'll have to think about here that come
to mind:

- When to start spreading the load (at what factor should we start
spreading work across the cluster)

- Whether it should auto-spread by default and the user can tell it
not to in certain cases or whether it should not spread by default and
the user can activate it

- What the criteria are by which we should let a user control how data
is partitioned (some key, round robin, etc..).  How to
rebalance/re-assign partitions if a node dies or comes on-line

There are 'counter cases' too that we must keep in mind such as
aggregation or bin packing grouped by some key.  In those cases all
data would need to be merged together at some point and thus all data
needs to be accessible at some point. Whether that means we direct all
data to a single node or whether we enable cross-cluster data
addressing is also a topic there.

* Multiple Masters *

So it is true that we have a single master model today.  But that
single master is solely for command/control of changes to the dataflow
configuration and is a very lightweight process that does nothing more
than that.  If the master dies then all nodes continue to do what they
were doing and even site-to-site continues to distribute data.  It
just does so without updates on current loading across the cluster.
Once the master is brought back on-line then the real-time command and
control functions return.  Building support for a back-up master to
offer HA of even the command/control side would probably also be a
considerable effort.  This one I'd be curious to hear of cases where
it was critical to make this part HA.

Excited to be talking about this level of technical detail.

Thanks
Joe
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Andrew Purtell-2
>

*Automated Cluster Load Balancing*

I put your excellent response up as NIFI-337.

>

* Multiple Masters *

>  that single master is solely for command/control of changes to the
dataflow
​> ​
configuration and is a very lightweight process that does nothing more
​> ​
than that.  If the master dies then all nodes continue to do what they
​> ​
were doing and even site-to-site continues to distribute data.  It
​> ​
just does so without updates on current loading across the cluster.

Yes but imagine a NiFi installation, perhaps a hosted service built on top
of it, where DataFlow Managers expect the command and control aspect of the
system to be as robust and available as flow processing itself. If one or
more standby masters are waiting in the wings to take over service for the
failed active master then automated and unattended failover would be
possible, and likely to narrow the interval where administrative changes
may fail.



On Sun, Feb 8, 2015 at 2:24 PM, Joe Witt <[hidden email]> wrote:

> Andrew
>
> ​​
> *Automated Cluster Load Balancing*
>
> The processors themselves are available and ready to run on all nodes
> at all times.  It's really just a question of whether they have data
> to run on.  We have always taken the view that 'if you want scalable
> dataflow' use scalable interfaces.  And I think that is the way to go
> in every case you can pull it off.  That generally meant one should
> use datasources which offer queueing semantics where multiple
> independent nodes can pull from the queue with 'at-least-once'
> guarantees.  In addition each node has back pressure so if it falls
> behind it slows its rate of pickup which  means other nodes in the
> cluster can pickup the slack.  This has worked extremely well.
>
> That said, I recognize that it simply isn't always possible to use
> scalable interfaces and given enough non-scalable datasources the
> cluster could become out of balance.  So this certainly seems like a
> good [valuable, fun, non-trivial] problem to tackle.  If we allow
> connections between processors to be auto-balanced then it will make
> for a pretty smooth experience as users won't really have to think too
> much about it.
>
> ​​
> So the key will be how to devise and implement an algorithm or
> approach to spreading that load intelligently and so data doesn't just
> bounce back and forth.  If anyone knows of good papers, similar
> systems, or approaches they can describe for how to think through this
> that would be great.  Things we'll have to think about here that come
> to mind:
>
> - When to start spreading the load (at what factor should we start
> spreading work across the cluster)
>
> - Whether it should auto-spread by default and the user can tell it
> not to in certain cases or whether it should not spread by default and
> the user can activate it
>
> - What the criteria are by which we should let a user control how data
> is partitioned (some key, round robin, etc..).  How to
> rebalance/re-assign partitions if a node dies or comes on-line
>
> There are 'counter cases' too that we must keep in mind such as
> aggregation or bin packing grouped by some key.  In those cases all
> data would need to be merged together at some point and thus all data
> needs to be accessible at some point. Whether that means we direct all
> data to a single node or whether we enable cross-cluster data
> addressing is also a topic there.
>
> ​​
> * Multiple Masters *
>
> So it is true that we have a single master model today.  But that
> single master is solely for command/control of changes to the dataflow
> configuration and is a very lightweight process that does nothing more
> than that.  If the master dies then all nodes continue to do what they
> were doing and even site-to-site continues to distribute data.  It
> just does so without updates on current loading across the cluster.
> Once the master is brought back on-line then the real-time command and
> control functions return.  Building support for a back-up master to
> offer HA of even the command/control side would probably also be a
> considerable effort.  This one I'd be curious to hear of cases where
> it was critical to make this part HA.
>
> Excited to be talking about this level of technical detail.
>
> Thanks
> Joe
>



--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)
Reply | Threaded
Open this post in threaded view
|

Re: Load distribution in cluster mode

Joe Witt
Yeah makes total sense.  https://issues.apache.org/jira/browse/NIFI-338

Thanks!
Joe

On Mon, Feb 9, 2015 at 6:43 PM, Andrew Purtell <[hidden email]> wrote:

>>
>
> *Automated Cluster Load Balancing*
>
> I put your excellent response up as NIFI-337.
>
>>
>
> * Multiple Masters *
>
>>  that single master is solely for command/control of changes to the
> dataflow
>>
> configuration and is a very lightweight process that does nothing more
>>
> than that.  If the master dies then all nodes continue to do what they
>>
> were doing and even site-to-site continues to distribute data.  It
>>
> just does so without updates on current loading across the cluster.
>
> Yes but imagine a NiFi installation, perhaps a hosted service built on top
> of it, where DataFlow Managers expect the command and control aspect of the
> system to be as robust and available as flow processing itself. If one or
> more standby masters are waiting in the wings to take over service for the
> failed active master then automated and unattended failover would be
> possible, and likely to narrow the interval where administrative changes
> may fail.
>
>
>
> On Sun, Feb 8, 2015 at 2:24 PM, Joe Witt <[hidden email]> wrote:
>
>> Andrew
>>
>>
>> *Automated Cluster Load Balancing*
>>
>> The processors themselves are available and ready to run on all nodes
>> at all times.  It's really just a question of whether they have data
>> to run on.  We have always taken the view that 'if you want scalable
>> dataflow' use scalable interfaces.  And I think that is the way to go
>> in every case you can pull it off.  That generally meant one should
>> use datasources which offer queueing semantics where multiple
>> independent nodes can pull from the queue with 'at-least-once'
>> guarantees.  In addition each node has back pressure so if it falls
>> behind it slows its rate of pickup which  means other nodes in the
>> cluster can pickup the slack.  This has worked extremely well.
>>
>> That said, I recognize that it simply isn't always possible to use
>> scalable interfaces and given enough non-scalable datasources the
>> cluster could become out of balance.  So this certainly seems like a
>> good [valuable, fun, non-trivial] problem to tackle.  If we allow
>> connections between processors to be auto-balanced then it will make
>> for a pretty smooth experience as users won't really have to think too
>> much about it.
>>
>>
>> So the key will be how to devise and implement an algorithm or
>> approach to spreading that load intelligently and so data doesn't just
>> bounce back and forth.  If anyone knows of good papers, similar
>> systems, or approaches they can describe for how to think through this
>> that would be great.  Things we'll have to think about here that come
>> to mind:
>>
>> - When to start spreading the load (at what factor should we start
>> spreading work across the cluster)
>>
>> - Whether it should auto-spread by default and the user can tell it
>> not to in certain cases or whether it should not spread by default and
>> the user can activate it
>>
>> - What the criteria are by which we should let a user control how data
>> is partitioned (some key, round robin, etc..).  How to
>> rebalance/re-assign partitions if a node dies or comes on-line
>>
>> There are 'counter cases' too that we must keep in mind such as
>> aggregation or bin packing grouped by some key.  In those cases all
>> data would need to be merged together at some point and thus all data
>> needs to be accessible at some point. Whether that means we direct all
>> data to a single node or whether we enable cross-cluster data
>> addressing is also a topic there.
>>
>>
>> * Multiple Masters *
>>
>> So it is true that we have a single master model today.  But that
>> single master is solely for command/control of changes to the dataflow
>> configuration and is a very lightweight process that does nothing more
>> than that.  If the master dies then all nodes continue to do what they
>> were doing and even site-to-site continues to distribute data.  It
>> just does so without updates on current loading across the cluster.
>> Once the master is brought back on-line then the real-time command and
>> control functions return.  Building support for a back-up master to
>> offer HA of even the command/control side would probably also be a
>> considerable effort.  This one I'd be curious to hear of cases where
>> it was critical to make this part HA.
>>
>> Excited to be talking about this level of technical detail.
>>
>> Thanks
>> Joe
>>
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)