NiFi data HA in cluster mode

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

NiFi data HA in cluster mode

尹文才
Hi guys, I have a question about data HA when NiFi is run in clustered
mode, if one node goes down, will the flowfiles owned by this node taken
over and processed by another node?
Or will the flowfiles be kept locally to that node and will only be
processed when that node is back online? Thanks.

Regards,
Ben
Reply | Threaded
Open this post in threaded view
|

Re: NiFi data HA in cluster mode

Joe Witt
Ben,

Data already mid-flow within a node will be kept on the node and
processed when the node is back on-line.  All other data coming into
the cluster can fail-over to other nodes provided you're sourcing data
with queuing semantics or automated load balancing or fail-over as-is
present in the Apache NiFi Site to Site protocol.

Thanks
Joe

On Mon, Jan 8, 2018 at 11:05 PM, 尹文才 <[hidden email]> wrote:
> Hi guys, I have a question about data HA when NiFi is run in clustered
> mode, if one node goes down, will the flowfiles owned by this node taken
> over and processed by another node?
> Or will the flowfiles be kept locally to that node and will only be
> processed when that node is back online? Thanks.
>
> Regards,
> Ben
Reply | Threaded
Open this post in threaded view
|

Re: NiFi data HA in cluster mode

尹文才
Thanks Joe, so you mean for example, if I set one processor to run only on
primary node in the cluster and there're 100 FlowFiles in the incoming
queue of the processor
waiting to be processed by this processor, and the processor suddenly goes
down and then another node is elected as the primary node, those 100
FlowFiles will be kept locally
in the node that went down and will continue to be processed by the node
when it goes back online, these FlowFiles will not be available to the new
primary node and other nodes,
am I correct?

Regards,
Ben


2018-01-09 14:08 GMT+08:00 Joe Witt <[hidden email]>:

> Ben,
>
> Data already mid-flow within a node will be kept on the node and
> processed when the node is back on-line.  All other data coming into
> the cluster can fail-over to other nodes provided you're sourcing data
> with queuing semantics or automated load balancing or fail-over as-is
> present in the Apache NiFi Site to Site protocol.
>
> Thanks
> Joe
>
> On Mon, Jan 8, 2018 at 11:05 PM, 尹文才 <[hidden email]> wrote:
> > Hi guys, I have a question about data HA when NiFi is run in clustered
> > mode, if one node goes down, will the flowfiles owned by this node taken
> > over and processed by another node?
> > Or will the flowfiles be kept locally to that node and will only be
> > processed when that node is back online? Thanks.
> >
> > Regards,
> > Ben
>
Reply | Threaded
Open this post in threaded view
|

Re: NiFi data HA in cluster mode

Joe Witt
I'd avoid setting any processor to primary node only unless it is a
source processor (something that brings data into the system).

But, yes, I believe your description is accurate as of now.

Thanks

On Mon, Jan 8, 2018 at 11:21 PM, 尹文才 <[hidden email]> wrote:

> Thanks Joe, so you mean for example, if I set one processor to run only on
> primary node in the cluster and there're 100 FlowFiles in the incoming
> queue of the processor
> waiting to be processed by this processor, and the processor suddenly goes
> down and then another node is elected as the primary node, those 100
> FlowFiles will be kept locally
> in the node that went down and will continue to be processed by the node
> when it goes back online, these FlowFiles will not be available to the new
> primary node and other nodes,
> am I correct?
>
> Regards,
> Ben
>
>
> 2018-01-09 14:08 GMT+08:00 Joe Witt <[hidden email]>:
>
>> Ben,
>>
>> Data already mid-flow within a node will be kept on the node and
>> processed when the node is back on-line.  All other data coming into
>> the cluster can fail-over to other nodes provided you're sourcing data
>> with queuing semantics or automated load balancing or fail-over as-is
>> present in the Apache NiFi Site to Site protocol.
>>
>> Thanks
>> Joe
>>
>> On Mon, Jan 8, 2018 at 11:05 PM, 尹文才 <[hidden email]> wrote:
>> > Hi guys, I have a question about data HA when NiFi is run in clustered
>> > mode, if one node goes down, will the flowfiles owned by this node taken
>> > over and processed by another node?
>> > Or will the flowfiles be kept locally to that node and will only be
>> > processed when that node is back online? Thanks.
>> >
>> > Regards,
>> > Ben
>>
Reply | Threaded
Open this post in threaded view
|

Re: NiFi data HA in cluster mode

尹文才
Thanks Joe, I will try to avoid to set processor to primary node. By the
way, I've seen someone posted suggestion about Data HA in NiFi's
wiki(HDFSContentRepository), is there a plan for that feature to be
implemented and included in NiFi?

Regards,
Ben

2018-01-09 14:25 GMT+08:00 Joe Witt <[hidden email]>:

> I'd avoid setting any processor to primary node only unless it is a
> source processor (something that brings data into the system).
>
> But, yes, I believe your description is accurate as of now.
>
> Thanks
>
> On Mon, Jan 8, 2018 at 11:21 PM, 尹文才 <[hidden email]> wrote:
> > Thanks Joe, so you mean for example, if I set one processor to run only
> on
> > primary node in the cluster and there're 100 FlowFiles in the incoming
> > queue of the processor
> > waiting to be processed by this processor, and the processor suddenly
> goes
> > down and then another node is elected as the primary node, those 100
> > FlowFiles will be kept locally
> > in the node that went down and will continue to be processed by the node
> > when it goes back online, these FlowFiles will not be available to the
> new
> > primary node and other nodes,
> > am I correct?
> >
> > Regards,
> > Ben
> >
> >
> > 2018-01-09 14:08 GMT+08:00 Joe Witt <[hidden email]>:
> >
> >> Ben,
> >>
> >> Data already mid-flow within a node will be kept on the node and
> >> processed when the node is back on-line.  All other data coming into
> >> the cluster can fail-over to other nodes provided you're sourcing data
> >> with queuing semantics or automated load balancing or fail-over as-is
> >> present in the Apache NiFi Site to Site protocol.
> >>
> >> Thanks
> >> Joe
> >>
> >> On Mon, Jan 8, 2018 at 11:05 PM, 尹文才 <[hidden email]> wrote:
> >> > Hi guys, I have a question about data HA when NiFi is run in clustered
> >> > mode, if one node goes down, will the flowfiles owned by this node
> taken
> >> > over and processed by another node?
> >> > Or will the flowfiles be kept locally to that node and will only be
> >> > processed when that node is back online? Thanks.
> >> >
> >> > Regards,
> >> > Ben
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: NiFi data HA in cluster mode

Brett Ryan
In reply to this post by Joe Witt
I had someone from Hortonworks suggest to me that I should also set any PutSQL processors to only execute on primary. The reasoning was due to flooding of the JDBC pool.

> On 9 Jan 2018, at 17:25, Joe Witt <[hidden email]> wrote:
>
> I'd avoid setting any processor to primary node only unless it is a
> source processor (something that brings data into the system).
>
> But, yes, I believe your description is accurate as of now.
>
> Thanks
>
>> On Mon, Jan 8, 2018 at 11:21 PM, 尹文才 <[hidden email]> wrote:
>> Thanks Joe, so you mean for example, if I set one processor to run only on
>> primary node in the cluster and there're 100 FlowFiles in the incoming
>> queue of the processor
>> waiting to be processed by this processor, and the processor suddenly goes
>> down and then another node is elected as the primary node, those 100
>> FlowFiles will be kept locally
>> in the node that went down and will continue to be processed by the node
>> when it goes back online, these FlowFiles will not be available to the new
>> primary node and other nodes,
>> am I correct?
>>
>> Regards,
>> Ben
>>
>>
>> 2018-01-09 14:08 GMT+08:00 Joe Witt <[hidden email]>:
>>
>>> Ben,
>>>
>>> Data already mid-flow within a node will be kept on the node and
>>> processed when the node is back on-line.  All other data coming into
>>> the cluster can fail-over to other nodes provided you're sourcing data
>>> with queuing semantics or automated load balancing or fail-over as-is
>>> present in the Apache NiFi Site to Site protocol.
>>>
>>> Thanks
>>> Joe
>>>
>>>> On Mon, Jan 8, 2018 at 11:05 PM, 尹文才 <[hidden email]> wrote:
>>>> Hi guys, I have a question about data HA when NiFi is run in clustered
>>>> mode, if one node goes down, will the flowfiles owned by this node taken
>>>> over and processed by another node?
>>>> Or will the flowfiles be kept locally to that node and will only be
>>>> processed when that node is back online? Thanks.
>>>>
>>>> Regards,
>>>> Ben
>>>
Reply | Threaded
Open this post in threaded view
|

Re: NiFi data HA in cluster mode

Joe Witt
That is a fair point Brett - i wasnt thinking of that when I answer
but that is a good point.  Then again we should create those
connections lazily so if we don't i'd call that a bug :)

Ben

Yeah there is definitely intent to provide distributed data durability
across nodes.  This is especially important as it serves as a great
way to support elastic clustering behavior.

I'm not sure HDFS as the backing store is best and we all have to keep
in mind we must ensure distributed durability of flowfile, content,
and provenance.  That might mean application level replication similar
to what Apache Kafka does.  That might mean distributed durable block
storage and then deciding which node is responsible for processing a
given set of data at a time.  There are a lot of ways to slice this
and they all offer different tradeoffs.

On Mon, Jan 8, 2018 at 11:37 PM, Brett Ryan <[hidden email]> wrote:

> I had someone from Hortonworks suggest to me that I should also set any PutSQL processors to only execute on primary. The reasoning was due to flooding of the JDBC pool.
>
>> On 9 Jan 2018, at 17:25, Joe Witt <[hidden email]> wrote:
>>
>> I'd avoid setting any processor to primary node only unless it is a
>> source processor (something that brings data into the system).
>>
>> But, yes, I believe your description is accurate as of now.
>>
>> Thanks
>>
>>> On Mon, Jan 8, 2018 at 11:21 PM, 尹文才 <[hidden email]> wrote:
>>> Thanks Joe, so you mean for example, if I set one processor to run only on
>>> primary node in the cluster and there're 100 FlowFiles in the incoming
>>> queue of the processor
>>> waiting to be processed by this processor, and the processor suddenly goes
>>> down and then another node is elected as the primary node, those 100
>>> FlowFiles will be kept locally
>>> in the node that went down and will continue to be processed by the node
>>> when it goes back online, these FlowFiles will not be available to the new
>>> primary node and other nodes,
>>> am I correct?
>>>
>>> Regards,
>>> Ben
>>>
>>>
>>> 2018-01-09 14:08 GMT+08:00 Joe Witt <[hidden email]>:
>>>
>>>> Ben,
>>>>
>>>> Data already mid-flow within a node will be kept on the node and
>>>> processed when the node is back on-line.  All other data coming into
>>>> the cluster can fail-over to other nodes provided you're sourcing data
>>>> with queuing semantics or automated load balancing or fail-over as-is
>>>> present in the Apache NiFi Site to Site protocol.
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>>> On Mon, Jan 8, 2018 at 11:05 PM, 尹文才 <[hidden email]> wrote:
>>>>> Hi guys, I have a question about data HA when NiFi is run in clustered
>>>>> mode, if one node goes down, will the flowfiles owned by this node taken
>>>>> over and processed by another node?
>>>>> Or will the flowfiles be kept locally to that node and will only be
>>>>> processed when that node is back online? Thanks.
>>>>>
>>>>> Regards,
>>>>> Ben
>>>>