Unable to modify flow when one of the nodes in a cluster is disconnected

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Unable to modify flow when one of the nodes in a cluster is disconnected

Purushotham Pushpavanthar
Hi,

I'm having a 3 nodes( ver 1.9.2) cluster running in production. As infra is unreliable due to various factors, our nodes go down often. We don't have distinction between dev and prod cluster. We modify, deploy, test in the same cluster. However, when one of the node goes down NiFi restricts us to modify the state of the flow by throwing warning window in the attachment. 

I read that if a node in the cluster is disconnected and comes back again, flow election happens. I would like to understand the motivation for not allowing the change of flow in the above scenario.
I was thinking why can't the latest node joining to the cluster pull a most elected flow.xml.gz from the cluster and apply it to itself?

Regards,
Purushotham Pushpavanth

Reply | Threaded
Open this post in threaded view
|

Re: Unable to modify flow when one of the nodes in a cluster is disconnected

Mark Payne
Purushotham,

If the node is disconnected and then attempts to reconnect, flow election does not occur. Rather, the node obtains a copy of the flow
from the cluster, determines whether or not it matches, and if so rejoins. If the flow does not match, it disconnects and stops trying to
reconnect.

There are a few reasons that the node doesn't just inherit the cluster's flow blindly. Firstly, if a user were to delete a connection, and the
re-joining node had data in that connection, it would lose the data. This is probably the most important reason - we never want to
design for data loss.

Secondly, when a node is disconnected from the cluster, the user is able to make changes. There are times when users will disconnect a
particular node from the cluster and make some changes to the dataflow for diagnostic purposes. For example, they may want to temporarily
send data to a new endpoint for sampling. When this happens, we don't want to just blindly lose those changes, because the user may not
have wanted those changes lost. And if an admin is managing several systems, it's possible that they could accidentally configure the node
to point to the wrong cluster, in which case it could potentially lose the entire dataflow. Perhaps not a problem if the dataflow exists on other
nodes, but if this is a standalone node being converted into cluster, it could be devastating for the user.

Now, there are some changes that we do allow, and the node will still re-join. For instance, if the positions of elements change, elements are started
or stopped, etc. In these cases, the new node will just inherit the flow from the cluster and take on those changes.

I think it would probably be advantageous to allow the node to back up its own flow before inheriting from the cluster, and then apply any changes from
the cluster that do not result in data loss (i.e., if any connection is removed and the node has data in that connection, then fail, else inherit). The big down
side there, honestly, is that it's just a huge amount of effort that would be required in order to make that work properly.

So to make a long story short: there are reasons that we don't just inherit the flow, but we could work around those problems. There are definitely
areas where we could improve, but it's just not been taken on yet by anyone in the community.

Thanks
-Mark


On Jun 27, 2019, at 3:37 AM, Purushotham Pushpavanthar <[hidden email]<mailto:[hidden email]>> wrote:

Hi,

I'm having a 3 nodes( ver 1.9.2) cluster running in production. As infra is unreliable due to various factors, our nodes go down often. We don't have distinction between dev and prod cluster. We modify, deploy, test in the same cluster. However, when one of the node goes down NiFi restricts us to modify the state of the flow by throwing warning window in the attachment.

I read<https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#flow-election> that if a node in the cluster is disconnected and comes back again, flow election happens. I would like to understand the motivation for not allowing the change of flow in the above scenario.
I was thinking why can't the latest node joining to the cluster pull a most elected flow.xml.gz from the cluster and apply it to itself?

Regards,
Purushotham Pushpavanth


Reply | Threaded
Open this post in threaded view
|

Re: Unable to modify flow when one of the nodes in a cluster is disconnected

Purushotham Pushpavanthar
Hi Mark,

I thank you for your time and descriptive insights. However, the concern I
raised was regarding the allowable changes like changing the run status of
the processors. I couldn't stop or start a processor in the cluster when
one of the nodes was disconnected. The warning panel displayed is attached
to the initial mail in this thread.


*Now, there are some changes that we do allow, and the node will still
re-join. For instance, if the positions of elements change, elements are
startedor stopped, etc. In these cases, the new node will just inherit the
flow from the cluster and take on those changes.*

 Regarding certain kind of changes you mentioned in your previous mail,
could you please throw some light on which release this it supported from?


Regards,
Purushotham Pushpavanth



On Thu, 27 Jun 2019 at 19:34, Mark Payne <[hidden email]> wrote:

> Purushotham,
>
> If the node is disconnected and then attempts to reconnect, flow election
> does not occur. Rather, the node obtains a copy of the flow
> from the cluster, determines whether or not it matches, and if so rejoins.
> If the flow does not match, it disconnects and stops trying to
> reconnect.
>
> There are a few reasons that the node doesn't just inherit the cluster's
> flow blindly. Firstly, if a user were to delete a connection, and the
> re-joining node had data in that connection, it would lose the data. This
> is probably the most important reason - we never want to
> design for data loss.
>
> Secondly, when a node is disconnected from the cluster, the user is able
> to make changes. There are times when users will disconnect a
> particular node from the cluster and make some changes to the dataflow for
> diagnostic purposes. For example, they may want to temporarily
> send data to a new endpoint for sampling. When this happens, we don't want
> to just blindly lose those changes, because the user may not
> have wanted those changes lost. And if an admin is managing several
> systems, it's possible that they could accidentally configure the node
> to point to the wrong cluster, in which case it could potentially lose the
> entire dataflow. Perhaps not a problem if the dataflow exists on other
> nodes, but if this is a standalone node being converted into cluster, it
> could be devastating for the user.
>
> Now, there are some changes that we do allow, and the node will still
> re-join. For instance, if the positions of elements change, elements are
> started
> or stopped, etc. In these cases, the new node will just inherit the flow
> from the cluster and take on those changes.
>
> I think it would probably be advantageous to allow the node to back up its
> own flow before inheriting from the cluster, and then apply any changes from
> the cluster that do not result in data loss (i.e., if any connection is
> removed and the node has data in that connection, then fail, else inherit).
> The big down
> side there, honestly, is that it's just a huge amount of effort that would
> be required in order to make that work properly.
>
> So to make a long story short: there are reasons that we don't just
> inherit the flow, but we could work around those problems. There are
> definitely
> areas where we could improve, but it's just not been taken on yet by
> anyone in the community.
>
> Thanks
> -Mark
>
>
> On Jun 27, 2019, at 3:37 AM, Purushotham Pushpavanthar <
> [hidden email]<mailto:[hidden email]>> wrote:
>
> Hi,
>
> I'm having a 3 nodes( ver 1.9.2) cluster running in production. As infra
> is unreliable due to various factors, our nodes go down often. We don't
> have distinction between dev and prod cluster. We modify, deploy, test in
> the same cluster. However, when one of the node goes down NiFi restricts us
> to modify the state of the flow by throwing warning window in the
> attachment.
>
> I read<
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#flow-election>
> that if a node in the cluster is disconnected and comes back again, flow
> election happens. I would like to understand the motivation for not
> allowing the change of flow in the above scenario.
> I was thinking why can't the latest node joining to the cluster pull a
> most elected flow.xml.gz from the cluster and apply it to itself?
>
> Regards,
> Purushotham Pushpavanth
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Unable to modify flow when one of the nodes in a cluster is disconnected

Mark Payne
My apologies, I wasn't very clear. If a node is in a disconnected state, you cannot make any changes
to the cluster. You would first have to go to the Cluster menu and choose to remove the node from the cluster.
Then you would be free to make changes to the flow. If the now-removed now is then restarted, it will attempt
to re-join the cluster. At this point, if there are components that have been stopped/started/moved around, then
the node will inherit these changes and join the cluster. But if you have changed a processor's properties, for
instance, this will result in the node failing to join the cluster and indicating that the local flow differs from the cluster's flow.


On Jun 29, 2019, at 2:53 PM, Purushotham Pushpavanthar <[hidden email]<mailto:[hidden email]>> wrote:

Hi Mark,

I thank you for your time and descriptive insights. However, the concern I
raised was regarding the allowable changes like changing the run status of
the processors. I couldn't stop or start a processor in the cluster when
one of the nodes was disconnected. The warning panel displayed is attached
to the initial mail in this thread.


*Now, there are some changes that we do allow, and the node will still
re-join. For instance, if the positions of elements change, elements are
startedor stopped, etc. In these cases, the new node will just inherit the
flow from the cluster and take on those changes.*

Regarding certain kind of changes you mentioned in your previous mail,
could you please throw some light on which release this it supported from?


Regards,
Purushotham Pushpavanth



On Thu, 27 Jun 2019 at 19:34, Mark Payne <[hidden email]<mailto:[hidden email]>> wrote:

Purushotham,

If the node is disconnected and then attempts to reconnect, flow election
does not occur. Rather, the node obtains a copy of the flow
from the cluster, determines whether or not it matches, and if so rejoins.
If the flow does not match, it disconnects and stops trying to
reconnect.

There are a few reasons that the node doesn't just inherit the cluster's
flow blindly. Firstly, if a user were to delete a connection, and the
re-joining node had data in that connection, it would lose the data. This
is probably the most important reason - we never want to
design for data loss.

Secondly, when a node is disconnected from the cluster, the user is able
to make changes. There are times when users will disconnect a
particular node from the cluster and make some changes to the dataflow for
diagnostic purposes. For example, they may want to temporarily
send data to a new endpoint for sampling. When this happens, we don't want
to just blindly lose those changes, because the user may not
have wanted those changes lost. And if an admin is managing several
systems, it's possible that they could accidentally configure the node
to point to the wrong cluster, in which case it could potentially lose the
entire dataflow. Perhaps not a problem if the dataflow exists on other
nodes, but if this is a standalone node being converted into cluster, it
could be devastating for the user.

Now, there are some changes that we do allow, and the node will still
re-join. For instance, if the positions of elements change, elements are
started
or stopped, etc. In these cases, the new node will just inherit the flow
from the cluster and take on those changes.

I think it would probably be advantageous to allow the node to back up its
own flow before inheriting from the cluster, and then apply any changes from
the cluster that do not result in data loss (i.e., if any connection is
removed and the node has data in that connection, then fail, else inherit).
The big down
side there, honestly, is that it's just a huge amount of effort that would
be required in order to make that work properly.

So to make a long story short: there are reasons that we don't just
inherit the flow, but we could work around those problems. There are
definitely
areas where we could improve, but it's just not been taken on yet by
anyone in the community.

Thanks
-Mark


On Jun 27, 2019, at 3:37 AM, Purushotham Pushpavanthar <
[hidden email]<mailto:[hidden email]><mailto:[hidden email]>> wrote:

Hi,

I'm having a 3 nodes( ver 1.9.2) cluster running in production. As infra
is unreliable due to various factors, our nodes go down often. We don't
have distinction between dev and prod cluster. We modify, deploy, test in
the same cluster. However, when one of the node goes down NiFi restricts us
to modify the state of the flow by throwing warning window in the
attachment.

I read<
https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#flow-election>
that if a node in the cluster is disconnected and comes back again, flow
election happens. I would like to understand the motivation for not
allowing the change of flow in the above scenario.
I was thinking why can't the latest node joining to the cluster pull a
most elected flow.xml.gz from the cluster and apply it to itself?

Regards,
Purushotham Pushpavanth

Reply | Threaded
Open this post in threaded view
|

Re: Unable to modify flow when one of the nodes in a cluster is disconnected

Purushotham Pushpavanthar
Mark, thanks for the clarification.

On Mon, Jul 1, 2019, 9:05 PM Mark Payne <[hidden email]> wrote:

> My apologies, I wasn't very clear. If a node is in a disconnected state,
> you cannot make any changes
> to the cluster. You would first have to go to the Cluster menu and choose
> to remove the node from the cluster.
> Then you would be free to make changes to the flow. If the now-removed now
> is then restarted, it will attempt
> to re-join the cluster. At this point, if there are components that have
> been stopped/started/moved around, then
> the node will inherit these changes and join the cluster. But if you have
> changed a processor's properties, for
> instance, this will result in the node failing to join the cluster and
> indicating that the local flow differs from the cluster's flow.
>
>
> On Jun 29, 2019, at 2:53 PM, Purushotham Pushpavanthar <
> [hidden email]<mailto:[hidden email]>> wrote:
>
> Hi Mark,
>
> I thank you for your time and descriptive insights. However, the concern I
> raised was regarding the allowable changes like changing the run status of
> the processors. I couldn't stop or start a processor in the cluster when
> one of the nodes was disconnected. The warning panel displayed is attached
> to the initial mail in this thread.
>
>
> *Now, there are some changes that we do allow, and the node will still
> re-join. For instance, if the positions of elements change, elements are
> startedor stopped, etc. In these cases, the new node will just inherit the
> flow from the cluster and take on those changes.*
>
> Regarding certain kind of changes you mentioned in your previous mail,
> could you please throw some light on which release this it supported from?
>
>
> Regards,
> Purushotham Pushpavanth
>
>
>
> On Thu, 27 Jun 2019 at 19:34, Mark Payne <[hidden email]<mailto:
> [hidden email]>> wrote:
>
> Purushotham,
>
> If the node is disconnected and then attempts to reconnect, flow election
> does not occur. Rather, the node obtains a copy of the flow
> from the cluster, determines whether or not it matches, and if so rejoins.
> If the flow does not match, it disconnects and stops trying to
> reconnect.
>
> There are a few reasons that the node doesn't just inherit the cluster's
> flow blindly. Firstly, if a user were to delete a connection, and the
> re-joining node had data in that connection, it would lose the data. This
> is probably the most important reason - we never want to
> design for data loss.
>
> Secondly, when a node is disconnected from the cluster, the user is able
> to make changes. There are times when users will disconnect a
> particular node from the cluster and make some changes to the dataflow for
> diagnostic purposes. For example, they may want to temporarily
> send data to a new endpoint for sampling. When this happens, we don't want
> to just blindly lose those changes, because the user may not
> have wanted those changes lost. And if an admin is managing several
> systems, it's possible that they could accidentally configure the node
> to point to the wrong cluster, in which case it could potentially lose the
> entire dataflow. Perhaps not a problem if the dataflow exists on other
> nodes, but if this is a standalone node being converted into cluster, it
> could be devastating for the user.
>
> Now, there are some changes that we do allow, and the node will still
> re-join. For instance, if the positions of elements change, elements are
> started
> or stopped, etc. In these cases, the new node will just inherit the flow
> from the cluster and take on those changes.
>
> I think it would probably be advantageous to allow the node to back up its
> own flow before inheriting from the cluster, and then apply any changes
> from
> the cluster that do not result in data loss (i.e., if any connection is
> removed and the node has data in that connection, then fail, else inherit).
> The big down
> side there, honestly, is that it's just a huge amount of effort that would
> be required in order to make that work properly.
>
> So to make a long story short: there are reasons that we don't just
> inherit the flow, but we could work around those problems. There are
> definitely
> areas where we could improve, but it's just not been taken on yet by
> anyone in the community.
>
> Thanks
> -Mark
>
>
> On Jun 27, 2019, at 3:37 AM, Purushotham Pushpavanthar <
> [hidden email]<mailto:[hidden email]><mailto:
> [hidden email]>> wrote:
>
> Hi,
>
> I'm having a 3 nodes( ver 1.9.2) cluster running in production. As infra
> is unreliable due to various factors, our nodes go down often. We don't
> have distinction between dev and prod cluster. We modify, deploy, test in
> the same cluster. However, when one of the node goes down NiFi restricts us
> to modify the state of the flow by throwing warning window in the
> attachment.
>
> I read<
>
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#flow-election
> >
> that if a node in the cluster is disconnected and comes back again, flow
> election happens. I would like to understand the motivation for not
> allowing the change of flow in the above scenario.
> I was thinking why can't the latest node joining to the cluster pull a
> most elected flow.xml.gz from the cluster and apply it to itself?
>
> Regards,
> Purushotham Pushpavanth
>
>