Upgrade from 1.0.0 to 1.1.1, cluster config. under heavy load, nodes do not connect

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Upgrade from 1.0.0 to 1.1.1, cluster config. under heavy load, nodes do not connect

bmichaud
On Monday, I stood up a cluster with the same configuration as was done successfully in 1.0.0 a three-server NiFi cluster. Before I started the cluster, I cleaned out all zookeeper state and data from the old cluster, but kept the same flow intact, connected to Kafka to pull data from a topic. This was a performance environment, and there was heavy load on that kafka topic, so it was immediately busy.

My strong belief is that, due to the volume of data that the flow needed to process during the election process, the election of a coordinator never occurred, and, to this day, each node remains disconnected from the others, although they are running independently.

Could this be a defect in NiFi or Zookeeper? What would you suggest that I do to resolve this issue?
All servers in the cluster are configured in the following manner:

nifi.properties:
nifi.state.management.embedded.zookeeper.start=true
nifi.cluster.is.node=true
nifi.cluster.node.address=server1
nifi.zookeeper.connect.string=server1:2181,server2:2181,server3:2181

zookeeper.properties:
server.1=server1:2888:3888
server.2=server2:2888:3888
server.3=server3:2888:3888

state-management.xml:
    <cluster-provider>
        <id>zk-provider</id>
        <class>org.apache.nifi.controller.state.providers.zookeeper.ZooKeeperStateProvider</class>
        <property name="Connect String">server1:2181,server2:2181,server3:2181</property>
        <property name="Root Node">/nifi</property>
        <property name="Session Timeout">10 seconds</property>
        <property name="Access Control">Open</property>
    </cluster-provider>

Let me know if you need additional information, please.
Reply | Threaded
Open this post in threaded view
|

Re: Upgrade from 1.0.0 to 1.1.1, cluster config. under heavy load, nodes do not connect

Mark Payne
Ben,

NiFi provides an embedded ZooKeeper server for convenience, mostly for 'testing and evaluation' types of
purposes. For any sort of production or very high-volume flows, I would strongly encourage you to move ZooKeeper
to its own servers. You will certainly see a lot of problems when trying to interact with ZooKeeper if the box that
ZooKeeper is running on is under heavy load - either CPU-wise or I/O-wise.

Thanks
-Mark



> On Jan 26, 2017, at 7:26 AM, bmichaud <[hidden email]> wrote:
>
> On Monday, I stood up a cluster with the same configuration as was done
> successfully in 1.0.0 a three-server NiFi cluster. Before I started the
> cluster, I cleaned out all zookeeper state and data from the old cluster,
> but kept the same flow intact, connected to Kafka to pull data from a topic.
> This was a performance environment, and there was heavy load on that kafka
> topic, so it was immediately busy.
>
> My strong belief is that, due to the volume of data that the flow needed to
> process during the election process, the election of a coordinator never
> occurred, and, to this day, each node remains disconnected from the others,
> although they are running independently.
>
> Could this be a defect in NiFi or Zookeeper? What would you suggest that I
> do to resolve this issue?
> All servers in the cluster are configured in the following manner:
>
> nifi.properties:
> nifi.state.management.embedded.zookeeper.start=true
> nifi.cluster.is.node=true
> nifi.cluster.node.address=server1
> nifi.zookeeper.connect.string=server1:2181,server2:2181,server3:2181
>
> zookeeper.properties:
> server.1=server1:2888:3888
> server.2=server2:2888:3888
> server.3=server3:2888:3888
>
> state-management.xml:
>    <cluster-provider>
>        <id>zk-provider</id>
>
> <class>org.apache.nifi.controller.state.providers.zookeeper.ZooKeeperStateProvider</class>
>        <property name="Connect
> String">server1:2181,server2:2181,server3:2181</property>
>        <property name="Root Node">/nifi</property>
>        <property name="Session Timeout">10 seconds</property>
>        <property name="Access Control">Open</property>
>    </cluster-provider>
>
> Let me know if you need additional information, please.
>
>
>
>
> --
> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/Upgrade-from-1-0-0-to-1-1-1-cluster-config-under-heavy-load-nodes-do-not-connect-tp14523.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Upgrade from 1.0.0 to 1.1.1, cluster config. under heavy load, nodes do not connect

Pierre Villard
In addition to Mark's comment, you could start NiFi with
nifi.flowcontroller.autoResumeState=false
in order to have the flow stopped when cluster is started. Once cluster is
OK, you can manually start the flow (can also be done via REST API if
needed).


2017-01-26 15:16 GMT+01:00 Mark Payne <[hidden email]>:

> Ben,
>
> NiFi provides an embedded ZooKeeper server for convenience, mostly for
> 'testing and evaluation' types of
> purposes. For any sort of production or very high-volume flows, I would
> strongly encourage you to move ZooKeeper
> to its own servers. You will certainly see a lot of problems when trying
> to interact with ZooKeeper if the box that
> ZooKeeper is running on is under heavy load - either CPU-wise or I/O-wise.
>
> Thanks
> -Mark
>
>
>
> > On Jan 26, 2017, at 7:26 AM, bmichaud <[hidden email]> wrote:
> >
> > On Monday, I stood up a cluster with the same configuration as was done
> > successfully in 1.0.0 a three-server NiFi cluster. Before I started the
> > cluster, I cleaned out all zookeeper state and data from the old cluster,
> > but kept the same flow intact, connected to Kafka to pull data from a
> topic.
> > This was a performance environment, and there was heavy load on that
> kafka
> > topic, so it was immediately busy.
> >
> > My strong belief is that, due to the volume of data that the flow needed
> to
> > process during the election process, the election of a coordinator never
> > occurred, and, to this day, each node remains disconnected from the
> others,
> > although they are running independently.
> >
> > Could this be a defect in NiFi or Zookeeper? What would you suggest that
> I
> > do to resolve this issue?
> > All servers in the cluster are configured in the following manner:
> >
> > nifi.properties:
> > nifi.state.management.embedded.zookeeper.start=true
> > nifi.cluster.is.node=true
> > nifi.cluster.node.address=server1
> > nifi.zookeeper.connect.string=server1:2181,server2:2181,server3:2181
> >
> > zookeeper.properties:
> > server.1=server1:2888:3888
> > server.2=server2:2888:3888
> > server.3=server3:2888:3888
> >
> > state-management.xml:
> >    <cluster-provider>
> >        <id>zk-provider</id>
> >
> > <class>org.apache.nifi.controller.state.providers.zookeeper.
> ZooKeeperStateProvider</class>
> >        <property name="Connect
> > String">server1:2181,server2:2181,server3:2181</property>
> >        <property name="Root Node">/nifi</property>
> >        <property name="Session Timeout">10 seconds</property>
> >        <property name="Access Control">Open</property>
> >    </cluster-provider>
> >
> > Let me know if you need additional information, please.
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-nifi-developer-
> list.39713.n7.nabble.com/Upgrade-from-1-0-0-to-1-1-1-
> cluster-config-under-heavy-load-nodes-do-not-connect-tp14523.html
> > Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Upgrade from 1.0.0 to 1.1.1, cluster config. under heavy load, nodes do not connect

bmichaud
In reply to this post by Mark Payne
Thanks for the heads up. I was not aware of that. This article from Hortonworks does not even mention that. This brings up many questions, though:
  • Is there clear documentation on how to set up zookeeper on a separate server?
  • Do the servers need to be dedicated to one zookeeper or can multiple instances of zookeeper run on one server (perhaps two instances for one node from each of two clusters, for example)?
  • What are the sizing and disk requirements for such a server?
  • Does the node ID in the $NIFI_HOME/state/zookeeper/myid file reside on the zookeeper server then?
Regards, Ben
Mark Payne wrote
Ben, NiFi provides an embedded ZooKeeper server for convenience, mostly for 'testing and evaluation' types of purposes. For any sort of production or very high-volume flows, I would strongly encourage you to move ZooKeeper to its own servers. You will certainly see a lot of problems when trying to interact with ZooKeeper if the box that ZooKeeper is running on is under heavy load - either CPU-wise or I/O-wise. Thanks -Mark > On Jan 26, 2017, at 7:26 AM, bmichaud <[hidden email]> wrote: > > On Monday, I stood up a cluster with the same configuration as was done > successfully in 1.0.0 a three-server NiFi cluster. Before I started the > cluster, I cleaned out all zookeeper state and data from the old cluster, > but kept the same flow intact, connected to Kafka to pull data from a topic. > This was a performance environment, and there was heavy load on that kafka > topic, so it was immediately busy. > > My strong belief is that, due to the volume of data that the flow needed to > process during the election process, the election of a coordinator never > occurred, and, to this day, each node remains disconnected from the others, > although they are running independently. > > Could this be a defect in NiFi or Zookeeper? What would you suggest that I > do to resolve this issue? > All servers in the cluster are configured in the following manner: > > nifi.properties: > nifi.state.management.embedded.zookeeper.start=true > nifi.cluster.is.node=true > nifi.cluster.node.address=server1 > nifi.zookeeper.connect.string=server1:2181,server2:2181,server3:2181 > > zookeeper.properties: > server.1=server1:2888:3888 > server.2=server2:2888:3888 > server.3=server3:2888:3888 > > state-management.xml: > > zk-provider > > org.apache.nifi.controller.state.providers.zookeeper.ZooKeeperStateProvider > server1:2181,server2:2181,server3:2181 > /nifi > 10 seconds > Open > > > Let me know if you need additional information, please. > > > > > -- > View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/Upgrade-from-1-0-0-to-1-1-1-cluster-config-under-heavy-load-nodes-do-not-connect-tp14523.html > Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Upgrade from 1.0.0 to 1.1.1, cluster config. under heavy load, nodes do not connect

bmichaud
In reply to this post by Pierre Villard
Pierre Villard wrote
In addition to Mark's comment, you could start NiFi with nifi.flowcontroller.autoResumeState=false in order to have the flow stopped when cluster is started. Once cluster is OK, you can manually start the flow (can also be done via REST API if needed).
Thanks for the tip! I have set that property accordingly. I will look into the REST API as well.