PutHDFS does not write on all nodes of a cluster, only the on the coordinator

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

PutHDFS does not write on all nodes of a cluster, only the on the coordinator

bmichaud
This post was updated on .
This happened on NiFi 1.0.0 as well as 1.1.1

I have a flow that uses PutHDFS to write to a remote HDFS. It connects to this remote MapR system using JAAS kerberos authentication configured in the local MapR client. I build NiFi to incorporate the MapR libraries.

Here is the flow writing data with data accumulated:


Here is the status history (flow files out) for the queue immediately preceding the PutHDFS processor:


Why does it only write from the coordinator? Any Idea? I'm willing to fix this in the code, and am planning on debugging into it to figure it out, but I was hoping some one out there could point me to the right place.

I have to work around this by using PutFile to the local FS then manually moving the files to the remove box. I was going for something a little less manual. ;)

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: PutHDFS does not write on all nodes of a cluster, only the on the coordinator

Pierre Villard
Hi Ben,

There shouldn't be any issue. I'm wondering: is the input processor of your
workflow running "on primary node only" mode? If yes, then, unless you
specifically distribute the data on your nodes, the data will remain on the
primary node from the beginning to the end of your workflow and only one
PutHDFS will actually write data. By any chance, would it be your case?

-Pierre

2017-01-26 21:27 GMT+01:00 bmichaud <[hidden email]>:

> This happened on NiFi 1.0.0 as well as 1.1.1
>
> I have a flow that uses PutHDFS to write to a remote HDFS. It connects to
> this remote MapR system using JAAS kerberos authentication configured in
> the
> local MapR client. I build NiFi to incorporate the MapR libraries.
>
> Here is the flow writing data with data accumulated:
> <http://apache-nifi-developer-list.39713.n7.nabble.com/file/
> n14535/QueueToPutHDFS_Flow.png>
>
> Here is the status history (flow files out) for the queue immediately
> preceding the PutHDFS processor:
> <http://apache-nifi-developer-list.39713.n7.nabble.com/file/
> n14535/QueueToPutHDFS_FlowFilesOut_OneNode.png>
>
> Why does is only write from the coordinator? Any Idea? I'm willing to fix
> this in the code, and am planning on debugging into it to figure it out,
> but
> I was hoping some one out there could point me to the right place.
>
> I have to work around this by using PutFile to the local FS then manually
> moving the files to the remove box. I was going for something a little less
> manual. ;)
>
> Thanks!
>
>
>
> --
> View this message in context: http://apache-nifi-developer-
> list.39713.n7.nabble.com/PutHDFS-does-not-write-on-all-
> nodes-of-a-cluster-only-the-on-the-coordinator-tp14535.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: PutHDFS does not write on all nodes of a cluster, only the on the coordinator

bmichaud
This post was updated on .
The data is distributed evenly on all nodes. It just accumulates in the queue just prior to writing to HDFS on two of the nodes for days or weeks or until the node runs our of memory. I first noticed this back in November, and I thought it was somewhat random, but this week I noticed that the coordinator seems to be the one that writes on all the clusters.

Another thing to add: When I go to the Cluster screen and disconnect the primary node, and another node takes over, that new primary/coordinator then starts writing to HDFS. This is a much better work-around than putting files to the local file system and manually moving them, but I would like to know what is going on here. I have delved somewhat into the code using a debugger, but the mechanism by which files are pulled off the input queue to PutHDFS and/or the logic for determining whether to do so is somehow involved here. On the nodes where there is no put HDFS, the onTrigger() method in PutHDFS is never called.

Pierre Villard wrote
Hi Ben,

There shouldn't be any issue. I'm wondering: is the input processor of your
workflow running "on primary node only" mode? If yes, then, unless you
specifically distribute the data on your nodes, the data will remain on the
primary node from the beginning to the end of your workflow and only one
PutHDFS will actually write data. By any chance, would it be your case?

-Pierre

2017-01-26 21:27 GMT+01:00 bmichaud <[hidden email]>:

> This happened on NiFi 1.0.0 as well as 1.1.1
>
> I have a flow that uses PutHDFS to write to a remote HDFS. It connects to
> this remote MapR system using JAAS kerberos authentication configured in
> the
> local MapR client. I build NiFi to incorporate the MapR libraries.
>
> Here is the flow writing data with data accumulated:
> <http://apache-nifi-developer-list.39713.n7.nabble.com/file/
> n14535/QueueToPutHDFS_Flow.png>
>
> Here is the status history (flow files out) for the queue immediately
> preceding the PutHDFS processor:
> <http://apache-nifi-developer-list.39713.n7.nabble.com/file/
> n14535/QueueToPutHDFS_FlowFilesOut_OneNode.png>
>
> Why does is only write from the coordinator? Any Idea? I'm willing to fix
> this in the code, and am planning on debugging into it to figure it out,
> but
> I was hoping some one out there could point me to the right place.
>
> I have to work around this by using PutFile to the local FS then manually
> moving the files to the remove box. I was going for something a little less
> manual. ;)
>
> Thanks!
>
>
>
> --
> View this message in context: http://apache-nifi-developer-
> list.39713.n7.nabble.com/PutHDFS-does-not-write-on-all-
> nodes-of-a-cluster-only-the-on-the-coordinator-tp14535.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: PutHDFS does not write on all nodes of a cluster, only the on the coordinator

bmichaud
As it turns out, the PutHDFS processor was configured or defaulted, most likely from an earlier version of my flow program, which was created by my predecessor, to only write from the primary node. This, I believe, makes it an isolated processor. The strange thing, and the reason that I say it was from an earlier version is that there was an italicized option selected in the Scheduling Strategy field of the processors Scheduling tab that said something like "only on primary node," which is not even an option in 1.* versions of NiFi. I changed that to "Timer driven" and made sure the new Execution select box had "All Nodes" selected as well. This allowed all nodes to write to HDFS at the same time. Configuration before: Configuration after: All nodes writing at once (evidence it worked): Cheers!