How to ingest files into HDFS via Apache NiFi from non-hadoop environment

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How to ingest files into HDFS via Apache NiFi from non-hadoop environment

Mothi86
Apache NiFi is installed on non-hadoop environment and targets to ingest processed files into HDFS (Kerberized cluster - 4 management node and 1 edge node on public network and 4 worker nodes on private network).

Is it workable solution to achieve above use case as I face multiple error even after performing below activities. Time being alternative, I have installed NiFi in edge node and everything works fine but please advise if there is anything additional I have to perform to make above use case work.

* Firewall restriction between NiFi and management server is open and ports (22,88,749,389) are open.
* Firewall restriction between NiFi and edge node server is open and ports (22, 2181,9083) are open
* krb5.conf file from hadoop cluster along with keytab for application user is copied to NiFi server. Running kinit using application user and keytab - successful token is listed under klist.
* SSH operation is successful and also SFTP into hadoop server works fine.
* configured hdfs-site.xml and core-site.xml files into NiFi.

NiFi putHDFS configuration snapshotAuthentication error snapshot

Reply | Threaded
Open this post in threaded view
|

Re: How to ingest files into HDFS via Apache NiFi from non-hadoop environment

Bryan Bende
Hello,

Every node where NiFi is running must be able to connect to the data
node process on every node where HDFS is running. I believe the
default port for the HDFS data node process is usually 50010.

I'm assuming your 4 worker nodes are running HDFS, so NiFi would have
to access those.

-Bryan


On Fri, Jun 23, 2017 at 3:37 PM, Mothi86 <[hidden email]> wrote:

> Apache NiFi is installed on non-hadoop environment and targets to ingest
> processed files into HDFS (Kerberized cluster - 4 management node and 1 edge
> node on public network and 4 worker nodes on private network).
>
> Is it workable solution to achieve above use case as I face multiple error
> even after performing below activities. Time being alternative, I have
> installed NiFi in edge node and everything works fine but please advise if
> there is anything additional I have to perform to make above use case work.
>
> * Firewall restriction between NiFi and management server is open and ports
> (22,88,749,389) are open.
> * Firewall restriction between NiFi and edge node server is open and ports
> (22, 2181,9083) are open
> * krb5.conf file from hadoop cluster along with keytab for application user
> is copied to NiFi server. Running kinit using application user and keytab -
> successful token is listed under klist.
> * SSH operation is successful and also SFTP into hadoop server works fine.
> * configured hdfs-site.xml and core-site.xml files into NiFi.
>
> <http://apache-nifi-developer-list.39713.n7.nabble.com/file/n16247/NiFi_Configuration.jpg>
> <http://apache-nifi-developer-list.39713.n7.nabble.com/file/n16247/putHDFS_loginError.jpg>
>
>
>
>
>
> --
> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/How-to-ingest-files-into-HDFS-via-Apache-NiFi-from-non-hadoop-environment-tp16247.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: How to ingest files into HDFS via Apache NiFi from non-hadoop environment

Mothi86
Hi Bryan,

Greetings and appreciate your instant reply. Data nodes are in private network inside the hadoop cluster and NiFi is away from hadoop cluster on a seperate non-hadoop server. If we need NiFi to have access to data node, does that mean we need to have NiFi within the cluster ? something like edge node or management node which has access to public network for twitter access or so and also private network of data nodes.
Reply | Threaded
Open this post in threaded view
|

Re: How to ingest files into HDFS via Apache NiFi from non-hadoop environment

Bryan Bende
Yes, I think running NiFi on edge nodes would make sense, this way
they can access the public network to receive data, but also access
HDFS on the private network.


On Fri, Jun 23, 2017 at 4:24 PM, Mothi86 <[hidden email]> wrote:

> Hi Bryan,
>
> Greetings and appreciate your instant reply. Data nodes are in private
> network inside the hadoop cluster and NiFi is away from hadoop cluster on a
> seperate non-hadoop server. If we need NiFi to have access to data node,
> does that mean we need to have NiFi within the cluster ? something like edge
> node or management node which has access to public network for twitter
> access or so and also private network of data nodes.
>
>
>
> --
> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/How-to-ingest-files-into-HDFS-via-Apache-NiFi-from-non-hadoop-environment-tp16247p16249.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: How to ingest files into HDFS via Apache NiFi from non-hadoop environment

Mothi86
Okay thanks so that clarifies that NiFi will not work in terms of integrating from local machine / non-hadoop environment to hadoop environment. It either has to be in edge node or built up a node similar restriction of edge or management node.

Is this HDF recommended solution ?

Will spinning a VM work ? Can you suggest me VM requirements for Apache NiFi ?

Reply | Threaded
Open this post in threaded view
|

Re: How to ingest files into HDFS via Apache NiFi from non-hadoop environment

Adam Taft
This is a bit outside of the box, but I have actually implemented this
solution previously.

My scenario was very similar.  NIFI was installed outside of the firewalled
HDFS cluster.  The only external access to the HDFS cluster was through SSH.

Therefore, my solution was to use SSH to call a remote command on the HDFS
node.  This was enabled using the ExecuteStreamCommand processor.  I used
the hadoop fs command line tools, piping in the contents of the flowfile.

The basic command (assuming put) would look something like this:

$>  cat file.ext | hadoop fs -put - /hdfs/path/file.ext

This would read from standard input and store the stream into file.ext.
Next you add the SSH execution to call the above.

$>  cat file.ext | ssh user@remote 'hadoop fs -put - /hdfs/path/file.ext'

Now we can try to put the above into the ExecuteStreamCommand processor.
We will extract the filename from the flowfile attribute.  I like using
bash to execute my script:

ExecuteStreamCommand
Command Path:  /bin/bash
Command Arguments: -c; "ssh user@remote 'hadoop fs -put -
/hdfs/path/${filename}'"    * unsure of the quotes here

Not sure if the above helps, since it sounds like you're going for
something more than 'get' and 'put'.  But the above is an easy mechanism to
interact with an HDFS cluster if the NIFI node is not running on the
cluster.



On Fri, Jun 23, 2017 at 2:53 PM, Mothi86 <[hidden email]> wrote:

> Okay thanks so that clarifies that NiFi will not work in terms of
> integrating
> from local machine / non-hadoop environment to hadoop environment. It
> either
> has to be in edge node or built up a node similar restriction of edge or
> management node.
>
> Is this HDF recommended solution ?
>
> Will spinning a VM work ? Can you suggest me VM requirements for Apache
> NiFi
> ?
>
>
>
>
>
> --
> View this message in context: http://apache-nifi-developer-
> list.39713.n7.nabble.com/How-to-ingest-files-into-HDFS-via-
> Apache-NiFi-from-non-hadoop-environment-tp16247p16252.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

RE: How to ingest files into HDFS via Apache NiFi from non-hadoop environment

Takanobu Asanuma
Hello Mothi86,

I think you can achieve it by using HttpFS on HDFS side. It is a part of hadoop library and a proxy server for HDFS.
https://hadoop.apache.org/docs/stable/hadoop-hdfs-httpfs/index.html

In your case, running HttpFS server on the management nodes or the edge node would be good. And set 'webhdfs://{HttpFS hostname}:{port}' to 'fs.defaultFS' in core-site.xml for NiFi's HDFS processors. Then, your NiFi cluster only need to access the HttpFS server and can access HDFS from non-hadoop environment.

Regards,
Takanobu