Proposal: New file processors: GetFIleData and PutFileData

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Proposal: New file processors: GetFIleData and PutFileData

Rick Braddy
This thread proposes community review/comments of modified versions of GetFile and PutFile for potential future adoption by the Nifi community.  For those who want to jump straight to the code, here's the review repository location for the current version:  https://github.com/rickbraddy/nifishare.

As background, we needed a way to replicate entire directory trees of files via Nifi, where multiple directory trees can be specified at run-time as part of an overall Nifi graph. As Nifi is rooted in file-based processing, it seems reasonable to continue advancing its abilities to ingest, process, transform and replicate files in the most flexible manner possible.  While this proposal is not a be all end all in that regard, it moves the needle in the right direction by making file-processing in Nifi more dynamic, enabling flows to determine how files (and directories) should be processed, which does well beyond today's basic file ingress/egress process capabilities (which certainly have their place and uses).  Whether it's via this proposal and code or another, clearly Nifi can benefit from this type of functionality.

Here's a more detailed explanation of the rationale for developing these Nifi file processor derivatives and their initial implementation:

GetFileData
----------------
The GetFile processor monitors a single directory tree for file changes and creates FlowFiles for every changed file in that configured tree. It does a good job of getting files from a configurable folder than need to be injected into a graph. GetFile falls short of other requirements that arise for general-purpose file processing:

-          Operates from a single, pre-configured source directory (not dynamically configurable at run-time as part of a flow)

-          Scheduled on a periodic basis only, not event-triggered when there's something to do

-          Does not support sending an entire directory tree (only files are sent, not directories)

-          Is a "source" processor node only, cannot be used within other Nifi flow logic that dynamically determines which files or directories to get and send as FlowFiles

-          Assumes each file is smaller than the content repository, which causes large files (hundreds of MB's, GBs, TBs) to overrun or dominate the content repository

A modified version of GetFile (currently) named GetFileData has been developed and is proposed as the basis for a new Nifi processor that will supplement file ingestion with these features:

-          Operates based upon inbound FlowFiles that contains the filesystem path to a file or directory

-          Scheduled by incoming FlowFiles containing a file or directory path, only runs when there's something to do

-          Supports sending directory tree as a series of directory and file paths; e.g., ExecuteProcess("find /mypath -print") => SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") => GetFIleData ...

-          Participates within simple or complex flows to fetch and send files and directories

-          (To be developed) Is designed to handle any size file, by breaking files larger than a "chunkingThreshold" into a series of multiple smaller files that can be reassembled on the other end (by PutFileData)

PutFileData
---------------
The PutFile processor accepts incoming FlowFiles and writes those files to a single target directory.  It does a good job of handling and resolving conflicts, but falls short of other requirements that arise for general-purpose file processing:

-          Does not support directories, only files

-          Only supports a single, preconfigured target directory

-          Cannot reconstruct and entire directory tree based upon relative file paths (all files go into a single target directory)

-          Assumes each file is small enough to fit into the content repository

A modified version of PutFile (currently) named PutFileData has been developed and is proposed as the basis for a new Nifi processor that will supplement file egress with these features:

-          Supports directories and files

-          Supports reconstruction of entire directory tree based upon relative file paths, enabling reconstruction of an entire directory free originating from GetFileData

-          (To be developed) Is designed to handle any size file, by reassembling multi-part files into very large files (TB's) that do not fit within the content repository

Should the community have an interest in these processors (we can name them something different, if needed), these contributions are now available.  In the meantime, we shall continue developing these processor to meet our specific use cases, adding the chunking functionality and QA certifying them for production use at scale.

Looking forward to comments, feedback and recommendations.

Here's the Github repo link again:  https://github.com/rickbraddy/nifishare

Best,
Rick

P.S. If there's a better vehicle for communicating these types of proposals, please advise.


Reply | Threaded
Open this post in threaded view
|

Re: Proposal: New file processors: GetFIleData and PutFileData

Joe Witt
Rick

This is a perfectly fine place to start the thread.  If you'd like to
create a wiki feature proposal for it too like we're doing with a lot
of the other things at this level we can give you access to create one
here [1].

Not at all trying to take away from the points you were making but
GetFile and PutFile do support recursive walking/reconstruction based
on relative paths.  By no means is that as comprehensive as you're
going for here though - just an FYI.

These sound like good things.  In particular I find your concept for
handling arbitrarily large data interesting.  Just need to make sure
backpressure works through the flow so that you could literally handle
the delivery of a file which is of itself larger than the repo by
capturing and sending a chunk of it at a time for instance.  So from a
brief historical perspective the GetFile / PutFile processors were
literally the first two processors ever build for NiFi back when it
had no GUI, no provenance, no nothin' that was cool.  These are the
OGs of NiFi.  They been improved a bit over the years but not much.
Why?  Because their utility was largely limited to trivial archiving
cases.  We have recently had discussions about making them more
powerful through the concept of ListFile/FetchFile like adam mentions
and as we've started doing with things like HDFS.  A much better model
for sure.  Still not as powerful as what you're cooking up though.  I
do think your proposal modified to consider the design pattern of
ListFile/FetchFile would be super powerful.  In your case ListFile for
a single larger file for instance could produce N listings that point
to the same file on disk but for different offset/ranges.  This would
be *very* interesting.  I am a bit concerned about how to have this
nicely handle competing consumer problems but...we can cross that
bridge later.

If you're willing to tackle this we can definitely work with you to
bring it in.  It is a non-trivial contribution for sure.  Folks often
do not consider all the nasty gotchas that can occur in something as
seemingly simple as File IO.

Thanks
Joe

[1] https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposals

On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <[hidden email]> wrote:

> This thread proposes community review/comments of modified versions of GetFile and PutFile for potential future adoption by the Nifi community.  For those who want to jump straight to the code, here's the review repository location for the current version:  https://github.com/rickbraddy/nifishare.
>
> As background, we needed a way to replicate entire directory trees of files via Nifi, where multiple directory trees can be specified at run-time as part of an overall Nifi graph. As Nifi is rooted in file-based processing, it seems reasonable to continue advancing its abilities to ingest, process, transform and replicate files in the most flexible manner possible.  While this proposal is not a be all end all in that regard, it moves the needle in the right direction by making file-processing in Nifi more dynamic, enabling flows to determine how files (and directories) should be processed, which does well beyond today's basic file ingress/egress process capabilities (which certainly have their place and uses).  Whether it's via this proposal and code or another, clearly Nifi can benefit from this type of functionality.
>
> Here's a more detailed explanation of the rationale for developing these Nifi file processor derivatives and their initial implementation:
>
> GetFileData
> ----------------
> The GetFile processor monitors a single directory tree for file changes and creates FlowFiles for every changed file in that configured tree. It does a good job of getting files from a configurable folder than need to be injected into a graph. GetFile falls short of other requirements that arise for general-purpose file processing:
>
> -          Operates from a single, pre-configured source directory (not dynamically configurable at run-time as part of a flow)
>
> -          Scheduled on a periodic basis only, not event-triggered when there's something to do
>
> -          Does not support sending an entire directory tree (only files are sent, not directories)
>
> -          Is a "source" processor node only, cannot be used within other Nifi flow logic that dynamically determines which files or directories to get and send as FlowFiles
>
> -          Assumes each file is smaller than the content repository, which causes large files (hundreds of MB's, GBs, TBs) to overrun or dominate the content repository
>
> A modified version of GetFile (currently) named GetFileData has been developed and is proposed as the basis for a new Nifi processor that will supplement file ingestion with these features:
>
> -          Operates based upon inbound FlowFiles that contains the filesystem path to a file or directory
>
> -          Scheduled by incoming FlowFiles containing a file or directory path, only runs when there's something to do
>
> -          Supports sending directory tree as a series of directory and file paths; e.g., ExecuteProcess("find /mypath -print") => SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") => GetFIleData ...
>
> -          Participates within simple or complex flows to fetch and send files and directories
>
> -          (To be developed) Is designed to handle any size file, by breaking files larger than a "chunkingThreshold" into a series of multiple smaller files that can be reassembled on the other end (by PutFileData)
>
> PutFileData
> ---------------
> The PutFile processor accepts incoming FlowFiles and writes those files to a single target directory.  It does a good job of handling and resolving conflicts, but falls short of other requirements that arise for general-purpose file processing:
>
> -          Does not support directories, only files
>
> -          Only supports a single, preconfigured target directory
>
> -          Cannot reconstruct and entire directory tree based upon relative file paths (all files go into a single target directory)
>
> -          Assumes each file is small enough to fit into the content repository
>
> A modified version of PutFile (currently) named PutFileData has been developed and is proposed as the basis for a new Nifi processor that will supplement file egress with these features:
>
> -          Supports directories and files
>
> -          Supports reconstruction of entire directory tree based upon relative file paths, enabling reconstruction of an entire directory free originating from GetFileData
>
> -          (To be developed) Is designed to handle any size file, by reassembling multi-part files into very large files (TB's) that do not fit within the content repository
>
> Should the community have an interest in these processors (we can name them something different, if needed), these contributions are now available.  In the meantime, we shall continue developing these processor to meet our specific use cases, adding the chunking functionality and QA certifying them for production use at scale.
>
> Looking forward to comments, feedback and recommendations.
>
> Here's the Github repo link again:  https://github.com/rickbraddy/nifishare
>
> Best,
> Rick
>
> P.S. If there's a better vehicle for communicating these types of proposals, please advise.
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Proposal: New file processors: GetFIleData and PutFileData

Rick Braddy
Joe,

Thanks for the quick response.

Yes, I can add to the Wiki once access has been granted. Further responses:

>> GetFile and PutFile do support recursive walking/reconstruction based on relative paths

Based on my recent testing of 0.3.0, GetFile does walk the configured directory tree, picking up the files it finds; however, only files are sent to PutFile, which places them all into a single target folder (not a directory tree - no directory information is sent by GetFile nor processed by PutFile from what I have seen, so I do not believe it reconstructs the directory tree at all today).

>> I do think your proposal modified to consider the design pattern of ListFile/FetchFile would be super powerful.  

We have another processor GetFileList that uses "find" to traverse a target folder tree and feeds the resulting newline delimited file/directory stream as FlowFiles into GetFileData.  Perhaps that processor could be evolved into a suitable ListFiles processor.

I believe GetFileList/GetFileData correspond roughly to the ListFile/FetchFile concept, based on a cursory review of ListHDFS/FetchHDFS.  If it's a matter of renaming that's obviously trivial at this point.  I'm assuming there are other facets to that List/Fetch design pattern - is it documented anywhere I can review to learn more?

So when we have a ListFile/FetchFile what is the corresponding "Put" side of the flow to be?  Perhaps simply PutFile enhanced to handle FlowFiles from both basic GetFile and the richer FetchFile (modified GetFileData) types of FlowFiles and behaviors would suffice.

>> Just need to make sure backpressure works through the flow so that you could literally handle the delivery of a file which is of itself larger than the repo by capturing and sending a chunk of it at a time for instance.

Agreed. Are there any best practices documented for configuring backpressure properly?

Thanks.

Rick

-----Original Message-----
From: Joe Witt [mailto:[hidden email]]
Sent: Wednesday, September 23, 2015 6:25 PM
To: [hidden email]
Subject: Re: Proposal: New file processors: GetFIleData and PutFileData

Rick

This is a perfectly fine place to start the thread.  If you'd like to create a wiki feature proposal for it too like we're doing with a lot of the other things at this level we can give you access to create one here [1].

Not at all trying to take away from the points you were making but GetFile and PutFile do support recursive walking/reconstruction based on relative paths.  By no means is that as comprehensive as you're going for here though - just an FYI.

These sound like good things.  In particular I find your concept for handling arbitrarily large data interesting.  Just need to make sure backpressure works through the flow so that you could literally handle the delivery of a file which is of itself larger than the repo by capturing and sending a chunk of it at a time for instance.  So from a brief historical perspective the GetFile / PutFile processors were literally the first two processors ever build for NiFi back when it had no GUI, no provenance, no nothin' that was cool.  These are the OGs of NiFi.  They been improved a bit over the years but not much.
Why?  Because their utility was largely limited to trivial archiving cases.  We have recently had discussions about making them more powerful through the concept of ListFile/FetchFile like adam mentions and as we've started doing with things like HDFS.  A much better model for sure.  Still not as powerful as what you're cooking up though.  I do think your proposal modified to consider the design pattern of ListFile/FetchFile would be super powerful.  In your case ListFile for a single larger file for instance could produce N listings that point to the same file on disk but for different offset/ranges.  This would be *very* interesting.  I am a bit concerned about how to have this nicely handle competing consumer problems but...we can cross that bridge later.

If you're willing to tackle this we can definitely work with you to bring it in.  It is a non-trivial contribution for sure.  Folks often do not consider all the nasty gotchas that can occur in something as seemingly simple as File IO.

Thanks
Joe

[1] https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposals

On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <[hidden email]> wrote:

> This thread proposes community review/comments of modified versions of GetFile and PutFile for potential future adoption by the Nifi community.  For those who want to jump straight to the code, here's the review repository location for the current version:  https://github.com/rickbraddy/nifishare.
>
> As background, we needed a way to replicate entire directory trees of files via Nifi, where multiple directory trees can be specified at run-time as part of an overall Nifi graph. As Nifi is rooted in file-based processing, it seems reasonable to continue advancing its abilities to ingest, process, transform and replicate files in the most flexible manner possible.  While this proposal is not a be all end all in that regard, it moves the needle in the right direction by making file-processing in Nifi more dynamic, enabling flows to determine how files (and directories) should be processed, which does well beyond today's basic file ingress/egress process capabilities (which certainly have their place and uses).  Whether it's via this proposal and code or another, clearly Nifi can benefit from this type of functionality.
>
> Here's a more detailed explanation of the rationale for developing these Nifi file processor derivatives and their initial implementation:
>
> GetFileData
> ----------------
> The GetFile processor monitors a single directory tree for file changes and creates FlowFiles for every changed file in that configured tree. It does a good job of getting files from a configurable folder than need to be injected into a graph. GetFile falls short of other requirements that arise for general-purpose file processing:
>
> -          Operates from a single, pre-configured source directory (not dynamically configurable at run-time as part of a flow)
>
> -          Scheduled on a periodic basis only, not event-triggered when there's something to do
>
> -          Does not support sending an entire directory tree (only files are sent, not directories)
>
> -          Is a "source" processor node only, cannot be used within other Nifi flow logic that dynamically determines which files or directories to get and send as FlowFiles
>
> -          Assumes each file is smaller than the content repository, which causes large files (hundreds of MB's, GBs, TBs) to overrun or dominate the content repository
>
> A modified version of GetFile (currently) named GetFileData has been developed and is proposed as the basis for a new Nifi processor that will supplement file ingestion with these features:
>
> -          Operates based upon inbound FlowFiles that contains the filesystem path to a file or directory
>
> -          Scheduled by incoming FlowFiles containing a file or directory path, only runs when there's something to do
>
> -          Supports sending directory tree as a series of directory and file paths; e.g., ExecuteProcess("find /mypath -print") => SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") => GetFIleData ...
>
> -          Participates within simple or complex flows to fetch and send files and directories
>
> -          (To be developed) Is designed to handle any size file, by breaking files larger than a "chunkingThreshold" into a series of multiple smaller files that can be reassembled on the other end (by PutFileData)
>
> PutFileData
> ---------------
> The PutFile processor accepts incoming FlowFiles and writes those files to a single target directory.  It does a good job of handling and resolving conflicts, but falls short of other requirements that arise for general-purpose file processing:
>
> -          Does not support directories, only files
>
> -          Only supports a single, preconfigured target directory
>
> -          Cannot reconstruct and entire directory tree based upon relative file paths (all files go into a single target directory)
>
> -          Assumes each file is small enough to fit into the content repository
>
> A modified version of PutFile (currently) named PutFileData has been developed and is proposed as the basis for a new Nifi processor that will supplement file egress with these features:
>
> -          Supports directories and files
>
> -          Supports reconstruction of entire directory tree based upon relative file paths, enabling reconstruction of an entire directory free originating from GetFileData
>
> -          (To be developed) Is designed to handle any size file, by reassembling multi-part files into very large files (TB's) that do not fit within the content repository
>
> Should the community have an interest in these processors (we can name them something different, if needed), these contributions are now available.  In the meantime, we shall continue developing these processor to meet our specific use cases, adding the chunking functionality and QA certifying them for production use at scale.
>
> Looking forward to comments, feedback and recommendations.
>
> Here's the Github repo link again:  
> https://github.com/rickbraddy/nifishare
>
> Best,
> Rick
>
> P.S. If there's a better vehicle for communicating these types of proposals, please advise.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: New file processors: GetFIleData and PutFileData

Joe Skora
It may be an oversimplification, but for the purposes of understanding, is
the intent to mirror directory tree with NiFi similar to rsync?

On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <[hidden email]> wrote:

> Joe,
>
> Thanks for the quick response.
>
> Yes, I can add to the Wiki once access has been granted. Further responses:
>
> >> GetFile and PutFile do support recursive walking/reconstruction based
> on relative paths
>
> Based on my recent testing of 0.3.0, GetFile does walk the configured
> directory tree, picking up the files it finds; however, only files are sent
> to PutFile, which places them all into a single target folder (not a
> directory tree - no directory information is sent by GetFile nor processed
> by PutFile from what I have seen, so I do not believe it reconstructs the
> directory tree at all today).
>
> >> I do think your proposal modified to consider the design pattern of
> ListFile/FetchFile would be super powerful.
>
> We have another processor GetFileList that uses "find" to traverse a
> target folder tree and feeds the resulting newline delimited file/directory
> stream as FlowFiles into GetFileData.  Perhaps that processor could be
> evolved into a suitable ListFiles processor.
>
> I believe GetFileList/GetFileData correspond roughly to the
> ListFile/FetchFile concept, based on a cursory review of
> ListHDFS/FetchHDFS.  If it's a matter of renaming that's obviously trivial
> at this point.  I'm assuming there are other facets to that List/Fetch
> design pattern - is it documented anywhere I can review to learn more?
>
> So when we have a ListFile/FetchFile what is the corresponding "Put" side
> of the flow to be?  Perhaps simply PutFile enhanced to handle FlowFiles
> from both basic GetFile and the richer FetchFile (modified GetFileData)
> types of FlowFiles and behaviors would suffice.
>
> >> Just need to make sure backpressure works through the flow so that you
> could literally handle the delivery of a file which is of itself larger
> than the repo by capturing and sending a chunk of it at a time for instance.
>
> Agreed. Are there any best practices documented for configuring
> backpressure properly?
>
> Thanks.
>
> Rick
>
> -----Original Message-----
> From: Joe Witt [mailto:[hidden email]]
> Sent: Wednesday, September 23, 2015 6:25 PM
> To: [hidden email]
> Subject: Re: Proposal: New file processors: GetFIleData and PutFileData
>
> Rick
>
> This is a perfectly fine place to start the thread.  If you'd like to
> create a wiki feature proposal for it too like we're doing with a lot of
> the other things at this level we can give you access to create one here
> [1].
>
> Not at all trying to take away from the points you were making but GetFile
> and PutFile do support recursive walking/reconstruction based on relative
> paths.  By no means is that as comprehensive as you're going for here
> though - just an FYI.
>
> These sound like good things.  In particular I find your concept for
> handling arbitrarily large data interesting.  Just need to make sure
> backpressure works through the flow so that you could literally handle the
> delivery of a file which is of itself larger than the repo by capturing and
> sending a chunk of it at a time for instance.  So from a brief historical
> perspective the GetFile / PutFile processors were literally the first two
> processors ever build for NiFi back when it had no GUI, no provenance, no
> nothin' that was cool.  These are the OGs of NiFi.  They been improved a
> bit over the years but not much.
> Why?  Because their utility was largely limited to trivial archiving
> cases.  We have recently had discussions about making them more powerful
> through the concept of ListFile/FetchFile like adam mentions and as we've
> started doing with things like HDFS.  A much better model for sure.  Still
> not as powerful as what you're cooking up though.  I do think your proposal
> modified to consider the design pattern of ListFile/FetchFile would be
> super powerful.  In your case ListFile for a single larger file for
> instance could produce N listings that point to the same file on disk but
> for different offset/ranges.  This would be *very* interesting.  I am a bit
> concerned about how to have this nicely handle competing consumer problems
> but...we can cross that bridge later.
>
> If you're willing to tackle this we can definitely work with you to bring
> it in.  It is a non-trivial contribution for sure.  Folks often do not
> consider all the nasty gotchas that can occur in something as seemingly
> simple as File IO.
>
> Thanks
> Joe
>
> [1]
> https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposals
>
> On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <[hidden email]> wrote:
> > This thread proposes community review/comments of modified versions of
> GetFile and PutFile for potential future adoption by the Nifi community.
> For those who want to jump straight to the code, here's the review
> repository location for the current version:
> https://github.com/rickbraddy/nifishare.
> >
> > As background, we needed a way to replicate entire directory trees of
> files via Nifi, where multiple directory trees can be specified at run-time
> as part of an overall Nifi graph. As Nifi is rooted in file-based
> processing, it seems reasonable to continue advancing its abilities to
> ingest, process, transform and replicate files in the most flexible manner
> possible.  While this proposal is not a be all end all in that regard, it
> moves the needle in the right direction by making file-processing in Nifi
> more dynamic, enabling flows to determine how files (and directories)
> should be processed, which does well beyond today's basic file
> ingress/egress process capabilities (which certainly have their place and
> uses).  Whether it's via this proposal and code or another, clearly Nifi
> can benefit from this type of functionality.
> >
> > Here's a more detailed explanation of the rationale for developing these
> Nifi file processor derivatives and their initial implementation:
> >
> > GetFileData
> > ----------------
> > The GetFile processor monitors a single directory tree for file changes
> and creates FlowFiles for every changed file in that configured tree. It
> does a good job of getting files from a configurable folder than need to be
> injected into a graph. GetFile falls short of other requirements that arise
> for general-purpose file processing:
> >
> > -          Operates from a single, pre-configured source directory (not
> dynamically configurable at run-time as part of a flow)
> >
> > -          Scheduled on a periodic basis only, not event-triggered when
> there's something to do
> >
> > -          Does not support sending an entire directory tree (only files
> are sent, not directories)
> >
> > -          Is a "source" processor node only, cannot be used within
> other Nifi flow logic that dynamically determines which files or
> directories to get and send as FlowFiles
> >
> > -          Assumes each file is smaller than the content repository,
> which causes large files (hundreds of MB's, GBs, TBs) to overrun or
> dominate the content repository
> >
> > A modified version of GetFile (currently) named GetFileData has been
> developed and is proposed as the basis for a new Nifi processor that will
> supplement file ingestion with these features:
> >
> > -          Operates based upon inbound FlowFiles that contains the
> filesystem path to a file or directory
> >
> > -          Scheduled by incoming FlowFiles containing a file or
> directory path, only runs when there's something to do
> >
> > -          Supports sending directory tree as a series of directory and
> file paths; e.g., ExecuteProcess("find /mypath -print") =>
> SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") =>
> GetFIleData ...
> >
> > -          Participates within simple or complex flows to fetch and send
> files and directories
> >
> > -          (To be developed) Is designed to handle any size file, by
> breaking files larger than a "chunkingThreshold" into a series of multiple
> smaller files that can be reassembled on the other end (by PutFileData)
> >
> > PutFileData
> > ---------------
> > The PutFile processor accepts incoming FlowFiles and writes those files
> to a single target directory.  It does a good job of handling and resolving
> conflicts, but falls short of other requirements that arise for
> general-purpose file processing:
> >
> > -          Does not support directories, only files
> >
> > -          Only supports a single, preconfigured target directory
> >
> > -          Cannot reconstruct and entire directory tree based upon
> relative file paths (all files go into a single target directory)
> >
> > -          Assumes each file is small enough to fit into the content
> repository
> >
> > A modified version of PutFile (currently) named PutFileData has been
> developed and is proposed as the basis for a new Nifi processor that will
> supplement file egress with these features:
> >
> > -          Supports directories and files
> >
> > -          Supports reconstruction of entire directory tree based upon
> relative file paths, enabling reconstruction of an entire directory free
> originating from GetFileData
> >
> > -          (To be developed) Is designed to handle any size file, by
> reassembling multi-part files into very large files (TB's) that do not fit
> within the content repository
> >
> > Should the community have an interest in these processors (we can name
> them something different, if needed), these contributions are now
> available.  In the meantime, we shall continue developing these processor
> to meet our specific use cases, adding the chunking functionality and QA
> certifying them for production use at scale.
> >
> > Looking forward to comments, feedback and recommendations.
> >
> > Here's the Github repo link again:
> > https://github.com/rickbraddy/nifishare
> >
> > Best,
> > Rick
> >
> > P.S. If there's a better vehicle for communicating these types of
> proposals, please advise.
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: Proposal: New file processors: GetFIleData and PutFileData

Rick Braddy
Yes.  Replication of directory tree via Nifi similar to rsync.

-----Original Message-----
From: Joe Skora [mailto:[hidden email]]
Sent: Thursday, September 24, 2015 10:16 PM
To: [hidden email]
Subject: Re: Proposal: New file processors: GetFIleData and PutFileData

It may be an oversimplification, but for the purposes of understanding, is the intent to mirror directory tree with NiFi similar to rsync?

On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <[hidden email]> wrote:

> Joe,
>
> Thanks for the quick response.
>
> Yes, I can add to the Wiki once access has been granted. Further responses:
>
> >> GetFile and PutFile do support recursive walking/reconstruction
> >> based
> on relative paths
>
> Based on my recent testing of 0.3.0, GetFile does walk the configured
> directory tree, picking up the files it finds; however, only files are
> sent to PutFile, which places them all into a single target folder
> (not a directory tree - no directory information is sent by GetFile
> nor processed by PutFile from what I have seen, so I do not believe it
> reconstructs the directory tree at all today).
>
> >> I do think your proposal modified to consider the design pattern of
> ListFile/FetchFile would be super powerful.
>
> We have another processor GetFileList that uses "find" to traverse a
> target folder tree and feeds the resulting newline delimited
> file/directory stream as FlowFiles into GetFileData.  Perhaps that
> processor could be evolved into a suitable ListFiles processor.
>
> I believe GetFileList/GetFileData correspond roughly to the
> ListFile/FetchFile concept, based on a cursory review of
> ListHDFS/FetchHDFS.  If it's a matter of renaming that's obviously
> trivial at this point.  I'm assuming there are other facets to that
> List/Fetch design pattern - is it documented anywhere I can review to learn more?
>
> So when we have a ListFile/FetchFile what is the corresponding "Put"
> side of the flow to be?  Perhaps simply PutFile enhanced to handle
> FlowFiles from both basic GetFile and the richer FetchFile (modified
> GetFileData) types of FlowFiles and behaviors would suffice.
>
> >> Just need to make sure backpressure works through the flow so that
> >> you
> could literally handle the delivery of a file which is of itself
> larger than the repo by capturing and sending a chunk of it at a time for instance.
>
> Agreed. Are there any best practices documented for configuring
> backpressure properly?
>
> Thanks.
>
> Rick
>
> -----Original Message-----
> From: Joe Witt [mailto:[hidden email]]
> Sent: Wednesday, September 23, 2015 6:25 PM
> To: [hidden email]
> Subject: Re: Proposal: New file processors: GetFIleData and
> PutFileData
>
> Rick
>
> This is a perfectly fine place to start the thread.  If you'd like to
> create a wiki feature proposal for it too like we're doing with a lot
> of the other things at this level we can give you access to create one
> here [1].
>
> Not at all trying to take away from the points you were making but
> GetFile and PutFile do support recursive walking/reconstruction based
> on relative paths.  By no means is that as comprehensive as you're
> going for here though - just an FYI.
>
> These sound like good things.  In particular I find your concept for
> handling arbitrarily large data interesting.  Just need to make sure
> backpressure works through the flow so that you could literally handle
> the delivery of a file which is of itself larger than the repo by
> capturing and sending a chunk of it at a time for instance.  So from a
> brief historical perspective the GetFile / PutFile processors were
> literally the first two processors ever build for NiFi back when it
> had no GUI, no provenance, no nothin' that was cool.  These are the
> OGs of NiFi.  They been improved a bit over the years but not much.
> Why?  Because their utility was largely limited to trivial archiving
> cases.  We have recently had discussions about making them more
> powerful through the concept of ListFile/FetchFile like adam mentions
> and as we've started doing with things like HDFS.  A much better model
> for sure.  Still not as powerful as what you're cooking up though.  I
> do think your proposal modified to consider the design pattern of
> ListFile/FetchFile would be super powerful.  In your case ListFile for
> a single larger file for instance could produce N listings that point
> to the same file on disk but for different offset/ranges.  This would
> be *very* interesting.  I am a bit concerned about how to have this
> nicely handle competing consumer problems but...we can cross that bridge later.
>
> If you're willing to tackle this we can definitely work with you to
> bring it in.  It is a non-trivial contribution for sure.  Folks often
> do not consider all the nasty gotchas that can occur in something as
> seemingly simple as File IO.
>
> Thanks
> Joe
>
> [1]
> https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposal
> s
>
> On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <[hidden email]> wrote:
> > This thread proposes community review/comments of modified versions
> > of
> GetFile and PutFile for potential future adoption by the Nifi community.
> For those who want to jump straight to the code, here's the review
> repository location for the current version:
> https://github.com/rickbraddy/nifishare.
> >
> > As background, we needed a way to replicate entire directory trees
> > of
> files via Nifi, where multiple directory trees can be specified at
> run-time as part of an overall Nifi graph. As Nifi is rooted in
> file-based processing, it seems reasonable to continue advancing its
> abilities to ingest, process, transform and replicate files in the
> most flexible manner possible.  While this proposal is not a be all
> end all in that regard, it moves the needle in the right direction by
> making file-processing in Nifi more dynamic, enabling flows to
> determine how files (and directories) should be processed, which does
> well beyond today's basic file ingress/egress process capabilities
> (which certainly have their place and uses).  Whether it's via this
> proposal and code or another, clearly Nifi can benefit from this type of functionality.
> >
> > Here's a more detailed explanation of the rationale for developing
> > these
> Nifi file processor derivatives and their initial implementation:
> >
> > GetFileData
> > ----------------
> > The GetFile processor monitors a single directory tree for file
> > changes
> and creates FlowFiles for every changed file in that configured tree.
> It does a good job of getting files from a configurable folder than
> need to be injected into a graph. GetFile falls short of other
> requirements that arise for general-purpose file processing:
> >
> > -          Operates from a single, pre-configured source directory (not
> dynamically configurable at run-time as part of a flow)
> >
> > -          Scheduled on a periodic basis only, not event-triggered when
> there's something to do
> >
> > -          Does not support sending an entire directory tree (only files
> are sent, not directories)
> >
> > -          Is a "source" processor node only, cannot be used within
> other Nifi flow logic that dynamically determines which files or
> directories to get and send as FlowFiles
> >
> > -          Assumes each file is smaller than the content repository,
> which causes large files (hundreds of MB's, GBs, TBs) to overrun or
> dominate the content repository
> >
> > A modified version of GetFile (currently) named GetFileData has been
> developed and is proposed as the basis for a new Nifi processor that
> will supplement file ingestion with these features:
> >
> > -          Operates based upon inbound FlowFiles that contains the
> filesystem path to a file or directory
> >
> > -          Scheduled by incoming FlowFiles containing a file or
> directory path, only runs when there's something to do
> >
> > -          Supports sending directory tree as a series of directory and
> file paths; e.g., ExecuteProcess("find /mypath -print") =>
> SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") =>
> GetFIleData ...
> >
> > -          Participates within simple or complex flows to fetch and send
> files and directories
> >
> > -          (To be developed) Is designed to handle any size file, by
> breaking files larger than a "chunkingThreshold" into a series of
> multiple smaller files that can be reassembled on the other end (by
> PutFileData)
> >
> > PutFileData
> > ---------------
> > The PutFile processor accepts incoming FlowFiles and writes those
> > files
> to a single target directory.  It does a good job of handling and
> resolving conflicts, but falls short of other requirements that arise
> for general-purpose file processing:
> >
> > -          Does not support directories, only files
> >
> > -          Only supports a single, preconfigured target directory
> >
> > -          Cannot reconstruct and entire directory tree based upon
> relative file paths (all files go into a single target directory)
> >
> > -          Assumes each file is small enough to fit into the content
> repository
> >
> > A modified version of PutFile (currently) named PutFileData has been
> developed and is proposed as the basis for a new Nifi processor that
> will supplement file egress with these features:
> >
> > -          Supports directories and files
> >
> > -          Supports reconstruction of entire directory tree based upon
> relative file paths, enabling reconstruction of an entire directory
> free originating from GetFileData
> >
> > -          (To be developed) Is designed to handle any size file, by
> reassembling multi-part files into very large files (TB's) that do not
> fit within the content repository
> >
> > Should the community have an interest in these processors (we can
> > name
> them something different, if needed), these contributions are now
> available.  In the meantime, we shall continue developing these
> processor to meet our specific use cases, adding the chunking
> functionality and QA certifying them for production use at scale.
> >
> > Looking forward to comments, feedback and recommendations.
> >
> > Here's the Github repo link again:
> > https://github.com/rickbraddy/nifishare
> >
> > Best,
> > Rick
> >
> > P.S. If there's a better vehicle for communicating these types of
> proposals, please advise.
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: New file processors: GetFIleData and PutFileData

Joe Witt
Rick,

I am finally taking a moment to clear out some dangling threads.  I
just looked into this one and the link appears to be gone.  Have you
chosen to withdraw this proposal at this time?

Thanks
Joe

On Fri, Sep 25, 2015 at 4:25 AM, Rick Braddy <[hidden email]> wrote:

> Yes.  Replication of directory tree via Nifi similar to rsync.
>
> -----Original Message-----
> From: Joe Skora [mailto:[hidden email]]
> Sent: Thursday, September 24, 2015 10:16 PM
> To: [hidden email]
> Subject: Re: Proposal: New file processors: GetFIleData and PutFileData
>
> It may be an oversimplification, but for the purposes of understanding, is the intent to mirror directory tree with NiFi similar to rsync?
>
> On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <[hidden email]> wrote:
>
>> Joe,
>>
>> Thanks for the quick response.
>>
>> Yes, I can add to the Wiki once access has been granted. Further responses:
>>
>> >> GetFile and PutFile do support recursive walking/reconstruction
>> >> based
>> on relative paths
>>
>> Based on my recent testing of 0.3.0, GetFile does walk the configured
>> directory tree, picking up the files it finds; however, only files are
>> sent to PutFile, which places them all into a single target folder
>> (not a directory tree - no directory information is sent by GetFile
>> nor processed by PutFile from what I have seen, so I do not believe it
>> reconstructs the directory tree at all today).
>>
>> >> I do think your proposal modified to consider the design pattern of
>> ListFile/FetchFile would be super powerful.
>>
>> We have another processor GetFileList that uses "find" to traverse a
>> target folder tree and feeds the resulting newline delimited
>> file/directory stream as FlowFiles into GetFileData.  Perhaps that
>> processor could be evolved into a suitable ListFiles processor.
>>
>> I believe GetFileList/GetFileData correspond roughly to the
>> ListFile/FetchFile concept, based on a cursory review of
>> ListHDFS/FetchHDFS.  If it's a matter of renaming that's obviously
>> trivial at this point.  I'm assuming there are other facets to that
>> List/Fetch design pattern - is it documented anywhere I can review to learn more?
>>
>> So when we have a ListFile/FetchFile what is the corresponding "Put"
>> side of the flow to be?  Perhaps simply PutFile enhanced to handle
>> FlowFiles from both basic GetFile and the richer FetchFile (modified
>> GetFileData) types of FlowFiles and behaviors would suffice.
>>
>> >> Just need to make sure backpressure works through the flow so that
>> >> you
>> could literally handle the delivery of a file which is of itself
>> larger than the repo by capturing and sending a chunk of it at a time for instance.
>>
>> Agreed. Are there any best practices documented for configuring
>> backpressure properly?
>>
>> Thanks.
>>
>> Rick
>>
>> -----Original Message-----
>> From: Joe Witt [mailto:[hidden email]]
>> Sent: Wednesday, September 23, 2015 6:25 PM
>> To: [hidden email]
>> Subject: Re: Proposal: New file processors: GetFIleData and
>> PutFileData
>>
>> Rick
>>
>> This is a perfectly fine place to start the thread.  If you'd like to
>> create a wiki feature proposal for it too like we're doing with a lot
>> of the other things at this level we can give you access to create one
>> here [1].
>>
>> Not at all trying to take away from the points you were making but
>> GetFile and PutFile do support recursive walking/reconstruction based
>> on relative paths.  By no means is that as comprehensive as you're
>> going for here though - just an FYI.
>>
>> These sound like good things.  In particular I find your concept for
>> handling arbitrarily large data interesting.  Just need to make sure
>> backpressure works through the flow so that you could literally handle
>> the delivery of a file which is of itself larger than the repo by
>> capturing and sending a chunk of it at a time for instance.  So from a
>> brief historical perspective the GetFile / PutFile processors were
>> literally the first two processors ever build for NiFi back when it
>> had no GUI, no provenance, no nothin' that was cool.  These are the
>> OGs of NiFi.  They been improved a bit over the years but not much.
>> Why?  Because their utility was largely limited to trivial archiving
>> cases.  We have recently had discussions about making them more
>> powerful through the concept of ListFile/FetchFile like adam mentions
>> and as we've started doing with things like HDFS.  A much better model
>> for sure.  Still not as powerful as what you're cooking up though.  I
>> do think your proposal modified to consider the design pattern of
>> ListFile/FetchFile would be super powerful.  In your case ListFile for
>> a single larger file for instance could produce N listings that point
>> to the same file on disk but for different offset/ranges.  This would
>> be *very* interesting.  I am a bit concerned about how to have this
>> nicely handle competing consumer problems but...we can cross that bridge later.
>>
>> If you're willing to tackle this we can definitely work with you to
>> bring it in.  It is a non-trivial contribution for sure.  Folks often
>> do not consider all the nasty gotchas that can occur in something as
>> seemingly simple as File IO.
>>
>> Thanks
>> Joe
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposal
>> s
>>
>> On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <[hidden email]> wrote:
>> > This thread proposes community review/comments of modified versions
>> > of
>> GetFile and PutFile for potential future adoption by the Nifi community.
>> For those who want to jump straight to the code, here's the review
>> repository location for the current version:
>> https://github.com/rickbraddy/nifishare.
>> >
>> > As background, we needed a way to replicate entire directory trees
>> > of
>> files via Nifi, where multiple directory trees can be specified at
>> run-time as part of an overall Nifi graph. As Nifi is rooted in
>> file-based processing, it seems reasonable to continue advancing its
>> abilities to ingest, process, transform and replicate files in the
>> most flexible manner possible.  While this proposal is not a be all
>> end all in that regard, it moves the needle in the right direction by
>> making file-processing in Nifi more dynamic, enabling flows to
>> determine how files (and directories) should be processed, which does
>> well beyond today's basic file ingress/egress process capabilities
>> (which certainly have their place and uses).  Whether it's via this
>> proposal and code or another, clearly Nifi can benefit from this type of functionality.
>> >
>> > Here's a more detailed explanation of the rationale for developing
>> > these
>> Nifi file processor derivatives and their initial implementation:
>> >
>> > GetFileData
>> > ----------------
>> > The GetFile processor monitors a single directory tree for file
>> > changes
>> and creates FlowFiles for every changed file in that configured tree.
>> It does a good job of getting files from a configurable folder than
>> need to be injected into a graph. GetFile falls short of other
>> requirements that arise for general-purpose file processing:
>> >
>> > -          Operates from a single, pre-configured source directory (not
>> dynamically configurable at run-time as part of a flow)
>> >
>> > -          Scheduled on a periodic basis only, not event-triggered when
>> there's something to do
>> >
>> > -          Does not support sending an entire directory tree (only files
>> are sent, not directories)
>> >
>> > -          Is a "source" processor node only, cannot be used within
>> other Nifi flow logic that dynamically determines which files or
>> directories to get and send as FlowFiles
>> >
>> > -          Assumes each file is smaller than the content repository,
>> which causes large files (hundreds of MB's, GBs, TBs) to overrun or
>> dominate the content repository
>> >
>> > A modified version of GetFile (currently) named GetFileData has been
>> developed and is proposed as the basis for a new Nifi processor that
>> will supplement file ingestion with these features:
>> >
>> > -          Operates based upon inbound FlowFiles that contains the
>> filesystem path to a file or directory
>> >
>> > -          Scheduled by incoming FlowFiles containing a file or
>> directory path, only runs when there's something to do
>> >
>> > -          Supports sending directory tree as a series of directory and
>> file paths; e.g., ExecuteProcess("find /mypath -print") =>
>> SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") =>
>> GetFIleData ...
>> >
>> > -          Participates within simple or complex flows to fetch and send
>> files and directories
>> >
>> > -          (To be developed) Is designed to handle any size file, by
>> breaking files larger than a "chunkingThreshold" into a series of
>> multiple smaller files that can be reassembled on the other end (by
>> PutFileData)
>> >
>> > PutFileData
>> > ---------------
>> > The PutFile processor accepts incoming FlowFiles and writes those
>> > files
>> to a single target directory.  It does a good job of handling and
>> resolving conflicts, but falls short of other requirements that arise
>> for general-purpose file processing:
>> >
>> > -          Does not support directories, only files
>> >
>> > -          Only supports a single, preconfigured target directory
>> >
>> > -          Cannot reconstruct and entire directory tree based upon
>> relative file paths (all files go into a single target directory)
>> >
>> > -          Assumes each file is small enough to fit into the content
>> repository
>> >
>> > A modified version of PutFile (currently) named PutFileData has been
>> developed and is proposed as the basis for a new Nifi processor that
>> will supplement file egress with these features:
>> >
>> > -          Supports directories and files
>> >
>> > -          Supports reconstruction of entire directory tree based upon
>> relative file paths, enabling reconstruction of an entire directory
>> free originating from GetFileData
>> >
>> > -          (To be developed) Is designed to handle any size file, by
>> reassembling multi-part files into very large files (TB's) that do not
>> fit within the content repository
>> >
>> > Should the community have an interest in these processors (we can
>> > name
>> them something different, if needed), these contributions are now
>> available.  In the meantime, we shall continue developing these
>> processor to meet our specific use cases, adding the chunking
>> functionality and QA certifying them for production use at scale.
>> >
>> > Looking forward to comments, feedback and recommendations.
>> >
>> > Here's the Github repo link again:
>> > https://github.com/rickbraddy/nifishare
>> >
>> > Best,
>> > Rick
>> >
>> > P.S. If there's a better vehicle for communicating these types of
>> proposals, please advise.
>> >
>> >
>>
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: New file processors: GetFIleData and PutFileData

Rick Braddy
There was no interest shown by the community so we moved on.

> On Nov 3, 2015, at 3:53 AM, Joe Witt <[hidden email]> wrote:
>
> Rick,
>
> I am finally taking a moment to clear out some dangling threads.  I
> just looked into this one and the link appears to be gone.  Have you
> chosen to withdraw this proposal at this time?
>
> Thanks
> Joe
>
>> On Fri, Sep 25, 2015 at 4:25 AM, Rick Braddy <[hidden email]> wrote:
>> Yes.  Replication of directory tree via Nifi similar to rsync.
>>
>> -----Original Message-----
>> From: Joe Skora [mailto:[hidden email]]
>> Sent: Thursday, September 24, 2015 10:16 PM
>> To: [hidden email]
>> Subject: Re: Proposal: New file processors: GetFIleData and PutFileData
>>
>> It may be an oversimplification, but for the purposes of understanding, is the intent to mirror directory tree with NiFi similar to rsync?
>>
>>> On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <[hidden email]> wrote:
>>>
>>> Joe,
>>>
>>> Thanks for the quick response.
>>>
>>> Yes, I can add to the Wiki once access has been granted. Further responses:
>>>
>>>>> GetFile and PutFile do support recursive walking/reconstruction
>>>>> based
>>> on relative paths
>>>
>>> Based on my recent testing of 0.3.0, GetFile does walk the configured
>>> directory tree, picking up the files it finds; however, only files are
>>> sent to PutFile, which places them all into a single target folder
>>> (not a directory tree - no directory information is sent by GetFile
>>> nor processed by PutFile from what I have seen, so I do not believe it
>>> reconstructs the directory tree at all today).
>>>
>>>>> I do think your proposal modified to consider the design pattern of
>>> ListFile/FetchFile would be super powerful.
>>>
>>> We have another processor GetFileList that uses "find" to traverse a
>>> target folder tree and feeds the resulting newline delimited
>>> file/directory stream as FlowFiles into GetFileData.  Perhaps that
>>> processor could be evolved into a suitable ListFiles processor.
>>>
>>> I believe GetFileList/GetFileData correspond roughly to the
>>> ListFile/FetchFile concept, based on a cursory review of
>>> ListHDFS/FetchHDFS.  If it's a matter of renaming that's obviously
>>> trivial at this point.  I'm assuming there are other facets to that
>>> List/Fetch design pattern - is it documented anywhere I can review to learn more?
>>>
>>> So when we have a ListFile/FetchFile what is the corresponding "Put"
>>> side of the flow to be?  Perhaps simply PutFile enhanced to handle
>>> FlowFiles from both basic GetFile and the richer FetchFile (modified
>>> GetFileData) types of FlowFiles and behaviors would suffice.
>>>
>>>>> Just need to make sure backpressure works through the flow so that
>>>>> you
>>> could literally handle the delivery of a file which is of itself
>>> larger than the repo by capturing and sending a chunk of it at a time for instance.
>>>
>>> Agreed. Are there any best practices documented for configuring
>>> backpressure properly?
>>>
>>> Thanks.
>>>
>>> Rick
>>>
>>> -----Original Message-----
>>> From: Joe Witt [mailto:[hidden email]]
>>> Sent: Wednesday, September 23, 2015 6:25 PM
>>> To: [hidden email]
>>> Subject: Re: Proposal: New file processors: GetFIleData and
>>> PutFileData
>>>
>>> Rick
>>>
>>> This is a perfectly fine place to start the thread.  If you'd like to
>>> create a wiki feature proposal for it too like we're doing with a lot
>>> of the other things at this level we can give you access to create one
>>> here [1].
>>>
>>> Not at all trying to take away from the points you were making but
>>> GetFile and PutFile do support recursive walking/reconstruction based
>>> on relative paths.  By no means is that as comprehensive as you're
>>> going for here though - just an FYI.
>>>
>>> These sound like good things.  In particular I find your concept for
>>> handling arbitrarily large data interesting.  Just need to make sure
>>> backpressure works through the flow so that you could literally handle
>>> the delivery of a file which is of itself larger than the repo by
>>> capturing and sending a chunk of it at a time for instance.  So from a
>>> brief historical perspective the GetFile / PutFile processors were
>>> literally the first two processors ever build for NiFi back when it
>>> had no GUI, no provenance, no nothin' that was cool.  These are the
>>> OGs of NiFi.  They been improved a bit over the years but not much.
>>> Why?  Because their utility was largely limited to trivial archiving
>>> cases.  We have recently had discussions about making them more
>>> powerful through the concept of ListFile/FetchFile like adam mentions
>>> and as we've started doing with things like HDFS.  A much better model
>>> for sure.  Still not as powerful as what you're cooking up though.  I
>>> do think your proposal modified to consider the design pattern of
>>> ListFile/FetchFile would be super powerful.  In your case ListFile for
>>> a single larger file for instance could produce N listings that point
>>> to the same file on disk but for different offset/ranges.  This would
>>> be *very* interesting.  I am a bit concerned about how to have this
>>> nicely handle competing consumer problems but...we can cross that bridge later.
>>>
>>> If you're willing to tackle this we can definitely work with you to
>>> bring it in.  It is a non-trivial contribution for sure.  Folks often
>>> do not consider all the nasty gotchas that can occur in something as
>>> seemingly simple as File IO.
>>>
>>> Thanks
>>> Joe
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposal
>>> s
>>>
>>>> On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <[hidden email]> wrote:
>>>> This thread proposes community review/comments of modified versions
>>>> of
>>> GetFile and PutFile for potential future adoption by the Nifi community.
>>> For those who want to jump straight to the code, here's the review
>>> repository location for the current version:
>>> https://github.com/rickbraddy/nifishare.
>>>>
>>>> As background, we needed a way to replicate entire directory trees
>>>> of
>>> files via Nifi, where multiple directory trees can be specified at
>>> run-time as part of an overall Nifi graph. As Nifi is rooted in
>>> file-based processing, it seems reasonable to continue advancing its
>>> abilities to ingest, process, transform and replicate files in the
>>> most flexible manner possible.  While this proposal is not a be all
>>> end all in that regard, it moves the needle in the right direction by
>>> making file-processing in Nifi more dynamic, enabling flows to
>>> determine how files (and directories) should be processed, which does
>>> well beyond today's basic file ingress/egress process capabilities
>>> (which certainly have their place and uses).  Whether it's via this
>>> proposal and code or another, clearly Nifi can benefit from this type of functionality.
>>>>
>>>> Here's a more detailed explanation of the rationale for developing
>>>> these
>>> Nifi file processor derivatives and their initial implementation:
>>>>
>>>> GetFileData
>>>> ----------------
>>>> The GetFile processor monitors a single directory tree for file
>>>> changes
>>> and creates FlowFiles for every changed file in that configured tree.
>>> It does a good job of getting files from a configurable folder than
>>> need to be injected into a graph. GetFile falls short of other
>>> requirements that arise for general-purpose file processing:
>>>>
>>>> -          Operates from a single, pre-configured source directory (not
>>> dynamically configurable at run-time as part of a flow)
>>>>
>>>> -          Scheduled on a periodic basis only, not event-triggered when
>>> there's something to do
>>>>
>>>> -          Does not support sending an entire directory tree (only files
>>> are sent, not directories)
>>>>
>>>> -          Is a "source" processor node only, cannot be used within
>>> other Nifi flow logic that dynamically determines which files or
>>> directories to get and send as FlowFiles
>>>>
>>>> -          Assumes each file is smaller than the content repository,
>>> which causes large files (hundreds of MB's, GBs, TBs) to overrun or
>>> dominate the content repository
>>>>
>>>> A modified version of GetFile (currently) named GetFileData has been
>>> developed and is proposed as the basis for a new Nifi processor that
>>> will supplement file ingestion with these features:
>>>>
>>>> -          Operates based upon inbound FlowFiles that contains the
>>> filesystem path to a file or directory
>>>>
>>>> -          Scheduled by incoming FlowFiles containing a file or
>>> directory path, only runs when there's something to do
>>>>
>>>> -          Supports sending directory tree as a series of directory and
>>> file paths; e.g., ExecuteProcess("find /mypath -print") =>
>>> SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") =>
>>> GetFIleData ...
>>>>
>>>> -          Participates within simple or complex flows to fetch and send
>>> files and directories
>>>>
>>>> -          (To be developed) Is designed to handle any size file, by
>>> breaking files larger than a "chunkingThreshold" into a series of
>>> multiple smaller files that can be reassembled on the other end (by
>>> PutFileData)
>>>>
>>>> PutFileData
>>>> ---------------
>>>> The PutFile processor accepts incoming FlowFiles and writes those
>>>> files
>>> to a single target directory.  It does a good job of handling and
>>> resolving conflicts, but falls short of other requirements that arise
>>> for general-purpose file processing:
>>>>
>>>> -          Does not support directories, only files
>>>>
>>>> -          Only supports a single, preconfigured target directory
>>>>
>>>> -          Cannot reconstruct and entire directory tree based upon
>>> relative file paths (all files go into a single target directory)
>>>>
>>>> -          Assumes each file is small enough to fit into the content
>>> repository
>>>>
>>>> A modified version of PutFile (currently) named PutFileData has been
>>> developed and is proposed as the basis for a new Nifi processor that
>>> will supplement file egress with these features:
>>>>
>>>> -          Supports directories and files
>>>>
>>>> -          Supports reconstruction of entire directory tree based upon
>>> relative file paths, enabling reconstruction of an entire directory
>>> free originating from GetFileData
>>>>
>>>> -          (To be developed) Is designed to handle any size file, by
>>> reassembling multi-part files into very large files (TB's) that do not
>>> fit within the content repository
>>>>
>>>> Should the community have an interest in these processors (we can
>>>> name
>>> them something different, if needed), these contributions are now
>>> available.  In the meantime, we shall continue developing these
>>> processor to meet our specific use cases, adding the chunking
>>> functionality and QA certifying them for production use at scale.
>>>>
>>>> Looking forward to comments, feedback and recommendations.
>>>>
>>>> Here's the Github repo link again:
>>>> https://github.com/rickbraddy/nifishare
>>>>
>>>> Best,
>>>> Rick
>>>>
>>>> P.S. If there's a better vehicle for communicating these types of
>>> proposals, please advise.
>>>
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: New file processors: GetFIleData and PutFileData

Joe Witt
Sounds good.  For future purposes documentation on the processes which
work best (so far) can be found here:
https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide

By creating a JIRA and attaching a patch in patch reviewed state that
helps for sure.  As does the Github PR process.  Both create a sort of
'pull' that allows the community to then work on items as available.
There are some List/Fetch items being worked so perhaps some of your
ideas will be addressed there.

Thanks
Joe

On Tue, Nov 3, 2015 at 10:49 AM, Rick Braddy <[hidden email]> wrote:

> There was no interest shown by the community so we moved on.
>
>> On Nov 3, 2015, at 3:53 AM, Joe Witt <[hidden email]> wrote:
>>
>> Rick,
>>
>> I am finally taking a moment to clear out some dangling threads.  I
>> just looked into this one and the link appears to be gone.  Have you
>> chosen to withdraw this proposal at this time?
>>
>> Thanks
>> Joe
>>
>>> On Fri, Sep 25, 2015 at 4:25 AM, Rick Braddy <[hidden email]> wrote:
>>> Yes.  Replication of directory tree via Nifi similar to rsync.
>>>
>>> -----Original Message-----
>>> From: Joe Skora [mailto:[hidden email]]
>>> Sent: Thursday, September 24, 2015 10:16 PM
>>> To: [hidden email]
>>> Subject: Re: Proposal: New file processors: GetFIleData and PutFileData
>>>
>>> It may be an oversimplification, but for the purposes of understanding, is the intent to mirror directory tree with NiFi similar to rsync?
>>>
>>>> On Wed, Sep 23, 2015 at 11:26 PM, Rick Braddy <[hidden email]> wrote:
>>>>
>>>> Joe,
>>>>
>>>> Thanks for the quick response.
>>>>
>>>> Yes, I can add to the Wiki once access has been granted. Further responses:
>>>>
>>>>>> GetFile and PutFile do support recursive walking/reconstruction
>>>>>> based
>>>> on relative paths
>>>>
>>>> Based on my recent testing of 0.3.0, GetFile does walk the configured
>>>> directory tree, picking up the files it finds; however, only files are
>>>> sent to PutFile, which places them all into a single target folder
>>>> (not a directory tree - no directory information is sent by GetFile
>>>> nor processed by PutFile from what I have seen, so I do not believe it
>>>> reconstructs the directory tree at all today).
>>>>
>>>>>> I do think your proposal modified to consider the design pattern of
>>>> ListFile/FetchFile would be super powerful.
>>>>
>>>> We have another processor GetFileList that uses "find" to traverse a
>>>> target folder tree and feeds the resulting newline delimited
>>>> file/directory stream as FlowFiles into GetFileData.  Perhaps that
>>>> processor could be evolved into a suitable ListFiles processor.
>>>>
>>>> I believe GetFileList/GetFileData correspond roughly to the
>>>> ListFile/FetchFile concept, based on a cursory review of
>>>> ListHDFS/FetchHDFS.  If it's a matter of renaming that's obviously
>>>> trivial at this point.  I'm assuming there are other facets to that
>>>> List/Fetch design pattern - is it documented anywhere I can review to learn more?
>>>>
>>>> So when we have a ListFile/FetchFile what is the corresponding "Put"
>>>> side of the flow to be?  Perhaps simply PutFile enhanced to handle
>>>> FlowFiles from both basic GetFile and the richer FetchFile (modified
>>>> GetFileData) types of FlowFiles and behaviors would suffice.
>>>>
>>>>>> Just need to make sure backpressure works through the flow so that
>>>>>> you
>>>> could literally handle the delivery of a file which is of itself
>>>> larger than the repo by capturing and sending a chunk of it at a time for instance.
>>>>
>>>> Agreed. Are there any best practices documented for configuring
>>>> backpressure properly?
>>>>
>>>> Thanks.
>>>>
>>>> Rick
>>>>
>>>> -----Original Message-----
>>>> From: Joe Witt [mailto:[hidden email]]
>>>> Sent: Wednesday, September 23, 2015 6:25 PM
>>>> To: [hidden email]
>>>> Subject: Re: Proposal: New file processors: GetFIleData and
>>>> PutFileData
>>>>
>>>> Rick
>>>>
>>>> This is a perfectly fine place to start the thread.  If you'd like to
>>>> create a wiki feature proposal for it too like we're doing with a lot
>>>> of the other things at this level we can give you access to create one
>>>> here [1].
>>>>
>>>> Not at all trying to take away from the points you were making but
>>>> GetFile and PutFile do support recursive walking/reconstruction based
>>>> on relative paths.  By no means is that as comprehensive as you're
>>>> going for here though - just an FYI.
>>>>
>>>> These sound like good things.  In particular I find your concept for
>>>> handling arbitrarily large data interesting.  Just need to make sure
>>>> backpressure works through the flow so that you could literally handle
>>>> the delivery of a file which is of itself larger than the repo by
>>>> capturing and sending a chunk of it at a time for instance.  So from a
>>>> brief historical perspective the GetFile / PutFile processors were
>>>> literally the first two processors ever build for NiFi back when it
>>>> had no GUI, no provenance, no nothin' that was cool.  These are the
>>>> OGs of NiFi.  They been improved a bit over the years but not much.
>>>> Why?  Because their utility was largely limited to trivial archiving
>>>> cases.  We have recently had discussions about making them more
>>>> powerful through the concept of ListFile/FetchFile like adam mentions
>>>> and as we've started doing with things like HDFS.  A much better model
>>>> for sure.  Still not as powerful as what you're cooking up though.  I
>>>> do think your proposal modified to consider the design pattern of
>>>> ListFile/FetchFile would be super powerful.  In your case ListFile for
>>>> a single larger file for instance could produce N listings that point
>>>> to the same file on disk but for different offset/ranges.  This would
>>>> be *very* interesting.  I am a bit concerned about how to have this
>>>> nicely handle competing consumer problems but...we can cross that bridge later.
>>>>
>>>> If you're willing to tackle this we can definitely work with you to
>>>> bring it in.  It is a non-trivial contribution for sure.  Folks often
>>>> do not consider all the nasty gotchas that can occur in something as
>>>> seemingly simple as File IO.
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> [1]
>>>> https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposal
>>>> s
>>>>
>>>>> On Wed, Sep 23, 2015 at 1:42 PM, Rick Braddy <[hidden email]> wrote:
>>>>> This thread proposes community review/comments of modified versions
>>>>> of
>>>> GetFile and PutFile for potential future adoption by the Nifi community.
>>>> For those who want to jump straight to the code, here's the review
>>>> repository location for the current version:
>>>> https://github.com/rickbraddy/nifishare.
>>>>>
>>>>> As background, we needed a way to replicate entire directory trees
>>>>> of
>>>> files via Nifi, where multiple directory trees can be specified at
>>>> run-time as part of an overall Nifi graph. As Nifi is rooted in
>>>> file-based processing, it seems reasonable to continue advancing its
>>>> abilities to ingest, process, transform and replicate files in the
>>>> most flexible manner possible.  While this proposal is not a be all
>>>> end all in that regard, it moves the needle in the right direction by
>>>> making file-processing in Nifi more dynamic, enabling flows to
>>>> determine how files (and directories) should be processed, which does
>>>> well beyond today's basic file ingress/egress process capabilities
>>>> (which certainly have their place and uses).  Whether it's via this
>>>> proposal and code or another, clearly Nifi can benefit from this type of functionality.
>>>>>
>>>>> Here's a more detailed explanation of the rationale for developing
>>>>> these
>>>> Nifi file processor derivatives and their initial implementation:
>>>>>
>>>>> GetFileData
>>>>> ----------------
>>>>> The GetFile processor monitors a single directory tree for file
>>>>> changes
>>>> and creates FlowFiles for every changed file in that configured tree.
>>>> It does a good job of getting files from a configurable folder than
>>>> need to be injected into a graph. GetFile falls short of other
>>>> requirements that arise for general-purpose file processing:
>>>>>
>>>>> -          Operates from a single, pre-configured source directory (not
>>>> dynamically configurable at run-time as part of a flow)
>>>>>
>>>>> -          Scheduled on a periodic basis only, not event-triggered when
>>>> there's something to do
>>>>>
>>>>> -          Does not support sending an entire directory tree (only files
>>>> are sent, not directories)
>>>>>
>>>>> -          Is a "source" processor node only, cannot be used within
>>>> other Nifi flow logic that dynamically determines which files or
>>>> directories to get and send as FlowFiles
>>>>>
>>>>> -          Assumes each file is smaller than the content repository,
>>>> which causes large files (hundreds of MB's, GBs, TBs) to overrun or
>>>> dominate the content repository
>>>>>
>>>>> A modified version of GetFile (currently) named GetFileData has been
>>>> developed and is proposed as the basis for a new Nifi processor that
>>>> will supplement file ingestion with these features:
>>>>>
>>>>> -          Operates based upon inbound FlowFiles that contains the
>>>> filesystem path to a file or directory
>>>>>
>>>>> -          Scheduled by incoming FlowFiles containing a file or
>>>> directory path, only runs when there's something to do
>>>>>
>>>>> -          Supports sending directory tree as a series of directory and
>>>> file paths; e.g., ExecuteProcess("find /mypath -print") =>
>>>> SplitText(newline) => ModifyAttribute(add "file.roodir=/mypath") =>
>>>> GetFIleData ...
>>>>>
>>>>> -          Participates within simple or complex flows to fetch and send
>>>> files and directories
>>>>>
>>>>> -          (To be developed) Is designed to handle any size file, by
>>>> breaking files larger than a "chunkingThreshold" into a series of
>>>> multiple smaller files that can be reassembled on the other end (by
>>>> PutFileData)
>>>>>
>>>>> PutFileData
>>>>> ---------------
>>>>> The PutFile processor accepts incoming FlowFiles and writes those
>>>>> files
>>>> to a single target directory.  It does a good job of handling and
>>>> resolving conflicts, but falls short of other requirements that arise
>>>> for general-purpose file processing:
>>>>>
>>>>> -          Does not support directories, only files
>>>>>
>>>>> -          Only supports a single, preconfigured target directory
>>>>>
>>>>> -          Cannot reconstruct and entire directory tree based upon
>>>> relative file paths (all files go into a single target directory)
>>>>>
>>>>> -          Assumes each file is small enough to fit into the content
>>>> repository
>>>>>
>>>>> A modified version of PutFile (currently) named PutFileData has been
>>>> developed and is proposed as the basis for a new Nifi processor that
>>>> will supplement file egress with these features:
>>>>>
>>>>> -          Supports directories and files
>>>>>
>>>>> -          Supports reconstruction of entire directory tree based upon
>>>> relative file paths, enabling reconstruction of an entire directory
>>>> free originating from GetFileData
>>>>>
>>>>> -          (To be developed) Is designed to handle any size file, by
>>>> reassembling multi-part files into very large files (TB's) that do not
>>>> fit within the content repository
>>>>>
>>>>> Should the community have an interest in these processors (we can
>>>>> name
>>>> them something different, if needed), these contributions are now
>>>> available.  In the meantime, we shall continue developing these
>>>> processor to meet our specific use cases, adding the chunking
>>>> functionality and QA certifying them for production use at scale.
>>>>>
>>>>> Looking forward to comments, feedback and recommendations.
>>>>>
>>>>> Here's the Github repo link again:
>>>>> https://github.com/rickbraddy/nifishare
>>>>>
>>>>> Best,
>>>>> Rick
>>>>>
>>>>> P.S. If there's a better vehicle for communicating these types of
>>>> proposals, please advise.
>>>>