Multiple dataflows with sub-flows and version control

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Multiple dataflows with sub-flows and version control

Edenfield, Orrin
Hello everyone - I'm new to the mailing list and I've tried to search the JIRA and mailing list to see if this has already been addressed and didn't find anything so here it goes:

When I think about the capabilities of this tool I instantly think of ETL-type tools. So the questions/comments below are likely to be coming from that frame of mind - let me know if I've misunderstood a key concept of NiFi as I think that could be possible.

Is it possible to have NiFi service setup and running and allow for multiple dataflows to be designed and deployed (running) at the same time?  So far in my testing I've found that I can get NiFi service up and functioning as expected on my cluster edge node but I'd like to be able to design multiple dataflows for the following reasons.

1. I have many datasets that will need some of the same flow actions but not all of them. I'd like to componentize the flows and possibly have multiple flows cascade from one to another. For example:  I will want all data to flow into an HDFS endpoint but dataset1 will be coming in as delimited data so it can go directly into the GetFile processor while I need dataset2 to go through a CompressContent processor first.

2. Because I have a need in #1 above - I'd like to be able to design multiple flows (specific to a data need or component flows that work together) and have them all be able to be deployed (running) concurrently.

Also - it would be nice to be able to version control these designed flows so I can have 1 flow running while modifying a version 2.0 of that flow and then once the updates have been made then I can safely and effectively have a mechanism to shut down flow.v1 and start up flow.v2.

Thank you.

--
Orrin Edenfield
Reply | Threaded
Open this post in threaded view
|

Re: Multiple dataflows with sub-flows and version control

Mark Payne
Orrin,

Within NiFi you can create many different dataflows within the same graph and run them concurrently. We've built flows with several hundred Processors. They data can flow between flows by simply connecting the Processors together.

If you want to separate the flows logically because it makes more sense to you to visualize them that way, you may want to use Process Groups.

I'm on my cell phone right now so I cannot draw up an example for you but I will this afternoon when I have a chance. But the basic idea is that for #1 you would have:

GetFile -> PutHDFS

And along side that another GetFile -> CompressContent -> the same PutHDFS.

In this case you can even do this with the following flow:

GetFile -> IdentifyMimeType (to check if compressed) -> CompressContent (set to decompress and the compression type come from mime type, which is identified by the previous processor) -> PutHDFS

With regards to #2:
You can build the new flow right along side the old flow. When you are ready to switch, simply change the connection to send data to the new flow instead of the old one.

Again, I'll put together some examples this afternoon with screen shots that should help. Let me know if this helps or if it creates more questions (or both :))

Thanks
-Mark



Sent from my iPhone

> On Jan 2, 2015, at 11:37 AM, Edenfield, Orrin <[hidden email]> wrote:
>
> Hello everyone - I'm new to the mailing list and I've tried to search the JIRA and mailing list to see if this has already been addressed and didn't find anything so here it goes:
>
> When I think about the capabilities of this tool I instantly think of ETL-type tools. So the questions/comments below are likely to be coming from that frame of mind - let me know if I've misunderstood a key concept of NiFi as I think that could be possible.
>
> Is it possible to have NiFi service setup and running and allow for multiple dataflows to be designed and deployed (running) at the same time?  So far in my testing I've found that I can get NiFi service up and functioning as expected on my cluster edge node but I'd like to be able to design multiple dataflows for the following reasons.
>
> 1. I have many datasets that will need some of the same flow actions but not all of them. I'd like to componentize the flows and possibly have multiple flows cascade from one to another. For example:  I will want all data to flow into an HDFS endpoint but dataset1 will be coming in as delimited data so it can go directly into the GetFile processor while I need dataset2 to go through a CompressContent processor first.
>
> 2. Because I have a need in #1 above - I'd like to be able to design multiple flows (specific to a data need or component flows that work together) and have them all be able to be deployed (running) concurrently.
>
> Also - it would be nice to be able to version control these designed flows so I can have 1 flow running while modifying a version 2.0 of that flow and then once the updates have been made then I can safely and effectively have a mechanism to shut down flow.v1 and start up flow.v2.
>
> Thank you.
>
> --
> Orrin Edenfield
Reply | Threaded
Open this post in threaded view
|

RE: Multiple dataflows with sub-flows and version control

Edenfield, Orrin
Mark,

I follow the logic here I just think over time it will be really hard to keep track of things when there are hundreds (or thousands) of processors - rather than hundreds of different flows (organized within a source control tree or similar) that all have 5-50 different processors within them.

I'd be interested to learn about how the Process Groups component works so if you do get time to draw an example I think that would be helpful.

Thank you.

--
Orrin Edenfield

-----Original Message-----
From: Mark Payne [mailto:[hidden email]]
Sent: Friday, January 02, 2015 12:34 PM
To: [hidden email]; Edenfield, Orrin
Subject: Re: Multiple dataflows with sub-flows and version control

Orrin,

Within NiFi you can create many different dataflows within the same graph and run them concurrently. We've built flows with several hundred Processors. They data can flow between flows by simply connecting the Processors together.

If you want to separate the flows logically because it makes more sense to you to visualize them that way, you may want to use Process Groups.

I'm on my cell phone right now so I cannot draw up an example for you but I will this afternoon when I have a chance. But the basic idea is that for #1 you would have:

GetFile -> PutHDFS

And along side that another GetFile -> CompressContent -> the same PutHDFS.

In this case you can even do this with the following flow:

GetFile -> IdentifyMimeType (to check if compressed) -> CompressContent (set to decompress and the compression type come from mime type, which is identified by the previous processor) -> PutHDFS

With regards to #2:
You can build the new flow right along side the old flow. When you are ready to switch, simply change the connection to send data to the new flow instead of the old one.

Again, I'll put together some examples this afternoon with screen shots that should help. Let me know if this helps or if it creates more questions (or both :))

Thanks
-Mark



Sent from my iPhone

> On Jan 2, 2015, at 11:37 AM, Edenfield, Orrin <[hidden email]> wrote:
>
> Hello everyone - I'm new to the mailing list and I've tried to search the JIRA and mailing list to see if this has already been addressed and didn't find anything so here it goes:
>
> When I think about the capabilities of this tool I instantly think of ETL-type tools. So the questions/comments below are likely to be coming from that frame of mind - let me know if I've misunderstood a key concept of NiFi as I think that could be possible.
>
> Is it possible to have NiFi service setup and running and allow for multiple dataflows to be designed and deployed (running) at the same time?  So far in my testing I've found that I can get NiFi service up and functioning as expected on my cluster edge node but I'd like to be able to design multiple dataflows for the following reasons.
>
> 1. I have many datasets that will need some of the same flow actions but not all of them. I'd like to componentize the flows and possibly have multiple flows cascade from one to another. For example:  I will want all data to flow into an HDFS endpoint but dataset1 will be coming in as delimited data so it can go directly into the GetFile processor while I need dataset2 to go through a CompressContent processor first.
>
> 2. Because I have a need in #1 above - I'd like to be able to design multiple flows (specific to a data need or component flows that work together) and have them all be able to be deployed (running) concurrently.
>
> Also - it would be nice to be able to version control these designed flows so I can have 1 flow running while modifying a version 2.0 of that flow and then once the updates have been made then I can safely and effectively have a mechanism to shut down flow.v1 and start up flow.v2.
>
> Thank you.
>
> --
> Orrin Edenfield
Reply | Threaded
Open this post in threaded view
|

Re: Multiple dataflows with sub-flows and version control

Joe Witt
Orrin,

You definitely bring up a good point.  I believe though the point is about
the inherent complexity that exists when you have large-scale dataflows and
large number of them at that.

What NiFi allows you to do is manage the complexity visually, in real-time,
and all across the desired spectrum of granularity.  One potentially
convenient way to think about it is this:

When you're writing code and you identify a new abstraction that would make
things cleaner and more logical you start to refactor.  You do this to make
your code more elegant, more efficient, more maintainable and to manage
complexity.  In NiFi you do exactly that.  As you're growing toward
hundreds or thousands of processors you identify patterns that reveal
themselves visually.  That is a great way to communicate concepts not just
for the original author but for others as well.  As you build flows bad
ideas tend to become obvious and more importantly easy to deal with.  The
key thing though is that you don't have long arduous off-line improvement
cycles which tend to cause folks to avoid solving the root problem and thus
they accrue tech debt.  With NiFi you just start making improvements to the
flow while everything is running.  You get immediate feedback on whether
what you're doing is correct or not.  You can experiment in production but
outside the production flow if necessary by doing a super efficient tee of
the flow.  It really is a very different way of approaching a very old
problem.

It's cool that you're seeing ETL cases for it.  If there are details of
that which you can share we'd love to hear them.  I don't know if the sweet
spot is there or not.  We'll have to see what the community finds and how
that evolves over time.  I will say for new NiFi users it is extremely
common for them to think of a bunch of independent dataflow graphs which
are basically a lot of independent linear graphs.  Then over time as they
start to understand more about what it enables they start thinking in
directed graphs and how to merge flows and establish reusable components
and so on.  Curious to see how that maps to your experience.

As for the check-in of the flow configuration to a source control system
you can certainly do that.  You could programmatically invoke our endpoint
which causes NiFi to make a backup of the flow and then put that in source
control on some time interval.  But keep in mind that is just like taking a
picture of what the 'flow looks like'.  NiFi is more than the picture of
the flow.  It is the picture of the flow and the state of the data within
it and so on.

Very interested to hear more of your thoughts as you look further and think
more about it.  You'll be a great help to us to better understand how to
communicate about it to folks coming from an ETL background.  Ultimately it
would be great if we get you to help us do that with us ;-)   Please don't
be shy letting us know you're expectations.  We're new here too.

Thanks
Joe



On Fri, Jan 2, 2015 at 1:50 PM, Edenfield, Orrin <[hidden email]>
wrote:

> Mark,
>
> I follow the logic here I just think over time it will be really hard to
> keep track of things when there are hundreds (or thousands) of processors -
> rather than hundreds of different flows (organized within a source control
> tree or similar) that all have 5-50 different processors within them.
>
> I'd be interested to learn about how the Process Groups component works so
> if you do get time to draw an example I think that would be helpful.
>
> Thank you.
>
> --
> Orrin Edenfield
>
> -----Original Message-----
> From: Mark Payne [mailto:[hidden email]]
> Sent: Friday, January 02, 2015 12:34 PM
> To: [hidden email]; Edenfield, Orrin
> Subject: Re: Multiple dataflows with sub-flows and version control
>
> Orrin,
>
> Within NiFi you can create many different dataflows within the same graph
> and run them concurrently. We've built flows with several hundred
> Processors. They data can flow between flows by simply connecting the
> Processors together.
>
> If you want to separate the flows logically because it makes more sense to
> you to visualize them that way, you may want to use Process Groups.
>
> I'm on my cell phone right now so I cannot draw up an example for you but
> I will this afternoon when I have a chance. But the basic idea is that for
> #1 you would have:
>
> GetFile -> PutHDFS
>
> And along side that another GetFile -> CompressContent -> the same PutHDFS.
>
> In this case you can even do this with the following flow:
>
> GetFile -> IdentifyMimeType (to check if compressed) -> CompressContent
> (set to decompress and the compression type come from mime type, which is
> identified by the previous processor) -> PutHDFS
>
> With regards to #2:
> You can build the new flow right along side the old flow. When you are
> ready to switch, simply change the connection to send data to the new flow
> instead of the old one.
>
> Again, I'll put together some examples this afternoon with screen shots
> that should help. Let me know if this helps or if it creates more questions
> (or both :))
>
> Thanks
> -Mark
>
>
>
> Sent from my iPhone
>
> > On Jan 2, 2015, at 11:37 AM, Edenfield, Orrin <[hidden email]>
> wrote:
> >
> > Hello everyone - I'm new to the mailing list and I've tried to search
> the JIRA and mailing list to see if this has already been addressed and
> didn't find anything so here it goes:
> >
> > When I think about the capabilities of this tool I instantly think of
> ETL-type tools. So the questions/comments below are likely to be coming
> from that frame of mind - let me know if I've misunderstood a key concept
> of NiFi as I think that could be possible.
> >
> > Is it possible to have NiFi service setup and running and allow for
> multiple dataflows to be designed and deployed (running) at the same time?
> So far in my testing I've found that I can get NiFi service up and
> functioning as expected on my cluster edge node but I'd like to be able to
> design multiple dataflows for the following reasons.
> >
> > 1. I have many datasets that will need some of the same flow actions but
> not all of them. I'd like to componentize the flows and possibly have
> multiple flows cascade from one to another. For example:  I will want all
> data to flow into an HDFS endpoint but dataset1 will be coming in as
> delimited data so it can go directly into the GetFile processor while I
> need dataset2 to go through a CompressContent processor first.
> >
> > 2. Because I have a need in #1 above - I'd like to be able to design
> multiple flows (specific to a data need or component flows that work
> together) and have them all be able to be deployed (running) concurrently.
> >
> > Also - it would be nice to be able to version control these designed
> flows so I can have 1 flow running while modifying a version 2.0 of that
> flow and then once the updates have been made then I can safely and
> effectively have a mechanism to shut down flow.v1 and start up flow.v2.
> >
> > Thank you.
> >
> > --
> > Orrin Edenfield
>
Reply | Threaded
Open this post in threaded view
|

RE: Multiple dataflows with sub-flows and version control

Edenfield, Orrin
Joe,

Thank you for taking the time to detail this out for me.  This is a different way of thinking for me but I think I'm starting to get it.  I work with a data factory that uses an ETL tool that would take about 300 to 600 individual flows (closer to the 300 side if we can parameterize/re-use pieces of flows) and would literally be thousands of processors - if we solved it the same way we solve with traditional ETL tools.

I'll try to think some more over the weekend but you're probably right that with the full use of these components that could be quickly compacted into a much smaller footprint when it comes to actual needed data flow.  

I know things are still getting started here with incubation but if there are any documents/more examples I can read up on when it comes to things like Process Groups - I think that would help me see if I can fully wrap my head around this when it comes to applying this to my world.  :-)

And just let me know if there is anything I can do to help - I'm excited about the possibilities of this tool!

Thank you.

--
Orrin Edenfield
Associate Architect - PRGX USA, Inc.
[hidden email]

-----Original Message-----
From: Joe Witt [mailto:[hidden email]]
Sent: Friday, January 02, 2015 2:26 PM
To: [hidden email]
Subject: Re: Multiple dataflows with sub-flows and version control

Orrin,

You definitely bring up a good point.  I believe though the point is about the inherent complexity that exists when you have large-scale dataflows and large number of them at that.

What NiFi allows you to do is manage the complexity visually, in real-time, and all across the desired spectrum of granularity.  One potentially convenient way to think about it is this:

When you're writing code and you identify a new abstraction that would make things cleaner and more logical you start to refactor.  You do this to make your code more elegant, more efficient, more maintainable and to manage complexity.  In NiFi you do exactly that.  As you're growing toward hundreds or thousands of processors you identify patterns that reveal themselves visually.  That is a great way to communicate concepts not just for the original author but for others as well.  As you build flows bad ideas tend to become obvious and more importantly easy to deal with.  The key thing though is that you don't have long arduous off-line improvement cycles which tend to cause folks to avoid solving the root problem and thus they accrue tech debt.  With NiFi you just start making improvements to the flow while everything is running.  You get immediate feedback on whether what you're doing is correct or not.  You can experiment in production but outside the production flow if necessary by doing a super efficient tee of the flow.  It really is a very different way of approaching a very old problem.

It's cool that you're seeing ETL cases for it.  If there are details of that which you can share we'd love to hear them.  I don't know if the sweet spot is there or not.  We'll have to see what the community finds and how that evolves over time.  I will say for new NiFi users it is extremely common for them to think of a bunch of independent dataflow graphs which are basically a lot of independent linear graphs.  Then over time as they start to understand more about what it enables they start thinking in directed graphs and how to merge flows and establish reusable components and so on.  Curious to see how that maps to your experience.

As for the check-in of the flow configuration to a source control system you can certainly do that.  You could programmatically invoke our endpoint which causes NiFi to make a backup of the flow and then put that in source control on some time interval.  But keep in mind that is just like taking a picture of what the 'flow looks like'.  NiFi is more than the picture of the flow.  It is the picture of the flow and the state of the data within it and so on.

Very interested to hear more of your thoughts as you look further and think more about it.  You'll be a great help to us to better understand how to communicate about it to folks coming from an ETL background.  Ultimately it
would be great if we get you to help us do that with us ;-)   Please don't
be shy letting us know you're expectations.  We're new here too.

Thanks
Joe



On Fri, Jan 2, 2015 at 1:50 PM, Edenfield, Orrin <[hidden email]>
wrote:

> Mark,
>
> I follow the logic here I just think over time it will be really hard
> to keep track of things when there are hundreds (or thousands) of
> processors - rather than hundreds of different flows (organized within
> a source control tree or similar) that all have 5-50 different processors within them.
>
> I'd be interested to learn about how the Process Groups component
> works so if you do get time to draw an example I think that would be helpful.
>
> Thank you.
>
> --
> Orrin Edenfield
>
> -----Original Message-----
> From: Mark Payne [mailto:[hidden email]]
> Sent: Friday, January 02, 2015 12:34 PM
> To: [hidden email]; Edenfield, Orrin
> Subject: Re: Multiple dataflows with sub-flows and version control
>
> Orrin,
>
> Within NiFi you can create many different dataflows within the same
> graph and run them concurrently. We've built flows with several
> hundred Processors. They data can flow between flows by simply
> connecting the Processors together.
>
> If you want to separate the flows logically because it makes more
> sense to you to visualize them that way, you may want to use Process Groups.
>
> I'm on my cell phone right now so I cannot draw up an example for you
> but I will this afternoon when I have a chance. But the basic idea is
> that for
> #1 you would have:
>
> GetFile -> PutHDFS
>
> And along side that another GetFile -> CompressContent -> the same PutHDFS.
>
> In this case you can even do this with the following flow:
>
> GetFile -> IdentifyMimeType (to check if compressed) ->
> CompressContent (set to decompress and the compression type come from
> mime type, which is identified by the previous processor) -> PutHDFS
>
> With regards to #2:
> You can build the new flow right along side the old flow. When you are
> ready to switch, simply change the connection to send data to the new
> flow instead of the old one.
>
> Again, I'll put together some examples this afternoon with screen
> shots that should help. Let me know if this helps or if it creates
> more questions (or both :))
>
> Thanks
> -Mark
>
>
>
> Sent from my iPhone
>
> > On Jan 2, 2015, at 11:37 AM, Edenfield, Orrin
> > <[hidden email]>
> wrote:
> >
> > Hello everyone - I'm new to the mailing list and I've tried to
> > search
> the JIRA and mailing list to see if this has already been addressed
> and didn't find anything so here it goes:
> >
> > When I think about the capabilities of this tool I instantly think
> > of
> ETL-type tools. So the questions/comments below are likely to be
> coming from that frame of mind - let me know if I've misunderstood a
> key concept of NiFi as I think that could be possible.
> >
> > Is it possible to have NiFi service setup and running and allow for
> multiple dataflows to be designed and deployed (running) at the same time?
> So far in my testing I've found that I can get NiFi service up and
> functioning as expected on my cluster edge node but I'd like to be
> able to design multiple dataflows for the following reasons.
> >
> > 1. I have many datasets that will need some of the same flow actions
> > but
> not all of them. I'd like to componentize the flows and possibly have
> multiple flows cascade from one to another. For example:  I will want
> all data to flow into an HDFS endpoint but dataset1 will be coming in
> as delimited data so it can go directly into the GetFile processor
> while I need dataset2 to go through a CompressContent processor first.
> >
> > 2. Because I have a need in #1 above - I'd like to be able to design
> multiple flows (specific to a data need or component flows that work
> together) and have them all be able to be deployed (running) concurrently.
> >
> > Also - it would be nice to be able to version control these designed
> flows so I can have 1 flow running while modifying a version 2.0 of
> that flow and then once the updates have been made then I can safely
> and effectively have a mechanism to shut down flow.v1 and start up flow.v2.
> >
> > Thank you.
> >
> > --
> > Orrin Edenfield
>
Reply | Threaded
Open this post in threaded view
|

Re: Multiple dataflows with sub-flows and version control

Joe Witt
Great!

"more documents ..."

Oh yeah - we're working hard on that too.  The current user guide draft can
be found here:

http://nifi.incubator.apache.org/docs/nifi-docs/user-guide.html

And if you build the latest 'develop' branch that is integrated into the
app as well as the initial stab at our expression language.  We're still a
ways behind the ball on docs though and it is a mjaor focus area.  We will
get examples out as well.  That is one really nice thing about our
'Templates' feature.  Examples can be easily imported too.

Thanks and have a great weekend

Joe

On Fri, Jan 2, 2015 at 3:02 PM, Edenfield, Orrin <[hidden email]>
wrote:

> Joe,
>
> Thank you for taking the time to detail this out for me.  This is a
> different way of thinking for me but I think I'm starting to get it.  I
> work with a data factory that uses an ETL tool that would take about 300 to
> 600 individual flows (closer to the 300 side if we can parameterize/re-use
> pieces of flows) and would literally be thousands of processors - if we
> solved it the same way we solve with traditional ETL tools.
>
> I'll try to think some more over the weekend but you're probably right
> that with the full use of these components that could be quickly compacted
> into a much smaller footprint when it comes to actual needed data flow.
>
> I know things are still getting started here with incubation but if there
> are any documents/more examples I can read up on when it comes to things
> like Process Groups - I think that would help me see if I can fully wrap my
> head around this when it comes to applying this to my world.  :-)
>
> And just let me know if there is anything I can do to help - I'm excited
> about the possibilities of this tool!
>
> Thank you.
>
> --
> Orrin Edenfield
> Associate Architect - PRGX USA, Inc.
> [hidden email]
>
> -----Original Message-----
> From: Joe Witt [mailto:[hidden email]]
> Sent: Friday, January 02, 2015 2:26 PM
> To: [hidden email]
> Subject: Re: Multiple dataflows with sub-flows and version control
>
> Orrin,
>
> You definitely bring up a good point.  I believe though the point is about
> the inherent complexity that exists when you have large-scale dataflows and
> large number of them at that.
>
> What NiFi allows you to do is manage the complexity visually, in
> real-time, and all across the desired spectrum of granularity.  One
> potentially convenient way to think about it is this:
>
> When you're writing code and you identify a new abstraction that would
> make things cleaner and more logical you start to refactor.  You do this to
> make your code more elegant, more efficient, more maintainable and to
> manage complexity.  In NiFi you do exactly that.  As you're growing toward
> hundreds or thousands of processors you identify patterns that reveal
> themselves visually.  That is a great way to communicate concepts not just
> for the original author but for others as well.  As you build flows bad
> ideas tend to become obvious and more importantly easy to deal with.  The
> key thing though is that you don't have long arduous off-line improvement
> cycles which tend to cause folks to avoid solving the root problem and thus
> they accrue tech debt.  With NiFi you just start making improvements to the
> flow while everything is running.  You get immediate feedback on whether
> what you're doing is correct or not.  You can experiment in production but
> outside the production flow if necessary by doing a super efficient tee of
> the flow.  It really is a very different way of approaching a very old
> problem.
>
> It's cool that you're seeing ETL cases for it.  If there are details of
> that which you can share we'd love to hear them.  I don't know if the sweet
> spot is there or not.  We'll have to see what the community finds and how
> that evolves over time.  I will say for new NiFi users it is extremely
> common for them to think of a bunch of independent dataflow graphs which
> are basically a lot of independent linear graphs.  Then over time as they
> start to understand more about what it enables they start thinking in
> directed graphs and how to merge flows and establish reusable components
> and so on.  Curious to see how that maps to your experience.
>
> As for the check-in of the flow configuration to a source control system
> you can certainly do that.  You could programmatically invoke our endpoint
> which causes NiFi to make a backup of the flow and then put that in source
> control on some time interval.  But keep in mind that is just like taking a
> picture of what the 'flow looks like'.  NiFi is more than the picture of
> the flow.  It is the picture of the flow and the state of the data within
> it and so on.
>
> Very interested to hear more of your thoughts as you look further and
> think more about it.  You'll be a great help to us to better understand how
> to communicate about it to folks coming from an ETL background.  Ultimately
> it
> would be great if we get you to help us do that with us ;-)   Please don't
> be shy letting us know you're expectations.  We're new here too.
>
> Thanks
> Joe
>
>
>
> On Fri, Jan 2, 2015 at 1:50 PM, Edenfield, Orrin <[hidden email]
> >
> wrote:
>
> > Mark,
> >
> > I follow the logic here I just think over time it will be really hard
> > to keep track of things when there are hundreds (or thousands) of
> > processors - rather than hundreds of different flows (organized within
> > a source control tree or similar) that all have 5-50 different
> processors within them.
> >
> > I'd be interested to learn about how the Process Groups component
> > works so if you do get time to draw an example I think that would be
> helpful.
> >
> > Thank you.
> >
> > --
> > Orrin Edenfield
> >
> > -----Original Message-----
> > From: Mark Payne [mailto:[hidden email]]
> > Sent: Friday, January 02, 2015 12:34 PM
> > To: [hidden email]; Edenfield, Orrin
> > Subject: Re: Multiple dataflows with sub-flows and version control
> >
> > Orrin,
> >
> > Within NiFi you can create many different dataflows within the same
> > graph and run them concurrently. We've built flows with several
> > hundred Processors. They data can flow between flows by simply
> > connecting the Processors together.
> >
> > If you want to separate the flows logically because it makes more
> > sense to you to visualize them that way, you may want to use Process
> Groups.
> >
> > I'm on my cell phone right now so I cannot draw up an example for you
> > but I will this afternoon when I have a chance. But the basic idea is
> > that for
> > #1 you would have:
> >
> > GetFile -> PutHDFS
> >
> > And along side that another GetFile -> CompressContent -> the same
> PutHDFS.
> >
> > In this case you can even do this with the following flow:
> >
> > GetFile -> IdentifyMimeType (to check if compressed) ->
> > CompressContent (set to decompress and the compression type come from
> > mime type, which is identified by the previous processor) -> PutHDFS
> >
> > With regards to #2:
> > You can build the new flow right along side the old flow. When you are
> > ready to switch, simply change the connection to send data to the new
> > flow instead of the old one.
> >
> > Again, I'll put together some examples this afternoon with screen
> > shots that should help. Let me know if this helps or if it creates
> > more questions (or both :))
> >
> > Thanks
> > -Mark
> >
> >
> >
> > Sent from my iPhone
> >
> > > On Jan 2, 2015, at 11:37 AM, Edenfield, Orrin
> > > <[hidden email]>
> > wrote:
> > >
> > > Hello everyone - I'm new to the mailing list and I've tried to
> > > search
> > the JIRA and mailing list to see if this has already been addressed
> > and didn't find anything so here it goes:
> > >
> > > When I think about the capabilities of this tool I instantly think
> > > of
> > ETL-type tools. So the questions/comments below are likely to be
> > coming from that frame of mind - let me know if I've misunderstood a
> > key concept of NiFi as I think that could be possible.
> > >
> > > Is it possible to have NiFi service setup and running and allow for
> > multiple dataflows to be designed and deployed (running) at the same
> time?
> > So far in my testing I've found that I can get NiFi service up and
> > functioning as expected on my cluster edge node but I'd like to be
> > able to design multiple dataflows for the following reasons.
> > >
> > > 1. I have many datasets that will need some of the same flow actions
> > > but
> > not all of them. I'd like to componentize the flows and possibly have
> > multiple flows cascade from one to another. For example:  I will want
> > all data to flow into an HDFS endpoint but dataset1 will be coming in
> > as delimited data so it can go directly into the GetFile processor
> > while I need dataset2 to go through a CompressContent processor first.
> > >
> > > 2. Because I have a need in #1 above - I'd like to be able to design
> > multiple flows (specific to a data need or component flows that work
> > together) and have them all be able to be deployed (running)
> concurrently.
> > >
> > > Also - it would be nice to be able to version control these designed
> > flows so I can have 1 flow running while modifying a version 2.0 of
> > that flow and then once the updates have been made then I can safely
> > and effectively have a mechanism to shut down flow.v1 and start up
> flow.v2.
> > >
> > > Thank you.
> > >
> > > --
> > > Orrin Edenfield
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Multiple dataflows with sub-flows and version control

Joe Gresock
Orrin,

I've also found that a great deal of "individual data flows" can often be
handled using some basic RouteOnAttribute and UpdateAttributes patterns in
NiFi.  These two processors alone are extremely powerful in reducing flow
sizes.  Joe W speaks the truth when he talks about being able to visualize
bad patterns, especially in code reuse.  My team has found that often what
appears to be multiple different flows turns out to be slightly different
uses of the same basic flow, and we've been able to reduce the number of
processors by orders of magnitude with careful study.  I wouldn't be
surprised to find that your 300-processor use case can be reduced as well,
but 300 is actually quite manageable in NiFi with Processor Groups, coupled
with the search bar feature (upper right).

I think Joe W said it best, but I just wanted to confirm that flow
reduction is really something that happens in practice.

On Fri, Jan 2, 2015 at 3:05 PM, Joe Witt <[hidden email]> wrote:

> Great!
>
> "more documents ..."
>
> Oh yeah - we're working hard on that too.  The current user guide draft can
> be found here:
>
> http://nifi.incubator.apache.org/docs/nifi-docs/user-guide.html
>
> And if you build the latest 'develop' branch that is integrated into the
> app as well as the initial stab at our expression language.  We're still a
> ways behind the ball on docs though and it is a mjaor focus area.  We will
> get examples out as well.  That is one really nice thing about our
> 'Templates' feature.  Examples can be easily imported too.
>
> Thanks and have a great weekend
>
> Joe
>
> On Fri, Jan 2, 2015 at 3:02 PM, Edenfield, Orrin <[hidden email]
> >
> wrote:
>
> > Joe,
> >
> > Thank you for taking the time to detail this out for me.  This is a
> > different way of thinking for me but I think I'm starting to get it.  I
> > work with a data factory that uses an ETL tool that would take about 300
> to
> > 600 individual flows (closer to the 300 side if we can
> parameterize/re-use
> > pieces of flows) and would literally be thousands of processors - if we
> > solved it the same way we solve with traditional ETL tools.
> >
> > I'll try to think some more over the weekend but you're probably right
> > that with the full use of these components that could be quickly
> compacted
> > into a much smaller footprint when it comes to actual needed data flow.
> >
> > I know things are still getting started here with incubation but if there
> > are any documents/more examples I can read up on when it comes to things
> > like Process Groups - I think that would help me see if I can fully wrap
> my
> > head around this when it comes to applying this to my world.  :-)
> >
> > And just let me know if there is anything I can do to help - I'm excited
> > about the possibilities of this tool!
> >
> > Thank you.
> >
> > --
> > Orrin Edenfield
> > Associate Architect - PRGX USA, Inc.
> > [hidden email]
> >
> > -----Original Message-----
> > From: Joe Witt [mailto:[hidden email]]
> > Sent: Friday, January 02, 2015 2:26 PM
> > To: [hidden email]
> > Subject: Re: Multiple dataflows with sub-flows and version control
> >
> > Orrin,
> >
> > You definitely bring up a good point.  I believe though the point is
> about
> > the inherent complexity that exists when you have large-scale dataflows
> and
> > large number of them at that.
> >
> > What NiFi allows you to do is manage the complexity visually, in
> > real-time, and all across the desired spectrum of granularity.  One
> > potentially convenient way to think about it is this:
> >
> > When you're writing code and you identify a new abstraction that would
> > make things cleaner and more logical you start to refactor.  You do this
> to
> > make your code more elegant, more efficient, more maintainable and to
> > manage complexity.  In NiFi you do exactly that.  As you're growing
> toward
> > hundreds or thousands of processors you identify patterns that reveal
> > themselves visually.  That is a great way to communicate concepts not
> just
> > for the original author but for others as well.  As you build flows bad
> > ideas tend to become obvious and more importantly easy to deal with.  The
> > key thing though is that you don't have long arduous off-line improvement
> > cycles which tend to cause folks to avoid solving the root problem and
> thus
> > they accrue tech debt.  With NiFi you just start making improvements to
> the
> > flow while everything is running.  You get immediate feedback on whether
> > what you're doing is correct or not.  You can experiment in production
> but
> > outside the production flow if necessary by doing a super efficient tee
> of
> > the flow.  It really is a very different way of approaching a very old
> > problem.
> >
> > It's cool that you're seeing ETL cases for it.  If there are details of
> > that which you can share we'd love to hear them.  I don't know if the
> sweet
> > spot is there or not.  We'll have to see what the community finds and how
> > that evolves over time.  I will say for new NiFi users it is extremely
> > common for them to think of a bunch of independent dataflow graphs which
> > are basically a lot of independent linear graphs.  Then over time as they
> > start to understand more about what it enables they start thinking in
> > directed graphs and how to merge flows and establish reusable components
> > and so on.  Curious to see how that maps to your experience.
> >
> > As for the check-in of the flow configuration to a source control system
> > you can certainly do that.  You could programmatically invoke our
> endpoint
> > which causes NiFi to make a backup of the flow and then put that in
> source
> > control on some time interval.  But keep in mind that is just like
> taking a
> > picture of what the 'flow looks like'.  NiFi is more than the picture of
> > the flow.  It is the picture of the flow and the state of the data within
> > it and so on.
> >
> > Very interested to hear more of your thoughts as you look further and
> > think more about it.  You'll be a great help to us to better understand
> how
> > to communicate about it to folks coming from an ETL background.
> Ultimately
> > it
> > would be great if we get you to help us do that with us ;-)   Please
> don't
> > be shy letting us know you're expectations.  We're new here too.
> >
> > Thanks
> > Joe
> >
> >
> >
> > On Fri, Jan 2, 2015 at 1:50 PM, Edenfield, Orrin <
> [hidden email]
> > >
> > wrote:
> >
> > > Mark,
> > >
> > > I follow the logic here I just think over time it will be really hard
> > > to keep track of things when there are hundreds (or thousands) of
> > > processors - rather than hundreds of different flows (organized within
> > > a source control tree or similar) that all have 5-50 different
> > processors within them.
> > >
> > > I'd be interested to learn about how the Process Groups component
> > > works so if you do get time to draw an example I think that would be
> > helpful.
> > >
> > > Thank you.
> > >
> > > --
> > > Orrin Edenfield
> > >
> > > -----Original Message-----
> > > From: Mark Payne [mailto:[hidden email]]
> > > Sent: Friday, January 02, 2015 12:34 PM
> > > To: [hidden email]; Edenfield, Orrin
> > > Subject: Re: Multiple dataflows with sub-flows and version control
> > >
> > > Orrin,
> > >
> > > Within NiFi you can create many different dataflows within the same
> > > graph and run them concurrently. We've built flows with several
> > > hundred Processors. They data can flow between flows by simply
> > > connecting the Processors together.
> > >
> > > If you want to separate the flows logically because it makes more
> > > sense to you to visualize them that way, you may want to use Process
> > Groups.
> > >
> > > I'm on my cell phone right now so I cannot draw up an example for you
> > > but I will this afternoon when I have a chance. But the basic idea is
> > > that for
> > > #1 you would have:
> > >
> > > GetFile -> PutHDFS
> > >
> > > And along side that another GetFile -> CompressContent -> the same
> > PutHDFS.
> > >
> > > In this case you can even do this with the following flow:
> > >
> > > GetFile -> IdentifyMimeType (to check if compressed) ->
> > > CompressContent (set to decompress and the compression type come from
> > > mime type, which is identified by the previous processor) -> PutHDFS
> > >
> > > With regards to #2:
> > > You can build the new flow right along side the old flow. When you are
> > > ready to switch, simply change the connection to send data to the new
> > > flow instead of the old one.
> > >
> > > Again, I'll put together some examples this afternoon with screen
> > > shots that should help. Let me know if this helps or if it creates
> > > more questions (or both :))
> > >
> > > Thanks
> > > -Mark
> > >
> > >
> > >
> > > Sent from my iPhone
> > >
> > > > On Jan 2, 2015, at 11:37 AM, Edenfield, Orrin
> > > > <[hidden email]>
> > > wrote:
> > > >
> > > > Hello everyone - I'm new to the mailing list and I've tried to
> > > > search
> > > the JIRA and mailing list to see if this has already been addressed
> > > and didn't find anything so here it goes:
> > > >
> > > > When I think about the capabilities of this tool I instantly think
> > > > of
> > > ETL-type tools. So the questions/comments below are likely to be
> > > coming from that frame of mind - let me know if I've misunderstood a
> > > key concept of NiFi as I think that could be possible.
> > > >
> > > > Is it possible to have NiFi service setup and running and allow for
> > > multiple dataflows to be designed and deployed (running) at the same
> > time?
> > > So far in my testing I've found that I can get NiFi service up and
> > > functioning as expected on my cluster edge node but I'd like to be
> > > able to design multiple dataflows for the following reasons.
> > > >
> > > > 1. I have many datasets that will need some of the same flow actions
> > > > but
> > > not all of them. I'd like to componentize the flows and possibly have
> > > multiple flows cascade from one to another. For example:  I will want
> > > all data to flow into an HDFS endpoint but dataset1 will be coming in
> > > as delimited data so it can go directly into the GetFile processor
> > > while I need dataset2 to go through a CompressContent processor first.
> > > >
> > > > 2. Because I have a need in #1 above - I'd like to be able to design
> > > multiple flows (specific to a data need or component flows that work
> > > together) and have them all be able to be deployed (running)
> > concurrently.
> > > >
> > > > Also - it would be nice to be able to version control these designed
> > > flows so I can have 1 flow running while modifying a version 2.0 of
> > > that flow and then once the updates have been made then I can safely
> > > and effectively have a mechanism to shut down flow.v1 and start up
> > flow.v2.
> > > >
> > > > Thank you.
> > > >
> > > > --
> > > > Orrin Edenfield
> > >
> >
>



--
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*
Reply | Threaded
Open this post in threaded view
|

Re: Multiple dataflows with sub-flows and version control

Edenfield, Orrin
Mark, Joe, &  Joe - Thank you for the info and examples - I will still need to re-read and disgest some of this but this really helps me get a better idea of what is possible here.

I will keep thinking and email again if I have any other thoughts/questions/ideas.

Really cool stuff here!

Orrin Edenfield

> On Jan 2, 2015, at 3:33 PM, Joe Gresock <[hidden email]> wrote:
>
> Orrin,
>
> I've also found that a great deal of "individual data flows" can often be
> handled using some basic RouteOnAttribute and UpdateAttributes patterns in
> NiFi.  These two processors alone are extremely powerful in reducing flow
> sizes.  Joe W speaks the truth when he talks about being able to visualize
> bad patterns, especially in code reuse.  My team has found that often what
> appears to be multiple different flows turns out to be slightly different
> uses of the same basic flow, and we've been able to reduce the number of
> processors by orders of magnitude with careful study.  I wouldn't be
> surprised to find that your 300-processor use case can be reduced as well,
> but 300 is actually quite manageable in NiFi with Processor Groups, coupled
> with the search bar feature (upper right).
>
> I think Joe W said it best, but I just wanted to confirm that flow
> reduction is really something that happens in practice.
>
>> On Fri, Jan 2, 2015 at 3:05 PM, Joe Witt <[hidden email]> wrote:
>>
>> Great!
>>
>> "more documents ..."
>>
>> Oh yeah - we're working hard on that too.  The current user guide draft can
>> be found here:
>>
>> http://nifi.incubator.apache.org/docs/nifi-docs/user-guide.html
>>
>> And if you build the latest 'develop' branch that is integrated into the
>> app as well as the initial stab at our expression language.  We're still a
>> ways behind the ball on docs though and it is a mjaor focus area.  We will
>> get examples out as well.  That is one really nice thing about our
>> 'Templates' feature.  Examples can be easily imported too.
>>
>> Thanks and have a great weekend
>>
>> Joe
>>
>> On Fri, Jan 2, 2015 at 3:02 PM, Edenfield, Orrin <[hidden email]
>> wrote:
>>
>>> Joe,
>>>
>>> Thank you for taking the time to detail this out for me.  This is a
>>> different way of thinking for me but I think I'm starting to get it.  I
>>> work with a data factory that uses an ETL tool that would take about 300
>> to
>>> 600 individual flows (closer to the 300 side if we can
>> parameterize/re-use
>>> pieces of flows) and would literally be thousands of processors - if we
>>> solved it the same way we solve with traditional ETL tools.
>>>
>>> I'll try to think some more over the weekend but you're probably right
>>> that with the full use of these components that could be quickly
>> compacted
>>> into a much smaller footprint when it comes to actual needed data flow.
>>>
>>> I know things are still getting started here with incubation but if there
>>> are any documents/more examples I can read up on when it comes to things
>>> like Process Groups - I think that would help me see if I can fully wrap
>> my
>>> head around this when it comes to applying this to my world.  :-)
>>>
>>> And just let me know if there is anything I can do to help - I'm excited
>>> about the possibilities of this tool!
>>>
>>> Thank you.
>>>
>>> --
>>> Orrin Edenfield
>>> Associate Architect - PRGX USA, Inc.
>>> [hidden email]
>>>
>>> -----Original Message-----
>>> From: Joe Witt [mailto:[hidden email]]
>>> Sent: Friday, January 02, 2015 2:26 PM
>>> To: [hidden email]
>>> Subject: Re: Multiple dataflows with sub-flows and version control
>>>
>>> Orrin,
>>>
>>> You definitely bring up a good point.  I believe though the point is
>> about
>>> the inherent complexity that exists when you have large-scale dataflows
>> and
>>> large number of them at that.
>>>
>>> What NiFi allows you to do is manage the complexity visually, in
>>> real-time, and all across the desired spectrum of granularity.  One
>>> potentially convenient way to think about it is this:
>>>
>>> When you're writing code and you identify a new abstraction that would
>>> make things cleaner and more logical you start to refactor.  You do this
>> to
>>> make your code more elegant, more efficient, more maintainable and to
>>> manage complexity.  In NiFi you do exactly that.  As you're growing
>> toward
>>> hundreds or thousands of processors you identify patterns that reveal
>>> themselves visually.  That is a great way to communicate concepts not
>> just
>>> for the original author but for others as well.  As you build flows bad
>>> ideas tend to become obvious and more importantly easy to deal with.  The
>>> key thing though is that you don't have long arduous off-line improvement
>>> cycles which tend to cause folks to avoid solving the root problem and
>> thus
>>> they accrue tech debt.  With NiFi you just start making improvements to
>> the
>>> flow while everything is running.  You get immediate feedback on whether
>>> what you're doing is correct or not.  You can experiment in production
>> but
>>> outside the production flow if necessary by doing a super efficient tee
>> of
>>> the flow.  It really is a very different way of approaching a very old
>>> problem.
>>>
>>> It's cool that you're seeing ETL cases for it.  If there are details of
>>> that which you can share we'd love to hear them.  I don't know if the
>> sweet
>>> spot is there or not.  We'll have to see what the community finds and how
>>> that evolves over time.  I will say for new NiFi users it is extremely
>>> common for them to think of a bunch of independent dataflow graphs which
>>> are basically a lot of independent linear graphs.  Then over time as they
>>> start to understand more about what it enables they start thinking in
>>> directed graphs and how to merge flows and establish reusable components
>>> and so on.  Curious to see how that maps to your experience.
>>>
>>> As for the check-in of the flow configuration to a source control system
>>> you can certainly do that.  You could programmatically invoke our
>> endpoint
>>> which causes NiFi to make a backup of the flow and then put that in
>> source
>>> control on some time interval.  But keep in mind that is just like
>> taking a
>>> picture of what the 'flow looks like'.  NiFi is more than the picture of
>>> the flow.  It is the picture of the flow and the state of the data within
>>> it and so on.
>>>
>>> Very interested to hear more of your thoughts as you look further and
>>> think more about it.  You'll be a great help to us to better understand
>> how
>>> to communicate about it to folks coming from an ETL background.
>> Ultimately
>>> it
>>> would be great if we get you to help us do that with us ;-)   Please
>> don't
>>> be shy letting us know you're expectations.  We're new here too.
>>>
>>> Thanks
>>> Joe
>>>
>>>
>>>
>>> On Fri, Jan 2, 2015 at 1:50 PM, Edenfield, Orrin <
>> [hidden email]
>>> wrote:
>>>
>>>> Mark,
>>>>
>>>> I follow the logic here I just think over time it will be really hard
>>>> to keep track of things when there are hundreds (or thousands) of
>>>> processors - rather than hundreds of different flows (organized within
>>>> a source control tree or similar) that all have 5-50 different
>>> processors within them.
>>>>
>>>> I'd be interested to learn about how the Process Groups component
>>>> works so if you do get time to draw an example I think that would be
>>> helpful.
>>>>
>>>> Thank you.
>>>>
>>>> --
>>>> Orrin Edenfield
>>>>
>>>> -----Original Message-----
>>>> From: Mark Payne [mailto:[hidden email]]
>>>> Sent: Friday, January 02, 2015 12:34 PM
>>>> To: [hidden email]; Edenfield, Orrin
>>>> Subject: Re: Multiple dataflows with sub-flows and version control
>>>>
>>>> Orrin,
>>>>
>>>> Within NiFi you can create many different dataflows within the same
>>>> graph and run them concurrently. We've built flows with several
>>>> hundred Processors. They data can flow between flows by simply
>>>> connecting the Processors together.
>>>>
>>>> If you want to separate the flows logically because it makes more
>>>> sense to you to visualize them that way, you may want to use Process
>>> Groups.
>>>>
>>>> I'm on my cell phone right now so I cannot draw up an example for you
>>>> but I will this afternoon when I have a chance. But the basic idea is
>>>> that for
>>>> #1 you would have:
>>>>
>>>> GetFile -> PutHDFS
>>>>
>>>> And along side that another GetFile -> CompressContent -> the same
>>> PutHDFS.
>>>>
>>>> In this case you can even do this with the following flow:
>>>>
>>>> GetFile -> IdentifyMimeType (to check if compressed) ->
>>>> CompressContent (set to decompress and the compression type come from
>>>> mime type, which is identified by the previous processor) -> PutHDFS
>>>>
>>>> With regards to #2:
>>>> You can build the new flow right along side the old flow. When you are
>>>> ready to switch, simply change the connection to send data to the new
>>>> flow instead of the old one.
>>>>
>>>> Again, I'll put together some examples this afternoon with screen
>>>> shots that should help. Let me know if this helps or if it creates
>>>> more questions (or both :))
>>>>
>>>> Thanks
>>>> -Mark
>>>>
>>>>
>>>>
>>>> Sent from my iPhone
>>>>
>>>>> On Jan 2, 2015, at 11:37 AM, Edenfield, Orrin
>>>>> <[hidden email]>
>>>> wrote:
>>>>>
>>>>> Hello everyone - I'm new to the mailing list and I've tried to
>>>>> search
>>>> the JIRA and mailing list to see if this has already been addressed
>>>> and didn't find anything so here it goes:
>>>>>
>>>>> When I think about the capabilities of this tool I instantly think
>>>>> of
>>>> ETL-type tools. So the questions/comments below are likely to be
>>>> coming from that frame of mind - let me know if I've misunderstood a
>>>> key concept of NiFi as I think that could be possible.
>>>>>
>>>>> Is it possible to have NiFi service setup and running and allow for
>>>> multiple dataflows to be designed and deployed (running) at the same
>>> time?
>>>> So far in my testing I've found that I can get NiFi service up and
>>>> functioning as expected on my cluster edge node but I'd like to be
>>>> able to design multiple dataflows for the following reasons.
>>>>>
>>>>> 1. I have many datasets that will need some of the same flow actions
>>>>> but
>>>> not all of them. I'd like to componentize the flows and possibly have
>>>> multiple flows cascade from one to another. For example:  I will want
>>>> all data to flow into an HDFS endpoint but dataset1 will be coming in
>>>> as delimited data so it can go directly into the GetFile processor
>>>> while I need dataset2 to go through a CompressContent processor first.
>>>>>
>>>>> 2. Because I have a need in #1 above - I'd like to be able to design
>>>> multiple flows (specific to a data need or component flows that work
>>>> together) and have them all be able to be deployed (running)
>>> concurrently.
>>>>>
>>>>> Also - it would be nice to be able to version control these designed
>>>> flows so I can have 1 flow running while modifying a version 2.0 of
>>>> that flow and then once the updates have been made then I can safely
>>>> and effectively have a mechanism to shut down flow.v1 and start up
>>> flow.v2.
>>>>>
>>>>> Thank you.
>>>>>
>>>>> --
>>>>> Orrin Edenfield
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*
Reply | Threaded
Open this post in threaded view
|

RE: Multiple dataflows with sub-flows and version control

Ralph.Spangler
In reply to this post by Edenfield, Orrin
Nifi is very capable of multiple data flows. In fact I have used it in this way for some time. Whether it is getting data from a source,, doing some processing on it, and sending it to multiple destinations, OR, sending data form multiple sources to a single destination, OR having multiple flows defined at one time. It is quite flexible. As for version control, I have not played with this, but the version I last used had a limited capability.

If you want to create a new flow, but keep the old running, just instantiate another flow in parallel, NiFi doesn't really care that they are the same.

To answer 1. Yes, there is capabilities (at least in the version I used) to test the data and alter the flow based on the type. 2 is answered above. Also, creating additional processors is not that difficult if you have the documentation. we actually created ours by reverse engineering NiFi processors, this would be another way to solve your issues.

Ralph Spangler
________________________________________
From: Edenfield, Orrin [[hidden email]]
Sent: Friday, January 02, 2015 11:37 AM
To: [hidden email]
Subject: Multiple dataflows with sub-flows and version control

Hello everyone - I'm new to the mailing list and I've tried to search the JIRA and mailing list to see if this has already been addressed and didn't find anything so here it goes:

When I think about the capabilities of this tool I instantly think of ETL-type tools. So the questions/comments below are likely to be coming from that frame of mind - let me know if I've misunderstood a key concept of NiFi as I think that could be possible.

Is it possible to have NiFi service setup and running and allow for multiple dataflows to be designed and deployed (running) at the same time?  So far in my testing I've found that I can get NiFi service up and functioning as expected on my cluster edge node but I'd like to be able to design multiple dataflows for the following reasons.

1. I have many datasets that will need some of the same flow actions but not all of them. I'd like to componentize the flows and possibly have multiple flows cascade from one to another. For example:  I will want all data to flow into an HDFS endpoint but dataset1 will be coming in as delimited data so it can go directly into the GetFile processor while I need dataset2 to go through a CompressContent processor first.

2. Because I have a need in #1 above - I'd like to be able to design multiple flows (specific to a data need or component flows that work together) and have them all be able to be deployed (running) concurrently.

Also - it would be nice to be able to version control these designed flows so I can have 1 flow running while modifying a version 2.0 of that flow and then once the updates have been made then I can safely and effectively have a mechanism to shut down flow.v1 and start up flow.v2.

Thank you.

--
Orrin Edenfield
Reply | Threaded
Open this post in threaded view
|

RE: Multiple dataflows with sub-flows and version control

Edenfield, Orrin
Thank you everyone for the help with explaining the logical approach to multiple flows that I'll need to take - since it is different than my "multiple ETL job" history I'm accustomed to.  I think I'm starting to understand - and this can work similarly with ETL tools I'm familiar with (Informatica, Sterling Integrator, Talend, Pentaho, etc.)

I'm still trying to start very simply with a compression of the input data before landing into HDFS - so I've setup multiple flows between the GetFile and the PutHDFS and this seems to be working as I expect it (see attached screenshot).  

I will need to think some more about how this can be used when it comes to our existing ETL pipelines but I think IdentifyMimeType, EvaluateRegularExpression, HashContent, MonitorActivity, ReplaceTextWithMapping, RouteOnContent, and even SegmentContent may get us a long way.

Thank you.

--
Orrin Edenfield

-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: Monday, January 05, 2015 10:07 AM
To: [hidden email]
Subject: RE: Multiple dataflows with sub-flows and version control

Nifi is very capable of multiple data flows. In fact I have used it in this way for some time. Whether it is getting data from a source,, doing some processing on it, and sending it to multiple destinations, OR, sending data form multiple sources to a single destination, OR having multiple flows defined at one time. It is quite flexible. As for version control, I have not played with this, but the version I last used had a limited capability.

If you want to create a new flow, but keep the old running, just instantiate another flow in parallel, NiFi doesn't really care that they are the same.

To answer 1. Yes, there is capabilities (at least in the version I used) to test the data and alter the flow based on the type. 2 is answered above. Also, creating additional processors is not that difficult if you have the documentation. we actually created ours by reverse engineering NiFi processors, this would be another way to solve your issues.

Ralph Spangler
________________________________________
From: Edenfield, Orrin [[hidden email]]
Sent: Friday, January 02, 2015 11:37 AM
To: [hidden email]
Subject: Multiple dataflows with sub-flows and version control

Hello everyone - I'm new to the mailing list and I've tried to search the JIRA and mailing list to see if this has already been addressed and didn't find anything so here it goes:

When I think about the capabilities of this tool I instantly think of ETL-type tools. So the questions/comments below are likely to be coming from that frame of mind - let me know if I've misunderstood a key concept of NiFi as I think that could be possible.

Is it possible to have NiFi service setup and running and allow for multiple dataflows to be designed and deployed (running) at the same time?  So far in my testing I've found that I can get NiFi service up and functioning as expected on my cluster edge node but I'd like to be able to design multiple dataflows for the following reasons.

1. I have many datasets that will need some of the same flow actions but not all of them. I'd like to componentize the flows and possibly have multiple flows cascade from one to another. For example:  I will want all data to flow into an HDFS endpoint but dataset1 will be coming in as delimited data so it can go directly into the GetFile processor while I need dataset2 to go through a CompressContent processor first.

2. Because I have a need in #1 above - I'd like to be able to design multiple flows (specific to a data need or component flows that work together) and have them all be able to be deployed (running) concurrently.

Also - it would be nice to be able to version control these designed flows so I can have 1 flow running while modifying a version 2.0 of that flow and then once the updates have been made then I can safely and effectively have a mechanism to shut down flow.v1 and start up flow.v2.

Thank you.

--
Orrin Edenfield
Reply | Threaded
Open this post in threaded view
|

RE: Multiple dataflows with sub-flows and version control

Mark Payne
In reply to this post by Joe Witt
All,
I responded to this e-mail thread on the 2nd, but the response included a lot of screenshots, and I got a failure notice back from the dev mailing list because the size exceeded 1 million bytes (though I believe the original author of this thread did receive the e-mail, since I also sent it directly to him). So I have formalized my response into a blog on the NiFi blog page: https://blogs.apache.org/nifi/entry/basic_dataflow_design
Just wanted to include this for anyone is interested in this thread.
Thanks-Mark


> Date: Fri, 2 Jan 2015 14:25:49 -0500
> Subject: Re: Multiple dataflows with sub-flows and version control
> From: [hidden email]
> To: [hidden email]
>
> Orrin,
>
> You definitely bring up a good point.  I believe though the point is about
> the inherent complexity that exists when you have large-scale dataflows and
> large number of them at that.
>
> What NiFi allows you to do is manage the complexity visually, in real-time,
> and all across the desired spectrum of granularity.  One potentially
> convenient way to think about it is this:
>
> When you're writing code and you identify a new abstraction that would make
> things cleaner and more logical you start to refactor.  You do this to make
> your code more elegant, more efficient, more maintainable and to manage
> complexity.  In NiFi you do exactly that.  As you're growing toward
> hundreds or thousands of processors you identify patterns that reveal
> themselves visually.  That is a great way to communicate concepts not just
> for the original author but for others as well.  As you build flows bad
> ideas tend to become obvious and more importantly easy to deal with.  The
> key thing though is that you don't have long arduous off-line improvement
> cycles which tend to cause folks to avoid solving the root problem and thus
> they accrue tech debt.  With NiFi you just start making improvements to the
> flow while everything is running.  You get immediate feedback on whether
> what you're doing is correct or not.  You can experiment in production but
> outside the production flow if necessary by doing a super efficient tee of
> the flow.  It really is a very different way of approaching a very old
> problem.
>
> It's cool that you're seeing ETL cases for it.  If there are details of
> that which you can share we'd love to hear them.  I don't know if the sweet
> spot is there or not.  We'll have to see what the community finds and how
> that evolves over time.  I will say for new NiFi users it is extremely
> common for them to think of a bunch of independent dataflow graphs which
> are basically a lot of independent linear graphs.  Then over time as they
> start to understand more about what it enables they start thinking in
> directed graphs and how to merge flows and establish reusable components
> and so on.  Curious to see how that maps to your experience.
>
> As for the check-in of the flow configuration to a source control system
> you can certainly do that.  You could programmatically invoke our endpoint
> which causes NiFi to make a backup of the flow and then put that in source
> control on some time interval.  But keep in mind that is just like taking a
> picture of what the 'flow looks like'.  NiFi is more than the picture of
> the flow.  It is the picture of the flow and the state of the data within
> it and so on.
>
> Very interested to hear more of your thoughts as you look further and think
> more about it.  You'll be a great help to us to better understand how to
> communicate about it to folks coming from an ETL background.  Ultimately it
> would be great if we get you to help us do that with us ;-)   Please don't
> be shy letting us know you're expectations.  We're new here too.
>
> Thanks
> Joe
>
>
>
> On Fri, Jan 2, 2015 at 1:50 PM, Edenfield, Orrin <[hidden email]>
> wrote:
>
> > Mark,
> >
> > I follow the logic here I just think over time it will be really hard to
> > keep track of things when there are hundreds (or thousands) of processors -
> > rather than hundreds of different flows (organized within a source control
> > tree or similar) that all have 5-50 different processors within them.
> >
> > I'd be interested to learn about how the Process Groups component works so
> > if you do get time to draw an example I think that would be helpful.
> >
> > Thank you.
> >
> > --
> > Orrin Edenfield
> >
> > -----Original Message-----
> > From: Mark Payne [mailto:[hidden email]]
> > Sent: Friday, January 02, 2015 12:34 PM
> > To: [hidden email]; Edenfield, Orrin
> > Subject: Re: Multiple dataflows with sub-flows and version control
> >
> > Orrin,
> >
> > Within NiFi you can create many different dataflows within the same graph
> > and run them concurrently. We've built flows with several hundred
> > Processors. They data can flow between flows by simply connecting the
> > Processors together.
> >
> > If you want to separate the flows logically because it makes more sense to
> > you to visualize them that way, you may want to use Process Groups.
> >
> > I'm on my cell phone right now so I cannot draw up an example for you but
> > I will this afternoon when I have a chance. But the basic idea is that for
> > #1 you would have:
> >
> > GetFile -> PutHDFS
> >
> > And along side that another GetFile -> CompressContent -> the same PutHDFS.
> >
> > In this case you can even do this with the following flow:
> >
> > GetFile -> IdentifyMimeType (to check if compressed) -> CompressContent
> > (set to decompress and the compression type come from mime type, which is
> > identified by the previous processor) -> PutHDFS
> >
> > With regards to #2:
> > You can build the new flow right along side the old flow. When you are
> > ready to switch, simply change the connection to send data to the new flow
> > instead of the old one.
> >
> > Again, I'll put together some examples this afternoon with screen shots
> > that should help. Let me know if this helps or if it creates more questions
> > (or both :))
> >
> > Thanks
> > -Mark
> >
> >
> >
> > Sent from my iPhone
> >
> > > On Jan 2, 2015, at 11:37 AM, Edenfield, Orrin <[hidden email]>
> > wrote:
> > >
> > > Hello everyone - I'm new to the mailing list and I've tried to search
> > the JIRA and mailing list to see if this has already been addressed and
> > didn't find anything so here it goes:
> > >
> > > When I think about the capabilities of this tool I instantly think of
> > ETL-type tools. So the questions/comments below are likely to be coming
> > from that frame of mind - let me know if I've misunderstood a key concept
> > of NiFi as I think that could be possible.
> > >
> > > Is it possible to have NiFi service setup and running and allow for
> > multiple dataflows to be designed and deployed (running) at the same time?
> > So far in my testing I've found that I can get NiFi service up and
> > functioning as expected on my cluster edge node but I'd like to be able to
> > design multiple dataflows for the following reasons.
> > >
> > > 1. I have many datasets that will need some of the same flow actions but
> > not all of them. I'd like to componentize the flows and possibly have
> > multiple flows cascade from one to another. For example:  I will want all
> > data to flow into an HDFS endpoint but dataset1 will be coming in as
> > delimited data so it can go directly into the GetFile processor while I
> > need dataset2 to go through a CompressContent processor first.
> > >
> > > 2. Because I have a need in #1 above - I'd like to be able to design
> > multiple flows (specific to a data need or component flows that work
> > together) and have them all be able to be deployed (running) concurrently.
> > >
> > > Also - it would be nice to be able to version control these designed
> > flows so I can have 1 flow running while modifying a version 2.0 of that
> > flow and then once the updates have been made then I can safely and
> > effectively have a mechanism to shut down flow.v1 and start up flow.v2.
> > >
> > > Thank you.
> > >
> > > --
> > > Orrin Edenfield
> >
         
Reply | Threaded
Open this post in threaded view
|

Re: Multiple dataflows with sub-flows and version control

Billie Rinaldi-2
On Mon, Jan 12, 2015 at 5:57 AM, Mark Payne <[hidden email]> wrote:

> All,
> I responded to this e-mail thread on the 2nd, but the response included a
> lot of screenshots, and I got a failure notice back from the dev mailing
> list because the size exceeded 1 million bytes


FYI, screenshots / attachments are automatically stripped from messages to
ASF lists, even if the size doesn't exceed a threshold.


> (though I believe the original author of this thread did receive the
> e-mail, since I also sent it directly to him). So I have formalized my
> response into a blog on the NiFi blog page:
> https://blogs.apache.org/nifi/entry/basic_dataflow_design
> Just wanted to include this for anyone is interested in this thread.
> Thanks-Mark
>
>
> > Date: Fri, 2 Jan 2015 14:25:49 -0500
> > Subject: Re: Multiple dataflows with sub-flows and version control
> > From: [hidden email]
> > To: [hidden email]
> >
> > Orrin,
> >
> > You definitely bring up a good point.  I believe though the point is
> about
> > the inherent complexity that exists when you have large-scale dataflows
> and
> > large number of them at that.
> >
> > What NiFi allows you to do is manage the complexity visually, in
> real-time,
> > and all across the desired spectrum of granularity.  One potentially
> > convenient way to think about it is this:
> >
> > When you're writing code and you identify a new abstraction that would
> make
> > things cleaner and more logical you start to refactor.  You do this to
> make
> > your code more elegant, more efficient, more maintainable and to manage
> > complexity.  In NiFi you do exactly that.  As you're growing toward
> > hundreds or thousands of processors you identify patterns that reveal
> > themselves visually.  That is a great way to communicate concepts not
> just
> > for the original author but for others as well.  As you build flows bad
> > ideas tend to become obvious and more importantly easy to deal with.  The
> > key thing though is that you don't have long arduous off-line improvement
> > cycles which tend to cause folks to avoid solving the root problem and
> thus
> > they accrue tech debt.  With NiFi you just start making improvements to
> the
> > flow while everything is running.  You get immediate feedback on whether
> > what you're doing is correct or not.  You can experiment in production
> but
> > outside the production flow if necessary by doing a super efficient tee
> of
> > the flow.  It really is a very different way of approaching a very old
> > problem.
> >
> > It's cool that you're seeing ETL cases for it.  If there are details of
> > that which you can share we'd love to hear them.  I don't know if the
> sweet
> > spot is there or not.  We'll have to see what the community finds and how
> > that evolves over time.  I will say for new NiFi users it is extremely
> > common for them to think of a bunch of independent dataflow graphs which
> > are basically a lot of independent linear graphs.  Then over time as they
> > start to understand more about what it enables they start thinking in
> > directed graphs and how to merge flows and establish reusable components
> > and so on.  Curious to see how that maps to your experience.
> >
> > As for the check-in of the flow configuration to a source control system
> > you can certainly do that.  You could programmatically invoke our
> endpoint
> > which causes NiFi to make a backup of the flow and then put that in
> source
> > control on some time interval.  But keep in mind that is just like
> taking a
> > picture of what the 'flow looks like'.  NiFi is more than the picture of
> > the flow.  It is the picture of the flow and the state of the data within
> > it and so on.
> >
> > Very interested to hear more of your thoughts as you look further and
> think
> > more about it.  You'll be a great help to us to better understand how to
> > communicate about it to folks coming from an ETL background.  Ultimately
> it
> > would be great if we get you to help us do that with us ;-)   Please
> don't
> > be shy letting us know you're expectations.  We're new here too.
> >
> > Thanks
> > Joe
> >
> >
> >
> > On Fri, Jan 2, 2015 at 1:50 PM, Edenfield, Orrin <
> [hidden email]>
> > wrote:
> >
> > > Mark,
> > >
> > > I follow the logic here I just think over time it will be really hard
> to
> > > keep track of things when there are hundreds (or thousands) of
> processors -
> > > rather than hundreds of different flows (organized within a source
> control
> > > tree or similar) that all have 5-50 different processors within them.
> > >
> > > I'd be interested to learn about how the Process Groups component
> works so
> > > if you do get time to draw an example I think that would be helpful.
> > >
> > > Thank you.
> > >
> > > --
> > > Orrin Edenfield
> > >
> > > -----Original Message-----
> > > From: Mark Payne [mailto:[hidden email]]
> > > Sent: Friday, January 02, 2015 12:34 PM
> > > To: [hidden email]; Edenfield, Orrin
> > > Subject: Re: Multiple dataflows with sub-flows and version control
> > >
> > > Orrin,
> > >
> > > Within NiFi you can create many different dataflows within the same
> graph
> > > and run them concurrently. We've built flows with several hundred
> > > Processors. They data can flow between flows by simply connecting the
> > > Processors together.
> > >
> > > If you want to separate the flows logically because it makes more
> sense to
> > > you to visualize them that way, you may want to use Process Groups.
> > >
> > > I'm on my cell phone right now so I cannot draw up an example for you
> but
> > > I will this afternoon when I have a chance. But the basic idea is that
> for
> > > #1 you would have:
> > >
> > > GetFile -> PutHDFS
> > >
> > > And along side that another GetFile -> CompressContent -> the same
> PutHDFS.
> > >
> > > In this case you can even do this with the following flow:
> > >
> > > GetFile -> IdentifyMimeType (to check if compressed) -> CompressContent
> > > (set to decompress and the compression type come from mime type, which
> is
> > > identified by the previous processor) -> PutHDFS
> > >
> > > With regards to #2:
> > > You can build the new flow right along side the old flow. When you are
> > > ready to switch, simply change the connection to send data to the new
> flow
> > > instead of the old one.
> > >
> > > Again, I'll put together some examples this afternoon with screen shots
> > > that should help. Let me know if this helps or if it creates more
> questions
> > > (or both :))
> > >
> > > Thanks
> > > -Mark
> > >
> > >
> > >
> > > Sent from my iPhone
> > >
> > > > On Jan 2, 2015, at 11:37 AM, Edenfield, Orrin <
> [hidden email]>
> > > wrote:
> > > >
> > > > Hello everyone - I'm new to the mailing list and I've tried to search
> > > the JIRA and mailing list to see if this has already been addressed and
> > > didn't find anything so here it goes:
> > > >
> > > > When I think about the capabilities of this tool I instantly think of
> > > ETL-type tools. So the questions/comments below are likely to be coming
> > > from that frame of mind - let me know if I've misunderstood a key
> concept
> > > of NiFi as I think that could be possible.
> > > >
> > > > Is it possible to have NiFi service setup and running and allow for
> > > multiple dataflows to be designed and deployed (running) at the same
> time?
> > > So far in my testing I've found that I can get NiFi service up and
> > > functioning as expected on my cluster edge node but I'd like to be
> able to
> > > design multiple dataflows for the following reasons.
> > > >
> > > > 1. I have many datasets that will need some of the same flow actions
> but
> > > not all of them. I'd like to componentize the flows and possibly have
> > > multiple flows cascade from one to another. For example:  I will want
> all
> > > data to flow into an HDFS endpoint but dataset1 will be coming in as
> > > delimited data so it can go directly into the GetFile processor while I
> > > need dataset2 to go through a CompressContent processor first.
> > > >
> > > > 2. Because I have a need in #1 above - I'd like to be able to design
> > > multiple flows (specific to a data need or component flows that work
> > > together) and have them all be able to be deployed (running)
> concurrently.
> > > >
> > > > Also - it would be nice to be able to version control these designed
> > > flows so I can have 1 flow running while modifying a version 2.0 of
> that
> > > flow and then once the updates have been made then I can safely and
> > > effectively have a mechanism to shut down flow.v1 and start up flow.v2.
> > > >
> > > > Thank you.
> > > >
> > > > --
> > > > Orrin Edenfield
> > >
>
>
>