NiFi architecture

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

NiFi architecture

Jonathan Natkins
Hi there,

I was curious if there exist any resources that would be helpful in
understanding the NiFi architecture. I'm trying to understand how dataflows
are executed, or how I would scale the system. Are there any architectural
docs, or blog posts, or academic papers out there that would be helpful?

Alternatively, some pointers into the code base as to where the execution
layer code lives could be helpful.

Thanks!
Natty

Jonathan "Natty" Natkins
StreamSets | Customer Engagement Engineer
mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice>
Reply | Threaded
Open this post in threaded view
|

Re: NiFi architecture

Joe Witt
Natty

There are very little existing resources as of yet but fully recognize that
this is a problem.

https://issues.apache.org/jira/browse/NIFI-162

If there are specific examples of architectural descriptions that you think
are well done I'd love to see them.

The very brief version of how execution and scale work:

Execution:
NiFi runs within the JVM.  As data flows through a given NiFi instance
there are two primary repositories that we keep which hold key information
about the data.  One repository is known as the Flowfile repository and its
job is to keep information about the data in the flow.  The other
repository is the content repository and it keeps the actual data.  In nifi
you're composing directed graphs of processors.  Each processor is
scheduled to run according to its configured scheduling style and is given
time to run by a flow controller/thread-pool.  When a given process runs it
is given access to the Flowfile Repository and content repository as
necessary to be able to access and modify the data in a safe and efficient
manner.

Out of the box the flow file repo can be all in-memory or run off a
write-ahead log based implementation with high reliability and throughput.
For the content repo it too supports all in-memory or using one or more
disks in parallel yielding again very high throughput with excellent
durability.

Scale:
Vertical: Supports highly concurrent processing and can utilize multiple
physical disks in parallel.
Horizontal: Supports clustering whereby a cluster manager relays commands
to nodes in the cluster and coordinates all their responses.  Nodes then
operate as they would if they were standalone.

Lots more coming here of course but if you have specific questions now
please feel free to fire away.

Thanks
Joe




On Tue, Dec 16, 2014 at 7:16 PM, Jonathan Natkins <[hidden email]>
wrote:

>
> Hi there,
>
> I was curious if there exist any resources that would be helpful in
> understanding the NiFi architecture. I'm trying to understand how dataflows
> are executed, or how I would scale the system. Are there any architectural
> docs, or blog posts, or academic papers out there that would be helpful?
>
> Alternatively, some pointers into the code base as to where the execution
> layer code lives could be helpful.
>
> Thanks!
> Natty
>
> Jonathan "Natty" Natkins
> StreamSets | Customer Engagement Engineer
> mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice>
>
Reply | Threaded
Open this post in threaded view
|

Re: NiFi architecture

Jonathan Natkins
Hey Joe,

This is really helpful. In terms of examples of good architectural
descriptions, I think the Kafka overview is pretty great (I think a lot of
it came from the original academic paper). It's very helpful for
understanding the key concepts and design trade-offs. My personal feeling
is that diagrams are very helpful: my guess is that the single-node
processing layer is not all that complex, but where architecture gets
interesting (and where a lot of my curiosity lies) is once you get into the
distributed modes. How is fault-tolerance handled, how do I specify which
processors operate on which nodes, how is cluster membership handled, etc.

Also: what does NAR stand for?

Thanks!
Natty

Jonathan "Natty" Natkins
StreamSets | Customer Engagement Engineer
mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice>


On Tue, Dec 16, 2014 at 6:08 PM, Joe Witt <[hidden email]> wrote:

>
> Natty
>
> There are very little existing resources as of yet but fully recognize that
> this is a problem.
>
> https://issues.apache.org/jira/browse/NIFI-162
>
> If there are specific examples of architectural descriptions that you think
> are well done I'd love to see them.
>
> The very brief version of how execution and scale work:
>
> Execution:
> NiFi runs within the JVM.  As data flows through a given NiFi instance
> there are two primary repositories that we keep which hold key information
> about the data.  One repository is known as the Flowfile repository and its
> job is to keep information about the data in the flow.  The other
> repository is the content repository and it keeps the actual data.  In nifi
> you're composing directed graphs of processors.  Each processor is
> scheduled to run according to its configured scheduling style and is given
> time to run by a flow controller/thread-pool.  When a given process runs it
> is given access to the Flowfile Repository and content repository as
> necessary to be able to access and modify the data in a safe and efficient
> manner.
>
> Out of the box the flow file repo can be all in-memory or run off a
> write-ahead log based implementation with high reliability and throughput.
> For the content repo it too supports all in-memory or using one or more
> disks in parallel yielding again very high throughput with excellent
> durability.
>
> Scale:
> Vertical: Supports highly concurrent processing and can utilize multiple
> physical disks in parallel.
> Horizontal: Supports clustering whereby a cluster manager relays commands
> to nodes in the cluster and coordinates all their responses.  Nodes then
> operate as they would if they were standalone.
>
> Lots more coming here of course but if you have specific questions now
> please feel free to fire away.
>
> Thanks
> Joe
>
>
>
>
> On Tue, Dec 16, 2014 at 7:16 PM, Jonathan Natkins <[hidden email]>
> wrote:
> >
> > Hi there,
> >
> > I was curious if there exist any resources that would be helpful in
> > understanding the NiFi architecture. I'm trying to understand how
> dataflows
> > are executed, or how I would scale the system. Are there any
> architectural
> > docs, or blog posts, or academic papers out there that would be helpful?
> >
> > Alternatively, some pointers into the code base as to where the execution
> > layer code lives could be helpful.
> >
> > Thanks!
> > Natty
> >
> > Jonathan "Natty" Natkins
> > StreamSets | Customer Engagement Engineer
> > mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: NiFi architecture

Sean Busbey
On Tue, Dec 16, 2014 at 10:04 PM, Jonathan Natkins <[hidden email]>
wrote:

>
>
> Also: what does NAR stand for?
>
>
Nifi ARchive.

short version: dependency classloader isolation for processor (or processor
groups)

long version: this thread, Mark has a good start *http://s.apache.org/Ur9
<http://s.apache.org/Ur9>*


--
Sean
Reply | Threaded
Open this post in threaded view
|

Re: NiFi architecture

Joe Witt
In reply to this post by Jonathan Natkins
Jonathan

NAR stands for "NiFi Archive" and it is simply our approach to classloader
isolation.  In maven then you can build these nar's and they become bundles
of processors/services/whatever NiFi extension you want which you can then
include into a build ala carte style or however.  The framework itself is
even a nar.  We did all this to first reduce the amount stuff (and thus
possible conflicts) on the system classloader and second we did it so
different bundles of capabilities could have different and sometimes
conflicting dependencies but there to be no problem.

I am with you on the distributed side being where things get more
interesting.  Processors execute in nodes and all nodes execute all
processors.  They only fire if they have data though.  So you interface
with protocols which inherently offer load balancing or you ensure that
NiFi is pulling the data and it by its nature will do load balancing.  We
then include things like back-pressure and congestion avoidance functions
which means even if load balancing isn't working we can still have nodes
'back off' and thus other nodes naturally pick up the slack.  This helps to
address the natural hot-spotting that tends to occur in data flow and data
processing.

Fault tolerance:  If a node dies the data on the node at this time is 'as
dead as the node'.  Any new data will be routed around that node in the
case of pushes from producers to NiFi or other nodes will take over the
load of pulling in the case of pull by NiFi from those consumers.  The
cluster of NiFi Nodes is managed by a thing called the 'NiFi Cluster
Manager [NCM]'.  As long as it is up you can comman and control all the
nodes in the cluster.  If the NCM is dead the nodes all keep rolling based
on what they knew last.  You then also have to be concerned about network
partitioning.  Each node is always heartbeating to the NCM.  If the NCM
doesn't get a heartbeat in a reasonable period of time it will mark that
node as disconnected.  When any node in the cluster is disconnected then
data flow changes cannot be made.  This prevents the flow from being into a
bad state as a result of the partitioning event.  If you know that node is
not just isolated but is actually 'dead', 'gone for  while', whatever then
you can delete it from the cluster and then you'll be able to make changes
to the flow again.

As new nodes join the cluster the first thing that happens is the typical
'am i authorized' to be here check.  Once through that then the NCM sends a
copy of the latest flow configuration to the node at which point the node
loads the flow begins its journey.

I'm glossing over a lot of the fun details of course but we will obviously
dig further here and if there are any tracks you want to go deeper on
sooner then please advise.

Thanks
Joe


On Tue, Dec 16, 2014 at 11:04 PM, Jonathan Natkins <[hidden email]>
wrote:

>
> Hey Joe,
>
> This is really helpful. In terms of examples of good architectural
> descriptions, I think the Kafka overview is pretty great (I think a lot of
> it came from the original academic paper). It's very helpful for
> understanding the key concepts and design trade-offs. My personal feeling
> is that diagrams are very helpful: my guess is that the single-node
> processing layer is not all that complex, but where architecture gets
> interesting (and where a lot of my curiosity lies) is once you get into the
> distributed modes. How is fault-tolerance handled, how do I specify which
> processors operate on which nodes, how is cluster membership handled, etc.
>
> Also: what does NAR stand for?
>
> Thanks!
> Natty
>
> Jonathan "Natty" Natkins
> StreamSets | Customer Engagement Engineer
> mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice>
>
>
> On Tue, Dec 16, 2014 at 6:08 PM, Joe Witt <[hidden email]> wrote:
> >
> > Natty
> >
> > There are very little existing resources as of yet but fully recognize
> that
> > this is a problem.
> >
> > https://issues.apache.org/jira/browse/NIFI-162
> >
> > If there are specific examples of architectural descriptions that you
> think
> > are well done I'd love to see them.
> >
> > The very brief version of how execution and scale work:
> >
> > Execution:
> > NiFi runs within the JVM.  As data flows through a given NiFi instance
> > there are two primary repositories that we keep which hold key
> information
> > about the data.  One repository is known as the Flowfile repository and
> its
> > job is to keep information about the data in the flow.  The other
> > repository is the content repository and it keeps the actual data.  In
> nifi
> > you're composing directed graphs of processors.  Each processor is
> > scheduled to run according to its configured scheduling style and is
> given
> > time to run by a flow controller/thread-pool.  When a given process runs
> it
> > is given access to the Flowfile Repository and content repository as
> > necessary to be able to access and modify the data in a safe and
> efficient
> > manner.
> >
> > Out of the box the flow file repo can be all in-memory or run off a
> > write-ahead log based implementation with high reliability and
> throughput.
> > For the content repo it too supports all in-memory or using one or more
> > disks in parallel yielding again very high throughput with excellent
> > durability.
> >
> > Scale:
> > Vertical: Supports highly concurrent processing and can utilize multiple
> > physical disks in parallel.
> > Horizontal: Supports clustering whereby a cluster manager relays commands
> > to nodes in the cluster and coordinates all their responses.  Nodes then
> > operate as they would if they were standalone.
> >
> > Lots more coming here of course but if you have specific questions now
> > please feel free to fire away.
> >
> > Thanks
> > Joe
> >
> >
> >
> >
> > On Tue, Dec 16, 2014 at 7:16 PM, Jonathan Natkins <[hidden email]>
> > wrote:
> > >
> > > Hi there,
> > >
> > > I was curious if there exist any resources that would be helpful in
> > > understanding the NiFi architecture. I'm trying to understand how
> > dataflows
> > > are executed, or how I would scale the system. Are there any
> > architectural
> > > docs, or blog posts, or academic papers out there that would be
> helpful?
> > >
> > > Alternatively, some pointers into the code base as to where the
> execution
> > > layer code lives could be helpful.
> > >
> > > Thanks!
> > > Natty
> > >
> > > Jonathan "Natty" Natkins
> > > StreamSets | Customer Engagement Engineer
> > > mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: NiFi architecture

Jonathan Natkins
Awesome, all makes sense, and I think that answers my immediate questions.
Thanks guys!

Jonathan "Natty" Natkins
StreamSets | Customer Engagement Engineer
mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice>


On Tue, Dec 16, 2014 at 8:22 PM, Joe Witt <[hidden email]> wrote:

>
> Jonathan
>
> NAR stands for "NiFi Archive" and it is simply our approach to classloader
> isolation.  In maven then you can build these nar's and they become bundles
> of processors/services/whatever NiFi extension you want which you can then
> include into a build ala carte style or however.  The framework itself is
> even a nar.  We did all this to first reduce the amount stuff (and thus
> possible conflicts) on the system classloader and second we did it so
> different bundles of capabilities could have different and sometimes
> conflicting dependencies but there to be no problem.
>
> I am with you on the distributed side being where things get more
> interesting.  Processors execute in nodes and all nodes execute all
> processors.  They only fire if they have data though.  So you interface
> with protocols which inherently offer load balancing or you ensure that
> NiFi is pulling the data and it by its nature will do load balancing.  We
> then include things like back-pressure and congestion avoidance functions
> which means even if load balancing isn't working we can still have nodes
> 'back off' and thus other nodes naturally pick up the slack.  This helps to
> address the natural hot-spotting that tends to occur in data flow and data
> processing.
>
> Fault tolerance:  If a node dies the data on the node at this time is 'as
> dead as the node'.  Any new data will be routed around that node in the
> case of pushes from producers to NiFi or other nodes will take over the
> load of pulling in the case of pull by NiFi from those consumers.  The
> cluster of NiFi Nodes is managed by a thing called the 'NiFi Cluster
> Manager [NCM]'.  As long as it is up you can comman and control all the
> nodes in the cluster.  If the NCM is dead the nodes all keep rolling based
> on what they knew last.  You then also have to be concerned about network
> partitioning.  Each node is always heartbeating to the NCM.  If the NCM
> doesn't get a heartbeat in a reasonable period of time it will mark that
> node as disconnected.  When any node in the cluster is disconnected then
> data flow changes cannot be made.  This prevents the flow from being into a
> bad state as a result of the partitioning event.  If you know that node is
> not just isolated but is actually 'dead', 'gone for  while', whatever then
> you can delete it from the cluster and then you'll be able to make changes
> to the flow again.
>
> As new nodes join the cluster the first thing that happens is the typical
> 'am i authorized' to be here check.  Once through that then the NCM sends a
> copy of the latest flow configuration to the node at which point the node
> loads the flow begins its journey.
>
> I'm glossing over a lot of the fun details of course but we will obviously
> dig further here and if there are any tracks you want to go deeper on
> sooner then please advise.
>
> Thanks
> Joe
>
>
> On Tue, Dec 16, 2014 at 11:04 PM, Jonathan Natkins <[hidden email]>
> wrote:
> >
> > Hey Joe,
> >
> > This is really helpful. In terms of examples of good architectural
> > descriptions, I think the Kafka overview is pretty great (I think a lot
> of
> > it came from the original academic paper). It's very helpful for
> > understanding the key concepts and design trade-offs. My personal feeling
> > is that diagrams are very helpful: my guess is that the single-node
> > processing layer is not all that complex, but where architecture gets
> > interesting (and where a lot of my curiosity lies) is once you get into
> the
> > distributed modes. How is fault-tolerance handled, how do I specify which
> > processors operate on which nodes, how is cluster membership handled,
> etc.
> >
> > Also: what does NAR stand for?
> >
> > Thanks!
> > Natty
> >
> > Jonathan "Natty" Natkins
> > StreamSets | Customer Engagement Engineer
> > mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice>
> >
> >
> > On Tue, Dec 16, 2014 at 6:08 PM, Joe Witt <[hidden email]> wrote:
> > >
> > > Natty
> > >
> > > There are very little existing resources as of yet but fully recognize
> > that
> > > this is a problem.
> > >
> > > https://issues.apache.org/jira/browse/NIFI-162
> > >
> > > If there are specific examples of architectural descriptions that you
> > think
> > > are well done I'd love to see them.
> > >
> > > The very brief version of how execution and scale work:
> > >
> > > Execution:
> > > NiFi runs within the JVM.  As data flows through a given NiFi instance
> > > there are two primary repositories that we keep which hold key
> > information
> > > about the data.  One repository is known as the Flowfile repository and
> > its
> > > job is to keep information about the data in the flow.  The other
> > > repository is the content repository and it keeps the actual data.  In
> > nifi
> > > you're composing directed graphs of processors.  Each processor is
> > > scheduled to run according to its configured scheduling style and is
> > given
> > > time to run by a flow controller/thread-pool.  When a given process
> runs
> > it
> > > is given access to the Flowfile Repository and content repository as
> > > necessary to be able to access and modify the data in a safe and
> > efficient
> > > manner.
> > >
> > > Out of the box the flow file repo can be all in-memory or run off a
> > > write-ahead log based implementation with high reliability and
> > throughput.
> > > For the content repo it too supports all in-memory or using one or more
> > > disks in parallel yielding again very high throughput with excellent
> > > durability.
> > >
> > > Scale:
> > > Vertical: Supports highly concurrent processing and can utilize
> multiple
> > > physical disks in parallel.
> > > Horizontal: Supports clustering whereby a cluster manager relays
> commands
> > > to nodes in the cluster and coordinates all their responses.  Nodes
> then
> > > operate as they would if they were standalone.
> > >
> > > Lots more coming here of course but if you have specific questions now
> > > please feel free to fire away.
> > >
> > > Thanks
> > > Joe
> > >
> > >
> > >
> > >
> > > On Tue, Dec 16, 2014 at 7:16 PM, Jonathan Natkins <
> [hidden email]>
> > > wrote:
> > > >
> > > > Hi there,
> > > >
> > > > I was curious if there exist any resources that would be helpful in
> > > > understanding the NiFi architecture. I'm trying to understand how
> > > dataflows
> > > > are executed, or how I would scale the system. Are there any
> > > architectural
> > > > docs, or blog posts, or academic papers out there that would be
> > helpful?
> > > >
> > > > Alternatively, some pointers into the code base as to where the
> > execution
> > > > layer code lives could be helpful.
> > > >
> > > > Thanks!
> > > > Natty
> > > >
> > > > Jonathan "Natty" Natkins
> > > > StreamSets | Customer Engagement Engineer
> > > > mobile: 609.577.1600 | linkedin <http://www.linkedin.com/in/nattyice
> >
> > > >
> > >
> >
>