Routing to Failure relationships and Route provenance events

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Routing to Failure relationships and Route provenance events

Michal Klempa
Hi,
I am maintaining several dataflows and I am facing this issue in practice:
Lets say, I have several points of possible failure within the
dataflow (nearly every processor have failure output), I route all of
these into my general failure handler subgroup, which basically does
some filtering and formatting before issuing a notification by email.

From my email notifications, I get the FlowFile UUID and in case i am
curious on what happened, I go into NiFi and search provenance events
for this particular FlowFile.
And here comes the point:
Sometimes I find hard to find, which processor was the first one which
sent the file into the 'Failure path'.

Shouldn't processor which does the 'failure' routing send a
ProvenanceEvent with type
ProvenanceEventType.Route to the flowfile history for Dataflow manager
to know when this unfortunate event happened? Is this the guideline
which Processors do not obey?

Or maybe, I do something wrong when search for events/history of the FlowFile.

To get into the concrete example, let me point out that PostHTTP
processor never issues any provenance event regarding the failure (nor
it fills any execution details into attributes, like does the
ExecuteStreamCommand do, for example, there you have execution.error
which contains the stderr). So locating the error to be in PostHTTP is
just heuristic from my side and I cannot find any HTTP -verbose output
(like in curl -v for example), with headers, response from server or
at least 'connection timeout' if that is the case...

Thanks for suggestions and opinions.
Michal Klempa
Reply | Threaded
Open this post in threaded view
|

Re: Routing to Failure relationships and Route provenance events

Mark Payne
Michal,

Currently, the guidance that we give is for processors not to emit any sort of ROUTE event for
routing a FlowFile to a 'failure' relationship. While this may seem counter-intuitive, we do this because
most of the time when a FlowFile is routed to 'failure', the failure relationship is not pointing to some
sort of 'failure' flow like you describe here but rather the failure relationship is a self-loop so that the
Processor tries again.

In the scenario described above, if PostHTTP were to route a FlowFile to failure and failure looped back
to PostHTTP, we may see that the FlowFile was routed to failure hundreds (or more) of times. As a result,
the Provenance lineage would not really be very easy to follow because it would be filled with a huge number
of 'ROUTE' events.

That being said, there are things that we could do to be smart about this at the framework level. For instance,
we could notice that the ROUTE event indicates that the FlowFile is being routed back to the same queue that
it came from, so we could just discard the ROUTE event.

Unfortunately, this doesn't always solve the problem, because we also often see scenarios where there is perhaps
a DistributeLoad processor that load balances between 5 different PostHTTP processors for instance. If a PostHTTP
fails, it routes back to the DistributeLoad. So we'd need to keep track of the fact that it's been to this connection before,
even though it wasn't the last connection, and so on.

So that was a really long-winded way to say: We intentionally do not emit ROUTE events for 'failure' because it can create
some very complicated, hard-to-follow lineages. But we can - and should - do better.

If this is something that you are interested in digging into, in the codebase, the community would be more than happy
to help guide you along the way!

Also, if you have other feedback about how you think we can handle these cases better, please feel free to elaborate on
the thread.

Thanks
-Mark



> On Nov 7, 2016, at 5:46 AM, Michal Klempa <[hidden email]> wrote:
>
> Hi,
> I am maintaining several dataflows and I am facing this issue in practice:
> Lets say, I have several points of possible failure within the
> dataflow (nearly every processor have failure output), I route all of
> these into my general failure handler subgroup, which basically does
> some filtering and formatting before issuing a notification by email.
>
> From my email notifications, I get the FlowFile UUID and in case i am
> curious on what happened, I go into NiFi and search provenance events
> for this particular FlowFile.
> And here comes the point:
> Sometimes I find hard to find, which processor was the first one which
> sent the file into the 'Failure path'.
>
> Shouldn't processor which does the 'failure' routing send a
> ProvenanceEvent with type
> ProvenanceEventType.Route to the flowfile history for Dataflow manager
> to know when this unfortunate event happened? Is this the guideline
> which Processors do not obey?
>
> Or maybe, I do something wrong when search for events/history of the FlowFile.
>
> To get into the concrete example, let me point out that PostHTTP
> processor never issues any provenance event regarding the failure (nor
> it fills any execution details into attributes, like does the
> ExecuteStreamCommand do, for example, there you have execution.error
> which contains the stderr). So locating the error to be in PostHTTP is
> just heuristic from my side and I cannot find any HTTP -verbose output
> (like in curl -v for example), with headers, response from server or
> at least 'connection timeout' if that is the case...
>
> Thanks for suggestions and opinions.
> Michal Klempa

Reply | Threaded
Open this post in threaded view
|

Re: Routing to Failure relationships and Route provenance events

Andy LoPresto-2
Michael,

A temporary solution would be to insert an UpdateAttribute processor between the source processor (where the failure occurred) and your general failure handling flow. This processor could add an attribute noting the location of the failure and you could quickly determine that when debugging. 

If this seems cumbersome, you could also put a single ExecuteScript processor at the beginning of your failure handling flow and query the provenance events for the incoming flowfile, detect the last event that occurred, and then write out an additional, arbitrary provenance event indicating the failure. 

Neither are excellent solutions, and Mark is right that there should be a better option for diagnosing this. Please submit a Jira capturing your thoughts and we’ll see what is possible. 


Andy LoPresto
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Nov 7, 2016, at 6:10 AM, Mark Payne <[hidden email]> wrote:

Michal,

Currently, the guidance that we give is for processors not to emit any sort of ROUTE event for
routing a FlowFile to a 'failure' relationship. While this may seem counter-intuitive, we do this because
most of the time when a FlowFile is routed to 'failure', the failure relationship is not pointing to some
sort of 'failure' flow like you describe here but rather the failure relationship is a self-loop so that the
Processor tries again.

In the scenario described above, if PostHTTP were to route a FlowFile to failure and failure looped back
to PostHTTP, we may see that the FlowFile was routed to failure hundreds (or more) of times. As a result,
the Provenance lineage would not really be very easy to follow because it would be filled with a huge number
of 'ROUTE' events.

That being said, there are things that we could do to be smart about this at the framework level. For instance,
we could notice that the ROUTE event indicates that the FlowFile is being routed back to the same queue that
it came from, so we could just discard the ROUTE event.

Unfortunately, this doesn't always solve the problem, because we also often see scenarios where there is perhaps
a DistributeLoad processor that load balances between 5 different PostHTTP processors for instance. If a PostHTTP
fails, it routes back to the DistributeLoad. So we'd need to keep track of the fact that it's been to this connection before,
even though it wasn't the last connection, and so on.

So that was a really long-winded way to say: We intentionally do not emit ROUTE events for 'failure' because it can create
some very complicated, hard-to-follow lineages. But we can - and should - do better.

If this is something that you are interested in digging into, in the codebase, the community would be more than happy
to help guide you along the way!

Also, if you have other feedback about how you think we can handle these cases better, please feel free to elaborate on
the thread.

Thanks
-Mark



On Nov 7, 2016, at 5:46 AM, Michal Klempa <[hidden email]> wrote:

Hi,
I am maintaining several dataflows and I am facing this issue in practice:
Lets say, I have several points of possible failure within the
dataflow (nearly every processor have failure output), I route all of
these into my general failure handler subgroup, which basically does
some filtering and formatting before issuing a notification by email.

From my email notifications, I get the FlowFile UUID and in case i am
curious on what happened, I go into NiFi and search provenance events
for this particular FlowFile.
And here comes the point:
Sometimes I find hard to find, which processor was the first one which
sent the file into the 'Failure path'.

Shouldn't processor which does the 'failure' routing send a
ProvenanceEvent with type
ProvenanceEventType.Route to the flowfile history for Dataflow manager
to know when this unfortunate event happened? Is this the guideline
which Processors do not obey?

Or maybe, I do something wrong when search for events/history of the FlowFile.

To get into the concrete example, let me point out that PostHTTP
processor never issues any provenance event regarding the failure (nor
it fills any execution details into attributes, like does the
ExecuteStreamCommand do, for example, there you have execution.error
which contains the stderr). So locating the error to be in PostHTTP is
just heuristic from my side and I cannot find any HTTP -verbose output
(like in curl -v for example), with headers, response from server or
at least 'connection timeout' if that is the case...

Thanks for suggestions and opinions.
Michal Klempa



signature.asc (859 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Routing to Failure relationships and Route provenance events

Michal Klempa
Hi,
thank you both for responses. I understand the scenario with rerouting
back to processor would cause infinite provenance history. It can also
cause inifite loop, when the destination system is offline, therefore
I am not using this approach in this case.

Generaly, I have problem identifying the 'last processor which routed
the flowfile to failure before entering failure handling'. And yes, I
was thinking of attaching UpdateAttribute right after each failure
connection I need to handle and distinguish. This would be really
messy. Therefore I was thinking I am doing something wrong in general.

My thoughts were, that when I can identify where the FlowFile escaped
standard execution through failure, I can then just save flowfile
somewhere (e.g. HDFS) with metadata (attributes) and let this for
future inspection and especially -> manually re-entering the flow from
the point of failure. Is this a bad approach ? Or how do you design
flows then? Is it possible to programmatically inspect flowfile to
find a processor which was the last in the chain touching it (even
though this processor did not emit any provenance event at all)? If
so, tell me, I can afford coding my processor to acoomplish this task.

Thanks. Michal.

On Thu, Nov 10, 2016 at 7:50 PM, Andy LoPresto <[hidden email]> wrote:

> Michael,
>
> A temporary solution would be to insert an UpdateAttribute processor between
> the source processor (where the failure occurred) and your general failure
> handling flow. This processor could add an attribute noting the location of
> the failure and you could quickly determine that when debugging.
>
> If this seems cumbersome, you could also put a single ExecuteScript
> processor at the beginning of your failure handling flow and query the
> provenance events for the incoming flowfile, detect the last event that
> occurred, and then write out an additional, arbitrary provenance event
> indicating the failure.
>
> Neither are excellent solutions, and Mark is right that there should be a
> better option for diagnosing this. Please submit a Jira capturing your
> thoughts and we’ll see what is possible.
>
>
> Andy LoPresto
> [hidden email]
> [hidden email]
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Nov 7, 2016, at 6:10 AM, Mark Payne <[hidden email]> wrote:
>
> Michal,
>
> Currently, the guidance that we give is for processors not to emit any sort
> of ROUTE event for
> routing a FlowFile to a 'failure' relationship. While this may seem
> counter-intuitive, we do this because
> most of the time when a FlowFile is routed to 'failure', the failure
> relationship is not pointing to some
> sort of 'failure' flow like you describe here but rather the failure
> relationship is a self-loop so that the
> Processor tries again.
>
> In the scenario described above, if PostHTTP were to route a FlowFile to
> failure and failure looped back
> to PostHTTP, we may see that the FlowFile was routed to failure hundreds (or
> more) of times. As a result,
> the Provenance lineage would not really be very easy to follow because it
> would be filled with a huge number
> of 'ROUTE' events.
>
> That being said, there are things that we could do to be smart about this at
> the framework level. For instance,
> we could notice that the ROUTE event indicates that the FlowFile is being
> routed back to the same queue that
> it came from, so we could just discard the ROUTE event.
>
> Unfortunately, this doesn't always solve the problem, because we also often
> see scenarios where there is perhaps
> a DistributeLoad processor that load balances between 5 different PostHTTP
> processors for instance. If a PostHTTP
> fails, it routes back to the DistributeLoad. So we'd need to keep track of
> the fact that it's been to this connection before,
> even though it wasn't the last connection, and so on.
>
> So that was a really long-winded way to say: We intentionally do not emit
> ROUTE events for 'failure' because it can create
> some very complicated, hard-to-follow lineages. But we can - and should - do
> better.
>
> If this is something that you are interested in digging into, in the
> codebase, the community would be more than happy
> to help guide you along the way!
>
> Also, if you have other feedback about how you think we can handle these
> cases better, please feel free to elaborate on
> the thread.
>
> Thanks
> -Mark
>
>
>
> On Nov 7, 2016, at 5:46 AM, Michal Klempa <[hidden email]> wrote:
>
> Hi,
> I am maintaining several dataflows and I am facing this issue in practice:
> Lets say, I have several points of possible failure within the
> dataflow (nearly every processor have failure output), I route all of
> these into my general failure handler subgroup, which basically does
> some filtering and formatting before issuing a notification by email.
>
> From my email notifications, I get the FlowFile UUID and in case i am
> curious on what happened, I go into NiFi and search provenance events
> for this particular FlowFile.
> And here comes the point:
> Sometimes I find hard to find, which processor was the first one which
> sent the file into the 'Failure path'.
>
> Shouldn't processor which does the 'failure' routing send a
> ProvenanceEvent with type
> ProvenanceEventType.Route to the flowfile history for Dataflow manager
> to know when this unfortunate event happened? Is this the guideline
> which Processors do not obey?
>
> Or maybe, I do something wrong when search for events/history of the
> FlowFile.
>
> To get into the concrete example, let me point out that PostHTTP
> processor never issues any provenance event regarding the failure (nor
> it fills any execution details into attributes, like does the
> ExecuteStreamCommand do, for example, there you have execution.error
> which contains the stderr). So locating the error to be in PostHTTP is
> just heuristic from my side and I cannot find any HTTP -verbose output
> (like in curl -v for example), with headers, response from server or
> at least 'connection timeout' if that is the case...
>
> Thanks for suggestions and opinions.
> Michal Klempa
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Routing to Failure relationships and Route provenance events

Jeff
Hello Michal,

On a previous project, I worked on a flow with hundreds of processors that
handled thousands of files per day, and had similar questions when I was
trying to figure out the best way to figure out how to handle failures
gracefully to make it possible to automatically retry processing of
possible, or at least collect enough information at the point of failure to
be able to stash the file somewhere and notify my team members of the
failure so someone could manually correct an issue and then resubmit the
flowfile.

What eventually was implemented consisted of several stages of processing
in which the flow would update an external database to log several aspects
of the stage of processing it was in, and several PutEmail processors
representing the ends of stages.  Failure relationships would be
"aggregated" by the appropriate PutEmail processor for each stage of
processing, which in turn had PutHDFS processors to stash the file at a
location representative of the stage of processing that the flowfile could
not complete.  When there was a failure, we logged that in the external
database, sent an email for notification of the failure, and then placed
the file on HDFS so that someone could fix it and put the flowfile back
into the flow so it could retry that stage.

I had thought it'd be good for NiFi to track the last processor by which a
flowfile had been processed, but regardless of that, the flow still needs
to be developed to deal with errors in the context in which they happen,
and it takes several iterations of development to get it written that way.
After the flow design had evolved in such a way to be able to handle the
failures gracefully, I found that it was very easy to diagnose where
something went wrong with a flowfile, and I could always go back into Data
Provenance if I needed more detail.

- Jeff

On Thu, Nov 17, 2016 at 4:17 AM Michal Klempa <[hidden email]>
wrote:

> Hi,
> thank you both for responses. I understand the scenario with rerouting
> back to processor would cause infinite provenance history. It can also
> cause inifite loop, when the destination system is offline, therefore
> I am not using this approach in this case.
>
> Generaly, I have problem identifying the 'last processor which routed
> the flowfile to failure before entering failure handling'. And yes, I
> was thinking of attaching UpdateAttribute right after each failure
> connection I need to handle and distinguish. This would be really
> messy. Therefore I was thinking I am doing something wrong in general.
>
> My thoughts were, that when I can identify where the FlowFile escaped
> standard execution through failure, I can then just save flowfile
> somewhere (e.g. HDFS) with metadata (attributes) and let this for
> future inspection and especially -> manually re-entering the flow from
> the point of failure. Is this a bad approach ? Or how do you design
> flows then? Is it possible to programmatically inspect flowfile to
> find a processor which was the last in the chain touching it (even
> though this processor did not emit any provenance event at all)? If
> so, tell me, I can afford coding my processor to acoomplish this task.
>
> Thanks. Michal.
>
> On Thu, Nov 10, 2016 at 7:50 PM, Andy LoPresto <[hidden email]>
> wrote:
> > Michael,
> >
> > A temporary solution would be to insert an UpdateAttribute processor
> between
> > the source processor (where the failure occurred) and your general
> failure
> > handling flow. This processor could add an attribute noting the location
> of
> > the failure and you could quickly determine that when debugging.
> >
> > If this seems cumbersome, you could also put a single ExecuteScript
> > processor at the beginning of your failure handling flow and query the
> > provenance events for the incoming flowfile, detect the last event that
> > occurred, and then write out an additional, arbitrary provenance event
> > indicating the failure.
> >
> > Neither are excellent solutions, and Mark is right that there should be a
> > better option for diagnosing this. Please submit a Jira capturing your
> > thoughts and we’ll see what is possible.
> >
> >
> > Andy LoPresto
> > [hidden email]
> > [hidden email]
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> >
> > On Nov 7, 2016, at 6:10 AM, Mark Payne <[hidden email]> wrote:
> >
> > Michal,
> >
> > Currently, the guidance that we give is for processors not to emit any
> sort
> > of ROUTE event for
> > routing a FlowFile to a 'failure' relationship. While this may seem
> > counter-intuitive, we do this because
> > most of the time when a FlowFile is routed to 'failure', the failure
> > relationship is not pointing to some
> > sort of 'failure' flow like you describe here but rather the failure
> > relationship is a self-loop so that the
> > Processor tries again.
> >
> > In the scenario described above, if PostHTTP were to route a FlowFile to
> > failure and failure looped back
> > to PostHTTP, we may see that the FlowFile was routed to failure hundreds
> (or
> > more) of times. As a result,
> > the Provenance lineage would not really be very easy to follow because it
> > would be filled with a huge number
> > of 'ROUTE' events.
> >
> > That being said, there are things that we could do to be smart about
> this at
> > the framework level. For instance,
> > we could notice that the ROUTE event indicates that the FlowFile is being
> > routed back to the same queue that
> > it came from, so we could just discard the ROUTE event.
> >
> > Unfortunately, this doesn't always solve the problem, because we also
> often
> > see scenarios where there is perhaps
> > a DistributeLoad processor that load balances between 5 different
> PostHTTP
> > processors for instance. If a PostHTTP
> > fails, it routes back to the DistributeLoad. So we'd need to keep track
> of
> > the fact that it's been to this connection before,
> > even though it wasn't the last connection, and so on.
> >
> > So that was a really long-winded way to say: We intentionally do not emit
> > ROUTE events for 'failure' because it can create
> > some very complicated, hard-to-follow lineages. But we can - and should
> - do
> > better.
> >
> > If this is something that you are interested in digging into, in the
> > codebase, the community would be more than happy
> > to help guide you along the way!
> >
> > Also, if you have other feedback about how you think we can handle these
> > cases better, please feel free to elaborate on
> > the thread.
> >
> > Thanks
> > -Mark
> >
> >
> >
> > On Nov 7, 2016, at 5:46 AM, Michal Klempa <[hidden email]>
> wrote:
> >
> > Hi,
> > I am maintaining several dataflows and I am facing this issue in
> practice:
> > Lets say, I have several points of possible failure within the
> > dataflow (nearly every processor have failure output), I route all of
> > these into my general failure handler subgroup, which basically does
> > some filtering and formatting before issuing a notification by email.
> >
> > From my email notifications, I get the FlowFile UUID and in case i am
> > curious on what happened, I go into NiFi and search provenance events
> > for this particular FlowFile.
> > And here comes the point:
> > Sometimes I find hard to find, which processor was the first one which
> > sent the file into the 'Failure path'.
> >
> > Shouldn't processor which does the 'failure' routing send a
> > ProvenanceEvent with type
> > ProvenanceEventType.Route to the flowfile history for Dataflow manager
> > to know when this unfortunate event happened? Is this the guideline
> > which Processors do not obey?
> >
> > Or maybe, I do something wrong when search for events/history of the
> > FlowFile.
> >
> > To get into the concrete example, let me point out that PostHTTP
> > processor never issues any provenance event regarding the failure (nor
> > it fills any execution details into attributes, like does the
> > ExecuteStreamCommand do, for example, there you have execution.error
> > which contains the stderr). So locating the error to be in PostHTTP is
> > just heuristic from my side and I cannot find any HTTP -verbose output
> > (like in curl -v for example), with headers, response from server or
> > at least 'connection timeout' if that is the case...
> >
> > Thanks for suggestions and opinions.
> > Michal Klempa
> >
> >
> >
>