Dynamic attributes on repeating capture groups

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Dynamic attributes on repeating capture groups

wildo
I am parsing a text file with semi-formatted data. I use the SplitContent processor to get me each "block" of data- basically each object, and then the ExtractText processor to get all the individual properties that will be associated to that object.

So I might have a flowfile with data:

foo1: bar1
foo2: bar2
foo3: bar3

When I push that flowfile into the ExtractText processor (which has "Enable repeating capture group" set to true) and run my capture group regex on it, I'm left with dynamic, numerated attributes. If I call my dynamic attribute "fields" then my flowfile attrs will have:

fields.1 --> foo1
fields.2 --> bar1
fields.3 --> foo2
fields.4 --> bar2
fields.5 --> foo3
fields.6 --> bar3

Now I'd like to push this flowfile into an UpdateAttribute and turn my dynamic list into real, accessible attributes. The intention would then be to go into an AttributesToJSON processor.

The question is: how do I configure the UpdateAttribute processor to have a new attribute of "foo1" using "fields.n" with a value of "bar1" using "field.n+1" and then dynamically add the other five as well?

It seems like the ability to store repeating capture groups as attributes must imply that theirs a way to actually USE those attributes dynamically. What am I missing here?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dynamic attributes on repeating capture groups

James Wing
I think you will need to use an ExecuteScript processor to nicely format
your data into attributes into the foo1=bar1 pattern.  I'm assuming here
that you cannot predict what the key names 'foo1', 'foo2', etc. will
actually be.  If you could predict those field names, an UpdateAttribute
processor with a brute-force list of keys might be ok if you can handle the
misses downstream using expressions with isEmpty() or something similar.

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-scripting-nar/1.3.0/org.apache.nifi.processors.script.ExecuteScript/index.html

Thanks,

James

On Mon, Jul 31, 2017 at 1:26 PM, wildo <[hidden email]> wrote:

> I am parsing a text file with semi-formatted data. I use the SplitContent
> processor to get me each "block" of data- basically each object, and then
> the ExtractText processor to get all the individual properties that will be
> associated to that object.
>
> So I might have a flowfile with data:
>
> foo1: bar1
> foo2: bar2
> foo3: bar3
>
> When I push that flowfile into the ExtractText processor (which has "Enable
> repeating capture group" set to true) and run my capture group regex on it,
> I'm left with dynamic, numerated attributes. If I call my dynamic attribute
> "fields" then my flowfile attrs will have:
>
> fields.1 --> foo1
> fields.2 --> bar1
> fields.3 --> foo2
> fields.4 --> bar2
> fields.5 --> foo3
> fields.6 --> bar3
>
> Now I'd like to push this flowfile into an UpdateAttribute and turn my
> dynamic list into real, accessible attributes. The intention would then be
> to go into an AttributesToJSON processor.
>
> The question is: how do I configure the UpdateAttribute processor to have a
> new attribute of "foo1" using "fields.n" with a value of "bar1" using
> "field.n+1" and then dynamically add the other five as well?
>
> It seems like the ability to store repeating capture groups as attributes
> must imply that theirs a way to actually USE those attributes dynamically.
> What am I missing here?
>
>
>
> --
> View this message in context: http://apache-nifi-developer-
> list.39713.n7.nabble.com/Dynamic-attributes-on-repeating-capture-groups-
> tp16540.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dynamic attributes on repeating capture groups

wildo
James- thanks for the comment. I've been hacking on this all day and think I have the python written to do what I need. I have just one follow up question as I'm new to the ExecuteScript processor.


-What do I do on exception? Some of the cookbook example scripts seem to print to traceback:

except:
    traceback.print_exec(file=sys.stdout)
    raise


...but I'm curious what you should do in a production environment. Instead of raising the exception, should you instead transfer the session to REL_FAILURE? Or maybe raising the exception automatically transfers to REL_FAILURE?

Also- I'm guessing that in the real world you'd probably write that exception to some logger, rather than stdout correct?

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dynamic attributes on repeating capture groups

James Wing
An exception thrown in your script and not caught will cause the process
session to roll back, meaning that the flowfile will be go back to the
input queue and retry.  If you have the processor set to run on schedule
every 0 seconds, it might retry a lot :).  You are correct that routing to
REL_FAILURE is usually better than retrying, but I suppose it depends on
the particular flow.

ExecuteScript provides a local script variable, 'log', which you can use to
log messages to the regular NiFi log.  The messages get sent to the same
log files, following the rules outlined in conf/logback.xml, and will show
up in the UI as bulletins if the severity deserves it.  So you can do

log.error("Something bad happpened")

And it will both log and display as you would expect a processor error to
display.  It is an instance of ComponentLog, so there are also debug(),
info(), warn(), etc.

James

On Wed, Aug 2, 2017 at 12:22 PM, wildo <[hidden email]> wrote:

> James- thanks for the comment. I've been hacking on this all day and think
> I
> have the python written to do what I need. I have just one follow up
> question as I'm new to the ExecuteScript processor.
>
>
> -What do I do on exception? Some of the cookbook example scripts seem to
> print to traceback:
>
> except:
>     traceback.print_exec(file=sys.stdout)
>     raise
>
>
> ...but I'm curious what you should do in a production environment. Instead
> of raising the exception, should you instead transfer the session to
> REL_FAILURE? Or maybe raising the exception automatically transfers to
> REL_FAILURE?
>
> Also- I'm guessing that in the real world you'd probably write that
> exception to some logger, rather than stdout correct?
>
>
>
>
>
> --
> View this message in context: http://apache-nifi-developer-l
> ist.39713.n7.nabble.com/Dynamic-attributes-on-repeating-
> capture-groups-tp16540p16565.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dynamic attributes on repeating capture groups

wildo
Fantastic! Thanks for the info!
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dynamic attributes on repeating capture groups

wildo
James- one last question. Does ExecuteScript support clustered NIFI instances? I keep getting this exception ONLY from the second node (2 of 2) in our cluster:

"StandardFlowFileRecord ... already in use for active callback or an InputStream created by ProcessSession.read(FlowFile) has not been closed in <script>"

It sure seems to me that I have TWO flowfiles- one for each node in the cluster, and therefore I don't know why the session object for the ExecuteScript process would be sharing a flowfile between the two nodes. I think my understanding on this is limited.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dynamic attributes on repeating capture groups

James Wing
ExecuteScript will certainly support a clustered NiFi environment.  Is the
processor actually processing files on both nodes, but only failing on
one?  Is it possible that the cluster's flow routes files through the
processor only on one node?

Also, can you share your script?  It sounds like you have an
InputStreamCallback not getting closed, or maybe an out of date flowfile
reference.

On Thu, Aug 3, 2017 at 6:54 AM, wildo <[hidden email]> wrote:

> James- one last question. Does ExecuteScript support clustered NIFI
> instances? I keep getting this exception ONLY from the second node (2 of 2)
> in our cluster:
>
> "StandardFlowFileRecord ... already in use for active callback or an
> InputStream created by ProcessSession.read(FlowFile) has not been closed in
> <script>"
>
> It sure seems to me that I have TWO flowfiles- one for each node in the
> cluster, and therefore I don't know why the session object for the
> ExecuteScript process would be sharing a flowfile between the two nodes. I
> think my understanding on this is limited.
>
>
>
> --
> View this message in context: http://apache-nifi-developer-
> list.39713.n7.nabble.com/Dynamic-attributes-on-repeating-capture-groups-
> tp16540p16569.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Dynamic attributes on repeating capture groups

wildo
Sorry for the lack of response. The client decided to go a different route. I was able to get a different script running for a different application. So I think I more/less have this worked out. Thanks again!
Loading...