CsvToAttributes processor

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

CsvToAttributes processor

François Prunier
Hello Nifi folks,

I've built a processor to parse CSV files with headers and turn each
line in a flowfile. Each resulting flowfile has as many attributes as
the number of columns. Each attributes has the name of a column with the
corresponding value for the line.

For example, this CSV file:

|col1,col2,col3 a,b,c d,e,f |

would generate two flowfiles with the following attributes:

|col1 = a col2 = b col3 = c |

and

|col1 = d col2 = e col3 = f |

As of now, you can configure the charset plus delimiter, quote and
escape character. It's based on the commons-csv parser.

It's very handy if you want to, for example, index a CSV file into
elasticsearch.

Would you guys be interested in a pull request to add this processor to
the main code base ? It needs a bit more documentation and cleanup that
I would need to add in but it's already successfully used in production.

Best regards,
--
*François Prunier
* *Hurence* - /Vos experts Big Data/
http://www.hurence.com
*mobile:* +33 6 38 68 60 50

Reply | Threaded
Open this post in threaded view
|

Aw: CsvToAttributes processor

Uwe Geercken
Francois,

very nice. Thanks.

I have been working on a simple version a while ago. But it had another scope: I wnated to have a Nifi processor to merge CSV data with a template from a template engine (e.g. Apache Velocity). I will review my code and have a look at your processor.

Where can we get it? Github?

Rgds,

Uwe

> Gesendet: Mittwoch, 19. Oktober 2016 um 11:10 Uhr
> Von: "François Prunier" <[hidden email]>
> An: [hidden email]
> Betreff: CsvToAttributes processor
>
> Hello Nifi folks,
>
> I've built a processor to parse CSV files with headers and turn each
> line in a flowfile. Each resulting flowfile has as many attributes as
> the number of columns. Each attributes has the name of a column with the
> corresponding value for the line.
>
> For example, this CSV file:
>
> |col1,col2,col3 a,b,c d,e,f |
>
> would generate two flowfiles with the following attributes:
>
> |col1 = a col2 = b col3 = c |
>
> and
>
> |col1 = d col2 = e col3 = f |
>
> As of now, you can configure the charset plus delimiter, quote and
> escape character. It's based on the commons-csv parser.
>
> It's very handy if you want to, for example, index a CSV file into
> elasticsearch.
>
> Would you guys be interested in a pull request to add this processor to
> the main code base ? It needs a bit more documentation and cleanup that
> I would need to add in but it's already successfully used in production.
>
> Best regards,
> --
> *François Prunier
> * *Hurence* - /Vos experts Big Data/
> http://www.hurence.com
> *mobile:* +33 6 38 68 60 50
>
>
Reply | Threaded
Open this post in threaded view
|

Re: CsvToAttributes processor

Joe Witt
Francois

Thanks for starting the discussion and this is indeed the type of
thing people would find helpful.  One thing I'd want to flag with this
approach is the impact it will have on performance at higher rates.
We're starting to see people wanting to do this more and more where
they'll take the content of a flowfile and turn it into attributes.
This can put a lot of pressure on the heap and garbage collection and
is best to avoid if you want to achieve sustained high performance.
Keeping the content in its native form or converting it to another
form will yield much higher sustained throughput as we can stream
those things from their underlying storage in the content repository
to their new form in the repository or to another system all while
only ever having only as much in memory as your technique for
operating on them. So for example we can do things like compress a 1GB
file and only have say 1KB in memory (as an example).  But by taking
the content and turning it into attributes on the flow file the flow
file object (not its content) will be in memory most of the time and
this is where problems can occur.  It would be better to have pushing
to elastic be driven off the content though this admittedly
introducing a different challenge which is 'well, what format of that
content does it expect'?  We have some examples of this pattern now in
our SQL processors for instance which are built around a specific data
format but we need to do better and offer generic or pluggable ways to
read record oriented data from a variety of formats and not have the
processors be specific to the underlying format where possible and
appropriate.  The key is to do this without forcing some goofy
normalization format that will kill performance as well and which
would make it more brittle.

So, anyway, I said all that to say that it is great you've offered to
contribute it and I think you certainly should.  We should just take
care to document its intended use and limitations on performance to
consider, and enable it to limit how many columns/fields get turned
into attributes maybe by setting a max or by having a
whitelist/blacklist type model.  Even if it won't achieve highest
sustained performance I suspect this will be quite helpful for people
as is.

Thanks!
Joe

On Wed, Oct 19, 2016 at 6:50 AM, Uwe Geercken <[hidden email]> wrote:

> Francois,
>
> very nice. Thanks.
>
> I have been working on a simple version a while ago. But it had another scope: I wnated to have a Nifi processor to merge CSV data with a template from a template engine (e.g. Apache Velocity). I will review my code and have a look at your processor.
>
> Where can we get it? Github?
>
> Rgds,
>
> Uwe
>
>> Gesendet: Mittwoch, 19. Oktober 2016 um 11:10 Uhr
>> Von: "François Prunier" <[hidden email]>
>> An: [hidden email]
>> Betreff: CsvToAttributes processor
>>
>> Hello Nifi folks,
>>
>> I've built a processor to parse CSV files with headers and turn each
>> line in a flowfile. Each resulting flowfile has as many attributes as
>> the number of columns. Each attributes has the name of a column with the
>> corresponding value for the line.
>>
>> For example, this CSV file:
>>
>> |col1,col2,col3 a,b,c d,e,f |
>>
>> would generate two flowfiles with the following attributes:
>>
>> |col1 = a col2 = b col3 = c |
>>
>> and
>>
>> |col1 = d col2 = e col3 = f |
>>
>> As of now, you can configure the charset plus delimiter, quote and
>> escape character. It's based on the commons-csv parser.
>>
>> It's very handy if you want to, for example, index a CSV file into
>> elasticsearch.
>>
>> Would you guys be interested in a pull request to add this processor to
>> the main code base ? It needs a bit more documentation and cleanup that
>> I would need to add in but it's already successfully used in production.
>>
>> Best regards,
>> --
>> *François Prunier
>> * *Hurence* - /Vos experts Big Data/
>> http://www.hurence.com
>> *mobile:* +33 6 38 68 60 50
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: CsvToAttributes processor

Matt Foley-2
For the specific use case of processing CSV files (and possibly “flat” db tables), would many of the same goals be met if the simple list of “bare” values in each record was turned into easily parsable key/value pairs?  Perhaps either JSON or YAML format?  But still left in the content rather than moved into the attribute list, so as to avoid the problems Joe stated.  Granted each downstream processor will have to re-parse the content, but it’s fast and easy - for instance, in python one can read such content into a {dictionary} with just a couple lines of code.  Indexers consume it well, too, or can be taught to do so.

Thanks,
--Matt


On 10/19/16, 6:02 AM, "Joe Witt" <[hidden email]> wrote:

    Francois
   
    Thanks for starting the discussion and this is indeed the type of
    thing people would find helpful.  One thing I'd want to flag with this
    approach is the impact it will have on performance at higher rates.
    We're starting to see people wanting to do this more and more where
    they'll take the content of a flowfile and turn it into attributes.
    This can put a lot of pressure on the heap and garbage collection and
    is best to avoid if you want to achieve sustained high performance.
    Keeping the content in its native form or converting it to another
    form will yield much higher sustained throughput as we can stream
    those things from their underlying storage in the content repository
    to their new form in the repository or to another system all while
    only ever having only as much in memory as your technique for
    operating on them. So for example we can do things like compress a 1GB
    file and only have say 1KB in memory (as an example).  But by taking
    the content and turning it into attributes on the flow file the flow
    file object (not its content) will be in memory most of the time and
    this is where problems can occur.  It would be better to have pushing
    to elastic be driven off the content though this admittedly
    introducing a different challenge which is 'well, what format of that
    content does it expect'?  We have some examples of this pattern now in
    our SQL processors for instance which are built around a specific data
    format but we need to do better and offer generic or pluggable ways to
    read record oriented data from a variety of formats and not have the
    processors be specific to the underlying format where possible and
    appropriate.  The key is to do this without forcing some goofy
    normalization format that will kill performance as well and which
    would make it more brittle.
   
    So, anyway, I said all that to say that it is great you've offered to
    contribute it and I think you certainly should.  We should just take
    care to document its intended use and limitations on performance to
    consider, and enable it to limit how many columns/fields get turned
    into attributes maybe by setting a max or by having a
    whitelist/blacklist type model.  Even if it won't achieve highest
    sustained performance I suspect this will be quite helpful for people
    as is.
   
    Thanks!
    Joe
   
    On Wed, Oct 19, 2016 at 6:50 AM, Uwe Geercken <[hidden email]> wrote:
    > Francois,
    >
    > very nice. Thanks.
    >
    > I have been working on a simple version a while ago. But it had another scope: I wnated to have a Nifi processor to merge CSV data with a template from a template engine (e.g. Apache Velocity). I will review my code and have a look at your processor.
    >
    > Where can we get it? Github?
    >
    > Rgds,
    >
    > Uwe
    >
    >> Gesendet: Mittwoch, 19. Oktober 2016 um 11:10 Uhr
    >> Von: "François Prunier" <[hidden email]>
    >> An: [hidden email]
    >> Betreff: CsvToAttributes processor
    >>
    >> Hello Nifi folks,
    >>
    >> I've built a processor to parse CSV files with headers and turn each
    >> line in a flowfile. Each resulting flowfile has as many attributes as
    >> the number of columns. Each attributes has the name of a column with the
    >> corresponding value for the line.
    >>
    >> For example, this CSV file:
    >>
    >> |col1,col2,col3 a,b,c d,e,f |
    >>
    >> would generate two flowfiles with the following attributes:
    >>
    >> |col1 = a col2 = b col3 = c |
    >>
    >> and
    >>
    >> |col1 = d col2 = e col3 = f |
    >>
    >> As of now, you can configure the charset plus delimiter, quote and
    >> escape character. It's based on the commons-csv parser.
    >>
    >> It's very handy if you want to, for example, index a CSV file into
    >> elasticsearch.
    >>
    >> Would you guys be interested in a pull request to add this processor to
    >> the main code base ? It needs a bit more documentation and cleanup that
    >> I would need to add in but it's already successfully used in production.
    >>
    >> Best regards,
    >> --
    >> *François Prunier
    >> * *Hurence* - /Vos experts Big Data/
    >> http://www.hurence.com
    >> *mobile:* +33 6 38 68 60 50
    >>
    >>
   
   


Reply | Threaded
Open this post in threaded view
|

Re: CsvToAttributes processor

Andy LoPresto-2
I like Matt’s idea. Currently there are ConvertCSVToAvro and ConvertAvroToJSON processors, but no processor that directly converts CSV to JSON. Keeping the content in the content claim, as Joe and Matt pointed out, will greatly improve performance over loading it into attributes. If attribute-based routing is desired, an UpdateAttribute processor can follow on to update a single attribute from the content without polluting it with unnecessary data. 

While I am not a proponent of creating n^2 processors just to do format conversions, I think CSV to JSON is a common-enough and useful-enough task that this would be beneficial. And once we get the extension registry, people can go nuts with n^2 conversion processors. 


Andy LoPresto
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Oct 19, 2016, at 1:14 PM, Matt Foley <[hidden email]> wrote:

For the specific use case of processing CSV files (and possibly “flat” db tables), would many of the same goals be met if the simple list of “bare” values in each record was turned into easily parsable key/value pairs?  Perhaps either JSON or YAML format?  But still left in the content rather than moved into the attribute list, so as to avoid the problems Joe stated.  Granted each downstream processor will have to re-parse the content, but it’s fast and easy - for instance, in python one can read such content into a {dictionary} with just a couple lines of code.  Indexers consume it well, too, or can be taught to do so.

Thanks,
--Matt


On 10/19/16, 6:02 AM, "Joe Witt" <[hidden email]> wrote:

   Francois

   Thanks for starting the discussion and this is indeed the type of
   thing people would find helpful.  One thing I'd want to flag with this
   approach is the impact it will have on performance at higher rates.
   We're starting to see people wanting to do this more and more where
   they'll take the content of a flowfile and turn it into attributes.
   This can put a lot of pressure on the heap and garbage collection and
   is best to avoid if you want to achieve sustained high performance.
   Keeping the content in its native form or converting it to another
   form will yield much higher sustained throughput as we can stream
   those things from their underlying storage in the content repository
   to their new form in the repository or to another system all while
   only ever having only as much in memory as your technique for
   operating on them. So for example we can do things like compress a 1GB
   file and only have say 1KB in memory (as an example).  But by taking
   the content and turning it into attributes on the flow file the flow
   file object (not its content) will be in memory most of the time and
   this is where problems can occur.  It would be better to have pushing
   to elastic be driven off the content though this admittedly
   introducing a different challenge which is 'well, what format of that
   content does it expect'?  We have some examples of this pattern now in
   our SQL processors for instance which are built around a specific data
   format but we need to do better and offer generic or pluggable ways to
   read record oriented data from a variety of formats and not have the
   processors be specific to the underlying format where possible and
   appropriate.  The key is to do this without forcing some goofy
   normalization format that will kill performance as well and which
   would make it more brittle.

   So, anyway, I said all that to say that it is great you've offered to
   contribute it and I think you certainly should.  We should just take
   care to document its intended use and limitations on performance to
   consider, and enable it to limit how many columns/fields get turned
   into attributes maybe by setting a max or by having a
   whitelist/blacklist type model.  Even if it won't achieve highest
   sustained performance I suspect this will be quite helpful for people
   as is.

   Thanks!
   Joe

   On Wed, Oct 19, 2016 at 6:50 AM, Uwe Geercken <[hidden email]> wrote:
Francois,

very nice. Thanks.

I have been working on a simple version a while ago. But it had another scope: I wnated to have a Nifi processor to merge CSV data with a template from a template engine (e.g. Apache Velocity). I will review my code and have a look at your processor.

Where can we get it? Github?

Rgds,

Uwe

Gesendet: Mittwoch, 19. Oktober 2016 um 11:10 Uhr
Von: "François Prunier" <[hidden email]>
An: [hidden email]
Betreff: CsvToAttributes processor

Hello Nifi folks,

I've built a processor to parse CSV files with headers and turn each
line in a flowfile. Each resulting flowfile has as many attributes as
the number of columns. Each attributes has the name of a column with the
corresponding value for the line.

For example, this CSV file:

|col1,col2,col3 a,b,c d,e,f |

would generate two flowfiles with the following attributes:

|col1 = a col2 = b col3 = c |

and

|col1 = d col2 = e col3 = f |

As of now, you can configure the charset plus delimiter, quote and
escape character. It's based on the commons-csv parser.

It's very handy if you want to, for example, index a CSV file into
elasticsearch.

Would you guys be interested in a pull request to add this processor to
the main code base ? It needs a bit more documentation and cleanup that
I would need to add in but it's already successfully used in production.

Best regards,
--
*François Prunier
* *Hurence* - /Vos experts Big Data/
http://www.hurence.com
*mobile:* +33 6 38 68 60 50








signature.asc (859 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: CsvToAttributes processor

Matt Burgess-2
Alternative to n^2 processors, there was some discussion a little
while back about having Controller Service instances to do data format
conversions [1]. However that's a complex issue and might not get
integrated in the near-term. I agree with Andy that CSV->JSON is a
useful task, and that when we get the extension registry (and/or the
controller services), we can update the processors accordingly.

Regards,
Matt

[1] http://apache-nifi-developer-list.39713.n7.nabble.com/Looking-for-feedback-on-my-WIP-Design-td13097.html

On Wed, Oct 19, 2016 at 1:58 PM, Andy LoPresto <[hidden email]> wrote:

> I like Matt’s idea. Currently there are ConvertCSVToAvro and
> ConvertAvroToJSON processors, but no processor that directly converts CSV to
> JSON. Keeping the content in the content claim, as Joe and Matt pointed out,
> will greatly improve performance over loading it into attributes. If
> attribute-based routing is desired, an UpdateAttribute processor can follow
> on to update a single attribute from the content without polluting it with
> unnecessary data.
>
> While I am not a proponent of creating n^2 processors just to do format
> conversions, I think CSV to JSON is a common-enough and useful-enough task
> that this would be beneficial. And once we get the extension registry,
> people can go nuts with n^2 conversion processors.
>
>
> Andy LoPresto
> [hidden email]
> [hidden email]
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Oct 19, 2016, at 1:14 PM, Matt Foley <[hidden email]> wrote:
>
> For the specific use case of processing CSV files (and possibly “flat” db
> tables), would many of the same goals be met if the simple list of “bare”
> values in each record was turned into easily parsable key/value pairs?
> Perhaps either JSON or YAML format?  But still left in the content rather
> than moved into the attribute list, so as to avoid the problems Joe stated.
> Granted each downstream processor will have to re-parse the content, but
> it’s fast and easy - for instance, in python one can read such content into
> a {dictionary} with just a couple lines of code.  Indexers consume it well,
> too, or can be taught to do so.
>
> Thanks,
> --Matt
>
>
> On 10/19/16, 6:02 AM, "Joe Witt" <[hidden email]> wrote:
>
>    Francois
>
>    Thanks for starting the discussion and this is indeed the type of
>    thing people would find helpful.  One thing I'd want to flag with this
>    approach is the impact it will have on performance at higher rates.
>    We're starting to see people wanting to do this more and more where
>    they'll take the content of a flowfile and turn it into attributes.
>    This can put a lot of pressure on the heap and garbage collection and
>    is best to avoid if you want to achieve sustained high performance.
>    Keeping the content in its native form or converting it to another
>    form will yield much higher sustained throughput as we can stream
>    those things from their underlying storage in the content repository
>    to their new form in the repository or to another system all while
>    only ever having only as much in memory as your technique for
>    operating on them. So for example we can do things like compress a 1GB
>    file and only have say 1KB in memory (as an example).  But by taking
>    the content and turning it into attributes on the flow file the flow
>    file object (not its content) will be in memory most of the time and
>    this is where problems can occur.  It would be better to have pushing
>    to elastic be driven off the content though this admittedly
>    introducing a different challenge which is 'well, what format of that
>    content does it expect'?  We have some examples of this pattern now in
>    our SQL processors for instance which are built around a specific data
>    format but we need to do better and offer generic or pluggable ways to
>    read record oriented data from a variety of formats and not have the
>    processors be specific to the underlying format where possible and
>    appropriate.  The key is to do this without forcing some goofy
>    normalization format that will kill performance as well and which
>    would make it more brittle.
>
>    So, anyway, I said all that to say that it is great you've offered to
>    contribute it and I think you certainly should.  We should just take
>    care to document its intended use and limitations on performance to
>    consider, and enable it to limit how many columns/fields get turned
>    into attributes maybe by setting a max or by having a
>    whitelist/blacklist type model.  Even if it won't achieve highest
>    sustained performance I suspect this will be quite helpful for people
>    as is.
>
>    Thanks!
>    Joe
>
>    On Wed, Oct 19, 2016 at 6:50 AM, Uwe Geercken <[hidden email]>
> wrote:
>
> Francois,
>
> very nice. Thanks.
>
> I have been working on a simple version a while ago. But it had another
> scope: I wnated to have a Nifi processor to merge CSV data with a template
> from a template engine (e.g. Apache Velocity). I will review my code and
> have a look at your processor.
>
> Where can we get it? Github?
>
> Rgds,
>
> Uwe
>
> Gesendet: Mittwoch, 19. Oktober 2016 um 11:10 Uhr
> Von: "François Prunier" <[hidden email]>
> An: [hidden email]
> Betreff: CsvToAttributes processor
>
> Hello Nifi folks,
>
> I've built a processor to parse CSV files with headers and turn each
> line in a flowfile. Each resulting flowfile has as many attributes as
> the number of columns. Each attributes has the name of a column with the
> corresponding value for the line.
>
> For example, this CSV file:
>
> |col1,col2,col3 a,b,c d,e,f |
>
> would generate two flowfiles with the following attributes:
>
> |col1 = a col2 = b col3 = c |
>
> and
>
> |col1 = d col2 = e col3 = f |
>
> As of now, you can configure the charset plus delimiter, quote and
> escape character. It's based on the commons-csv parser.
>
> It's very handy if you want to, for example, index a CSV file into
> elasticsearch.
>
> Would you guys be interested in a pull request to add this processor to
> the main code base ? It needs a bit more documentation and cleanup that
> I would need to add in but it's already successfully used in production.
>
> Best regards,
> --
> *François Prunier
> * *Hurence* - /Vos experts Big Data/
> http://www.hurence.com
> *mobile:* +33 6 38 68 60 50
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Aw: Re: CsvToAttributes processor

Uwe Geercken
In reply to this post by Joe Witt
Joe,

thanks for the clarifying words. It was exactly that what I asked a while ago.

Rgds,

Uwe

> Gesendet: Mittwoch, 19. Oktober 2016 um 15:02 Uhr
> Von: "Joe Witt" <[hidden email]>
> An: [hidden email]
> Betreff: Re: CsvToAttributes processor
>
> Francois
>
> Thanks for starting the discussion and this is indeed the type of
> thing people would find helpful.  One thing I'd want to flag with this
> approach is the impact it will have on performance at higher rates.
> We're starting to see people wanting to do this more and more where
> they'll take the content of a flowfile and turn it into attributes.
> This can put a lot of pressure on the heap and garbage collection and
> is best to avoid if you want to achieve sustained high performance.
> Keeping the content in its native form or converting it to another
> form will yield much higher sustained throughput as we can stream
> those things from their underlying storage in the content repository
> to their new form in the repository or to another system all while
> only ever having only as much in memory as your technique for
> operating on them. So for example we can do things like compress a 1GB
> file and only have say 1KB in memory (as an example).  But by taking
> the content and turning it into attributes on the flow file the flow
> file object (not its content) will be in memory most of the time and
> this is where problems can occur.  It would be better to have pushing
> to elastic be driven off the content though this admittedly
> introducing a different challenge which is 'well, what format of that
> content does it expect'?  We have some examples of this pattern now in
> our SQL processors for instance which are built around a specific data
> format but we need to do better and offer generic or pluggable ways to
> read record oriented data from a variety of formats and not have the
> processors be specific to the underlying format where possible and
> appropriate.  The key is to do this without forcing some goofy
> normalization format that will kill performance as well and which
> would make it more brittle.
>
> So, anyway, I said all that to say that it is great you've offered to
> contribute it and I think you certainly should.  We should just take
> care to document its intended use and limitations on performance to
> consider, and enable it to limit how many columns/fields get turned
> into attributes maybe by setting a max or by having a
> whitelist/blacklist type model.  Even if it won't achieve highest
> sustained performance I suspect this will be quite helpful for people
> as is.
>
> Thanks!
> Joe
>
> On Wed, Oct 19, 2016 at 6:50 AM, Uwe Geercken <[hidden email]> wrote:
> > Francois,
> >
> > very nice. Thanks.
> >
> > I have been working on a simple version a while ago. But it had another scope: I wnated to have a Nifi processor to merge CSV data with a template from a template engine (e.g. Apache Velocity). I will review my code and have a look at your processor.
> >
> > Where can we get it? Github?
> >
> > Rgds,
> >
> > Uwe
> >
> >> Gesendet: Mittwoch, 19. Oktober 2016 um 11:10 Uhr
> >> Von: "François Prunier" <[hidden email]>
> >> An: [hidden email]
> >> Betreff: CsvToAttributes processor
> >>
> >> Hello Nifi folks,
> >>
> >> I've built a processor to parse CSV files with headers and turn each
> >> line in a flowfile. Each resulting flowfile has as many attributes as
> >> the number of columns. Each attributes has the name of a column with the
> >> corresponding value for the line.
> >>
> >> For example, this CSV file:
> >>
> >> |col1,col2,col3 a,b,c d,e,f |
> >>
> >> would generate two flowfiles with the following attributes:
> >>
> >> |col1 = a col2 = b col3 = c |
> >>
> >> and
> >>
> >> |col1 = d col2 = e col3 = f |
> >>
> >> As of now, you can configure the charset plus delimiter, quote and
> >> escape character. It's based on the commons-csv parser.
> >>
> >> It's very handy if you want to, for example, index a CSV file into
> >> elasticsearch.
> >>
> >> Would you guys be interested in a pull request to add this processor to
> >> the main code base ? It needs a bit more documentation and cleanup that
> >> I would need to add in but it's already successfully used in production.
> >>
> >> Best regards,
> >> --
> >> *François Prunier
> >> * *Hurence* - /Vos experts Big Data/
> >> http://www.hurence.com
> >> *mobile:* +33 6 38 68 60 50
> >>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: CsvToAttributes processor

François Prunier
In reply to this post by François Prunier
Hello again nifi folks,

I did not get a direct reply to my email below. However, I've since
noticed in the mailing list archive that some of you have kindly
replied, although the emails did not make it to my inbox !

I wasn't part of the mailing list at the time, I am now, I guess that's
why I did not got the responses, it still seems a bit weird though... (*).

Anyway, could someone reply to the thread and include my email so I can
answer each of your comments while keeping the threading 'clean' ?

Thanks !

François

*: Maybe something the admins should look into, as some people might
fire off an email to the list, see no answers and assume no one replied
to them !

On 19/10/2016 11:10, François Prunier wrote:

>
> Hello Nifi folks,
>
> I've built a processor to parse CSV files with headers and turn each
> line in a flowfile. Each resulting flowfile has as many attributes as
> the number of columns. Each attributes has the name of a column with
> the corresponding value for the line.
>
> For example, this CSV file:
>
> |col1,col2,col3 a,b,c d,e,f |
>
> would generate two flowfiles with the following attributes:
>
> |col1 = a col2 = b col3 = c |
>
> and
>
> |col1 = d col2 = e col3 = f |
> As of now, you can configure the charset plus delimiter, quote and
> escape character. It's based on the commons-csv parser.
>
> It's very handy if you want to, for example, index a CSV file into
> elasticsearch.
>
> Would you guys be interested in a pull request to add this processor
> to the main code base ? It needs a bit more documentation and cleanup
> that I would need to add in but it's already successfully used in
> production.
>
> Best regards,
> --
> *François Prunier
> * *Hurence* - /Vos experts Big Data/
> http://www.hurence.com
> *mobile:* +33 6 38 68 60 50
>

--
*François Prunier
* *Hurence* - /Vos experts Big Data/
http://www.hurence.com
*mobile:* +33 6 38 68 60 50

Reply | Threaded
Open this post in threaded view
|

Re: CsvToAttributes processor

Andy LoPresto-2
Hi François, 

I hope this is what you were looking for. If you do not get the entire thread via this email, you can see the thread in a web view here [1]. 


Andy LoPresto
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Oct 27, 2016, at 6:31 AM, François Prunier <[hidden email]> wrote:

--------------7FEEA278B796C52DD32D150C
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit

Hello again nifi folks,

I did not get a direct reply to my email below. However, I've since
noticed in the mailing list archive that some of you have kindly
replied, although the emails did not make it to my inbox !

I wasn't part of the mailing list at the time, I am now, I guess that's
why I did not got the responses, it still seems a bit weird though... (*).

Anyway, could someone reply to the thread and include my email so I can
answer each of your comments while keeping the threading 'clean' ?

Thanks !

François

*: Maybe something the admins should look into, as some people might
fire off an email to the list, see no answers and assume no one replied
to them !

On 19/10/2016 11:10, François Prunier wrote:

Hello Nifi folks,

I've built a processor to parse CSV files with headers and turn each
line in a flowfile. Each resulting flowfile has as many attributes as
the number of columns. Each attributes has the name of a column with
the corresponding value for the line.

For example, this CSV file:

|col1,col2,col3 a,b,c d,e,f |

would generate two flowfiles with the following attributes:

|col1 = a col2 = b col3 = c |

and

|col1 = d col2 = e col3 = f |
As of now, you can configure the charset plus delimiter, quote and
escape character. It's based on the commons-csv parser.

It's very handy if you want to, for example, index a CSV file into
elasticsearch.

Would you guys be interested in a pull request to add this processor
to the main code base ? It needs a bit more documentation and cleanup
that I would need to add in but it's already successfully used in
production.

Best regards,
--
*François Prunier
* *Hurence* - /Vos experts Big Data/
http://www.hurence.com
*mobile:* +33 6 38 68 60 50


--
*François Prunier
* *Hurence* - /Vos experts Big Data/
http://www.hurence.com
*mobile:* +33


signature.asc (859 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: CsvToAttributes processor

Andy LoPresto-2
And according to the IETF RFC 2822 (Email), the reply-to field can hold multiple mailboxes, so we will investigate if we can get the dev@ and users@ lists to reply to the list *and* the sender by default. This might really clog people’s inboxes though, so it needs to be evaluated. 

Andy LoPresto
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Oct 27, 2016, at 8:34 PM, Andy LoPresto <[hidden email]> wrote:

Hi François, 

I hope this is what you were looking for. If you do not get the entire thread via this email, you can see the thread in a web view here [1]. 


Andy LoPresto
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Oct 27, 2016, at 6:31 AM, François Prunier <[hidden email]> wrote:

--------------7FEEA278B796C52DD32D150C
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit

Hello again nifi folks,

I did not get a direct reply to my email below. However, I've since
noticed in the mailing list archive that some of you have kindly
replied, although the emails did not make it to my inbox !

I wasn't part of the mailing list at the time, I am now, I guess that's
why I did not got the responses, it still seems a bit weird though... (*).

Anyway, could someone reply to the thread and include my email so I can
answer each of your comments while keeping the threading 'clean' ?

Thanks !

François

*: Maybe something the admins should look into, as some people might
fire off an email to the list, see no answers and assume no one replied
to them !

On 19/10/2016 11:10, François Prunier wrote:

Hello Nifi folks,

I've built a processor to parse CSV files with headers and turn each
line in a flowfile. Each resulting flowfile has as many attributes as
the number of columns. Each attributes has the name of a column with
the corresponding value for the line.

For example, this CSV file:

|col1,col2,col3 a,b,c d,e,f |

would generate two flowfiles with the following attributes:

|col1 = a col2 = b col3 = c |

and

|col1 = d col2 = e col3 = f |
As of now, you can configure the charset plus delimiter, quote and
escape character. It's based on the commons-csv parser.

It's very handy if you want to, for example, index a CSV file into
elasticsearch.

Would you guys be interested in a pull request to add this processor
to the main code base ? It needs a bit more documentation and cleanup
that I would need to add in but it's already successfully used in
production.

Best regards,
--
*François Prunier
* *Hurence* - /Vos experts Big Data/
http://www.hurence.com
*mobile:* +33 6 38 68 60 50


--
*François Prunier
* *Hurence* - /Vos experts Big Data/
http://www.hurence.com
*mobile:* +33



signature.asc (859 bytes) Download Attachment