ConvertCSVToAvro vs CSVReader - Value Delimiter

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

ConvertCSVToAvro vs CSVReader - Value Delimiter

Arun Manivannan
Hi,

The ConvertCSVToAvro processor have been having performance issues while
processing files which are more than a GB and I was suggested to use the
ConvertRecord that leverages the RecordReader and Writer. Did some tests
and they do perform well.

Strangely, the CSVReader doesn't accept unicode character as the value
delimiter - Control A  (\u0001) character is the delimiter of my CSV.

Did some analysis and I see that a minor change needs to be made on the
CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and
also modify the SingleCharacterValidator.

Please let me know if you believe this isn't an issue and there's a
workaround for this. Else, I am more than happy to raise an issue and
submit a PR for review.

Best Regards,
Arun
Reply | Threaded
Open this post in threaded view
|

RE: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter

Peter Wicks (pwicks)
Arun,

I'm also using Ctrl+A as a delimiter and had the same problem.  I haven't had time to write up a PR but it looked like a pretty easy fix to me too.

I can't merge the change if you submit it, but I'd be happy to review it.

--Peter

-----Original Message-----
From: Arun Manivannan [mailto:[hidden email]]
Sent: Sunday, September 24, 2017 11:17 PM
To: [hidden email]
Subject: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter

Hi,

The ConvertCSVToAvro processor have been having performance issues while processing files which are more than a GB and I was suggested to use the ConvertRecord that leverages the RecordReader and Writer. Did some tests and they do perform well.

Strangely, the CSVReader doesn't accept unicode character as the value delimiter - Control A  (\u0001) character is the delimiter of my CSV.

Did some analysis and I see that a minor change needs to be made on the CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and also modify the SingleCharacterValidator.

Please let me know if you believe this isn't an issue and there's a workaround for this. Else, I am more than happy to raise an issue and submit a PR for review.

Best Regards,
Arun
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter

Joe Witt
Thanks Arun and Peter.  Getting that resolved will be nice.  The
performance difference of the record reader/writer approach in all
this is pretty fantastic so the more we can do to iron out these sorts
of edges the better.  Thanks!

On Sun, Sep 24, 2017 at 8:56 PM, Peter Wicks (pwicks) <[hidden email]> wrote:

> Arun,
>
> I'm also using Ctrl+A as a delimiter and had the same problem.  I haven't had time to write up a PR but it looked like a pretty easy fix to me too.
>
> I can't merge the change if you submit it, but I'd be happy to review it.
>
> --Peter
>
> -----Original Message-----
> From: Arun Manivannan [mailto:[hidden email]]
> Sent: Sunday, September 24, 2017 11:17 PM
> To: [hidden email]
> Subject: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter
>
> Hi,
>
> The ConvertCSVToAvro processor have been having performance issues while processing files which are more than a GB and I was suggested to use the ConvertRecord that leverages the RecordReader and Writer. Did some tests and they do perform well.
>
> Strangely, the CSVReader doesn't accept unicode character as the value delimiter - Control A  (\u0001) character is the delimiter of my CSV.
>
> Did some analysis and I see that a minor change needs to be made on the CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and also modify the SingleCharacterValidator.
>
> Please let me know if you believe this isn't an issue and there's a workaround for this. Else, I am more than happy to raise an issue and submit a PR for review.
>
> Best Regards,
> Arun
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter

Matt Burgess
Thanks all, if the PR is available tomorrow I can review as well and merge, but I will be on vacation for a week after that. No pressure :)

Regards,
Matt

> On Sep 24, 2017, at 8:57 PM, Joe Witt <[hidden email]> wrote:
>
> Thanks Arun and Peter.  Getting that resolved will be nice.  The
> performance difference of the record reader/writer approach in all
> this is pretty fantastic so the more we can do to iron out these sorts
> of edges the better.  Thanks!
>
>> On Sun, Sep 24, 2017 at 8:56 PM, Peter Wicks (pwicks) <[hidden email]> wrote:
>> Arun,
>>
>> I'm also using Ctrl+A as a delimiter and had the same problem.  I haven't had time to write up a PR but it looked like a pretty easy fix to me too.
>>
>> I can't merge the change if you submit it, but I'd be happy to review it.
>>
>> --Peter
>>
>> -----Original Message-----
>> From: Arun Manivannan [mailto:[hidden email]]
>> Sent: Sunday, September 24, 2017 11:17 PM
>> To: [hidden email]
>> Subject: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter
>>
>> Hi,
>>
>> The ConvertCSVToAvro processor have been having performance issues while processing files which are more than a GB and I was suggested to use the ConvertRecord that leverages the RecordReader and Writer. Did some tests and they do perform well.
>>
>> Strangely, the CSVReader doesn't accept unicode character as the value delimiter - Control A  (\u0001) character is the delimiter of my CSV.
>>
>> Did some analysis and I see that a minor change needs to be made on the CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and also modify the SingleCharacterValidator.
>>
>> Please let me know if you believe this isn't an issue and there's a workaround for this. Else, I am more than happy to raise an issue and submit a PR for review.
>>
>> Best Regards,
>> Arun
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter

Arun Manivannan
Thanks a lot, gentlemen. JIRA and PR coming through in a few hours.

On Mon, Sep 25, 2017, 09:07 Matt Burgess <[hidden email]> wrote:

> Thanks all, if the PR is available tomorrow I can review as well and
> merge, but I will be on vacation for a week after that. No pressure :)
>
> Regards,
> Matt
>
> > On Sep 24, 2017, at 8:57 PM, Joe Witt <[hidden email]> wrote:
> >
> > Thanks Arun and Peter.  Getting that resolved will be nice.  The
> > performance difference of the record reader/writer approach in all
> > this is pretty fantastic so the more we can do to iron out these sorts
> > of edges the better.  Thanks!
> >
> >> On Sun, Sep 24, 2017 at 8:56 PM, Peter Wicks (pwicks) <
> [hidden email]> wrote:
> >> Arun,
> >>
> >> I'm also using Ctrl+A as a delimiter and had the same problem.  I
> haven't had time to write up a PR but it looked like a pretty easy fix to
> me too.
> >>
> >> I can't merge the change if you submit it, but I'd be happy to review
> it.
> >>
> >> --Peter
> >>
> >> -----Original Message-----
> >> From: Arun Manivannan [mailto:[hidden email]]
> >> Sent: Sunday, September 24, 2017 11:17 PM
> >> To: [hidden email]
> >> Subject: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter
> >>
> >> Hi,
> >>
> >> The ConvertCSVToAvro processor have been having performance issues
> while processing files which are more than a GB and I was suggested to use
> the ConvertRecord that leverages the RecordReader and Writer. Did some
> tests and they do perform well.
> >>
> >> Strangely, the CSVReader doesn't accept unicode character as the value
> delimiter - Control A  (\u0001) character is the delimiter of my CSV.
> >>
> >> Did some analysis and I see that a minor change needs to be made on the
> CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and
> also modify the SingleCharacterValidator.
> >>
> >> Please let me know if you believe this isn't an issue and there's a
> workaround for this. Else, I am more than happy to raise an issue and
> submit a PR for review.
> >>
> >> Best Regards,
> >> Arun
>
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter

Arun Manivannan
Hi All,

Just raised a PR (https://github.com/apache/nifi/pull/2172) for JIRA
NIFI-4416 <https://issues.apache.org/jira/browse/NIFI-4416>

Appreciate your help, Peter and Matt.  Could you please have a quick look
and give your comments.

Joe - Could you also check out the JIRA and let me know if I've committed
some crime.

You guys are the best !

Best Regards,
Arun

On Mon, Sep 25, 2017 at 9:44 AM Arun Manivannan <[hidden email]> wrote:

> Thanks a lot, gentlemen. JIRA and PR coming through in a few hours.
>
> On Mon, Sep 25, 2017, 09:07 Matt Burgess <[hidden email]> wrote:
>
>> Thanks all, if the PR is available tomorrow I can review as well and
>> merge, but I will be on vacation for a week after that. No pressure :)
>>
>> Regards,
>> Matt
>>
>> > On Sep 24, 2017, at 8:57 PM, Joe Witt <[hidden email]> wrote:
>> >
>> > Thanks Arun and Peter.  Getting that resolved will be nice.  The
>> > performance difference of the record reader/writer approach in all
>> > this is pretty fantastic so the more we can do to iron out these sorts
>> > of edges the better.  Thanks!
>> >
>> >> On Sun, Sep 24, 2017 at 8:56 PM, Peter Wicks (pwicks) <
>> [hidden email]> wrote:
>> >> Arun,
>> >>
>> >> I'm also using Ctrl+A as a delimiter and had the same problem.  I
>> haven't had time to write up a PR but it looked like a pretty easy fix to
>> me too.
>> >>
>> >> I can't merge the change if you submit it, but I'd be happy to review
>> it.
>> >>
>> >> --Peter
>> >>
>> >> -----Original Message-----
>> >> From: Arun Manivannan [mailto:[hidden email]]
>> >> Sent: Sunday, September 24, 2017 11:17 PM
>> >> To: [hidden email]
>> >> Subject: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter
>> >>
>> >> Hi,
>> >>
>> >> The ConvertCSVToAvro processor have been having performance issues
>> while processing files which are more than a GB and I was suggested to use
>> the ConvertRecord that leverages the RecordReader and Writer. Did some
>> tests and they do perform well.
>> >>
>> >> Strangely, the CSVReader doesn't accept unicode character as the value
>> delimiter - Control A  (\u0001) character is the delimiter of my CSV.
>> >>
>> >> Did some analysis and I see that a minor change needs to be made on
>> the CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and
>> also modify the SingleCharacterValidator.
>> >>
>> >> Please let me know if you believe this isn't an issue and there's a
>> workaround for this. Else, I am more than happy to raise an issue and
>> submit a PR for review.
>> >>
>> >> Best Regards,
>> >> Arun
>>
>