Two issues relating to a processor I'm developing

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Two issues relating to a processor I'm developing

Mike Thomsen
The processor breaks down a much larger file into a huge number of small
data points. We're talking like turning a 1.1M line file into about 2.5B
data points.

My current approach is "read a file with GetFile, save to /tmp, break down
into a bunch of large CSV record batches (like a few hundred thousand
records per group)" and then commit.

It's slow, and with some good debugging statements, I can see the processor
tearing into the data just fine. However, I am thinking about adding a
variant to this which would be an "iterative" version that would follow
this pattern:

"read the file, save to /tmp, load the file, keep the current read position
intact, every onTrigger call sends out a batch w/ session.commit() until
it's done reading. Then grab the next flowfile."

Does anyone have any suggestions on good practices to follow here,
potential concerns, etc.? (Note: I have to write the file to /tmp because a
library I am using which I don't want to fork doesn't have an API that can
read from a stream rather than a java.io.File)

Also, are there any issues with accepting a contribution that makes use of
a LGPL-licensed library, in the event that my client wants to open source
it (we think they will)?

Thanks,

Mike
Reply | Threaded
Open this post in threaded view
|

Re: Two issues relating to a processor I'm developing

Bryan Bende
Mike,

Regarding the licensing, I believe LGPL is a no-go for Apache projects.

Take a look here:
https://www.apache.org/legal/resolved.html#category-x

-Bryan


On Sat, Oct 28, 2017 at 4:47 PM, Mike Thomsen <[hidden email]> wrote:

> The processor breaks down a much larger file into a huge number of small
> data points. We're talking like turning a 1.1M line file into about 2.5B
> data points.
>
> My current approach is "read a file with GetFile, save to /tmp, break down
> into a bunch of large CSV record batches (like a few hundred thousand
> records per group)" and then commit.
>
> It's slow, and with some good debugging statements, I can see the processor
> tearing into the data just fine. However, I am thinking about adding a
> variant to this which would be an "iterative" version that would follow
> this pattern:
>
> "read the file, save to /tmp, load the file, keep the current read position
> intact, every onTrigger call sends out a batch w/ session.commit() until
> it's done reading. Then grab the next flowfile."
>
> Does anyone have any suggestions on good practices to follow here,
> potential concerns, etc.? (Note: I have to write the file to /tmp because a
> library I am using which I don't want to fork doesn't have an API that can
> read from a stream rather than a java.io.File)
>
> Also, are there any issues with accepting a contribution that makes use of
> a LGPL-licensed library, in the event that my client wants to open source
> it (we think they will)?
>
> Thanks,
>
> Mike