NiFi PutElasticsearch Processor

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

NiFi PutElasticsearch Processor

shankhamajumdar
Hi,

I am working on a use case where I need to load a PDF document to ElasticSearch. I have written a Custom NiFi processor using apache Tika which is basically extracting the content of the PDF. The NiFi flow is mentioned below.

1. NiFi GetFile processor is getting the PDF file from the source directory.

2. NiFi custom processor which is written using apache Tika is extracting the PDF file.

2. Using NiFi PutElasticsearch processor to load the data in ElasticSearch. But I am getting the below error.

MapperParsingException[failed to parse]; nested: NotXContentException[Compressor detection can only
be called on some xcontent bytes or compressed xcontent bytes];

Regards,
Shankha
Reply | Threaded
Open this post in threaded view
|

Re: NiFi PutElasticsearch Processor

Joe Witt
Hello

Is the code available to provide feedback on?  When you extract aspects
from the PDF where are you putting the extracted values?  You prob want to
convert the PDF content into the extracted results in json.  You could also
extract aspects of the PDF into flowfile attributes and then make json in
another processor but this won't scale as well for large documents/extracts.

Thanks
Joe

On Feb 14, 2017 7:46 AM, "shankhamajumdar" <[hidden email]>
wrote:

Hi,

I am working on a use case where I need to load a PDF document to
ElasticSearch. I have written a Custom NiFi processor using apache Tika
which is basically extracting the content of the PDF. The NiFi flow is
mentioned below.

1. NiFi GetFile processor is getting the PDF file from the source directory.

2. NiFi custom processor which is written using apache Tika is extracting
the PDF file.

2. Using NiFi PutElasticsearch processor to load the data in ElasticSearch.
But I am getting the below error.

MapperParsingException[failed to parse]; nested:
NotXContentException[Compressor detection can only
be called on some xcontent bytes or compressed xcontent bytes];

Regards,
Shankha



--
View this message in context: http://apache-nifi-developer-
list.39713.n7.nabble.com/NiFi-PutElasticsearch-Processor-tp14733.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: NiFi PutElasticsearch Processor

Matt Burgess-2
I agree with Joe, it sounds like the flow file content is the
plain-text extract of the PDF, and the ES processors are expecting
JSON documents. You can use ReplaceText to wrap your content in a JSON
object, with something like:

{ "content" : $1}

(I didn't try that but the idea is just to put the original text into
a field inside a JSON object)

Alternatively as Joe mentioned, you may want to consider offering the
user a choice of output format (Text, JSON, etc.) in your custom
processor.

Regards,
Matt

On Tue, Feb 14, 2017 at 8:27 AM, Joe Witt <[hidden email]> wrote:

> Hello
>
> Is the code available to provide feedback on?  When you extract aspects
> from the PDF where are you putting the extracted values?  You prob want to
> convert the PDF content into the extracted results in json.  You could also
> extract aspects of the PDF into flowfile attributes and then make json in
> another processor but this won't scale as well for large documents/extracts.
>
> Thanks
> Joe
>
> On Feb 14, 2017 7:46 AM, "shankhamajumdar" <[hidden email]>
> wrote:
>
> Hi,
>
> I am working on a use case where I need to load a PDF document to
> ElasticSearch. I have written a Custom NiFi processor using apache Tika
> which is basically extracting the content of the PDF. The NiFi flow is
> mentioned below.
>
> 1. NiFi GetFile processor is getting the PDF file from the source directory.
>
> 2. NiFi custom processor which is written using apache Tika is extracting
> the PDF file.
>
> 2. Using NiFi PutElasticsearch processor to load the data in ElasticSearch.
> But I am getting the below error.
>
> MapperParsingException[failed to parse]; nested:
> NotXContentException[Compressor detection can only
> be called on some xcontent bytes or compressed xcontent bytes];
>
> Regards,
> Shankha
>
>
>
> --
> View this message in context: http://apache-nifi-developer-
> list.39713.n7.nabble.com/NiFi-PutElasticsearch-Processor-tp14733.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: NiFi PutElasticsearch Processor

shankhamajumdar
Hi,

I have added ExtractText processor and there added a new property called myAttribute with value (.+). Then added AttributesToJSON processor with Attributes List as myAttribute. As result I am getting the below json structure.

{"myAttribute":"test elasticsearch"}

But it's not working for multiline content as the json attribute is taking single line only. To resolve this I to keep the entire content in a single line. So I have added ReplaceText processor before AttributesToJSON processor. In the replace text processor I am trying to replace \n to empty space so that entire content can come in a single line.

Can you please tell me how to make the entire content in a single line using ReplaceText? I have used search value as \n and replacement value as ' '. But this is not working properly.

Please provide some input on this.

Regards,
Shankha
Reply | Threaded
Open this post in threaded view
|

Re: NiFi PutElasticsearch Processor

Mark Payne
Shankha,

With Java Regex'es, by default, the dot-character does not match newlines. So (.+) will only match
a single line. On Extract Text, you can change the property named "Enable DOTALL Mode" to true,
which should allow the .+ to capture all of the text in the FlowFile.

Thanks
-Mark



> On Feb 17, 2017, at 5:18 AM, shankhamajumdar <[hidden email]> wrote:
>
> Hi,
>
> I have added ExtractText processor and there added a new property called
> myAttribute with value (.+). Then added AttributesToJSON processor with
> Attributes List as myAttribute. As result I am getting the below json
> structure.
>
> {"myAttribute":"test elasticsearch"}
>
> But it's not working for multiline content as the json attribute is taking
> single line only. To resolve this I to keep the entire content in a single
> line. So I have added ReplaceText processor before AttributesToJSON
> processor. In the replace text processor I am trying to replace \n to empty
> space so that entire content can come in a single line.
>
> Can you please tell me how to make the entire content in a single line using
> ReplaceText? I have used search value as \n and replacement value as ' '.
> But this is not working properly.
>
> Please provide some input on this.
>
> Regards,
> Shankha
>
>
>
> --
> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/NiFi-PutElasticsearch-Processor-tp14733p14774.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: NiFi PutElasticsearch Processor

shankhamajumdar
Thanks Mark, your solution has worked! I am facing one more issue. I am trying to put the entire content in a single json attribute using AttributesToJSON processor. It's working fine but that particular attribute is not able to capture the entire content, it's able to capture around first 7 lines of the entire content. Is there any limitation on that or how to to resolve this issue?

Regards,
Shankha
Reply | Threaded
Open this post in threaded view
|

Re: NiFi PutElasticsearch Processor

shankhamajumdar
Hi Mark,

I have resolved json attribute issue by increasing the value of Maximum Capture Group Length in AttributesToJSON processor.

I have one more question - For PutElasticsearch processor I am using elasticsearch2.2.0 version. Is it possible to use elasticsearch5 version for PutElasticsearch processor?

Regards,
Shankha
Reply | Threaded
Open this post in threaded view
|

Re: NiFi PutElasticsearch Processor

shankhamajumdar
Just a small correction - increased the value of Maximum Capture Group Length in ExtractText processor.
Reply | Threaded
Open this post in threaded view
|

Re: NiFi PutElasticsearch Processor

Matt Burgess
In reply to this post by shankhamajumdar
Shanka,

The Fetch/PutElasticsearch processors are built to be part of the ES cluster, and IIRC Elasticsearch says that this should be compatible against dot releases for a particular major/minor version, so I think ours are built against 2.1.x. These might work with ES 2.2.0 but they do not "guarantee" it. Likewise the ES 5 processors are built with ES 5.0.1 so they should work with 5.0.x and most likely won't work with an ES 2.x cluster.

There is a set of HTTP processors (Fetch/PutElasticsearchHttp for example) that are more robust in terms of which versions of ES clusters they support, as these processors use the more stable REST API versus the more volatile (but more performant) native transport API.

Regards,
Matt


> On Feb 20, 2017, at 9:03 AM, shankhamajumdar <[hidden email]> wrote:
>
> Hi Mark,
>
> I have resolved json attribute issue by increasing the value of Maximum
> Capture Group Length in AttributesToJSON processor.
>
> I have one more question - For PutElasticsearch processor I am using
> elasticsearch2.2.0 version. Is it possible to use elasticsearch5 version for
> PutElasticsearch processor?
>
> Regards,
> Shankha
>
>
>
> --
> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/NiFi-PutElasticsearch-Processor-tp14733p14822.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: NiFi PutElasticsearch Processor

shankhamajumdar
Hi Matt,

Thanks for your input, I can see ES5 processor in NiFi1.1.0.
I have one more doubt. I am using a custom processor using Tika where I am detecting the file type first then getting the file content.
The below code snippet is working fine

String fileType = tika.detect(inputStream);
 inputStream = new FileInputStream(new File("filepath"));    
pdfparser.parse(inputStream, handler, metadata, pcontext);

But the below code snippet is not working properly
 
 String fileType = tika.detect(inputStream);                    
pdfparser.parse(inputStream, handler, metadata, pcontext);

The inputStream I am getting from GetFile processor. I am wondering why should again need to create InputStream object again to parse the content.

Regards,
Shankha