Getting Duplicate Flowfiles from InvokeHttp and QueryElasticsearchHttp

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Getting Duplicate Flowfiles from InvokeHttp and QueryElasticsearchHttp

martin.cooley
If I configure an InvokeHttp processor to query against an elasticsearch node, I should get one json object written to a flowfile.  If I use the QueryElasticsearchHttp processor, if the query returns two documents from the index, I should get two json objects, each written to their own flowfile.

However, the InvokeHttp processor is writing two flowfiles.  They have separate UUIDs, but the contents are the same.  Yes, the processor is scheduled to run every 900 seconds.

The QueryElasticsearchHttp processor is writing 4 flowfiles.  It, too, is scheduled to run every 900 seconds.

Elasticsearch is returning:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "etltodoc",
        "_type": "document_record",
        "_id": "2045680246129",
        "_score": 0.2876821,
        "_source": {
          "myguid": "2045680246129",
          "filename": "sample1.pdf",
          "exception": "",
          "original_filename": "\\\\f1\\DocsRepo\\CF\\sample1.pdf",
          "conceptCode": "C2159782",
          "timestamp": "2019-03-12T12:43:21.166531",
          "status": "delivered"
        }
      },
      {
        "_index": "etltodock",
        "_type": "document_record",
        "_id": "2045680246128",
        "_score": 0.2876821,
        "_source": {
          "myguid": "2045680246128",
          "filename": "sample2.pdf",
          "exception": "",
          "original_filename": "\\\\f1\\DocsRepo\\CF\\sample2.pdf",
          "conceptCode": "C2159782",
          "timestamp": "2019-03-12T12:43:21.165467",
          "status": "delivered"
        }
      }
    ]
  }
}

I'm hoping I just have something misconfigured, but I have tried playing with just about every setting.  On the QueryElasticsearchHttp processor, if I set limit to one, I still get two flowfiles instead of four.

Any help will be much appreciated.

Martin
Reply | Threaded
Open this post in threaded view
|

Re: Getting Duplicate Flowfiles from InvokeHttp and QueryElasticsearchHttp

Bryan Bende
Hello,

Are you running a NiFi cluster of 2 nodes, or a standalone instance of NiFi?

-Bryan

On Mon, Mar 18, 2019 at 12:21 PM Martin Cooley <[hidden email]> wrote:

>
> If I configure an InvokeHttp processor to query against an elasticsearch node, I should get one json object written to a flowfile.  If I use the QueryElasticsearchHttp processor, if the query returns two documents from the index, I should get two json objects, each written to their own flowfile.
>
> However, the InvokeHttp processor is writing two flowfiles.  They have separate UUIDs, but the contents are the same.  Yes, the processor is scheduled to run every 900 seconds.
>
> The QueryElasticsearchHttp processor is writing 4 flowfiles.  It, too, is scheduled to run every 900 seconds.
>
> Elasticsearch is returning:
>
> {
>   "took": 1,
>   "timed_out": false,
>   "_shards": {
>     "total": 5,
>     "successful": 5,
>     "skipped": 0,
>     "failed": 0
>   },
>   "hits": {
>     "total": 2,
>     "max_score": 0.2876821,
>     "hits": [
>       {
>         "_index": "etltodoc",
>         "_type": "document_record",
>         "_id": "2045680246129",
>         "_score": 0.2876821,
>         "_source": {
>           "myguid": "2045680246129",
>           "filename": "sample1.pdf",
>           "exception": "",
>           "original_filename": "\\\\f1\\DocsRepo\\CF\\sample1.pdf",
>           "conceptCode": "C2159782",
>           "timestamp": "2019-03-12T12:43:21.166531",
>           "status": "delivered"
>         }
>       },
>       {
>         "_index": "etltodock",
>         "_type": "document_record",
>         "_id": "2045680246128",
>         "_score": 0.2876821,
>         "_source": {
>           "myguid": "2045680246128",
>           "filename": "sample2.pdf",
>           "exception": "",
>           "original_filename": "\\\\f1\\DocsRepo\\CF\\sample2.pdf",
>           "conceptCode": "C2159782",
>           "timestamp": "2019-03-12T12:43:21.165467",
>           "status": "delivered"
>         }
>       }
>     ]
>   }
> }
>
> I'm hoping I just have something misconfigured, but I have tried playing with just about every setting.  On the QueryElasticsearchHttp processor, if I set limit to one, I still get two flowfiles instead of four.
>
> Any help will be much appreciated.
>
> Martin
Reply | Threaded
Open this post in threaded view
|

Re: Getting Duplicate Flowfiles from InvokeHttp and QueryElasticsearchHttp

martin.cooley
Hey Bryan,

Indeed it is a 2 node cluster.  I would like to say I see where this is
going, but I don't.

Thanks,

Martin



Bryan Bende wrote
> Hello,
>
> Are you running a NiFi cluster of 2 nodes, or a standalone instance of
> NiFi?
>
> -Bryan
>
> On Mon, Mar 18, 2019 at 12:21 PM Martin Cooley &lt;

> martin.cooley@

> &gt; wrote:
>>
>> If I configure an InvokeHttp processor to query against an elasticsearch
>> node, I should get one json object written to a flowfile.  If I use the
>> QueryElasticsearchHttp processor, if the query returns two documents from
>> the index, I should get two json objects, each written to their own
>> flowfile.
>>
>> However, the InvokeHttp processor is writing two flowfiles.  They have
>> separate UUIDs, but the contents are the same.  Yes, the processor is
>> scheduled to run every 900 seconds.
>>
>> The QueryElasticsearchHttp processor is writing 4 flowfiles.  It, too, is
>> scheduled to run every 900 seconds.
>>
>> Elasticsearch is returning:
>>
>> {
>>   "took": 1,
>>   "timed_out": false,
>>   "_shards": {
>>     "total": 5,
>>     "successful": 5,
>>     "skipped": 0,
>>     "failed": 0
>>   },
>>   "hits": {
>>     "total": 2,
>>     "max_score": 0.2876821,
>>     "hits": [
>>       {
>>         "_index": "etltodoc",
>>         "_type": "document_record",
>>         "_id": "2045680246129",
>>         "_score": 0.2876821,
>>         "_source": {
>>           "myguid": "2045680246129",
>>           "filename": "sample1.pdf",
>>           "exception": "",
>>           "original_filename": "\\\\f1\\DocsRepo\\CF\\sample1.pdf",
>>           "conceptCode": "C2159782",
>>           "timestamp": "2019-03-12T12:43:21.166531",
>>           "status": "delivered"
>>         }
>>       },
>>       {
>>         "_index": "etltodock",
>>         "_type": "document_record",
>>         "_id": "2045680246128",
>>         "_score": 0.2876821,
>>         "_source": {
>>           "myguid": "2045680246128",
>>           "filename": "sample2.pdf",
>>           "exception": "",
>>           "original_filename": "\\\\f1\\DocsRepo\\CF\\sample2.pdf",
>>           "conceptCode": "C2159782",
>>           "timestamp": "2019-03-12T12:43:21.165467",
>>           "status": "delivered"
>>         }
>>       }
>>     ]
>>   }
>> }
>>
>> I'm hoping I just have something misconfigured, but I have tried playing
>> with just about every setting.  On the QueryElasticsearchHttp processor,
>> if I set limit to one, I still get two flowfiles instead of four.
>>
>> Any help will be much appreciated.
>>
>> Martin





--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Getting Duplicate Flowfiles from InvokeHttp and QueryElasticsearchHttp

Bryan Bende
Hi Martin,

Since you have a 2 node cluster, when you start the processors they
are likely running on both nodes doing the same thing twice and what
you see in the stats and queues is the combined values across the
cluster, so that is why you see either 2 or 4, instead of 1 or 2.

Each processor has an option on the scheduling tab of the
configuration to determine where it runs, either all nodes or primary
node only. You most likely want primary node only so that it only runs
on one of the nodes in the cluster, whichever is the primary node at
the time it is scheduled to run.

Hope that helps.

-Bryan

On Tue, Mar 19, 2019 at 10:58 AM martin.cooley <[hidden email]> wrote:

>
> Hey Bryan,
>
> Indeed it is a 2 node cluster.  I would like to say I see where this is
> going, but I don't.
>
> Thanks,
>
> Martin
>
>
>
> Bryan Bende wrote
> > Hello,
> >
> > Are you running a NiFi cluster of 2 nodes, or a standalone instance of
> > NiFi?
> >
> > -Bryan
> >
> > On Mon, Mar 18, 2019 at 12:21 PM Martin Cooley &lt;
>
> > martin.cooley@
>
> > &gt; wrote:
> >>
> >> If I configure an InvokeHttp processor to query against an elasticsearch
> >> node, I should get one json object written to a flowfile.  If I use the
> >> QueryElasticsearchHttp processor, if the query returns two documents from
> >> the index, I should get two json objects, each written to their own
> >> flowfile.
> >>
> >> However, the InvokeHttp processor is writing two flowfiles.  They have
> >> separate UUIDs, but the contents are the same.  Yes, the processor is
> >> scheduled to run every 900 seconds.
> >>
> >> The QueryElasticsearchHttp processor is writing 4 flowfiles.  It, too, is
> >> scheduled to run every 900 seconds.
> >>
> >> Elasticsearch is returning:
> >>
> >> {
> >>   "took": 1,
> >>   "timed_out": false,
> >>   "_shards": {
> >>     "total": 5,
> >>     "successful": 5,
> >>     "skipped": 0,
> >>     "failed": 0
> >>   },
> >>   "hits": {
> >>     "total": 2,
> >>     "max_score": 0.2876821,
> >>     "hits": [
> >>       {
> >>         "_index": "etltodoc",
> >>         "_type": "document_record",
> >>         "_id": "2045680246129",
> >>         "_score": 0.2876821,
> >>         "_source": {
> >>           "myguid": "2045680246129",
> >>           "filename": "sample1.pdf",
> >>           "exception": "",
> >>           "original_filename": "\\\\f1\\DocsRepo\\CF\\sample1.pdf",
> >>           "conceptCode": "C2159782",
> >>           "timestamp": "2019-03-12T12:43:21.166531",
> >>           "status": "delivered"
> >>         }
> >>       },
> >>       {
> >>         "_index": "etltodock",
> >>         "_type": "document_record",
> >>         "_id": "2045680246128",
> >>         "_score": 0.2876821,
> >>         "_source": {
> >>           "myguid": "2045680246128",
> >>           "filename": "sample2.pdf",
> >>           "exception": "",
> >>           "original_filename": "\\\\f1\\DocsRepo\\CF\\sample2.pdf",
> >>           "conceptCode": "C2159782",
> >>           "timestamp": "2019-03-12T12:43:21.165467",
> >>           "status": "delivered"
> >>         }
> >>       }
> >>     ]
> >>   }
> >> }
> >>
> >> I'm hoping I just have something misconfigured, but I have tried playing
> >> with just about every setting.  On the QueryElasticsearchHttp processor,
> >> if I set limit to one, I still get two flowfiles instead of four.
> >>
> >> Any help will be much appreciated.
> >>
> >> Martin
>
>
>
>
>
> --
> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Getting Duplicate Flowfiles from InvokeHttp and QueryElasticsearchHttp

martin.cooley
Bryan,

Thanks so much!  I get it now, and was able to find the setting and change
it to behave the way that makes the most sense for me.  

I appreciate all your help,

Martin



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/