Restarting NiFi causing SiteToSiteBulletinReportingTask to fail

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Restarting NiFi causing SiteToSiteBulletinReportingTask to fail

Woodhead, Chad
I am running HDF 3.0.1.1 which comes with NiFi 1.2.0.3.0.1.1-5. We are using SiteToSiteBulletinReportingTask to monitor bulletins (for things like Disk Usage and Memory Usage). When we restart NiFi via Ambari (either with a Restart or Stop and then Start), when NiFi comes back up the SiteToSiteBulletinReportingTask no longer works. It throws the following error when it is first trying to start up:

SiteToSiteBulletinReportingTask[id=ba6b4499-0162-1000-0000-00003ccd7573] org.apache.nifi.remote.client.PeerSelector@34e976af Unable to refresh Remote Group's peers due to response code 409:Conflict with explanation: null

No matter how long we wait, it never works. The ways I have been able to get it to start working again are as follows:

  *   Stop and then Start the Remote Input Port the SiteToSiteBulletinReportingTask is using
  *   Delete the SiteToSiteBulletinReportingTask and create a new one
  *   Wait a while and stop and start the SiteToSiteBulletinReportingTask (however this doesn't work consistently)

I have tested the same flow steps using a process that uses a Remote Process Group and a different Remote Input Port, and that RPG throws the same error when first coming up but then starts working after a period of time. So maybe the SiteToSiteBulletinReportingTask isn't trying enough times to connect to the Remote Input Port?

Sincerely,
Chad Woodhead
Reply | Threaded
Open this post in threaded view
|

Re: Restarting NiFi causing SiteToSiteBulletinReportingTask to fail

Pierre Villard
Hi Chad,

I believe this could have been fixed recently but I've very limited access
right now (and for the next few days) and can't be sure...
I will check next week if no one gave you feedbacks before.

Pierre

2018-04-12 19:57 GMT+02:00 Woodhead, Chad <[hidden email]>:

> I am running HDF 3.0.1.1 which comes with NiFi 1.2.0.3.0.1.1-5. We are
> using SiteToSiteBulletinReportingTask to monitor bulletins (for things
> like Disk Usage and Memory Usage). When we restart NiFi via Ambari (either
> with a Restart or Stop and then Start), when NiFi comes back up the
> SiteToSiteBulletinReportingTask no longer works. It throws the following
> error when it is first trying to start up:
>
> SiteToSiteBulletinReportingTask[id=ba6b4499-0162-1000-0000-00003ccd7573]
> org.apache.nifi.remote.client.PeerSelector@34e976af Unable to refresh
> Remote Group's peers due to response code 409:Conflict with explanation:
> null
>
> No matter how long we wait, it never works. The ways I have been able to
> get it to start working again are as follows:
>
>   *   Stop and then Start the Remote Input Port the
> SiteToSiteBulletinReportingTask is using
>   *   Delete the SiteToSiteBulletinReportingTask and create a new one
>   *   Wait a while and stop and start the SiteToSiteBulletinReportingTask
> (however this doesn't work consistently)
>
> I have tested the same flow steps using a process that uses a Remote
> Process Group and a different Remote Input Port, and that RPG throws the
> same error when first coming up but then starts working after a period of
> time. So maybe the SiteToSiteBulletinReportingTask isn't trying enough
> times to connect to the Remote Input Port?
>
> Sincerely,
> Chad Woodhead
>
Reply | Threaded
Open this post in threaded view
|

Re: Restarting NiFi causing SiteToSiteBulletinReportingTask to fail

Pierre Villard
Hi Chad,

I confirm that I can reproduce the issue on my side with a NiFi 1.5.0
cluster and I don't see anything that would fix it in NiFi 1.6.0.

I had a closer look and it does not seem related to the Site-to-Site
mechanism: the thread in charge of refreshing the peers is correctly
running and you should see logs like "Successfully refreshed Peer Status;
remote instance consists of X peers".

As far as I can see, it sounds related to how we are caching the ID of the
last bulletin sent and how we retrieve this value to "restart" the task
after the NiFi node restarted. That's why you have to delete the task and
create it again: it'll delete the associated cache.

That's just an assumption after a quick look, I'll keep digging tomorrow
and open a JIRA for that.

Thanks for reporting it!

Pierre


2018-04-12 23:41 GMT+02:00 Pierre Villard <[hidden email]>:

> Hi Chad,
>
> I believe this could have been fixed recently but I've very limited access
> right now (and for the next few days) and can't be sure...
> I will check next week if no one gave you feedbacks before.
>
> Pierre
>
> 2018-04-12 19:57 GMT+02:00 Woodhead, Chad <[hidden email]>:
>
>> I am running HDF 3.0.1.1 which comes with NiFi 1.2.0.3.0.1.1-5. We are
>> using SiteToSiteBulletinReportingTask to monitor bulletins (for things
>> like Disk Usage and Memory Usage). When we restart NiFi via Ambari (either
>> with a Restart or Stop and then Start), when NiFi comes back up the
>> SiteToSiteBulletinReportingTask no longer works. It throws the following
>> error when it is first trying to start up:
>>
>> SiteToSiteBulletinReportingTask[id=ba6b4499-0162-1000-0000-00003ccd7573]
>> org.apache.nifi.remote.client.PeerSelector@34e976af Unable to refresh
>> Remote Group's peers due to response code 409:Conflict with explanation:
>> null
>>
>> No matter how long we wait, it never works. The ways I have been able to
>> get it to start working again are as follows:
>>
>>   *   Stop and then Start the Remote Input Port the
>> SiteToSiteBulletinReportingTask is using
>>   *   Delete the SiteToSiteBulletinReportingTask and create a new one
>>   *   Wait a while and stop and start the SiteToSiteBulletinReportingTask
>> (however this doesn't work consistently)
>>
>> I have tested the same flow steps using a process that uses a Remote
>> Process Group and a different Remote Input Port, and that RPG throws the
>> same error when first coming up but then starts working after a period of
>> time. So maybe the SiteToSiteBulletinReportingTask isn't trying enough
>> times to connect to the Remote Input Port?
>>
>> Sincerely,
>> Chad Woodhead
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Restarting NiFi causing SiteToSiteBulletinReportingTask to fail

Pierre Villard
I created to https://issues.apache.org/jira/browse/NIFI-5092 to track the
issue. Will submit a fix really soon.
Current workaround: after a NiFi restart, stop the reporting task, clear
the state of the reporting task and start the reporting task.

Pierre

2018-04-18 0:04 GMT+02:00 Pierre Villard <[hidden email]>:

> Hi Chad,
>
> I confirm that I can reproduce the issue on my side with a NiFi 1.5.0
> cluster and I don't see anything that would fix it in NiFi 1.6.0.
>
> I had a closer look and it does not seem related to the Site-to-Site
> mechanism: the thread in charge of refreshing the peers is correctly
> running and you should see logs like "Successfully refreshed Peer Status;
> remote instance consists of X peers".
>
> As far as I can see, it sounds related to how we are caching the ID of the
> last bulletin sent and how we retrieve this value to "restart" the task
> after the NiFi node restarted. That's why you have to delete the task and
> create it again: it'll delete the associated cache.
>
> That's just an assumption after a quick look, I'll keep digging tomorrow
> and open a JIRA for that.
>
> Thanks for reporting it!
>
> Pierre
>
>
> 2018-04-12 23:41 GMT+02:00 Pierre Villard <[hidden email]>:
>
>> Hi Chad,
>>
>> I believe this could have been fixed recently but I've very limited
>> access right now (and for the next few days) and can't be sure...
>> I will check next week if no one gave you feedbacks before.
>>
>> Pierre
>>
>> 2018-04-12 19:57 GMT+02:00 Woodhead, Chad <[hidden email]>:
>>
>>> I am running HDF 3.0.1.1 which comes with NiFi 1.2.0.3.0.1.1-5. We are
>>> using SiteToSiteBulletinReportingTask to monitor bulletins (for things
>>> like Disk Usage and Memory Usage). When we restart NiFi via Ambari (either
>>> with a Restart or Stop and then Start), when NiFi comes back up the
>>> SiteToSiteBulletinReportingTask no longer works. It throws the
>>> following error when it is first trying to start up:
>>>
>>> SiteToSiteBulletinReportingTask[id=ba6b4499-0162-1000-0000-00003ccd7573]
>>> org.apache.nifi.remote.client.PeerSelector@34e976af Unable to refresh
>>> Remote Group's peers due to response code 409:Conflict with explanation:
>>> null
>>>
>>> No matter how long we wait, it never works. The ways I have been able to
>>> get it to start working again are as follows:
>>>
>>>   *   Stop and then Start the Remote Input Port the
>>> SiteToSiteBulletinReportingTask is using
>>>   *   Delete the SiteToSiteBulletinReportingTask and create a new one
>>>   *   Wait a while and stop and start the SiteToSiteBulletinReportingTask
>>> (however this doesn't work consistently)
>>>
>>> I have tested the same flow steps using a process that uses a Remote
>>> Process Group and a different Remote Input Port, and that RPG throws the
>>> same error when first coming up but then starts working after a period of
>>> time. So maybe the SiteToSiteBulletinReportingTask isn't trying enough
>>> times to connect to the Remote Input Port?
>>>
>>> Sincerely,
>>> Chad Woodhead
>>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Restarting NiFi causing SiteToSiteBulletinReportingTask to fail

Woodhead, Chad
Hi Pierre,

Thanks for the updates and HCC comments as well.

-Chad

On 4/18/18, 5:36 AM, "Pierre Villard" <[hidden email]> wrote:

    I created to https://issues.apache.org/jira/browse/NIFI-5092 to track the
    issue. Will submit a fix really soon.
    Current workaround: after a NiFi restart, stop the reporting task, clear
    the state of the reporting task and start the reporting task.
   
    Pierre
   
    2018-04-18 0:04 GMT+02:00 Pierre Villard <[hidden email]>:
   
    > Hi Chad,
    >
    > I confirm that I can reproduce the issue on my side with a NiFi 1.5.0
    > cluster and I don't see anything that would fix it in NiFi 1.6.0.
    >
    > I had a closer look and it does not seem related to the Site-to-Site
    > mechanism: the thread in charge of refreshing the peers is correctly
    > running and you should see logs like "Successfully refreshed Peer Status;
    > remote instance consists of X peers".
    >
    > As far as I can see, it sounds related to how we are caching the ID of the
    > last bulletin sent and how we retrieve this value to "restart" the task
    > after the NiFi node restarted. That's why you have to delete the task and
    > create it again: it'll delete the associated cache.
    >
    > That's just an assumption after a quick look, I'll keep digging tomorrow
    > and open a JIRA for that.
    >
    > Thanks for reporting it!
    >
    > Pierre
    >
    >
    > 2018-04-12 23:41 GMT+02:00 Pierre Villard <[hidden email]>:
    >
    >> Hi Chad,
    >>
    >> I believe this could have been fixed recently but I've very limited
    >> access right now (and for the next few days) and can't be sure...
    >> I will check next week if no one gave you feedbacks before.
    >>
    >> Pierre
    >>
    >> 2018-04-12 19:57 GMT+02:00 Woodhead, Chad <[hidden email]>:
    >>
    >>> I am running HDF https://urldefense.proofpoint.com/v2/url?u=http-3A__3.0.1.1&d=DwIBaQ&c=gJN2jf8AyP5Q6Np0yWY19w&r=MJ04HXP0mOz9-J4odYRNRx3ln4A_OnHTjJvmsZOEG64&m=HjckJSegMO_Vjm51wNuSBdY4V9QxOuWuJGoOWv-Q1hs&s=EHpb-XSM3jNvt8gU9Ozx8o9sSTZF0V4BgIZqCBDSn2g&e= which comes with NiFi 1.2.0.3.0.1.1-5. We are
    >>> using SiteToSiteBulletinReportingTask to monitor bulletins (for things
    >>> like Disk Usage and Memory Usage). When we restart NiFi via Ambari (either
    >>> with a Restart or Stop and then Start), when NiFi comes back up the
    >>> SiteToSiteBulletinReportingTask no longer works. It throws the
    >>> following error when it is first trying to start up:
    >>>
    >>> SiteToSiteBulletinReportingTask[id=ba6b4499-0162-1000-0000-00003ccd7573]
    >>> org.apache.nifi.remote.client.PeerSelector@34e976af Unable to refresh
    >>> Remote Group's peers due to response code 409:Conflict with explanation:
    >>> null
    >>>
    >>> No matter how long we wait, it never works. The ways I have been able to
    >>> get it to start working again are as follows:
    >>>
    >>>   *   Stop and then Start the Remote Input Port the
    >>> SiteToSiteBulletinReportingTask is using
    >>>   *   Delete the SiteToSiteBulletinReportingTask and create a new one
    >>>   *   Wait a while and stop and start the SiteToSiteBulletinReportingTask
    >>> (however this doesn't work consistently)
    >>>
    >>> I have tested the same flow steps using a process that uses a Remote
    >>> Process Group and a different Remote Input Port, and that RPG throws the
    >>> same error when first coming up but then starts working after a period of
    >>> time. So maybe the SiteToSiteBulletinReportingTask isn't trying enough
    >>> times to connect to the Remote Input Port?
    >>>
    >>> Sincerely,
    >>> Chad Woodhead
    >>>
    >>
    >>
    >