[DISCUSS] NIFI-1069 / PR1093 - Return code for a NiFi not responding to ping

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] NIFI-1069 / PR1093 - Return code for a NiFi not responding to ping

Andre
devs,

I am reviewing PR#1093, which happens to be a great contribution towards a
LSB compliant NiFi (something the overall community seems to be eager to
have).

The PR basically changes RunNiFi.java so that it returns a numeric exit
code compatible with the LSB specifications.

I am happy with the overall code but there's one sticking point:

Should we return 0 (i.e. "healthy") when "Apache NiFi is running at PID {}
but is not responding to ping requests" ?

The LSB defines:

"
If the status action is requested, the init script will return the
following exit status codes.

0 program is running or service is OK
1 program is dead and /var/run pid file exists
2 program is dead and /var/lock lock file exists
3 program is not running
4 program or service status is unknown
5-99 reserved for future LSB use
100-149 reserved for distribution use
150-199 reserved for application use
200-254 reserved
"

My reading is that we should return 4, for the JVM PID is currently
running, however, the absence of a ping response could signal the NiFi
program running within the JVM is not healthy. (the PR contribution returns
0).

Would anyone have a view on what usually would cause a NiFi instance to be
"running" but unable to respond to pings? Whenever that happens should we
return 0 (running/service ok) or 4 (program/service status unknown)?

I thank you in advance
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] NIFI-1069 / PR1093 - Return code for a NiFi not responding to ping

Mark Payne
Andre,

In that case, I agree with you that a 4 would be the proper response. Things that I can
think of that may cause it not to respond:

1) Long Garbage Collection pause
2) Stuck in some sort of infinite loop or just way overtaxed CPU
3) Too many open files prevents it from accepting the connection

Not sure what else may cause this...

Thanks
-Mark

> On Oct 14, 2016, at 9:08 PM, Andre <[hidden email]> wrote:
>
> devs,
>
> I am reviewing PR#1093, which happens to be a great contribution towards a
> LSB compliant NiFi (something the overall community seems to be eager to
> have).
>
> The PR basically changes RunNiFi.java so that it returns a numeric exit
> code compatible with the LSB specifications.
>
> I am happy with the overall code but there's one sticking point:
>
> Should we return 0 (i.e. "healthy") when "Apache NiFi is running at PID {}
> but is not responding to ping requests" ?
>
> The LSB defines:
>
> "
> If the status action is requested, the init script will return the
> following exit status codes.
>
> 0 program is running or service is OK
> 1 program is dead and /var/run pid file exists
> 2 program is dead and /var/lock lock file exists
> 3 program is not running
> 4 program or service status is unknown
> 5-99 reserved for future LSB use
> 100-149 reserved for distribution use
> 150-199 reserved for application use
> 200-254 reserved
> "
>
> My reading is that we should return 4, for the JVM PID is currently
> running, however, the absence of a ping response could signal the NiFi
> program running within the JVM is not healthy. (the PR contribution returns
> 0).
>
> Would anyone have a view on what usually would cause a NiFi instance to be
> "running" but unable to respond to pings? Whenever that happens should we
> return 0 (running/service ok) or 4 (program/service status unknown)?
>
> I thank you in advance

Reply | Threaded
Open this post in threaded view
|

[DISCUSS] NIFI-1069 / PR1093 - Return code for a NiFi not responding to ping

Edgardo Vega
I would say go with 4.

Ansible will see 1, 2, 3, 4, 69 as not running and do the correct thing.
Puppet sees 0 vs non zero. I think If he service is up running and
responding to pings return 0 anything else should return another code. This
will allow these tools to restart the application to get them back into a
good state.

Not sure what would put nifi into this state maybe disk full.

Cheers,

Edgardo


On Friday, October 14, 2016, Andre <[hidden email]
<javascript:_e(%7B%7D,'cvml','[hidden email]');>> wrote:

> devs,
>
> I am reviewing PR#1093, which happens to be a great contribution towards a
> LSB compliant NiFi (something the overall community seems to be eager to
> have).
>
> The PR basically changes RunNiFi.java so that it returns a numeric exit
> code compatible with the LSB specifications.
>
> I am happy with the overall code but there's one sticking point:
>
> Should we return 0 (i.e. "healthy") when "Apache NiFi is running at PID {}
> but is not responding to ping requests" ?
>
> The LSB defines:
>
> "
> If the status action is requested, the init script will return the
> following exit status codes.
>
> 0 program is running or service is OK
> 1 program is dead and /var/run pid file exists
> 2 program is dead and /var/lock lock file exists
> 3 program is not running
> 4 program or service status is unknown
> 5-99 reserved for future LSB use
> 100-149 reserved for distribution use
> 150-199 reserved for application use
> 200-254 reserved
> "
>
> My reading is that we should return 4, for the JVM PID is currently
> running, however, the absence of a ping response could signal the NiFi
> program running within the JVM is not healthy. (the PR contribution returns
> 0).
>
> Would anyone have a view on what usually would cause a NiFi instance to be
> "running" but unable to respond to pings? Whenever that happens should we
> return 0 (running/service ok) or 4 (program/service status unknown)?
>
> I thank you in advance
>


--
Cheers,

Edgardo

Sent from Gmail Mobile
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] NIFI-1069 / PR1093 - Return code for a NiFi not responding to ping

Michal Klempa
 I am the original contributor and I am ok with 4. Commit is here:
https://github.com/apache/nifi/pull/1093/commits/5f90cb714dc48264e0f863ce864ac41ddb93556c

And yes, we need at least some LSB complaince to manage NiFi using
Ansible :) otherwise, I have to check ps a | grep NiFi output to see
if NiFi is running or not. Thats bad.

On Sat, Oct 15, 2016 at 3:40 AM, Edgardo Vega <[hidden email]> wrote:

> I would say go with 4.
>
> Ansible will see 1, 2, 3, 4, 69 as not running and do the correct thing.
> Puppet sees 0 vs non zero. I think If he service is up running and
> responding to pings return 0 anything else should return another code. This
> will allow these tools to restart the application to get them back into a
> good state.
>
> Not sure what would put nifi into this state maybe disk full.
>
> Cheers,
>
> Edgardo
>
>
> On Friday, October 14, 2016, Andre <[hidden email]
> <javascript:_e(%7B%7D,'cvml','[hidden email]');>> wrote:
>
>> devs,
>>
>> I am reviewing PR#1093, which happens to be a great contribution towards a
>> LSB compliant NiFi (something the overall community seems to be eager to
>> have).
>>
>> The PR basically changes RunNiFi.java so that it returns a numeric exit
>> code compatible with the LSB specifications.
>>
>> I am happy with the overall code but there's one sticking point:
>>
>> Should we return 0 (i.e. "healthy") when "Apache NiFi is running at PID {}
>> but is not responding to ping requests" ?
>>
>> The LSB defines:
>>
>> "
>> If the status action is requested, the init script will return the
>> following exit status codes.
>>
>> 0 program is running or service is OK
>> 1 program is dead and /var/run pid file exists
>> 2 program is dead and /var/lock lock file exists
>> 3 program is not running
>> 4 program or service status is unknown
>> 5-99 reserved for future LSB use
>> 100-149 reserved for distribution use
>> 150-199 reserved for application use
>> 200-254 reserved
>> "
>>
>> My reading is that we should return 4, for the JVM PID is currently
>> running, however, the absence of a ping response could signal the NiFi
>> program running within the JVM is not healthy. (the PR contribution returns
>> 0).
>>
>> Would anyone have a view on what usually would cause a NiFi instance to be
>> "running" but unable to respond to pings? Whenever that happens should we
>> return 0 (running/service ok) or 4 (program/service status unknown)?
>>
>> I thank you in advance
>>
>
>
> --
> Cheers,
>
> Edgardo
>
> Sent from Gmail Mobile