Provenance query results in Node disconnect from cluster

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Provenance query results in Node disconnect from cluster

Mark Bean
We are seeing cases where a user attempts to query provenance on a cluster.
One or more Nodes may not respond to the request in a timely manner, and is
then subsequently disconnected from the cluster. The nifi-app.log shows log
messages similar to:

ThreadPoolRequestReplicator Failed to replicate request POST
/nifi-api/provenance to {host:port} due to
com.sun.jersy.api.client.ClientHandlerException:
java.net.SocketTimeoutException: Read timed out
NodeClusterCoordinator The following nodes failed to process URI
/nifi-api/provenance '{list of one or more nodes}'. Requesting each node
disconnect from cluster.

We have implemented a custom authorizer. For certain policies, additional
authorization checking is performed. Provenance is one such policy which
performs additional checking. It is surprising that the process is taking
so long as to time out the request. Currently, timeouts are set as:
nifi.cluster.node.read.timeout=10 sec
nifi.cluster.request.replication.claim.timeout=30 sec

This leads me to believe we are thread-limited, not CPU-limited.

In this scenario, what threads are involved? Would
nifi.cluster.node.protocol.threads (or .max.threads) be limiting the
processing of such api calls?

Is the api provenance request(s) limited by
nifi.provenance.repository.query.thread?

Are there other thread-related properties we should be looking at?

Are thread properties (such as nifi.provenance.repository.query.threads)
counted against the total threads given by nifi.web.jetty.threads?

Thanks,
Mark
Reply | Threaded
Open this post in threaded view
|

Re: Provenance query results in Node disconnect from cluster

Mark Payne
Mark,

By and large, when you run into issues with timeouts on cluster replication, in my experience, the culprit
is usually Garbage Collection. So it may be that you are not thread-limited or CPU-limited,
or resource limited at all, just that garbage collection is kicking in at an inopportune time. In such a situation,
my suggestion would be to use a nifi.cluster.node.read.timeout of say 30 seconds instead of 10, and to
look into how the garbage collection is performing on your system.

I have answered specific questions below, though, in case they are helpful.

Thanks
-Mark


> On Nov 20, 2017, at 3:25 PM, Mark Bean <[hidden email]> wrote:
>
> We are seeing cases where a user attempts to query provenance on a cluster.
> One or more Nodes may not respond to the request in a timely manner, and is
> then subsequently disconnected from the cluster. The nifi-app.log shows log
> messages similar to:
>
> ThreadPoolRequestReplicator Failed to replicate request POST
> /nifi-api/provenance to {host:port} due to
> com.sun.jersy.api.client.ClientHandlerException:
> java.net.SocketTimeoutException: Read timed out
> NodeClusterCoordinator The following nodes failed to process URI
> /nifi-api/provenance '{list of one or more nodes}'. Requesting each node
> disconnect from cluster.
>
> We have implemented a custom authorizer. For certain policies, additional
> authorization checking is performed. Provenance is one such policy which
> performs additional checking. It is surprising that the process is taking
> so long as to time out the request. Currently, timeouts are set as:
> nifi.cluster.node.read.timeout=10 sec
> nifi.cluster.request.replication.claim.timeout=30 sec
>
> This leads me to believe we are thread-limited, not CPU-limited.
>
> In this scenario, what threads are involved? Would
> nifi.cluster.node.protocol.threads (or .max.threads) be limiting the
> processing of such api calls?

>>> These are the jetty threads that are involved, on the 'receiving' side
and the nifi.cluster.node.protocol.threads on the client side

>
> Is the api provenance request(s) limited by
> nifi.provenance.repository.query.thread?

>>> These query threads are background threads that are used to populate the results
of the query. Client requests will not block on those results.

>
> Are there other thread-related properties we should be looking at?
>

>>> I don't think so. I can't think of any off of the top of my head, anyway.

> Are thread properties (such as nifi.provenance.repository.query.threads)
> counted against the total threads given by nifi.web.jetty.threads?

>>> No, these are separate thread pools.

>
> Thanks,
> Mark

Reply | Threaded
Open this post in threaded view
|

Re: Provenance query results in Node disconnect from cluster

Mark Bean
Thanks for some clarifying information. Setting
nifi.cluster.node.read.timeout=30
sec seems to have alleviated the problem.

It was determined that there was a relatively long time in performing all
the authorizations or each Provenance Event after choosing Global Menu ->
Data Provenance. In this case, the Provenance Query Thread authorizes "Data
for ..." for each processor. Each such authorization takes approximately
0.5-0.6 ms. (Timing was taken with custom authorization logic disabled.) I
have not yet determined if this authorization proceeds for ALL Provenance
Events, or only for the 1,000 events which the UI limits for display
purposes. I have noted that all authorizations are being handled by a
single Provenance Query Thread despite the property
nfi.provenance.repository.query.threads=2.
I assume this property allows more threads for simultaneous client
requests, but each individual request uses only a single thread.

Also, it was determined that GC was not a significant factor. The JVM is
spending approximately 10% of its time performing GC, but none of it a full
GC. And, the time of any one GC is reasonable (approx. 0.5 sec).


On Mon, Nov 20, 2017 at 4:11 PM, Mark Payne <[hidden email]> wrote:

> Mark,
>
> By and large, when you run into issues with timeouts on cluster
> replication, in my experience, the culprit
> is usually Garbage Collection. So it may be that you are not
> thread-limited or CPU-limited,
> or resource limited at all, just that garbage collection is kicking in at
> an inopportune time. In such a situation,
> my suggestion would be to use a nifi.cluster.node.read.timeout of say 30
> seconds instead of 10, and to
> look into how the garbage collection is performing on your system.
>
> I have answered specific questions below, though, in case they are helpful.
>
> Thanks
> -Mark
>
>
> > On Nov 20, 2017, at 3:25 PM, Mark Bean <[hidden email]> wrote:
> >
> > We are seeing cases where a user attempts to query provenance on a
> cluster.
> > One or more Nodes may not respond to the request in a timely manner, and
> is
> > then subsequently disconnected from the cluster. The nifi-app.log shows
> log
> > messages similar to:
> >
> > ThreadPoolRequestReplicator Failed to replicate request POST
> > /nifi-api/provenance to {host:port} due to
> > com.sun.jersy.api.client.ClientHandlerException:
> > java.net.SocketTimeoutException: Read timed out
> > NodeClusterCoordinator The following nodes failed to process URI
> > /nifi-api/provenance '{list of one or more nodes}'. Requesting each node
> > disconnect from cluster.
> >
> > We have implemented a custom authorizer. For certain policies, additional
> > authorization checking is performed. Provenance is one such policy which
> > performs additional checking. It is surprising that the process is taking
> > so long as to time out the request. Currently, timeouts are set as:
> > nifi.cluster.node.read.timeout=10 sec
> > nifi.cluster.request.replication.claim.timeout=30 sec
> >
> > This leads me to believe we are thread-limited, not CPU-limited.
> >
> > In this scenario, what threads are involved? Would
> > nifi.cluster.node.protocol.threads (or .max.threads) be limiting the
> > processing of such api calls?
>
> >>> These are the jetty threads that are involved, on the 'receiving' side
> and the nifi.cluster.node.protocol.threads on the client side
>
> >
> > Is the api provenance request(s) limited by
> > nifi.provenance.repository.query.thread?
>
> >>> These query threads are background threads that are used to populate
> the results
> of the query. Client requests will not block on those results.
>
> >
> > Are there other thread-related properties we should be looking at?
> >
>
> >>> I don't think so. I can't think of any off of the top of my head,
> anyway.
>
> > Are thread properties (such as nifi.provenance.repository.query.threads)
> > counted against the total threads given by nifi.web.jetty.threads?
>
> >>> No, these are separate thread pools.
>
> >
> > Thanks,
> > Mark
>
>