Long recovery time

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Long recovery time

Joe Gresock
I have a a NiFi 1.1.0 instance whose disk nearly (but not quite) filled
up.  I noticed that some of its NiFi processors were hanging so I restarted
it, but it's taking over an hour to come back up.

My question is: how can I tell if NiFi is doing something productive (and
therefore I should just let it finish) vs. hanging (and therefore I should
try something else)?  Is it possible that NiFi could take hours to stand
back up?  My content_repository is 276GB and my flowfile_repository is
640GB.

I see the following in the logs:

o.a.n.controller.StandardFlowFileQueue Recovered 8 swap files for
FlowFileQueue[...] in 51 millis
org.wali.MinimalLockingWriteAheadLog finished recovering records.
Performing Checkpoint to ensure proper state of Partitions before updates
org.wali.MinimalLockingWriteAheadLog Successfully recovered 10536141
records in 38509 milliseconds

Thereafter, the only thing I see in the logs are these periodic messages:
org.wali.MinimalLockingWriteAheadLog checkpointed with 2 Records and 0 Swap
Files in 16 milliseconds, Max Transaction ID 31

I did a thread dump and see pretty standard stuff, including one that I
thought might be relevant:
"main" Id=1 RUNNABLE
  at java.util.HashMap.putVal(HashMap.java:641)
  at java.util.HashMap.put(HashMap.java:611)
  at
org.apache.nifi.repository.schema.SchemaRecordReader.readFieldValue(SchemaRecordReader.java:154)

I took a couple dumps in a row in case it was hung here, but it appears to
be progressing to different points in the stack.

NiFi is the only thing running on this instance, and nearly all of its 48GB
of RAM are being used, and I did notice that it is doing some heavy reads
but not many writes (using iostat).

Thanks,
Joe
--
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*
Reply | Threaded
Open this post in threaded view
|

Re: Long recovery time

Mark Payne
Joe,

When you have a huge FlowFile repository like that, it can indeed take quite a long time to recover (potentially a few hours).
But I say that with the caveat that the FlowFile repository should probably never reach that size.
It generally doesn't grow beyond a couple of GB. The two things that I have seen cause tremendous
growth in the FlowFile repository are OutOfMemoryError and Too Many Open File IOExceptions. Either could prevent
the FlowFile Repository from properly checkpointing, which would cause it to grow unbounded. The
former case (OOME), though, is far more likely. The reason that I say this is that if any OOME (or any uncaught Throwable)
is thrown when checkpointing the FlowFile repository, that background thread will die. I did submit a fix for that,
and it was included in 1.4.0, I believe (ASF JIRA is having maintenance performed at the moment I believe, so I'm
not able to look it up at the moment).

There probably are some things that can be done to address the super long recovery time, though. The problem should
be easy to replicate by telling NiFi to checkpoint the flowfile repository once an hour instead of the default once every 2
minutes (via the 'nifi.flowfile.repository.checkpoint.interval' property) so that would allow us to know what is taking so
long and address accordingly.


Thanks
-Mark


> On Oct 8, 2017, at 11:53 AM, Joe Gresock <[hidden email]> wrote:
>
> I have a a NiFi 1.1.0 instance whose disk nearly (but not quite) filled
> up.  I noticed that some of its NiFi processors were hanging so I restarted
> it, but it's taking over an hour to come back up.
>
> My question is: how can I tell if NiFi is doing something productive (and
> therefore I should just let it finish) vs. hanging (and therefore I should
> try something else)?  Is it possible that NiFi could take hours to stand
> back up?  My content_repository is 276GB and my flowfile_repository is
> 640GB.
>
> I see the following in the logs:
>
> o.a.n.controller.StandardFlowFileQueue Recovered 8 swap files for
> FlowFileQueue[...] in 51 millis
> org.wali.MinimalLockingWriteAheadLog finished recovering records.
> Performing Checkpoint to ensure proper state of Partitions before updates
> org.wali.MinimalLockingWriteAheadLog Successfully recovered 10536141
> records in 38509 milliseconds
>
> Thereafter, the only thing I see in the logs are these periodic messages:
> org.wali.MinimalLockingWriteAheadLog checkpointed with 2 Records and 0 Swap
> Files in 16 milliseconds, Max Transaction ID 31
>
> I did a thread dump and see pretty standard stuff, including one that I
> thought might be relevant:
> "main" Id=1 RUNNABLE
>  at java.util.HashMap.putVal(HashMap.java:641)
>  at java.util.HashMap.put(HashMap.java:611)
>  at
> org.apache.nifi.repository.schema.SchemaRecordReader.readFieldValue(SchemaRecordReader.java:154)
>
> I took a couple dumps in a row in case it was hung here, but it appears to
> be progressing to different points in the stack.
>
> NiFi is the only thing running on this instance, and nearly all of its 48GB
> of RAM are being used, and I did notice that it is doing some heavy reads
> but not many writes (using iostat).
>
> Thanks,
> Joe
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*

Reply | Threaded
Open this post in threaded view
|

Re: Long recovery time

Joe Gresock
Thanks for the input, Mark.  I also enabled DEBUG logging on
MinimalLockingWriteAheadLog
and that did show me that it was making progress in recovery.  The node did
come up overnight and is back down to a reasonable flowfile_repository size.

On Mon, Oct 9, 2017 at 12:33 AM, Mark Payne <[hidden email]> wrote:

> Joe,
>
> When you have a huge FlowFile repository like that, it can indeed take
> quite a long time to recover (potentially a few hours).
> But I say that with the caveat that the FlowFile repository should
> probably never reach that size.
> It generally doesn't grow beyond a couple of GB. The two things that I
> have seen cause tremendous
> growth in the FlowFile repository are OutOfMemoryError and Too Many Open
> File IOExceptions. Either could prevent
> the FlowFile Repository from properly checkpointing, which would cause it
> to grow unbounded. The
> former case (OOME), though, is far more likely. The reason that I say this
> is that if any OOME (or any uncaught Throwable)
> is thrown when checkpointing the FlowFile repository, that background
> thread will die. I did submit a fix for that,
> and it was included in 1.4.0, I believe (ASF JIRA is having maintenance
> performed at the moment I believe, so I'm
> not able to look it up at the moment).
>
> There probably are some things that can be done to address the super long
> recovery time, though. The problem should
> be easy to replicate by telling NiFi to checkpoint the flowfile repository
> once an hour instead of the default once every 2
> minutes (via the 'nifi.flowfile.repository.checkpoint.interval' property)
> so that would allow us to know what is taking so
> long and address accordingly.
>
>
> Thanks
> -Mark
>
>
> > On Oct 8, 2017, at 11:53 AM, Joe Gresock <[hidden email]> wrote:
> >
> > I have a a NiFi 1.1.0 instance whose disk nearly (but not quite) filled
> > up.  I noticed that some of its NiFi processors were hanging so I
> restarted
> > it, but it's taking over an hour to come back up.
> >
> > My question is: how can I tell if NiFi is doing something productive (and
> > therefore I should just let it finish) vs. hanging (and therefore I
> should
> > try something else)?  Is it possible that NiFi could take hours to stand
> > back up?  My content_repository is 276GB and my flowfile_repository is
> > 640GB.
> >
> > I see the following in the logs:
> >
> > o.a.n.controller.StandardFlowFileQueue Recovered 8 swap files for
> > FlowFileQueue[...] in 51 millis
> > org.wali.MinimalLockingWriteAheadLog finished recovering records.
> > Performing Checkpoint to ensure proper state of Partitions before updates
> > org.wali.MinimalLockingWriteAheadLog Successfully recovered 10536141
> > records in 38509 milliseconds
> >
> > Thereafter, the only thing I see in the logs are these periodic messages:
> > org.wali.MinimalLockingWriteAheadLog checkpointed with 2 Records and 0
> Swap
> > Files in 16 milliseconds, Max Transaction ID 31
> >
> > I did a thread dump and see pretty standard stuff, including one that I
> > thought might be relevant:
> > "main" Id=1 RUNNABLE
> >  at java.util.HashMap.putVal(HashMap.java:641)
> >  at java.util.HashMap.put(HashMap.java:611)
> >  at
> > org.apache.nifi.repository.schema.SchemaRecordReader.readFieldValue(
> SchemaRecordReader.java:154)
> >
> > I took a couple dumps in a row in case it was hung here, but it appears
> to
> > be progressing to different points in the stack.
> >
> > NiFi is the only thing running on this instance, and nearly all of its
> 48GB
> > of RAM are being used, and I did notice that it is doing some heavy reads
> > but not many writes (using iostat).
> >
> > Thanks,
> > Joe
> > --
> > I know what it is to be in need, and I know what it is to have plenty.  I
> > have learned the secret of being content in any and every situation,
> > whether well fed or hungry, whether living in plenty or in want.  I can
> do
> > all this through him who gives me strength.    *-Philippians 4:12-13*
>
>


--
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*