In ListS3 processor, where does Nifi persists the state of objects?

classic Classic list List threaded Threaded
3 messages Options
sam
Reply | Threaded
Open this post in threaded view
|

In ListS3 processor, where does Nifi persists the state of objects?

sam
This post was updated on .
Hi,

I am currently evaluating Nifi, I came across ListS3 + FetchS3Object processors which am using for retrieving S3 objects as they are added, transforming and posting it to an external API. To be able to use it in production, I need some information on how this state is maintained and whether it is dependable.

1. If I start my dataflow 'now' (2017-01-01 00:00), I see it retrieves all object 'now' (2017-01-01 00:00) onwards, after that even if I restart dataflow or the Nifi server, it retrieves all the objects it that missed since 'now'.
Seems this information is persisted, probably in the filesystem, where exactly is that? Is there any control over it? I started on 2017-01-01 00:00 but now I want it to process objects only 2017-07-01 00:00 onwards.

2. What happens in case there was an error half way processing through the file, how can I reprocess say it was a genuine bug in a custom processor.

3. I would like to monitor which files were successfully processed. What is the recommended way to do that?

Appreciate your help.
Reply | Threaded
Open this post in threaded view
|

Re: In ListS3 processor, where does Nifi persists the state of objects?

Toivo Adams
Hi,

As far I know ListS3 use NiFi built in StateManager which in turn use StateProvider's.
NiFi may have different StateProvider implementations.
Currently NiFi have 2 providers, ZooKeeper based and write-ahead log file based.
ZooKeeper is used when NiFi cluster is configured and other is used for local single node NiFi.
As I understand NiFi will choose automatically ZooKeeper for cluster and local for single NiFi instance.

You can Replay FlowFile.
Open Data Provenance, choose Provenance Event, open CONTENT tab and click REPLAY.
Also many NiFi processors have Failure relationship which is used to route failed FlowFile’s to some other path. So you can automate how to handle failed FlowFiles.

Data Provenance is simplest way to see successfully processed files.
But you can create custom Reporting Task to collect Provenance Events and do what ever you need.

Regards
Toivo
Reply | Threaded
Open this post in threaded view
|

Re: In ListS3 processor, where does Nifi persists the state of objects?

Dave Hirko
We use the ListS3 processor quite a bit, and maintaining state for millions of objects in S3 is critical for us. If we lost track of state, it would cause us to have to re-download a lot of S3 objects, which is costly.

We use the "local-provider" and the default directory "./state/local"

When I had to migrate to a new instance, we could not afford to lose state, so I copied the entire and original "./state/local" directory from the old to the new instance. The ListS3 processor in the new instance was able to use the state from the old one successfully. I didn't see any documentation on this, but I was able to get it to work.

I have not figured out how to manipulate the state intentionally. There are use cases where we need to go back in time a few days to relist objects that were recent, and so adjusting the state back to a particular date would be helpful in certain cases. This would allow us to "re-list" objects based on date parameters. As a workaround, I've added date filters.

--

Dave Hirko | [hidden email]<mailto:[hidden email]> | 571.421.7729

On Sun, 2017-01-22 at 08:21 -0700, Toivo Adams wrote:

Hi,

As far I know ListS3 use NiFi built in StateManager which in turn use
StateProvider's.
NiFi may have different StateProvider implementations.
Currently NiFi have 2 providers, ZooKeeper based and write-ahead log file
based.
ZooKeeper is used when NiFi cluster is configured and other is used for
local single node NiFi.
As I understand NiFi will choose automatically ZooKeeper for cluster and
local for single NiFi instance.

You can Replay FlowFile.
Open Data Provenance, choose Provenance Event, open CONTENT tab and click
REPLAY.
Also many NiFi processors have Failure relationship which is used to route
failed FlowFile’s to some other path. So you can automate how to handle
failed FlowFiles.

Data Provenance is simplest way to see successfully processed files.
But you can create custom Reporting Task to collect Provenance Events and do
what ever you need.

Regards
Toivo



--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/In-ListS3-processor-where-does-Nifi-persists-the-state-of-objects-tp14489p14490.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.