Using map cache clients to detect already processed files

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Using map cache clients to detect already processed files

Mike Thomsen
We are getting a lot of independent submissions of data from various and
sundry teams that work with our client, and our client may need a processor
that roughly does this story:

"as a NiFi user, I would like to be able to detect whether a file has been
seen before and processed based on feedback from a RDBMS/HBase/Elastic and
then be able to choose whether to reprocess it or drop it."

Want to make sure that I'm not reinventing the wheel before writing such a
processor.

Thanks,

Mike
Reply | Threaded
Open this post in threaded view
|

Re: Using map cache clients to detect already processed files

Mark Payne
Mike,

There is a DetectDuplicate processor. It gives you the ability to provide an attribute to use for identification (for example, using a SHA256 hash or looking at an identifier in the data or a filename, etc). It uses a DistributedMapCacheClient to track this so it could be backed by Redis or whatever other implementations we have available. Would that give you what you need?

Thanks
-Mark

Sent from my iPhone

> On Dec 15, 2018, at 8:52 AM, Mike Thomsen <[hidden email]> wrote:
>
> We are getting a lot of independent submissions of data from various and
> sundry teams that work with our client, and our client may need a processor
> that roughly does this story:
>
> "as a NiFi user, I would like to be able to detect whether a file has been
> seen before and processed based on feedback from a RDBMS/HBase/Elastic and
> then be able to choose whether to reprocess it or drop it."
>
> Want to make sure that I'm not reinventing the wheel before writing such a
> processor.
>
> Thanks,
>
> Mike
Reply | Threaded
Open this post in threaded view
|

Re: Using map cache clients to detect already processed files

Mike Thomsen
Sounds perfect.

On Sat, Dec 15, 2018 at 9:11 AM Mark Payne <[hidden email]> wrote:

> Mike,
>
> There is a DetectDuplicate processor. It gives you the ability to provide
> an attribute to use for identification (for example, using a SHA256 hash or
> looking at an identifier in the data or a filename, etc). It uses a
> DistributedMapCacheClient to track this so it could be backed by Redis or
> whatever other implementations we have available. Would that give you what
> you need?
>
> Thanks
> -Mark
>
> Sent from my iPhone
>
> > On Dec 15, 2018, at 8:52 AM, Mike Thomsen <[hidden email]>
> wrote:
> >
> > We are getting a lot of independent submissions of data from various and
> > sundry teams that work with our client, and our client may need a
> processor
> > that roughly does this story:
> >
> > "as a NiFi user, I would like to be able to detect whether a file has
> been
> > seen before and processed based on feedback from a RDBMS/HBase/Elastic
> and
> > then be able to choose whether to reprocess it or drop it."
> >
> > Want to make sure that I'm not reinventing the wheel before writing such
> a
> > processor.
> >
> > Thanks,
> >
> > Mike
>
Reply | Threaded
Open this post in threaded view
|

Re: Using map cache clients to detect already processed files

James Srinivasan
Our use case is that we are scraping a web site for new files to download.
We use DetectDuplicate with a key of the URL to avoid downloading the same
file multiple times. We use the HBase DistributedMapCache because the built
in one doesn't work properly in a cluster, plus we don't really have the
concept of a bounded set of keys.

On Sat, 15 Dec 2018, 14:21 Mike Thomsen <[hidden email] wrote:

> Sounds perfect.
>
> On Sat, Dec 15, 2018 at 9:11 AM Mark Payne <[hidden email]> wrote:
>
> > Mike,
> >
> > There is a DetectDuplicate processor. It gives you the ability to provide
> > an attribute to use for identification (for example, using a SHA256 hash
> or
> > looking at an identifier in the data or a filename, etc). It uses a
> > DistributedMapCacheClient to track this so it could be backed by Redis or
> > whatever other implementations we have available. Would that give you
> what
> > you need?
> >
> > Thanks
> > -Mark
> >
> > Sent from my iPhone
> >
> > > On Dec 15, 2018, at 8:52 AM, Mike Thomsen <[hidden email]>
> > wrote:
> > >
> > > We are getting a lot of independent submissions of data from various
> and
> > > sundry teams that work with our client, and our client may need a
> > processor
> > > that roughly does this story:
> > >
> > > "as a NiFi user, I would like to be able to detect whether a file has
> > been
> > > seen before and processed based on feedback from a RDBMS/HBase/Elastic
> > and
> > > then be able to choose whether to reprocess it or drop it."
> > >
> > > Want to make sure that I'm not reinventing the wheel before writing
> such
> > a
> > > processor.
> > >
> > > Thanks,
> > >
> > > Mike
> >
>