Provenance Repository Interface

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Provenance Repository Interface

dam6923 .
Hello!

Thanks for all the support and direction thus far in my efforts to
improve the system provenance record keeping.

I'd like to propose a few changes to the ProvenanceEventRepository
API.  There are some inconsistent and error-prone method calls
available that I think can be enhanced.

I've based some of my ideas off of the Hibernate Session object that
allows for this kind of plug-able behavior of persisting and searching
objects.  In general, I believe there should be a bigger overhaul down
the road, but some items to consider for the shorter term:

1. Do not assume ProvenanceEventRecord objects have an ID of type
long.  Instead, I'd like to see the ID be "Serializable" so one can
use numerical values, byte arrays (i.e., hashes), Strings (i.e.,
UUID).

2. Remove the method "getEvents".  From what I can tell, the only
place it is used is to generate the "Oldest event available" display
on the Provenance UI.  The way in which it is used is a bit wonky and
not actually correct.  The ControllerFacade assumes that the first
item submitted to the repository is going to be the oldest.  This
isn't true. Since the client is responsible for setting the event
date-time, the first item could have any arbitrary value and not
actually the "oldest" event in a way that most users would expect.  I
would recommend removing this label from the UI altogether.  It's
value seems limited and is generated in this incorrect manner.

3. Add a "clear" method to allow the users to manually empty the repository.

4. Remove method "getMaxEventId".  In an implementation that uses
UUID/Hash ids, there is no such thing as a "max" ID.

5. Remove methods "registerEvent"/"registerEvents" in favor of a
single "addEvent" that accepts a single ProvenanceEventRecord and
returns its Serializable ID.

6. Introduce a  ProvenanceEventRepositoryRuntimeException class (or
something like that) instead of using explicit IO exceptions.  As is,
the IOException seems inconsistent across methods.

I have attached a sample of these changes though only at the
Interface, not all changes required.  Changing to a Serialize object
would affect several classes.

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Provenance Repository Interface

Sean Busbey
ASF mailing lists strip all attachments. If you want to reference something
you'll need to upload it somewhere and the include a link.

Normally patches would go onto a filed JIRA to make licensing intention
clear. If the patch is trying to illustrate an approach but has known
limitations, a note to that effect is usually sufficient.

--
Sean
On Jun 4, 2015 7:26 PM, "dam6923 ." <[hidden email]> wrote:

> Hello!
>
> Thanks for all the support and direction thus far in my efforts to
> improve the system provenance record keeping.
>
> I'd like to propose a few changes to the ProvenanceEventRepository
> API.  There are some inconsistent and error-prone method calls
> available that I think can be enhanced.
>
> I've based some of my ideas off of the Hibernate Session object that
> allows for this kind of plug-able behavior of persisting and searching
> objects.  In general, I believe there should be a bigger overhaul down
> the road, but some items to consider for the shorter term:
>
> 1. Do not assume ProvenanceEventRecord objects have an ID of type
> long.  Instead, I'd like to see the ID be "Serializable" so one can
> use numerical values, byte arrays (i.e., hashes), Strings (i.e.,
> UUID).
>
> 2. Remove the method "getEvents".  From what I can tell, the only
> place it is used is to generate the "Oldest event available" display
> on the Provenance UI.  The way in which it is used is a bit wonky and
> not actually correct.  The ControllerFacade assumes that the first
> item submitted to the repository is going to be the oldest.  This
> isn't true. Since the client is responsible for setting the event
> date-time, the first item could have any arbitrary value and not
> actually the "oldest" event in a way that most users would expect.  I
> would recommend removing this label from the UI altogether.  It's
> value seems limited and is generated in this incorrect manner.
>
> 3. Add a "clear" method to allow the users to manually empty the
> repository.
>
> 4. Remove method "getMaxEventId".  In an implementation that uses
> UUID/Hash ids, there is no such thing as a "max" ID.
>
> 5. Remove methods "registerEvent"/"registerEvents" in favor of a
> single "addEvent" that accepts a single ProvenanceEventRecord and
> returns its Serializable ID.
>
> 6. Introduce a  ProvenanceEventRepositoryRuntimeException class (or
> something like that) instead of using explicit IO exceptions.  As is,
> the IOException seems inconsistent across methods.
>
> I have attached a sample of these changes though only at the
> Interface, not all changes required.  Changing to a Serialize object
> would affect several classes.
>
> Thanks.
>
Reply | Threaded
Open this post in threaded view
|

Re: Provenance Repository Interface

Joe Witt
...when JIRA is up :-)

On Fri, Jun 5, 2015 at 12:05 AM, Sean Busbey <[hidden email]> wrote:

> ASF mailing lists strip all attachments. If you want to reference something
> you'll need to upload it somewhere and the include a link.
>
> Normally patches would go onto a filed JIRA to make licensing intention
> clear. If the patch is trying to illustrate an approach but has known
> limitations, a note to that effect is usually sufficient.
>
> --
> Sean
> On Jun 4, 2015 7:26 PM, "dam6923 ." <[hidden email]> wrote:
>
>> Hello!
>>
>> Thanks for all the support and direction thus far in my efforts to
>> improve the system provenance record keeping.
>>
>> I'd like to propose a few changes to the ProvenanceEventRepository
>> API.  There are some inconsistent and error-prone method calls
>> available that I think can be enhanced.
>>
>> I've based some of my ideas off of the Hibernate Session object that
>> allows for this kind of plug-able behavior of persisting and searching
>> objects.  In general, I believe there should be a bigger overhaul down
>> the road, but some items to consider for the shorter term:
>>
>> 1. Do not assume ProvenanceEventRecord objects have an ID of type
>> long.  Instead, I'd like to see the ID be "Serializable" so one can
>> use numerical values, byte arrays (i.e., hashes), Strings (i.e.,
>> UUID).
>>
>> 2. Remove the method "getEvents".  From what I can tell, the only
>> place it is used is to generate the "Oldest event available" display
>> on the Provenance UI.  The way in which it is used is a bit wonky and
>> not actually correct.  The ControllerFacade assumes that the first
>> item submitted to the repository is going to be the oldest.  This
>> isn't true. Since the client is responsible for setting the event
>> date-time, the first item could have any arbitrary value and not
>> actually the "oldest" event in a way that most users would expect.  I
>> would recommend removing this label from the UI altogether.  It's
>> value seems limited and is generated in this incorrect manner.
>>
>> 3. Add a "clear" method to allow the users to manually empty the
>> repository.
>>
>> 4. Remove method "getMaxEventId".  In an implementation that uses
>> UUID/Hash ids, there is no such thing as a "max" ID.
>>
>> 5. Remove methods "registerEvent"/"registerEvents" in favor of a
>> single "addEvent" that accepts a single ProvenanceEventRecord and
>> returns its Serializable ID.
>>
>> 6. Introduce a  ProvenanceEventRepositoryRuntimeException class (or
>> something like that) instead of using explicit IO exceptions.  As is,
>> the IOException seems inconsistent across methods.
>>
>> I have attached a sample of these changes though only at the
>> Interface, not all changes required.  Changing to a Serialize object
>> would affect several classes.
>>
>> Thanks.
>>
Reply | Threaded
Open this post in threaded view
|

RE: Provenance Repository Interface

Mark Payne
In reply to this post by dam6923 .
Hello.

Thanks for the suggestions! So I noted the last time that I touched the Prov Repo that we definitely need to
provide some docs about the overall design of repo, and why we designed it the way that we did. There are
definitely some things that are not obvious about the design.

Let me use this e-mail as a starting point for addressing the why's of the design, based on the feedback provided.
Comments are in-line, below.

If, after you read the explanation for why we chose the current implementation/design, you feel there are still some things
that should be changed, I (and I'm sure several others) would be happy to discuss those in more detail.

Thanks!
-Mark

----------------------------------------

> Date: Thu, 4 Jun 2015 20:25:21 -0400
> Subject: Provenance Repository Interface
> From: [hidden email]
> To: [hidden email]
>
> Hello!
>
> Thanks for all the support and direction thus far in my efforts to
> improve the system provenance record keeping.
>
> I'd like to propose a few changes to the ProvenanceEventRepository
> API. There are some inconsistent and error-prone method calls
> available that I think can be enhanced.
>
> I've based some of my ideas off of the Hibernate Session object that
> allows for this kind of plug-able behavior of persisting and searching
> objects. In general, I believe there should be a bigger overhaul down
> the road, but some items to consider for the shorter term:
>
> 1. Do not assume ProvenanceEventRecord objects have an ID of type
> long. Instead, I'd like to see the ID be "Serializable" so one can
> use numerical values, byte arrays (i.e., hashes), Strings (i.e.,
> UUID).

There are a couple of reasons that we use Longs as identifiers, but the main
reason is that it provides a guaranteed ordering of the events. This can be very
important because NiFi can sometimes perform several Provenance Events for the
same FlowFile within the same millisecond. As a result, the timestamp cannot guarantee
ordering.

Additionally, the Provenance Repo provides two very important mechanisms for retrieving
the data: querying the data, as you would see in the UI; but also iterating over the events
in the repository. The latter case means that we need a way to paginate, essentially, over
the entire repository. We do this by using the getEvents method, indicating the first
Event ID that we are interested in, as well as the maximum number of events.

>
> 2. Remove the method "getEvents". From what I can tell, the only
> place it is used is to generate the "Oldest event available" display
> on the Provenance UI. The way in which it is used is a bit wonky and
> not actually correct. The ControllerFacade assumes that the first
> item submitted to the repository is going to be the oldest. This
> isn't true. Since the client is responsible for setting the event
> date-time, the first item could have any arbitrary value and not
> actually the "oldest" event in a way that most users would expect. I
> would recommend removing this label from the UI altogether. It's
> value seems limited and is generated in this incorrect manner.

The getEvents method is used to iterate over all events (or some subset of Events) in the
repository. This is typically done in a Reporting Task such that we can process all events
in the repository (I spoke about this a bit in the first point above).

With respect to obtaining the oldest event: you are correct in that it does not absolutely
guarantee the first event based on timestamp. It provides the first event based on Event ID.
If the repository is updated simultaneously with many events, it's possible that those two
could be different from one another. However, in practice, since the only way for these
two to diverge is if multiple threads are updating repository simultaneously, then if Event A
has ID 1 and Event B has ID 2, we know that the timestamps of Event A and Event B are going to
be EXTREMELY close - i.e., it's very unlikely that they will vary by more than a millisecond or two,
though they could due to garbage collection, etc. Also note, that you are absolutely correct in
that the "client is responsible for setting the event date-time" but remember that the client
in this sense is actually the NiFi Framework. The repo is not expected to be used by arbitrary
clients.

The way this is used, though, is simply to indicate to the user the timestamp of the earliest event
so that they have an understanding of how far back the repository goes. So, for instance, if they are
interested in searching the repo for events that happened to some piece of data that they think was
received 2 weeks ago, they should know immediately if the events will be there or not.


>
> 3. Add a "clear" method to allow the users to manually empty the repository.
>

We could certainly do this, though I don't honestly understand the use case. Can you explain why
it would be beneficial to allow someone to clear the entire repository?

> 4. Remove method "getMaxEventId". In an implementation that uses
> UUID/Hash ids, there is no such thing as a "max" ID.
>

See explanation above for #1 as to why I think we need to keep Event ID's as Longs.


> 5. Remove methods "registerEvent"/"registerEvents" in favor of a
> single "addEvent" that accepts a single ProvenanceEventRecord and
> returns its Serializable ID.
>

We use registerEvents so that if we have multiple events we can add them to the repository in a single
"transaction" and because it is far more efficient to store a batch of events than to store a single event.
The registerEvent method really is just a convenience method so that the code updating the repo doesn't
always have to wrap a single event in a Collection themselves if they want to update the repo with just
one event.


> 6. Introduce a ProvenanceEventRepositoryRuntimeException class (or
> something like that) instead of using explicit IO exceptions. As is,
> the IOException seems inconsistent across methods.
>

I definitely agree that it is inconsistent across methods. This needs to be addressed. I have created
a ticket to address this: https://issues.apache.org/jira/browse/NIFI-660.

I do, however, content that IOException is appropriate here. If the repo fails to update due to a problem writing
to disk or to a network or whatever, it should absolutely throw an IOException. Any time that we are interacting
with some sort of external device (be it network, disk, or whatever), there is a very real possibility that there will
be environmental problems (such as disk full) that will cause us to fail, and any code that calls into such a method
needs to be responsible for handling those conditions properly, so I feel a checked exception is appropriate.

> I have attached a sample of these changes though only at the
> Interface, not all changes required. Changing to a Serialize object
> would affect several classes.
>
> Thanks.
     
Reply | Threaded
Open this post in threaded view
|

Re: Provenance Repository Interface

dam6923 .
Hello,

Thanks again for your time and sincere considerations.

Here is the sample I was trying to email.  Jira was giving me fits, so
I tried to email it, but since it's just a discussion at this point,
and not really an action item, I've dropped it on PasteBin.

http://pastebin.com/ani98rS5

In general, I'm trying to think of ways to allow for a more flexible,
pluggable, framework and not being tied down to how things exist at
the moment.

1) Longs v.s. UUID.  I think you'll find that when you move to a
clustered, distributed environment, UUIDs will be more useful.  So at
this point, I'm not asking to move the current code base away from
numerical IDs.  I'm simply asking that the interface be changed to
allow for it.  Defining Serializable IDs allows for the current
numerical implementation and also for someone else to drop in UUIDs
later (or whatever other IDs they can think of).

Pagination is traditionally performed by sorting on a particular
domain field (date most often) and then taking a subset of the sorted
set.  I've already described to you a way in which event dates can be
completely arbitrary in terms of insertion in a multi-threaded
application.  Another practical situation would be in the case of a
cluster. A node may queue events during a network outage and later
submit events to the cluster manager where the IDs no longer give any
kind of indication of the event time.  With such an outage, it's not
reasonable to assume that event IDs are inserted in the cluster
manager in any kind of ordering.  Perhaps the answer is to sort on
event date, and have each processor stamp some sort of one-up marker
on each event so that even if it happens within the same millisecond,
order can be preserved.  I'd have to think about that more, but the
answer is not to rely on the IDs having any kind of implicit ordering.

2) getEvents - I must apologize. Perhaps I am missing something, but
the only place I can locate its use is in ControllerFacade and it's to
generate the one value that displays the the "oldest" event.  What is
the use case for viewing all the events in the repository?  If it's
important to iterate over all the events in the repository, allow the
implementation to decide how pagination works.  A simple method like:
public Collection<ProvenanceEventRecord>
getProvenanceEventRecords(final int pageNo, final int pageSize) would
suffice.

3) If you feel that the "oldest" event is a required feature, then
allow the repository to implement the solution.  Include a
"getOldestEvent" method or update the query mechanism to allow for
sorted and "top n" type queries.

4)  The "clear" method would be nice to plan for now, but tying it
into the UI could come at some later point.  The feature would be
mostly useful for testing purposes.  One may want to build a flow, run
some data through it, reset everything, make some tweaks, and try it
again.  It's helpful to clear out the data between test runs so that
there is no confusion about what events were generated on THIS test
run an which were generated during the LAST test run.  Secondly, it's
a quick way to clear out the events in the case where resources have
become lacking for some reason and an operator needs a quick fix.
Dumping months of events may be a quick option.

5) Pedantic, but I was suggesting addEvent, getEvent, to map more
closely to what a Java developer would expect to see when working with
some sort of data "collection" or "repository."  This would be in
place of the method "registerEvent."  Second, when I was writing the
unit tests, it felt clunky because I had no easy way to get the ID
that the repository generated for the even after I submitted it.  I
wanted to write a simple:

long id = addEvent(actualEvent);
Event resultEvent = getEvent(id);
assertEquals(actualEvent,resultEvent);

unit test, but it's not currently possible.  The addEvent method
should return the created ID.  In such a scenario, to be consistent, a
given addEvents method should return a list of IDs that were created,
one for each event.  That's one way to go about it, but it's not
really clean.  Depending on the storage mechanism, some of the events
could be persisted, but if an error occurred part way through the
list, and there was no rollback plan, then there's no way for the
repository to communicate the exception and the list of IDs that did
succeed.  I think that should be handled at a higher level.

6) Well, there's always the debate these days about checked v.s.
unchecked exceptions. I would just say that if the repository is
suppose to do any validation of the events, then it could be rolled
up, with IOException into one custom event that all methods throw.  In
the current implementation of the Standard ProvenanceEventBuilder, I
see that it does validation in the build method, though that is not
documented as a requirement of a ProvenanceEventBuilder.  So, I guess
if that is documented as part of the function of the build method, and
not on the repository, then the repository might reasonably be believe
to only choke on an IOException.

Thanks!


On Fri, Jun 5, 2015 at 8:16 AM, Mark Payne <[hidden email]> wrote:

> Hello.
>
> Thanks for the suggestions! So I noted the last time that I touched the Prov Repo that we definitely need to
> provide some docs about the overall design of repo, and why we designed it the way that we did. There are
> definitely some things that are not obvious about the design.
>
> Let me use this e-mail as a starting point for addressing the why's of the design, based on the feedback provided.
> Comments are in-line, below.
>
> If, after you read the explanation for why we chose the current implementation/design, you feel there are still some things
> that should be changed, I (and I'm sure several others) would be happy to discuss those in more detail.
>
> Thanks!
> -Mark
>
> ----------------------------------------
>> Date: Thu, 4 Jun 2015 20:25:21 -0400
>> Subject: Provenance Repository Interface
>> From: [hidden email]
>> To: [hidden email]
>>
>> Hello!
>>
>> Thanks for all the support and direction thus far in my efforts to
>> improve the system provenance record keeping.
>>
>> I'd like to propose a few changes to the ProvenanceEventRepository
>> API. There are some inconsistent and error-prone method calls
>> available that I think can be enhanced.
>>
>> I've based some of my ideas off of the Hibernate Session object that
>> allows for this kind of plug-able behavior of persisting and searching
>> objects. In general, I believe there should be a bigger overhaul down
>> the road, but some items to consider for the shorter term:
>>
>> 1. Do not assume ProvenanceEventRecord objects have an ID of type
>> long. Instead, I'd like to see the ID be "Serializable" so one can
>> use numerical values, byte arrays (i.e., hashes), Strings (i.e.,
>> UUID).
>
> There are a couple of reasons that we use Longs as identifiers, but the main
> reason is that it provides a guaranteed ordering of the events. This can be very
> important because NiFi can sometimes perform several Provenance Events for the
> same FlowFile within the same millisecond. As a result, the timestamp cannot guarantee
> ordering.
>
> Additionally, the Provenance Repo provides two very important mechanisms for retrieving
> the data: querying the data, as you would see in the UI; but also iterating over the events
> in the repository. The latter case means that we need a way to paginate, essentially, over
> the entire repository. We do this by using the getEvents method, indicating the first
> Event ID that we are interested in, as well as the maximum number of events.
>
>>
>> 2. Remove the method "getEvents". From what I can tell, the only
>> place it is used is to generate the "Oldest event available" display
>> on the Provenance UI. The way in which it is used is a bit wonky and
>> not actually correct. The ControllerFacade assumes that the first
>> item submitted to the repository is going to be the oldest. This
>> isn't true. Since the client is responsible for setting the event
>> date-time, the first item could have any arbitrary value and not
>> actually the "oldest" event in a way that most users would expect. I
>> would recommend removing this label from the UI altogether. It's
>> value seems limited and is generated in this incorrect manner.
>
> The getEvents method is used to iterate over all events (or some subset of Events) in the
> repository. This is typically done in a Reporting Task such that we can process all events
> in the repository (I spoke about this a bit in the first point above).
>
> With respect to obtaining the oldest event: you are correct in that it does not absolutely
> guarantee the first event based on timestamp. It provides the first event based on Event ID.
> If the repository is updated simultaneously with many events, it's possible that those two
> could be different from one another. However, in practice, since the only way for these
> two to diverge is if multiple threads are updating repository simultaneously, then if Event A
> has ID 1 and Event B has ID 2, we know that the timestamps of Event A and Event B are going to
> be EXTREMELY close - i.e., it's very unlikely that they will vary by more than a millisecond or two,
> though they could due to garbage collection, etc. Also note, that you are absolutely correct in
> that the "client is responsible for setting the event date-time" but remember that the client
> in this sense is actually the NiFi Framework. The repo is not expected to be used by arbitrary
> clients.
>
> The way this is used, though, is simply to indicate to the user the timestamp of the earliest event
> so that they have an understanding of how far back the repository goes. So, for instance, if they are
> interested in searching the repo for events that happened to some piece of data that they think was
> received 2 weeks ago, they should know immediately if the events will be there or not.
>
>
>>
>> 3. Add a "clear" method to allow the users to manually empty the repository.
>>
>
> We could certainly do this, though I don't honestly understand the use case. Can you explain why
> it would be beneficial to allow someone to clear the entire repository?
>
>> 4. Remove method "getMaxEventId". In an implementation that uses
>> UUID/Hash ids, there is no such thing as a "max" ID.
>>
>
> See explanation above for #1 as to why I think we need to keep Event ID's as Longs.
>
>
>> 5. Remove methods "registerEvent"/"registerEvents" in favor of a
>> single "addEvent" that accepts a single ProvenanceEventRecord and
>> returns its Serializable ID.
>>
>
> We use registerEvents so that if we have multiple events we can add them to the repository in a single
> "transaction" and because it is far more efficient to store a batch of events than to store a single event.
> The registerEvent method really is just a convenience method so that the code updating the repo doesn't
> always have to wrap a single event in a Collection themselves if they want to update the repo with just
> one event.
>
>
>> 6. Introduce a ProvenanceEventRepositoryRuntimeException class (or
>> something like that) instead of using explicit IO exceptions. As is,
>> the IOException seems inconsistent across methods.
>>
>
> I definitely agree that it is inconsistent across methods. This needs to be addressed. I have created
> a ticket to address this: https://issues.apache.org/jira/browse/NIFI-660.
>
> I do, however, content that IOException is appropriate here. If the repo fails to update due to a problem writing
> to disk or to a network or whatever, it should absolutely throw an IOException. Any time that we are interacting
> with some sort of external device (be it network, disk, or whatever), there is a very real possibility that there will
> be environmental problems (such as disk full) that will cause us to fail, and any code that calls into such a method
> needs to be responsible for handling those conditions properly, so I feel a checked exception is appropriate.
>
>> I have attached a sample of these changes though only at the
>> Interface, not all changes required. Changing to a Serialize object
>> would affect several classes.
>>
>> Thanks.
>
Reply | Threaded
Open this post in threaded view
|

RE: Provenance Repository Interface

Mark Payne
I totally understand where you are coming from. And I very much appreciate your consideration of how to improve
such an important part of the system. I am sure that the Provenance Repo is far from perfect and will continue to
evolve over time. And creating a flexible, pluggable framework is indeed a good thing. 

It often does, however, come at a cost. In this case, an important potential cost would be performance.

Consider this use case (which is a very real use case that drove much of the implementation
and design). I have a ReportingTask that needs to process every Provenance Event that occurs.
I need to process the events with little overhead, and each node is expected to keep several billion events.

In this case, I will ask for 1000 events at a time starting at event id 1 (I do this using the getEvents
method. This is not used by the framework except in the case that you noticed, but is intended
for ReportingTasks mostly). Once I've processed them, I will persist the fact that I finished record 1000 
and then ask for 1000 events starting at event id 1001 and repeat. If there are no more events, I wait 
1 second and then continue. Requesting events starting at  a particular event id is quite inexpensive 
due to the specific implementation.

If we were required to instead say "start at page X using this sort order", each request for 1000 events
will have to do quite a bit of work to sort those events and can be pretty expensive. We request only 1000 
events at a time because it allows us to be conservative with heap utilization. However, we do want to pull 
back several events at a time for efficiency reasons (disk IO, etc).

It's also important to note that the existing implementation has been used in massive clustered, distributed 
environments for years with great success. Nodes do not send events to the NCM in an async manner, as I 
believe you are describing below. When a user issues a query, the NCM replicates the request to each node
and then merges their responses.

I don't by any means want to imply that I think using a Serializable as the key is not a reasonable (or perhaps
an even better) solution for some use cases. However, it would come at some costs and I think
that we would need some very convincing, concrete use cases in order to make such a significant change
beneficial.

Thanks
-Mark


----------------------------------------

> Date: Sat, 6 Jun 2015 00:08:57 -0400
> Subject: Re: Provenance Repository Interface
> From: [hidden email]
> To: [hidden email]
>
> Hello,
>
> Thanks again for your time and sincere considerations.
>
> Here is the sample I was trying to email. Jira was giving me fits, so
> I tried to email it, but since it's just a discussion at this point,
> and not really an action item, I've dropped it on PasteBin.
>
> http://pastebin.com/ani98rS5
>
> In general, I'm trying to think of ways to allow for a more flexible,
> pluggable, framework and not being tied down to how things exist at
> the moment.
>
> 1) Longs v.s. UUID. I think you'll find that when you move to a
> clustered, distributed environment, UUIDs will be more useful. So at
> this point, I'm not asking to move the current code base away from
> numerical IDs. I'm simply asking that the interface be changed to
> allow for it. Defining Serializable IDs allows for the current
> numerical implementation and also for someone else to drop in UUIDs
> later (or whatever other IDs they can think of).
>
> Pagination is traditionally performed by sorting on a particular
> domain field (date most often) and then taking a subset of the sorted
> set. I've already described to you a way in which event dates can be
> completely arbitrary in terms of insertion in a multi-threaded
> application. Another practical situation would be in the case of a
> cluster. A node may queue events during a network outage and later
> submit events to the cluster manager where the IDs no longer give any
> kind of indication of the event time. With such an outage, it's not
> reasonable to assume that event IDs are inserted in the cluster
> manager in any kind of ordering. Perhaps the answer is to sort on
> event date, and have each processor stamp some sort of one-up marker
> on each event so that even if it happens within the same millisecond,
> order can be preserved. I'd have to think about that more, but the
> answer is not to rely on the IDs having any kind of implicit ordering.
>
> 2) getEvents - I must apologize. Perhaps I am missing something, but
> the only place I can locate its use is in ControllerFacade and it's to
> generate the one value that displays the the "oldest" event. What is
> the use case for viewing all the events in the repository? If it's
> important to iterate over all the events in the repository, allow the
> implementation to decide how pagination works. A simple method like:
> public Collection<ProvenanceEventRecord>
> getProvenanceEventRecords(final int pageNo, final int pageSize) would
> suffice.
>
> 3) If you feel that the "oldest" event is a required feature, then
> allow the repository to implement the solution. Include a
> "getOldestEvent" method or update the query mechanism to allow for
> sorted and "top n" type queries.
>
> 4) The "clear" method would be nice to plan for now, but tying it
> into the UI could come at some later point. The feature would be
> mostly useful for testing purposes. One may want to build a flow, run
> some data through it, reset everything, make some tweaks, and try it
> again. It's helpful to clear out the data between test runs so that
> there is no confusion about what events were generated on THIS test
> run an which were generated during the LAST test run. Secondly, it's
> a quick way to clear out the events in the case where resources have
> become lacking for some reason and an operator needs a quick fix.
> Dumping months of events may be a quick option.
>
> 5) Pedantic, but I was suggesting addEvent, getEvent, to map more
> closely to what a Java developer would expect to see when working with
> some sort of data "collection" or "repository." This would be in
> place of the method "registerEvent." Second, when I was writing the
> unit tests, it felt clunky because I had no easy way to get the ID
> that the repository generated for the even after I submitted it. I
> wanted to write a simple:
>
> long id = addEvent(actualEvent);
> Event resultEvent = getEvent(id);
> assertEquals(actualEvent,resultEvent);
>
> unit test, but it's not currently possible. The addEvent method
> should return the created ID. In such a scenario, to be consistent, a
> given addEvents method should return a list of IDs that were created,
> one for each event. That's one way to go about it, but it's not
> really clean. Depending on the storage mechanism, some of the events
> could be persisted, but if an error occurred part way through the
> list, and there was no rollback plan, then there's no way for the
> repository to communicate the exception and the list of IDs that did
> succeed. I think that should be handled at a higher level.
>
> 6) Well, there's always the debate these days about checked v.s.
> unchecked exceptions. I would just say that if the repository is
> suppose to do any validation of the events, then it could be rolled
> up, with IOException into one custom event that all methods throw. In
> the current implementation of the Standard ProvenanceEventBuilder, I
> see that it does validation in the build method, though that is not
> documented as a requirement of a ProvenanceEventBuilder. So, I guess
> if that is documented as part of the function of the build method, and
> not on the repository, then the repository might reasonably be believe
> to only choke on an IOException.
>
> Thanks!
>
>
> On Fri, Jun 5, 2015 at 8:16 AM, Mark Payne <[hidden email]> wrote:
>> Hello.
>>
>> Thanks for the suggestions! So I noted the last time that I touched the Prov Repo that we definitely need to
>> provide some docs about the overall design of repo, and why we designed it the way that we did. There are
>> definitely some things that are not obvious about the design.
>>
>> Let me use this e-mail as a starting point for addressing the why's of the design, based on the feedback provided.
>> Comments are in-line, below.
>>
>> If, after you read the explanation for why we chose the current implementation/design, you feel there are still some things
>> that should be changed, I (and I'm sure several others) would be happy to discuss those in more detail.
>>
>> Thanks!
>> -Mark
>>
>> ----------------------------------------
>>> Date: Thu, 4 Jun 2015 20:25:21 -0400
>>> Subject: Provenance Repository Interface
>>> From: [hidden email]
>>> To: [hidden email]
>>>
>>> Hello!
>>>
>>> Thanks for all the support and direction thus far in my efforts to
>>> improve the system provenance record keeping.
>>>
>>> I'd like to propose a few changes to the ProvenanceEventRepository
>>> API. There are some inconsistent and error-prone method calls
>>> available that I think can be enhanced.
>>>
>>> I've based some of my ideas off of the Hibernate Session object that
>>> allows for this kind of plug-able behavior of persisting and searching
>>> objects. In general, I believe there should be a bigger overhaul down
>>> the road, but some items to consider for the shorter term:
>>>
>>> 1. Do not assume ProvenanceEventRecord objects have an ID of type
>>> long. Instead, I'd like to see the ID be "Serializable" so one can
>>> use numerical values, byte arrays (i.e., hashes), Strings (i.e.,
>>> UUID).
>>
>> There are a couple of reasons that we use Longs as identifiers, but the main
>> reason is that it provides a guaranteed ordering of the events. This can be very
>> important because NiFi can sometimes perform several Provenance Events for the
>> same FlowFile within the same millisecond. As a result, the timestamp cannot guarantee
>> ordering.
>>
>> Additionally, the Provenance Repo provides two very important mechanisms for retrieving
>> the data: querying the data, as you would see in the UI; but also iterating over the events
>> in the repository. The latter case means that we need a way to paginate, essentially, over
>> the entire repository. We do this by using the getEvents method, indicating the first
>> Event ID that we are interested in, as well as the maximum number of events.
>>
>>>
>>> 2. Remove the method "getEvents". From what I can tell, the only
>>> place it is used is to generate the "Oldest event available" display
>>> on the Provenance UI. The way in which it is used is a bit wonky and
>>> not actually correct. The ControllerFacade assumes that the first
>>> item submitted to the repository is going to be the oldest. This
>>> isn't true. Since the client is responsible for setting the event
>>> date-time, the first item could have any arbitrary value and not
>>> actually the "oldest" event in a way that most users would expect. I
>>> would recommend removing this label from the UI altogether. It's
>>> value seems limited and is generated in this incorrect manner.
>>
>> The getEvents method is used to iterate over all events (or some subset of Events) in the
>> repository. This is typically done in a Reporting Task such that we can process all events
>> in the repository (I spoke about this a bit in the first point above).
>>
>> With respect to obtaining the oldest event: you are correct in that it does not absolutely
>> guarantee the first event based on timestamp. It provides the first event based on Event ID.
>> If the repository is updated simultaneously with many events, it's possible that those two
>> could be different from one another. However, in practice, since the only way for these
>> two to diverge is if multiple threads are updating repository simultaneously, then if Event A
>> has ID 1 and Event B has ID 2, we know that the timestamps of Event A and Event B are going to
>> be EXTREMELY close - i.e., it's very unlikely that they will vary by more than a millisecond or two,
>> though they could due to garbage collection, etc. Also note, that you are absolutely correct in
>> that the "client is responsible for setting the event date-time" but remember that the client
>> in this sense is actually the NiFi Framework. The repo is not expected to be used by arbitrary
>> clients.
>>
>> The way this is used, though, is simply to indicate to the user the timestamp of the earliest event
>> so that they have an understanding of how far back the repository goes. So, for instance, if they are
>> interested in searching the repo for events that happened to some piece of data that they think was
>> received 2 weeks ago, they should know immediately if the events will be there or not.
>>
>>
>>>
>>> 3. Add a "clear" method to allow the users to manually empty the repository.
>>>
>>
>> We could certainly do this, though I don't honestly understand the use case. Can you explain why
>> it would be beneficial to allow someone to clear the entire repository?
>>
>>> 4. Remove method "getMaxEventId". In an implementation that uses
>>> UUID/Hash ids, there is no such thing as a "max" ID.
>>>
>>
>> See explanation above for #1 as to why I think we need to keep Event ID's as Longs.
>>
>>
>>> 5. Remove methods "registerEvent"/"registerEvents" in favor of a
>>> single "addEvent" that accepts a single ProvenanceEventRecord and
>>> returns its Serializable ID.
>>>
>>
>> We use registerEvents so that if we have multiple events we can add them to the repository in a single
>> "transaction" and because it is far more efficient to store a batch of events than to store a single event.
>> The registerEvent method really is just a convenience method so that the code updating the repo doesn't
>> always have to wrap a single event in a Collection themselves if they want to update the repo with just
>> one event.
>>
>>
>>> 6. Introduce a ProvenanceEventRepositoryRuntimeException class (or
>>> something like that) instead of using explicit IO exceptions. As is,
>>> the IOException seems inconsistent across methods.
>>>
>>
>> I definitely agree that it is inconsistent across methods. This needs to be addressed. I have created
>> a ticket to address this: https://issues.apache.org/jira/browse/NIFI-660.
>>
>> I do, however, content that IOException is appropriate here. If the repo fails to update due to a problem writing
>> to disk or to a network or whatever, it should absolutely throw an IOException. Any time that we are interacting
>> with some sort of external device (be it network, disk, or whatever), there is a very real possibility that there will
>> be environmental problems (such as disk full) that will cause us to fail, and any code that calls into such a method
>> needs to be responsible for handling those conditions properly, so I feel a checked exception is appropriate.
>>
>>> I have attached a sample of these changes though only at the
>>> Interface, not all changes required. Changing to a Serialize object
>>> would affect several classes.
>>>
>>> Thanks.
>>