Adding schema inference

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding schema inference

Mike Thomsen
Matt left this suggestion:

https://github.com/apache/nifi/pull/3231#discussion_r285693972

What would be a good example of that pattern if I wanted to update that PR
and document the process for others?

Thanks,

Mike
Reply | Threaded
Open this post in threaded view
|

Re: Adding schema inference

Matt Burgess-2
Mike,

Check AbstractRouteRecord, it uses the read-first-get-schema-read-rest
pattern. However for that snippet, it is the RecordReader that is
possibly updating the schema (currently the only thing that does this
is schema inference), then the RecordSetWriter is created using the
(possibly updated) schema. For your PR, you might need to update the
schema manually and then pass that into the
RecordSetWriterFactory.getSchema() call.

Now that I think of it, you might be the first to do this
automatic-schema-update-on-write thing (if you choose to do so). I
thought NIFI-5524 [1] had been implemented but apparently not.

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-5524


On Tue, May 21, 2019 at 10:08 AM Mike Thomsen <[hidden email]> wrote:

>
> Matt left this suggestion:
>
> https://github.com/apache/nifi/pull/3231#discussion_r285693972
>
> What would be a good example of that pattern if I wanted to update that PR
> and document the process for others?
>
> Thanks,
>
> Mike
Reply | Threaded
Open this post in threaded view
|

Re: Adding schema inference

Mike Thomsen
All things considered, it doesn't seem like that much work. My main concern
is how to handle it if a user does something like nest the data deep into
multiple schema levels that don't exist. How should that be handled? An
exception or generating empty record declarations that have one field?

On Tue, May 21, 2019 at 2:07 PM Matt Burgess <[hidden email]> wrote:

> Mike,
>
> Check AbstractRouteRecord, it uses the read-first-get-schema-read-rest
> pattern. However for that snippet, it is the RecordReader that is
> possibly updating the schema (currently the only thing that does this
> is schema inference), then the RecordSetWriter is created using the
> (possibly updated) schema. For your PR, you might need to update the
> schema manually and then pass that into the
> RecordSetWriterFactory.getSchema() call.
>
> Now that I think of it, you might be the first to do this
> automatic-schema-update-on-write thing (if you choose to do so). I
> thought NIFI-5524 [1] had been implemented but apparently not.
>
> Regards,
> Matt
>
> [1] https://issues.apache.org/jira/browse/NIFI-5524
>
>
> On Tue, May 21, 2019 at 10:08 AM Mike Thomsen <[hidden email]>
> wrote:
> >
> > Matt left this suggestion:
> >
> > https://github.com/apache/nifi/pull/3231#discussion_r285693972
> >
> > What would be a good example of that pattern if I wanted to update that
> PR
> > and document the process for others?
> >
> > Thanks,
> >
> > Mike
>
Reply | Threaded
Open this post in threaded view
|

Re: Adding schema inference

Matt Burgess-2
I think it'd be great to create one-field records for each level that
doesn't exist, for things like LookupRecord and/or UpdateRecord (from
the aforementioned Jira) I had envisioned the same thing. It would
make it easier for other such processors to enrich/enlarge a
newly-existing record level, basically an Upsert pattern (or mkdir -p
:)

Regards,
Matt

On Tue, May 21, 2019 at 2:32 PM Mike Thomsen <[hidden email]> wrote:

>
> All things considered, it doesn't seem like that much work. My main concern
> is how to handle it if a user does something like nest the data deep into
> multiple schema levels that don't exist. How should that be handled? An
> exception or generating empty record declarations that have one field?
>
> On Tue, May 21, 2019 at 2:07 PM Matt Burgess <[hidden email]> wrote:
>
> > Mike,
> >
> > Check AbstractRouteRecord, it uses the read-first-get-schema-read-rest
> > pattern. However for that snippet, it is the RecordReader that is
> > possibly updating the schema (currently the only thing that does this
> > is schema inference), then the RecordSetWriter is created using the
> > (possibly updated) schema. For your PR, you might need to update the
> > schema manually and then pass that into the
> > RecordSetWriterFactory.getSchema() call.
> >
> > Now that I think of it, you might be the first to do this
> > automatic-schema-update-on-write thing (if you choose to do so). I
> > thought NIFI-5524 [1] had been implemented but apparently not.
> >
> > Regards,
> > Matt
> >
> > [1] https://issues.apache.org/jira/browse/NIFI-5524
> >
> >
> > On Tue, May 21, 2019 at 10:08 AM Mike Thomsen <[hidden email]>
> > wrote:
> > >
> > > Matt left this suggestion:
> > >
> > > https://github.com/apache/nifi/pull/3231#discussion_r285693972
> > >
> > > What would be a good example of that pattern if I wanted to update that
> > PR
> > > and document the process for others?
> > >
> > > Thanks,
> > >
> > > Mike
> >
Reply | Threaded
Open this post in threaded view
|

Re: Adding schema inference

Mike Thomsen
I think tabling that feature for this processor is probably the right way
to go for now. It really should be part of a larger upgrade to the Record
API because just jamming in new fields without warning here could be very
unwelcomed by some users such as my current project team (we have very hard
controls on schemas compared to most of my other engagements).

(FWIW, we're actually using this processor on real data)

Thanks,

Mike

On Thu, May 23, 2019 at 6:25 PM Matt Burgess <[hidden email]> wrote:

> I think it'd be great to create one-field records for each level that
> doesn't exist, for things like LookupRecord and/or UpdateRecord (from
> the aforementioned Jira) I had envisioned the same thing. It would
> make it easier for other such processors to enrich/enlarge a
> newly-existing record level, basically an Upsert pattern (or mkdir -p
> :)
>
> Regards,
> Matt
>
> On Tue, May 21, 2019 at 2:32 PM Mike Thomsen <[hidden email]>
> wrote:
> >
> > All things considered, it doesn't seem like that much work. My main
> concern
> > is how to handle it if a user does something like nest the data deep into
> > multiple schema levels that don't exist. How should that be handled? An
> > exception or generating empty record declarations that have one field?
> >
> > On Tue, May 21, 2019 at 2:07 PM Matt Burgess <[hidden email]>
> wrote:
> >
> > > Mike,
> > >
> > > Check AbstractRouteRecord, it uses the read-first-get-schema-read-rest
> > > pattern. However for that snippet, it is the RecordReader that is
> > > possibly updating the schema (currently the only thing that does this
> > > is schema inference), then the RecordSetWriter is created using the
> > > (possibly updated) schema. For your PR, you might need to update the
> > > schema manually and then pass that into the
> > > RecordSetWriterFactory.getSchema() call.
> > >
> > > Now that I think of it, you might be the first to do this
> > > automatic-schema-update-on-write thing (if you choose to do so). I
> > > thought NIFI-5524 [1] had been implemented but apparently not.
> > >
> > > Regards,
> > > Matt
> > >
> > > [1] https://issues.apache.org/jira/browse/NIFI-5524
> > >
> > >
> > > On Tue, May 21, 2019 at 10:08 AM Mike Thomsen <[hidden email]>
> > > wrote:
> > > >
> > > > Matt left this suggestion:
> > > >
> > > > https://github.com/apache/nifi/pull/3231#discussion_r285693972
> > > >
> > > > What would be a good example of that pattern if I wanted to update
> that
> > > PR
> > > > and document the process for others?
> > > >
> > > > Thanks,
> > > >
> > > > Mike
> > >
>