Help with loading a file into a cache

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Help with loading a file into a cache

davidrsmith
Hi Devs
I am running a NiFi 1.8 cluster, each node has 128Gb of Ram. I need to load the contents of a file of which is around 5Gb in size  into a Key/Value cache.
The file I want to load is produced by another company so the format it comes in is not negotiable. The file contains thousands of lines in the following format:-
<index value1>:{<property1 name>: <property1 value>, <property2 name>:<property2 value>}<index value2>:{<property1 name>: <property1 value>, <property2 name>:<property2 value>}
<index value3>:{<property1 name>: <property1 value>, <property2 name>:<property2 value>}

I want the index value to become the Key and everything  beyond the colon to become the value.
What would be the most efficient way of reading the file, and parsing it to load into a cache, I thought of reading in the file, using a split content on CR/LF and then splitting on the first colon.I have noticed in 1.8 there are some CSV and JSON Readers (controller services), would these be a better way of doing this, but the problem I can see is that the file isn't quite a CSV and it isn't quite a JSON Array file.
Many thanksDave
Reply | Threaded
Open this post in threaded view
|

Re: Help with loading a file into a cache

Mike Thomsen
Dave,

Can you post a redacted example with dummy data?

Thanks,

Mike

On Fri, Nov 30, 2018 at 7:08 AM DAVID SMITH
<[hidden email]> wrote:

> Hi Devs
> I am running a NiFi 1.8 cluster, each node has 128Gb of Ram. I need to
> load the contents of a file of which is around 5Gb in size  into a
> Key/Value cache.
> The file I want to load is produced by another company so the format it
> comes in is not negotiable. The file contains thousands of lines in the
> following format:-
> <index value1>:{<property1 name>: <property1 value>, <property2
> name>:<property2 value>}<index value2>:{<property1 name>: <property1
> value>, <property2 name>:<property2 value>}
> <index value3>:{<property1 name>: <property1 value>, <property2
> name>:<property2 value>}
>
> I want the index value to become the Key and everything  beyond the colon
> to become the value.
> What would be the most efficient way of reading the file, and parsing it
> to load into a cache, I thought of reading in the file, using a split
> content on CR/LF and then splitting on the first colon.I have noticed in
> 1.8 there are some CSV and JSON Readers (controller services), would these
> be a better way of doing this, but the problem I can see is that the file
> isn't quite a CSV and it isn't quite a JSON Array file.
> Many thanksDave
Reply | Threaded
Open this post in threaded view
|

Re: Help with loading a file into a cache

davidrsmith
In reply to this post by davidrsmith
Hi

As requested here is an example file with some redacted data:

ZA105:{"Aircraft Type":"Sea King", "Lifed Items":{ "port engine ser#":"RR-P1234", "starboard engine ser#":"RR-S1234","gearboxes ser#":[ "WHM1234", "WHI1234", "WHT1234" ] }}
ZA106:{"Aircraft Type":"Sea King", "Lifed Items":{ "port engine ser#":"RR-P2345", "starboard engine ser#":"RR-S2345","gearboxes ser#":[ "WHM2345", "WHI2345", "WHT2345" ] }}
ZA107:{"Aircraft Type":"Merlin", "Lifed Items":{ "port engine ser#":"RR-P3456", "starboard engine ser#":"RR-S3456","centre engine ser#":"RR-C3456","gearboxes ser#":[ "WHM3456", "WHI3456", "WHT3456" ] }}
ZA108:{"Aircraft Type":"Merlin", "Lifed Items":{ "port engine ser#":"RR-P4567", "starboard engine ser#":"RR-S4567","centre engine ser#":"RR-C4567","gearboxes ser#":[ "WHM4567", "WHI4567", "WHT4567" ] }}
ZA109:{"Aircraft Type":"Wessex", "Lifed Items":{ "port engine":"RR-P9876", "starboard engine":"RR-S9876","gearboxes":[ "WHM9876", "WHI9876", "WHT9876" ] }}
ZA104:{"Aircraft Type":"Wessex", "Lifed Items":{ "port engine":"RR-P8765", "starboard engine":"RR-S8765","gearboxes":[ "WHM8765", "WHI8765", "WHT8765" ] }}
ZA103:{"Aircraft Type":"Wessex", "Lifed Items":{ "port engine":"RR-P7654", "starboard engine":"RR-S7654","gearboxes":[ "WHM7654", "WHI7654", "WHT7654" ] }}



What I would like is the aircraft tail no eg ZA104 to be the key of the cache item and everything after the colon (the aircraft type and replaceables serial numbers to be the cached item value. The cached item value can stay as a JSON string.


Many thanks

Dave
--------------------------------------------
On Fri, 30/11/18, Mike Thomsen <[hidden email]> wrote:

 Subject: Re: Help with loading a file into a cache
 To: [hidden email]
 Date: Friday, 30 November, 2018, 15:26
 
 Dave,
 
 Can you post a redacted example with dummy
 data?
 
 Thanks,
 
 Mike
 
 On
 Fri, Nov 30, 2018 at 7:08 AM DAVID SMITH
 <[hidden email]>
 wrote:
 
 > Hi Devs
 > I am running a NiFi 1.8 cluster, each node
 has 128Gb of Ram. I need to
 > load the
 contents of a file of which is around 5Gb in size  into
 a
 > Key/Value cache.
 >
 The file I want to load is produced by another company so
 the format it
 > comes in is not
 negotiable. The file contains thousands of lines in the
 > following format:-
 >
 <index value1>:{<property1 name>: <property1
 value>, <property2
 >
 name>:<property2 value>}<index
 value2>:{<property1 name>: <property1
 > value>, <property2
 name>:<property2 value>}
 >
 <index value3>:{<property1 name>: <property1
 value>, <property2
 >
 name>:<property2 value>}
 >
 > I want the index value to become the Key
 and everything  beyond the colon
 > to
 become the value.
 > What would be the
 most efficient way of reading the file, and parsing it
 > to load into a cache, I thought of reading
 in the file, using a split
 > content on
 CR/LF and then splitting on the first colon.I have noticed
 in
 > 1.8 there are some CSV and JSON
 Readers (controller services), would these
 > be a better way of doing this, but the
 problem I can see is that the file
 >
 isn't quite a CSV and it isn't quite a JSON Array
 file.
 > Many thanksDave
 
Reply | Threaded
Open this post in threaded view
|

Re: Help with loading a file into a cache

Mike Thomsen
What level of comfort do you have writing scripts or Java code, and what
product will do the storage?

On Fri, Nov 30, 2018 at 2:47 PM DAVID SMITH
<[hidden email]> wrote:

> Hi
>
> As requested here is an example file with some redacted data:
>
> ZA105:{"Aircraft Type":"Sea King", "Lifed Items":{ "port engine
> ser#":"RR-P1234", "starboard engine ser#":"RR-S1234","gearboxes ser#":[
> "WHM1234", "WHI1234", "WHT1234" ] }}
> ZA106:{"Aircraft Type":"Sea King", "Lifed Items":{ "port engine
> ser#":"RR-P2345", "starboard engine ser#":"RR-S2345","gearboxes ser#":[
> "WHM2345", "WHI2345", "WHT2345" ] }}
> ZA107:{"Aircraft Type":"Merlin", "Lifed Items":{ "port engine
> ser#":"RR-P3456", "starboard engine ser#":"RR-S3456","centre engine
> ser#":"RR-C3456","gearboxes ser#":[ "WHM3456", "WHI3456", "WHT3456" ] }}
> ZA108:{"Aircraft Type":"Merlin", "Lifed Items":{ "port engine
> ser#":"RR-P4567", "starboard engine ser#":"RR-S4567","centre engine
> ser#":"RR-C4567","gearboxes ser#":[ "WHM4567", "WHI4567", "WHT4567" ] }}
> ZA109:{"Aircraft Type":"Wessex", "Lifed Items":{ "port engine":"RR-P9876",
> "starboard engine":"RR-S9876","gearboxes":[ "WHM9876", "WHI9876", "WHT9876"
> ] }}
> ZA104:{"Aircraft Type":"Wessex", "Lifed Items":{ "port engine":"RR-P8765",
> "starboard engine":"RR-S8765","gearboxes":[ "WHM8765", "WHI8765", "WHT8765"
> ] }}
> ZA103:{"Aircraft Type":"Wessex", "Lifed Items":{ "port engine":"RR-P7654",
> "starboard engine":"RR-S7654","gearboxes":[ "WHM7654", "WHI7654", "WHT7654"
> ] }}
>
>
>
> What I would like is the aircraft tail no eg ZA104 to be the key of the
> cache item and everything after the colon (the aircraft type and
> replaceables serial numbers to be the cached item value. The cached item
> value can stay as a JSON string.
>
>
> Many thanks
>
> Dave
> --------------------------------------------
> On Fri, 30/11/18, Mike Thomsen <[hidden email]> wrote:
>
>  Subject: Re: Help with loading a file into a cache
>  To: [hidden email]
>  Date: Friday, 30 November, 2018, 15:26
>
>  Dave,
>
>  Can you post a redacted example with dummy
>  data?
>
>  Thanks,
>
>  Mike
>
>  On
>  Fri, Nov 30, 2018 at 7:08 AM DAVID SMITH
>  <[hidden email]>
>  wrote:
>
>  > Hi Devs
>  > I am running a NiFi 1.8 cluster, each node
>  has 128Gb of Ram. I need to
>  > load the
>  contents of a file of which is around 5Gb in size  into
>  a
>  > Key/Value cache.
>  >
>  The file I want to load is produced by another company so
>  the format it
>  > comes in is not
>  negotiable. The file contains thousands of lines in the
>  > following format:-
>  >
>  <index value1>:{<property1 name>: <property1
>  value>, <property2
>  >
>  name>:<property2 value>}<index
>  value2>:{<property1 name>: <property1
>  > value>, <property2
>  name>:<property2 value>}
>  >
>  <index value3>:{<property1 name>: <property1
>  value>, <property2
>  >
>  name>:<property2 value>}
>  >
>  > I want the index value to become the Key
>  and everything  beyond the colon
>  > to
>  become the value.
>  > What would be the
>  most efficient way of reading the file, and parsing it
>  > to load into a cache, I thought of reading
>  in the file, using a split
>  > content on
>  CR/LF and then splitting on the first colon.I have noticed
>  in
>  > 1.8 there are some CSV and JSON
>  Readers (controller services), would these
>  > be a better way of doing this, but the
>  problem I can see is that the file
>  >
>  isn't quite a CSV and it isn't quite a JSON Array
>  file.
>  > Many thanksDave
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Help with loading a file into a cache

Matt Burgess-2
In reply to this post by davidrsmith
Dave,

Depending on the processor you're going to use to store these records
into cache (see Mike's reply), if you want to convert each of the
lines to JSON objects, you can use ReplaceText:

Search Value: ^([^:]+):(.*)
Replacement Value: {"$1":$2}
Replacement Strategy: Line By Line

This creates a valid JSON object on each line, having one key whose
value is the embedded JSON object. Then, as of NiFi 1.7.0 [1] you can
use record-based processors with a JsonTreeReader and it will process
one JSON per line.

If you'd like to have the key as an attribute and only the JSON object
as the payload, you can use SplitText with Line Count = 1 to split the
file into individual flow files (1 line per file), then ExtractText to
get the key. Add a user-defined property (let's call it cache.key) to
ExtractText:

cache.key = ^([^:]+):.*

This extracts the key into an attribute called cache.key, but the
value still remains in the flow file, so you'll need a ReplaceText to
remove it:

Search Value: ^([^:]+):(.*)
Replacement Value: $2
Replacement Strategy: Line By Line

Regards,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-4456
On Fri, Nov 30, 2018 at 2:47 PM DAVID SMITH
<[hidden email]> wrote:

>
> Hi
>
> As requested here is an example file with some redacted data:
>
> ZA105:{"Aircraft Type":"Sea King", "Lifed Items":{ "port engine ser#":"RR-P1234", "starboard engine ser#":"RR-S1234","gearboxes ser#":[ "WHM1234", "WHI1234", "WHT1234" ] }}
> ZA106:{"Aircraft Type":"Sea King", "Lifed Items":{ "port engine ser#":"RR-P2345", "starboard engine ser#":"RR-S2345","gearboxes ser#":[ "WHM2345", "WHI2345", "WHT2345" ] }}
> ZA107:{"Aircraft Type":"Merlin", "Lifed Items":{ "port engine ser#":"RR-P3456", "starboard engine ser#":"RR-S3456","centre engine ser#":"RR-C3456","gearboxes ser#":[ "WHM3456", "WHI3456", "WHT3456" ] }}
> ZA108:{"Aircraft Type":"Merlin", "Lifed Items":{ "port engine ser#":"RR-P4567", "starboard engine ser#":"RR-S4567","centre engine ser#":"RR-C4567","gearboxes ser#":[ "WHM4567", "WHI4567", "WHT4567" ] }}
> ZA109:{"Aircraft Type":"Wessex", "Lifed Items":{ "port engine":"RR-P9876", "starboard engine":"RR-S9876","gearboxes":[ "WHM9876", "WHI9876", "WHT9876" ] }}
> ZA104:{"Aircraft Type":"Wessex", "Lifed Items":{ "port engine":"RR-P8765", "starboard engine":"RR-S8765","gearboxes":[ "WHM8765", "WHI8765", "WHT8765" ] }}
> ZA103:{"Aircraft Type":"Wessex", "Lifed Items":{ "port engine":"RR-P7654", "starboard engine":"RR-S7654","gearboxes":[ "WHM7654", "WHI7654", "WHT7654" ] }}
>
>
>
> What I would like is the aircraft tail no eg ZA104 to be the key of the cache item and everything after the colon (the aircraft type and replaceables serial numbers to be the cached item value. The cached item value can stay as a JSON string.
>
>
> Many thanks
>
> Dave
> --------------------------------------------
> On Fri, 30/11/18, Mike Thomsen <[hidden email]> wrote:
>
>  Subject: Re: Help with loading a file into a cache
>  To: [hidden email]
>  Date: Friday, 30 November, 2018, 15:26
>
>  Dave,
>
>  Can you post a redacted example with dummy
>  data?
>
>  Thanks,
>
>  Mike
>
>  On
>  Fri, Nov 30, 2018 at 7:08 AM DAVID SMITH
>  <[hidden email]>
>  wrote:
>
>  > Hi Devs
>  > I am running a NiFi 1.8 cluster, each node
>  has 128Gb of Ram. I need to
>  > load the
>  contents of a file of which is around 5Gb in size  into
>  a
>  > Key/Value cache.
>  >
>  The file I want to load is produced by another company so
>  the format it
>  > comes in is not
>  negotiable. The file contains thousands of lines in the
>  > following format:-
>  >
>  <index value1>:{<property1 name>: <property1
>  value>, <property2
>  >
>  name>:<property2 value>}<index
>  value2>:{<property1 name>: <property1
>  > value>, <property2
>  name>:<property2 value>}
>  >
>  <index value3>:{<property1 name>: <property1
>  value>, <property2
>  >
>  name>:<property2 value>}
>  >
>  > I want the index value to become the Key
>  and everything  beyond the colon
>  > to
>  become the value.
>  > What would be the
>  most efficient way of reading the file, and parsing it
>  > to load into a cache, I thought of reading
>  in the file, using a split
>  > content on
>  CR/LF and then splitting on the first colon.I have noticed
>  in
>  > 1.8 there are some CSV and JSON
>  Readers (controller services), would these
>  > be a better way of doing this, but the
>  problem I can see is that the file
>  >
>  isn't quite a CSV and it isn't quite a JSON Array
>  file.
>  > Many thanksDave
>