Using the template cvs to json to convert dsv files into json format: a few questions

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Using the template cvs to json to convert dsv files into json format: a few questions

idioma
This post was updated on .
Hi,
I am presented with the task of converting dsv files into json format. The resulting json format will consist of a number of "joins" between the data contained in these files. The steps I have identified so far are the following:

- Read the data from a location;
- Apply some basic transformation such as converting datetime to date
- Generate the final Json by taking into account a number of mappings

I have looked at existing processors/templates and I ended up looking at the template csv -> json:

https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates

Ideally, I believe my flow should look like this in the end:

GetFile -> ReplaceText (e.g. datetime -> date conversion) ->ExtractText (I guess this is needed to parse the result of the regex, I believe - correct me if I am wrong) -> ReplaceText (This is where we are actually building the mappings into json key-value pairs, is that correct?) -> UpdateAttribute (Here, I am not sure what it does and why I really need it in my case -> EvaluateJsonPath and ultimately PutFile (to check the result of the transformations)

Do you think this is a sensible flow for what I am trying to achieve? Am I missing any step or using an incorrect approach?

I have then moved into testing the template with a very simple dsv file and this is where I have stumbled across issues, which I think lead to the fact that I am not still sure about the flow itself and how to implement extract and replace operations.

So, let's say my dsv looks something like this:

"HEADING1"¦ "HEADING2" ¦ "HEADING3"
 abc           ¦    dfs           ¦    2000-06-11 5.56.43

My ReplaceText has the replacement value : ${time:format("yyyy-MM-dd'")} from https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#format , then I was not sure about the actual search value format for which I made various attempts. This is one of them: (0{0,1}[1-9])|(1/d)|(2/d)|(3[0-1])/(0{0,1}[1-9])|(1[0-2])/([1-9]/d):(0{0,1}/d)|(1/d)|(2[0-4]):(0{0,1}/d)|([1-5]/d) . What should I include in the search value? I have come across a stackoverflow question and followed a similar approach that would fit my case, but no success whatsover: http://stackoverflow.com/questions/36825860/replacetext-processor-configuration

Can you provide a very basic example of how the csv to json template works or point me to an existing one?

Thank you for your help, much appreciated.
Reply | Threaded
Open this post in threaded view
|

Re: Using the template cvs to json to convert dsv files into json format: a few questions

idioma
This post was updated on .
I am commenting on my own question, just to clarify one aspect of the existing template. I am not sure why we need ExtractText between the first ReplaceText and the one that in the flow comes before UpdateAttribute? I am not sure I have fully understood the flow, what is the actual value for that step?

Thank you so much for your help.
Reply | Threaded
Open this post in threaded view
|

Re: Using the template cvs to json to convert dsv files into json format: a few questions

Bryan Bende
Hello,

Regarding the CsvToJson template, here is what is happening...

1) GenerateFlowFile produces a FlowFile to simulate some data

2) ReplaceText sets the content of that flow file to "a,b,c,d" to represent
and example CSV line

Steps 1 and 2 are just simulating data that would really come from
somewhere else.

3) ExtractText applies a regular expression to the FlowFile to extract the
columns of the CSV...

In this case there is an attribute added to the processor called "csv" with
a value of (.+),(.+),(.+),(.+)

After the FlowFile passes through this processor it will have added
attributes:

csv.1 = a
csv.2 = b
csv.3 = c
csv.4 = d

4) ReplaceText uses those new attributes to build a new JSON document by
setting the replacement value to:

{
 "field1" : "${csv.1}",
 "field2" : "${csv.2}",
 "field3" : "${csv.3}",
 "field4" : "${csv.4}"
}

5) UpdateAttribute is not doing anything in this example, it is just a
place holder to send the results of step 4 somewhere and end the flow, you
would replace this with whatever you want to do with your JSON.

-Bryan


On Tue, May 3, 2016 at 4:26 AM, idioma <[hidden email]> wrote:

> I am comment on my own question, just to clarify one aspect of the existing
> template. I am not sure why we need ExtractText between the first
> ReplaceText and the one that in the flow comes before UpdateAttribute? I am
> not sure I have fully understood the flow, what is the actual value for
> that
> step?
>
> Thank you so much for your help.
>
>
>
> --
> View this message in context:
> http://apache-nifi-developer-list.39713.n7.nabble.com/Using-the-template-cvs-to-json-to-convert-dsv-files-into-json-format-a-few-questions-tp9738p9782.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Using the template cvs to json to convert dsv files into json format: a few questions

idioma
Hi Bryan, as always you are a star!
Thank you so much for your help. It does start making much more sense now, I cannot thank you enough, this flow is really something challenging considering my current knowledge of NiFi. So, in my case:

1 ) I have replaced GenerateFlowFile with GetFile (which looks into a subset of the content from the .dsv files).

2) I believe I do not need to set any content (correct me if I am wrong) This step is, therefore, skipped.

3) There might be some need for removing special characters, \r \n and \t with " ", so I am thinking to use ReplaceText and then double check with ConvertCharacterSet so that ASCII has been converted into UTF-8. Can you have more than one replacement value? So, in the first case you would need to have a search value \r then your replacement value would be something like replaceAll("\r", " "), then I would need to do a similar thing with \n and \t.

4) Here my ExtractText would be slightly different from csv. Something like: (.+)¦(.+)¦(.+)¦(.+) with attribute dsv. Does it make sense?

5) ReplaceText, then, is a working in progress as I have a number of dsv files, which will then constitute the final json, so I will need to clarify the process in depth. I have a question, though. Are field1 as much as field2 and a new attribute matching to the desider Json field?

Thank you again, I hope very soon to give back to the community! :)
Reply | Threaded
Open this post in threaded view
|

Re: Using the template cvs to json to convert dsv files into json format: a few questions

Bryan Bende
1) Replacing with GetFile sounds good

2) You are correct, there is nothing to set since you are getting it from
GetFile

3) You could probably replace all those characters with one ReplaceText
processor where the Search Value is something like (\n|\r|\t) and the
Replacement Value is empty.
I haven't tried this, but I think it should work.

4) That modified expression looks correct assuming ¦ is your delimiter.

5) Not totally sure I understand the question... In the example template,
field1 and field2 are just example field names to show how you could take
column 1 and column 2 from the CSV and put them into a JSON document.

Are you saying that you need to produce a single JSON document from
multiple dsv files? can you give a more concrete example of what the final
JSON needs to look like?

-Bryan


On Tue, May 3, 2016 at 11:23 AM, idioma <[hidden email]> wrote:

> Hi Bryan, as always you are a star!
> Thank you so much for your help. It does start making much more sense now,
> I
> cannot thank you enough, this flow is really something challenging
> considering my current knowledge of NiFi. So, in my case:
>
> 1 ) I have replaced GenerateFlowFile with GetFile (which looks into a
> subset
> of the content from the .dsv files).
>
> 2) I believe I do not need to set any content (correct me if I am wrong)
> This step is, therefore, skipped.
>
> 3) There might be some need for removing special characters, \r \n and \t
> with " ", so I am thinking to use ReplaceText and then double check with
> ConvertCharacterSet so that ASCII has been converted into UTF-8. Can you
> have more than one replacement value? So, in the first case you would need
> to have a search value \r then your replacement value would be something
> like replaceAll("\r", " "), then I would need to do a similar thing with \n
> and \t.
>
> 4) Here my ExtractText would be slightly different from csv. Something
> like:
> (.+)¦(.+)¦(.+)¦(.+) with attribute dsv. Does it make sense?
>
> 5) ReplaceText, then, is a working in progress as I have a number of dsv
> files, which will then constitute the final json, so I will need to clarify
> the process in depth. I have a question, though. Are field1 as much as
> field2 and a new attribute matching to the desider Json field?
>
> Thank you again, I hope very soon to give back to the community! :)
>
>
>
> --
> View this message in context:
> http://apache-nifi-developer-list.39713.n7.nabble.com/Using-the-template-cvs-to-json-to-convert-dsv-files-into-json-format-a-few-questions-tp9738p9815.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Using the template cvs to json to convert dsv files into json format: a few questions

idioma
Hi Bryan,
thank you so much, I am glad I am on the right track. Regarding how the resulting json file is created, I should clarify that there are 5 different .dsv files from which the json will be created. Therefore, there are a number of denormalisation rules employed to produce what will be the output. For example, the following json is an example of a subset of the mappings:

{
a = "valueA",
b = "valueB

}

From .dsv(s): DSV1, DSV2, DSV3, DSV4 and DSV5.

The key/value "a" is from DSV1 where there is a column called "HeadingA" with value "a", however the value "valueA" is the actual description of "a" as per DSV4 (here is its structure):

HeadingCode ¦ HeadingText
a                   "valueA"

There are other cases, like b where the value is taken straight from one .dsv file/column

Does it clarify a bit? What is your recommendation in my scenario?

Thank you so much.



Reply | Threaded
Open this post in threaded view
|

Re: Using the template cvs to json to convert dsv files into json format: a few questions

Bryan Bende
Ok I think I understand now...

Generally NiFi is geared more towards operating on a single piece of data
at a time (one FlowFile), so taking one DSV and converting it to JSON is
straight forward, but trying to use several DSVs as input to produce one
JSON is more tricky.

What you might be able to do is first convert each DSV to JSON so in your
example you would first produce:

{ a = "valueA" }

{ b = "valueB" }

Then use MergeContent to merge all these JSON documents together, and then
possibly do more extracting/replacing after that turn it into your desired
output.

Maybe someone else has another idea of how to handle this.

-Bryan




On Wed, May 4, 2016 at 7:25 AM, idioma <[hidden email]> wrote:

> Hi Bryan,
> thank you so much, I am glad I am on the right track. Regarding how the
> resulting json file is created, I should clarify that there are 5 different
> .dsv files from which the json will be created. Therefore, there are a
> number of denormalisation rules employed to produce what will be the
> output.
> For example, the following json is an example of a subset of the mappings:
>
> {
> a = "valueA",
> b = "valueB
>
> }
>
> From .dsv(s): DSV1, DSV2, DSV3, DSV4 and DSV5.
>
> The key/value "a" is from DSV1 where there is a column called "HeadingA"
> with value "a", however the value "valueA" is the actual description of "a"
> as per DSV4 (here is its structure):
>
> HeadingCode ¦ HeadingText
> a                   "valueA"
>
> There are other cases, like b where the value is taken straight from one
> .dsv file/column
>
> Does it clarify a bit? What is your recommendation in my scenario?
>
> Thank you so much.
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-nifi-developer-list.39713.n7.nabble.com/Using-the-template-cvs-to-json-to-convert-dsv-files-into-json-format-a-few-questions-tp9738p9899.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Using the template cvs to json to convert dsv files into json format: a few questions

idioma
Bryan, thank you so much. I will then try a combination of ReplaceText and MergeContent and work out whether I need any additional manipulation after that. I am kinda struggling with ReplaceText, once again, I am using in the search value (\r\n) and the replacement value empty. Furthermore, my replacement strategy is Literal Replace with Evaluation Mode Line by line (I have modified these values and tried different combinations, but no joy)

In case of the following value:

"This is adescription",

I would like to be returned with:

"This is a description"

However, the regex seems to replace everything before and after ¦ with ??, without touching the values/sentences that needs the space in. Is that because my file is probably still in ASCII? Should I convert it into UTF-8 before using ReplaceText? Can you help?

Thank you,
Reply | Threaded
Open this post in threaded view
|

Re: Using the template cvs to json to convert dsv files into json format: a few questions

idioma
Just to add on my previous comment. I have just used ConvertCharacterSet from ASCII to UTF-8 and I am getting the same result. The issue seems to be how ¦ are interpreted, I am not sure why.
Reply | Threaded
Open this post in threaded view
|

Re: Using the template cvs to json to convert dsv files into json format: a few questions

idioma
In reply to this post by Bryan Bende
With the view of ultimately sharing a template for converting dsv into json (this is the least I can do with all the help received from this amazing community), I have been working my way trough the steps for the dataflow and still running into some troubles and not sure why. Here is my ExtractText -> ReplaceText strategy:

I have replaced my initial delimiter symbol eventually with a semi-column, so my ExtractText looks like this (I have added the attribute dsv:

dsv = (.+);(.+);(.+);(.+)  

This should parse the content and pipe it in the dataflow. Then, the ReplaceText will do the following:

Search Value = (?s:^.*$)
ReplacementValue =

{
 "testField" : "${dsv.1}"
}


My sample data looks like something like this (full version is around 30 delimited values):

"HeadingA";"HeadingB";"HeadingC";"HeadingD";"HeadingE"
4787; "788769"; "67869"; "SomeText"; 12-07-200 13.45.22

The initial value is not enclosed within "" and some values such as date, datetime and some numerical ones too are randomly not wrapped up with the speech marks.  

My final Json output looks rather weird and I am not sure why this is happening (is this because my search value is the default one? Can I search for ${dsv} ?

{
 "testField" : "4787; "788769"; "67869"; "SomeText"; 12-07-200 13.45.22"
}

I am expecting to have the testField matched to 4787, is that correct. I though the issue was the lack of speech marks and I have tweaked the data sample to look like the other fields, but no joy.

Can you help?
Reply | Threaded
Open this post in threaded view
|

Re: Using the template cvs to json to convert dsv files into json format: a few questions

idioma
Hi, I have made some progress and specifically modified the regex in ExtractText, as follows:

(.*?);(.*?);(\d+.*)

The same value is used as SearchValue in my ReplaceText and the Replacement Value is as before =

{
 "testField" : "${dsv.1}"
}

This time, I am outputting the right value, but I am capturing the headers too, as follows:

"HeadingA";"HeadingB";"HeadingC";"HeadingD";"HeadingE"
 {
 "testField" : "4787"
}

I have tried to use UpdateAttribute just before the ReplaceText, but the headers are still in the output format.

What is the best approach for removing the first line from my flowfile so that only the Json format will be generated? Alternative, how can I modify the SearchValue to skip the first line?

Thanks and looking forward to sharing this with the community!