FuzzyHashContent/CompareFuzzyHash processor

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

FuzzyHashContent/CompareFuzzyHash processor

shankhamajumdar
Hi,

I want to implement fuzzy logic on some fields in a data file using NiFi. I
am trying to use  FuzzyHashContent/CompareFuzzyHash processor but not sure
how to implement the flow. Can you please provide me an example?

Regards,
Shankha



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: FuzzyHashContent/CompareFuzzyHash processor

Andy LoPresto-2
Hi Shankha,

The fuzzy hash processors operate on the content of the flowfile. You would first use a processor to ingest the “data file” content. This could be something like GetFile, GetHDFS, GetSFTP, InvokeHTTP, etc. depending on the source of the file. Once that step is done, the flowfile content will contain the data file bytes. If you want to perform the fuzzy hash calculation on the entire data file content, you can connect the success relationship from the ingest processor directly to FuzzyHashContent, and the resulting flowfile will contain an attribute with the calculated hash value. If you want to perform the calculation over only specific parts of the flowfile, you can use a processor to manipulate the content, for example EvaluateJsonPath, EvaluateXPath, ReplaceText, etc. 

You can see an example flow which uses these processors in slide 21 of a presentation [1] André Fucs de Miranda and I gave recently, and André has published the flow XML here [2]. 


Andy LoPresto
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Oct 6, 2017, at 4:27 AM, shankhamajumdar <[hidden email]> wrote:

Hi,

I want to implement fuzzy logic on some fields in a data file using NiFi. I
am trying to use  FuzzyHashContent/CompareFuzzyHash processor but not sure
how to implement the flow. Can you please provide me an example?

Regards,
Shankha



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/


signature.asc (859 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FuzzyHashContent/CompareFuzzyHash processor

shankhamajumdar
Hi Andy,

Thanks for the reply. But I am still not able to solve my use case. For
example

I have a data file in the below structure.

Col1      Col2      Col3      Col4      Col5

Test1    Test2     Test3     Test4     Test5

I want to do a fuzzy matching on Col2 and Col3 and generate an output file.

I am using getFile and FuzzyHashContent processor but not able to design the
flow. Need your help on this.

Regards,
Shankha






--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: FuzzyHashContent/CompareFuzzyHash processor

Andy LoPresto-2
You need to extract the relevant fields and either modify the flowfile content inline (losing the other data) or create a new flowfile (you can still retain the complete content in the “original” flowfile) and pass the flowfile with only the content you want to perform the hash on to the FuzzyHashContent processor. 

For the data you have provided (I’m assuming this is a single line of values, rather than the structure and there exist many lines), you could use a ReplaceText processor to drop unrelated columns. If you have multiple rows in the flowfile content, you can use a CSVRecordReader/ScriptedReader and CSVRecordSetWriter/ScriptedRecordSetWriter in conjunction with an UpdateRecord processor to reduce the content down to just the relevant fields, and then use a SplitRecord processor to generate individual flowfiles from each line, and pass all of them to FuzzyHashContent. 


Andy LoPresto
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Oct 9, 2017, at 4:19 AM, shankhamajumdar <[hidden email]> wrote:

Hi Andy,

Thanks for the reply. But I am still not able to solve my use case. For
example

I have a data file in the below structure.

Col1      Col2      Col3      Col4      Col5

Test1    Test2     Test3     Test4     Test5

I want to do a fuzzy matching on Col2 and Col3 and generate an output file.

I am using getFile and FuzzyHashContent processor but not able to design the
flow. Need your help on this.

Regards,
Shankha






--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/


signature.asc (859 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FuzzyHashContent/CompareFuzzyHash processor

shankhamajumdar
Hi Andy,

I am having multiple lines in the file. For example

Col1      Col2      Col3      Col4      Col5

Test1    Test2     Test3     Test4     Test5
Data1   Data2    Data3    Data4    Data5
......................................................
......................................................
......................................................

In the output file I want to write all the fields where there will be a
fuzzy matching between Col2 and Col3.

Regards,
Shankha





--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: FuzzyHashContent/CompareFuzzyHash processor

shankhamajumdar
Hi Andy,

I am looking into the flow of

https://github.com/fluenda/dataworks_summit_iot_botnet/blob/master/dataworks_summit_iot_botnet_full_flow.xml

I am fine up to GenerateSSDEEPHash of FuzzyHashContent processor. But in the
next step in Compare_SSDEEP_Hashes of CompareFuzzyHash processor processor
the  Hash List source file is
/opt/nifi/nifi-1.3.0/conf/permanent/ssdeep_hashes.list. Can you please let
me know how this file has been generated?

I am using the below flow in my use case.

1. GetFile processor to get the file content.
2. FuzzyHashContent processor to create the has value of the file content.
3. In this step want to use CompareFuzzyHash processor but not sure what
value I should put in the Hash List source file property.

Regards,
Shankha

 



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/