I want to implement fuzzy logic on some fields in a data file using NiFi. I
am trying to use FuzzyHashContent/CompareFuzzyHash processor but not sure
how to implement the flow. Can you please provide me an example?
The fuzzy hash processors operate on the content of the flowfile. You would first use a processor to ingest the “data file” content. This could be something like GetFile, GetHDFS, GetSFTP, InvokeHTTP, etc. depending on the source of the file. Once that step is done, the flowfile content will contain the data file bytes. If you want to perform the fuzzy hash calculation on the entire data file content, you can connect the success relationship from the ingest processor directly to FuzzyHashContent, and the resulting flowfile will contain an attribute with the calculated hash value. If you want to perform the calculation over only specific parts of the flowfile, you can use a processor to manipulate the content, for example EvaluateJsonPath, EvaluateXPath, ReplaceText, etc.
You can see an example flow which uses these processors in slide 21 of a presentation  André Fucs de Miranda and I gave recently, and André has published the flow XML here .
I want to implement fuzzy logic on some fields in a data file using NiFi. I am trying to use FuzzyHashContent/CompareFuzzyHash processor but not sure how to implement the flow. Can you please provide me an example?
You need to extract the relevant fields and either modify the flowfile content inline (losing the other data) or create a new flowfile (you can still retain the complete content in the “original” flowfile) and pass the flowfile with only the content you want to perform the hash on to the FuzzyHashContent processor.
For the data you have provided (I’m assuming this is a single line of values, rather than the structure and there exist many lines), you could use a ReplaceText processor to drop unrelated columns. If you have multiple rows in the flowfile content, you can use a CSVRecordReader/ScriptedReader and CSVRecordSetWriter/ScriptedRecordSetWriter in conjunction with an UpdateRecord processor to reduce the content down to just the relevant fields, and then use a SplitRecord processor to generate individual flowfiles from each line, and pass all of them to FuzzyHashContent.
I am fine up to GenerateSSDEEPHash of FuzzyHashContent processor. But in the
next step in Compare_SSDEEP_Hashes of CompareFuzzyHash processor processor
the Hash List source file is
/opt/nifi/nifi-1.3.0/conf/permanent/ssdeep_hashes.list. Can you please let
me know how this file has been generated?
I am using the below flow in my use case.
1. GetFile processor to get the file content.
2. FuzzyHashContent processor to create the has value of the file content.
3. In this step want to use CompareFuzzyHash processor but not sure what
value I should put in the Hash List source file property.