Data anonymization in Nifi

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Data anonymization in Nifi

Vyshali
Hi,

Please suggest possible ways to do data anonymization in Nifi such that PII
data is not exposed.
Suggest suitable processors for the same.
Thanks in advance.

Regards,
Vyshali



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Data anonymization in Nifi

Chris Herssens
You can use the ExecuteScript processor for hashing some fields is for
instance CSV data

Regards,

Chris

On Tue, Oct 17, 2017 at 8:41 AM, Vyshali <[hidden email]> wrote:

> Hi,
>
> Please suggest possible ways to do data anonymization in Nifi such that PII
> data is not exposed.
> Suggest suitable processors for the same.
> Thanks in advance.
>
> Regards,
> Vyshali
>
>
>
> --
> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Data anonymization in Nifi

Matt Burgess-2
Vyshali,

Building on Chris's suggestion of using ExecuteScript, you could also
include the ARX JAR(s) in your Module Directory property, and then
leverage all the ARX goodness [1].  In general this does seem like a
good idea for a processor, I have written NIFI-4492 [2] to add an
AnonymizeRecord processor. It need not use ARX but I did mention it in
the Jira case.

Regards,
Matt

[1] http://arx.deidentifier.org/api/
[2] https://issues.apache.org/jira/browse/NIFI-4492


On Tue, Oct 17, 2017 at 8:09 AM, Chris Herssens
<[hidden email]> wrote:

> You can use the ExecuteScript processor for hashing some fields is for
> instance CSV data
>
> Regards,
>
> Chris
>
> On Tue, Oct 17, 2017 at 8:41 AM, Vyshali <[hidden email]> wrote:
>
>> Hi,
>>
>> Please suggest possible ways to do data anonymization in Nifi such that PII
>> data is not exposed.
>> Suggest suitable processors for the same.
>> Thanks in advance.
>>
>> Regards,
>> Vyshali
>>
>>
>>
>> --
>> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
>>
Reply | Threaded
Open this post in threaded view
|

Re: Data anonymization in Nifi

Vyshali
In reply to this post by Chris Herssens
Hi Chris,

Hashing using executescript processor means that I should write some coding
logic to do that.If so,will the format of the field will remain the same ?

Please explain me with examples.

Regards,
Vyshali



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Data anonymization in Nifi

Mike Thomsen
Not if you use hashing. You'll get a field value like this (sha1
algorithm): c3499c2729730a7f807efb8676a92dcb6f8a3f8f

For getting closer to the original data in the sort of values present,
you'll need to try something like ARX.

On Tue, Oct 17, 2017 at 11:53 AM, Vyshali <[hidden email]> wrote:

> Hi Chris,
>
> Hashing using executescript processor means that I should write some coding
> logic to do that.If so,will the format of the field will remain the same ?
>
> Please explain me with examples.
>
> Regards,
> Vyshali
>
>
>
> --
> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Data anonymization in Nifi

Andy LoPresto-2
Vyshali,

You may be interested in format preserving encryption (FPE) [1] if you need to maintain format while performing data masking. There are also methods to derive a cryptographically secure hash function from encryption [2] so that you can have “one way” data transformation and maintain a given format. 

I would encourage you to be aware of all attack surfaces here, though. First, there are many examples of anonymization being easily undone because it was not correctly implemented [3], used a weak process [4], or could be reconstructed through associated data [5]. Even with a strong anonymization approach, remember that NiFi tracks the data lineage throughout the process, so a user with sufficient permissions will be able to look at the provenance for a flowfile before/after it has undergone the anonymization operation and see the original data. This can be partially mitigated and restricted to a core group of privileged users via strict access control policies. On top of that, the provenance repository does provide an encrypted implementation, but the content and flowfile repositories currently do not. A malicious user with OS-level access could examine the repository files on disk to extract the original content or flowfile attributes before they were anonymized. There are open Jiras [6][7] for those efforts. There is also the issue of a user examining the flowfile via queue listing. Open Jiras for encrypting attributes [8] and hashing attributes [9], as well as “sensitive attributes” with per-key-permissions also exist [10]. 

I hope this helps to illustrate the complexities of anonymization and leads you to a successful solution.  


[2] https://crypto.stackexchange.com/questions/24284/is-there-a-format-preserving-cryptographically-secure-hash


Andy LoPresto
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Oct 17, 2017, at 10:36 AM, Mike Thomsen <[hidden email]> wrote:

Not if you use hashing. You'll get a field value like this (sha1
algorithm): c3499c2729730a7f807efb8676a92dcb6f8a3f8f

For getting closer to the original data in the sort of values present,
you'll need to try something like ARX.

On Tue, Oct 17, 2017 at 11:53 AM, Vyshali <[hidden email]> wrote:

Hi Chris,

Hashing using executescript processor means that I should write some coding
logic to do that.If so,will the format of the field will remain the same ?

Please explain me with examples.

Regards,
Vyshali



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/



signature.asc (859 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Data anonymization in Nifi

Vyshali
In reply to this post by Chris Herssens
Hi Chris,

Thanks for the suggestion.Should I have code in python or some languagues
for hashing the data using exectescript processor ? If so,will the format of
the data be detained after hashing.
Please provide some clarity on that.

Thanks,
Vyshali



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Data anonymization in Nifi

Chris Herssens
Hello Vyshali

below you can find  python  code example for hashing the fourth column of a
CSV file using the ExecuteScript processor
If you hash a field using SHA256 then the length of the field is changed.
A sha256 is 256 bits long

import hashlib
import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback

def hashField(text):
        return hashlib.sha256(text.encode('ascii')).hexdigest()

class convertStream(StreamCallback):
  def __init__(self):
        pass
  def process(self,inputStream,outputStream):
    text = IOUtils.toString(inputStream, StandardCharsets.ISO_8859_1)
    output=[]
    for line in text.splitlines():
                l=line.split(';')
                l[3] = hashField(l[3].lower())
                l.append(l[3]+"_"+l[0]+"_"+l[1])
                output.append(';'.join(l))
    out='\n'.join(output)
    outputStream.write(out.encode('latin-1'))

flowfile = session.get()
if(flowfile != None):
        flowfile=session.write(flowfile,convertStream())
        flowfile = session.putAttribute(flowfile, "filename",
flowfile.getAttribute('filename').split('.')[0]+'_hashed')
        session.transfer(flowfile, REL_SUCCESS)
        session.commit()



Regards,

Chris

On Fri, Oct 20, 2017 at 7:19 PM, Vyshali <[hidden email]> wrote:

> Hi Chris,
>
> Thanks for the suggestion.Should I have code in python or some languagues
> for hashing the data using exectescript processor ? If so,will the format
> of
> the data be detained after hashing.
> Please provide some clarity on that.
>
> Thanks,
> Vyshali
>
>
>
> --
> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Data anonymization in Nifi

Vyshali
In reply to this post by Matt Burgess-2
Hi Matt,

Thanks for the suggestion.
It would be very much helpful if you can give the instruction on how to use
the AnonymizeRecord processor.
Please give some clarity on how to setup processor after downloading ARX
jars
I downloaded the jar from  http://arx.deidentifier.org/downloads/
<http://http://arx.deidentifier.org/downloads/>  

Regards,
Vyshali




--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Data anonymization in Nifi

Matt Burgess-2
Vyshali,

The AnonymizeRecord processor does not yet exist, I just wrote up a
Jira to track the addition of it possibly sometime in the future.

For the scripted solution, you can add the location of the ARX JARs to
the Module Directory property of ExecuteScript. If it is a flat
directory of JARs and you are using Groovy, Clojure, or Javascript,
you can just set the Module Directory to the directory containing the
JARs. Otherwise you'd have to list the JARs separately (for languages
such as Jython).  Once the Module Directory property is set, you can
import and use any of the ARX classes according to their
documentation.

For examples on using the NiFi API (to read/write flow files, etc.), I
have an ExecuteScript Cookbook blog series [1] and a few other
examples on my blog [2].

Regards,
Matt

[1] https://community.hortonworks.com/articles/75032/executescript-cookbook-part-1.html
[2] http://funnifi.blogspot.com


On Mon, Oct 23, 2017 at 12:41 PM, Vyshali <[hidden email]> wrote:

> Hi Matt,
>
> Thanks for the suggestion.
> It would be very much helpful if you can give the instruction on how to use
> the AnonymizeRecord processor.
> Please give some clarity on how to setup processor after downloading ARX
> jars
> I downloaded the jar from  http://arx.deidentifier.org/downloads/
> <http://http://arx.deidentifier.org/downloads/>
>
> Regards,
> Vyshali
>
>
>
>
> --
> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Data anonymization in Nifi

Vyshali
Matt,

Thanks for your valuable suggestion.
ARX supports JAVA and only languages like Groovy,Python,Jython,Python are
available in executescript processor.Have you tried using ARX
functionalities in any of these languages ?
If so, please send some references.

Thanks,
Vyshali



--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Data anonymization in Nifi

Mike Thomsen
Groovy is very close to being a superset of Java 7 in terms of syntax, so
in most cases you can copy and paste Java code directly into a Groovy
script without modification.

On Tue, Oct 24, 2017 at 8:52 AM, Vyshali <[hidden email]> wrote:

> Matt,
>
> Thanks for your valuable suggestion.
> ARX supports JAVA and only languages like Groovy,Python,Jython,Python are
> available in executescript processor.Have you tried using ARX
> functionalities in any of these languages ?
> If so, please send some references.
>
> Thanks,
> Vyshali
>
>
>
> --
> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Data anonymization in Nifi

Vyshali
In reply to this post by Matt Burgess-2
Hi Matt,

Thanks for your valuable comment.

Is it possible to anonymize data without specifying generalization
hierarchies in ARX.?
Also,can you please help me with some basic examples using ARX APIs.

Regards,
Vyshali




--
Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Data anonymization in Nifi

Matt Burgess-2
Vyshali,

I would love to help, but I've never used ARX so I'm not at all
familiar with their APIs. They do have an examples page though [1].

Regards,
Matt

[1] http://arx.deidentifier.org/overview/#a3


On Tue, Oct 31, 2017 at 1:11 PM, Vyshali <[hidden email]> wrote:

> Hi Matt,
>
> Thanks for your valuable comment.
>
> Is it possible to anonymize data without specifying generalization
> hierarchies in ARX.?
> Also,can you please help me with some basic examples using ARX APIs.
>
> Regards,
> Vyshali
>
>
>
>
> --
> Sent from: http://apache-nifi-developer-list.39713.n7.nabble.com/