Adding new data anonymization processor bundle

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding new data anonymization processor bundle

Sivaprasanna Sethuraman
With data becoming more critical and substantial to business development,
new stringent regulations & law are getting introduced (GDPR being a recent
example), I've been spending some time lately doing research on data
anonymization and after some hefty thinking, I finally decided to go ahead
with the creation of new processor bundle that has processors like
'AnonymizeRecord', 'DeanonymizeRecord' (not quite sure about the name
though). Following are my questions:

   - What do you guys think about these proposed processors?
   - If the processors are okay to be introduced, are they "standard"
   enough to get them added to our 'nifi-standard-bundles' module or is it
   better to keep it separated much like others like AWS, Azure bundles, etc.

Having said this, I'm very much in the beginning phase with my research and
development efforts so all your inputs & feedback on this one are greatly
appreciated.

Thanks.

-
Sivaprasanna
Reply | Threaded
Open this post in threaded view
|

Re: Adding new data anonymization processor bundle

Mike Thomsen
There's a framework called ARX that could very useful for this. The only
question you have is how compliant it would be with different sets of
distinct legal requirements for privacy handling. In the absence of strong
legal guidance, I'd say err on the side of complying with health care
regulations because that's where you're likely to find the clearest
guidance and established tools.

Ping me on any PR you send.

On Wed, Jun 20, 2018 at 12:49 PM Sivaprasanna <[hidden email]>
wrote:

> With data becoming more critical and substantial to business development,
> new stringent regulations & law are getting introduced (GDPR being a recent
> example), I've been spending some time lately doing research on data
> anonymization and after some hefty thinking, I finally decided to go ahead
> with the creation of new processor bundle that has processors like
> 'AnonymizeRecord', 'DeanonymizeRecord' (not quite sure about the name
> though). Following are my questions:
>
>    - What do you guys think about these proposed processors?
>    - If the processors are okay to be introduced, are they "standard"
>    enough to get them added to our 'nifi-standard-bundles' module or is it
>    better to keep it separated much like others like AWS, Azure bundles,
> etc.
>
> Having said this, I'm very much in the beginning phase with my research and
> development efforts so all your inputs & feedback on this one are greatly
> appreciated.
>
> Thanks.
>
> -
> Sivaprasanna
>
Reply | Threaded
Open this post in threaded view
|

Re: Adding new data anonymization processor bundle

Matt Burgess-2
I think is a great idea, I filed a Jira [1] a while ago in case
someone wanted to start working on it (or in case I got a chance). It
mentions ARX but any Apache-friendly implementation is of course
welcome. I think it should be in its own bundle as it is functionality
separate from all our other bundles (and not ubiquitous enough to put
in the standard NAR).

Glad to hear you're interested in this, please feel free to reach out
with any questions and I too would be happy to review any
contributions.

Thanks,
Matt

[1] https://issues.apache.org/jira/browse/NIFI-4492

On Wed, Jun 20, 2018 at 12:57 PM Mike Thomsen <[hidden email]> wrote:

>
> There's a framework called ARX that could very useful for this. The only
> question you have is how compliant it would be with different sets of
> distinct legal requirements for privacy handling. In the absence of strong
> legal guidance, I'd say err on the side of complying with health care
> regulations because that's where you're likely to find the clearest
> guidance and established tools.
>
> Ping me on any PR you send.
>
> On Wed, Jun 20, 2018 at 12:49 PM Sivaprasanna <[hidden email]>
> wrote:
>
> > With data becoming more critical and substantial to business development,
> > new stringent regulations & law are getting introduced (GDPR being a recent
> > example), I've been spending some time lately doing research on data
> > anonymization and after some hefty thinking, I finally decided to go ahead
> > with the creation of new processor bundle that has processors like
> > 'AnonymizeRecord', 'DeanonymizeRecord' (not quite sure about the name
> > though). Following are my questions:
> >
> >    - What do you guys think about these proposed processors?
> >    - If the processors are okay to be introduced, are they "standard"
> >    enough to get them added to our 'nifi-standard-bundles' module or is it
> >    better to keep it separated much like others like AWS, Azure bundles,
> > etc.
> >
> > Having said this, I'm very much in the beginning phase with my research and
> > development efforts so all your inputs & feedback on this one are greatly
> > appreciated.
> >
> > Thanks.
> >
> > -
> > Sivaprasanna
> >
Reply | Threaded
Open this post in threaded view
|

Re: Adding new data anonymization processor bundle

Sivaprasanna Sethuraman
Wow.. I dint realize there was a JIRA already. I'm interested and would be
happy to contribute my time & efforts on this.

On Wed, Jun 20, 2018 at 10:34 PM, Matt Burgess <[hidden email]> wrote:

> I think is a great idea, I filed a Jira [1] a while ago in case
> someone wanted to start working on it (or in case I got a chance). It
> mentions ARX but any Apache-friendly implementation is of course
> welcome. I think it should be in its own bundle as it is functionality
> separate from all our other bundles (and not ubiquitous enough to put
> in the standard NAR).
>
> Glad to hear you're interested in this, please feel free to reach out
> with any questions and I too would be happy to review any
> contributions.
>
> Thanks,
> Matt
>
> [1] https://issues.apache.org/jira/browse/NIFI-4492
>
> On Wed, Jun 20, 2018 at 12:57 PM Mike Thomsen <[hidden email]>
> wrote:
> >
> > There's a framework called ARX that could very useful for this. The only
> > question you have is how compliant it would be with different sets of
> > distinct legal requirements for privacy handling. In the absence of
> strong
> > legal guidance, I'd say err on the side of complying with health care
> > regulations because that's where you're likely to find the clearest
> > guidance and established tools.
> >
> > Ping me on any PR you send.
> >
> > On Wed, Jun 20, 2018 at 12:49 PM Sivaprasanna <[hidden email]
> >
> > wrote:
> >
> > > With data becoming more critical and substantial to business
> development,
> > > new stringent regulations & law are getting introduced (GDPR being a
> recent
> > > example), I've been spending some time lately doing research on data
> > > anonymization and after some hefty thinking, I finally decided to go
> ahead
> > > with the creation of new processor bundle that has processors like
> > > 'AnonymizeRecord', 'DeanonymizeRecord' (not quite sure about the name
> > > though). Following are my questions:
> > >
> > >    - What do you guys think about these proposed processors?
> > >    - If the processors are okay to be introduced, are they "standard"
> > >    enough to get them added to our 'nifi-standard-bundles' module or
> is it
> > >    better to keep it separated much like others like AWS, Azure
> bundles,
> > > etc.
> > >
> > > Having said this, I'm very much in the beginning phase with my
> research and
> > > development efforts so all your inputs & feedback on this one are
> greatly
> > > appreciated.
> > >
> > > Thanks.
> > >
> > > -
> > > Sivaprasanna
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: Adding new data anonymization processor bundle

Andy LoPresto
Sivaprasanna,

Thanks for joining this effort. I don’t recall what’s on the existing Jira, but please be very aware of the challenges in data anonymization and the various threat models — de-anonymizing data can lead to the leak of PII, EPHI, PCI data, etc. In some cases, it can even lead to physical danger against persons.

There are a number of high impact examples of avoidable scenarios like this.

https://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-databases-of-ruin/

https://arstechnica.com/tech-policy/2014/06/poorly-anonymized-logs-reveal-nyc-cab-drivers-detailed-whereabouts/

We should use publicly reviewed algorithms, document the risks and known challenges well, take into consideration provenance and other NiFi-specific features, and write a good summary of these features if/when they are introduced.

Andy LoPresto
[hidden email]
[hidden email]
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Jun 20, 2018, at 10:06, Sivaprasanna <[hidden email]> wrote:
>
> Wow.. I dint realize there was a JIRA already. I'm interested and would be
> happy to contribute my time & efforts on this.
>
>> On Wed, Jun 20, 2018 at 10:34 PM, Matt Burgess <[hidden email]> wrote:
>>
>> I think is a great idea, I filed a Jira [1] a while ago in case
>> someone wanted to start working on it (or in case I got a chance). It
>> mentions ARX but any Apache-friendly implementation is of course
>> welcome. I think it should be in its own bundle as it is functionality
>> separate from all our other bundles (and not ubiquitous enough to put
>> in the standard NAR).
>>
>> Glad to hear you're interested in this, please feel free to reach out
>> with any questions and I too would be happy to review any
>> contributions.
>>
>> Thanks,
>> Matt
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-4492
>>
>> On Wed, Jun 20, 2018 at 12:57 PM Mike Thomsen <[hidden email]>
>> wrote:
>>>
>>> There's a framework called ARX that could very useful for this. The only
>>> question you have is how compliant it would be with different sets of
>>> distinct legal requirements for privacy handling. In the absence of
>> strong
>>> legal guidance, I'd say err on the side of complying with health care
>>> regulations because that's where you're likely to find the clearest
>>> guidance and established tools.
>>>
>>> Ping me on any PR you send.
>>>
>>> On Wed, Jun 20, 2018 at 12:49 PM Sivaprasanna <[hidden email]
>>>
>>> wrote:
>>>
>>>> With data becoming more critical and substantial to business
>> development,
>>>> new stringent regulations & law are getting introduced (GDPR being a
>> recent
>>>> example), I've been spending some time lately doing research on data
>>>> anonymization and after some hefty thinking, I finally decided to go
>> ahead
>>>> with the creation of new processor bundle that has processors like
>>>> 'AnonymizeRecord', 'DeanonymizeRecord' (not quite sure about the name
>>>> though). Following are my questions:
>>>>
>>>>   - What do you guys think about these proposed processors?
>>>>   - If the processors are okay to be introduced, are they "standard"
>>>>   enough to get them added to our 'nifi-standard-bundles' module or
>> is it
>>>>   better to keep it separated much like others like AWS, Azure
>> bundles,
>>>> etc.
>>>>
>>>> Having said this, I'm very much in the beginning phase with my
>> research and
>>>> development efforts so all your inputs & feedback on this one are
>> greatly
>>>> appreciated.
>>>>
>>>> Thanks.
>>>>
>>>> -
>>>> Sivaprasanna
>>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Adding new data anonymization processor bundle

Mike Thomsen
Andy,

You raise a great point about considering the provenance. Unless there's a
way to exclude attributes from provenance tracking, I think we'd need to
force the issue by not allowing attributes to be an input source for
expression language. That's the only way to kinda force people to think
"hey, I shouldn't put this here." In my opinion, that's not really
something we should allow given the ramifications of people using the
feature without reading up on the relevant documentation.

On Wed, Jun 20, 2018 at 1:35 PM Andy LoPresto <[hidden email]>
wrote:

> Sivaprasanna,
>
> Thanks for joining this effort. I don’t recall what’s on the existing
> Jira, but please be very aware of the challenges in data anonymization and
> the various threat models — de-anonymizing data can lead to the leak of
> PII, EPHI, PCI data, etc. In some cases, it can even lead to physical
> danger against persons.
>
> There are a number of high impact examples of avoidable scenarios like
> this.
>
>
> https://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-databases-of-ruin/
>
>
> https://arstechnica.com/tech-policy/2014/06/poorly-anonymized-logs-reveal-nyc-cab-drivers-detailed-whereabouts/
>
> We should use publicly reviewed algorithms, document the risks and known
> challenges well, take into consideration provenance and other NiFi-specific
> features, and write a good summary of these features if/when they are
> introduced.
>
> Andy LoPresto
> [hidden email]
> [hidden email]
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> > On Jun 20, 2018, at 10:06, Sivaprasanna <[hidden email]>
> wrote:
> >
> > Wow.. I dint realize there was a JIRA already. I'm interested and would
> be
> > happy to contribute my time & efforts on this.
> >
> >> On Wed, Jun 20, 2018 at 10:34 PM, Matt Burgess <[hidden email]>
> wrote:
> >>
> >> I think is a great idea, I filed a Jira [1] a while ago in case
> >> someone wanted to start working on it (or in case I got a chance). It
> >> mentions ARX but any Apache-friendly implementation is of course
> >> welcome. I think it should be in its own bundle as it is functionality
> >> separate from all our other bundles (and not ubiquitous enough to put
> >> in the standard NAR).
> >>
> >> Glad to hear you're interested in this, please feel free to reach out
> >> with any questions and I too would be happy to review any
> >> contributions.
> >>
> >> Thanks,
> >> Matt
> >>
> >> [1] https://issues.apache.org/jira/browse/NIFI-4492
> >>
> >> On Wed, Jun 20, 2018 at 12:57 PM Mike Thomsen <[hidden email]>
> >> wrote:
> >>>
> >>> There's a framework called ARX that could very useful for this. The
> only
> >>> question you have is how compliant it would be with different sets of
> >>> distinct legal requirements for privacy handling. In the absence of
> >> strong
> >>> legal guidance, I'd say err on the side of complying with health care
> >>> regulations because that's where you're likely to find the clearest
> >>> guidance and established tools.
> >>>
> >>> Ping me on any PR you send.
> >>>
> >>> On Wed, Jun 20, 2018 at 12:49 PM Sivaprasanna <
> [hidden email]
> >>>
> >>> wrote:
> >>>
> >>>> With data becoming more critical and substantial to business
> >> development,
> >>>> new stringent regulations & law are getting introduced (GDPR being a
> >> recent
> >>>> example), I've been spending some time lately doing research on data
> >>>> anonymization and after some hefty thinking, I finally decided to go
> >> ahead
> >>>> with the creation of new processor bundle that has processors like
> >>>> 'AnonymizeRecord', 'DeanonymizeRecord' (not quite sure about the name
> >>>> though). Following are my questions:
> >>>>
> >>>>   - What do you guys think about these proposed processors?
> >>>>   - If the processors are okay to be introduced, are they "standard"
> >>>>   enough to get them added to our 'nifi-standard-bundles' module or
> >> is it
> >>>>   better to keep it separated much like others like AWS, Azure
> >> bundles,
> >>>> etc.
> >>>>
> >>>> Having said this, I'm very much in the beginning phase with my
> >> research and
> >>>> development efforts so all your inputs & feedback on this one are
> >> greatly
> >>>> appreciated.
> >>>>
> >>>> Thanks.
> >>>>
> >>>> -
> >>>> Sivaprasanna
> >>>>
> >>
>