This is how you prevent the processing of personal data (part 1/2): tool-independent blacklists and whitelists

3 March 2020 | 8 minutes of reading time

Published by Marketingfacts

Read part 2 here!

Since 25 May last year, the General Data Protection Regulation (AVG) applies. Within the AVG, a broad definition is used for the concept of ‘personal data’, namely ‘all information relating to an identified or identifiable natural person’. This refers to data that directly relate to a person, or data that, in combination with other data, can be traced back to this person. 

Personal data may only be processed for the purpose for which it was obtained. This is called ‘purpose limitation’. In order to prevent organizations from collecting personal data without being able to justify this properly, they must implement the principles of privacy by design and privacy by default. Privacy by design means that an organization ensures that personal data is properly protected when designing products and services. Privacy by default means that an organization must take technical and organizational measures to ensure that, as a standard, it only processes personal data that are necessary for the specific purpose it wants to achieve. This means, for example, that it is not permitted to ask for more data than is necessary to subscribe to a newsletter.

There is a chance, however, that you may process personal data unlawfully, without you being aware of this. For example, because you use tools or scripts on your website that measure which URL a visitor is on. This URL can contain an e-mail address, for example, if the customer visits your website via a customer e-mail. 

In my two-part blog I would therefore like to show you what measures you can take to prevent the unintentional collection and processing of personal data on your website as much as possible. In doing so, personal data is overwritten, not deleted – so if certain data is collected unintentionally, you can easily recognize it and trace its source.  

In part 1, I introduce the ‘PII prevention matrix’ and show you how you can use Google Tag Manager to blacklist and whitelist personal data in a tool-independent way – i.e. not for every script or tool again, but only once for all scripts and tools. The developed measures are intended for technical web analysts, who do not shy away from using Google Tag Manager and Javascript. I also use regular expression (regex) to describe patterns.

The PII-preventionmatrix

I classify the measures you can take to prevent the processing of personal data according to two classifications. Together, they form a matrix which I call the PII prevention matrix – here the abbreviation PII refers to the term ‘Personal Identifiable Information’ which is more or less equivalent to the term ‘personal data’. The classifications are:

1) Tool-independent vs. tool-dependent.

In a tool-independent solution, the personal data must be replaced for each new tool. The order in which different events take place is as follows: 

  1. Data such as a URL become available
  2. Scripts and/or tools are loaded
  3. Each script and/or tool replaces the personal data
  4. Each script and/or tool processes the data that has been stripped of personal data

Because the tool-dependent solution requires step III to be performed several times, this is rather inefficient. In addition, some scripts and/or tools do not even offer the possibility to edit data such as URLs before they are actually processed by the scripts and/or tools. The tool-independent solution offers a solution for this. With a tool-independent solution, the solution is tool-transcending and therefore only needs to be applied once. The sequence in which various events take place is as follows: 

  1. Data such as a URL becomes available
  2. Any personal data within the available data will be replaced
  3. Scripts and/or tools are loaded
  4. Each script and/or tool processes the data that has been stripped of personal data 

2) blacklisting vs. whitelisting

With blacklisting you define a list of data that may not be processed. Is a certain data not on the blacklist? Then it may be processed. Whitelists works the other way around: you define a list of data that can be processed. A particular item of data may only be processed if it is on the list. This makes whitelists stricter than blacklists.

The classifications above result in a quadrant of 4 solutions to prevent the processing of personal data:

  Blacklisting Whitelisting
Toolindependent Current blog Current blog
Tooldependent Part 2/2 Part 2/2


In the rest of this blog I will go deeper into the tool-independent blacklisting and whitelisting of personal data using Google Tag Manager. In the next blog I will show you how to apply blacklisting and whitelisting on Google Analytics.

Replacing tool-independent personal data

Since most scripts on a page have access to the metadata of a web page – think of the URL and page title – especially these data are suitable to protect personal data in a tool-independent way. This means: before the data is available for other scripts. It regularly happens that these metadata, intentionally or unintentionally, contain personal data. Think, for example, of an e-mail address that is sent as a query parameter on the destination page of a client e-mail. Or a postal code that is sent as a query parameter from a comparison site.

Tool-independent blacklisting – how do I do that?

With the solution of tool-independent blacklisting, you specify the regular expressions with which personal data must comply (the blacklist). Next, you check whether these patterns occur in the data you want to process. If this is the case, you replace the substrings that meet these patterns. You do this before the data is processed by other scripts (tool-independent).

As an illustration, I would like to show you how to replace the URL parameter “foo” and/or “bar” with “[REDACTED]” and e-mail addresses with “[REDACTED EMAIL]”. The URL “https://www.domein.nl?foo=waarde&bar=waarde&email=siemon@i-spark.nl&foobar=waarde” then becomes “https://www.domein.nl?foo=[REDACTED]&bar=[REDACTED]&email=[REDACTED_EMAIL]&foobar=waarde”. Below I explain step by step how you can achieve this using Google Tag Manager:

  1. Define your blacklist.
  • Create a new variable of the type “Custom JavaScript macro” and call it “PII”. See the example below.
  • Within the new variable, define an array containing an object for each type of personal data. In our example, two types of personal data are involved, namely blacklisted parameters and e-mail addresses.
  • Give the objects you defined for each type of personal data 3 keys: ‘name’, ‘regex’ and ‘replacement’. For the name key, enter a string describing the type of personal data. This is especially useful for yourself. For the regex key, enter the regular expression with which the type of personal data complies. For the replacement key, enter the string with which the personal data must be replaced. 
  • Return the defined array.
function(){  var piiRegex = [{           name: ‘BLACKLISTED PARAMETER’,           regex: /[?&](foo|bar)=([^&$#]+)/gi,

           replacement: “[REDACTED]”

  },{

           name: ‘EMAIL’,

           regex:    /(([a-zA-Z0-9_\-\.]+)(@|%40)([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5}))/gi,

           replacement: “[REDACTED_EMAIL]”

  }];

 

  return piiRegex;

}

 

  1. Create a new function that replaces personal data.
  • Create a new variable of the type ‘Custom JavaScript macro’ and call it ‘return editorData function’. Use the Javascript below.
  • The variable returns a function with 2 parameters: the 1st parameter is a data string, the 2nd parameter is the previously defined blacklist we called “PII”. For each of the defined personal data in the blacklist, the function checks if the regular expression occurs in the data string, until there is a match. At that moment the personal data is replaced by the corresponding value of the replacement key, and the function returns the data string, where personal data is replaced.
  • By placing the function in a separate variable, you can always call it from any tag. Suppose you read the values of form fields and want to send them to Google Analytics. In order to spare this data from personal data before sending it to Google Analytics, you can call the newly defined function from a tag. To do so, give the data string and the defined blacklist “PII” as arguments. This will then look like this: 

{{return redactData function}}(datastring, {{PII}})

 

function(){  return function(data, PII){    for (var i = 0; i < PII.length; i++){   data = data.replace(PII[i].regex, PII[i].replacement);

};

     return data;

  }

}

  1. Create a tag that replaces personal data.
  • Create a new tag of the type ‘Custom HTML’. Use the Javascript below. 
  • The tag (a) reads the title and URL of the webpage on which the tag is loaded, (b) replaces the previously defined personal data in the title and URL of the webpage, using the function mentioned above (c) adjusts the URL in the browser and/or replaces the title of the webpage, if personal data are present, and (d) sends a ‘piiRedacted’ event to the dataLayer, together with the new URL.
<script>(function(){                   var PII = {{PII}};

 

  var URL = {{Page URL}};

  var newURL = {{return redactData function}}(URL, PII);

  if (newURL !== URL) {

    window.history.replaceState({}, document.title, newURL)

  }

 

  var title = document.title;

  var newTitle = {{return redactData function}}(title, PII);

  if (newTitle !== title) {

document.title = newTitle;

  }

 

  window.dataLayer = window.dataLayer || [];

  window.dataLayer.push({

“event”: “piiRedacted”,

“Page URL”: newURL

  });

 

})();

</script>

 

  1. Create a trigger based on the “piiRedacted” event.

Create a new trigger of type ‘Custom Event’ based on the ‘piiRedacted’ event. This event indicates the moment that all actions in the Custom HTML tag from above have been executed – at this moment the URL and title of the webpage are completely free of personal data.

 

  1. Replace the existing ‘All Pages’ trigger with the new ‘piiRedacted’ trigger. 

To ensure that tags from other scripts are not loaded until after the URL and page title have been removed from personal data, the ‘All Pages’ trigger on existing tags should be replaced by the new trigger based on the ‘piiRedacted’ event.

Tool dependent whitelists – how do I do that?

With the solution of tool independent whitelists you define the data that is not personal data (the whitelist). Next, you turn them into blacklist patterns, because you want to replace values that are not present in the whitelist. An example will make this clear. 

As an illustration, I want to replace the value of all URL parameters with “[REDACTED]”, except for the parameters “foo” and “bar” – this is my whitelist. Specifically, this means that the URL “https://www.domein.nl?foo=waarde&bar=waarde&email=siemon@i-spark.nl&foobar=waarde” will be replaced by “https://www.domein.nl?foo=waarde&bar=waarde&email=[REDACTED]&foobar=[REDACTED]”. Below I explain step by step how you can achieve this using Google Tag Manager:

  1. Define your blacklist 
  • Turn your whitelist into a blacklist. The goal is to replace all parameters except the one of the whitelist.  So specify ‘all parameters except the whitelist’ in a regular expression. In a similar way you can specify a pattern that matches every word except the one of a whitelist.
  • Create a new variable of the type “Custom JavaScript macro” and call it “PII”. 
  • Within the new variable, define an array containing an object with 3 keys: ‘name’, ‘regex’ and ‘replacement’. For the name key, enter a string describing the type of data if it does not appear in the whitelist. In this case I use ‘NON-WHITELISTED PARAMETER’. For the regex key, specify the regular expression of the type of data that does not appear in the whitelist – in this case all parameters except ‘foo’ and ‘bar’. For the replacement key, specify the string with which the personal data is to be replaced. A “$” followed by a number indicates the number of the capturing group whose match must be maintained in the replacement. 
  • Return the defined array.

 

function(){  var piiRegex = [{       name: ‘NON-WHITELISTED PARAMETER’,       regex: /([?&](?!((foo|bar)=))[^=]+=)([^&$#])+/gi,

       replacement: “$1[REDACTED]”

  }]

 

  return piiRegex;

}

The regular expression is now more complicated by using a negative lookahead. For the enthusiast I like to explain the regular expression with the negative lookahead bit by bit:

  • ([?&](?!((foo|bar)=))[^=]+=)

First capturing group: match a “?” or “&” not followed by “foo=” or “bar=”

  • [?&]

Match a ‘?’ or ‘&’.

  • (?!regex)

Negative lookahead: match the above only if not followed through the regular expression.

-((foo|bar)=)

The string “foo” or “bar” followed by “=”. 

“foo|bar” is your whitelist!

  • [^=]+=

Match each character except “=” at least once up to “=” 

  • ([^&$#]+)

Second capturing group: match each character except “&”, “$” (end of string) or “#” at least 1 time

  • The ‘g’ means ‘global’. In other words, search (and replace) all matches within the string instead of just the first match.
  • The ‘i’ indicates that the regular expression is not case-insensitive.

The use of the above regular expression means that the URL
https://www.domein.nl?foo=waarde&bar=waarde&email=siemon@i-spark.nl&foobar=waarde” gives two matches, namely “&email=siemon@i-spark.nl” en “&foobar=waarde”. After all, there are 2 parameters that do not match the whitelist within the negative lookahead, namely “email” and “foobar”. I only want to replace the parameter value by “[REDACTED]” and therefore the 1st capturing group of each match – “&foo=” and “&bar=” – be preserved. I do this by replacing the full regex matches with “$1[REDACTED]”.

Steps b to e remain the same for the tool-independent whitelisting, as described above for the tool-independent blacklisting.

Takeaways

  • The Personal Data Authority uses a broad definition for ‘personal data’. This requires measures to prevent the processing of personal data as much as possible. A combination of blacklisting and whitelisting is recommended.
  • The ‘replace’ function in combination with regular expression makes it easy to replace personal data. This method can be applied tool-independently before other scripts are loaded and is therefore very efficient.
  • The tool-independent solution using regular expression supports both blacklists and whitelists. For whitelisting you can use a negative lookahead.

Would you like to have help implementing the above mentioned solutions? Or do you want advice on what is the best solution for your business? Get in touch!

contact us

Siemon