3 March 2020 | 8 minutes of reading time
Read part 2 here!
Since 25 May last year, the General Data Protection Regulation (AVG) applies. Within the AVG, a broad definition is used for the concept of ‘personal data’, namely ‘all information relating to an identified or identifiable natural person’. This refers to data that directly relate to a person, or data that, in combination with other data, can be traced back to this person.
Personal data may only be processed for the purpose for which it was obtained. This is called ‘purpose limitation’. In order to prevent organizations from collecting personal data without being able to justify this properly, they must implement the principles of privacy by design and privacy by default. Privacy by design means that an organization ensures that personal data is properly protected when designing products and services. Privacy by default means that an organization must take technical and organizational measures to ensure that, as a standard, it only processes personal data that are necessary for the specific purpose it wants to achieve. This means, for example, that it is not permitted to ask for more data than is necessary to subscribe to a newsletter.
There is a chance, however, that you may process personal data unlawfully, without you being aware of this. For example, because you use tools or scripts on your website that measure which URL a visitor is on. This URL can contain an e-mail address, for example, if the customer visits your website via a customer e-mail.
In my two-part blog I would therefore like to show you what measures you can take to prevent the unintentional collection and processing of personal data on your website as much as possible. In doing so, personal data is overwritten, not deleted – so if certain data is collected unintentionally, you can easily recognize it and trace its source.
In part 1, I introduce the ‘PII prevention matrix’ and show you how you can use Google Tag Manager to blacklist and whitelist personal data in a tool-independent way – i.e. not for every script or tool again, but only once for all scripts and tools. The developed measures are intended for technical web analysts, who do not shy away from using Google Tag Manager and Javascript. I also use regular expression (regex) to describe patterns.
I classify the measures you can take to prevent the processing of personal data according to two classifications. Together, they form a matrix which I call the PII prevention matrix – here the abbreviation PII refers to the term ‘Personal Identifiable Information’ which is more or less equivalent to the term ‘personal data’. The classifications are:
1) Tool-independent vs. tool-dependent.
In a tool-independent solution, the personal data must be replaced for each new tool. The order in which different events take place is as follows:
Because the tool-dependent solution requires step III to be performed several times, this is rather inefficient. In addition, some scripts and/or tools do not even offer the possibility to edit data such as URLs before they are actually processed by the scripts and/or tools. The tool-independent solution offers a solution for this. With a tool-independent solution, the solution is tool-transcending and therefore only needs to be applied once. The sequence in which various events take place is as follows:
2) blacklisting vs. whitelisting
With blacklisting you define a list of data that may not be processed. Is a certain data not on the blacklist? Then it may be processed. Whitelists works the other way around: you define a list of data that can be processed. A particular item of data may only be processed if it is on the list. This makes whitelists stricter than blacklists.
The classifications above result in a quadrant of 4 solutions to prevent the processing of personal data:
Blacklisting | Whitelisting | |
Toolindependent | Current blog | Current blog |
Tooldependent | Part 2/2 | Part 2/2 |
In the rest of this blog I will go deeper into the tool-independent blacklisting and whitelisting of personal data using Google Tag Manager. In the next blog I will show you how to apply blacklisting and whitelisting on Google Analytics.
Since most scripts on a page have access to the metadata of a web page – think of the URL and page title – especially these data are suitable to protect personal data in a tool-independent way. This means: before the data is available for other scripts. It regularly happens that these metadata, intentionally or unintentionally, contain personal data. Think, for example, of an e-mail address that is sent as a query parameter on the destination page of a client e-mail. Or a postal code that is sent as a query parameter from a comparison site.
Tool-independent blacklisting – how do I do that?
With the solution of tool-independent blacklisting, you specify the regular expressions with which personal data must comply (the blacklist). Next, you check whether these patterns occur in the data you want to process. If this is the case, you replace the substrings that meet these patterns. You do this before the data is processed by other scripts (tool-independent).
function(){ var piiRegex = [{ name: ‘BLACKLISTED PARAMETER’, regex: /[?&](foo|bar)=([^&$#]+)/gi, replacement: “[REDACTED]” },{
name: ‘EMAIL’, regex: /(([a-zA-Z0-9_\-\.]+)(@|%40)([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5}))/gi, replacement: “[REDACTED_EMAIL]” }];
return piiRegex; } |
{{return redactData function}}(datastring, {{PII}})
function(){ return function(data, PII){ for (var i = 0; i < PII.length; i++){ data = data.replace(PII[i].regex, PII[i].replacement);}; return data;
} } |
<script>(function(){ var PII = {{PII}}; var URL = {{Page URL}};
var newURL = {{return redactData function}}(URL, PII); if (newURL !== URL) { window.history.replaceState({}, document.title, newURL) }
var title = document.title; var newTitle = {{return redactData function}}(title, PII); if (newTitle !== title) { document.title = newTitle; }
window.dataLayer = window.dataLayer || []; window.dataLayer.push({ “event”: “piiRedacted”, “Page URL”: newURL });
})(); </script> |
Create a new trigger of type ‘Custom Event’ based on the ‘piiRedacted’ event. This event indicates the moment that all actions in the Custom HTML tag from above have been executed – at this moment the URL and title of the webpage are completely free of personal data.
To ensure that tags from other scripts are not loaded until after the URL and page title have been removed from personal data, the ‘All Pages’ trigger on existing tags should be replaced by the new trigger based on the ‘piiRedacted’ event.
With the solution of tool independent whitelists you define the data that is not personal data (the whitelist). Next, you turn them into blacklist patterns, because you want to replace values that are not present in the whitelist. An example will make this clear.
As an illustration, I want to replace the value of all URL parameters with “[REDACTED]”, except for the parameters “foo” and “bar” – this is my whitelist. Specifically, this means that the URL “https://www.domein.nl?foo=waarde&bar=waarde&email=siemon@i-spark.nl&foobar=waarde” will be replaced by “https://www.domein.nl?foo=waarde&bar=waarde&email=[REDACTED]&foobar=[REDACTED]”. Below I explain step by step how you can achieve this using Google Tag Manager:
function(){ var piiRegex = [{ name: ‘NON-WHITELISTED PARAMETER’, regex: /([?&](?!((foo|bar)=))[^=]+=)([^&$#])+/gi, replacement: “$1[REDACTED]” }]
return piiRegex; } |
The regular expression is now more complicated by using a negative lookahead. For the enthusiast I like to explain the regular expression with the negative lookahead bit by bit:
First capturing group: match a “?” or “&” not followed by “foo=” or “bar=”
Match a ‘?’ or ‘&’.
Negative lookahead: match the above only if not followed through the regular expression.
-((foo|bar)=)
The string “foo” or “bar” followed by “=”.
“foo|bar” is your whitelist!
Match each character except “=” at least once up to “=”
Second capturing group: match each character except “&”, “$” (end of string) or “#” at least 1 time
The use of the above regular expression means that the URL
“https://www.domein.nl?foo=waarde&bar=waarde&email=siemon@i-spark.nl&foobar=waarde” gives two matches, namely “&email=siemon@i-spark.nl” en “&foobar=waarde”. After all, there are 2 parameters that do not match the whitelist within the negative lookahead, namely “email” and “foobar”. I only want to replace the parameter value by “[REDACTED]” and therefore the 1st capturing group of each match – “&foo=” and “&bar=” – be preserved. I do this by replacing the full regex matches with “$1[REDACTED]”.
Steps b to e remain the same for the tool-independent whitelisting, as described above for the tool-independent blacklisting.
Takeaways
Would you like to have help implementing the above mentioned solutions? Or do you want advice on what is the best solution for your business? Get in touch!
Siemon