This is how you prevent privacy-sensitive data collection – part 2/2: Blacklisting and whitelisting in Google Analytics

3 March 2020 | 6 minutes of reading time

Published by Marketingfacts

Have you missed part 1? Read it here!

In order for personal data to be collected, they must comply with one or more of the principles of processing, such as lawfulness, transparency, purpose limitation and accuracy. If personal data does not meet these requirements, their collection is not permitted. There is, however, a chance that you may process personal data unlawfully, without you being aware of this. For example, because you use tools or scripts on your website that measure which URL a visitor is on. This URL may contain an email address, for example, if the customer visits your website via a customer email.

In part 1 of this blog, I shared the so-called ‘PII prevention matrix’ and showed how you can apply blacklisting and whitelisting of personal data tool-independently. It also became clear that, with the tool-independent solution, the unintentional collection of personal data was limited to the personal data in the metadata of a page. Therefore, in this second part of the blog, I will discuss blacklisting and whitelisting using a tool-independent solution. Here I will show you how you can blacklist and whitelist PII in Google Analytics. I conclude with a recommendation on the most suitable solution and explain the implementation step by step.


Blacklisting in Google Analytics

In his blog, Simo Ahava shows how the customTask field can be used to check the hit sent to the Google servers for regular expressions of personal data. Personal data contained in the hit will then be replaced by for example the text “REDACTED_EMAIL]”.

 

The neat thing about the method described by Simo is that the entire hit is checked for the presence of certain regular expressions. So for example not only the URL or page title, but all the hit parameters that are sent to the Google servers. In addition, the list of regular expressions used within the customTask is flexible, in the sense that the list of regular expressions can easily be adjusted and expanded to the needs of your organization. 


Whitelisting in Google Analytics

The disadvantage of blacklisting is that you need to know in advance which patterns the different types of personal data conform to. And often you don’t know that. Think, for example, of a search term entered on your website: how do you distinguish a search term for a product from a search term that contains personal data? In that case, the safe option is not to measure the search field at all. 

Sometimes, however, you can use whitelisting – only the values that meet your specified patterns are collected. Patterns that do not meet them should be replaced. To do this, the whitelist must first be turned into a blacklist – this was discussed in part 1 of this blog. 

The use of whitelisting reduces the risk of inadvertent collection of personal data. On the other hand, if your whitelist is not complete, you run the risk of missing data that does not contain any personal data. And that chance is great. After all, how can you specify all the texts you want to whitelist?

 

So, what is the best solution to prevent the unintentional collection of personal data?

The purpose of this and the previous blog was to show you that there are different approaches to preventing the collection of personal data and to give you some guidance on how to do this.  But which of the solutions described in parts 1 and 2 of this blog is the best? Disclaimer: the implementation of the described solutions is by no means complete in order to comply with privacy legislation. 

My personal preference is always to whitelist where possible. After all, this way you run the lowest risk of inadvertently collecting personal data. We have seen that URL parameters are ideally suited for this. However, in theory it is possible that even the value of a whitelisted parameter, for example a campaign tracking parameter, contains personal data. Also, not all data that is allowed to be collected can be captured in whitelists. For these reasons, it is essential to use blacklists as well. Where possible, a combination of whitelisting and blacklisting is therefore recommended. 

Do you only use Google Analytics on your site and are you not planning to use other scripts in the near future? Then it suffices to opt for the Google Analytics specific solution to prevent the collection of personal data (again, credits for this go to Simo Ahava). The tool-independent solution in which personal data is replaced in the URL and title of the web page has no added value in this case. After all, with the Google-Analytics specific solution, the entire payload is checked for PII patterns and this payload includes the page URL and page title. However, it is advisable to rewrite a whitelist as a blacklist and include it within the customTask.

Do you also use scripts other than Google Analytics? If so, it is advisable to use a combination of blacklisting and whitelisting independently of tools to prevent the collection of personal data in the URL and title of a web page. You can also prevent the collection of personal data within Google Analytics by using the customTask field. For this it is also advisable to use a combination of blacklisting and whitelisting. The combination of tool-independent blacklisting and whitelisting will be described in the next chapter.

Using a combination of tool-independent and tool-dependent blacklisting and whitelisting – the implementation

It is important to keep your Google Tag Manager organized. This will ultimately save you time and prevent mistakes. For the implementation of tool-independent and tool-independent black- and whitelisting this means 2 things: 

1) Define a blacklist once, including the inverted whitelist.

2) Define once a function to replace the blacklist patterns in the data.

Then call these variables from both the tool-independent and tool-dependent solution to collect personal data.

a. Define your blacklist

  • Create a new variable of the type “Custom JavaScript macro” and call it “PII”. See the example below.
  • Within the new variable, define an array containing an object for each type of data whose collection you want to prevent. In our example we are dealing with two types of data, namely non-whitelisted parameters and e-mail addresses.
  • Give the objects you defined for each type of data 3 keys: ‘name’, ‘regex’ and ‘replacement’. For the name key, enter a string describing the type of data. This is especially useful for yourself. For the regex key, enter the regular expression that the type of data complies with. For the replacement key, enter the string with which the pattern is to be replaced.
  • Return the defined array. 

Because the regular expressions are in a separate variable, they can be used to check the metadata of a page for the specified patterns (tool independent) as well as within the customTask in Google Analytics (tool dependent). This means that you only need to make an adjustment in your blacklist in 1 place.

 

function(){  var piiRegex = [{       name: ‘NON-WHITELISTED PARAMETER’,       regex: /([?&](?!((foo|bar)=))[^=]+=)([^&$#])+/gi,       replacement: “$1[REDACTED]”  },{

       name: ‘EMAIL’,

       regex: /(([a-zA-Z0-9_\-\.]+)(@|%40)([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5}))/gi,

       replacement: “[REDACTED_EMAIL]”

  }];

 

  return piiRegex;

}

b. Create a new function that replaces personal data

  • Create a new variable of the type “Custom JavaScript macro” and call it “return editorData function”. Use the Javascript below.
  • The variable returns a function with 2 parameters: the 1st parameter is a data string, the 2nd parameter is the previously defined blacklist we called “PII”. For each of the defined data in the blacklist, the function checks if the regular expression occurs in the data string until there is a match. At that time, the data string is replaced by the corresponding value of the replacement key. As a result, the function returns a data string where the data in our defined blacklist “PII” has been replaced.

Again, by placing the function in a separate variable, you can always call it from any tag or Javascript variable within Google Tag Manager. And therefore also from both the tool-independent and the tool-dependent solution to replace personal data. To do this, you give the data string and the defined blacklist “PII” as arguments. This then looks like this: {{return editorData function}}(data string, {{PII}})

function(){  return function(data, PII){    for (var i = 0; i < PII.length; i++){   data = data.replace(PII[i].regex, PII[i].replacement);};return data;

  }

}

With the creation of the blacklist (step a) and the function to replace blacklisted patterns in data (step b), the basis for our implementation is ready. What remains for us to do is to let both the tool-independent and the tool-dependent solution make use of these variables. For the tool-independent solution I refer you to steps c to e of the previous part of this blog. For the Google Analytics specific solution the customTask field needs to be adjusted as follows, so that the above variables are called: 

function() {  return function(model) {var piiRegex = {{PII}};var globalSendTaskName = ‘_’ + model.get(‘trackingId’) + ‘_sendHitTask’;// Fetch reference to the original sendHitTaskvar originalSendTask = window[globalSendTaskName] = window[globalSendTaskName] || model.get(‘sendHitTask’);

 

var i,j, hitPayload, parts, val;

// Overwrite sendHitTask with PII purger

model.set(‘sendHitTask’, function(sendModel) {

   hitPayload = sendModel.get(‘hitPayload’).split(‘&’);

   for (i = 0; i < hitPayload.length; i++) {

     parts = hitPayload[i].split(‘=’);

     // Double-decode, to account for web server encode + analytics.js encode

     try {

       val = decodeURIComponent(decodeURIComponent(parts[1]));

     } catch(e) {

       val = decodeURIComponent(parts[1]);

        }

    

        piiRegex.forEach(function(pii){

       val = {{return redactData function}}(val, piiRegex);

     });

  

     parts[1] = encodeURIComponent(val);

     hitPayload[i] = parts.join(‘=’);

   }

      sendModel.set(‘hitPayload’, hitPayload.join(‘&’), true);

   originalSendTask(sendModel);

});

  };

}

With this implementation you have applied all four solutions from the PII matrix. The next step is to use Google Analytics to monitor how often and where data needs to be replaced, which can be recognized by the text “REDACTED”. Use this as input to completely prevent the use of this data in URLs. Good luck!

Would you like to have help implementing the above mentioned solutions? Or do you want advice on what is the best solution for your business? Get in touch!

contact us

Siemon