How to keep Personally Identifiable Information out of Google Analytics

In May 2018, the EU’s General Data Protection Regulation (GDPR) came into force, causing panic amongst marketers, and everyone’s inbox to be crowded with emails from companies updating their privacy policies. One major aspect of this that marketers have to be aware of is the implication for analytics. GA is bringing in a new restriction on data retention past a set number of months (26 by default), and all analytics and tracking tools are going to be a lot more sensitive to Personally Identifiable Information being collected.

What Happens If You Send Personally Identifiable Information to Google Analytics

Regardless of GDPR however, Google Analytics has for some time prohibited the sending of personally identifiable information (PII) to GA. The potential risks of this are high: if your GA account is found to be in breach of this policy, the guidelines specify:

“Your Analytics account could be terminated and your data destroyed if you use any of this information.”

The Google Analytics documentation specifies the types of data that it counts as PII. This includes:

  • email addresses
  • mailing addresses
  • phone numbers
  • full names or usernames

Importantly, this applies to data that has been collected in the past. This means that if you’ve ever accidentally (or intentionally) collected email addresses, full names or phone numbers in your GA account, the whole account and all of your historical data is at risk of being deleted. This is more common than you might think - it’s definitely worth checking all of your GA accounts. This is why it is also important to back up any historical data in your GA account, if you think you’re at risk of falling foul of the terms of service.

This also applies to data filtered out at the view level. This means that if you’re sending PII such as email addresses in Google Analytics hits, but filtering it out using filters set up within the GA interface, you are still in breach of the GA terms of service.

Find out whether you’re collecting PII in GA

If you’re sending PII to GA, you’re probably sending the same information to Adwords, Doubleclick, Bing, Facebook, Optimizely, Hotjar, etc., etc., etc. - any analytics, tracking or remarketing tools that are implemented on your site are potential PII holes.

One silver lining of this cloud is that the AdWords team tend to be more proactively communicative with advertisers than GA - the AdWords team may spot what you’re doing and warn you about it. The Adwords terms of service, as well as DoubleClick, have similar rules to Google Analytics on PII, and the consequences of breaching those terms are to remove access to Adwords features such as remarketing. The fix I’m outlining later in this post will stop PII from being sent to Adwords and DoubleClick, as well as any other marketing tags, as long as they’re being fired through Google Tag Manager.

This may be waiting for you in your inbox if you don’t fix any PII collection vulnerabilities on your site.

This notification from Adwords is rare, compared to the frequency of PII such as email address being inadvertently sent to Google Analytics. The most common way this happens is through URL parameters containing email address, phone numbers or usernames.

In order to find if/where this is happening on your site, you should pull up the Behaviour > Site Content > All Pages report, and use a custom filter for the “@” symbol (over as wide a time period as possible). This will unearth any URLs that have been visited on your site that contain that symbol.

An example of a GA account with email addresses captured in URL parameters.

As mentioned above, if you have been sending PII to GA, you’re at risk of your GA account and data being deleted, in which case you’d be wise to back up your data. My colleague Dom Woodman has been developing a python package to download data from the API and upload it to Google BigQuery - this won’t be a full backup of every hit in GA, but will give you the ability to have a record of key metrics and dimensions just in case the worst happens. Follow Dom on Twitter to hear more about this when it’s ready.

Once you’ve stopped PII from being sent to GA, it’s also a good idea to create a second clean GA account (as opposed to a property or a view) to start collecting data with no risk of losing that data. Unfortunately, that will only be possible for data after this has been set up as there is no way to retrospectively load data into GA.

Preventing PII from being sent to GA

Note: this solution requires a bit of knowledge of Google Tag Manager (GTM) and javaScript. I highly recommend Simo Ahava’s blog for anyone learning how to use GTM - he’s written the best article on pretty much every GTM topic out there.

This solution works for anyone who is using Google Tag Manager to implement Google Analytics (and any other tracking tools) code on their site, either through custom HTML tags or the inbuilt Universal Analytics tag.

Other people (including Simo Ahava and Brian Clifton) have recommended a similar fix in the past: their approach is to overwrite the page path variable that is sent to GA. While this approach definitely works, and is a good way to prevent PII from reaching GA, it does rely on the variable being amended for every different tracking tag that you’re using in GTM, whereas the approach outlined here will, by default, apply to every tracking tag in your GTM container.

The way my method works is by the following steps:

  1. Rewrite URLs to remove any offending parameters and redact email addresses
  2. Change the URL in the browser using history.replaceState()
  3. Rewrite page titles to remove any email addresses
  4. Send a custom event to the DataLayer
  5. Trigger all tracking tags off of this custom event.

In GTM, the way this is done is by introducing a new tag, a new trigger and a new variable. I’ll outline each of these below.

The Tag

The new tag used in this fix is a custom HTML <script> tag, which should be triggered to load on all pages, at page view. This tag performs the first four actions above:

  1. Rewrite URLs to remove any offending parameters and redact email addresses
    1. Firstly, it extracts all URL parameters form the URL
    2. It then checks these parameters to see if they have been whitelisted (see below) - if a parameter is not on the whitelist, the parameter will be deleted.
    3. If a parameter is whitelisted, the value of the parameter is checked for email addresses using a regular expression (regex). If this regex finds an email address, it will be replaced with “EMAIL_REDACTED”
  2. Change the URL in the browser using history.replaceState()

    1. If changes have been made to URL parameters, the code uses the javaScript history API to update the URL in the browser. It is important that no tags have fired by this point, this is why triggers need to be changed for all tracking tags in GTM.
    2. This has the secondary benefit of making URLs cleaner and more shareable, and ensuring that links are more likely to be to the canonical version of URLs.
  3. Rewrite page titles to remove any email addresses

    1. The code also checks for email addresses in page <title> tags using the same regex as above.
    2. If email addresses are found, the email address is overwritten with “EMAIL_REDACTED”
  4. Send a custom event to the DataLayer
    1. Once the above operations are complete, a DataLayer event with the name “parametersRemoved” is sent, that can be used to trigger other tags.

How the Custom HTML tag should be set up.

Below is the code to copy and paste as the Custom HTML tag.

Triggers

Once the above tag has run, the URL and title will have changed for the page if they contained PII, and all email addresses will have been removed. It is now safe to send tracking information to Google Analytics, and other tracking tools.

In order to do this, set up custom triggers for all of your tracking tags. These triggers should replace the standard pageview tag you would normally use for analytics tags, to ensure that the PII removal is in place before the tracking hits are sent. The trigger should fire on a Custom Event, with the name “parametersRemoved”.

You can create multiple triggers if certain tags are only to be fired on some pages. For example, you can add a hostname filter for a tag that is only to be fired on a certain subdomain.

How the trigger should be set up.

Whitelisted Parameters Variable

In order to not lose important tagging and tracking information, it is important to make a whitelist of parameters. This will consist of a javaScript array that includes all URL parameters that tracking tools such as GA need to see.

This list will vary depending on the functionality of your site and its analytics set up, but should generally include:

  • utm_source
  • utm_medium
  • utm_campaign
  • utm_content
  • gclid
  • Site search parameters (e.g. “search” or “q”)
  • Affiliate tracking parameters

This list of parameters should be set as a Custom JS GTM variable, which returns an array. See below for an example:

For the tag to work, this variable must be named “parameterWhitelist”.

How the parameterWhitelist variable should be set up.

When this won’t work

It’s important to note that this fix only works under certain circumstances. It won’t strip PII from any tracking tags fired using on-page code (including gtag.js) rather than through GTM, and it won’t work for any GTM tags fired using pageview triggers.

The version of the code above also only checks for email addresses. Deleting non-whitelisted parameters will generally deal with most other forms of PII, but there’s a chance that things like phone numbers, names and postal addresses will still be tracked in GA. If this is the case, deeper action will need to be taken on your site to prevent this.

Summing Up

With GDPR coming into force, everyone in digital marketing needs to be more vigilant of the data they’re collecting and where it’s being stored. This is a specific, slightly hacky fix for PII in Google Analytics, but if these sorts of issues are likely that’s probably symptomatic of a website or company’s wider attitude to user data not being up to scratch.

Here’s what you’ll need to do if you want to make sure you’re not risking your GA data:

  • Identify whether you’re collecting PII by checking for email addresses in URLs and titles
  • Backup your GA data
  • Implement the GTM fix to stop sending PII to GA and other tracking tools
  • Create a clean GA account just in case the worst happens

Let me know in the comments what you think of this method, and whether you have any other tips for keeping your site and its analytics GDPR compliant.

Get blog posts via email

About the author
Sam Nemzer

Sam Nemzer

Sam joined Distilled in 2015 as an analyst on the consulting team. He graduated from Oxford University in 2013 with a degree in Physics and worked in the NHS before pursuing a career in digital marketing. In his spare time, Sam enjoys taking part in...   read more