A simple change to the way Google Analytics is installed is (unfortunately) a common way to render your analytics data worthless. To make matters worse, if you aren’t aware of the setup change, you might falsely attribute any differences in reported traffic to something else, rather than the Google Analytics change.
In an ideal world we would have a ‘set it and forget it’ warning system that will alert you any time the analytics code is changed on a page. I’m going to walk you through some code that can be used to extract what Google Analytics resources are being used.
Watching for the Google Analytics beacon to fire.
To follow along, you’ll need to have PhantomJS installed. There are executables for Windows and Mac on the download page. If, like me, you are running in a Linux environment things are a little trickier (aren’t they always?). I found this guide helpful.
We’ll build the full code up slowly, starting by simply opening a page, and outputting the final URL we end on. Download or copy the following code and run it from a command line. (If the command line is an alien concept, this is a great resource to help learn command line).
Step 1 - Use PhantomJS to download a URL
If successful, the script should output two lines, showing that we loaded a URL on crawlbin.com which redirected us to the crawlbin homepage.
Step 2 - Watching for resources being loaded
The goal of the script we are building is to notify us when a Google Analytics resources is loaded. To make things super easy for us, PhantomJS has a function that gets called whenever the page requests a resource. The function we want is aptly named OnResourceRequested. We will make use of this function and simply output all the resources that are requested.
Step 3 - Filtering the resources to ones we care about
For a site like Crawlbin you’ll notice there aren’t many resources that get requested. However there are plenty of sites that make hundreds of requests. Let’s make the output easier to follow and filter the requests to only show ones that are relevant to Google Analytics.
Since crawlbin doesn’t have any analytics code installed, I’ve updated this script to use distilled.net. In order to get this example to work you’ll have to add an additional command line parameter to the PhantomJS call. The default SSL protocol (SSLv3) doesn’t work, so we allow PhantomJS to choose an alternative.
The onResourceRequested function has been updated to only log requests from two specific domains. The
.test function returns a boolean indicating whether the string (in our case the URL we are requesting) matches the regular expression. I’ve filtered the requests to just two domains - google-analytics.com and stats.g.doubleclick.net.
phantomjs --ssl-protocol=any example3.js.
Step 4 - The final script
At this point we have a script that you can run against any URL and see what, if any Google Analytics requests are being made. I’ve gone ahead and created a more robust version of the script that you can download and use. I’ve removed the hard-coded URL and allowed you to pass it in as a parameter.
phantomjs --ssl-protocol=any get_ga_resources.js http://www.yoururl.com.
Making sense of the output
The output from this script is a list of URLs used by Google Analytics. On their own the raw URLs aren’t of much use. They don’t immediately tell you what install is used, and they change on each request. However you can translate these URLs into the type of request being made, and keep track of that to ensure it never changes. For the moment, I’m leaving that as an exercise for the reader.
Google Tag Assistant
The easiest way I’ve found to understand the URLs is to use Google Tag Assistant. This is a chrome extension written by Google to help you troubleshoot installations of their tools. If you do a lot of work across multiple different sites you’ll find the extension invaluable. It can help with Google Analytics, Adwords Conversion Tracking, Google Tag Manager and more.
If you load up the Distilled homepage, and run Tag Assistant you’ll see something like the following, showing that we have Tag Manager and Google Analytics installed.
Clicking on Google Analytics will show that our install makes two requests, a 'Pageview Request' and a 'Keep-alive' request. Clicking on any of those requests, and the 'URL' shows the URL that was requested. You can then map the URLs being returned from the script to a classification and watch for that classification changing.
Alerting when the URL classification changes
Wherever possible, I believe you should be aiming for tools that allow you to 'set-it-and-forget-it'. Your time is too valuable to waste checking things that can be easily checked by a computer.
Once you have classified the URLs you want to be alerted if that classification ever changes. If we were to re-run the script on the Distilled homepage and get something other than a 'Pageview Request' and a 'Keep-alive' request then it is worth spending some time checking that we haven’t done something that could impact the validity of our Google Analytics data.
For example if there were suddenly multiple snippets firing we would see twice the number of URLs being requested. If the install reverted to an old version of Google Analytics (e.g. ga.js) we would see the ga.js URL instead of the analytics.js URL we see at the moment. Any other changes to the Google Analytics setup will result in a different format of URL being passed, and can be easily picked up by monitoring these URLs.
As more and more companies are relying on data to make decisions it is more important than ever that this data can be trusted. Putting in place simple passive monitoring to ensure data quality is becoming an increasingly important part of what we all need to be doing.
Losing data can have a big impact on the speed and accuracy of decision making, but far worse is to make decision based on incorrect data without realising it.