Using PhantomJS to monitor Google Analytics

A simple change to the way Google Analytics is installed is (unfortunately) a common way to render your analytics data worthless. To make matters worse, if you aren’t aware of the setup change, you might falsely attribute any differences in reported traffic to something else, rather than the Google Analytics change.

In an ideal world we would have a ‘set it and forget it’ warning system that will alert you any time the analytics code is changed on a page. I’m going to walk you through some code that can be used to extract what Google Analytics resources are being used.

Watching for the Google Analytics beacon to fire.

Getting data about visitors to your website into Google Analytics is a two part process. When a browser requests a page on your site, it will also download a JavaScript script from Google (e.g. Depending on your setup, this script will make further requests to Google. It is these secondary requests that are used to register a page view. We can watch for these requests to fire (the first request is often called a beacon) and use the URL being requested to infer what Google Analytics setup is in place.

Because this process uses javascript, a regular web crawler script won’t help us here, since it doesn’t execute the Javascript code. We need something that acts like a regular browser executing javascript on our behalf. PhantomJS is a ‘headless browser’ which is scriptable, and fits the bill perfectly.

To follow along, you’ll need to have PhantomJS installed. There are executables for Windows and Mac on the download page. If, like me, you are running in a Linux environment things are a little trickier (aren’t they always?). I found this guide helpful.

The Code

We’ll build the full code up slowly, starting by simply opening a page, and outputting the final URL we end on. Download or copy the following code and run it from a command line. (If the command line is an alien concept, this is a great resource to help learn command line).

Step 1 - Use PhantomJS to download a URL

phantomjs example1.js

If successful, the script should output two lines, showing that we loaded a URL on which redirected us to the crawlbin homepage.

Step 2 - Watching for resources being loaded

The goal of the script we are building is to notify us when a Google Analytics resources is loaded. To make things super easy for us, PhantomJS has a function that gets called whenever the page requests a resource. The function we want is aptly named OnResourceRequested. We will make use of this function and simply output all the resources that are requested.

phantomjs example2.js.

Step 3 - Filtering the resources to ones we care about

For a site like Crawlbin you’ll notice there aren’t many resources that get requested. However there are plenty of sites that make hundreds of requests. Let’s make the output easier to follow and filter the requests to only show ones that are relevant to Google Analytics.

Since crawlbin doesn’t have any analytics code installed, I’ve updated this script to use In order to get this example to work you’ll have to add an additional command line parameter to the PhantomJS call. The default SSL protocol (SSLv3) doesn’t work, so we allow PhantomJS to choose an alternative.

The onResourceRequested function has been updated to only log requests from two specific domains. The .test function returns a boolean indicating whether the string (in our case the URL we are requesting) matches the regular expression. I’ve filtered the requests to just two domains - and

phantomjs --ssl-protocol=any example3.js.

Step 4 - The final script

At this point we have a script that you can run against any URL and see what, if any Google Analytics requests are being made. I’ve gone ahead and created a more robust version of the script that you can download and use. I’ve removed the hard-coded URL and allowed you to pass it in as a parameter.

phantomjs --ssl-protocol=any get_ga_resources.js

Making sense of the output

The output from this script is a list of URLs used by Google Analytics. On their own the raw URLs aren’t of much use. They don’t immediately tell you what install is used, and they change on each request. However you can translate these URLs into the type of request being made, and keep track of that to ensure it never changes. For the moment, I’m leaving that as an exercise for the reader.

Google Tag Assistant

The easiest way I’ve found to understand the URLs is to use Google Tag Assistant. This is a chrome extension written by Google to help you troubleshoot installations of their tools. If you do a lot of work across multiple different sites you’ll find the extension invaluable. It can help with Google Analytics, Adwords Conversion Tracking, Google Tag Manager and more.

If you load up the Distilled homepage, and run Tag Assistant you’ll see something like the following, showing that we have Tag Manager and Google Analytics installed.

Clicking on Google Analytics will show that our install makes two requests, a 'Pageview Request' and a 'Keep-alive' request. Clicking on any of those requests, and the 'URL' shows the URL that was requested. You can then map the URLs being returned from the script to a classification and watch for that classification changing.

Alerting when the URL classification changes

Wherever possible, I believe you should be aiming for tools that allow you to 'set-it-and-forget-it'. Your time is too valuable to waste checking things that can be easily checked by a computer.

Once you have classified the URLs you want to be alerted if that classification ever changes. If we were to re-run the script on the Distilled homepage and get something other than a 'Pageview Request' and a 'Keep-alive' request then it is worth spending some time checking that we haven’t done something that could impact the validity of our Google Analytics data.

For example if there were suddenly multiple snippets firing we would see twice the number of URLs being requested. If the install reverted to an old version of Google Analytics (e.g. ga.js) we would see the ga.js URL instead of the analytics.js URL we see at the moment. Any other changes to the Google Analytics setup will result in a different format of URL being passed, and can be easily picked up by monitoring these URLs.

Final thoughts

As more and more companies are relying on data to make decisions it is more important than ever that this data can be trusted. Putting in place simple passive monitoring to ensure data quality is becoming an increasingly important part of what we all need to be doing.

Losing data can have a big impact on the speed and accuracy of decision making, but far worse is to make decision based on incorrect data without realising it.

Get blog posts via email