In case you haven’t used it before, Screaming Frog is a web crawler which you can use to crawl a website the way a search engine does. While it does this it spits out pretty much every piece of SEO related data you could think of from each page.
One of the great things about Screaming Frog is that there are regular updates, which consistently have great new features. In fact, one of my colleagues has commented (with a straight face) that it’s the only program for which he actually downloads the updates straight away and is excited to do so.
In the last year, three major updates have been released, which included a glut of new features that we’ve been playing with here at Distilled. In this post I’ll outline a few of these new features, and give you some examples of how they’ll make your life easier.
Custom extraction
Screaming Frog has had a custom search feature for a long time, which allows you to filter for pages that contain a specific string. This has many applications, such as checking that Google Analytics tracking code is consistent across a site, or looking for pages which have social sharing buttons.
In the July version 4.0 update, Screaming Frog expanded the custom search functionality, by adding in a custom extraction option. This allows you to not only find pages which contain a string, but to extract specific parts of the HTML on the page.
This feature is incredibly powerful, as it allows you to easily generate a report of almost any element that is present on a page, as the below examples outline.
There are three ways that you can select the element to extract: CSSpath, Xpath and regex match. I have found that CSSpath is normally the best way to consistently scrape the correct element from all pages, but it really depends on the structure of the pages on the site, and how consistent they are. Regex matching can be very useful if you’re looking at something on pages from various sites which do not have consistent page structure, as I’ll show you in the second example below.
To find the CSSpath (or the Xpath) of an element on a page, there’s a very quick and easy way to do so in Chrome. Simply right click on the element you want to scrape, click on inspect element, then right click on the snippet of HTML in the DevTools window, and copy either Xpath or CSSpath. This can then be pasted into Screaming Frog.
Here are a few examples of this feature in action:
Number of items in paginated list
You may want to find out how many items there are in a list. For example, you might want to know how many articles each author on a news site has written. To do that, I would follow the following process:
1. Find an example page of the type that you want to scrape. In this instance I am looking to see how many articles a sample of Guardian journalists have written. On each author’s profile page, the total number of results is reported. Right click on this number, and go to inspect element.
2. In the DevTools window, find the HTML to be scraped, right click, and copy CSSpath.
3. Open the custom extraction window in Screaming Frog.
4. Paste the CSSpath into the extraction window.
5. Crawl your list of sample URLs, and see the text in your selected HTML element read out in the Custom tab, with Extraction selected in the Filter dropdown. You will note that in this example some of the pages didn’t return anything in the extraction filter. This is because in those cases the authors only had one page of results, and the element we were scraping does not exist. This is not completely fruitless, as it still gives us the information that the author has written less than the 25 articles required to make the list go to two pages.
Once you have this extraction list, you can export it to be manipulated in, for example, Excel, in order to extract the numbers from the export for whatever purposes you require them.
Backlinks
An example of this functionality using regex matching instead of CCSpath or Xpath is when performing a backlink audit. It may often be the case that you have a list of backlinks to your site, but you don’t know exactly which page they refer to. In this case you need to use a regex match to extract the linked URL.
1. Create a regex pattern that will match any link on any page to your domain, and set it up in the custom extraction window. In this example, the regex pattern for distilled.net is:
(distilled\.net(.){0,50})
This pattern will match anything in the html, beginning with distilled.net, and report the next 50 characters. This method is a bit messy, and will require some cleanup after the crawl.
2. Crawl your list of backlinks in list mode.
3. Export the Custom tab, with the extraction filter, and use Excel’s text to columns with both ‘ and “ as delimiters to isolate the URLs. You now have a clean list of pages being linked to on your site!
Pull GA data directly into crawl
This feature was added when version 4.0 was released in July. It enables you to pull Google Analytics data directly into a Screaming Frog crawl using the API. This massively simplifies the process of setting up API calls in a script or Google Sheets, and then cross-referencing this with Screaming Frog data.
In order to enable the GA API in Screaming Frog, go to Configuration -> API Access -> Google Analytics. Then click on ‘Connect to New Account’, and log in in the browser window that opens. You will then be able to choose any Accounts, Properties and Views that you have access to.
You can pick up to 30 metrics to display, and use each crawled URL as a landing page for a session, or just a page which has been visited at some point in the session.
There’s nothing groundbreaking here - you can access all of this data through the Google Analytics interface, or using the Google Sheets add-on - but it can make things a lot simpler. I’ll now show you an example of how to use this feature:
Pageviews
You might want to crawl your site to find the pages with the most sessions including that page, from a specific segment of your audience; and see which pages are getting the most views. You can do this with the following steps:
1. Before you start crawling, connect your GA account and select the view you’re interested in, and the segment.
2. Deselect all metrics except the two default ones (sessions and bounce rate), the dimension ‘page path’ for each one, and the date range you’re interested in.
3. Then simply run a crawl as you usually would. The API calls are made concurrently with the crawl, so it shouldn’t run noticeably slower than a normal crawl.
You can use this feature to look at any metric you like, including conversion rates to find the best converting landing pages, or any other (non-custom) dimension you can think of.
Pull search analytics data into crawl
Similarly to the GA API, you can now (as of version 5.0) connect to the Google Search Console API, which allows you to pull search analytics data into the crawl. The data that this includes are those that you’ll find in the Search Console Search Analytics report, such as number of clicks, impressions, and the average position that page appears in search results.
In the configuration drop-down, you can select which Search Console account you’d like to pull data from, and other parameters, such as the date range over which you’d like to see data.
The benefit of using Screaming Frog to pull in this data is that it can be laborious to go through the Search Console interface, entering URLs one-by-one to see the data on each one. Screaming Frog makes this much simpler, especially as there is no Google Analytics-style Sheets integration.
Checking indexation status
This feature can be used to flag up if Google is indexing something you don’t want to be indexed (e.g. if it is canonicalised to a different page or has a noindex tag). It isn’t a perfect method by any means - the downside is that it could very easily miss pages if they’re indexed but happen not to have received any impressions or clicks.
The method is as follows:
1. Connect to the Search Console API. Select your date range. For this purpose, it should be as long as possible within the time period that the pages should not be indexed.
2. Upload a list of pages which should not be indexed.
3. Run the crawl on those pages, and check the Search Console tab to see if any of those pages received clicks. In this case, none did, which means that these pages are not being shown to searchers.
Robots.txt checker
This feature was introduced in Version 5.0 which was released in September of this year. It allows you to check if pages in a crawl or a list are blocked by a site’s robots.txt file. Previously any URLs blocked by robots.txt were hidden, unless you chose to ignore robots.txt directives, in which case you wouldn’t know at all whether pages are blocked or not.
Now any URLs which are discovered by the spider appear in the crawl with the status ‘blocked by robots.txt’. This appears in the internal and response codes tabs, and makes it a lot easier to see which pages are being blocked. If, in the response codes tab, you filter down to just the URLs blocked by robots.txt, you can also see which line of the robots.txt file is responsible for that page being blocked. This is a great way to debug your robots.txt and check what rule is causing the page to be blocked.
A practical implementation of this feature is below:
XML sitemap check
When uploading an XML sitemap, it is wise to test it using Screaming Frog. Search Console allows you to test a sitemap for errors, but only live sitemaps that you’ve already uploaded to your site. Screaming Frog is the only way that I’m aware of to check sitemaps at scale before you put them on your site.
Screaming Frog’s list mode has allowed you to upload XML sitemaps for a while, and check for many of the basic requirements of URLs within sitemaps. For example, the Directives report tells you if a page is noindexed by meta robots, and the Response Codes report will tell you if the URLs are returning 3XX or 4XX codes.
The new capability to check if pages are blocked by a site’s robots.txt allows you to check one more thing that would cause sitemap errors.
You can do this as follows (this example is with a completely imaginary sitemap, to demonstrate the concept):
1. In list mode, upload the xml sitemap to be crawled.
2. In the response codes tab, filter by Blocked by Robots.txt. You’ll now see all of the URLs in the sitemap that can’t be accessed due to the robots.txt file.
This can then be exported as a CSV/XLSX file to be manipulated and investigated further.
Find Out More
If you want to find out more about what you can use Screaming Frog for, the guide over at Seer Interactive is extensive (a wild understatement) and has a lot of great tips. If you have any other ways of utilising these new features or any other hacks or tips, go ahead and leave a comment.