Using SEO Spider Data in Excel

We’ve been using Screaming Frog’s SEO Spider in the office for a few months now and, along with other spider software, it provides a great means of identifying technical and on-page issues quickly and easily.

My personal preference is to export the data to Excel where you can manipulate it however you like.  In addition, you can then email the spreadsheet to a client, development team etc. once you’ve highlighted the problem you’re trying to fix.

Here are some of the Excel features and formulas that I’ve been using to generate meaningful, actionable reports.

Sorting and Filtering

This is the first step that I always take once I’ve exported a crawl.  You can either hit Ctrl + T to turn the sheet into a table, or if you’d prefer just to Sort and Filter choose Data > Filter from the main menu.

Now you can pull out some really quick reports:

  • Filter the Content column to show only HTML pages (typically text/html; charset=utf-8).  You’ll want to apply this filter for the majority of your SEO work
  • Filter Status Codes to show 404 errors, 301 and 302 redirects.  This is an easy way to create a list of 404s that ought to be fixed/redirected
  • Filter Titles, Meta Descriptions, H1/H2 etc. to show blanks.  This gives you a list of pages where this information ought to be added
  • Sort Title Length or Meta Description Length in ascending order.  Any titles over 70 characters need shortening, and the same applies to meta descriptions over 155 characters
  • Sort Level in ascending order.  This should give you a rough idea of the hierarchy of your site.  Are there any key pages which are buried beyond level 3 or 4?
The list goes on – have a go yourself and you’ll see that there’s quite a lot you can do with sorting and filters.

Conditional Formatting

Another feature that can be used to highlight errors is conditional formatting.  The first example I’ll use is flagging up titles with a length over the 70 character limit.

Simply select the cells in the Title Length column and go to Home > Conditional Formatting > New Rule.  Now select Format only cells that contain and under the rule description you should have Cell value greater than 70 as shown below.

Hit Format… and choose something that pops – in this case I’ve got bold white text over a red fill.  Once you’re done, click OK and you should now be able to visually scan your page title lengths.  Obviously the same can be done for Meta Description Length above 155.

You could also apply conditional formatting to columns such as Size, Inlinks, and Outlinks.  The built in colour scales are a quick way to make the information in your spreadsheet easy to scan.  The example below uses Colour Scales to show pages with a high number of Outlinks.

Finding Duplicates

Screaming Frog’s SEO Spider generates a hash key for each page based on it’s content.  If two or more different URLs have the same hash key (and therefore the same content) your site most likely has duplicate content problems.

A quick formula is all that’s needed to find URLs where the hash key is duplicated.  Label the column to the right of the Hash column as Hash Duplicate and paste the following formula into the first cell:

=IF(COUNTIF($AA$3:$AA$10545,AA3)>1,“Yes”,“No”)

Double click the bottom right corner of the cell to copy the formula down through the rest of the spreadsheet.  You should now be able to filter this column to show only cells containing Yes.  Order the Hash column in ascending order to group the duplicate hash keys together.

A quick explanation of the formula: the countif() function counts the instances of the hash in the whole column.  If this is greater than one, we know that there are duplicates so the if() function outputs “Yes”.

The same technique can be applied to the Title and Meta Description columns to identify duplicates.

Finding URLs With Query Strings

If you’re not familiar with SEO friendly URLs, I’d recommend having a quick read of this page over on SEOmoz.  One of the most common characteristics of poor URLs is the use of query strings such as www.example.com/?category=example. You can easily find all of the URLs on your site which contain query strings if you use the following formula:

=IF(IFERROR(FIND(“?”,A3),“No”)=“No”,“No”,“Yes”)

Again, create a column to the right of the URL column and call it Query String.  Paste in the formula above and drag it down to the bottom of the sheet.  You should now be able to filter to show Yes giving you a list of URLs which you can improve using URL rewrites.

If you’re interested, the formula above uses the find() function to look for a ? in the URL.  If there isn’t one, a #VALUE! error is returned so a combination of iferror() and if() is used to output a yes/no answer.

Text to Columns

You can easily reveal the structure of a site by using Text to Columns on the list of URLs found by the spider tool.  Here’s the process:
  • Copy the list of URLs and paste them into a new sheet
  • Sort the list alphabetically
  • Go to Data > Text to Columns
  • Click Next
  • In Delimiters, click Other and enter /
  • Hit Finish
  • You can get rid of the first 2 columns.
Hit Ctrl + T to turn the sheet into a table and you can now sort and filter.  For example, if your blog post URLs contain a date, you could filter to show only pages in /blog/ and order by the date to show how often you’ve been publishing content.

You could also create a chart from this information.  Take the first column after your domain for example – copy and paste it into a new column and click Remove Duplicates in the data menu.  Now run a countif() for each row against the original column, which should look something like this:

=COUNTIF(B:B,J3)

This gives you a count of the number of pages in each top level subdirectory on the site, which you can create a pie chart from for example.

You could use this as a comparison against your competitors – are they more prolific bloggers for example?  Or do they have a faceted navigation that generates lots of long tail pages?

The techniques outlines above are just a snapshot of what is possible using these two tools.  If you have any further ideas about how you might manipulate crawl data, I’d love to hear about them in the comments.