SEO is one of those industries that make you want the latest and greatest tools. I think this is partly down to SEO still being a bit of a black box and partly just people looking for the magic bullet that will rocket them to the top of the SERPS.
While the search engines are always changing, it’s fair to say that for a long time the basics have more or less stayed the same. With that in mind I thought I would do a post on how to do a site audit using only the basics, Google webmaster tools and Google search.
There are occasions when you need the extra tools with all the bells and whistles, but to be honest, using just the tools that Google provide you with for free is an excellent starting point and should cover most of the broad strokes of a comprehensive SEO site audit.
Crawling & Indexation
It goes without saying that ensuring Google can read and index your content is crucial.
Webmaster tools make this really easy. Just go to sitemaps under the site configuration tab which will bring up any Sitemaps that you have submitted for this site. If you don’t see any, look on the right hand side where it says: “Show submissions: By me – All” it could be that the site map is under the all section as shown in the image to the right:
Once you find the site map you will be able to see if there are any indexation problems. Google show the number of ‘URLs submitted’ compared to ‘URLs in web index’. In an ideal world these should be the same but quite often they won’t be. The number can only be equal to or smaller than the URLs submitted as it’s only checking for the pages in the Sitemap not all pages.
Quite often the reason for the number being less is that the pages not in the index are blocked by robots.txt. To check this go to the Diagnostics section and click Crawl errors. If this is the case there should be a tab called “In Sitemaps” and within that will be a list of URLs which are in the sitemap but being blocked by robots.txt.
Another reason this could happen is if you have duplicate URLs in the sitemap. For example if you have a Sitemap with 500 URLs in it. If only 250 are unique, the URLs in web index will only ever show 250 not 500.
If neither of these are the reason, it would be time to take a deeper dive into the structure and navigation to identify potential problems.
As mentioned the second part to content being indexed is making sure the content can be seen by the search engines. There are various ways to do this but personally I prefer just to go to Google and look at the text only cache version and compare it to the version the user can see.
If there are discrepancies look for common problems like content being in iframes or some form of code that Google finds hard to read.
To find the cache of any page in the Google index, simply go to Google and type “cache:” before the URL. So for the distilled home page it would be:
That will bring up the view seen in the image below, you then simply hit the button that says “Text only version”. This ensures that you see the content the same way Google do.
Robots.txt can cause all sorts of problems if not used correctly. Webmaster Tools makes it really easy to see exactly what pages are being blocked. Normally robots.txt should be easy enough to read but you may want to get familiar with Regex as some files can include it.
To be sure that you’re not missing anything, head over to the Diagnostic’s section again and go to Crawl errors. In there you will see a tab called “Restricted by robots.txt” as shown below:
Have a look through the list of URLs; it should be really easy to spot pages that shouldn’t be getting blocked from the search engines. This is much easier than reading the Regex as mistakes can easily be made.
Response Headers/Crawl Errors
This is one of the few areas where I would recommend using a tool to crawl your entire site. I like to use something like Screaming Frog which actually has a free version but for larger sites you would need the paid version. Even without this tool, you can get lots of data from Webmaster Tools regarding problems Google has found crawling your site.
Head to the diagnostics section and click crawl errors. Hopefully there shouldn’t be much to see in here, but more often than not there will be a bunch of common problems like 404s, HTTP errors etc. This section is great for getting a lot of quick wins from SEO. Often you can find external links that are pointing to the wrong URL, in this case you can simply contact the person and ask them to update the link, or if all else fails and you can’t get the link updated you could always 301 that page. Some of the things you can expect to see in this page are:
- Not found
- URLs not followed
- URLs restricted by robots.txt
- URLs timed out
- HTTP errors
- URL unreachable
- Soft 404s
For more details on how to deal with these see these guidelines from Google.
Canonicalisation happens the exact same content can be accessed by going to different URLs. This usually happens because of the default settings on web servers. Common examples of how this can happen are shown below:
- http:// mydomain.com
- http:// mydomain.com/index.html
- https:// mydomain.com/index.html
All of the above could potentially return the same page and be linked to from external pages. It goes without saying; this isn’t ideal because it dilutes the links and potential ranking ability of that page. Google often try to work out when you have problems like this and sometimes they can, but it’s best to be safe and make it as clear for them as possible as it can cause your pages to compete with each other unnecessarily.
Identifying these problems is the same process as duplicate content so let’s just cover that as well.
The process for finding duplicate content is a simple but effective one. Pick a few of your most important pages and copy the most unique sentence you can find. Head over to Google and do a search for that exact sentence inside quotes.
Hopefully the only page that will show up is the page you took the content from. If however you see more than one result, you potentially have duplicate content issues. Take a close look at the URL and try to identify if it’s being caused by one of the canonicalisation issues that were mentioned earlier.
If you scroll to the bottom of the search results you will also often see the following message:
At this stage you have to make a decision on how to deal with duplicate content. Assuming the content is on your site or other sites that you have control over, this can be done by using one of the following methods:
A 301 redirect is normally the best way to deal with duplicate content due to content being moved. This tells the search engines that the content is moved permanently to this new location and will pass almost all of the link juice gained by the old page, to the new location. This is normally the best scenario; however there are occasions when this isn’t possible; what if you don’t want the user to be redirected? That’s what the next two options are for.
Use this when you don’t want the user to change location. Paddy wrote a great post on the difference between using a 301 and rel=canonical. See the example below from that post:
If you are not able to add the rel=canonical to the pages then the last option is using the meta robots tag to tell Google not to index that page but still pass link juice from it. To do this simply add the following to the head section of the offending pages:
If the duplicate content is being caused by things like session IDs or tracking codes, webmaster tools has a tool to help you deal with it. I cover this in the next section on URL structure.
This is quite a broad topic and includes things like parameters and pagination. Google do a really great job of identifying all the parameters that your site uses. To see them go to the section called site configuration and click on URL parameters.
The idea of this feature is to allow your site to be crawled effectively. URL parameters can cause needless pages to be crawled and indexed by the search engines (as mentioned in the duplicate content section) therefore by telling Google how to handle these parameters you can save pages being indexed that don’t need to be. There is a post on the webmaster tools blog explaining the changes and how you could potentially use them, see it here.
Site speed is something that I’ve spoken about in great depth before so I won’t go into too much detail. There is a section in webmaster tools called labs which has a sub section called Site performance (see below):
This shows the average speed of your site compared to others and plots a graph to show if on average your site is fast or slow.
This will give you a rough idea of your site speed, if it’s pretty slow I would recommend optimising your site using the following site speed recommendations.
Getting the basics correct is still an important aspect of SEO. Google makes this pretty easy by making suggestions like those seen below:
This is great as you can then export these items and play around with them in Excel in order to see which changes need to be made. If there are a lot of changes to be made, probably best to send this to a developer who can help make the changes on a large scale. For more information on how to get your on-page optimisation correct, see this great post by Rand.
I hope you find this a useful starting point for self-diagnosis of site problems. Once you feel confident doing this, take a deep dive and follow Geoff’s full site audit checklist. Thanks for reading, get me on Twitter @CraigBradford
Craaaiiig Bradford is an SEO Analyst at Distilled London. He loves learning anything and is currently learning to be a code master. SEO interests include site speed and deep diving into competitor analysis.