Advanced SEO troubleshooting: Why isn’t this page indexed?

UPDATED: APRIL 3RD 2013 Some of you seasoned pros will likely find this post of n00b status, but chances are that you’ve made a silly mistake at one point (or two) in your SEO career. This happened to me and a few of my esteemed colleagues, so I decided that it was time for a bit of a refresher, or a “basics” kick in the butt. If a page isn’t indexed in 2013, we can blame it on one of the following:

  • Specific directives
  • Google’s almighty boot (algorithmic or manual)
  • Site architecture issues (orphaning pages isn’t cool, give your pages a family)
In this post, we’re only going to go over the directives that can yank a page out of the index, or rather never let it appear in the first place. We’ll start with the blatantly obvious, and then work our way to sneakier stuff. Just in case you don’t know how to check if a page is indexed:
  1. Copy URL
  2. Paste into Google search (any country)**
  3. The page should come up as the first result, visit the page to ensure it’s the same URL that you copy pasted in!
  4. If it’s not the exact URL, then you’ll need to start doing manual checks (see list below)

** - Ok ok, so there’s only one way that you wouldn’t see the URL in specific countries. If the URL you’re copy / pasting in is using HREFLANG + CANONICAL, then you might not see it in certain countries. There’s an example below in the honorouble mentions section.

1 ) Meta robots NOINDEX in <head>

  • Difficulty to spot rating: It’s like getting slapped in the face
  • Easiest way to spot it? Look in the source code of your browser, in the <head> section
  • What does it look like? <META NAME=“ROBOTS” CONTENT=“NOINDEX”> and remember it’s in the head section.
  • What does it do? Tells Google that you don’t want this page in Google’s index, ever. Example here: http://www.mattcutts.com/blog/2013/02/
More information:

2 ) Rel=canonical in <head>

  • Difficulty to spot rating: A 50 metre ditch, 1 metre in front of you.
  • Easiest way to spot it? Look in the source code of your browser, in the <head> section
  • What does it look like? <link rel=“canonical” href=“http://www.canonical-target.com”/> in the <head> portion of the HTML.
  • What does it do? Tells Google that the value (PageRank, link authority) of this page should be passed onto another page. It’s not a directive, but Google does honour the canonical suggestion we provide most of the time, which can pull this page out of the index while the canonical target (page) stays in the index.
More information:

3 ) NOINDEX in robots.txt

  • Difficulty to spot rating: Like trying to pick out Phil Nottingham from a L’Oreal model in a crowd, or relatively easy.
  • Easiest way to spot it? Check in Robots.txt
  • What does it look like? Noindex: /folder/  Special note here, my all caps test (NOINDEX) doesn’t work. However, it de-index a page if you only capitalize the N in Noindex:. Thanks to Richard Falconer for the heads up and his working example here.
  • What does it do? Same as the regular noindex tag, but this is a weird robots.txt implementation .
More information:

4 ) NOINDEX in the HTTP header (or None in HTTP header)

  • Difficulty to spot rating: Finding an American penny in a pile of English pence
  • Easiest way to spot it? Check HTTP headers with Chrome or something like this HTTP viewer
  • What does it look like?
Headers (response)

CF-RAY:66cdb31ea060165 Connection:keep-alive Content-Encoding:gzip Content-Type:text/html Date:Mon, 29 Apr 2013 14:59:27 GMT Server:cloudflare-nginxTransfer-Encoding:chunked X-Robots-Tag:noindex

  • What does it do? It’s the same meta=“robots” content=“noindex” you’re used to, except it’s delivered in the HTTP header. You can see a live example here. Hat tip to Ian Macfarlane for pointing out that “none” also means NOINDEX in the X-robots directive.
More information:

5 ) Rel canonical in the HTTP header

  • Difficulty to spot rating: Spotting a penny in a pool without goggles.
  • Easiest way to spot it? Check HTTP headers with Chrome or something like this HTTP viewer
  • What does it look like? Google says to do it like this: Link: <http://www.davidsottimano.com>; rel=“canonical” you can see a working example here: http://www.davidsottimano.com/http-canonical-example.php
  • What does it do? Same thing as the regular canonical tag (in the <head>).
More information:

6 ) Meta refresh with delay > 0 (example: 5)

  • Difficulty to spot rating: Being slapped 5 times in 5 seconds
  • Easiest way to spot it? Go to the page, and watch for any redirects OR just look in the source code
  • What does it look like? <meta http-equiv=“Refresh” content=“5;url=http://soliddelivery.co.uk/finsdanishpack.html”> in the <head> portion of the HTML.
  • What does it do? Waits for the specified amount of time, then sends your browser on a road trip. Google only indexed the destination when there’s a time delay greater than 0, not sure why, but now you know. Here’s a live test: http://soliddelivery.co.uk/feratsodaman.html (copy paste that into search and also visit that URL to see what happens or just click here)
 

7 ) Parameter canonicalization in GWMT (Google webmaster tools)

  • Difficulty to spot rating: Remembering what you got your teacher for Christmas in 1994, or hard
  • Easiest way to spot it? You need to either have access to webmaster tools for the site or know someone who does.
  • What does it look like? It’s in the Configuration > URL Parameters sectionCapture1
  • What does it do? When you change these settings in WMT, Google usually obeys. Depending on what you tell Googlebot to do per parameter, you might end up dropping loads of pages out of the index. Example: If you instruct Googlebot not to crawl any URLs, it might drop all of these pages with parameters out of the index. Similarly, if you tell Googlebot that these pages paginate or do no change page content, then it can drop them too. Remember to check URL param configuration when you can’t figure out why these pages aren’t showing.Capture2
More information:
  • http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687
  • BONUS ROUND! If you’re reading this, then I’ll give you a little tip. If you need to canonicalize a page and you don’t want anyone to easily find out, tack on a fake parameter to the URL and add the custom parameter to WMT. That’s all I’m telling you, and you shouldn’t use it for naughty purposes.

8 ) URL removal request in WMT

  • Difficulty to spot rating: Like getting your boss to agree to a 3 month paid holiday in Thailand... or Hard.
  • Easiest way to spot it? You need to either have access to webmaster tools for the site or know someone who does.
  • What does it look like? Optimization > Remove URLs section of WMT
  • What does it do again? Yanks that page out of the index within 24 hours. In my personal experience, I didn’t need to supplement with a NOINDEX tag or Robots.txt block, this URL removal request works and is efficient. The only problem is that you’ll never know a page has been “blocked” until you check this report.Capture3
More information:

Honourable mentions

  • Just repeating it, but blocking in Robots.txt doesn’t prevent indexation!
  • Originally, I had rel=“next” / “prev” in as a potential cause of de-indexation (because of its likeness to canonical), but I can’t list it without a public example.
  • 301 redirects will likely only show the destination target, but not always, and not forever. I’ve left this out because the tests are always flaky - but if you can back this up, please do.
  • 302s  (test here, example here) and meta refresh 0 delay (test here, example here) will show the URL in search with the content of the target
  • Similar to the behaviour of the 302s and meta refresh 0, Hreflang + canonical might cause a page to  disappear OR reappear depending on what country you search in (also dependent on hreflang alternates)
Did I miss anything? If I did, and you can prove it, I’ll add it to the post and happily credit you.