Advanced SEO troubleshooting: Why isn’t this page indexed?

UPDATED: APRIL 3RD 2013 Some of you seasoned pros will likely find this post of n00b status, but chances are that you’ve made a silly mistake at one point (or two) in your SEO career. This happened to me and a few of my esteemed colleagues, so I decided that it was time for a bit of a refresher, or a “basics” kick in the butt. If a page isn’t indexed in 2013, we can blame it on one of the following:

  • Specific directives
  • Google’s almighty boot (algorithmic or manual)
  • Site architecture issues (orphaning pages isn’t cool, give your pages a family)
In this post, we’re only going to go over the directives that can yank a page out of the index, or rather never let it appear in the first place. We’ll start with the blatantly obvious, and then work our way to sneakier stuff. Just in case you don’t know how to check if a page is indexed:
  1. Copy URL
  2. Paste into Google search (any country)**
  3. The page should come up as the first result, visit the page to ensure it’s the same URL that you copy pasted in!
  4. If it’s not the exact URL, then you’ll need to start doing manual checks (see list below)

** - Ok ok, so there’s only one way that you wouldn’t see the URL in specific countries. If the URL you’re copy / pasting in is using HREFLANG + CANONICAL, then you might not see it in certain countries. There’s an example below in the honorouble mentions section.

1 ) Meta robots NOINDEX in <head>

  • Difficulty to spot rating: It’s like getting slapped in the face
  • Easiest way to spot it? Look in the source code of your browser, in the <head> section
  • What does it look like? <META NAME=“ROBOTS” CONTENT=“NOINDEX”> and remember it’s in the head section.
  • What does it do? Tells Google that you don’t want this page in Google’s index, ever. Example here: http://www.mattcutts.com/blog/2013/02/
More information:

2 ) Rel=canonical in <head>

  • Difficulty to spot rating: A 50 metre ditch, 1 metre in front of you.
  • Easiest way to spot it? Look in the source code of your browser, in the <head> section
  • What does it look like? <link rel=“canonical” href=“http://www.canonical-target.com”/> in the <head> portion of the HTML.
  • What does it do? Tells Google that the value (PageRank, link authority) of this page should be passed onto another page. It’s not a directive, but Google does honour the canonical suggestion we provide most of the time, which can pull this page out of the index while the canonical target (page) stays in the index.
More information:

3 ) NOINDEX in robots.txt

  • Difficulty to spot rating: Like trying to pick out Phil Nottingham from a L’Oreal model in a crowd, or relatively easy.
  • Easiest way to spot it? Check in Robots.txt
  • What does it look like? Noindex: /folder/  Special note here, my all caps test (NOINDEX) doesn’t work. However, it de-index a page if you only capitalize the N in Noindex:. Thanks to Richard Falconer for the heads up and his working example here.
  • What does it do? Same as the regular noindex tag, but this is a weird robots.txt implementation .
More information:

4 ) NOINDEX in the HTTP header (or None in HTTP header)

  • Difficulty to spot rating: Finding an American penny in a pile of English pence
  • Easiest way to spot it? Check HTTP headers with Chrome or something like this HTTP viewer
  • What does it look like?
Headers (response)

CF-RAY:66cdb31ea060165 Connection:keep-alive Content-Encoding:gzip Content-Type:text/html Date:Mon, 29 Apr 2013 14:59:27 GMT Server:cloudflare-nginxTransfer-Encoding:chunked X-Robots-Tag:noindex

  • What does it do? It’s the same meta=“robots” content=“noindex” you’re used to, except it’s delivered in the HTTP header. You can see a live example here. Hat tip to Ian Macfarlane for pointing out that “none” also means NOINDEX in the X-robots directive.
More information:

5 ) Rel canonical in the HTTP header

  • Difficulty to spot rating: Spotting a penny in a pool without goggles.
  • Easiest way to spot it? Check HTTP headers with Chrome or something like this HTTP viewer
  • What does it look like? Google says to do it like this: Link: <http://www.davidsottimano.com>; rel=“canonical” you can see a working example here: http://www.davidsottimano.com/http-canonical-example.php
  • What does it do? Same thing as the regular canonical tag (in the <head>).
More information:

6 ) Meta refresh with delay > 0 (example: 5)

  • Difficulty to spot rating: Being slapped 5 times in 5 seconds
  • Easiest way to spot it? Go to the page, and watch for any redirects OR just look in the source code
  • What does it look like? <meta http-equiv=“Refresh” content=“5;url=http://soliddelivery.co.uk/finsdanishpack.html”> in the <head> portion of the HTML.
  • What does it do? Waits for the specified amount of time, then sends your browser on a road trip. Google only indexed the destination when there’s a time delay greater than 0, not sure why, but now you know. Here’s a live test: http://soliddelivery.co.uk/feratsodaman.html (copy paste that into search and also visit that URL to see what happens or just click here)
 

7 ) Parameter canonicalization in GWMT (Google webmaster tools)

  • Difficulty to spot rating: Remembering what you got your teacher for Christmas in 1994, or hard
  • Easiest way to spot it? You need to either have access to webmaster tools for the site or know someone who does.
  • What does it look like? It’s in the Configuration > URL Parameters sectionCapture1
  • What does it do? When you change these settings in WMT, Google usually obeys. Depending on what you tell Googlebot to do per parameter, you might end up dropping loads of pages out of the index. Example: If you instruct Googlebot not to crawl any URLs, it might drop all of these pages with parameters out of the index. Similarly, if you tell Googlebot that these pages paginate or do no change page content, then it can drop them too. Remember to check URL param configuration when you can’t figure out why these pages aren’t showing.Capture2
More information:
  • http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687
  • BONUS ROUND! If you’re reading this, then I’ll give you a little tip. If you need to canonicalize a page and you don’t want anyone to easily find out, tack on a fake parameter to the URL and add the custom parameter to WMT. That’s all I’m telling you, and you shouldn’t use it for naughty purposes.

8 ) URL removal request in WMT

  • Difficulty to spot rating: Like getting your boss to agree to a 3 month paid holiday in Thailand... or Hard.
  • Easiest way to spot it? You need to either have access to webmaster tools for the site or know someone who does.
  • What does it look like? Optimization > Remove URLs section of WMT
  • What does it do again? Yanks that page out of the index within 24 hours. In my personal experience, I didn’t need to supplement with a NOINDEX tag or Robots.txt block, this URL removal request works and is efficient. The only problem is that you’ll never know a page has been “blocked” until you check this report.Capture3
More information:

Honourable mentions

  • Just repeating it, but blocking in Robots.txt doesn’t prevent indexation!
  • Originally, I had rel=“next” / “prev” in as a potential cause of de-indexation (because of its likeness to canonical), but I can’t list it without a public example.
  • 301 redirects will likely only show the destination target, but not always, and not forever. I’ve left this out because the tests are always flaky - but if you can back this up, please do.
  • 302s  (test here, example here) and meta refresh 0 delay (test here, example here) will show the URL in search with the content of the target
  • Similar to the behaviour of the 302s and meta refresh 0, Hreflang + canonical might cause a page to  disappear OR reappear depending on what country you search in (also dependent on hreflang alternates)
Did I miss anything? If I did, and you can prove it, I’ll add it to the post and happily credit you.

Dave Sottimano

Dave Sottimano

David Sottimano comes from a varied background in Corporate Marketing and Professional Sales. His love affair between the internet and marketing has finally found the perfect balance at Distilled, and continues to flourish each day. He graduated...   read more

Get blog posts via email

23 Comments

  1. I've gotten into trouble in the past with the canonical tag and managed to get a page not indexed by mistake.
    Good refresher article, thanks.

    reply >
  2. Let's see if I deserve a mention :)
    If off-site issues count, though it's uncommon, I'd cite Hijacking: copy-paste a phrase from your page and check if someone has stolen your entire page (tested by Dan Petrovic here http://dejanseo.com.au/hijacked/).

    Good and comprehensive post anyway!
    Giuseppe

    reply >
    • David Sottimano

      Ooh, I remember reading that post but I'm still hazy on the exact details. I think this is a result of algorithmic filters and not of a directive. While it does make perfect sense, I didn't go into algorithmic factors for de-indexation here and was only talking about directives.

  3. Great Post, I agree some is noob status but some is things that just slip by or don't always appear as an option at first site. This should almost be a checklist for anyone who asks about duplicate issues or has an issue with getting pages indexed - of course it's not a exhaustive list but its a good list and a good jumping off point for each "idea" is given. Nice work David.

    reply >
  4. Tom

    Thanks for the insight, David.

    I was a little surprised you added rel next/prev to the list. Do you have any examples of when applying those tags you found paginated pages removed from the index or pages that never made it to the index?

    I imagine that Google would generally keep all pages in the index and only serve the correct page as it matched a user's search (i.e. the search matched a specific piece of text on page 3).

    Thanks again,
    Tom

    reply >
    • David Sottimano

      Hey Tom,

      Yes, this is a weird one and I was surprised of this behaviour as well when I saw it. I can't give client examples that would confirm it unfortunately, but a good place to look is any site with Yoast's SEO WP plugin. Paddy's rel=next and rel=prev ARE indexed www.paddymoogan.com/category/seo/page/2/ (Google result here: ) - which completely contradicts my point.

      You are right - if I can't show a working example I shouldn't add it to the list. I'll do some more testing and come back shortly.

      Thanks for challenging me.

  5. Not really an addition, but site:mysite.com inurl:xyz and site:mysite.com -inurl:xyz will help you check several pages that include / don't include xyz in their URLs at once. Useful if you check huge sites imho.

    reply >
  6. I've also found that a JS redirect can cause a page to drop out. I'll write about it soon, but I looked at different ways to get pages deindexed (similar to what you did, I think). When I tried the canonical, I found that it was ignored - the pages were completely different though...Not sure how similar the pages were in your test.

    reply >
  7. The Noindex directive in robots.txt does still seem to work in Google for the site I have it set up on.

    A couple of other reasons why a page might not be indexed:


    noindex X-Robots-Tag in HTTP header (X-Robots-Tag: noindex)
    Incorrect mime type declared in HTTP header. For example, for an HTML document, if it was declared as Content-Type: text/javascript

    reply >
    • David Sottimano

      I can't believe I forgot to include the noindex in HTTP header... I'm going to credit you for your awesome answers, but did you have any live examples of these by chance?

    • Yeah, I thought that was a strange omission since you pointed out the canonical X-Robots-Tag :-)

      Tumblr tag pages use noindex X-Robots.
      (thanks Ian Macfarlane, LBi)

      I've only found the MIME type problem once and had it fixed so I don't have any examples. I've no idea how widespread this issue might be though.

  8. Hi David, thanks for the awesome refresher post. Will you be putting the "Almighty Boot" and "Architectural" posts at some point?

    Cheers!

    reply >
  9. Here's a Noindex example from an old blog I'd forgotten about.

    I'm guessing Google recognises "Noindex:" in the robots.txt but not "NOINDEX:"

    reply >
  10. To add to the list...

    307 - works just like a 302.

    Not having a 200 OK status code (a way back we had a client where their site served a 404 status code for all URLs). There are obviously some caveats around a few other statuses like 302s.

    You can get edge case scenarios where a 301 will be treated like a 302 instead - the most common example would be the homepage redirecting to an internal page with a 301.

    Also a meta robots with a value of "NONE" works as well as "NOINDEX" (it implies both noindex & nofollow).

    If your robots.txt file serves a 503, your whole site might get de-indexed :-)

    If your site is doing any sort of geo-targeting or language detection (or flat out cloaking, especially if you've gotten hacked), it might be redirecting differently to Googlebot than to your browser - best to test Fetch as Googlebot to be sure.

    I know you've excluded "Site architecture issues", but rel=nofollow kinda falls into the first category (it won't by itself prevent indexation if other links point there, but it's a directive that can result in a page not being indexed).

    I know you've also excluded "Google’s almighty boot" but XML Sitemaps are also used as a hint a bit like the canonical link element is, which is effectively a directive which prevents search engines from indexing that particular page if another dup is included in the Sitemap index. It probably sits somewhere between a canonical link element and Google's built-in de-duping in this respect. This is getting into semantics though really :-)

    Lastly, for checking if a page is indexed, the "info:" operator is one of the more reliable methods I've found.

    reply >
  11. Excellent and very helpful post.........................................thanks for sharing

    reply >
  12. Nice technical information. I loved your post.

    reply >
  13. Thanks for a very informative post David, that has cleared up a few points for me.

    reply >
  14. David Sottimano

    Hi Eric, the easiest way for you to control robot directives and other SEO specific tags on Wordpress is Yoast's SEO plugin. Don't worry it's free ;) You say that don't want certain pages indexed, so I would advise you to use NOINDEX to ensure they aren't indexed. The canonical tag isn't really a directive, and more of a suggestion that may / may not result in indexation while in use. Hope that helps.

    reply >
  15. Geoff - I'd like to read your post about JS redirects when you write it. Great refresher post David. I'm surprised by how often a site's robots.txt file causes indexing problems.

    reply >
  16. Nice post, Ive been planning on doing one just like you did only in swedish as Ive had several readers asking for a similar article. It was well written and you cover the most important aspects.

    reply >
  17. Great post David, this a nice checklist to use for indexing problems.

    reply >

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>