Data Inaccuracies in Indexed Pages Lists

A client recently asked if there was a way to identify which pages in their site were not being indexed. Google, Bing and Yahoo all have very different systems and getting a definitive answer from any of them is next to impossible. Anyone that has done research on indexed pages in the past 10 years know that the three engines don't agree on how many pages to index, much less which ones. So we decided to run some tests to see how to discern which pages were being indexed by any engine, without hand checking every URL.

The first idea was to use the exportable list of internal links from Google Webmaster Tools to identify pages that has been indexed. The logic being that if internal links were identified from the page, then it was most likely indexed. The accuracy of Google Webmaster Tools has been somewhat inconsistent, so we wanted to test the accuracy on some smaller sites first.



Click 'Your site on the web' -> 'Internal Links' -> 'Download this table' to find a list of pages

Our clients have any where from 100,000 to 10 million pages, making it hard to pull a list of all indexed pages using a "site:" search and that was integral to testing how many pages were "accurately indexed." This is loosely said because the "site:" search can be flakey based on the datacenter, time of day, etc.

The Tiny Test Group

We used four sites with very few pages (<200) were used to perform this small test. This is in no way scientific, nor can the data be used for full analysis or correlation. This is rather just a quick look at how the use of the internal links report might help identify indexed pages. The sites used are a financially based consumer site, a personal blog, a health services site, and a medical surgeon's site.

Please note that once we pulled the internal links information for the original client we came across the major caveat: Only 100,000 pages are shown in the downloaded report. So for sites with more than 100,000 indexed pages, this idea won't ever work, no matter the accuracy.

The test started by downloading all internal links from Google Webmaster Tools into Excel and de-duplicating. Then we used the SEO for Firefox plugin to pull a "site:" search for each domain, downloading all pages into CSV. The two lists were compared, and the results were dreadful. The only time that the two came close were when the site had less than 20 pages.

From this rest and the fact that we could not use this tactic for the client anyway, the idea of using listed internal links has been thrown out. And it was our best idea for the automation of such a list. But it did bring us to do more research as to the many ways a client or webmaster might look at the number of indexed pages. And with  some clear oddities in the data, this test spurred a deeper look into the numbers we were pulling.

Identifying the Outliers

The main issue we wanted to explore was the fact that there were pages in the internal links list, but not the indexed list. We realized that the site reporting this has a select number of pages set to noindex. Therefore the links to these pages were identified but the page was not in the index. That led us to wonder if we pulled all the possible ways of looking at the list of indexed pages on a site, how they might differ and why. So using the same sites, we pulled numbers from all the engines from both a logged in and site colon search, and compared those numbers to the sitemap for the site and the number of visible pages.



Known Pages - those pages that a webmaster would see as "pages" on their site. Typically product pages, informational, and others that are easily navigable from the navigation.
Sitemap URLs - How many URLs are listed in the XML sitemap. (These sites only have one.)
Noindex Pages - Pages on the site that have a noindex tag.
Google Indexed - The number of pages for the site domain using a site colon search.
Google Sitemap Indexed - The number of pages of the sitemap Google has indexed.
Yahoo Indexed - The number of pages for the site domain using a site colon search.
Yahoo My Sites - The number of pages the My Sites section states as indexed.
Bing Indexed - The number of pages for the site domain using a site colon search.
Bing WMC Indexed - The number of pages Bing's Webmaster Center says it is has indexed.

What the ****?

The numbers don't line up at all do they? They are all across the board for every site. The reasoning is fairly simple but can easily be missed by anyone looking for a way to locate a list of pages not in the index.

First remember that the "Known Pages" is a relative number. That was really just me looking at the number of pages that could be identified externally by looking at the blog or site. The number of noindex tagged pages is by hand as well. Those two and the sitemaps are full of human error. We have already noted that the site colon search isn't always spot on, but the real kicker is in the data from the engines.

Google - The only way to get a "count" of indexed pages from Webmaster Tools is to have a sitemap and those can be flawed in many ways.

Bing - They are notoriously slow to crawl and index sites. So the newer and smaller site is not yet indexed.

Yahoo - There was only one major outlier here, but if you notice their numbers are so much higher than everyone else's; that is common for Yahoo. Site Explorer is known for showing every link to a site under the sun, so if the file exists in any form, we could expect them to list it.

The Six Things to Remember

Beyond the fact that all of the engines are different, here are a few things we realized when pulling the numbers from all the engines. These are all easily fixable if you are serious about getting a clear list of indexed pages, but can be easily overlooked, causing confusion when the numbers are pulled.

  1. Parameters
    For many sites, there are pages being indexed with dynamic parameters such as a session ID. If these are not ignored by the engines, there is a possibility that there are a number of pages in a site colon search and internal linking report that are in essence duplicates.
  2. No index
    If you have set some pages in your site navigation to noindex, naturally they are not going to appear in a list of indexed pages. But those pages can be linking to other pages, and would therefore appear in the report of internally linked pages.
  3. Outdated Sitemap
    If you are relying on a sitemap to tell you how many URLs are indexed, the sitemap needs to be constantly updated. If possible set it to be automatically generated server side. This will cut down on any confusion like on Site 2.
  4. Other File Types (PDF, Flash, Word, PowerPoint)
    The engines index more than just web pages. Any document on your domain that is linked to anywhere on the Internet can be indexed if it is a supported file type and not restricted by robots.txt or an htaccess file. These include any flash files, presentations or marketing material you might have pushed at some point. Think press releases.
  5. Pagination
    If you own a blog or a site with products that span a number of pages, pagination is most likely occurring. A category on a blog that the owner would think counts as just one page, could be two or three pages long. This can make the number of indexed pages larger than expected.
  6. Subdomains
    Are you running a mobile site? Anything on your domain, including mobile site files, subdomains, and duplicate content from those subdomains could be throwing off your numbers of indexed files.

Just remember that in the end, pulling a number of indexed pages from anywhere without digging into the data is most likely going to be misleading. Just like if you pull rank numbers for a keyword and expect that to be your rank for everyone searching for that term. The internet is not a perfect place, it is not scientific by any means. Take every number with a grain of salt and do your homework. Things change all the time and the best way to combat that is to keep your analytical brain engaged at every moment.

Get blog posts via email