Let me begin by taking a moment to thank John Doherty and the rest of the Distilled Team for giving me a chance to share of this research with you today. I’ve had the pleasure of getting to know the members of Distilled through their annual SearchLove conference, which was an amazing experience. Lynsey and Lauren definitely know how to plan an event, and their hospitality was amazing. I experienced this first hand, as I was a very last minute attendee of SearchLove (I emailed Lynsey at 3am the day of and she got back to me in plenty of time to arrive for the first speakers; thanks again Lynsey!).
On that note, if you didn’t attend, not only did you miss a special appearance by EmCee Hammer; you missed Mike King’s awesome presentation “Targeting Humans” which contained a brief section on my recent paper “GoogleBot is Chrome: The Jig is Up… or why a search giant built a browser.”
You also missed plenty of other things too… The speakers were great, and all the presentations were packed with usable information or handy resources. The after-hours meet and greets were no less interesting, and that’s really where Distilled’s hospitality shone through. If you didn’t attend, you really lost out. Be smart, learn from your mistake, and plan to be there next time around. Also don’t forget to check out the Live Blogs and Wrap Ups that are sure to be floating around on Distilled, and other places like Outspoken Media or SEOMoz.
In case you haven’t read it, “GoogleBot is Chrome” outlines a theory that Google’s Search Crawler GoogleBot is actually built off of the Chrome Web Browser, and may even have been the primary reason for the development of Chrome. If this is true, it leads us to believe that GoogleBot is a lot more intelligent and capable than Google is letting on, and we’ll have to totally abandon the dated notion that GoogleBot is looking at a website in a manner similar to Lynx or other text-only browsers.
You can catch the full version of the original paper at Mike King’s site if you want to dive into the initial evidence. This post is going to be a follow-up to the original, as since the launch at SearchLove a wealth of other data has come to light.
Stay tuned, and we’ll cover some recent revelations from Google, and some potentially game changing proof of concepts that show us Google isn’t being totally straight with us on their indexing capabilities.
A few hours after the paper debuted at SearchLove, Matt Cutts tweeted a confirmation regarding a recent ”Digital Inspiration” article that observed GoogleBot is capable of crawling the AJAX content found in the Facebook Comments Plugin that’s so popular across the web. Based on the patent evidence recently uncovered, this was a strong indicator that GoogleBot is closer to Chrome than Lynx; especially since this functionality was only casually confirmed, rather than publically announced. This capability probably wasn’t something new to the search stack, just something the SEO Industry recently noticed.
Google also confirmed that they are applying analysis on content “above the fold,” something that is hinted at in their Visual Gap Analysis patents.
The functionality has already seemed to roll out via some Instant Previews (which are absolutely generated via a Browser – http://sites.google.com/site/webmasterhelpforum/en/faq-instant-previews#02)
The Instant Preview shows a jagged cut representing the boundaries of the non-scrolling viewable space, commonly referred to as “The Fold.” The content extracted by GoogleBot and highlighted in the Instant Preview comes from below The Fold, which Google seems to have been able to accurately identify. It was highlighted as Google appears to have considered it relevant for ranking purposes. While more testing on this is definitely needed, it still seems to suggest that Google’s capabilities are more in line with the patent evidence than with their public announcements tend to suggest.
An example of how precise the content extraction below the fold can get, we need only turn to the query “GoogleBot is Chrome” where the highly accurate excerpt “What if Chrome was in fact a repackaging of their search crawler; affectionately known as GoogleBot, for the consumer environment?” can be found in the Instant Preview for the article over on IPullRank.com
That sentence was actually written as a summary of the introduction and to see my intent so accurately extracted is pretty impressive. Google’s ability to accurately identify, extract, and visually present content is signs of an increasingly complex spider moving well beyond the average “Lynx-like” crawler we tend to think of in the SEO World.
What we learned is a Game Changer…
Thanks to all the feedback I received, I was really dead set on trying to uncover some “Smoking Gun” evidence that GoogleBot was behaving more like a browser than a semi-smart text-based crawler. Patent evidence is great, but is often without context… protecting intellectual property via patenting often comes well before public implementation.
A good portion of the things I found were to be expected, a few were pretty odd, and one example was so interesting it may change the way we think about GoogleBot and the Link Graph.
Even in the expected, there are some interesting standouts; like locally hosted Google Analytics JS files, jQuery, and Drupal Theme Style Sheets occurring with an interesting frequency.
It’s definitely an odd item to train a custom script parser to execute, as it doesn’t really add any value to the index. On the other hand, it would be trivial to index this item if your crawler were a browser, already executing scripts by default.
This wasn’t even the oddest example from the bunch though… Google seems to have indexed 530 versions of TweetMeme’s ad serving script at ads.tweetmeme.com/serve.js.
- The indexer is very “greedy” from a programming standpoint; that is to say it tries to capture as much content as it can whether or not it can do anything with it at this point in time.
- The indexer is very ignorant and doesn’t see the duplicity it’s indexing.
- The indexer is very intelligent and is able to identify differences in these URLs beyond HTML Markup.
I personally find 2 to be pretty unlikely this late in the game, though there was a time when it was probably quite true. 1 is very likely considering Google’s stated goal of indexing the world’s information and making it useful, but it doesn’t take the full scope of Google’s capabilities into consideration. Either way, without some intelligence in place, these options both have scary implications for search quality.
The Instant Preview clearly shows a rendered ad, and we know from our previous digging that Google is deploying some form of Visual Analysis to the Index, and is able to correlate that analysis back to the Instant Preview.
Taking all of this into consideration, it’s very possible that Google is probably monitoring file size/page load time, and maybe deploying OCR or even using visual analysis to identify differences in these files… making 3 a very viable option as well.
Not only did Google manage to Index some of the text, but this time the Instant Preview worked… and it showed the redirect’s destination page!
The indexed text and Instant Preview being displayed ended up matching the redirect destination, while the URL displayed matched the source of the redirect.
This could be due to search quality, or technological limitations, but either way it’s interesting…
Google seems to favor a cycle of internal innovation, followed by public announcement and user experience enhancement. When Google deploys new technology to the indexing stack, they probably want to let it burn in and gather data to compute meaningful features for extraction. The Google N-Gram Corpus for example, helped make spam detection and natural language processing based topic modeling a reality for Google and was only possible once significant amounts of textual data was absorbed into the index.
Once the functionality is actively affecting the Index, rather than being used primarily as a learning tool for the Index, we tend to see a related User Experience enhancement which not only exposes the new functionality to enhance the search experience, but also allows Google to gather usability data on the behavioral impact of the new data.
For example, Universal Search came about as Google began gathering lots of business data, map data, and news data.
In time both video and social were integrated into the experience… an outgrowth of the real time indexing capabilities Google had been focusing on adding.
I believe Google Instant Preview is the user experience enhancement that heralds the inclusion of Visual Analysis and Browser Based Crawlers
Ultimately, all we can do is make educated guesses about the exact nature of Google’s (and Bing’s) true indexing capabilities as we’re on the outside looking in. The Patent Evidence and the public statements tell conflicting stories, and with the research lining up more with the Patent Evidence than the public announcements we should probably wonder what benefit Google may derive from keeping us in the dark about their indexing capabilities.
It’s highly possible that these innovations are still being tested, or are only rolled out to certain crawling servers, though it’s more likely that Google is seeking to avoid contaminating their data pool. Google has used the power of Big Data very effectively in the past, and the ability to learn from the trends of the Web gives them a competitive edge.
Search Quality is probably the primary motivator for keeping their full capabilities under wraps… the more we know, the more feasible it becomes to attempt to “game” the Algorithm, which ultimately taints their data pool. Features are most meaningful when they’re a natural outgrowth of the behaviors of real users.
Profit can never be overlooked as a factor either; Google is built on the back of Google Web Search. The Adwords Empire only matters because of the Algorithm… and people use Google Analytics and Webmaster Tools to learn to please the Algorithm. If more people could effectively game the Algorithm it would impact search quality and ultimately affect Google’s bottom line.
And perhaps the most persuasive argument is from the perspective of innovation… the less brash young upstarts understand about Google’s capabilities, the less likely they will be to field a truly viable competitor. Google has tons of potentially world changing innovations simply sitting dormant waiting for the right market and model to help them monetize it effectively.
My personal favorite example of this is Google Translate, which is one of the most accurate machine translating tools on the planet. Google almost sacked it because it was not profitable, and had it not been for public outcry we may have lost access to this technology altogether. Tools such as this truly have the ability to impact the entire world. Imagine being able to combine this with the Speech to Text commonly found on Android Devices to create an Instant Translator for disaster relief operations?
This amazing capability is simply a side effect of indexing volumes of data… it’s an accidental innovation that almost got moth balled in favor of making sure you can +1 that Youtube video of the kitten sleeping in a tea cup.
That said, the jig is still up Google… and it is game time. Who’s ready to help build the future?