Google… Stop Playing; The Jig Is Still Up! [Guest Post]

Intro

Let me begin by taking a moment to thank John Doherty and the rest of the Distilled Team for giving me a chance to share of this research with you today. I’ve had the pleasure of getting to know the members of Distilled through their annual SearchLove conference, which was an amazing experience. Lynsey and Lauren definitely know how to plan an event, and their hospitality was amazing. I experienced this first hand, as I was a very last minute attendee of SearchLove (I emailed Lynsey at 3am the day of and she got back to me in plenty of time to arrive for the first speakers; thanks again Lynsey!).

On that note, if you didn’t attend, not only did you miss a special appearance by EmCee Hammer; you missed Mike King’s awesome presentation “Targeting Humans” which contained a brief section on my recent paper “GoogleBot is Chrome: The Jig is Up… or why a search giant built a browser.”

You also missed plenty of other things too… The speakers were great, and all the presentations were packed with usable information or handy resources. The after-hours meet and greets were no less interesting, and that’s really where Distilled’s hospitality shone through. If you didn’t attend, you really lost out. Be smart, learn from your mistake, and plan to be there next time around. Also don’t forget to check out the Live Blogs and Wrap Ups that are sure to be floating around on Distilled, and other places like Outspoken Media or SEOMoz.

In case you haven’t read it, “GoogleBot is Chrome” outlines a theory that Google’s Search Crawler GoogleBot is actually built off of the Chrome Web Browser, and may even have been the primary reason for the development of Chrome. If this is true, it leads us to believe that GoogleBot is a lot more intelligent and capable than Google is letting on, and we’ll have to totally abandon the dated notion that GoogleBot is looking at a website in a manner similar to Lynx or other text-only browsers.

Instead we may have to face a reality where GoogleBot’s capabilities rival that of any other web browser, being able to crawl and index DOM Transformations from AJAX or JavaScript Libraries like jQuery.

You can catch the full version of the original paper at Mike King’s site if you want to dive into the initial evidence. This post is going to be a follow-up to the original, as since the launch at SearchLove a wealth of other data has come to light.

Stay tuned, and we’ll cover some recent revelations from Google, and some potentially game changing proof of concepts that show us Google isn’t being totally straight with us on their indexing capabilities.

Recent Revelations

https://twitter.com/#!/mattcutts/status/131425949597179904

A few hours after the paper debuted at SearchLove, Matt Cutts tweeted a confirmation regarding a recent ”Digital Inspiration” article that observed GoogleBot is capable of crawling the AJAX content found in the Facebook Comments Plugin that’s so popular across the web. Based on the patent evidence recently uncovered, this was a strong indicator that GoogleBot is closer to Chrome than Lynx; especially since this functionality was only casually confirmed, rather than publically announced. This capability probably wasn’t something new to the search stack, just something the SEO Industry recently noticed.

Google also confirmed that they are applying analysis on content “above the fold,” something that is hinted at in their Visual Gap Analysis patents.

http://www.searchenginejournal.com/pubcon-matt-cutts-amit-singhal/36015/
The functionality has already seemed to roll out via some Instant Previews (which are absolutely generated via a Browser - http://sites.google.com/site/webmasterhelpforum/en/faq-instant-previews#02)

The Instant Preview shows a jagged cut representing the boundaries of the non-scrolling viewable space, commonly referred to as “The Fold.” The content extracted by GoogleBot and highlighted in the Instant Preview comes from below The Fold, which Google seems to have been able to accurately identify. It was highlighted as Google appears to have considered it relevant for ranking purposes. While more testing on this is definitely needed, it still seems to suggest that Google’s capabilities are more in line with the patent evidence than with their public announcements tend to suggest.

An example of how precise the content extraction below the fold can get, we need only turn to the query “GoogleBot is Chrome” where the highly accurate excerpt “What if Chrome was in fact a repackaging of their search crawler; affectionately known as GoogleBot, for the consumer environment?” can be found in the Instant Preview for the article over on IPullRank.com

That sentence was actually written as a summary of the introduction and to see my intent so accurately extracted is pretty impressive. Google’s ability to accurately identify, extract, and visually present content is signs of an increasingly complex spider moving well beyond the average “Lynx-like” crawler we tend to think of in the SEO World.

With that in mind, and with some prompting from John Doherty (thanks for the heads up!), I took some time to do a little digging and see how advanced GoogleBot really is when it comes to JavaScript.

What we learned is a Game Changer…

GoogleBot is crawling, indexing, and even executing all sorts of JavaScript!

Thanks to all the feedback I received, I was really dead set on trying to uncover some “Smoking Gun” evidence that GoogleBot was behaving more like a browser than a semi-smart text-based crawler. Patent evidence is great, but is often without context… protecting intellectual property via patenting often comes well before public implementation.

A good portion of the things I found were to be expected, a few were pretty odd, and one example was so interesting it may change the way we think about GoogleBot and the Link Graph.

The Expected

A quick Google Search using the “Filetype” advanced operator shows Google is definitely finding and indexing JavaScript files, and even CSS files. Repositories like Google Code and Github make it a little harder to find literal examples, but if you set your search settings to return 100 results at a time and scroll, you’ll find more than a few on the first page. Look for the URLs without text descriptions or titles, as those tend to be actual CSS or JS files.

Even in the expected, there are some interesting standouts; like locally hosted Google Analytics JS files, jQuery, and Drupal Theme Style Sheets occurring with an interesting frequency.

The Odd

This next example could belong in either “The Odd” or “The Expected” depending on your own understanding of Google’s crawling and indexing technology. I’m inclined to think that this kind of JavaScript indexing is common… but if you follow the party line that GoogleBot is a text-based crawler with specialized JavaScript capabilities, than it’s definitely odd.

Google indexes the phrase “first name” on LasVegasTickets.com tickets’ results page due to the newsletter widget embedded on the top of the right hand navigation. Oddly enough, this is actually a JavaScript widget from a third party contact solution.

A quick visit to the same page using the Firefox NoScript plugin gives us a default NOSCRIPT text warning us that we need JavaScript for that widget to work. Google also indexes this warning text as part of the page content. A quick scan of the HTML Source using Firebug shows that there’s no DIV with the proper name in the script source.

Re-enabling JavaScript allows the script to execute and instead of the warning message, we’re greeted with a straightforward opt-in form for a newsletter. A search of the source with Firebug shows a hit for the DIV ID “emma_member_first_name,” allowing us to infer that the script is using DOM injection techniques to add the DIV to the DOM after the page has loaded, and then probably using some CSS to ensure that the layer shows on top of the other element.

This elegant, low-tech solution is a great example progressive enhancement in web development which serves to accommodate users who disable or lack JavaScript Support. The Flash Replacement tool swfObject is another example of this DOM injection methodology being used to support a User/Search Friendly experience through progressive enhancement techniques... that aside, it’s a JavaScript DOM Transformation and Google really shouldn’t be able to index it if it’s closer to a text-only browser than Chrome.

It’s definitely an odd item to train a custom script parser to execute, as it doesn’t really add any value to the index. On the other hand, it would be trivial to index this item if your crawler were a browser, already executing scripts by default.

This wasn’t even the oddest example from the bunch though… Google seems to have indexed 530 versions of TweetMeme’s ad serving script at ads.tweetmeme.com/serve.js.

This behavior is odd and unexpected largely because the differing versions don’t add anything substantial to the index. The file only outputs a little inline CSS and the JavaScript to embed the image, so there’s nothing for the crawler to really get at. Ultimately this kind of behavior speaks to the quality of the Indexing technology and we can infer one of three things:
  1. The indexer is very “greedy” from a programming standpoint; that is to say it tries to capture as much content as it can whether or not it can do anything with it at this point in time.
  2. The indexer is very ignorant and doesn’t see the duplicity it’s indexing.
  3. The indexer is very intelligent and is able to identify differences in these URLs beyond HTML Markup.
I personally find 2 to be pretty unlikely this late in the game, though there was a time when it was probably quite true. 1 is very likely considering Google’s stated goal of indexing the world’s information and making it useful, but it doesn’t take the full scope of Google’s capabilities into consideration. Either way, without some intelligence in place, these options both have scary implications for search quality.

The Instant Preview clearly shows a rendered ad, and we know from our previous digging that Google is deploying some form of Visual Analysis to the Index, and is able to correlate that analysis back to the Instant Preview.

Taking all of this into consideration, it’s very possible that Google is probably monitoring file size/page load time, and maybe deploying OCR or even using visual analysis to identify differences in these files… making 3 a very viable option as well.

The Interesting

When I first started putting together this research, I ended up running into a lot of dead ends; Search Friendly design has spread far and wide, leading to many instances where content I thought was being created by JavaScript was really just a hidden DIV being brought into view through CSS Transformations. I literally sorted through hundreds of tickets sites, hotels, airlines, and news sites before I found this surprising example.

The first time I tried to grab an Instant Preview of “wcyb.sportsdirectinc.com” I couldn’t help but notice it didn’t give a preview. Curious, I opened the page with the NoScript Plugin enabled, and found myself facing a warning for a JavaScript based redirect.

This made the perfect opportunity to test the limits of GoogleBot’s JavaScript Capability, so I let the page load without NoScript and took some text from the page I ended up on after the redirect.

Not only did Google manage to Index some of the text, but this time the Instant Preview worked… and it showed the redirect’s destination page!

The indexed text and Instant Preview being displayed ended up matching the redirect destination, while the URL displayed matched the source of the redirect.

This suggests that not only is GoogleBot executing JavaScript, it seems to be following JavaScript Redirects and treating them as 302 Redirect. This isn’t the most interesting part though; it seems Bing is doing this too!

The true scope of Bing’s ability to execute and index JavaScript is unknown, though Microsoft is also the holder of patents related to Headless Browser Based Crawlers. Oddly enough, the BingBot was able to keep pace with Google when it came to the JavaScript redirect, but for whatever reason did not index the contact form from LasVegasTickets.com.

This could be due to search quality, or technological limitations, but either way it’s interesting…

Conclusions

Google seems to favor a cycle of internal innovation, followed by public announcement and user experience enhancement. When Google deploys new technology to the indexing stack, they probably want to let it burn in and gather data to compute meaningful features for extraction. The Google N-Gram Corpus for example, helped make spam detection and natural language processing based topic modeling a reality for Google and was only possible once significant amounts of textual data was absorbed into the index.

Once the functionality is actively affecting the Index, rather than being used primarily as a learning tool for the Index, we tend to see a related User Experience enhancement which not only exposes the new functionality to enhance the search experience, but also allows Google to gather usability data on the behavioral impact of the new data.

For example, Universal Search came about as Google began gathering lots of business data, map data, and news data.

In time both video and social were integrated into the experience… an outgrowth of the real time indexing capabilities Google had been focusing on adding.

I believe Google Instant Preview is the user experience enhancement that heralds the inclusion of Visual Analysis and Browser Based Crawlers

Ultimately, all we can do is make educated guesses about the exact nature of Google’s (and Bing’s) true indexing capabilities as we’re on the outside looking in. The Patent Evidence and the public statements tell conflicting stories, and with the research lining up more with the Patent Evidence than the public announcements we should probably wonder what benefit Google may derive from keeping us in the dark about their indexing capabilities.

It’s highly possible that these innovations are still being tested, or are only rolled out to certain crawling servers, though it’s more likely that Google is seeking to avoid contaminating their data pool. Google has used the power of Big Data very effectively in the past, and the ability to learn from the trends of the Web gives them a competitive edge.

Search Quality is probably the primary motivator for keeping their full capabilities under wraps… the more we know, the more feasible it becomes to attempt to “game” the Algorithm, which ultimately taints their data pool. Features are most meaningful when they’re a natural outgrowth of the behaviors of real users.

Profit can never be overlooked as a factor either; Google is built on the back of Google Web Search. The Adwords Empire only matters because of the Algorithm… and people use Google Analytics and Webmaster Tools to learn to please the Algorithm. If more people could effectively game the Algorithm it would impact search quality and ultimately affect Google’s bottom line.

And perhaps the most persuasive argument is from the perspective of innovation… the less brash young upstarts understand about Google’s capabilities, the less likely they will be to field a truly viable competitor. Google has tons of potentially world changing innovations simply sitting dormant waiting for the right market and model to help them monetize it effectively.

My personal favorite example of this is Google Translate, which is one of the most accurate machine translating tools on the planet. Google almost sacked it because it was not profitable, and had it not been for public outcry we may have lost access to this technology altogether. Tools such as this truly have the ability to impact the entire world. Imagine being able to combine this with the Speech to Text commonly found on Android Devices to create an Instant Translator for disaster relief operations?

This amazing capability is simply a side effect of indexing volumes of data… it’s an accidental innovation that almost got moth balled in favor of making sure you can +1 that Youtube video of the kitten sleeping in a tea cup.

That said, the jig is still up Google… and it is game time. Who’s ready to help build the future?

Get blog posts via email

16 Comments

  1. Charlie

    Maybe I'm missing the boat here... why does this matter?

    So you can't hide content with javascript now... and? You shouldn't have been doing that in the first place.

    Anything that's behind a java file/form that you don't want indexed, add noindex or block via robots.txt.

    The idea that Google's been using UX has been floating around since Panda first hit in the Spring. It started with the fact that they could already tell which Adsense units were above the fold (since you can buy that inventory specifically as an advertiser).

    It's nice to see this idea getting a near-sure confirmation but... again, so what? I thought we've all been operating under the idea that it was the case for months now. It's interesting to see that it's being done via Chrome - but I don't understand how people are making this out to be the biggest thing to hit the industry in recent weeks.

    What best practice has this change?
    How would you have to change a strategy because of this?

    I'm hoping I'm totally missing the big deal here because this whole thing has everyone excited and I'm left totally underwhelmed.

    reply >
    • Joshua Giardino

      Charlie, thanks for sharing your thoughts.

      I think the most clear cut "best practice" that could be perceived as being impacted by this data is the use of JavaScript to do things like obfuscate affiliate links or insert navigational links after the page has rendered to reduce the number of on-page outbound links.

      Techniques like this would often be used in an attempt to sculpt the transfer of internal link popularity.

      The data also has implications for progressive enhancement techniques like SIFR or swfObject; if Google is indexing inconsequential email opt-in boxes, what does this mean for more advanced techniques relying on JavaScript replacement?

      Lastly, there are direct implications for the Link Graph... Google and Bing appear to be treating JS Redirects in the same fashion as they treat 302 Redirects. Combine this with the recent announcements that Google follows GET Forms, and will follow POST forms when it feels it is safe to do so and you've got a totally different spectrum of inbound links that need to be assessed.

      Beyond that, it just confirms the futility of Black Hat techniques such as cloaking and reiterates how important it is to invest in a sustainable and reputable Search Strategy... Google's only getting smarter.

      I hope that gives some insight all around, and thanks again for your thoughts.

  2. Alan Bleiweiss

    Excellent article Josh. I really appreciate the time you took for research on this one. I hadn't till now read your Googlebot is Chrome article. Your conclusions, in my opinion, make perfect sense. I've held the position for a while now that we need to think of search engines as attempting to grasp user experience as seen through the lenses of their algorithms, but this takes it to the next (more appropriate and accurate) level.

    Without first being able to read a page as a browser does, there's no way the algorithm can even begin to assess a web page/site's intent due to the presentation layer that falls by the wayside in a lynx type experience.

    As for the fact that you discovered the indexing of 530 versions of TweetMeme’s ad serving script, well, that too makes sense to me. There could very well be code / content within any script that, when run through a browser, delivers intent modifications as compared to a page where scripts are stripped out.

    And for all the good work engines do, I think 2011 was a resounding confirmation as to how far they have yet to go (hence the roll-out of schema.org). So for now at least, a lot of what GoogleBot or BingBot can scrape/find/see is still likely to slip into the final index.

    reply >
  3. Vincent

    Tweetmeme's ad javascript is served as text/html. That would explain why it's being indexed :) (just sayin')

    reply >
    • Joshua Giardino

      Haha thanks for the insight Vincent.

      The ad is clearly HTML based, but the markup is largely identical across all the ads... the fact that it's indexed is less interesting than the fact that Google indexed so many of them.

      It hints at some non-markup based indexing behavior, and I wouldn't be entirely surprised if it were computer vision based.

  4. I think the statement: "And perhaps the most persuasive argument is from the perspective of innovation… the less brash young upstarts understand about Google’s capabilities, the less likely they will be to field a truly viable competitor. Google has tons of potentially world changing innovations simply sitting dormant waiting for the right market and model to help them monetize it effectively." is probably the most accurate assumption. There are countless examples of Google's reasoning for keeping their cards close in the book "Inside The Plex," and some accounts where employees have been fired for letting the cat out of the bag too early. Besides first-mover advantage and initial market saturation, they can acquire related and supportive companies/technologies for pennies on dollar, long before these companies learn that they probably have more leverage at the table than they realize. Look at the acquisition of Android Inc. in 2005.

    Great article!

    reply >
  5. Bryant

    http://code.google.com/web/ajaxcrawling/docs/faq.html#whatifnot

    is pretty telling of what they want to do with googlebot

    reply >
    • Joshua Giardino

      Hello Bryant,

      Thanks for sharing, I haven't seen that before. In my opinion that's just another attempt at misdirection on the part of Google...

      They claim they want it to behave more like a browser, but I think there's enough evidence to say that at least part of their Search Stack IS a browser.

      Instant Preview is absolutely a browser, but beyond that it appears that they are deploying some highly intelligent indexing capabilities which are most likely coming from a browser being used as a crawler.

  6. Harry Venus

    I gathered that Googlebot was rendering pages early 2010 with the introduction of Google Pagespeed (How does Google know how fast a page loads, without loading it?).

    I wonder if and how javascript execution will influence A/B testing that relies on JavaScript. Couldn't software other than Google Website Optimizer be very confusing to a JS bot?

    The execution of JavaScript is indeed very nice. I've seen music player pages where all content is loaded via JavaScript and the pagetitle is nondescriptive.

    Google had indexed all the JavaScript loaded content, and get this: used this data to repair the pagetitle to be more specific and relevant.

    Stuff like call-to-action inside the fold, or the distribution of content to advertisements in the fold, or telephone/contact info inside the fold can all start playing a role now, especially after Panda quality update. We should judge our sites visually now, and care more about design too.

    reply >
    • Joshua Giardino

      Hello Harry,

      Page Speed is definitely a good example. I theorize in the original paper that by 2009 with the launch of the Chromium project, Google has fully deployed a browser based crawler.

      If this were true, 2010 would be plenty of time for them to have deployed Machine Learning on their index to see what the most popular or authoritative sites look like in relation to Page Load Times to perhaps derive a meaningful ranking feature for their algorithm.

      Regarding A/B Testing... I haven't given it much thought honestly. I imagine it definitely presents issues, though I also think Google could compensate for this by detecting common A/B Testing Scripts by their tag and basically setting a break point in V8 on the execution of that code or removing it from their rendering of the document via DOM transformation.

  7. Another interesting istant preview feature: it seems to show SVG files only if they are 1) embedded in a web page 2) the web page ends in a .svg extension (even if it's not an svg file and the http headers are set to text/html).

    reply >
  8. There's a reason Matt Cutts has said on several occasions in his videos and at conferences to just focus on user experience. They've been reading our sites like a full browser for years.

    Looking at page structure, Javascript, etc.. I just wonder if they've been able to interpret compiled .SWFs. It's probably safe to assume GoogleBot hasn't been a Lynx style crawler since 2006 or so. I also wonder how much the V8 Javascript engine has to do with their indexing.

    reply >
  9. Ian Bowden

    Page speed is formulated from the Google toolbar, no?

    reply >
  10. Thanks for the time spent "distilling" this info. Seriously though, great post. I wish I could reach out to the bot and find out what I will be having for dinner next Thurs. Its amazing how intelligent the bots are becoming and the speed at which it is happening. I'm the dumbest/slowest kid on this block and I would love to know what those reading this post are going to do with this info in terms of future seo strategy.

    reply >
  11. Some interesting points, but the proof is in the pudding. Instant preview does not correlate to Google seeing "all". Take a look at GWT and SmartGWT content which is 100% js generated. Google cannot generate a description that reflects the content yet it generates a 100% accurate preview. Google "Smartgwt showcase" to have a look see. Google can only "see" the loader js but no content. So yes, it can see some js but does not look like they can see all.

    reply >
  12. rose

    Thanks for the time spent “distilling” this info. I wonder if and how javascript execution will influence A/B testing that relies on JavaScript.

    reply >

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>