The ImportXML Guide for Google Docs

The Importxml guide for Google Docs has been written to primarily empower SEOs, SEMs and digital professionals across the world to create their own tools.  Inspiration, innovation and all around ingenuity can be attributed to the talented engineers at Google for bringing us this invaluable service. Next time you speak to a Googler, show them some love.

This guide is suitable for absolute beginners to the diabolically clever coders that will no doubt expand on what they learn here. If you’re looking for advanced level Google docs, Tom Critchlow created a full video guide on Appsumo. Enough talk, let’s get it on...

Chapter 1 - Prerequisites - How to look at source code and interpret HTML

Chapter 2 - Introduction to Xpath

Chapter 3 - I’m ready - What’s ImportXML and why should I use it?

Chapter 4 - Your first ImportXML lesson

Chapter 5 - Advanced ImportXML

Chapter 6 - Your first Google Docs script with ImportXML

Chapter 7 - Advanced scripts and API magic

Chapter 8 - Regular Expressions and data extraction

Chapter 9 - Efficiently lightening the load on the spreadsheet

Chapter 1 - Prerequisites

back to top

First things first, go get yourself a Google account. Secondly, if you haven’t used Google Docs and specifically Google spreadsheets you’ll need to take the tour and familarize yourself. If you’re already an Excel user, you’ll be right at home - hurry up we’re waiting!

Great, now we’ll need to make sure you can read some source code. In this instance we’re talking about HTML markup that your browser needs to bring those beautiful websites to life.


Try it! Right click on this web page and choose view source.


Don’t get scared, and more importantly DO NOT give up now :)

Funnily enough this will all make perfect sense to you in a bit, trust us you’ll love it. What you’re seeing up there is mostly HTML markup which is how web pages are built.


If you want to become an expert on HTML visit this introduction at W3 Schools

If you want a quick and dirty explanation, stick with me.

HTML isn’t exactly programming, rather it’s a markup language made up of tags. The tags are surrounded by angle brackets < > and come in pairs - one to open and another to close ( We add a forward slash to close like this: . Think of writing a quote in text “You would always open up the quote by using apostrophes, but the end must come as well so we close them by adding some more”


Try it!  Take your mouse and rest it on the tab of the current page you’re on. You should see something like this:


That information is actually stored in an HTML tag called


Try doing this right now with a live test


*Notice how the second title tag is closed using the / slash. There are many tags which you’ll need to reference in your quest for importing data, so use this cheat sheet to help you.

Chapter 2 - Introduction to Importxml: Exploring Xpath

back to top

Great, you made it! Now that we’re familiar with HTML and such, we’re going to learn how to import some of that data using Google Docs Spreadsheets. Before we do, we’ll need to learn some XPATH first.

Xpath (simple definition): A query language used to pick out pieces of information from web pages.


If you want to become an expert on Xpath, get back to W3 and start the introduction.

Or we’ll take you through another quick and dirty explanation here.


To get you through quickly, we’re going to explain only what you need to know for ImportXML - Parents and Children.

For the following example we’re going to use:

For the following example we’re going to use:

<div> Starring as “the parent”, <p> as “aww the cute child” and co-starring is: “Plain old text” as himself.

<div>

<p>Plain old text</p>

</div>

So we’ve got our stars and now we want to pluck out “Plain old text” as he hasn’t been performing to his best these days.

To do this using Xpath we would write: //p

Ok so you might not get it, let’s get some information from the Godfather (specific to Google Docs):

  • // – this means select all elements of the type<
  • //h3 – this means select all h3 elements
  • [@class=’’] – this means only select those elements that meet the criteria given - Example: look for H1
  • //h3[@class=’storytitle’] – this means only select elements that look like:h3 class=”storytitle”Title/h3

Chapter 3 - Why should I care about using ImportXML?

back to top

Well you’ve come this far and it’s fair that we tell you. ImportXML allows you to import data from web pages directly into your Google Docs Spreadsheet. It’s like magic! We can now look at a page, decide what information we want by looking in the source code, finding the HTML element and by using Xpath we can dig it out.

ImportXML is how we’re going to pull it in. So why exactly should you care about it?

Instead of copying and pasting information from a bunch of websites, this is will do it in 2.2 seconds.

It automatically pulls it into a spreasheet so you can work with your data however you want.

If you’re a trader, you can automatically pull in all your stocks into one spreadsheet effortlessly.

If you’re an SEO, you can acquire 100’s of link targets from around the web.

If you’re an online retailer, check up on your competitor’s pricing for Product XYZ as soon as it changes!

It is truly endless, and only limited by your imagination....and um Google. Have we convinced you? OK come on let’s get to the first lesson!

Chapter 4 - Your first ImportXML lesson

back to top

You’ve got 2 options: You can watch a quick video we made below or continue past the video to get hands on straight away.

The hands on approach: We’re getting used to the code, so let’s go ahead and step through a practical example. A common SEO task is “how can I find as many blogs in specific niches as possible”.

Performing a a quick search turns up with a site listing 100 blogs like this: http://www.onlineschools.org/2009/10/27/100-best-book-blogs-for-kids-tweens-and-teens/. It’s manual and time consuming having to click on each one to copy the link. I want to get the list of URLs into a spreadsheet as quick as possible.

1) First we take a look at the source code of the page and we see something like this:

You can see that all of the blogs are in the li element. Make a mental note!

2) Now fire up your G docs spreadsheet and add the URL in cell A1: http://www.onlineschools.org/2009/10/27/100-best-book-blogs-for-kids-tweens-and-teens

We can see that all the blogs are a separate li element so let’s try something like

=importxml(A1,“//li”)
(where A1 is the cell with the URL of the page). We get this back:

3) As you can see, it contains the blog names so we’re getting there. But our query is also getting a whole load of other stuff we don’t want. So let’s look in the code and see if we can isolate the list of blog items. I find the “inspect element” control in Google Chrome excellent for visualising this. As you hover over the code, it highlights the section of the page that applies to it.

4) We refine our guess to limit ourselves to the div with class ‘intro-box-wide’ using a query like

=importxml(A1,“//div[@class=’intro-box-wide’]//li”)
Which loosely translated says “get the div tag with that class (intro-box-wide) and then select all li (list) elements within it” which results in:

5) We’re nearly there! We now have a list of all the blog elements. The next step is to pull the URL. So we modify our function to be:

=importxml(A1,“//div[@class=’intro-box-wide’]//li/a/@href”)
Which says, from the li elements select the href contents from the a element. This /a/@href is a very common thing to tag on the end of importxml functions so I suggest you memorise it. This results in:

And we’re done! Have a look at the Google Docs spreadsheet go here and make a copy then you can play around to your heart’s content :)

Chapter 5 - Advanced ImportXML

back to top

Ok, now we have the basics down let’s move on. You might have seen the next section, as it’s been detailed by Tom Critchlow in a different post. I’m going to help explain the genius behind his rank tracking - try and keep up. This is a common task that we might want to do so let’s quickly discuss how to do that. Firstly we construct the search URL like this:

=“http://www.google.com/search?hl=en&gl=US&q=”&a2
Let’s break this down a bit shall we?
  • We’re going to get data from google, so naturally we have http://www.google.com/
  • Next is /search? - this means we’re pulling up the search results page
  • We add in hl=en which is the language code. EN=english
  • gl=US is helping us get United States Results, for UK we would use gl=UK
  • q= is your search query ex. cool places in london, cats etc...
  • Optional: You can specify the number of results by adding &num=20 or =50,=100
  • Most importantly, this URL looks like a formula because he adds the =. Also, he adds “ because this is essentially telling G docs that anything in the ” “ is text. When the text finishes he adds the cell reference &a2. Why? This way you can type your next query in A2 and not have to modify the URL again :)
Ok back to Tom:Then we parse the Google URL using importxml like this:
=importxml(C2,”//h3[@class=’r’]/a/@href“)
I’m not going to break that down, hopefully you can figure out what I’m getting off the page (Psst - he’s taking all the links from the listings). Again, check the source code for the page if you’re not sure what to write in your importxml function. Output like this:

As before, Grab your own copy here.

Of course, since this is a common function we might want to roll it into a single cell like this:

=ArrayFormula(MATCH(1, FIND(”https://www.distilled.net“,importxml(concatenate(”http://www.google.co.uk/search?q=“&A2&”&pws=0&gl=UK&num=50“),”//h3[@class=’r’]/a/@href“)),0))

I was going to add some explanation here as to what this formula does but actually it gets pretty complicated. Either you already know what an arrayformula does (in which case it should be straightforward) or you don’t. In which case you probably just want to copy and paste for now :)

I should note at this stage that there is a limit for 50 importxml calls per spreadsheet which limits us from building a full web crawler but for most agile tools this is sufficient.

Cool, so now we have a bunch of moving parts let’s start combining things. Here’s a spreadsheet where you enter a keyword, 4 location variants and a domain and it gives you the ranking position for all the keywords that search suggest spits out for that domain. Nifty.

As always, have a poke around with the Google Doc to understand what’s happening.

Chapter 6 - Your first Google Docs script with ImportXML

back to top


Welcome to your first script in Google Docs. If you’re a beginner you’ll probably want to head to Google Apps and try your first script.

Or, get your hands dirty right away with a very simple tutorial...


Script (simple definition): A set of instructions where Google docs will execute and return values, perform actions etc..

What coding language does Google docs script use? In simple terms, it’s a ”bastardized“ version of Javascript - tutorials here.

We’re going to walk you through a very basic script that puts a a value into a cell. Wooohoo, let’s go!

1) Fire up the old G docs spreadsheet, click on Tools > Script Editor

2) The easiest way to explain the function name is to relate it to something you already know. Any formula in Excel ex. =SUM is a function. This will work the same way if we need it to.

3) First thing we need to do is define the current spreadsheet. Google has already pre-made this:

SpreadsheetApp.getActiveSheet();
That’s pretty long and I don’t feel like typing it out every time. Let’s put it in a box (real name is a variable) and let’s call this box ”sheet“. Much nicer name don’t you think?
var sheet = SpreadsheetApp.getActiveSheet();
4) Alright, now let’s try and put something in a cell. You’ll have to do 2 things here: Tell the script which cell you want to put things in and what you want to put in. To ”get“ the cell we use: .getRange() and to put something in, we use .setValue(). Notice the uppercase letters, they’re important! Ok let’s put it all together:
sheet.getRange(”b3“).setValue(”Hi there!“);
5) Ok, the moment of truth - put all of this in the script and it should look like this:
function myFunction() {
var sheet = SpreadsheetApp.getActiveSheet();
  sheet.getRange(”b3“).setValue(”hi there“);
}
6) Press the play button and just save the script as it comes. Now go back to your spreadsheet, what do you see?

Well done!

Chapter 7 - Advanced scripts and API magic

back to top

Let’s combine the magic of ImportXML + APIs + Google scripts to make some seriously cool tools. For the following example we put Tom Anthony in a cage until he made us a Twitter followers scraper. We wanted to know how many crazy people actually followed him on Twitter using scripting magic.

As always, have a poke around with the Google Doc to understand what’s happening.

The only cell that needs user input is A2 - which is his twitter handle. The rest of the magic happens behind the GO button. Here is your step by step:

1) Select Tools from top menu > Script Editor

2) Copy and paste in the script below. Anything with ”//“ in front of it is a comment and isn’t required to make the script run.

function loadImport()

{

// This variable, ‘ss’, will store a reference to the Spreadsheet.

var ss = SpreadsheetApp.getActiveSpreadsheet();

// And ‘sheet’ will store the Sheet.

var sheet = ss.getActiveSheet();

// In more complex scenarios we may want to be referencing different

// sheets or even different spreadsheets. Normally though, the above

// will be what you want.

// Get a reference to the A2 Cell, where the URL has been placed.

var cellreference = sheet.getRange(”A2“);

// Now we use getValue() to get the URL stored in that cell.

var url = cellreference.getValue();

// Now join that url onto the API URL.

var apistring = ”http://api.twitter.com/1/statuses/followers/“ + url + ”.xml?cursor=-1“;

// Now build our importxml() command

var command = ”importxml(\“” + apistring + “\”, \“//users_list/users/user/screen_name\”)“;

// Now we want to reset our reference and point it to the target cell

var cellreference = sheet.getRange(”B2“);

// Now we put the forumla into the cell and it will pull in some followers for us! Woo!

cellreference.setFormula(command);

}

See the line that begins with var command= ? This is a bit tricky, he’s basically trying to formulate this ImportXML call: =importxml(”http://api.twitter.com/1/statuses/followers/tomanthonyseo.xml?cursor=-1“, ”//users_list/users/user/screen_name“)

The only problem is that when he’s trying to put in the Xpath like so:

\”//users_list/users/user/screen_name\“ it looks kinda funny. Look at the start of this line: \” - Why is that there? Google will assume the quotations are needed to run this script, but really all we want to do is make this regular text in a cell. To make sure Google doesn’t execute the quotes, we use \ (regex) to escape the proceeding character.

3) Save the script!

4) Go to the Insert Menu and choose drawing

Yep, we’re going to make us a shiny button. Draw away friends and when you’re done we need you to right click on your button and “assign script”

In this instance it’s going to be called loadImport. That’s it, you’re done :)

Chapter 8 - Regular Expressions and data extraction

back to top

ImportXML and meta data seem to disagree. I’m sure you’ve tried helplessly trying to get meta descriptions from different web pages with fruitless results.

Ladies and Gentlemen, there is an answer and it’s not ImportXML but Importdata and Regex.

Importdata is a function in Google docs that allows you to import the entire source code of a webpage directly into the spreadsheet. Here’s how it works:

=Importdata(“http://www.example.com”) < Simply call the function and ensure the URL is between the quotes and parentheses and you’re done.

It’s taken over my entire spreadsheet!!?? No sweat, just wrap it in the concatenate function like this:

=Concatenate(Importdata(“http://www.example.com”))

There are a few Regex functions available to you in Gdocs, but for the sake of this tutorial we’ll be working with =Regexreplace(text, regular expression, replace)


We strongly advise you to read about Regex and test your skillz.

However, we’ll break down the following formula to save you the pain of importing meta descriptions...this time ;)


In order to pull out meta descriptions from a web page you need to first Importdata then reference that cell with this formula:

=REGEXREPLACE(Importdata cell,“(.*)(meta name=.description. content=.)(.*?)(\x22)(.*)”,“$3”)

As always, have a poke around with the Google Doc to understand what’s happening.

Chapter 9 - Efficiently lightening the load on the spreadsheet

back to top

You might have the most complex super duper Google docs spreadsheet, but it’s soooo slow. How do we speed this thing up?

  1. Store formulas in scripts
  2. Use script triggers (buttons) to populate cells with formulas
  3. Hide the cells that you don’t need to see. Trust us, it works
  4. Incorporate “clear” functions - all you need to do is set “” values in cells for each set of importXML calls, this helps with clearing out old data and formulas that have gone wrong
As we find more speed fixes we’ll continue to update this section.

Pre-Made Toys

back to top

This is a collection of free tools made by the folks at Distilled, friends, and other very talented people who have shared their tools with us.

Have one you’d like to submit? Tweet @dsottimano, or add it in the comments below and I’ll be happy to look at it!

Tom Critchlow

David Sottimano Tom Anthony  


Resources & Further Learning

back to top

Reference

Blog Posts  


Special Thank-you’s

back to top

A very special thank you goes to the guy who mentored me: Tom Critchlow. Another thank you goes to Tom Anthony for helping me hack tools together everyday at the office. They have both generously contributed to this guide and it wouldn’t have been possible without them. Also, I love solid delivery

 


FAQ

back to top

We need more questions folks, feel free to submit as comments!

Q: How can I add a pause or sleep function in Google Docs with ImportXML? A: There is a function in Google scripts called utilities.sleep and will allow you to pause for a maximum of 5000 milliseconds