I know what’s going through your head: “Noko what? Web scraping?” Learning how to write a web scraping program will significantly up your game as a resourceful and data driven online marketer. If you’re a web developer of any skill level you’ll also enjoy this guide as it will expand your horizons with a bunch of fun new project possibilities.
Web scraping is a programmatic method of extracting data from websites. It is both loathed and loved by web developers and is as much an art as it is a science. When you browse the web you consume a ton of publicly available information. As a user, all of this information is presented to you as unstructured data in the form of HTML documents. Now imagine, what if you could take all of these pages of and turn them into structured data, pick out the pieces you like and export it all to a database or spreadsheet.
There are many ways that you, as a novice, can choose to scrape data from websites using free software and your computer. Google Docs ImportXML is a great solution for extracting data from an individual page and it only requires some basic knowledge about HTML and CSS. There are a lot of great resources on this topic if you just need to quickly scrape an individual page:
ImportXML will only get you so far if you need access to a large data set from a site with thousands pages of data. Developing a simple web scraping program is a better method for gathering a site’s unstructured data at scale when an API, RSS feed or publicly database is not available.
Ruby, which we’ll be using, is a great language to build your first web scraper with for a couple of reasons:
It’s a very straight-forward and powerful language, so it’s a great place to start for newbies to web development and programming.
It’s got an awesome community of developers with tons of documentation online.
It comes preinstalled on all new Mac computers so there will be minimal work setting up your machine to execute your web scraping program.
So what is Nokogiri? It’s a ruby gem that will transform a webpage into a ruby object and make all of this web scraping stuff really easy. Ruby gems are optional add-on libraries of code that members of the ruby community make available to other developers so that they don’t have to reinvent the wheel each time we build an app with a common use case. We have Aaron Patterson and Mike Dalessio to thank for developing Nokogiri, which makes parsing HTML and XML a breeze.
Use case for marketers
If you’re an online marketer and are still not sold on learning how to build a web scraper I have a couple of use cases for you:
At Distilled we love building data visualizations for our clients and our work is some of the best out there, but building a great interactive piece is not easy. You have to come up with a compelling concept and then you need to gather the data to build your concept on top of. Web scraping opens the door to a whole new world of data sources for these types of projects. You’re no longer limited to working with APIs and publicly available data sources.
SEO use cases
Automate a site audit.
Let’s say you have a ton of pages that you need to optimize for your own site or a client. Nobody wants to spend hours copying and pasting things like title tags, meta-description tags and page headers into a spreadsheet. Instead you could write a web scraping program to loop through the pages of a site and save all this information to a CSV.
Gather all of a site’s URLs from an XML site map
The value of this is pretty self explanatory. A scraping program can help you parse one or many sitemaps and conveniently dump all the URLs into a CSV file for some other use.
Crawl a site to build a keyword list
Pretend your main competitor is ranking really well in SERPs and you want to figure out what keywords they’re targeting across a large number of pages. You could write a web scraping program to crawl a list of pages and aggregate on-page content like product descriptions or SEO blurbs so that you can better understand what terms they may be targeting.
Compare your product catalog
There are many reasons online marketers should gather intelligence on their competitors catalog. Most importantly, you should know how your assortment compares to your competition and how this will impact your SEO and PPC strategy.
Gather product information
Amazon, Walmart and most major players in e-commerce use automated web scraping to automate their pricing based on their competitions listings. These are very sophisticated use cases, however even simple one time web scraping efforts can give you access to your competitions average prices in a category, they’re most popular items and which items receive the best product reviews.
Now that you’ve learned a bit more about web scraping, it’s use cases and Nokogiri, let’s build simple web scraping program together.
In this tutorial we’ll write a simple web scraping program in Ruby that uses Nokogiri. Our objective will be to scrape the headline text from 100 most recent listings from the Pets section of Craigslist NYC. At the end of this tutorial we’ll have 100 headlines neatly stored in a CSV file. Please note that this tutorial is written from the point of view of a Mac user. The code used in the screenshots of this tutorial is available in this GitHub Repo.
To complete the tutorial in the next section of this post you will need the following:
A computer with the Ruby program language installed on it. If you have a Mac you’re already set.
A text editor to write your ruby web scraping program in. If you don’t already have one on your machine, I recommend downloading Sublime Text. Sublime Text has lots of cool features to make coding a more enjoyable experience.
A basic understanding of HTML and CSS. Specifically, how web developers apply CSS classes and IDs to HTML elements. If this is new to you I recommend completing the free HTML and CSS course from Codecademy. W3schools is also a great resource if you’re new to web development.
Some baseline knowledge of Ruby (optional). If Ruby is totally new to you then check out Codecademy’s Ruby track for free. You should still be able to follow along without this though.
Step 1 - Install dependencies
Let’s get started by installing three ruby gems that we’ll use for our web-scraper:
We’ve already discussed Nokogiri, so let’s talk about HTTParty and Pry. HTTParty is the gem our web scraper will use to send an HTTP request to the page(s) we’re scraping. We’ll use HTTParty to send a GET request to the page we’re interested in scraping. This will return all the HTML of the page as a string. Pry is an awesome ruby debugging gem that we’ll use throughout this tutorial to help us parse the code from the site we’re scraping.
Install each of these gems on your machine by opening your terminal and running the following commands:
gem install nokogiri
gem install httparty
gem install pry
Hint: If you don’t know how to open your terminal hit command spacebar and type “terminal” and then hit enter
Depending on how your Ruby environment is configured, you may have to run these commands with ‘sudo” such as:
sudo gem install nokogiri
This will prompt you to input your machine’s password. I do not recommend you do this if anyone else uses your computer for web development, as you’re altering your computer’s Ruby environment.
Step 2 - Create your scraper file and require dependencies
Create a folder on your desktop (or wherever you prefer) called nokogiri_tutorial. Next open up Sublime Text or another text editor and save a Ruby file to this folder called “web_scraper.rb”. Next we’ll require all the dependencies our program needs, as shown below:
Step 3 - Send an HTTP request to the page
Let’s create a variable called “page” and set it equal to an HTTParty GET request of the URL we’re going to scrape. After this insert “Pry.start(binding)”.
In your terminal navigate to the folder that contains your web_scraping.rb file. If you saved the folder to your desktop you can get there by opening your terminal and entering the following command:
Then, run your web scraping program by entering the following command in your terminal:
Uour terminal screen should now be in Pry and look like the following:
In your terminal type “page” and this will return as a string the HTML of the New York craigslist we’re scraping.
Pretty cool right?! We just accessed a website using a program we wrote instead of using our browser. But this giant string is not of much use to us if we want to scrape all the pet listings in New York. Before we move on to step for type “exit” in your terminal to leave Pry and return to the folder your program is located in.
Step 4: Use Nokogiri
Now let’s use Nokogiri to convert the HTML string of the pet listings page into a Nokogiri object so that we begin parsing the data and extract the information we want from the page. To do this, create a new variable called “parse_page” and set it equal to Nokogiri’s method for converting an HTML string into a Nokogiri object. Leave your Pry at the bottom of the document:
Save your Ruby file and run it again in your terminal. Pry will open again and this time you will enter the name of the new variable we created, “parse_page”. This will return the craigslist page as a a Nokogiri object and you should see something similar to the image below.
At this point I recommend creating an HTML file in the same folder called pets.html and copying and pasting what was returned from parse_page into this file. This formatted HTML will come in handy for reference purposes when we begin parsing the data. Before beginning the next section remember to exit out of Pry in your terminal.
Step 5: Parsing the data
This next part is where some basic programming knowledge will come in handy. Before we begin parsing the data let’s create an array called “pets_array”. After we parse the code from the parse_page variable we’ll push each individual Craigslist NYC pets listing into this array and then eventually into a CSV file.
Our parse_page variable is a Nokogiri object, which let’s us use Nokogiri’s .css method to drill down into the HTML document and find the the headline text of all the pets listings (note you can find all Nokogiri methods here. This is where our copy of the pets.html file we saved in part four will come in handy. We first need locate the HTML element all of the pet listings are in. You can also do this by using the inspect element tool in chrome or by viewing the page source code.
It looks like all of our listings are located in a div with where the class=”content”. We select this div by using the .css method on our parse_page object like so:
Move your “Pry.start (binding)” before this new line of code and then enter it manually to see what returns. You should see just the HTML within this div. So let’s drill deeper! Each listing is in its own div with where class=”row” and the text for the listing within each div is an anchor tag (<a></a>) where the class=”hdrlnk”. We can extract these anchor tags that wrap around our text by chaining addition .css methods to the first one like so:
If you do this in Pry, you’ll see a Nokogiri object that contains all the anchor tags that had the class of hdrlnk. It will look something like this in your terminal:
So we need to get the all headline text out of this object (that’s the green text above), convert it to a string and push it into an array so that we can ultimately move each headline from the array into a CSV file. To do this I’m going to use Ruby’s .text method and .map method on each anchor tag in the nokogiri object as shown below:
Run the program in your terminal. When Pry opens, type in the blank array we created at the beginning of step 5:
In your terminal you should see a beautiful array with the headlines of all the pet listings like the one below.
Step 6: Exporting data to a CSV
Success! You’ve scraped a website and converted unstructured data to structured data! From here you could take many paths. You could scrape more information about each post and turn each post into an object with more attributes than just the headline text. That’s a little more advanced, but I encourage you to figure it out. You could also created a more complicated scraper that uses a defined list of pages, or a while loop to iterate through dozens, hundreds or even thousands of pages on the site and scrape many more pet listings.
For now let’s just focus on getting this array of listings into a CSV file so that you can manipulate them however you want in Microsoft Excel. If you can complete this next step you can use your new Nokogiri scraping skills to quickly rip data from all sorts of new sources that previously would have required you to manually extract the via copy and paste.
Let’s go back to our terminal. If you’re still in Pry be sure to exit out so that your terminal is in the the nokogiri_tutorial folder that contains your Ruby scraping program and the pets.html file. Type the following command into your terminal:
You now have a blank csv file that we can save the data from pets_array to. To do this we’re to remove the Pry.start(binding) from our program and we’re going to add one more section of code:
You should now be able to open your CSV file in Sublime Text or Excel and see all the pet listing. Congrats!
For more info on legality, I'd start here.
Combine your web scraping with these cool things
Want to turn your web scraper into a scraping bot? Combine your web scraping program with another Ruby gem called Mechanize. Mechanize will allow your program to fill out forms and mimic other tasks normal users must complete to access content. Mechanize can be used to download images and other content, but (as before) only do this if you are in compliance with the site’s terms and conditions.
So, there you have it. Give it a whirl and let me know what you think in the comments below.