Building Your Own Scraper for Link Analysis

As an SEO there are a lot of questions I’d like to answer that don’t have simple solutions. Many of these are associated with profiling a list of URLs based off the information located in page content. Using available tools, I came up with the idea of using Google Custom Search Engine (GCSE), but this wasn’t a full solution. I wanted to answer more advanced questions using a method that would output in a usable format.

Some Solutions I’d Like to Automate.

  • Mining for bloggers in a particular niche.
  • Finding link sources from particular CMS systems.
  • Building a seed set of keywords for keyword research.
  • Evaluating the relevance of a linking URL.
  • Tracking the use of embeddable content (widgets, infographics, videos, etc.).
  • Breaking down a link network.

So last weekend, along with my developer friend Matthew Callis, owner of Super Famicom,  I set out to develop a basic scraper as a proof of concept for more sophisticated link analysis.

Caveats: This tool is a proof of concept, not a polished tool. It requires some technical and programming knowledge to leverage. It’s also not designed with scalable efficiency in mind, but the concept could be expanded on by anyone with programming chops. It’s all up on Github, open source, and free to use.

Solving the First Question

As a link builder, I’d like to develop a list of bloggers in a niche that may be linking to my competitors. These are blogs I could target for guest blogging, content pitching, blog commenting, or social networking. A tool like Open Site Explorer gives you output of linking domains, but not specific information about the content on those domains. I could drop up to 5,000 of these into a GCSE and perform advanced search queries to find blogs. However, this isn’t automated and there is no pretty export.

So first we looked at a way of identifying a site. There are many ways, and none are perfectly accurate, but a robust script could account for many of these at once. For this tool, we first started with the generator output created by many major CMS. While building it out, we added several other checks as well.

Wordpres generator code

So I wanted a tool that would process a list of links and determine its CMS, then output that data back to me in a CSV, while keeping the OSE data intact.

Basics of Our Goal

Our goal is simple.

scrapper flow chart

The program will go to every URL, cache its content, and then parse the code for anything I’m looking for. In this example, it’s the CMS, but this can be a wide range of information, which I’ll get into later.

-> Get Seemes on Gethub – Thanks to Matthew for his work on this tool.

Aside: This tool is built in PHP, which is the language I know and can run easily from my Mac. You can run it on Windows using something like WAMP or on a server. You can also read about installing PHP on Windows.

The files:

  • custom.php – Simple command line wrapper for the Seemes class (seemes.php) based on the provided command_line_example.php on GitHub
  • seemes.php - The Seemes class that will scan and parse our CSV for the data we need.

You can find this code on Github. It’s fairly well documented for anyone interested in digging in. I recommend looking at the source, because that’s really where a big chuck of this post is.

How to Use

If you’re a Mac user, it’s easy to use. If not, get a Mac, or use the links above to install it on your Windows machine.

starting script

To get started, I first changed the file permissions on the files by using chmod 0777.  Then to run the script, pass the input and set the offset.

./custom.php input.csv offset output.csv

You can specify the output file name, but it uses output.csv by default, so I didn’t define one in my example. The offset tells the script what row to start at. Since Open Site Explorer outputs a CSV that starts the URLs at row 8, I set the offset as 7.

From there, it’s just a matter of waiting for the script to do all the heavy lifting.

running script

The OSE CSV for distilled.co.uk had 1,156 root linking domains. It took the script 47 minutes to complete. The script currently caches page content, which allows you to parse their content again without having to download again. This information could also be passed into a database, but we didn’t build out that functionality for this post.

The script is currently designed to pull information on CMS and analytics. It can be setup to check for all sorts of information, which I’ll get into later, but let’s first deal with our example.

Finding Blogs

Our basic setup allows us to easily find blogs linking to Distilled.co.uk. Although not perfectly accurate (some sites remove generator tag and not all WP sites are blogs, for example), this is a good starting point to discovering new blogs.

list of wordpress sites linking to distilled.co.uk

So if you’re looking at the “SEO” niche, here is a nice set of blogs to start with. You could compile outputs from multiple competitors and even cross-check linking domains to find link hubs in your niche. These are prime targets for outreach.

Other Uses For CMS

We only used a limited number of checks for CMS, but there are a number of other footprints that could be used to identify a CMS. These include things like CMS specific folders (wp-admin/wp-content) , login pages, and themes. A much more robust system could be built out to identify the CMS running a site. There are a few reasons I can think of that might be useful:

  • Find forums – Communities for participation, networking, content sharing, and maybe link dropping.
  • Find social CMS – An example might be Pligg, which runs a lot of the Digg clone sites.
  • CMS with profile pages – Maybe a chance for a link drop.
  • CMS with dofollow – If you know thats certain CMS have common ways of getting a dofollow link, you could add in checks for these.
  • Wiki sites – Find wiki sites for link building and article creation.

What Else You Can Do

Once you have a basic scraper built, you can basically parse the content for anything you might want.

- Check Adoption of Embeds

If I recommend the use of content licensing, embeds, widgets, or infographics as a link building tactic, I may want to track the effectiveness of this campaign. Currently, the way to track this is through the setup of alerts, advanced searches, or digging manually through a link profile.

I can setup this script to check for the existence of a footprint associated with embeds, such as the image name or alt attribute of an infographic. I can have the script flag all URLs where this footprint exists.

- Check Link Relevance

Backlink outputs from OSE give me basic information like URL and title, but little else in terms of relevancy. I can run a check against page content for the mention and use of specific keyword phrases. This could be as simple checking for its use. It could also check for number of uses or more complicated metrics. I might want to parse Distilled’s backlinks for every URL that mentions the phrase “link building” in the body content and count the number of times used at each URL

- Find Directories

Tom wrote a post on SEOmoz about using OSE outputs to find directory links by using a check against the title and URL in Excel.

This is a great check, but depends on the use of directory in the URL and title. A scraper will allow you to check page content, directory footprints, and common directory CMS.

- Mine for Social Accounts

You could easily check for links to Twitter and Facebook pages and parse out items like Twitter usernames.

- Mine for Email Data

You could check pattern matches for emails addresses. Check for the instances of about or contact URLs, store those URLs and then scrape those pages in search for email.

- Check for Adsense

A lot of reasons for this one, but a simple one is to find sites that are attempting to monetize their site. Advertising opportunity maybe…

- Evaluate Penalties

You could use this to evaluate link quality.

Check for instances of spammy link patterns such as outbound anchors against a list of high value keywords (paid links?) and negative keyword usage.

- Build a Keyword Seed List

Compile a list of competitive sites; scrape their meta keywords, meta description, and title to build a seed list for keyword research.

- Cloaking

Scrape the content using different user agents, such as Googlebot and then Chrome, and compare the contents. If they’re not the same, they’re doing some form of cloaking based off user agent. It can also be automated to a degree with checksum and diff files.

Where to Next

This script isn’t polished, but helps demonstrate the value of developing a basic scraper for more sophisticated link analysis. The script is open source and free to use, so feel free to build on top of it. I’ll likely keep working with Matthew to build out some of the more sophisticated checks I discussed in the post. There is a lot that you can check, but I hope this post helps set you in the right direction.

You can always find me on Twitter if you’re interested in talking more about link analysis. I’ll be talking more on this topic at SMX West in March.