Building Your Own Scraper for Link Analysis

As an SEO there are a lot of questions I’d like to answer that don’t have simple solutions. Many of these are associated with profiling a list of URLs based off the information located in page content. Using available tools, I came up with the idea of using Google Custom Search Engine (GCSE), but this wasn’t a full solution. I wanted to answer more advanced questions using a method that would output in a usable format.

Some Solutions I’d Like to Automate.

  • Mining for bloggers in a particular niche.
  • Finding link sources from particular CMS systems.
  • Building a seed set of keywords for keyword research.
  • Evaluating the relevance of a linking URL.
  • Tracking the use of embeddable content (widgets, infographics, videos, etc.).
  • Breaking down a link network.

So last weekend, along with my developer friend Matthew Callis, owner of Super Famicom,  I set out to develop a basic scraper as a proof of concept for more sophisticated link analysis.

Caveats: This tool is a proof of concept, not a polished tool. It requires some technical and programming knowledge to leverage. It’s also not designed with scalable efficiency in mind, but the concept could be expanded on by anyone with programming chops. It’s all up on Github, open source, and free to use.

Solving the First Question

As a link builder, I’d like to develop a list of bloggers in a niche that may be linking to my competitors. These are blogs I could target for guest blogging, content pitching, blog commenting, or social networking. A tool like Open Site Explorer gives you output of linking domains, but not specific information about the content on those domains. I could drop up to 5,000 of these into a GCSE and perform advanced search queries to find blogs. However, this isn’t automated and there is no pretty export.

So first we looked at a way of identifying a site. There are many ways, and none are perfectly accurate, but a robust script could account for many of these at once. For this tool, we first started with the generator output created by many major CMS. While building it out, we added several other checks as well.

Wordpres generator code

So I wanted a tool that would process a list of links and determine its CMS, then output that data back to me in a CSV, while keeping the OSE data intact.

Basics of Our Goal

Our goal is simple.

scrapper flow chart

The program will go to every URL, cache its content, and then parse the code for anything I’m looking for. In this example, it’s the CMS, but this can be a wide range of information, which I’ll get into later.

-> Get Seemes on Gethub – Thanks to Matthew for his work on this tool.

Aside: This tool is built in PHP, which is the language I know and can run easily from my Mac. You can run it on Windows using something like WAMP or on a server. You can also read about installing PHP on Windows.

The files:

  • custom.php – Simple command line wrapper for the Seemes class (seemes.php) based on the provided command_line_example.php on GitHub
  • seemes.php - The Seemes class that will scan and parse our CSV for the data we need.

You can find this code on Github. It’s fairly well documented for anyone interested in digging in. I recommend looking at the source, because that’s really where a big chuck of this post is.

How to Use

If you’re a Mac user, it’s easy to use. If not, get a Mac, or use the links above to install it on your Windows machine.

starting script

To get started, I first changed the file permissions on the files by using chmod 0777.  Then to run the script, pass the input and set the offset.

./custom.php input.csv offset output.csv

You can specify the output file name, but it uses output.csv by default, so I didn’t define one in my example. The offset tells the script what row to start at. Since Open Site Explorer outputs a CSV that starts the URLs at row 8, I set the offset as 7.

From there, it’s just a matter of waiting for the script to do all the heavy lifting.

running script

The OSE CSV for distilled.co.uk had 1,156 root linking domains. It took the script 47 minutes to complete. The script currently caches page content, which allows you to parse their content again without having to download again. This information could also be passed into a database, but we didn’t build out that functionality for this post.

The script is currently designed to pull information on CMS and analytics. It can be setup to check for all sorts of information, which I’ll get into later, but let’s first deal with our example.

Finding Blogs

Our basic setup allows us to easily find blogs linking to Distilled.co.uk. Although not perfectly accurate (some sites remove generator tag and not all WP sites are blogs, for example), this is a good starting point to discovering new blogs.

list of wordpress sites linking to distilled.co.uk

So if you’re looking at the “SEO” niche, here is a nice set of blogs to start with. You could compile outputs from multiple competitors and even cross-check linking domains to find link hubs in your niche. These are prime targets for outreach.

Other Uses For CMS

We only used a limited number of checks for CMS, but there are a number of other footprints that could be used to identify a CMS. These include things like CMS specific folders (wp-admin/wp-content) , login pages, and themes. A much more robust system could be built out to identify the CMS running a site. There are a few reasons I can think of that might be useful:

  • Find forums – Communities for participation, networking, content sharing, and maybe link dropping.
  • Find social CMS – An example might be Pligg, which runs a lot of the Digg clone sites.
  • CMS with profile pages – Maybe a chance for a link drop.
  • CMS with dofollow – If you know thats certain CMS have common ways of getting a dofollow link, you could add in checks for these.
  • Wiki sites – Find wiki sites for link building and article creation.

What Else You Can Do

Once you have a basic scraper built, you can basically parse the content for anything you might want.

- Check Adoption of Embeds

If I recommend the use of content licensing, embeds, widgets, or infographics as a link building tactic, I may want to track the effectiveness of this campaign. Currently, the way to track this is through the setup of alerts, advanced searches, or digging manually through a link profile.

I can setup this script to check for the existence of a footprint associated with embeds, such as the image name or alt attribute of an infographic. I can have the script flag all URLs where this footprint exists.

- Check Link Relevance

Backlink outputs from OSE give me basic information like URL and title, but little else in terms of relevancy. I can run a check against page content for the mention and use of specific keyword phrases. This could be as simple checking for its use. It could also check for number of uses or more complicated metrics. I might want to parse Distilled’s backlinks for every URL that mentions the phrase “link building” in the body content and count the number of times used at each URL

- Find Directories

Tom wrote a post on SEOmoz about using OSE outputs to find directory links by using a check against the title and URL in Excel.

This is a great check, but depends on the use of directory in the URL and title. A scraper will allow you to check page content, directory footprints, and common directory CMS.

- Mine for Social Accounts

You could easily check for links to Twitter and Facebook pages and parse out items like Twitter usernames.

- Mine for Email Data

You could check pattern matches for emails addresses. Check for the instances of about or contact URLs, store those URLs and then scrape those pages in search for email.

- Check for Adsense

A lot of reasons for this one, but a simple one is to find sites that are attempting to monetize their site. Advertising opportunity maybe…

- Evaluate Penalties

You could use this to evaluate link quality.

Check for instances of spammy link patterns such as outbound anchors against a list of high value keywords (paid links?) and negative keyword usage.

- Build a Keyword Seed List

Compile a list of competitive sites; scrape their meta keywords, meta description, and title to build a seed list for keyword research.

- Cloaking

Scrape the content using different user agents, such as Googlebot and then Chrome, and compare the contents. If they’re not the same, they’re doing some form of cloaking based off user agent. It can also be automated to a degree with checksum and diff files.

Where to Next

This script isn’t polished, but helps demonstrate the value of developing a basic scraper for more sophisticated link analysis. The script is open source and free to use, so feel free to build on top of it. I’ll likely keep working with Matthew to build out some of the more sophisticated checks I discussed in the post. There is a lot that you can check, but I hope this post helps set you in the right direction.

You can always find me on Twitter if you’re interested in talking more about link analysis. I’ll be talking more on this topic at SMX West in March.

Get blog posts via email

15 Comments

  1. I built something similar using Google App Script,which runs from Google Docs. So you load up a Google spreadsheet with URLs, and it fetches them, and examines them for useful things.

    I'm ashamed to say this script will find email addresses, email a customised email to any addresses found, and identify pages that should be looked at by a human in more detail. It was a proof of concept more than something I think is a useful strategy!

    Here is the code as a starter for 10....(also not polished script!) There are issues with fetching a lot of URLs via Google Apps Script, which is why the script makes a note of how far down the list it got, so you can restart it without repeating URLs. I hope it spurs you on to play with Google App Script for good, not evil :)

    function clarion() {
    var emailblob = "";
    var ss = SpreadsheetApp.getActiveSpreadsheet();
    var sheet = ss.getSheets()[0];
    var lastrow = sheet.getLastRow();
    var i = 0;
    var templateSheet = ss.getSheets()[1];
    var emailTemplate = templateSheet.getRange("A1").getValue();

    var start = sheet.getRange("F1");


    //Browser.msgBox(lastrow);

    for(i=start.getValue();i<=lastrow;i++) {
    // Browser.msgBox(i);
    var Acellrange =sheet.getRange("A"+i);
    var Bcellrange =sheet.getRange("B"+i);
    var Ccellrange = sheet.getRange("C"+i);
    var Dcellrange = sheet.getRange("D"+i);

    var url=Acellrange.getValue();

    //var msg = "The URL is " + url;
    // Bcellrange.setValue(msg);

    try {


    var response = UrlFetchApp.fetch(url);

    }

    catch(e) {
    //Browser.msgBox("Removing bad url row:" + url);
    Bcellrange.setValue("Bad URL");

    }

    if (!(Bcellrange.getValue() == "Bad URL") && !(Dcellrange.getValue() == "Processed")) {
    Dcellrange.setValue("Processed");
    var content = response.getContentText();
    var content = content.toLowerCase();
    var woohoo = findEmailAddresses(content);


    Ccellrange.setValue(woohoo);

    }

    }

    }

    function findEmailAddresses(StrObj) {
    var separateEmailsBy = ", ";
    var email = ""; // if no match, use this
    var emailsArray = StrObj.match(/([a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+.[a-zA-Z0-9._-]+)/gi);
    if (emailsArray) {
    email = "";
    for (var i = 0; i < emailsArray.length; i++) {
    if (i != 0) email += separateEmailsBy;
    email += emailsArray[i];
    }
    }
    return email;
    }

    function fillInTemplateFromObject(template, theurl) {
    var email = template;
    // Search for all the variables to be replaced, for instance ${"Column name"}
    // var templateVars = template.match(/${"[^"]+"}/g);
    var templateVars = "${yourpage}";

    // Replace variables from the template with the actual values from the data object.
    // If no value is available, replace with the empty string.

    // normalizeHeader ignores ${"} so we can call it directly here.

    email = email.replace(templateVars, theurl || "");


    return email;
    }

    function doemails (startrow) {

    // Browser.msgBox(w... [continued below]

    reply >
  2. [continued from above] ...oohoo);
    if (!(woohoo == "") && !(Bcellrange.value == "Email sent") && !(emailblob.match(woohoo))) {
    //Browser.msgBox(Bcellrange.value);
    var emailblob = emailblob+woohoo;
    // Browser.msgBox(emailblob);
    var emailText = fillInTemplateFromObject(emailTemplate, Acellrange.value);
    var emailSubject = "Quick and easy link exchange";

    try {
    MailApp.sendEmail(woohoo, emailSubject, emailText);
    }

    catch(e) {
    //Browser.msgBox(e);
    }






    Bcellrange.setValue("Email sent");
    }
    else if (content.match("Links Engine Powered By|add url|add site| add your site|link exchange|exchange link|mylinkhelper|linx|reciprocal link")) {

    Bcellrange.setValue("Worth investigation");
    }




    }

    reply >
  3. Seriously bad &** stuff. Thanks for posting this.

    reply >
  4. Nice scripts and thanks for sharing. How long did it take to build this little prototype?

    reply >
  5. Very nice script, I really appreciate the work you did for detecting the CMS type and all the other datas. Creating scrapers is a huge advantage for SEOs, you can get excellent datas in a very short time.

    For developers who want to increase the speed of the script, think about launching several process at the same time for fetching the data ;)

    reply >
  6. Excellent info Justin! I find it v.interesting as I'm working on a scraper myself in PHP which allows me to scrape and search the sites I input - You've mentioned some really good and useful points. Look forward to more :P

    reply >
  7. Interesting script, got me think about creating my own using ruby, there is a project out there "http://anemone.rubyforge.org/" which could fit quite nicely. BTW there are a couple of errors in the script. I can't remember what they were off hand, but if you add error_reporting(-1) to the top of the custom.php script they will get reported. If anyone out there is interested on collaborating on something like this, please contact me.

    reply >
  8. Hey Justin, Very Good! Thank's!

    reply >
  9. Justin,

    Great article ! I'm currently building a php script just to go out and find the backlinks. This identifies some great next steps. And Jeremy Webb -- thanks for all you shared in your comment -- great Google App script !

    For anyone wanting to learn more about building screen scrapers, I highly recommend Michael Schrenk's book "Webbots, Spiders, and Screen Scrapers". Published in 2007 but not showing its age at all :)

    reply >
  10. Chris

    Only problem I'm having is that it refuses to recognize my input.csv file in W7.

    reply >
  11. Mike

    So I just watched Justin's webinar on SEOmoz and discovered this. I'd actually made something similar in PHP while I was at Razorfish that I called "LinkScrape." It would grab 22 fields of data on every URL including whether the page was Spam, number of links, link position and guess what the ideal anchor text and target would be based on what the page was about and such.

    As you can imagine it's a very taxing script that could have used some optimization or at least needed to be built on the cloud....but anyway I'm thinking of doing all that and your meta generator thing is a GREAT call that I will be adding to it.

    Thanks for sharing!

    reply >
  12. Red

    WGET - all ya need ;)

    reply >
  13. Woah! this is seriously good stuff. virtually drools

    reply >
  14. Justin, nice write up. I also think that discovering additional attributes about link networks is hugely valuable, to know what types of links your competitor / market has and give links attributes beyond basic anchor text information to interesting stuff such as page and site type, social etc.

    Have you tried Linkdex before doing this exercise? From what you are describing, you will find that Linkdex has those features built in and gets unique market intelligence beyond anything I've used before. Thought it may be helpful for these type of tasks.

    reply >
  15. This is a very nice script, however, I think that it can be performance optimized by integrating rollingCURL to run parallel requests. If I can find the free time, I plan on forking this project to implement a faster scraping routine.

    Thanks again for the share.

    reply >

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>