Understanding Site Architecture with Xenu and Excel

One of the often under-considered factors for a strong SEO strategy is the site’s architecture and its affect on the PageRank flow through a site. We know that PageRank dissipates as we get further from the home page, but our conceptualization of page depth is clouded by the modern website’s sidebars, dynamically updating widgets, site-wide footers, and more. The difficulty in assessing a site’s architecture is compounded with very large sites with rich histories, especially as an SEO working for an agency where there is a need to develop a strong understanding of a new client’s site relatively quickly.

There are a few things the SEO can do to quickly get a sense of how the search engine might crawl and discover deep pages aside from opening Firefox, disabling JavaScript, setting User Agent to Googlebot, and clicking around starting from the home page. Let’s discuss a few of these options:

Google Webmaster Tools

One such tool that doesn’t fail to disappoint time and again is the Webmaster Tools ‘Internal Links’ report. Within this report we can get a look at which pages are linking to which, and get a count of internal links to any individual page. Unfortunately, the data is sampled (and often misleading) and does not report page depth.

Google Webmaster Tools Internal Link Report

Black Widow

Black Widow by SoftBytes will crawl through your site discovering new pages link by link. The biggest benefit of Black Widow is the Windows Explorer-like visualization of site architecture:

SoftBytes' Black Widow offers a nice site architecture visualization

Xenu

Xenu is a great tool for discovering broken links by doing a full site crawl much like Black Widow. Additionally, Xenu reports “LEVEL” and “LINKS IN”, which are particularly useful for developing a site architecture understanding. After running a full crawl, I like to import the results into Microsoft Excel and do some quick manipulation to rate the internal link juice that flows into each page. I have found that this has enabled me to get a quick sense of the site’s architecture in a pretty painless and repeatable manner. This process is what I’ll be detailing in this blog post:

Step 1: Run Your Crawl

We won’t need anything but links to internal pages crawled. You can speed up a crawl significantly be allowing Xenu to crawl only the pages that matter. That is, if you’d like to crawl only the www subdomain, you should specify so in Xenu, as crawling the root domain could take a lot more time.

Step 2: Import and Clean up Tab Separated File into Excel

After importing, remove all but the “Address”, “Type”, “Level”, and “Links In” columns. Next, we’ll remove all of the non-html pages by deleting all addresses that do not have a “Type” of text/html.

filter with excel
Use the above filter to show only non-text/html entries, then delete them all. Once the filter is removed, we’re left with just html pages.

Once we’ve done this we can delete the “Type” column, leaving us with “Address”, “Level” and “Links In”.

Because of various crawling oddities, many sites will include odd level counts. Unless you’re working with a MASSIVE site, most normal pages will not have a level higher than 10. Sort by level, and find that point where the levels begin to jump and/or non-important pages are crawled, and remove all thereafter.

Step 3: Assign a “Level” and “Links In” Score

Knowing that more PageRank flows to pages closer to zero (the home page), I use the following formula to score “Level”:

=1-(Table1[[#This Row],[Level]])/(MAX([Level])+AVERAGE([Level]))

The home page (level zero) will receive a score of 1, all of the level one pages will receive a score that is a fraction of 1, level two will be scored less than level one, and so on.

I score the “Links In” column using the following formula:

=Table1[[#This Row],[Links In]]/MAX([Links In])

This formula works similarly in that the strongest score is 1, and lower “Links In” counts will be a fraction of 1.

Scoring Internal Links with Xenu and Excel
Your Excel table should look something like this

Step 4: Rank Your Pages

Once we have our scores, we can add them together and/or use the RANK formula to get a quick reference number.

Excel Final Table
The higher total score or lower PageRank score indicates higher importance

Utility and Caveats

There are some obvious issues and shortcomings with this method of scoring internal pages. The most obvious is the lack of external link weight into the formula. It’s important to understand that our score is simply based on internal weight.

I have found, however, that it can be quite useful to have early in the life of a project as a reference. For instance, as I’m auditing a new site I can copy the URL of a page in question and do a quick CTRL+F in my Excel score sheet to get a quick feel for a page’s internal “importance”. Another great utility would be to compare these scores with other KPI, such as conversion rate or organic traffic. If you’ve got a page that converts like crazy, but has a poor internal link score, perhaps it should be moved closer to the home page, or linked to from more internal pages.

What helps you visualize site architecture? Let me know in the comments or on Twitter, @MikeCP.

Get blog posts via email

15 Comments

  1. Good article Mike, I would totally agree it is an under-utilised approach determining the prominence of pages on a site.

    We have used something similar at MediaCo for a few years now based on the PageRank algorithm. As in your approach, it allows us to determine if the big-money pages are prominent on the site, but also can be used to assign weighting to other issues, such as summing the value in pages that appear to be duplicates or return error codes - this can then give a good indication of how big a problem the issue is on the site.

    reply >
  2. Joe Amadon

    Good article. I work with a very large site and it can be difficult to visualize where all the PageRank is flowing using only GWT, so option 2 and 3 here are very useful.
    You noted that "Unless you’re working with a MASSIVE site, most normal pages will not have a level higher than 10." Will there be any other differences when performing this analysis on a massive site?

    reply >
  3. Aaron Luckie

    Great post Mike. In the past I have just utilised Google Webmaster Tools to analyse the internal link structure so this has most definitely provided me with a new insight.

    I am interested to hear the answer to Joe's question regarding the differences when performing the same analysis on a massive site.

    Secondly, after you have these results and have analysed them what would be your next step? Would you be looking to parse new internal links to pages without as high ranking etc.

    reply >
  4. Joe and Aaron,

    Re: "Will there be any other differences when performing this analysis on a massive site?"
    I'm sure there would be. One example: I would imagine that the "Links In" spread would be so great, that my method of scoring might be a bit unreliable. If that were the case I might work to normalize in some way. Perhaps not scoring 20,000 'Links In' any better or worse than '100,000'.

    Aaron,
    Indeed, that is exactly what I would do. Or if things seem to be completely amiss, I might propose a grand site restructuring.

    reply >
  5. I also like to look at the GWT as it give a rough idea on how GG see your internal pages and structure.

    FYI there's a OS X program called Integrity on the mac that does crawl sites similar to Xenu, not much info about it and small community tho'...

    reply >
  6. excellent stuff Mike
    thx for sharing

    have you ever used the Microsoft IIS SEO toolkit instead of Xenu ? Richard Baxter wrote a couple of articles praising it and I tried it a couple of times as well. But I am still trying to figure it out which crawling software is best/more efficient for an agency to work with

    matteo

    reply >
  7. Thanks for the post Distilled - I agree with Samuels comment - Link Sluth coupled with GWMT is a killer SEO IA definition.

    I would also think that thins post might have some bearing here - the use of anchor text and the first link - http://moz.com/ugc/3-ways-to-avoid-the-first-link-counts-rule

    Thanks for the read guys! And the tip!

    BC

    reply >
  8. John

    Sorry for being stupid but I can't figure out what to substitute the fields for in the formulas?

    =1-(Table1[[#This Row],[Level]])/(MAX([Level])+AVERAGE([Level]))

    reply >
    • Try with:

      =1-(Table1[@Level])/(MAX(Table1[Level])+Average(Table1[Level]))

      =Table1[@Links In]/MAX(Table1[Links In])

  9. John, try the formula with the new Excel.

    reply >
  10. Hi Mike,

    Thanks for the great guide and excel formulas. It gave me a bunch of sweet findings (aka. erros) that must be fixed. The analysis was on a domain with 20+ subdomains - Imagine the data..

    +1

    Frederik Trovatten

    reply >
  11. Formula example for Excel 2007:

    =1-(B2/(MAX($B$2:$B$999)+AVERAGE($B$2:$B$999)))

    reply >
  12. Hi Mike -

    Thanks again for the info offered in this post. I'm still looking around for alternative methods for analyzing internal linking. In particular, I'm looking for a tool that will allow me to create visualizations of internal linking structures.

    I'm not having a great deal of joy on that front - but if you or anyone else knows a good tool to create visual representations of internal linking structures I'd be very pleased to hear about it!

    Oliver

    reply >
  13. After my crawl is complete, how do I export to Excel? I'm saving, but all I'm getting is gibberish when I open in Excel.

    reply >
  14. Trenton Erker

    I'm so confused about that formula.

    reply >

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>