Understanding Site Architecture with Xenu and Excel

One of the often under-considered factors for a strong SEO strategy is the site’s architecture and its affect on the PageRank flow through a site. We know that PageRank dissipates as we get further from the home page, but our conceptualization of page depth is clouded by the modern website’s sidebars, dynamically updating widgets, site-wide footers, and more. The difficulty in assessing a site’s architecture is compounded with very large sites with rich histories, especially as an SEO working for an agency where there is a need to develop a strong understanding of a new client’s site relatively quickly.

There are a few things the SEO can do to quickly get a sense of how the search engine might crawl and discover deep pages aside from opening Firefox, disabling JavaScript, setting User Agent to Googlebot, and clicking around starting from the home page. Let’s discuss a few of these options:

Google Webmaster Tools

One such tool that doesn’t fail to disappoint time and again is the Webmaster Tools ‘Internal Links’ report. Within this report we can get a look at which pages are linking to which, and get a count of internal links to any individual page. Unfortunately, the data is sampled (and often misleading) and does not report page depth.

Google Webmaster Tools Internal Link Report

Black Widow

Black Widow by SoftBytes will crawl through your site discovering new pages link by link. The biggest benefit of Black Widow is the Windows Explorer-like visualization of site architecture:

SoftBytes' Black Widow offers a nice site architecture visualization

Xenu

Xenu is a great tool for discovering broken links by doing a full site crawl much like Black Widow. Additionally, Xenu reports “LEVEL” and “LINKS IN”, which are particularly useful for developing a site architecture understanding. After running a full crawl, I like to import the results into Microsoft Excel and do some quick manipulation to rate the internal link juice that flows into each page. I have found that this has enabled me to get a quick sense of the site’s architecture in a pretty painless and repeatable manner. This process is what I’ll be detailing in this blog post:

Step 1: Run Your Crawl

We won’t need anything but links to internal pages crawled. You can speed up a crawl significantly be allowing Xenu to crawl only the pages that matter. That is, if you’d like to crawl only the www subdomain, you should specify so in Xenu, as crawling the root domain could take a lot more time.

Step 2: Import and Clean up Tab Separated File into Excel

After importing, remove all but the “Address”, “Type”, “Level”, and “Links In” columns. Next, we’ll remove all of the non-html pages by deleting all addresses that do not have a “Type” of text/html.

filter with excel
Use the above filter to show only non-text/html entries, then delete them all. Once the filter is removed, we’re left with just html pages.

Once we’ve done this we can delete the “Type” column, leaving us with “Address”, “Level” and “Links In”.

Because of various crawling oddities, many sites will include odd level counts. Unless you’re working with a MASSIVE site, most normal pages will not have a level higher than 10. Sort by level, and find that point where the levels begin to jump and/or non-important pages are crawled, and remove all thereafter.

Step 3: Assign a “Level” and “Links In” Score

Knowing that more PageRank flows to pages closer to zero (the home page), I use the following formula to score “Level”:

=1-(Table1[[#This Row],[Level]])/(MAX([Level])+AVERAGE([Level]))

The home page (level zero) will receive a score of 1, all of the level one pages will receive a score that is a fraction of 1, level two will be scored less than level one, and so on.

I score the “Links In” column using the following formula:

=Table1[[#This Row],[Links In]]/MAX([Links In])

This formula works similarly in that the strongest score is 1, and lower “Links In” counts will be a fraction of 1.

Scoring Internal Links with Xenu and Excel
Your Excel table should look something like this

Step 4: Rank Your Pages

Once we have our scores, we can add them together and/or use the RANK formula to get a quick reference number.

Excel Final Table
The higher total score or lower PageRank score indicates higher importance

Utility and Caveats

There are some obvious issues and shortcomings with this method of scoring internal pages. The most obvious is the lack of external link weight into the formula. It’s important to understand that our score is simply based on internal weight.

I have found, however, that it can be quite useful to have early in the life of a project as a reference. For instance, as I’m auditing a new site I can copy the URL of a page in question and do a quick CTRL+F in my Excel score sheet to get a quick feel for a page’s internal “importance”. Another great utility would be to compare these scores with other KPI, such as conversion rate or organic traffic. If you’ve got a page that converts like crazy, but has a poor internal link score, perhaps it should be moved closer to the home page, or linked to from more internal pages.

What helps you visualize site architecture? Let me know in the comments or on Twitter, @MikeCP.