One of the often under-considered factors for a strong SEO strategy is the site’s architecture and its affect on the PageRank flow through a site. We know that PageRank dissipates as we get further from the home page, but our conceptualization of page depth is clouded by the modern website’s sidebars, dynamically updating widgets, site-wide footers, and more. The difficulty in assessing a site’s architecture is compounded with very large sites with rich histories, especially as an SEO working for an agency where there is a need to develop a strong understanding of a new client’s site relatively quickly.
There are a few things the SEO can do to quickly get a sense of how the search engine might crawl and discover deep pages aside from opening Firefox, disabling JavaScript, setting User Agent to Googlebot, and clicking around starting from the home page. Let’s discuss a few of these options:
Google Webmaster Tools
One such tool that doesn’t fail to disappoint time and again is the Webmaster Tools ‘Internal Links’ report. Within this report we can get a look at which pages are linking to which, and get a count of internal links to any individual page. Unfortunately, the data is sampled (and often misleading) and does not report page depth.

Black Widow
Black Widow by SoftBytes will crawl through your site discovering new pages link by link. The biggest benefit of Black Widow is the Windows Explorer-like visualization of site architecture:

Xenu
Xenu is a great tool for discovering broken links by doing a full site crawl much like Black Widow. Additionally, Xenu reports “LEVEL” and “LINKS IN”, which are particularly useful for developing a site architecture understanding. After running a full crawl, I like to import the results into Microsoft Excel and do some quick manipulation to rate the internal link juice that flows into each page. I have found that this has enabled me to get a quick sense of the site’s architecture in a pretty painless and repeatable manner. This process is what I’ll be detailing in this blog post:
Step 1: Run Your Crawl
We won’t need anything but links to internal pages crawled. You can speed up a crawl significantly be allowing Xenu to crawl only the pages that matter. That is, if you’d like to crawl only the www subdomain, you should specify so in Xenu, as crawling the root domain could take a lot more time.
Step 2: Import and Clean up Tab Separated File into Excel
After importing, remove all but the “Address”, “Type”, “Level”, and “Links In” columns. Next, we’ll remove all of the non-html pages by deleting all addresses that do not have a “Type” of text/html.

Use the above filter to show only non-text/html entries, then delete them all. Once the filter is removed, we’re left with just html pages.
Once we’ve done this we can delete the “Type” column, leaving us with “Address”, “Level” and “Links In”.
Because of various crawling oddities, many sites will include odd level counts. Unless you’re working with a MASSIVE site, most normal pages will not have a level higher than 10. Sort by level, and find that point where the levels begin to jump and/or non-important pages are crawled, and remove all thereafter.
Step 3: Assign a “Level” and “Links In” Score
Knowing that more PageRank flows to pages closer to zero (the home page), I use the following formula to score “Level”:
=1-(Table1[[#This Row],[Level]])/(MAX([Level])+AVERAGE([Level]))
The home page (level zero) will receive a score of 1, all of the level one pages will receive a score that is a fraction of 1, level two will be scored less than level one, and so on.
I score the “Links In” column using the following formula:
=Table1[[#This Row],[Links In]]/MAX([Links In])
This formula works similarly in that the strongest score is 1, and lower “Links In” counts will be a fraction of 1.

Your Excel table should look something like this
Step 4: Rank Your Pages
Once we have our scores, we can add them together and/or use the RANK formula to get a quick reference number.

The higher total score or lower PageRank score indicates higher importance
Utility and Caveats
There are some obvious issues and shortcomings with this method of scoring internal pages. The most obvious is the lack of external link weight into the formula. It’s important to understand that our score is simply based on internal weight.
I have found, however, that it can be quite useful to have early in the life of a project as a reference. For instance, as I’m auditing a new site I can copy the URL of a page in question and do a quick CTRL+F in my Excel score sheet to get a quick feel for a page’s internal “importance”. Another great utility would be to compare these scores with other KPI, such as conversion rate or organic traffic. If you’ve got a page that converts like crazy, but has a poor internal link score, perhaps it should be moved closer to the home page, or linked to from more internal pages.
What helps you visualize site architecture? Let me know in the comments or on Twitter, @MikeCP.

Mike Pantoliano is a lead consultant at Distilled in Seattle. At one point he had an Orkut profile.
Pingback: SearchCap: The Day In Search, January 6, 2011
Good article Mike, I would totally agree it is an under-utilised approach determining the prominence of pages on a site.
We have used something similar at MediaCo for a few years now based on the PageRank algorithm. As in your approach, it allows us to determine if the big-money pages are prominent on the site, but also can be used to assign weighting to other issues, such as summing the value in pages that appear to be duplicates or return error codes – this can then give a good indication of how big a problem the issue is on the site.
Good article. I work with a very large site and it can be difficult to visualize where all the PageRank is flowing using only GWT, so option 2 and 3 here are very useful. You noted that “Unless you’re working with a MASSIVE site, most normal pages will not have a level higher than 10.” Will there be any other differences when performing this analysis on a massive site?
Great post Mike. In the past I have just utilised Google Webmaster Tools to analyse the internal link structure so this has most definitely provided me with a new insight.
I am interested to hear the answer to Joe’s question regarding the differences when performing the same analysis on a massive site.
Secondly, after you have these results and have analysed them what would be your next step? Would you be looking to parse new internal links to pages without as high ranking etc.
Joe and Aaron,
Re: “Will there be any other differences when performing this analysis on a massive site?” I’m sure there would be. One example: I would imagine that the “Links In” spread would be so great, that my method of scoring might be a bit unreliable. If that were the case I might work to normalize in some way. Perhaps not scoring 20,000 ‘Links In’ any better or worse than ’100,000′.
Aaron, Indeed, that is exactly what I would do. Or if things seem to be completely amiss, I might propose a grand site restructuring.
Pingback: Weekly Search & Social News: 01/11/2011 | Search Engine Journal
I also like to look at the GWT as it give a rough idea on how GG see your internal pages and structure.
FYI there’s a OS X program called Integrity on the mac that does crawl sites similar to Xenu, not much info about it and small community tho’…
excellent stuff Mike thx for sharing
have you ever used the Microsoft IIS SEO toolkit instead of Xenu ? Richard Baxter wrote a couple of articles praising it and I tried it a couple of times as well. But I am still trying to figure it out which crawling software is best/more efficient for an agency to work with
matteo
Thanks for the post Distilled – I agree with Samuels comment – Link Sluth coupled with GWMT is a killer SEO IA definition.
I would also think that thins post might have some bearing here – the use of anchor text and the first link – http://moz.com/ugc/3-ways-to-avoid-the-first-link-counts-rule
Thanks for the read guys! And the tip!
BC
Sorry for being stupid but I can’t figure out what to substitute the fields for in the formulas?
=1-(Table1[[#This Row],[Level]])/(MAX([Level])+AVERAGE([Level]))
Try with:
=1-(Table1[@Level])/(MAX(Table1[Level])+Average(Table1[Level]))
=Table1[@Links In]/MAX(Table1[Links In])
Pingback: SearchFest 2011 Interview: Matthew Brown | Search Engine Journal
Pingback: Do Your Very Own Site Structure Audit | Distilled blog
John, try the formula with the new Excel.
Hi Mike,
Thanks for the great guide and excel formulas. It gave me a bunch of sweet findings (aka. erros) that must be fixed. The analysis was on a domain with 20+ subdomains – Imagine the data..
+1
Frederik Trovatten
Formula example for Excel 2007:
=1-(B2/(MAX($B$2:$B$999)+AVERAGE($B$2:$B$999)))Hi Mike -
Thanks again for the info offered in this post. I’m still looking around for alternative methods for analyzing internal linking. In particular, I’m looking for a tool that will allow me to create visualizations of internal linking structures.
I’m not having a great deal of joy on that front – but if you or anyone else knows a good tool to create visual representations of internal linking structures I’d be very pleased to hear about it!
Oliver
After my crawl is complete, how do I export to Excel? I’m saving, but all I’m getting is gibberish when I open in Excel.