The Beginners Guide to using the Command Line for SEO

The command line can look very intimidating if you’ve never used it before. By the very nature of the command line the interface is basics but when you know how to use it, it can be very powerful. Even learning the basics of the command line can quickly make the terminal into one of your greatest tools. As an SEO you should learn to love the command line, once you get over the fact that it’s ugly and looks like something out of the matrix it can make your job a lot easier. What takes you hours in Excel can take you second in the command line.

If you don’t even know what the command line is, don’t worry, it’s that horrible screen that look something like this: 

intro to command line

Set up

Sorry, but this isn’t going to be a post on how to set up the command line, so in the interest in getting the most out of this post, I’m going to assume that you are using the terminal if you are on a Mac or that you have installed Cygwin if you are on a Windows machine.

Orientation

Open up the command line and if you are using a Mac, you will be presented with something similar to the screen shown below:

mac command line

 

If you are using Cygwin, you will most likely see something like the screen below:

windows command line

 

Tip – If you get confused or want to clear the screen and start again you can just clear the screen using “ctrl+l”. Also if you run a command and it’s running for a long time, you can quit by pressing ctrl+c. 

Okay, the first thing you will want to know is where you are. To find out just type:

“pwd”  (print working directory) and hit enter. You will then see the screen below:

pwdThis is handy to remember; it can be easy to get lost or forget which folder you are in so hitting pwd can be useful. 

Moving Between Folders

Before you move anywhere, you need to know what’s in your working directory. To do this, you want the command line to list everything in the folder you are currently in. To do this simply type:

“ls” (list) and hit enter. You will then see the screen below:

listFor privacy reasons, I’ve blacked out some of my folders, but you can see that it lists all of the folders within the directory. Let’s say I want to go into my desktop folder. To move up or down directories, the command always starts with “cd” (change directory), so to move into the desktop directory, we simply type:

“cd desktop” and hit enter. If you then use the list commands, you will see the items on your desktop as in the image below:

 

As you can see, I have 3 folders on my desktop. Let’s go into folder 3 using the cd command I’ve just shown you:

Important – As there is a space in the folder name, we need to do one of two things:

  1. Either escape the space using a backslash – “cd Folder\ 3”
  2. Use quotations around the folder name – “cd “Folder 3”

To make sure we are in there, let’s list all items in folder 3. See the screen below:

lists items
As you can see, there is only one folder named Folder 4. Let’s assume we’ve changed our mind and we actually want to be in folder 2. To move back up a level we again use the change directory command, but this time we use a space and two full stops. This would look like this:

“cd ..” – This would then take you to the folder above, which in this case is the desktop.

Tip – When typing out a folder or file name, you don’t need to type the full thing, just press the tab key and the command line will auto complete as far as possible. In the case of the example above, we could just Type “cd F(tab)” and it would auto compete up to “Folder \” meaning we just need to add the 4.

Summary of what we’ve covered so far

  1. Find out where you are using “pwd”
  2. List the items in current directory “ls”
  3. Move down a directory “cd FolderName”
  4. Move up a directory “cd ..”

Working with Documents

Now that we know the basics and how to get around in the command line environment, let’s get into something more interesting. Let’s look at how you can open, search, edit and export documents in seconds. As this is designed to be used by SEOs, the files that we are going to use are log files from my personal site. It’s a tiny site with very little traffic so the files won’t be too large, which is perfect for our needs. I’ve included 4 months of logs, from April to July.

Let’s navigate to the files.  I don’t know where I am at the moment and I’ve put the server logs into “Folder 1,” so you can follow the steps below:

“pwd” – Turns out I was in Folder 3, so I want to move up to the Desktop
“cd ..” – I’m now on the Desktop
“cd Folder\ 1” – Move into Folder 1
“ls” – List folder 1 items– There’s only one named Server Logs, let’s go into it.
“cd Server \ logs”
“ls” – Shows there are 4 folders (April, May, June, July)
“cd April”  - Go into the April folders
“ls” – Show all the log files in April

I’m now in the April folder and can see all of the log files in there. The image below shows some of the files in there, but there are a lot more:

april logs
Let’s look inside the first log file. As you can see, it’s a long file name and it’s annoying to type out. This is where the tab trick comes in handy as we don’t need to type it all out. For this, we are going to use a new command named “cat” which means output or to make it easier, it basically means show me the file contents within the command line of the file I say . Let’s “cat” the first file in that list, the one that ends in 114.log.

“cat ex20120329000001-184.168.193.114.log”

The output will be the contents of that log file and will look something like the image below (note I have chopped off some of the contents to make the content readable).  

log content

Finding Googlebot

What can we do with these log files? What if I wanted to find out how many times Googlebot visited my site in this month? For this I’m going to assume you know how to read your server logs. Here’s the thought process of what we need to do: (note I’m searching for “ooglebot” to avoid missing entries with or without a capital G“

1 – Output all content for all server logs in April
2 – Search for the string “ooglebot” in that content
3 – Output every occurance of this
4 – Count the number of lines in the output (This is equal to the number of hits)

Below are the individual steps to do all of the above; the numbers are consistent for ease of following along.

1 - We already know how to output one file, but to do all the files in April we need to tell the command line to “cat” any file in the April folder that ends in “.log”. We can do this using some regex. While in the April folder type the following:

“cat *\.log” – Note the backslash before the “.” – this escapes the full stop because there actually is a full stop in the file name. The output from this is all the server logs in April.  

2 – Now we need to search for any mentions of “ooglebot” in all that mess. To do that we use a new command called “grep”. Using grep is really easy you just type grep followed by the string you want to search for. For example if I wanted to find ooglebot I would type:

“grep ooglebot”

But before we can do that we need to do the search after the contents have been output to the terminal. This is going to require “piping” which basically chains a series of commands together using the “|” symbol. Shown below is me asking the command line in one line to output all log content (as in step one) but only output logs that mention “ooglebot”

“cat *\.log | grep ooglebot”

This will output a bunch of lines that mention the string “ooglebot”. A section of the output can be seen in the image below:

googlebot
The final step is simply to get a count of the number of times the site was hit by Google. To do this we need one more command which is the word count. To do this we pipe “wc” onto the end of the command. This would look like this:

cat *\.log | grep ooglebot | wc

The output of which will be three numbers; these correspond to the number of Lines, words and characters of the output as shown below.

craig-bradfords-MacBook-Pro:April Craig$ cat *\.log | grep ooglebot | wc
    189    3230   40576

As you can see from above, that returned a count of 189 for the number of times Googlebot hit my site in April.

Summary of what we’ve covered so far

“cat” – Output the contents of a file
“\” - backslash to escape characters
“grep” – The grep utility searches any given input files, selecting lines that
     match one or more patterns.
“|” – The pipe character allows us to chain commands

wc – The wc command outputs the number of lines, words and characters in the document.

How often does Googlebot visit your most important pages?

I want to finish off with something more advanced. What if you had a site that you suspected Googlebot wasn’t crawling correctly. Let’s say you wanted to plot a graph of the number of hits to your website from Google over the last 12 months? This would essentially show you increases and decreases is how often Google is hitting your website, this can be useful for a whole number of things. Ok here we go, the command is as follows:

cat */*\.log | grep Googlebot | awk ‘{print $4}’ | awk -F/ ‘{print $2}’ | sort -M | uniq –c

Don’t panic! Let’s talk through it:

The first part should look familiar, as all we do differently is add  “*/” which means any folder since we are no longer in the directory we are searching. We want to search across April, May, June and July folders therefore we need the “*/” at the start. Then,up to the end of grep Googlebot is just the same as before, it basically says print all occurrences of “Googlebot”

If we just do that, we get the following output: (note: If you actually did this there would be a lot more data)

108.162.226.47 - - [28/May/2012:23:44:40 -0700] ”GET www.craigbradford.co.uk/category/seo/ HTTP/1.1“ 200 3421 ”-“ ”Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)“
108.162.226.47 - - [29/May/2012:01:59:32 -0700] ”GET www.craigbradford.co.uk/robots.txt HTTP/1.0“ 200 66 ”-“ ”Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)“
108.162.226.47 - - [29/May/2012:13:30:41 -0700] ”GET www.craigbradford.co.uk/robots.txt HTTP/1.0“ 200 66 ”-“ ”Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)“
108.162.226.47 - - [29/May/2012:13:30:47 -0700] ”GET www.craigbradford.co.uk/index.php HTTP/1.1“ 200 4945 ”-“ ”DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)“
craig-bradfords-MacBook-Pro:server logs Craig$

What do we want to do with this now? Well, we need to strip out the date, so we can tell which months Google hit the site. To do that we use a new command called the “awk” command, which recognises patterns and splits up the document into columns using spaces. To make this simple, we are going to take just one input from Googlebot above to help explain the awk command. See the log input below:

108.162.226.47 - - [29/May/2012:13:30:47 -0700] ”GET www.craigbradford.co.uk/index.php HTTP/1.1“ 200 4945 ”-“ ”DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)“

The “awk” command basically splits up the data above into columns using spaces, so above column 1 would be in yellow (108.162.226.47), column 2 in green (-) column 3 in purple (-) and column 4 in red ([29/May/2012:13:30:47) and so on. Since we are only interested in the part with the month in it, we can tell the command line to only print column 4. The $4 in the command above represents the column we want to print hence if we now run the command line again but this time up to the end of awk ‘{print $4}’ like this:

cat */*\.log | grep Googlebot | awk ‘{print $4}’

The output looks like this:

[24/June/2012:11:54:02
[25/June/2012:00:28:24
[25/Apr/2012:08:20:52
[25/Apr/2012:08:20:58
[25/July/2012:17:18:09
[25/May/2012:17:18:11
[25/May/2012:23:31:02

We actually don’t need all of that though; for example, we just need the month part “May”. To do this we pipe the awk command again but tell it to use a “/” instead of a space to separate the data. This is done by adding “-F/” (note the / after the F) to the awk command. See the new breakdown of columns below:

[25/May/2012:23:31:02

We now tell it to just print column 2 (in red), since that’s all we are interested in. If we run the command line again up to the end of the new section we get the following output:

Apr
Apr
Apr
Jun
Jun
Jun
Jun
July
July
May
May
May

We’re almost there as all that remains to do is sort the output by month, count the number of occurrences and remove duplicates. That’s what the last two parts of the command does:

sort -M | uniq –c

Sort –M sorts all the Months above by month then we pipe that with uniq – c which says remove all duplicates and count the number of each. When we run it all together we get the following:

craig-bradfords-MacBook-Pro:server logs Craig$ cat */*.log | grep Googlebot | awk ‘{print $4}’ | awk -F/ ‘{print $2}’ | sort -M | uniq -c
 28 Mar
204 Apr
294 May
236 Jun
 16 Jul
craig-bradfords-MacBook-Pro:server logs Craig$

We can now see how many times per month googlebot hit my site and if I spotted anything weird I could look into why that is. The final thing that would be nice to do would be to put this into a nice graph to show my client. To do this just add:

“> filename.txt” to the end of the command and it will output all the numbers you need into a text file. Just to be clear the final command line would look like this:

cat */*.log | grep Googlebot | awk ‘{print $4}’ | awk -F/ ‘{print $2}’ | sort -M | uniq -c > googlebot.txt

You can then throw that into excel and make the following graph:

google hits per month

I think that’s plenty for today, wouldn’t you agree? Just before I finish, think about how you could use the above process on a larger scale. Obviously I’ve only used it for a few months and for all pages of my site, but what if you wanted to check if Googlebot wasn’t visiting certain pages or sections of your site as often as others? You could use the above process, but instead of pulling out months, you could pull out specific sections of a site and perhaps produce a graph like that below.

google_hits_per_directory
You could plot the number of times Google has hit your blog, your shop etc over the space of 12 months. If there are important sections of your site not getting crawled as often as you would like, you could use this as a signal to perhaps revise your internal navigation or build more links to the important pages that make you money or drive conversions.

I hope you’ve found this useful, as always please leave comments, ask questions and share with others. Feel free to drop me a message on Twitter @CraigBradford. For further reading I also recommend reading this post by Ian Lurie on how to mine server logs for broken links

Craig Bradford

Craig Bradford

Craig joined Distilled in March 2011. Originally from Scotland, Craig moved to London in search of becoming an SEO ninja. After spending 4 years at Strathclyde University studying Sports Engineering, he decided that he didn’t fancy designing sports...   read more

Get blog posts via email

15 Comments

  1. The best part of this tutorial is that if you have SSH access to your hosting account you can do this probably right there, without needing to download all .log files to your machine.

    reply >
    • Craig Bradford

      Hey UD, yup SSH is amazing but I thought I'd keep it simple for this time around. Thanks fr reading and leaving a comment.

  2. Sometimes it might be useful to look for the "time-spent" value from the logs. If it takes too much time for Googlebot to crawl the page, then there might be something wrong.

    Great tool for handling the logs is Splunk: http://www.splunk.com

    reply >
    • +1 for Splunk. Manually parsing logs with awk and grep are awesome skills to know when you're doing things on the fly ... but Splunk is seriously a time saver and lets you quickly discover trends that you wouldn't be able to glean from basic command line stuff. Great intro to command line none the less. I think more SEOs really should be familiar with it.

  3. Ben

    That´s really cool, I have found myself using the command line through ssh more and more these days, once you get the hang of it you are so much quicker than using GUIs, especially web based interfaces like plesk, phpmyadmin and so on.

    reply >
    • Craig Bradford

      Hey Ben, absolutely, once you get the hang of it it's a great tool. Thanks for commenting

  4. A terminal console is not ugly. It is a beautiful thing to people that know how to use one efficiently. In fact, we should out law GUI's and all use command line software from now on!! Well, maybe not quite that strong, but you get my point.

    reply >
  5. Larry Heart

    Hello, Craig!
    Thanks for great tutorial, i've tried it on my personal website and I've noticed only minor fluctuations.
    It took some time to make that line work for me, but the best thing is that now I can easily test any other sites in future.
    I love reading distilled blog, because you guys always show us some unexpected seo aspects, which make us 1 step better SEO's

    reply >
    • Craig Bradford

      Thanks Larry, glad you got it to work and yeah it's great that once you have it set up once, you can just repeat it again. Thanks for reading.

  6. Great posting, you don't see this sort of explanation much.

    The best nix command, *ever is simply:

    apropos

    You basically type, for instance "apropos file" and it will list out all the manual pages (i.e. help topics) for you that are available (such as for files it would list the commands cp [copy], mv [move], etc..., then you just type "man " for more detail on the particular command.

    If you only memorize one thing, memorize that one!

    Great piece,


    Ted

    reply >
  7. This is what totally new thing for me after a long time, Thanks Craig for shared this type of different post.

    reply >
  8. Hey,

    Thank you for sharing this. This is totally some thing new for me. i have never heard about these command lines. :)

    Thank you

    Saif

    reply >
  9. Craig, great post! Looking forward to a follow up post with some more in depth analysis because the possibilities are numerous! Are you going to be at Brighton SEO again this year?

    reply >
    • Craig Bradford

      Hey Jan, good to hear from you. I can't make it this time but some of the other consultants are going.

  10. Hey

    Thanks for sharing this.after reading this i got so many new points that is really informative for me.keep on sharing.

    Thank you.

    Rick

    reply >

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>