The Beginners Guide to using the Command Line for SEO

The command line can look very intimidating if you’ve never used it before. By the very nature of the command line the interface is basics but when you know how to use it, it can be very powerful. Even learning the basics of the command line can quickly make the terminal into one of your greatest tools. As an SEO you should learn to love the command line, once you get over the fact that it’s ugly and looks like something out of the matrix it can make your job a lot easier. What takes you hours in Excel can take you second in the command line.

If you don’t even know what the command line is, don’t worry, it’s that horrible screen that look something like this: 

intro to command line

Set up

Sorry, but this isn’t going to be a post on how to set up the command line, so in the interest in getting the most out of this post, I’m going to assume that you are using the terminal if you are on a Mac or that you have installed Cygwin if you are on a Windows machine.


Open up the command line and if you are using a Mac, you will be presented with something similar to the screen shown below:

mac command line


If you are using Cygwin, you will most likely see something like the screen below:

windows command line


Tip – If you get confused or want to clear the screen and start again you can just clear the screen using “ctrl+l”. Also if you run a command and it’s running for a long time, you can quit by pressing ctrl+c. 

Okay, the first thing you will want to know is where you are. To find out just type:

“pwd”  (print working directory) and hit enter. You will then see the screen below:

pwdThis is handy to remember; it can be easy to get lost or forget which folder you are in so hitting pwd can be useful. 

Moving Between Folders

Before you move anywhere, you need to know what’s in your working directory. To do this, you want the command line to list everything in the folder you are currently in. To do this simply type:

“ls” (list) and hit enter. You will then see the screen below:

listFor privacy reasons, I’ve blacked out some of my folders, but you can see that it lists all of the folders within the directory. Let’s say I want to go into my desktop folder. To move up or down directories, the command always starts with “cd” (change directory), so to move into the desktop directory, we simply type:

“cd desktop” and hit enter. If you then use the list commands, you will see the items on your desktop as in the image below:


As you can see, I have 3 folders on my desktop. Let’s go into folder 3 using the cd command I’ve just shown you:

Important – As there is a space in the folder name, we need to do one of two things:

  1. Either escape the space using a backslash – “cd Folder\ 3”
  2. Use quotations around the folder name – “cd “Folder 3”

To make sure we are in there, let’s list all items in folder 3. See the screen below:

lists items
As you can see, there is only one folder named Folder 4. Let’s assume we’ve changed our mind and we actually want to be in folder 2. To move back up a level we again use the change directory command, but this time we use a space and two full stops. This would look like this:

“cd ..” – This would then take you to the folder above, which in this case is the desktop.

Tip – When typing out a folder or file name, you don’t need to type the full thing, just press the tab key and the command line will auto complete as far as possible. In the case of the example above, we could just Type “cd F(tab)” and it would auto compete up to “Folder \” meaning we just need to add the 4.

Summary of what we’ve covered so far

  1. Find out where you are using “pwd”
  2. List the items in current directory “ls”
  3. Move down a directory “cd FolderName”
  4. Move up a directory “cd ..”

Working with Documents

Now that we know the basics and how to get around in the command line environment, let’s get into something more interesting. Let’s look at how you can open, search, edit and export documents in seconds. As this is designed to be used by SEOs, the files that we are going to use are log files from my personal site. It’s a tiny site with very little traffic so the files won’t be too large, which is perfect for our needs. I’ve included 4 months of logs, from April to July.

Let’s navigate to the files.  I don’t know where I am at the moment and I’ve put the server logs into “Folder 1,” so you can follow the steps below:

“pwd” – Turns out I was in Folder 3, so I want to move up to the Desktop
“cd ..” – I’m now on the Desktop
“cd Folder\ 1” – Move into Folder 1
“ls” – List folder 1 items– There’s only one named Server Logs, let’s go into it.
“cd Server \ logs”
“ls” – Shows there are 4 folders (April, May, June, July)
“cd April”  - Go into the April folders
“ls” – Show all the log files in April

I’m now in the April folder and can see all of the log files in there. The image below shows some of the files in there, but there are a lot more:

april logs
Let’s look inside the first log file. As you can see, it’s a long file name and it’s annoying to type out. This is where the tab trick comes in handy as we don’t need to type it all out. For this, we are going to use a new command named “cat” which means output or to make it easier, it basically means show me the file contents within the command line of the file I say . Let’s “cat” the first file in that list, the one that ends in 114.log.

“cat ex20120329000001-”

The output will be the contents of that log file and will look something like the image below (note I have chopped off some of the contents to make the content readable).  

log content

Finding Googlebot

What can we do with these log files? What if I wanted to find out how many times Googlebot visited my site in this month? For this I’m going to assume you know how to read your server logs. Here’s the thought process of what we need to do: (note I'm searching for "ooglebot" to avoid missing entries with or without a capital G"

1 – Output all content for all server logs in April
2 – Search for the string “ooglebot” in that content
3 – Output every occurance of this
4 – Count the number of lines in the output (This is equal to the number of hits)

Below are the individual steps to do all of the above; the numbers are consistent for ease of following along.

1 - We already know how to output one file, but to do all the files in April we need to tell the command line to “cat” any file in the April folder that ends in “.log”. We can do this using some regex. While in the April folder type the following:

“cat *\.log” – Note the backslash before the “.” – this escapes the full stop because there actually is a full stop in the file name. The output from this is all the server logs in April.  

2 – Now we need to search for any mentions of “ooglebot” in all that mess. To do that we use a new command called “grep”. Using grep is really easy you just type grep followed by the string you want to search for. For example if I wanted to find ooglebot I would type:

“grep ooglebot”

But before we can do that we need to do the search after the contents have been output to the terminal. This is going to require “piping” which basically chains a series of commands together using the “|” symbol. Shown below is me asking the command line in one line to output all log content (as in step one) but only output logs that mention “ooglebot”

“cat *\.log | grep ooglebot”

This will output a bunch of lines that mention the string “ooglebot”. A section of the output can be seen in the image below:

The final step is simply to get a count of the number of times the site was hit by Google. To do this we need one more command which is the word count. To do this we pipe “wc” onto the end of the command. This would look like this:

cat *\.log | grep ooglebot | wc

The output of which will be three numbers; these correspond to the number of Lines, words and characters of the output as shown below.

craig-bradfords-MacBook-Pro:April Craig$ cat *\.log | grep ooglebot | wc
    189    3230   40576

As you can see from above, that returned a count of 189 for the number of times Googlebot hit my site in April.

Summary of what we’ve covered so far

“cat” – Output the contents of a file
“\” - backslash to escape characters
“grep” – The grep utility searches any given input files, selecting lines that
     match one or more patterns.
“|” – The pipe character allows us to chain commands

wc – The wc command outputs the number of lines, words and characters in the document.

How often does Googlebot visit your most important pages?

I want to finish off with something more advanced. What if you had a site that you suspected Googlebot wasn’t crawling correctly. Let’s say you wanted to plot a graph of the number of hits to your website from Google over the last 12 months? This would essentially show you increases and decreases is how often Google is hitting your website, this can be useful for a whole number of things. Ok here we go, the command is as follows:

cat */*\.log | grep Googlebot | awk '{print $4}' | awk -F/ '{print $2}' | sort -M | uniq –c

Don’t panic! Let’s talk through it:

The first part should look familiar, as all we do differently is add  “*/” which means any folder since we are no longer in the directory we are searching. We want to search across April, May, June and July folders therefore we need the “*/” at the start. Then,up to the end of grep Googlebot is just the same as before, it basically says print all occurrences of “Googlebot”

If we just do that, we get the following output: (note: If you actually did this there would be a lot more data) - - [28/May/2012:23:44:40 -0700] "GET HTTP/1.1" 200 3421 "-" "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +" - - [29/May/2012:01:59:32 -0700] "GET HTTP/1.0" 200 66 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [29/May/2012:13:30:41 -0700] "GET HTTP/1.0" 200 66 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +" - - [29/May/2012:13:30:47 -0700] "GET HTTP/1.1" 200 4945 "-" "DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +"
craig-bradfords-MacBook-Pro:server logs Craig$

What do we want to do with this now? Well, we need to strip out the date, so we can tell which months Google hit the site. To do that we use a new command called the “awk” command, which recognises patterns and splits up the document into columns using spaces. To make this simple, we are going to take just one input from Googlebot above to help explain the awk command. See the log input below: - - [29/May/2012:13:30:47 -0700] "GET HTTP/1.1" 200 4945 "-" "DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +"

The “awk” command basically splits up the data above into columns using spaces, so above column 1 would be in yellow (, column 2 in green (-) column 3 in purple (-) and column 4 in red ([29/May/2012:13:30:47) and so on. Since we are only interested in the part with the month in it, we can tell the command line to only print column 4. The $4 in the command above represents the column we want to print hence if we now run the command line again but this time up to the end of awk '{print $4}' like this:

cat */*\.log | grep Googlebot | awk '{print $4}'

The output looks like this:


We actually don’t need all of that though; for example, we just need the month part “May”. To do this we pipe the awk command again but tell it to use a “/” instead of a space to separate the data. This is done by adding “-F/” (note the / after the F) to the awk command. See the new breakdown of columns below:


We now tell it to just print column 2 (in red), since that’s all we are interested in. If we run the command line again up to the end of the new section we get the following output:


We’re almost there as all that remains to do is sort the output by month, count the number of occurrences and remove duplicates. That’s what the last two parts of the command does:

sort -M | uniq –c

Sort –M sorts all the Months above by month then we pipe that with uniq – c which says remove all duplicates and count the number of each. When we run it all together we get the following:

craig-bradfords-MacBook-Pro:server logs Craig$ cat */*.log | grep Googlebot | awk '{print $4}' | awk -F/ '{print $2}' | sort -M | uniq -c
 28 Mar
204 Apr
294 May
236 Jun
 16 Jul
craig-bradfords-MacBook-Pro:server logs Craig$

We can now see how many times per month googlebot hit my site and if I spotted anything weird I could look into why that is. The final thing that would be nice to do would be to put this into a nice graph to show my client. To do this just add:

“> filename.txt” to the end of the command and it will output all the numbers you need into a text file. Just to be clear the final command line would look like this:

cat */*.log | grep Googlebot | awk '{print $4}' | awk -F/ '{print $2}' | sort -M | uniq -c > googlebot.txt

You can then throw that into excel and make the following graph:

google hits per month

I think that’s plenty for today, wouldn’t you agree? Just before I finish, think about how you could use the above process on a larger scale. Obviously I’ve only used it for a few months and for all pages of my site, but what if you wanted to check if Googlebot wasn’t visiting certain pages or sections of your site as often as others? You could use the above process, but instead of pulling out months, you could pull out specific sections of a site and perhaps produce a graph like that below.

You could plot the number of times Google has hit your blog, your shop etc over the space of 12 months. If there are important sections of your site not getting crawled as often as you would like, you could use this as a signal to perhaps revise your internal navigation or build more links to the important pages that make you money or drive conversions.

I hope you’ve found this useful, as always please leave comments, ask questions and share with others. Feel free to drop me a message on Twitter @CraigBradford. For further reading I also recommend reading this post by Ian Lurie on how to mine server logs for broken links

Get blog posts via email

About the author
Craig Bradford

Craig Bradford

Craig is VP of operations for Distilled's SEO split-testing platform, the ODN. Craig moved from Glasgow, Scotland to join Distilled in 2011 as an SEO analyst. Since joining, he has consulted with a range of...   read more