How SEOs can Use Python

By

Python for SEO

During the last six months or so, I started learning more and more about coding. It all started with WordPress and quickly moved on to other things.

As an SEO, one of the things I notice is how much work is sometimes involved in simple tasks like writing title or description tags for clients. As this is SEO 101, we end up doing it for every client. Most of the time the sorts of things we want in a good title are already on the page. We just need a way to extract them. Crawlers like Xenu make finding all the pages easy. But then what?

Enter Python

Python is my scripting language of choice. Its easy syntax, batteries-included attitude, and library for just about everything make it a great choice for a great many things. It makes tasks like writing titles and descriptions go quicker.

In short, knowing Python (or any other scripting language) gives a SEO the tools to get results for their clients quicker.

An Example: Your Client Can’t Give You a List of URLs

When you ask a client how many pages their site has, chances are you’ll get a pretty inconclusive answer. “Maybe about 10,000?” they’ll say. As an SEO, you need know know, so you send out a crawler, like Xenu, to find everything.

There’s another way, however: the sitemap. Almost every ecommerce or CMS platform generates a sitemap. All you need is a XML parser and a way to fetch the sitemap URL to find every URL your client deemed important enough to throw in a sitemap. This becomes especially useful when clients have multiple sitemaps (categories, products, static pages, etc). It lets you find specific pages to optimize first — like product pages on an ecommerce site.

Sitemaps are also good Python practice because the spec is well known and used. You can count on most sitemaps being the same and having well formed XML. That makes it easy to use a parse like BeautifulSoup (see below).

Your Python Sitemap Parser

We’ll use two third-party python librarys for this example. Requests is a nice wrapper around a lot of the python url and HTTP libraries with a much prettier API. My XML/HTML parser of choice is BeautifulSoup, which, apart from its humorous name, has great documentation and works well. We’ll import these two libraries and the with_statement at the top of our file.

from __future__ import with_statement # we'll use this later, has to be here

import requests
from BeautifulSoup import BeautifulStoneSoup as Soup

To get started, we’ll write a function that takes a URL as its only argument. It will then grab the content of the page with simple GET request. We’ll fetch the URL with request.get, which returns an object. That object has several attributes, but we’re only going to worry about two: the status_code and content. After getting the URL, we’ll check to make sure it returned a 200 OK response.

def parse_sitemap(url):
	resp = requests.get(url)

	# we didn't get a valid response, bail
	if 200 != resp.status_code:
		return False

With that done, we can use BeautifulStoneSoup to parse the XML. This returns an object which contains several useful methods, but we’ll only use find and findAll. We’ll use findAll, which takes a tag name as its only required, positional argument, to find all the url tags in the sitemap.

def parse_sitemap(url):
	resp = requests.get(url)

	# we didn't get a valid response, bail
	if 200 != resp.status_code:
		return False

	# BeautifulStoneSoup to parse the document
	soup = Soup(resp.content)

	# find all the <url> tags in the document
	urls = soup.findAll('url')

findAll returns a list, and each of its items is also a BeautifulStoneSoup object. If we didn’t get any URLs, findAll will return an empty list, which we’ll check for. Next we’ll iterate through our list of URLs, and extract each of the elements with a call to find. The .string attribute at the end of each find extracts the text of the element only. After that’s all done, we can return what we found. The entire function looks like this.

def parse_sitemap(url):
	resp = requests.get(url)

	# we didn't get a valid response, bail
	if 200 != resp.status_code:
		return False

	# BeautifulStoneSoup to parse the document
	soup = Soup(resp.content)

	# find all the <url> tags in the document
	urls = soup.findAll('url')

	# no urls? bail
	if not urls:
		return False

	# storage for later...
	out = []

	#extract what we need from the url
	for u in urls:
		loc = u.find('loc').string
		prio = u.find('priority').string
		change = u.find('changefreq').string
		last = u.find('lastmod').string
		out.append([loc, prio, change, last])
	return out

That’s it! You could use this function from the python shell:

>>>from xml import parse_sitemap
>>>l = parse_sitemap('http://www.classicalguitar.org/post-sitemap.xml')
>>>for url in l:
>>>    # do stuff here, like write to file

Or you could write an if __name__ == '__main__' clause at the bottom of the file with the logic of what’s supposed to happen if the script is run directly like this:

shell> python xml.py

That clause might look a bit like this:

if __name__ == '__main__':
	options = ArgumentParser()
	options.add_argument('-u', '--url', action='store', dest='url', help='The file contain one url per line')
	options.add_argument('-o', '--output', action='store', dest='out', default='out.txt', help='Where you would like to save the data')
	args = options.parse_args()
	urls = parse_sitemap(args.url)
	if not urls:
		print 'There was an error!'
	with open(args.out, 'w') as out:
		for u in urls:
			out.write('\t'.join([i.encode('utf-8') for i in u]) + '\n')

And then from the command line

shell> python xml.py -u http://www.classicalguitar.org/post-sitemap.xml -o output.txt

This entire script is available on github if you’re interested.

Why Every SEO Should Know Python

Well, every SEO should know some scripting language. It’s a tool that allows you to quickly do work for your clients and get things in place faster so you can start getting results.

Posted on October 24, 2011 in Python.