Python: Web Scraping with lxml

Introduction

I recently needed to solve a problem dealing with a list of ISO2 Coded countries that have provinces instead of states. Turns out I could’nt find a nice list of these codes, so I decided to make it myself and in the process learn a little bit about web scraping with Python.

Design

We will use urllib2 and lxml to open a connection to the website and parse it into a navigable DOM. Once we have a DOM we can use CSS Selectors to find the information we care about.
Notes:

  • you can also use xpath selectors, but I had trouble getting them to work with lxml
  • BeautifulSoup is a popular web scraping library that can actually use lxml to do the parsing, but I had trouble installing it on my machine so I went with plain lxml

We will be using data from the CIA World Factbook to determine if a country has states or provinces. Our starting point will be an index page containing links to each specific country. Our basic algorithm looks like:

  1. Parse index page, grabbing the links to each individual country page
  2. Parse each country page, looking for information about provinces
  3. Look up the country code based on the name of the country (only if it has provinces)

Implementation

from urllib2 import urlopen
from lxml.html import fromstring
from lxml.html import tostring
import pycountry

def get_page(url):
  html = urlopen(url).read()
  dom = fromstring(html)
  dom.make_links_absolute(url)
  return dom

def get_country_code_map():
  cc={}
  t = list(pycountry.countries)

  for country in t:
    cc[country.name.lower()]=country.alpha2

  return cc

def get_iso2_code(country):
  global countryCodes
  if countryCodes is None:
    countryCodes = get_country_code_map()
  
  isoCode = None
  if country.lower() in countryCodes:
    isoCode = countryCodes[country.lower()]
  
  return isoCode

def parse_country_page(country, url):
  dom = get_page(url)
  isoCode = get_iso2_code(country)
  global unknownCodes
  global countriesWithProvinces
  
  #its about to get nasty
  #the web page for each country is set up such that it is literally impossible to uniquely identify elements with the css selectors available to lxml
  for row in dom.cssselect("tr"):
    td = row.getchildren()[0].getchildren()
    
    #look for the tr that has 2 td's, the left of which identifies the section that tells us if the country has provinces (the Administrative divisions section)
    if len(td) > 0 and 'Administrative divisions' in tostring(td[0]) and 'province' in tostring(row.getchildren()[1]):
      temp = tostring(row.getchildren()[1])
      
      #to help identify possible false positives print out the text that has to do with provinces, e.g. Afghanistan has 34 provinces
      #print country + ' has ' + temp[temp.find('>')+1:temp.find('province')+len('province')+1].strip() 
      
      #look up ISO code, add to appropriate list for output
      if isoCode == None:
	unknownCodes.append(country)
      else:
	countriesWithProvinces.append(country + ' (' + isoCode + ')')

def get_countries_with_provinces():
  dom = get_page("http://geography.about.com/library/cia/blcindex.htm")

  for node in dom.cssselect('#articlebody>ul>li>a'):
    country = node.text
    url = node.attrib['href']
    parse_country_page(country, url)

countryCodes = None
unknownCodes = []
countriesWithProvinces = []

get_countries_with_provinces()

print '\nCountries with Provinces -ISO2 Codes'
for country in countriesWithProvinces:
  print country
  
print '\nCountries with Provinces - Unable to find country code'
for country in unknownCodes:
  print country

Most of the real work is done in get_countries_with_provinces() which is responsible for parsing the index page for all the countries (getting our list of urls to parse) and parse_country_page(country, url), which is responsible for parsing each individual country web page looking for information about provinces. Besides that we have a fiew helper methods to avoid code repeat, improve readability, etc. Yay abstraction! It also uses a few global variables (global countriesWithProvinces) to keep track of the country codes and errors across method calls.

I didnt spend too much time hardening this script, as it is supposed to be a sort of quick and ditry approach to solving a problem. So I have things like global variables and not having the script encapsulated in if __name__ == "__main__". Bad practices I normally wouldnt partake in, but I am the only one using this code, it wont be incorporated with any other code, and I probably wont use it again. Forgive me if I cut a few corners for the sake of time.

Results

Once you find and setup your libraries/frameworks, this is a pretty simple problem to solve. Most of the headaches came from working with the html itself. Parsing the individual country pages was pretty bad because of the way the web page was written. There were almost no id’s or unique class names used anywhere so there was no nice or easy way to identify the section of the page that contained the province information. FireBug was giving me css selectors like:

html body#tt0.gt5.education.nd div#abw div#abb div#abm.clear div#abc div#articlebody table tbody tr td

Which might work for the page I used to get that selector from, but do you think it would be the same for all 200-something country pages? So I ended up with custom code to parse the webpage like this:

for row in dom.cssselect('tr'):
    td = row.getchildren()[0].getchildren()
    
    #look for the tr that has 2 td's, the left of which identifies the section that tells us if the country has provinces (the Administrative divisions section)
    if len(td) > 0 and 'Administrative divisions' in tostring(td[0]) and 'province' in tostring(row.getchildren()[1]):
    	...

Still not that nice, its a little more clear (I think) than that css selector though. It is kind of slow, takes maybe 30 – 45 seconds to run. I didn’t time it, but it was long enough for me to switch over to another task and come back to it.

I now have my nice reduced list of countries that have provinces instead of states. Turns out we only do business with China, Canada and Argentina so it wasnt really needed, but you never know when an order might come in from Tajikistan right? In any case, it was about an hour, hour and a half and I got 1 answer and learned a little about web scraping. I’d call that a success.

Downloads

Dependencies

References

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>