Scripting a Website Scraper

May 24 2012, Thursday

This is a follow-up to my previous post, Please learn to script?. A friend needed to get a list of companies in Beijing for a research project. Alibaba has a list of such companies, but getting the information requires manually clicking through all 90 pages of results and copying-pasting each link. This was a perfect application of the automation I talked about last time--writing a simple script to let the computer do the dirty work. Here's one script that could be used for that purpose:

# Outputs tab-separated file of company names and links
import re
import time
import requests

listitem_re = re.compile('\s*<a id="lsubject_(\d+)" onmousedown="(.*)" href="(?P<href>http://.*)" onclick="(.*)" title="(?P<title>.*)" target="_blank">')

with open('output.txt', 'w') as outfile:
    outfile.write('Name\tLink\n')
    for p in range(1,91):
        url = "http://www.alibaba.com/corporations/beijing/--CN------TR1------------------OFFSET1/{}.html".format(p)
        r = requests.get(url)
        if r.status_code == 404:
            print("URL {} not found!".format(url))
            break

        lines = r.text.splitlines()
        for l in lines:
            m = listitem_re.match(l)
            if m is not None:
                title = m.group('title').replace('&amp;', '&')
                outfile.write('{}\t{}\n'.format(title, m.group('href')))

        print("Processed page {}".format(p))

The script is made slightly simpler by Alibaba's convenient handling of search URLs: the search query seems to be in the part with the dashes, and simply putting a {number}.html after it gets you the results page.

What about beginner scripters?

If a person learning to script were to write this script, the following might be some difficult points:

listitem_re: regular expressions are hard, this would be difficult for a beginner programmer. I wish we had a less opaque method of English text processing.
with open(...) as f: it would also be fine if the scripter wrote outfile = open(...) then later outfile.close(). Not as Pythonic but maybe more likely to show up in beginner Python tutorials.
for p in range(1,91): url = ".../{}.html".format(p): this is a trick that a beginner might have trouble coming up with immediately.
r = requests.get(url): script writer needs to know about the excellent Requests framework.
if r.error_code == 404: would work fine without this error-checking.
m = listitem_re.match(l): again, REs are hard, but this is one of the simplest uses of regular expressions.

Regular expressions are the most difficult thing about this "assignment", but with the help of an experienced mentor (or maybe Stack Overflow), this script should certainly be within reach of a budding script-writer.

The ability to write scripts like this, or at least the ability to recognize when scripting can be useful, is the difference between days of tedious "manual labor" and a few minutes of hacking together a quick script and getting the job done.

Comments

comments powered by Disqus

Tags: