Crawling CodeSnipers

In Detecting Dead Links, I was looking at some simple crawler principles by example of a script that checks a webpage for broken links. Today I want to look at a more dynamic example. It is actually fairly easy to create a crawler that manages to visit a broad range of pages, jumping from site to site, collecting data. In this example however, I want to take specific measures to avoid that. I want to only look at those pages that are accessible within a given set of base URLs.

The flow is pretty straightforward:

  1. Import libraries.
  2. Initialize data structures.
  3. Crawl.

A couple of helper libraries are imported first:

import string
import urllib
import re
import Queue

The parsing is again done with a regular expression:

url_pattern = re.compile(r'''
                     (?x)(
                     http
                     ://
                     (\w+[:.]?){2,}
                     (/?|
                     [^ \n\r"'])+
                     [\w/!?.=]
                     (?=[\s\.,#\<\>)"'\]])
                     )
                     ''')

To start the crawl, a URL is required. Further, a queue is set up, which at all times will contain a list of those links that are yet to be crawled. At the end of the crawl, visited_links would ideally be a list of all URLs that could be reached starting from the initial URL. This may well be every single web page on a site, if all pages are connected.

url = 'http://www.codesnipers.com'
crawl_queue = Queue.Queue(0)
crawl_queue.put(string.lower(url))
visited_links = []

Finally, we'll need a means of keeping this crawler focussed. We don't want it to start looking at our website, just to wander off and peruse cnn.com.

For this purpose, a list of base URLs can be specified. The crawler shall not look at pages outside those. The function get_base_url() determines the base URL, given some other address. It would for example turn http://codesnipers.com/?q=node/227&&title=MSDN-Channel-9-Interview-with-Gavin-Bowman into http://codesnipers.com.

def get_base_url(url):
    start_of_base = url.find('://')
    if start_of_base == -1:
        start_of_base = 0
    else:
        start_of_base += 3
    end_of_base = url[start_of_base:].find('/')
    if end_of_base >= 0:
        return url[:start_of_base + end_of_base]
    else:
        return url

base_urls = ['http://www.codesnipers.com', 'http://codesnipers.com']

The crawler logic itself is easy. While there are links in the crawl queue, the pages as referenced by the URLs in the queue are visited (if they are based off the specified base URLs). Links are collected from those pages and added to the queue (if they have not been visited yet). Here's the code for the crawl:

# process queue
while not crawl_queue.empty():
    url = crawl_queue.get()
    if url not in visited_links:
        try:
            #get base address
            base_url = get_base_url(url)
            #look only at URL, if it's off (one of) the base URL(s)
            if base_url in base_urls:
                # create list of links on page
                link_list = re.findall(url_pattern, urllib.urlopen(url).read())
                # queue up those links that were found but not visited yet.
                for link in link_list:
                    if not link[0] in visited_links and link[0] != url:
                        crawl_queue.put(string.lower(link[0]))
                # add current URL to visited_links
                visited_links = visited_links + [url]
                print 'visited:', url
        except Exception, e:
            print e

In the end, visited_links contains the list of those links that were actually visited.

There is really nothing preventing you from combining this with the functionality discussed in Detecting Dead Links to create a script that scans an entire website for dead links. On the other hand, this can also serve as a starting point for a more elaborate web crawler. There are lots of interesting areas that can be approached from here, such as politeness policies, page content storage mechanisms, multiple crawl threads, a more elaborate parser, and so forth.

Hah, yes. That would be much

Hah, yes. That would be much more elegant.