Detecting Dead Links

I have been spending some of my free time lately with theory and practice of web crawling, searching and so forth. Let's talk about a very quick and easy application: A script to check for dead links on a web site. It's probably easy to come up with various use cases for such a script, so this not only incorporates some simple crawler elements, it also does something useful!

The script simply follows these steps:

  1. Import necessary libraries.
  2. Read web page content.
  3. Collect all links on page.
  4. Check if links are dead.

This script makes use of regular expression and needs to open URLs. This is accomplished using functionality from re and urllib, respectively:

import re
import urllib

The url to check shall be CodeSnipers.com:

url = 'http://www.codesnipers.com'
link_list = []

Now, the logic to parse the page content can obviously be done in various different ways. In this case I picked a snippet from David Mertz' very useful Text Processing with Python, Chapter 3 - Regular Expressions and adjusted it slightly for this example, mostly to allow more flexibility regarding the characters with which an URL can end:

url_pattern = (r'''
                     (?x)(
                     (http)
                     ://
                     (\w+[:.]?){2,}
                     (/?|
                     [^ \n\r"']+
                     [\w/!?.=#])
                     (?=[\s\.,>)"'\]])
                     )
                     ''')

Next, the page at the given URL needs to be accessed and read into a string. A call to re.findall(), applying the specified pattern on the page content then yields a list of links embedded in the page. I really did not worry about compiling the regular expression here, as it is really used only once here.

try:
    page = urllib.urlopen(url).read()
    link_list = re.findall(url_pattern, page)
except Exception, e:
    print e

Lastly, we need to iterate through the links and check their availability. This really just involves attempting to read a URL and displaying status information.

for link in link_list:
    print 'Checking:', link[0], '...', 
    try:
        page = urllib.urlopen(link[0]).read()
        print 'OK!'
    except IOError, e:
        print 'PROBLEM:', e

BTW: The comma at the end of the print expression ensures that the following text will be displayed on the same line with one leading space. Very convenient, the above typically yields one line of text per checked link.

This is it. The good news: As of Wednesday night, there are no dead links on CodeSniper's main page. I found some initially, but that was really only pointing to errors in the parser pattern.

Threat of a sequel: There is of course room for improvement here. We could decide to do a little more in response to occurring errors. The parser can certainly be improved and be made more flexible. Do we really want every single URL or just those enclosed within anchor tags? What kinds of resource types do we want to support? Finally, the script can be made more versatile by crawling an entire site and report on broken links throughout. That's when performance will become a more important topic, too.

what about

sites that start with https ?? I'm no a RegEx expert by no means.. but I don't see that would be caught in your expression.

Neat article.. I really like Python. :)

You are correct, of course.

You are correct, of course. The above expression is really only looking for those urls starting with http. Replacing (http) with (http|https) should do the trick.

the http part is optional though

and it will include links that aren't actually linked (i.e. in an href=), but that may be desired.
This is a great article by the way, this is exactly the kind of article I find very useful.

SGMLParser

You could also use Python's SGMLParser to easily pull out the links enclosed in anchors. Mark Pilgrim provides some good coverage of this in Chapter 8 of his book Dive into Python.