Indexing Experiment

Not that the discussion of web crawling is over - far from it - but I thought it would be nice to start tinkering with indexing a little bit. This post will bring a very simple example of creating such an index then. The example is intentionally simple to show how easy it is to get started on writing an indexing scheme in Python. Basically, I am looking for a way of creating a word index based on the content of a list of files. For sake of this example, this list shall consist of two popular URLs:

url_list = ['http://www.codesnipers.com', 'http://www.python.org']

In practice this is likely not very realistic. I would instead rather have the webcrawler collect documents and then use an indexer that runs locally to create an index, instead of having the indexer grab the documents directly from the web. Anyway - keeping things brief... Next up, a string pattern is defined. In its current form, it is detecting sequences of alphanumeric characters:

url_pattern = re.compile((r'''
                     (?x)
                     (\w+)
                     '''))

Here's the logic that creates the word index using the given URLs:

word_hash = {}
for url in url_list:
    # open URL
    page = urllib.urlopen(url).read()
    # extract strings into a list
    word_list = re.findall(url_pattern, page)
    for index in range(len(word_list)):
        # populate dictionary containing word indices
        if word_hash.has_key(word_list[index]):
            indices = word_hash[word_list[index]]
            indices = indices + [[index, url]]
        else:
            indices = [[index, url]]
        word_hash[word_list[index]] = indices

This goes through the list of URLs and extracts a list of strings from the page at the URL. A dictionary (word_hash) is then populated. The term found on a given URL is used as a key, while the value associated with it will end up being a list of (number, URL) pairs. With the aforementioned URLs, the entries for the term Foundation looked like this during my last test run:

[780, 'http://www.python.org'], [1097, 'http://www.python.org']

This means that the term Foundation was found at positions 780 and 1097 at http://www.python.org. Given the used string pattern, the index contains a lot of terms that are of dubious value. It makes sense for example to pay attention to stop words, words that are really part of HTML code, numbers, and so forth. Regardless though, this is not without use. Next week, I'll look at various ways to formulate queries to such an index.