Indexing Experiment - Expressing Queries

In last week's post, I started this indexing experiment by creating a simple index based on the words found at specified URLs. Today, I want to look at querying such an index. Using last week's example, my goal is to find all indexed documents containing the words

  • Python and PHP
  • Python or PHP
  • Python not PHP

And so forth. Let's take a look.

It's fairly straightforward to come up with a function that figures out a collection of documents containing a given term. If the term exists in our determined index (see last week's post), it is associated to a list of document-location pairs. All different document identifiers are added to the results collection.

from sets import Set

def search(term):
    matches = Set()
    if word_hash.has_key(term):
        occurances = word_hash[term]
        for item in occurances:
            if item[1] not in matches:
                matches.add(item[1])
    return matches

Note that the Set class is used to store the result of the search function. This greatly simplifies more interesting queries, such as those containing logical operators.

The following pattern is used to break down search queries:

pattern = re.compile((r'''
                     (?x)
                     (\w+|&|-|\|)
                     '''))

This will match sequences of alphanumeric characters (much like those contained in the index), as well as the &, | and - symbols. These are being used as logical operators and, or and not, respectively.

The following function query evaluates a search string, performs searches and logical operations on individual result sets. The absense of an operator is viewed as a logical or.

def query(search_string):
    result = Set()
    terms = re.findall(pattern, search_string)
    operator = ''
    for term in terms:
        if term in ['&', '|', '-']:
            operator = term
        else:
            if operator in ['|', '']:
                result = result | search(term)
            elif operator == '&':
                result = result & search(term)
            elif operator == '-':
                result = result - search(term)
            operator = ''
    return result

Trying this with the following:

print query('Python & PHP') # and
print query('Python | PHP') # or
print query('Python PHP') # or
print query('Python - PHP') # not

Predictably returns:

Set(['http://www.codesnipers.com'])
Set(['http://www.python.org', 'http://www.codesnipers.com'])
Set(['http://www.python.org', 'http://www.codesnipers.com'])
Set(['http://www.python.org'])

Experimenting with this by adding more data to the index and bigger combinations of logical operations will invariably reveal the limitation of the current very strict one-set-at-a-time, left-to-right order of precedence. Parentheses would be really convenient, I think. For this, query will have to be modified. I will take a look at that next week.