Indexing Experiment - Queries With Parentheses

As indicated in Indexing Experiment - Expressing Queries, I want to allow for more interesting queries with this week's post. Using last week's sample data, I'd like to form expressions, such as

  • Python and PHP
  • Python or PHP
  • (Python not PHP) or Javascript
  • ((Python not PHP) and Buildbot) or Javascript
  • Python and (PHP and AJAX)

In short, last week's example will be expanded to allow usage of parentheses.

The changes affect two areas:

  • infixToPostfix - a new function that turns a search expression from infix notation into postfix notation.
  • query - a modified version of query that processes a search string

One nice thing about postfix notation is that it eliminates the need for parentheses. Also, I am currently assuming the operators and, or and not, each of which have equal priority within an expression.

This makes the algorithm and its implementation pretty straightforward: The function initializes an empty list postfixNotation, a helper structure stack and then iterates through the given infix expression. Each operand (a regular string) is appended to postfixNotation. If any operators &, | - and ( are encountered, they are placed on stack. Encountering a closing bracket, results in adding all operators previously added to stack (LIFO order) to postfixNotation, except the opening bracket. In the end, any operators still remaining in stack are also added to postfixNotation.

def infixToPostfix(expressionList):
    postfixNotation = [];
    stack = []
    for item in expressionList:
        if item in ['&', '|', '-', '(']:
            stack.append(item)
        elif item == ')':
            from_stack = stack.pop()
            while from_stack <> '(':
                postfixNotation.append(from_stack)
                from_stack = stack.pop()
        else:
            postfixNotation.append(item)
    while len(stack) > 0:
        postfixNotation.append(stack.pop())
    return postfixNotation

Query first converts the search expression to list form and then uses infixToPostfix to create a representation in postfix notation. Then, it iterates through each element of that list to determine the final value. Again, the mechanism is relatively straightforward. Any operand is pushed onto a stack, while encountering an operator results into performing the appropriate operation on the top two elements of the stack and then pushing the result back on the stack. In the end, the value of the expression will be the one remaining element on the stack.

Elements on the stack can either be search terms or result sets. The routine does have to check for that.

def query(search_string):
    terms = infixToPostfix(re.findall(pattern, search_string))
    intermediate_results = []
    for term in terms:
        if term in ['&', '|', '-']:
            operand2 = intermediate_results.pop()
            if type(operand2) == type(''):
                operand2 = search(operand2)
            operand1 = intermediate_results.pop()
            if type(operand1) == type(''):
                operand1 = search(operand1)
            if term == '&':
                intermediate_results.append(operand1 & operand2)
            elif term == '|':
                intermediate_results.append(operand1 | operand2)
            elif term == '-':
                intermediate_results.append(operand1 - operand2)
        else:
            intermediate_results.append(term)
    return intermediate_results.pop()

The example pointed out at the beginning of this post can now be expressed like this:

print query('Python & PHP')
print query('Python | PHP')
print query('(Python - PHP) | Javascript')
print query('((Python - PHP) & Buildbot) | Javascript')
print query('Python & (PHP & AJAX)')

As of last testing, this returned the following results:

Set(['http://www.codesnipers.com'])
Set(['http://www.python.org', 'http://www.codesnipers.com'])
Set(['http://www.python.org', 'http://www.codesnipers.com'])
Set(['http://www.python.org', 'http://www.codesnipers.com'])
Set(['http://www.codesnipers.com'])

There are still open issues. The presented code is not offering a whole lot in terms of error-checking, incl. checking for balanced parentheses. Also, last week's example allowed us to ommit or operators, which won't work as nicely now anymore. This does however allow us to construct more meaningful queries and do more interesting things with our index.

More to come ...