Extracting Emails

In last week's post I talked about a simple application of web crawling features. This week I want to discuss another application that came to mind, as I was thinking about web crawling.

Let's talk about extracting email addresses from a web page.

Again, the logic is pretty straightforward and Python's readable syntax makes for an easy script. The flow is like this:

  1. Import libraries.
  2. Open web page.
  3. Extract email addresses.

Just like last week, functionality from re and urllib is used:

import re
import urllib

I found that the page at http://www.terc.org/emaillist.html appears to have a nice variety of email addresses. Simply for the sake of this example, it is certainly useful to have a list of differently formatted addresses.

The list of email addresses shall be stored in email_list.

url = 'http://www.terc.org/emaillist.html'
email_list = []

Regular expressions are used here and the pattern used to find email addresses is actually pretty simple:

    pattern = (r'''
               (?xm)
               [\w\-\.]+
               @
               [\w\-\.]{2,}
               ''')

Basically, this is looking for strings that begin with one string that contains alphanumeric chracters, hypens and periods, is followed by the @ symbol and are finally concluded by at least two strings that can also contain alphanumeric characters, hyphens and a period.

Given a URL, the page content gets read into a string, which is then used as parameter for a call to re.findall().

    page = urllib.urlopen(url).read()
    email_list = re.findall(pattern, page)

Web pages often contain the same email address more than once, especially since the above pattern is not distinguishing between email address as part of a mailto: string and those appearing in other parts of a document.

It is pretty straightforward to create a list of unique email addresses, given the data in email_list.

The following snippet creates a new list unique_list that will contain a list of unique email addresses from the opened web page.

    unique_emails = []
    for email in email_list:
        if email not in unique_emails:
            unique_emails = unique_emails + [email]

Nowadays, it makes a lot of sense to not include plain email addresses on web pages any longer. A lot of people have begun modifying addresses slightly and in a way that leaves the intent obvious to human visitors, but often makes them (ideally) much harder to detect for computer programs.

Simple examples include replacing the @ symbol with different characters, inserting obvious out-of-place characters in the domain portion, and so forth.

Modification of existing email address strings is simple enough that it can be automated. First the pattern is changed slightly:

    pattern = (r'''
               (?xm)
               (
               ([\w\-\.]+)
               (@)
               ([\w\-\.]{2,})
               )
               ''')

Several pairs of parentheses are added to form groups of sub-results, which can be referenced individually. Here are the groups in order of appearance:

  1. The entire email address.
  2. The part before the @ symbol.
  3. The @ symbol.
  4. The part after the @ symbol.

This is useful for the following replacement logic.

    replaced_content = re.sub(pattern, r'\2 at \4', page)

The string replaced_content will contain a copy of the original page, except that in all detected email addresses the @ symbol is replaced with ' at '.

This is of course just a simple example of a replacement mechanism. For a larger website with lots of dynamic content and possibly often changing contact information, it might be quite useful to use a similar method to "pre-process" page content to modify any email addresses before displaying them to the user.

Script Kiddies Rejoice

You've just given birth to 10,000 new spammers. :-/

I jest.

Yeah, it was definitevely a

Yeah, it was definitely a consideration. In the end, I honestly didn't believe the risk was all that great though.