![]() |
CS 124 / LINGUIST 180   -     Winter 2009
Homework 1: Harvesting emails and phone numbers |
| Due: January 13 before the start of class |
Here's your chance to be a SpamLord! Yes, you too can build regexes to help spread evil throughout the galaxy!
More specifically, your goal in this assignment is to write regular expressions that extract phone numbers and regular expressions that extract email addresses.
To start with a trivial example, given
jurafsky@stanford.eduyour job is to return
jurafsky@stanford.edu
But you also need to deal with more complex examples created by people trying to shield their addresses, such as the following types of examples that I'm sure you've all seen:
jurafsky at [university].edu jurafsky(at)cs.stanford.edu jurafsky at csli dot stanford dot edu
You should even handle examples that look like the following (as it appears on your screen; we've used metachars on this page to make it display properly):
<script type="text/javascript">obfuscate('stanford.edu','jurafsky')</script>
For all of the above you should return
jurafsky@stanford.edu
Similarly, for phone numbers, you need to handle examples like the following:
TEL +1 650 723 0293
Phone: (650) 723-0293
Tel (+1): 650-723-0293
<a href="contact.html">TEL</a> +1 650 723 0293
all of which should return the following canonical form:
650-353-0173(you can assume all phone numbers we give you will be inside North America).
In order to make it easier for you to do this and other homeworks, we will be giving you some data to test your code on, what is technically called a development test set. This is a document with some emails and phone numbers, together with the correct answers, so that you can make sure your code extracts all and only the right information.
You will be graded on how well your regular expressions find emails and phone numbers in a different test set that we have. Because you don't know exactly what trickery goes on in this test set, you should be creative in looking at the web and thinking of different types of ways of encoding emails and phone numbers, and not just rely on the methods we've listed here, or are listed in the homework.
You won't have to deal with: really difficult examples like images of any kind, or examples that require parsing names into first/last like:
"first name"@cs.stanford.eduor difficult examples like my friend Jim Martin, whose email is listed only as:
To send me email, try the simplest address that makes sense.
You may use any programming language you like, but since we have to run your code on our test sets, if we have trouble running your code you will not receive full credit. For this reason, we suggest sticking to wide-spread and portable languages like Java, python, or Perl.
We have provided starter code written in Java, available in /usr/class/linguist180/assignments/hw1/ , which has a directory structure like:
hw1/ build # A script to build your code. config # A configuration script example/ # a minimal example set that shows how everything works. exampleGOLD # the correct answers run # to run your code SpamLord.java # starter code. train/ # The training set trainGold/ # What we think are the right answers. (If you see something we don't, let us know!)
By default, if you execute:
./run example exampleGOLDIt will run your code on the minimal example, which extracts a name (not an email address) and prints it. This example just shows you how Java regexes work. This output is then compared with the unix utility diff. diff compares two files and prints differences between then. No printing means the files were the same, which is the case here. Try again, with s/example/train/g, and you should see the phone numbers and email addresses of many of your professors! The "<" indicates that the "GOLD" output (our output) has something that you do not. A ">" would indicate the reverse. You can safely ignore the weird numbers like "37a3". Your goal, then, is to reduce the printed output to nothing. Note that if you just want to see what you're printing (and not the diff), run "java -cp classes/ SpamLord train".
You may change anything you want in this starter directory.However, we expect to be able to say ./build && ./run (some set) (some answers)
and for it to work! If you only edit SpamLord.java, you should be fine.
The development data lives in /usr/class/linguist180/hw1/train/ with answers in /usr/class/linguist180/hw1/trainGOLD.
The results should be printed to standard out (System.out.println) with the following format:
<filename> <TAB> <p or e> <TAB> <email or phone #>The canonical forms we expect are: user@example.com and 650-555-1234.
First, how. In the directory you plan to submit, execute
/usr/class/cs124/bin/submit. This is almost exactly the same script as CS107 uses, so most of you should be familiar. Just follow the direction and email us if you have any problems.
Second, what. Your code, your build, config, and run files, and then a README with a description of the kinds of things you can extract, and anything else you want to tell us.