Linguist 278: Programming for Linguists Stanford Linguistics, Fall 2009 Christopher Potts Assignment 8 - Code for the long term Distributed 2009-11-17 Due 2009-12-07 ====================================================================== Overview: The goal for this assignment is to get you to write some code that you will continue to use and improve indefinitely. You're at a critical period in your programming careers: * keep coding, and you'll soon find that it is productive, satisfying, and rewarding; * stop coding now, and your Python smarts will fade quickly away. So I aim to at least push you in the direction of continued coding. I can't do it alone, though. The important thing is that you find a project that will indeed be useful to you long-term, something that you will alias in your .bash_profile and use regularly. Some examples from my own library: -- A sentegrep program like the one described in problem 2 above. -- A deluxe version of the reference resolving program you wrote for assignment 4. -- A script that takes my the latex source of my CV and converts it to HTML, for use on the Web. -- A script that updates the bib references for my own work and creates a Web version of it (http://www.stanford.edu/~cgpotts/reference-info/potts-bibtex.html) -- Classes like those described in option 3 of assignment 7. (These are really important to my research right now.) ====================================================================== optparse If you write a command-line program, I encourage you to check out the optparse library: http://docs.python.org/library/optparse.html It makes it fairly easy to write interactive command-line arguments that behave well if someone gives them the wrong kind of input. ====================================================================== 1. Closures For people who work in theories with heavy-duty combinatorics --- probably, this is the semanticists and phonologists --- the challenge problem called 'Set-theoretic closure operations' is a good choice, and the algorithms involved are rewarding to study in their own right. ====================================================================== 2. Sentence egrep Write a command-line program that works like this: python sentegrep.py pattern filename1 filename2 ... where the program reads in each file, heuristically parses it into sentences, and then returns sentences (not lines!) that match pattern. You'll also want to be able to do python sentegrep.py pattern dirname/* and allow recursive file-search with an -r flag, but you might save this for the months ahead. ====================================================================== 3. A novel corpus, with tools for exploring it Gather a new corpus (perhaps continuing what you did for the previous assignment), and then develop useful classes for exploring the corpus. Think first about intuitive design and access, and secondarily about efficiency. Expect the design to evolve as your research does. ====================================================================== 4. Naive Bayes classifier A Naive Bayes document classifier is built upon the following simple statistical model: P(cls|doc) ~ P(cls) * P(word_1|cls) * (P(word_2|cls) * ... * (P(word_n|cls) where word_1 .. word_n are the words in doc. This breaks down as follows: (i) P(cls) = the number of documents of class cls / the total number of documents (ii) P(w|cls) = the probability that word w appears given that we're in a doc in cls To make this concrete, let's suppose that our task is to distinguish spam email from ham email (legit email) in this dataset: http://www.cs.umass.edu/~mccallum/courses/cl2006/data/spamham.tar.gz There are thus two classes: 'spam' and 'ham'. Calculating (i) is easy: count the documents in each subdirectory and then create a probability distribution over classes/subdirectories. To calculate (ii), build a dictionary mapping words to classes to counts. Thus, you'll have mappings like this: the:{'spam':count1,'ham':count2} And thus P(the|spam) = (count1/(count1+count2)) / P(cls) Training involves building the distributions P(cls) and P(w|cls) from data. Then, when a new document doc comes in, estimate P(cls|doc) for each class cls and assign doc the cls with the highest value. With the above very simple model, my classifier typically achieves 88% accuracy when I train on a randomly chosen 80% of the spamham data and test on the remaining 20%. ====================================================================== 5. Your idea here If you'd like to do something else, please send me a short description of it by email. ======================================================================