![]() |
CS 124 / LINGUIST 180   -     Winter 2009
Homework 3: Movie Reviews Sentiment Classification |
| Due: January 27 before the start of class |
Your goal for this homework is to do sentiment analysis: classifying movie reviews as positive or negative. Recall from lecture that sentiment analysis can be used to extract people's opinions about all sorts of things (congressional debates, presidential speeches, reviews, blogs) and at many levels of granularity (the sentence, the paragraph, the entire document). Our goal in this task is to look at an entire movie review and classify it as positive or negative.
You will be using naive bayes, following the pseudocode in Manning, Raghavan, and Schütze on page 241, using Laplace smoothing. Your classifier will use words as features, add the logprob scores for each token, and make a binary decision between positive and negative.
Instead of using a test set as you did in the previous homeworks, you will be doing 10-fold cross-validation and reporting your own accuracy instead of us doing it for you. Recall that in cross-validation, you divide up the data into 10 sections, and then you run your classifier 10 times, each time choosing a different section as the test set and the other 9 as the training set. Your final accuracy is the average of the 10 runs. The data comes with the cross-validation sections; they are defined in the file:
/usr/class/linguist180/assignments/hw3/poldata.README.2.0Thus, when using a movie review for training, you use the fact that it is positive or negative (the hand-labeled "true class") to help compute the correct statistics. But when the same review is used for testing, you only use this label to compute your accuracy.
This algorithm is a simplified version of this paper that you read for class:
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 79--86.
In case you want to see other papers related to this data, you can look at http://www.cs.cornell.edu/People/pabo/movie-review-data/,
For extra credit, you can try various augmentations to this basic algorithm and see if you can improve your accuracy. For example, you could throw away stop words (very frequent words like "the", "of", "it", etc). Or you could use only adjectives, or try to eliminate proper names, since those are unlikely to generalize well from one movie to another. To help you on this extra credit, we've put a small (gzip'ed) dictionary of part of speech tags here:
/usr/class/linguist180/assignments/hw3/mobypos.txt.gzwith a README file in
/usr/class/linguist180/assignments/hw3/aaREADME.txt
If you want to do the extra credit, you have to tell us your accuracy with the basic algorithm, and then your (improved) accuracy using stop word or using part of speech tags.
You may use any programming language you like, but since we have to run your code on our test sets, if we have trouble running your code you will not receive full credit. For this reason, we suggest sticking to wide-spread and portable languages like Java, python, or Perl.
We have provided starter code written in Java, available in /usr/class/linguist180/assignments/hw3/ , which has a directory structure like:
hw3/ build # A script to build your code configure # A configuration script run # builds and runs your code NaiveBayes.java # starter code txt_sentoken/ # The training directory poldata.README.2.0 # README for how to do CV, and about the data. mobypos.txt.gz # Part of Speech tags for extra credit aaREADME.txt # description of the above
By default, if you execute:
./run train/You'll get a 0% accuracy. This time, you'll need to write the code to perform CV and evaluate your performance in addition to building your classifier.
You may change anything you want in this starter directory.
However, we expect to be able to say ./run [train]
and for it to work! If you only edit NaiveBayes.java, you should be fine.
The training data lives in /usr/class/linguist180/assignments/hw3/txt_sentoken/
It's up to you; we're more interested in your report in your README this time.
First, how. In the directory you plan to submit, execute
/usr/class/cs124/bin/submit. This is almost exactly the same script as CS107 uses, so most of you should be familiar with it. Just follow the direction and email us if you have any problems.
Second, what. We need the usual: your code, your build, config, and run files. We also need a README, which includes a table of accuracies for each fold and whatever you tried. If you attempted any extra credit, let us know what you tried and what worked.