![]() |
CS 124 / LINGUIST 180   -     Winter 2009
Homework 4: Person Name ('Named Entity') Classification |
| Due: Feb 5 before the start of class |
For this assignment, you will be building a maximum entropy Markov model (MEMM) for identifying person names in newswire text. We have provided all of the machinery for training and testing your MEMM, but we have left the feature set woefully inadequate. Your job is to modify the code for generating features so that it produces a much more sensible, complete, a higher-performing set of features.
This assignment has significantly more starter code associated with it than your other previous (or future) assignments. Because of this, we highly recommend that you use our starter code, and program in Java. However, if you really want to reimplement an entire MEMM in another language, you are allowed to do so. In that case, please email the staff (cs124-win0809-staff@lists.stanford.edu) and we can give you some pointers on getting started.
The file you will be modifying is FeatureFactory.java. The class
currently looks something like this:
public class FeatureFactory {
public List computeFeatures(List<String> words, String previousLabel, int position) {
List features = new ArrayList();
// ADD YOUR FEATURES HERE
return features;
}
}
You will create the features for the word at the given position, with the
given previous label. You may condition on any word in the sequence (and its
relative position), not just
the current word, because they are all observed. You may not condition on any
labels other than the previous one.
Each function you build will be a binary function of some kind of feature.
The features are stored in a list because we are using a sparse
representation. Features which have a value of true (or
1.0) will be present in the list and everything not present is
assumed to be false (or 0.0).
The argument to features.add() is the name of the feature.
you need to give a name for your feature, and then the system will use
this unique name in training, to set the weight for that feature, and then in testing
to use the feature and its weight to make a decision.
Consider the following sample feature:
features.add("word="+currentWord);
and imagine that the current word is "Oakland".
This means "I am a binary function that is true if the current word is 'Oakland'". For every such feature you write, the code will take this function and actually learn two separate weights for this feature: one for "label=PERSON" and one for "label=O".
So just by specifying this feature, the system will learn two functions, call them f1 and f2, where f1, which means "I am a binary function that is true if the current word is 'Oakland' and the output label is 'Person'" will get a low weight, and f2, which means "I am a binary function that is true if the current word is 'Oakland' and the output label is 'O''" will get a high weight. You can think of the name of both f1 and f2 as the string "word=Oakland" and not worry about the fact that we are learning two separate weights. Then, when the system is running on a test sentence that has the word Oakland, it will call your code to create the feature "word=Oakland", and it will automatically create two features, "word=Oakland,label=PERSON" and "word=Oakland,label=O", and it will use those weights in the classifier.
Similarly, this sample feature:
features.add("word=Jenny, prevLabel=O")
will create two features, each with its own weight, one for
"word=Jenny, prevLabel=O, label=PERSON"
and one for
"word=Jenny, prevLabel=O, label=O".
In the starter code we have provided you with three simple starter features, but you should be able to improve substantially on them. We recommend experimenting with orthographic information, gazeteers, and the surrounding words, and we also encourage you to think beyond these suggestions. The better your test set performance, the better your grade!
We have provided you with a training set, called train and a
developemnt test set called dev. We will be running your
programs on an unseen test set, so you should try to make your features as
general as possible. Your goal should be to increase F1, which is the
harmonic mean of the precision and the recall. If you run the program as-is,
you should get the following score:
precision = 0.8015075376884422 recall = 0.5223799126637555 F1 = 0.6325181758096496
When you run the program, you will see it print out a lot of information as it
does the optimization, and you can pretty much ignore that. Afterwards, it
will print out your score as above. You can give it an additional flag,
-print and have it print the test set along with the real answers
and your guesses. The first column is the word, the second column is the true
answer, and the third column is your program's guess. This should help you do
error analysis to see what kinds of things you are getting right, and what
kind you are getting wrong. This will help you properly target your features.
All of the starter code and the train and dev files are located in
/usr/class/cs124/assignments/hw4/
As with previous assignments, we have provided starter build
and run files, which you will probably not need to change, and
which we expect to be able to run in your submission. To run the program, we
should be able to type:
./run [train file] [test file]and we should also be able to run it with the optional extra flag:
./run [train file] [test file] -printand have it print your output in three columns, as discussed above.
First, how. In the directory you plan to submit, execute
/usr/class/cs124/bin/submit. This is almost exactly the same script as CS107 uses, so most of you should be familiar with it. Just follow the direction and email us if you have any problems.
Second, what. We need the usual: your code, your build, config, and run files. We also need a README, which includes a table of accuracies for each fold and whatever you tried. If you attempted any extra credit, let us know what you tried and what worked.