We're going to build a supervised WSD system to disambiguate two senses of the made-up word vek, as described in Section 20.2 of Jurafsky & Martin. The annotated training corpus consists of just five sentences:
foo vek1 bar baz
foo foo vek1 zub
zip vek1 zip zub
foo vek2 zip zip
foo zub vek2 bar
Our feature representation will be exceedingly simple: we'll use a bag of words containing the one word before and one word after the target word. That is, for every word wi in our vocabulary, we'll have a boolean feature fi whose value is 1 if wi appears as one of these two context words, and 0 otherwise. Assume that the vocabulary includes every word observed in the training data.
Using this feature representation, we will train a standard Naive Bayes classifier on our small training corpus, and then use the classifier to make disambiguation predictions on test examples.
(If you're familiar with Naive Bayes, you may recognize that the feature representation we've specified corresponds to the Bernoulli formulation of Naive Bayes, rather than the multinomial formulation. The difference between the two is explained here — but you don't need to grok that explanation in order to answer this question.)
zip vek? zub
Which sense will our classifier predict to be more likely?
foo vek? baz
What difficulties might we encounter, and how can we address them?
We've put a 1GB fragment of Wikipedia here.
(Alternatively, if you're on AFS, you can find it at /afs/ir/class/cs224u/WWW/data/enwiki-20081008-pages-articles-first-1gb.xml.zip.)
Please download it to some convenient location and unzip it using the
unzip command.
By grepping through the unzipped XML file, you can find occurrences of (Wikipedia-style) hyperlinks like the ones that [Mihalcea 2007] used to create her sense-annotated corpus. For example, try the following grep command to find hyperlinks with anchor text "bar":
egrep -o "\[{2}[a-z\(\) \-\_]+\|bar\]{2}" enwiki-20081008-pages-articles-first-1gb.xml
Play around with this command and some variations to get familiar with the possibilities. (Pro tip: add "| sort | uniq -c" to the end of your command line to combine and count duplicates.)
Then, please find a noun having the following properties:
Show us how you'd map the Wikipedia senses you've discovered into WordNet senses. We're looking for mappings that look like this:
[[bar (law)|bar]] --> 'legal_profession.n.01' (the canonical name for the synset)or
[[bar (law)|bar]] --> 'the body of individuals qualified to practice law in a particular jurisdiction' (synset definition of the sort one gets from the Web interface)Be sure to list any Wikipedia senses that don't map to any WordNet sense, and conversely, any WordNet senses to which no Wikipedia senses are mapped.