## CS224U Homework 2

1. We're going to build a supervised WSD system to disambiguate two senses of the made-up word vek, as described in Section 20.2 of Jurafsky & Martin. The annotated training corpus consists of just five sentences:

```          foo vek1 bar baz
foo foo vek1 zub
zip vek1 zip zub
foo vek2 zip zip
foo zub vek2 bar ```

Our feature representation will be exceedingly simple: we'll use a bag of words containing the one word before and one word after the target word. That is, for every word wi in our vocabulary, we'll have a boolean feature fi whose value is 1 if wi appears as one of these two context words, and 0 otherwise. Assume that the vocabulary includes every word observed in the training data.

Using this feature representation, we will train a standard Naive Bayes classifier on our small training corpus, and then use the classifier to make disambiguation predictions on test examples.

(If you're familiar with Naive Bayes, you may recognize that the feature representation we've specified corresponds to the Bernoulli formulation of Naive Bayes, rather than the multinomial formulation. The difference between the two is explained here — but you don't need to grok that explanation in order to answer this question.)

1. Write down the equation that our Naive Bayes classifier will use to make predictions.

2. What is the prior probability of sense 1?

3. What is conditional probability of feature foo, given sense 1?

4. Suppose we apply our trained classifier to the following test example.
`              zip vek? zub `
Which sense will our classifier predict to be more likely?

5. What specific probabilities will our classifier assign to each of the two senses?

6. [Extra credit] Now suppose we apply our trained classifier to this test example.
`              foo vek? baz `
What difficulties might we encounter, and how can we address them?

2. We've put a 1GB fragment of Wikipedia here. (Alternatively, if you're on AFS, you can find it at `/afs/ir/class/cs224u/WWW/data/enwiki-20081008-pages-articles-first-1gb.xml.zip`.) Please download it to some convenient location and unzip it using the unzip command.

By grepping through the unzipped XML file, you can find occurrences of (Wikipedia-style) hyperlinks like the ones that [Mihalcea 2007] used to create her sense-annotated corpus. For example, try the following grep command to find hyperlinks with anchor text "bar":

`          egrep -o "\[{2}[a-z\(\) \-\_]+\|bar\]{2}" enwiki-20081008-pages-articles-first-1gb.xml `

Play around with this command and some variations to get familiar with the possibilities. (Pro tip: add "| sort | uniq -c" to the end of your command line to combine and count duplicates.)

Then, please find a noun having the following properties:

• It has at least five senses in WordNet.
• It has at least five senses in (our fragment of) Wikipedia, using Mihalcea's strategy for extracting sense annotations from Wikipedia.
• It does not appear in Table 2 of [Mihalcea 2007].

Show us how you'd map the Wikipedia senses you've discovered into WordNet senses. We're looking for mappings that look like this:

[[bar (law)|bar]] --> 'legal_profession.n.01' (the canonical name for the synset)
or
[[bar (law)|bar]] --> 'the body of individuals qualified to practice law in a particular jurisdiction' (synset definition of the sort one gets from the Web interface)
Be sure to list any Wikipedia senses that don't map to any WordNet sense, and conversely, any WordNet senses to which no Wikipedia senses are mapped.