![]() |
CS 124 / LINGUIST 180   -     Winter 2009
Homework 2: Language Identification |
| Due: January 20 before the start of class |
In order to extract any kind of information from text, the first thing we have to know is what language the text is in. In this assignment you are going to use character N-gram grammars to solve the problem of language identification.
Given a document, your goal is to say what language it is written in. We will give you a set of training documents (one in each of 10 languages) and a set of development test documents. You will be graded on an unseen set of 10 test documents. To make the problem tractable, we guarantee that the test documents will come from one of the 10 languages you have seen in the training set.
The data you will use is 10 translations of the Universal Declaration of Human Rights (which has been translated into many languages). (Although we've set up the data for you locally so you don't need to download it from that site; see below.)
The algorithm you will use requires that you build 10 separate character bigram grammars, one for each language, on the training data. Mostly in lecture we talked about word bigrams. A character bigram is computed on characters instead of words. You should use the simple Bayesian Unigram Prior smoothing method that we introduced in class.
For each test document, for each of your 10 bigram grammars, you compute the log-likelihood of the test document given the bigram grammar (use the log-likelihood instead of the likelihood since it's less likely to underflow). Then you choose as your answer for that document the language that gave the highest log-likelihood.
Here's the formal description of the equations you should be computing.
First, you want to pick the language, out of the 10 languages,
which assigns the highest log probability to the document:
To compute the log probability for each language, you make
the Markov (N-gram) assumption, and use a bigram grammar
that has been trained on that language:
(That was the equation in log-space; in non-log space it would be:
Don't forget to add some sort of special START and END characters at the beginning and end of the file.
To train your bigram grammars, use Bayesian Unigram Prior smoothing:
You may use any programming language you like, but since we have to run your code on our test sets, if we have trouble running your code you will not receive full credit. For this reason, we suggest sticking to wide-spread and portable languages like Java, python, or Perl.
We have provided starter code written in Java, available in /usr/class/linguist180/assignments/hw2/ , which has a directory structure like:
hw2/ build # A script to build your code config # A configuration script run # builds and runs your code LanguageID.java # starter code train/ # The training directory dev/ # The development test directory
By default, if you execute:
./run train/ dev/It will randomly guess a language for each file.
You may change anything you want in this starter directory.
However, we expect to be able to say ./run [train] [test]
and for it to work! If you only edit LanguageID.java, you should be fine.
The training data lives in /usr/class/linguist180/hw2/train/ and the development data lives in /usr/class/linguist180/hw2/dev/.
The results should be printed to standard out (System.out.println) with the following format:
<filename> <TAB> <language>
First, how. In the directory you plan to submit, execute
/usr/class/cs124/bin/submit. This is almost exactly the same script as CS107 uses, so most of you should be familiar. Just follow the direction and email us if you have any problems.
Second, what. Your code, your build, config, and run files, and then a README with a description of the kinds of things you can extract, and anything else you want to tell us.