 |
CS 124/LINGUIST 180     -     Winter 2009
Homework 7 |
Read this entire page before starting!!
You (in groups of 2) are going to implement the Direct approach to MT in a small way. Here's what you do!
- Choose a partner, and then choose a language X. Pick one that at least one of you
knows well enough to work with.
If you two don't know any language besides English this well, just choose the one you know best,
perhaps getting another friend who is a native (or just a good-enough)
speaker of language X to help you.
- Create a test document in Language X; one paragraph is probably fine, let's say 10 sentences.
Don't write these 10 sentences yourself; take a real 10 sentences from some source
(a newspaper, a novel, a web site, etc).
- Create a bilingual English-X dictionary for each word in your test document.
It's very difficult to get a good downloadable dictionary,
so do this using a web-based or print English-X dictionary
(For example
here's a web-based dictionary for a couple languages, but you can find others). Just create a little dictionary file that has each word in your test
document and a corresponding translation English. *Don't* use
your knowledge of the context of the sentence to pick the translating word in English.
Just use the first (or most frequent, or something like that) definition in the dictionary.
- Now write code (again, any programming language is ok) to implement the following "Direct MT" system:
- Use your bilingual dictionary to translate each word from Language X into English.
- Run a simple part-of-speech tagger on the English target words.
- Now write at least 10 simple part-of-speech-based reordering transformations
to reorder the words in your 10 new "English" sentences to look more like real English!
- See how close you can get to a real translation!!
- Now do an error analysis: what kinds of errors are still left,
and what kind of knowledge would you have needed to fix them?
- Now run your 10 sentences through Google translate, and discuss any
errors that Google makes.
- Turn in, in the normal way:
- Your code
- Your input file, your dictionary.
- Running of your code on your input
- A description, for each rule you write, about what
it was supposed to do, what differences between Language X and
English it was supposed to fix (and make sure you give a good example of the
use of the rules in your running of your code).
- Your error analysis of the remaining errors.
- The output of Google translate and your error analysis of Google's errors.