STANFORD LINGUIST 180     -       Autumn 2007
Homework 3
Due: Thursday October 18 at the start of class

Read this entire page before starting!!

Build two bigram grammars, one each from two corpora, and generate 10 random sentences each from your two bigram grammars. The first corpus is the 1 million word Wall Street Journal corpus /afs/ir/class/linguist180/WWW/restricted/wsj.txt, and the second corpus is the complete works of Shakespeare, also about a million words, at /afs/ir/class/linguist180/WWW/shaks12.txt.

Hint 1: First, you should compute bigram word probabilities from each corpus. You will have to do some cleanup (for example remove comments and things, and put spaces around all the punctuation). You can either write a large single program to do all this, or you can do it quite quickly using various convenient UNIX utilities. For example, one useful thing you will need is to put every word on a separate line, which you can do with this command:

tr ' ' '\012'
And before that, you can use perl or java regular expressions to add a space around each punctuation mark, so they appear on their own line.

Once you have a cleaned-up file with one word on each line, you can make a second copy of the file, delete the first line in emacs, and paste the two files together using the unix command paste:

paste wordlistmissingfirstword wordlist > bigramlist
The resulting file will have one line for each bigram in the corpus. Now you can write a program to count how many times each bigram occurs, how many times each unigram occurs, and divide correctly to get the bigram probability.

Hint 2: To generate random sentences, chose a random probability between 0 and 1, and use that to pick a word with the appropriate probability. For example, you could store each word in your unigram list with its own probability but also with a running total of the probabilities of all earlier words in the list. Then you could just walk down the list until the word probabilities sum to your random number. Once you have a first word of your sentence, you switch to choosing bigrams with a similar method. Alternatively, you can create a special "first and last word", called SENTENCEBOUNDARY or something, and then you can do everything with bigrams instead of fooling around with unigrams. That will also make it easier to know when your sentence has ended.

What to turn in:

  1. Your programs.
  2. A description of what you did.
  3. The sentences generated by your programs: 10 for the Wall Street Journal, 10 for the Shakespeare corpus.

How to turn it in: