STANFORD CS 224S/LINGUIST 236   -     Winter 2009
Homework 5: A Digit Recognizer
Due: Tuesday February 10 before the start of class.

The purpose of this homework is to make you familiar with Hidden Markov Model Toolkit (HTK), a portable toolkit for building and manipulating Hidden Markov Models.

In this assignment, you will build a simple digit recognizer with monophone models and report digit recognition accuracy.

HTK:

You can download HTK from the following webpage and build your own binaries:
http://htk.eng.cam.ac.uk/download.shtml
Alternatively, we provide all the HTK 3.4 binary files you need on AFS. You can either copy them to your working directory or link to them.
/afs/ir/class/cs224s/htk-3.4
(These binaries have been tested on the myth, bramble, hedge, vine, and pod clusters. If you have issues getting HTK to run on one of these machines, please let us know.)

Data:

We provide part of TIDIGITS, a continuous digit sequence corpus, for both training and testing. The data is located in
/afs/ir/class/cs224s/tidigits
The corpus includes both training and testing data in the train and test directories. In the training set, there are 13 males and 14 females, totaling 27 speakers. In the test set, there are 3 males and 3 females, totaling 6 speakers. (You may copy the tidigits directory to your machine, but it is several hundred MB, so you might not want to.)

The digit sequences were made up of the digits: "zero", "oh", "one", "two", "three", "four", "five", "six", "seven", "eight", and "nine". The digit sequences spoken by each speaker can be broken down as follows:

22 isolated digits (2 productions of each of 11 digits)
11 2-digit sequences
11 3-digit sequences
11 4-digit sequences
11 5-digit sequences
11 7-digit sequences

Scripts:

We set up a tar file with all the scripts and configuration files you'll need for training digit recognizer. Your first job is to go through those scripts and make sure everything works to get your initial training results.

You'll need to follow each step in the scripts directory. But you'll need to change the paths to fit your working directory. The paths also refer to the tidigits directory on the class AFS space, so you'll probably want to run the scripts on a machine with AFS access.

You can download the tar file here.

The scripts directory includes all scripts for training the digit recognizer from extracting MFCC to evaluating Word Error Rate. The simple description is as follow.

(5.1) 01_HCopy.sh: Generate MFCC from wave files.

(5.2) 02_HCompV.sh: Train an initial HMM with three states and single Gaussian from proto file.

(5.3) 03_hmmdef.sh: Generate the initial HMM for each phone from step (5.2).

(5.4) 04_HLEd.sh: Generate Master Label File (mlf).

(5.5) 05_HERest.sh: Use the Baum-Welch algorithm to train HMM.

(5.6) 06_mix02.sh: Split into 2 Gaussians and do Baum-Welch training.

(5.7) 07_mix04.sh: Split into 4 Gaussians and do Baum-Welch training.

(5.10) 10_HParse.sh: Generate a digit network for decoding.

(5.11) 11_HVite.sh: Do viterbi decoding.

(5.12) 12_HResult.sh: Do evaluation.

More Gaussians:

In acoustic modeling, using only four Gaussians in each state is generally not enough for good performance. Your second job is to extend the scripts to train a digit recognizer with 16 Gaussians in each state.

More Data:

The current scripts only use half the training data (only the male speakers) for training. The performance will be better if you use female speakers as well. Your last job is to change the scripts so that you can use all the training data.

Extra Point:

Can you improve the accuracy further? You can try anything you've learned from class or the HTK handbook to improve performance.

What you should turn in:

(1) The digit accuracy with all male speakers as training data and 4 Gaussians in each state.

(2) The digit accuracy with all male speakers as training data and 16 Gaussians in each state.

(3) The digit accuracy with all male and female speakers as training data and 4 Gaussians in each state.

(4) The digit accuracy with all male and female speakers as training data and 16 Gaussians in each state.

(5) A simple error analysis about what you found from those four digit recognizers.

(6) Optional extra point.

How to turn in the homework:

* Email your homework to David Borowitz at cs224s-win0809-ta@lists.stanford.edu
* Please put it in the form of a single PDF file with all your answers to the above questions.
* The filename should be in the following format:
lastname_firstname_hw#.pdf

Reference:

You can find more details about how to use HTK in HTK handbook.