STANFORD CS 224S/LINGUIST 136/236
Speech Recognition and Synthesis  
Winter 2005
Course Information

Time : Tu,Th 10:30-11:50am Classroom 460-126
Professor: Dan Jurafsky Office Hours 113 Margaret Jacks, Tuesdays 3:00-4:30pm
TA: Rion Snow Office Hours 110 Gates, Mondays/Thursdays 2:45-4:00pm

This course is an introduction to automatic speech recognition, speech understanding and speech synthesis/text-to-speech from the computer science and linguistics (as opposed to EE) perspective. Focus on understanding of key algorithms including noisy channel model, Hidden Markov Models (HMMs), A* and Viterbi decoding, N-gram language modeling, unit selection synthesis, and roles of linguistic knowledge (esp. phonetics, intonation, pronunciation variation, disfluencies). Prerequisites: programming experience. Recommended: basic familiarity with probability.

Course Requirements:

Readings:

To be handed out. Readings are either papers from the literature, or selections from one of three reference/text books. The textbooks are:

Schedule

WK
DATE
NOTES
TOPIC
HW DUE TODAY
1
Jan 4
lec1.ppt
lec1.PDF
lec1.4up.PDF
Introduction to ASR, History, Articulatory Phonetics, ARPAbet transcription
1
Jan 6
lec2.ppt
lec2.pdf
lec2.4up.pdf
Acoustic Phonetics

J+M Chapter 4 pages 93-105.
HW 0
(turn in
w/HW 1)
Speech Synthesis Week 1: Written Sentences to Phonemes
2
Jan 11
lec3.ppt
lec3.pdf
lec3.4up.pdf
Introduction to TTS, TTS Architectures, How to use Festival
Thanks to Alan Black! Much of the TTS material is from his course at CMU this spring and in previous springs.
Read sections 1, 2, 3, and 4 from Alan Black's lecture notes on TTS and Festival.
You should be looking at the Festival manual. You don't have to read the whole thing through, but you should skim it so you know where things are in the manual.
For those who don't know Scheme, which is used as Festival's scripting language, here's an Introduction to Scheme for C Programmers, from Cal Tech.
For those who are get really excited by Scheme and want to know more, here's the homepage for the text Structure and Interpretation of Computer Programs, by Abelson, Sussman, and Sussman.
HW 1
2
Jan 13
lec4.ppt
lec4.pdf
lec4.4up.pdf
Text normalization, grapheme-to-phoneme
Huang Chapter 14 and J+M Chapter 8
Read sections 5 and 6.1.1 from Alan Black's lecture notes on TTS and Festival.

Speech Synthesis Week 2: Phonemes to Speech
3
Jan 18
lec5.ppt
lec5.pdf
lec5.6up.pdf
Prosody: Intonation, Boundaries, and Duration
Huang Chapter 15 and
Huang 4.5 for those who haven't seen decision trees before
Read the rest of Section 6 of Alan Black's lecture notes
HW 2
3
Jan 20
lec6.ppt
lec6.pdf
lec6.6up.pdf
Waveform Synthesis: Diphone and Unit Selection Synthesis
Huang Chapter 16
Section 7 from Alan Black's lecture notes on TTS and Festival.


Speech Recognition: HMMs
4
Jan 25
lec7.ppt
lec7.pdf
lec7.6up.pdf
Noisy Channel Model, Bayes, HMMs, Forward, Viterbi
Knill and Young. 1997. Hidden Markov Models in Speech and Language Processing. In S. Young and G. Bloothooft (eds.), Corpus-Based Methods in Language and Speech Processing. Kluwer, 27-68.
Huang Chapter 8 page 377-393
Additional reading to give more background on HMMs:
Manning+Schuetze Chapter 9
HW 3
4
Jan 27
lec8.ppt
lec8.pdf
lec8.6up.pdf
HMMs continued, up to Baum-Welch (Forward-Backward) algorithm
Rest of Huang Chapter 8 (page 394-409)
Additional reading to give more background on HMMs:
Manning+Schuetze Chapter 9

Getting a speech recognizer for your project: where are HTK, Sonic, and Sphinx?
HTK
Sonic
Sphinx
Here are instructions on the Digit Recognizer tutorial with Sonic, Bryan Pellom's recognizer. Those of you interested in digit recognition for your projects will want to look at this.

Speech Recognition: Acoustic Modeling
5
Feb 1
lec9.ppt
lec9.pdf
lec9.6up.pdf
Acoustic Modeling: GMMs (Gaussian Mixture Models), triphones, state tying, decision trees
Huang Chapter 9
JOB LISTINGS

5
Feb 3
lec10.pdf More on Acoustic Modeling: Guest Lecture by Mark Mao
1-pgraph
Project
Proposal
Speech Recognition: Language Modeling
6
Feb 8
lec11.ppt N-grams and Language Modeling: TA Lecture by Rion Snow
Manning+Schuetze Chapter 6
Optional additional reading: J+M Chapter 6
HW 4
ASR: Search
6
Feb 10
lec12.ppt
lec12.pdf
lec12.6up.pdf
Search: Lattices, N-best lists, A*, etc

Pages 244-259 of J+M Chapter 7
Dialogue and Conversational Agents
7
Feb 15
lec13.ppt
lec13.pdf
lec13.6up.pdf
Simple dialogue systems: components, frame-based dialogue systems, VXML, Evaluation
J+M New Chapter 19 pages 1-25
7
Feb 17
lec14.ppt
lec14.pdf
lec14.6up.pdf
Grounding, Confirmation, Speech and Dialogue Acts, Dialogue Act Tagging
J+M New Chapter 19 pages 26-42
HW 5
Advanced Issues in Speech Understanding and Synthesis: Prosody
8
Feb 22
Guest Lecture by Jared Bernstein, Ordinate: "Measure the speaker, not the speech. Validating speech processing for language proficiency tests"

8
Feb 24
lec16.ppt
lec16.pdf
lec16.6up.pdf
Disfluencies and Metadata: Boundaries, Fillers, Edit Terms

Read the following paper
Shriberg, Elizabeth, Andreas Stolcke, Dilek Hakkani-Tur, and Gukhan Tur. 2000. Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication (Special issue on accessing information in spoken audio) 32:1-2, 127 - 154
In addition read any one of the following 6 papers.
J. Bear, J. Dowding, and E. Shriberg. 1992. Integrating Multiple Knowledge Sources for the Detection and Correction of Repairs in Human-Computer Dialog. Proceedings of the 30th ACL.
Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Barbara Peskin, Jeremy Ang, Dustin Hillard, Mari Ostendorf, Marcus Tomalin, Phil Woodland, and Mary Harper. 2005. Structural Metatada Research in the EARS Program, Invited paper. To appear in ICASSP 2005.
Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, Barbara Peskin, and Mary Harper. 2004. The ICSI-SRI-UW Metadata Extraction System, ICSLP 2004.
Yang Liu, Elizabeth Shriberg, and Andreas Stolcke. 2003. Automatic Disfluency Identification in Conversational Speech Using Multiple Knowledge Sources, EuroSpeech 2003.
Yang Liu, 2003. Word Fragment Identification Using Acoustic-Prosodic Features in Conversational Speech, HLT/NAACL 2003 Student Workshop, 2003.
Snover, Matthew, Bonnie Dorr and Richard Schwartz. 2004. A Lexically-Driven Algorithm for Disfluency Detection. Short Papers Proceedings of HLT-NAACL 2004. Boston: ACL. 157--160.
HW 6
Speech Recognition: Advanced Issues
9
Mar 1
lec17.ppt
lec17.pdf
lec17.6up.pdf
Variation and Adaptation: Speaker adaptation, MLLR, Pronunciation variation, etc
9
Mar 3 lec18.ppt
lec18.pdf
lec18.6up.pdf
Advanced Issues in Dialogue: Markov Decision Processes (MDPs), PARADISE, etc

J+M New Chapter 19 pages 22-25, 46-50

10
Mar 8
Advanced Topic TBD: Perhaps Emotional Speech
10
Mar 10
Final Project Poster Presentations

Course Newsgroup: su.class.linguist236
URL: http://www.stanford.edu/class/linguist236/