![]() |
CS 224S/LINGUIST 136/236
Speech Recognition and Synthesis Winter 2005 |
| Course Information |
| Time : | Tu,Th 10:30-11:50am | Classroom | 460-126 |
| Professor: | Dan Jurafsky | Office Hours | 113 Margaret Jacks, Tuesdays 3:00-4:30pm |
| TA: | Rion Snow | Office Hours | 110 Gates, Mondays/Thursdays 2:45-4:00pm |
This course is an introduction to automatic speech recognition, speech understanding and speech synthesis/text-to-speech from the computer science and linguistics (as opposed to EE) perspective. Focus on understanding of key algorithms including noisy channel model, Hidden Markov Models (HMMs), A* and Viterbi decoding, N-gram language modeling, unit selection synthesis, and roles of linguistic knowledge (esp. phonetics, intonation, pronunciation variation, disfluencies). Prerequisites: programming experience. Recommended: basic familiarity with probability.
| WK |
DATE |
NOTES |
TOPIC |
HW DUE TODAY |
|
| 1 |
Jan 4 |
lec1.ppt lec1.PDF lec1.4up.PDF |
Introduction to ASR, History, Articulatory Phonetics, ARPAbet transcription | |
|
| 1 |
Jan 6 |
lec2.ppt lec2.pdf lec2.4up.pdf |
Acoustic Phonetics J+M Chapter 4 pages 93-105. |
HW 0 (turn in w/HW 1) |
Speech Synthesis Week 1: Written Sentences to Phonemes |
| 2 |
Jan 11 |
lec3.ppt lec3.pdf lec3.4up.pdf |
Introduction to TTS, TTS Architectures, How to use Festival
Thanks to Alan Black! Much of the TTS material is from his course at CMU this spring and in previous springs. Read sections 1, 2, 3, and 4 from Alan Black's lecture notes on TTS and Festival. You should be looking at the Festival manual. You don't have to read the whole thing through, but you should skim it so you know where things are in the manual. For those who don't know Scheme, which is used as Festival's scripting language, here's an Introduction to Scheme for C Programmers, from Cal Tech. For those who are get really excited by Scheme and want to know more, here's the homepage for the text Structure and Interpretation of Computer Programs, by Abelson, Sussman, and Sussman. | HW 1 | |
| 2 |
Jan 13 |
lec4.ppt lec4.pdf lec4.4up.pdf |
Text normalization, grapheme-to-phoneme
Huang Chapter 14 and J+M Chapter 8 Read sections 5 and 6.1.1 from Alan Black's lecture notes on TTS and Festival. |
|
Speech Synthesis Week 2: Phonemes to Speech |
| 3 |
Jan 18 |
lec5.ppt lec5.pdf lec5.6up.pdf |
Prosody: Intonation, Boundaries, and Duration
Huang Chapter 15 and Huang 4.5 for those who haven't seen decision trees before Read the rest of Section 6 of Alan Black's lecture notes |
HW 2 | |
| 3 |
Jan 20 |
lec6.ppt lec6.pdf lec6.6up.pdf |
Waveform Synthesis: Diphone and Unit Selection Synthesis
Huang Chapter 16 Section 7 from Alan Black's lecture notes on TTS and Festival. | |
|
Speech Recognition: HMMs |
| 4 |
Jan 25 |
lec7.ppt lec7.pdf lec7.6up.pdf |
Noisy Channel Model, Bayes, HMMs, Forward, Viterbi
Knill and Young. 1997. Hidden Markov Models in Speech and Language Processing. In S. Young and G. Bloothooft (eds.), Corpus-Based Methods in Language and Speech Processing. Kluwer, 27-68. Huang Chapter 8 page 377-393 Additional reading to give more background on HMMs: Manning+Schuetze Chapter 9 |
HW 3 | |
| 4 |
Jan 27 |
lec8.ppt lec8.pdf lec8.6up.pdf |
HMMs continued, up to Baum-Welch (Forward-Backward) algorithm
Rest of Huang Chapter 8 (page 394-409) Additional reading to give more background on HMMs: Manning+Schuetze Chapter 9 Getting a speech recognizer for your project: where are HTK, Sonic, and Sphinx? HTK Sonic Sphinx Here are instructions on the Digit Recognizer tutorial with Sonic, Bryan Pellom's recognizer. Those of you interested in digit recognition for your projects will want to look at this. |
|
Speech Recognition: Acoustic Modeling |
| 5 |
Feb 1 |
lec9.ppt lec9.pdf lec9.6up.pdf |
Acoustic Modeling: GMMs (Gaussian Mixture Models), triphones, state tying, decision trees
Huang Chapter 9 JOB LISTINGS |
|
|
| 5 |
Feb 3 |
lec10.pdf | More on Acoustic Modeling: Guest Lecture by Mark Mao
|
1-pgraph Project Proposal |
Speech Recognition: Language Modeling |
| 6 |
Feb 8 |
lec11.ppt | N-grams and Language Modeling: TA Lecture by Rion Snow
Manning+Schuetze Chapter 6 Optional additional reading: J+M Chapter 6 |
HW 4 | ASR: Search |
| 6 |
Feb 10 |
lec12.ppt lec12.pdf lec12.6up.pdf |
Search: Lattices, N-best lists, A*, etc Pages 244-259 of J+M Chapter 7 | Dialogue and Conversational Agents | |
| 7 |
Feb 15 |
lec13.ppt lec13.pdf lec13.6up.pdf |
Simple dialogue systems: components, frame-based dialogue systems, VXML, Evaluation
J+M New Chapter 19 pages 1-25 | ||
| 7 |
Feb 17 |
lec14.ppt lec14.pdf lec14.6up.pdf |
Grounding, Confirmation, Speech and Dialogue Acts, Dialogue Act Tagging
J+M New Chapter 19 pages 26-42 | HW 5 | Advanced Issues in Speech Understanding and Synthesis: Prosody |
| 8 |
Feb 22 |
Guest Lecture by Jared Bernstein, Ordinate:
"Measure the speaker, not the speech. Validating speech processing for language proficiency tests"
| |||
| 8 |
Feb 24 |
lec16.ppt lec16.pdf lec16.6up.pdf |
Disfluencies and Metadata: Boundaries, Fillers, Edit Terms Read the following paper Shriberg, Elizabeth, Andreas Stolcke, Dilek Hakkani-Tur, and Gukhan Tur. 2000. Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication (Special issue on accessing information in spoken audio) 32:1-2, 127 - 154 In addition read any one of the following 6 papers. J. Bear, J. Dowding, and E. Shriberg. 1992. Integrating Multiple Knowledge Sources for the Detection and Correction of Repairs in Human-Computer Dialog. Proceedings of the 30th ACL. Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Barbara Peskin, Jeremy Ang, Dustin Hillard, Mari Ostendorf, Marcus Tomalin, Phil Woodland, and Mary Harper. 2005. Structural Metatada Research in the EARS Program, Invited paper. To appear in ICASSP 2005. Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, Barbara Peskin, and Mary Harper. 2004. The ICSI-SRI-UW Metadata Extraction System, ICSLP 2004. Yang Liu, Elizabeth Shriberg, and Andreas Stolcke. 2003. Automatic Disfluency Identification in Conversational Speech Using Multiple Knowledge Sources, EuroSpeech 2003. Yang Liu, 2003. Word Fragment Identification Using Acoustic-Prosodic Features in Conversational Speech, HLT/NAACL 2003 Student Workshop, 2003. Snover, Matthew, Bonnie Dorr and Richard Schwartz. 2004. A Lexically-Driven Algorithm for Disfluency Detection. Short Papers Proceedings of HLT-NAACL 2004. Boston: ACL. 157--160. |
HW 6 | Speech Recognition: Advanced Issues |
| 9 |
Mar 1 |
lec17.ppt lec17.pdf lec17.6up.pdf |
Variation and Adaptation: Speaker adaptation, MLLR, Pronunciation variation, etc | |
|
| 9 |
Mar 3 |
lec18.ppt lec18.pdf lec18.6up.pdf |
Advanced Issues in Dialogue: Markov Decision Processes (MDPs), PARADISE, etc J+M New Chapter 19 pages 22-25, 46-50 |
||
| 10 |
Mar 8 |
Advanced Topic TBD: Perhaps Emotional Speech
|
|||
| 10 |
Mar 10 |
Final Project Poster Presentations
|
Course Newsgroup: su.class.linguist236
URL: http://www.stanford.edu/class/linguist236/