CS 224S / LINGUIST 281
Speech Recognition and Synthesis
Winter 2007

COURSE INFORMATION
Instructor
Dan Jurafsky, jurafsky@stanford.edu
Office: Margaret Jacks Hall (bld 460) 113
Office Hours: Mon 10:00-11:00am
TA
Yun-Hsuan Sung cs224s-ta@cs.stanford.edu
Office: Margaret Jacks Hall (bld 460) 022
Office Hours: Mon 12:40-1:30pm and Wed 11:10-12:00am
Newsgroup
We have setup a class newsgroup, su.class.cs224s. Please post questions there and check to see if your question has been asked before posting it.
Time
Tuesdays and Thursdays, 10:00-11:15am (NOTE SLIGHT CHANGE IN TIME TO START AT 10:00am)
Location
200-217
Textbook

Description
This course is an introduction to automatic speech recognition, speech understanding and speech synthesis/text-to-speech from the computer science and linguistics (rather than EE) perspective. Focus on understanding of key algorithms including noisy channel model, Hidden Markov Models (HMMs), GMM (Gaussian Mixture Model) acoustic models, A* and Viterbi decoding, N-gram language modeling, unit selection synthesis, dialogue architectures, and the role of linguistic knowledge (esp. phonetics, intonation, pronunciation variation, disfluencies, emotion). Prerequisites: programming experience and familiarity with probability. Recommended: cs221 or cs229.
Required Work
  • Homeworks: 6 homeworks (Homework Collaboration Policy). Homework is due at 9:59am on the day it is due (i.e. before class starts). LATE HOMEWORK WILL NOT BE ACCEPTED. But I will drop your lowest homework grade.
  • Readings: To be read before the class period in which they will be discussed. THERE IS A LOT OF READING IN THIS COURSE!!! We are covering what are really three entire fields (speech recognition, speech synthesis, dialogue systems) in one 10-week quarter, and not everything can be covered in each lecture, so you need to do all the reading.
  • Final Project: Any project in speech recognition, speech synthesis, speech understanding, dialogue design, speech user interface design, etc etc. Either individual or joint projects are fine. The final project will be presented as a poster at the poster session on Thursday March 15, and is due on Monday March 19 at noon PST by email. Details here. Some publishable sample projects involving collaboration with students or postdocs in my lab here.
  • Determination of final grade:
    • 45%: final project
    • 45%: best 5 homeworks out of 6
    • 10%: class participation


SCHEDULE
Wk
Date
HW
Lec

Topic and Readings

1
Jan 9
  Lec 1 (ppt)
Lec 1 (6up PDF)

Course Overview and History, Articulatory Phonetics and ARPAbet transcription

1
Jan 11
  Lec 2 (ppt)
Lec 2 (6up PDF)

Phonetics: Acoustic Phonetics

2
Jan 16
HW 1 due Lec 3 (ppt)
Lec 3 (6up PDF)

TTS: Background (machine learning, classification, NLP) and Text Normalization

2
Jan 18
Lec 4 (ppt)
Lec 4 (6up PDF)

Grapheme-to-phoneme and the Festival software (plus Text Normalization spillover)

3
Jan 23
HW 2 due Lec 5 (ppt)
Lec 5 (6up PDF)

TTS: Prosody (Intonation, Boundaries, and Duration)

3
Jan 25
Lec 6 (ppt)
Lec 6 (6up PDF)

TTS: Waveform Synthesis (Diphone and Unit Selection Synthesis)

4
Jan 30
HW 3 due Lec 7 (ppt)
Lec 7 (6up PDF)

ASR: Noisy Channel Model, Bayes, HMMs, Forward, Viterbi, Start of Baum-Welch

4
Feb 1

Lec 8 (ppt)
Lec 8 (6up PDF)

ASR: HMMs continued, and application to ASR

5
Feb 6
1-pgraph
project
proposal

Lec 9 (ppt)
Lec 9 (pdf)

ASR: Feature Extraction

5
Feb 8
Lec 10 (ppt)
Lec 10 (6up PDF)
ASR: Acoustic Modeling

6
Feb 13
HW 4 due Lec 11 (ppt)
Lec 11 (6up PDF)

N-grams and Language Modeling

6
Feb 15

Lec 12 (ppt)
Lec 12 (6up PDF)

ASR: Search (Lattices, N-best lists, A*, etc)

7
Feb 20
HW 5 due
Lec 13 (ppt)
Lec 13 (6up PDF)

Conversational Agents: Simple dialogue systems (frame-based systems, VXML, evaluation)

7
Feb 22
Lec 14 (ppt)
Lec 14 (6up PDF)
Conversational Agents: Grounding, Confirmation, Dialogue Acts
8
Feb 27
HW 6 due
Lec 15 (ppt)
Lec 15 (6up PDF)

Conversational Agents III: Markov Decision Processes (MDPs), etc
8
Mar 1
Lec 16 (ppt)
Lec 16 (6up PDF)

ASR: Dealing with Variation (Adaptation, MLLR, Pronunciation modeling)
9
Mar 6

Lec 17 (ppt)
Lec 17 (6up PDF)

Disfluencies and Metadata: Boundaries, Fillers, Edit Terms
9
Mar 8

Lec 18 (ppt)
Lec 18 (6up PDF)

Emotional speech
10
Mar 13

Guest Lecture: Luciana Ferrer on Speaker Recognition
10
Mar 15

Final Project Presentations