CS 124/LINGUIST 180
From Languages to Information

Winter 2009

COURSE INFORMATION
Instructor
Dan Jurafsky, jurafsky@stanford.edu
Office: Margaret Jacks Hall (bld 460) 117
Office Hours: M 12:00-12:30, Tu 4:30-5:30
TAs
Jenny Finkel
Office: Gates 232
Office hours: Thursday 10:45am - 12:15pm
          David Hall
Office: Gates 232
Office hours: Monday 1:30-2:15, Wednesday 2:00-4:00
Time Tues/Thur, 9:30-10:45am
Staff Email cs124-win0809-staff@lists.stanford.edu for any questions about the homework (or anything else)
Location 200-030
Textbooks
  • Required: Jurafsky and Martin. 2008. Speech and Language Processing (2nd Edition). Pearson
  • Recommended: Manning, Raghavan, and Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.
Note that the readings from MR+S are required, not recommended, but the physical book is only recommended because you may do the reading online here.
Description Automated processing of less structured information: human language text and speech, web pages, social networks, genome sequences, with goal of automatically extracting meaning and structure. Methods include: string algorithms, automata and transducers, hidden Markov models, graph algorithms, XML processing. Applications such as information retrieval, text classification, social network models, machine translation, genomic sequence alignment, word meaning extraction, and speech recognition.
Prerequisite: CS 106B, CS 103 or 103B, and CS 107 (or familiarity with Linux shell scripts)
Required Work
  • Homeworks: 7 homeworks. Homework is due at 9:29am on the day it is due (i.e. before class starts).
    • Homework Collaboration Policy: You may talk to anybody you want about the assignments and bounce ideas off each other. But you must write the actual homeworks and programs yourself.
    • Late homeworks: You have 4 free late (calendar) days to use on the homeworks. Once these late days are exhausted, any homework turned in late will be penalized 20% per late day. Each 24 hours or part thereof that a homework is late uses up one full late day.
  • Readings: To be read before the class period in which they will be discussed. I will expect you to do a significant amount of textbook reading in this course.
  • Final Exam: Friday March 20, 12:15pm-3:15pm in 300-300

  • Determination of final grade:
    • 55% homeworks
    • 5% class participation
    • 40% final exam




SCHEDULE
Wk
Date
HW
Lec

Topic and Readings

1
Jan 6
Lec 1 (ppt)
Lec 1 (pdf)

Strings, Formal languages, and Automata

  • J+M Chapter 2: Regular Expressions and Automata
1
Jan 8
Lec 2 (ppt)
Lec 2 (pdf)

Edit Distance (in Text and Genes) and start of Tokenization

  • J+M Chapter 3: Words and Transducers, pages 68-77
  • MR+S Chapter 2: Term vocabulary and postings lists, pages 26-32
2
Jan 13
HW 1: Harvesting email addresses and phone numbers
Lec 3 (ppt)
Lec 3 (pdf)

Language Modeling (and Probability Theory Background)

2
Jan 15
Lec 4 (ppt)
Lec 4 (pdf)

Naive Bayes and Text Classification

  • MR+S Chapter 13: Text classification and Naive Bayes, pages 234-250

3
Jan 20
HW 2: Language Identification
Lec 5 (ppt)
Lec 5 (pdf)

Text Classification for Sentiment Analysis

3
Jan 22
Lec 6 (ppt)
Lec 6 (pdf)

Hidden Markov Models

4
Jan 27
HW 3: Sentiment analysis of movie reviews Lec 7 (ppt)
Lec 7 (pdf)

Named Entity Tagging

  • J+M Chapter 22: Information Extraction, pages 727-734, 743-749
  • J+M Chapter 6: Logistic Regression and MEMMs, pages 193-211
4
Jan 29
Lec 8 (ppt)
Lec 8 (pdf)

Information Retrieval (I)

  • MR+S Chapter 1: Boolean Retrieval
  • MR+S Chapter 2: Term vocabulary and postings lists
5
Feb 3

Lec 9 (ppt)
Lec 9 (pdf)

Information Retrieval (II)

  • MR+S Chapter 6: Scoring, term weighting, and the vector space model
5
Feb 5
HW 4: Person name extraction
Lec 10 (ppt)
Lec 10 (pdf)

Information Retrieval (III)

  • MR+S Chapter 8: Evaluation in Information Retrieval
6
Feb 10
Lec 11 (ppt)
Lec 11 (pdf)

XML: accessing structured information (I)

6
Feb 12
Lec 12 (ppt)
Lec 12 (pdf)
XML: accessing structured information (II)
  • XML in a Nutshell via Safari Tech books, Chapter 8 (XSLT),
    To get this, go to library.stanford.edu/ezproxy/, choose Safari Tech Books, and search for XML in a Nutshell.
  • XML in a Nutshell via Safari Tech books, Chapter 9 (XPath),
    To get this, go to library.stanford.edu/ezproxy/, choose Safari Tech Books, and search for XML in a Nutshell.
7
Feb 17

HW 5: Exercises and Search Engine analysis
Lec 13 (ppt)
Lec 13 (pdf)

Computational Lexical Semantics
  • J+M Chapter 19: Lexical Semantics (pages 611-619)
  • J+M Chapter 20 Computational Lexical Semantics 20 (pages 652-670)
7
Feb 19
Lec 14 (ppt)
Lec 14 (pdf)

Relation and Information Extraction
  • J+M Chapter 22: Information Extraction (including Biomedical Information Extraction), page 734-743, 749-764
8
Feb 24
Lec 15 (ppt)
Lec 15 (pdf)

Machine Translation

  • J+M Chapter 25: Machine Translation, page 859-879
8
Feb 26
HW 6: Relation Extraction Lec 16 (ppt)
Lec 16 (pdf)

Machine Translation

  • J+M Chapter 25: Machine Translation, page 879-897
9
Mar 3
Lec 17 (ppt)
Lec 17 (pdf)

Web graphs, Links, and PageRank
9
Mar 5
HW 7: Machine Translation Lec 18 (ppt)
Lec 18 (pdf)

Understanding Social and Technological Networks: Small Worlds, Fat Tails, and Whatnot
10
Mar 10

No Class Today
10
Mar 12

Lec 19 (ppt)
Lec 19 (pdf)

Speech Recognition
  • J+M Chapter 9: Automatic Speech Recognition, page 285-314