|
|
CS 276
/ LING 286 |
IIR = Introduction
to Information Retrieval, by C. Manning, P. Raghavan,
and H. Schütze. Cambridge
University Press, 2008.
This book is available from the Stanford bookstore
(or your favorite book purveyor). You can also download and print chapters at
the book
website. (We’d appreciate any reports of typos or of higher-level problems
for the third printing. Thanks.)
MG = Managing
Gigabytes, by I. Witten, A. Moffat, and T. Bell.
IRAH = Information Retrieval: Algorithms
and Heuristics by D. Grossman and O. Frieder.
IR = Modern Information Retrieval, by
R. Baeza-Yates and B. Ribeiro-Neto.
FOA = Finding Out About, by R. Belew.
MTW = Mining the Web, by S. Chakrabarti.
FSNLP = Foundations of Statistical
Natural Language Processing, by C. Manning and H. Schütze.
These books all have useful information on topics
that we cover and are recommended as references. MG is particularly good as a
detailed reference for technical IR in the first half of the course. MTW covers
many of the topics from the latter part of the course.
More detailed resources can be found here.
PN = Pandu Nayak
PR = Prabhakar Raghavan
TA = TA
All lectures will be held at Gates B03.
Lectures are Tuesdays and Thursdays from 4:15 to 5:30.
Six review sessions are scheduled for assignments
and exams.
Final exam is on 7th June (Tuesday) from 12:15-3:15pm, location: Gates B01, B03.
Details of the schedule, slides and reading lists
will be updated as the quarter progresses.
|
Date |
Topics |
Notes |
Who |
Readings |
Assignments |
|
Tue 29 Mar |
IR 1. Introduction to Information Retrieval. Inverted indices and boolean queries. Query optimization. The nature of unstructured and semi-structured text. Course administrivia. |
[powerpoint] |
PN |
IIR
Ch. 1 |
|
|
Thu 31 Mar |
IR 2. The term vocabulary and postings lists. Text encoding: tokenization, stemming, lemmatization, stop words, phrases. Optimizing indices with skip lists. Proximity and phrase queries. Positional indices. |
[powerpoint] |
PN |
IIR
Ch. 2 |
|
|
Tue 5 Apr |
IR 3. Dictionaries and tolerant retrieval. Dictionary data structures. Wild-card queries, permuterm indices, n-gram indices. Spelling correction and synonyms: edit distance, soundex, language detection. |
[powerpoint] |
PN |
IIR
Ch. 3 |
[PS1 out] |
|
Thu 7 Apr |
IR 4. Index construction. Postings size estimation,
sort-based indexing, dynamic indexing, positional indexes, n-gram indexes,
distributed indexing, real-world issues. |
[powerpoint] |
PR |
IIR
Ch. 4 |
|
|
Tue 12 Apr |
Review session for PS1 (Location: Huang 018, Time: 2:15pm-3:05pm) |
|
TA |
|
|
|
Tue 12 Apr |
IR 5. Index compression: lexicon compression and postings lists compression. Gap encoding, gamma codes, Zipf's Law, variable-byte encoding. Blocking. Extreme compression. |
[powerpoint] |
PN |
IIR Ch. 5 |
[PE1 out] |
|
Thu 14 Apr |
IR 6. Scoring, term weighting, and the vector space model. Parametric or fielded search. Document zones. The vector space retrieval model. tf.idf weighting. The cosine measure. Scoring documents. |
[powerpoint] |
PR |
PS1 due |
|
|
Tue 19 Apr |
Review session for PE1 (Location: Huang 018, Time: 2:15pm-3:05pm) |
|
TA |
|
|
|
Tue 19 Apr |
IR 7. Computing scores in a complete search system: Components of an IR system. Efficient vector space scoring. Nearest neighbor techniques, reduced dimensionality approximations, random projection. |
[powerpoint] |
PR |
|
|
|
Thu 21 Apr |
IR 8. Results summaries: static and dynamic. Evaluating search engines. User happiness, precision, recall, F-measure. Creating test collections: kappa measure, interjudge agreement. Relevance, approximate vector retrieval. |
[powerpoint] |
PR |
IIR
Ch. 8 |
PE1 due |
|
Tue 26 Apr |
Review session for midterm (Location: Huang 018, Time: 2:15pm-3:05pm) |
|
TA |
|
|
|
Tue 26 Apr |
IR 9. Relevance feedback. Pseudo relevance feedback. Query expansion. Automatic thesaurus generation. Sense-based retrieval. Experimental results of performance effectiveness. |
[powerpoint] |
PR |
IIR
Ch. 9 |
|
|
Thu 28 Apr |
Midterm to be held in-class |
|
PN, PR |
|
|
|
Tue 3 May |
CLASSIFICATION 1. Introduction to text classification. Naive Bayes models. Spam filtering. |
[powerpoint] |
PN |
IIR
Ch. 13 Tackling
the poor assumptions of Naive Bayes classifier (Rennie
et al. 2003) (for PE2) |
|
|
Thu 5 May |
CLASSIFICATION 2. K Nearest Neighbors, Decision boundaries, Vector space classification using centroids. Comparative results. |
[powerpoint] |
PN |
IIR
Ch. 14 |
|
|
Tue 10 May |
CLUSTERING 1. Introduction to the problem. Partitioning methods: k-means clustering; Hierarchical clustering. |
[powerpoint] |
PR |
|
[PS2 out] |
|
Thu 12 May |
CLUSTERING 2. Latent semantic indexing (LSI). Applications to clustering and to information retrieval. |
[powerpoint] |
PN |
|
|
|
Tue 17 May |
Review session for PS2 (Location: Huang 018, Time: 2:15pm-3:05pm) |
|
TA |
|
|
|
Tue 17 May |
CLASSIFICATION 3. Support vector machine classifiers. Kernel Function. Evaluation of classification. Micro- and macro-averaging. Learning rankings. |
[powerpoint] |
PN |
IIR
Ch. 15 |
|
|
Thu 19 May |
Web 1: What makes the web different. Web search overview, web structure, the user, paid placement, search engine optimization/spam. Web size measurement. |
[powerpoint] |
PN |
|
PS2 due |
|
Tue 24 May |
Review session for PE2 (Location: Huang 018, Time: 2:15pm-3:05pm) |
|
TA |
|
|
|
Tue 24 May |
Web 2: Crawling and web indexes. Near-duplicate detection. |
[powerpoint] |
PN |
IIR
Ch. 20 |
|
|
Thu 26 May |
Web 3: Link analysis. |
[powerpoint] |
PR |
IIR
Ch. 21 |
PE2 due |
|
Tue 31 May |
Review session for final (Location: Huang 018, Time: 2:15pm-3:05pm) |
|
TA |
|
|
|
Tue 31 May |
Web 4: Learning to rank. |
[powerpoint] |
PR |
IIR
6.1.2-3, IIR 15.4 |
|
|
Thu 2 Jun |
No classes. |
|
|
|
|
|
Tue 7 Jun |
Final Exam (from 12:15-3:15pm, location: Gates B01, B03.) |
|
|
|