Stanford University

CS 276 / LING 286
Information Retrieval and Web Search
Autumn 2009


Course Syllabus

 

Required Textbook:

IIR = Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze. Cambridge University Press, 2008.

This book is available from the Stanford bookstore (or your favorite book purveyor). You can also download and print chapters at the book website. (We’d appreciate any reports of typos or of higher-level problems for the third printing. Thanks.)

 

Other Good IR etc. Books:

MG = Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell.
IRAH = Information Retrieval: Algorithms and Heuristics by D. Grossman and O. Frieder.
IR = Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto.
FOA = Finding Out About, by R. Belew.
MTW = Mining the Web, by S. Chakrabarti.
FSNLP = Foundations of Statistical Natural Language Processing, by C. Manning and H. Schütze.

These books all have useful information on topics that we cover and are recommended as references. MG is particularly good as a detailed reference for technical IR in the first half of the course. MTW covers many of the topics from the latter part of the course.

 

Lecturers:

CM = Christopher Manning
PR = Prabhakar Raghavan
TA = TA

 

Locations and times:

All lectures will be held at Gates B01.
Lectures are Tuesdays and Thursdays from 4:15 to 5:30.

Six review sessions are scheduled for assignments and exams.
Review sessions time and place is Tuesdays from 1:15 to 2:05 in Skilling 193.
Scroll down to see the specific dates for the review sessions.

Final exam is scheduled for December 9th, Wed 12:15-3:15pm.

 

Schedule:

Details of the schedule, slides and reading lists will be updated as the quarter progresses.

Date

Topics

Notes

Who

Readings

Assignments

Tue 22 Sep

IR 1. Introduction to Information Retrieval. Inverted indices and boolean queries. Query optimization. The nature of unstructured and semi-structured text. Course administrivia.

[powerpoint]
[PDF/6]
[PDF/1]

CM

IIR Ch. 1
MG 3.2
MIR 8.2
Shakespeare plays

 

Thu 24 Sep

IR 2. The term vocabulary and postings lists. Text encoding: tokenization, stemming, lemmatization, stop words, phrases. Optimizing indices with skip lists. Proximity and phrase queries. Positional indices.

[powerpoint]
[PDF/6]
[PDF/1]

CM

IIR Ch. 2
MG 3.6, 4.3
MIR 7.2
Porter's stemmer (MIR), Porter stemming algorithm (Official)
A skip list cookbook (Pugh 1990)
Fast phrase querying with combined indexes (Williams, Zobel, Bahle 2004)
Efficient phrase querying with an auxiliary index (Bahle, Williams, Zobel 2002)

 

Tue 29 Sep

IR 3. Dictionaries and tolerant retrieval. Dictionary data structures. Wild-card queries, permuterm indices, n-gram indices. Spelling correction and synonyms: edit distance, soundex, language detection.

[powerpoint]
[PDF/6]
[PDF/1]

PR

IIR Ch. 3
MG 4.2
Techniques for automatically correcting words in text (Kukich 1992)
Finding approximate matches in large lexicons (Zobel and Dart 1995)
Efficient Generation and Ranking of Spelling Error Corrections (Tillenius)
How to write a spelling corrector (Peter Norvig)

[PS1 out]

Thu 1 Oct

IR 4. Index construction. Postings size estimation, sort-based indexing, dynamic indexing, positional indexes, n-gram indexes, distributed indexing, real-world issues.

[powerpoint]
[PDF/6]
[PDF/1]

CM

IIR Ch. 4
MG Ch. 5
MapReduce: simplified data processing on large clusters (Dean and Ghemawat 2004)
Efficient single-pass index construction for text databases (Heinz and Zobel 2003)

 

Tue 6 Oct

Review session for PS1

 

TA

 

 

Tue 6 Oct

IR 5. Index compression: lexicon compression and postings lists compression. Gap encoding, gamma codes, Zipf's Law, variable-byte encoding. Blocking. Extreme compression.

[powerpoint]
[PDF/6]
[PDF/1]

CM

IIR Ch. 5
MG 3.3, 3.4
Compression of inverted indexes for fast query evaluation (Scholer et al. 2002)
Inverted index compression using word-aligned binary codes (Anh and Moffat 2005)

[PE1 out]

Thu 8 Oct

IR 6. Scoring, term weighting, and the vector space model. Parametric or fielded search. Document zones. The vector space retrieval model. tf.idf weighting. The cosine measure. Scoring documents.  [YK]

[powerpoint]
[PDF/6]
[PDF/1]

CM

IIR 6.2 - 6.4.3

PS1 due
[Solutions]

Tue 13 Oct

Review session for PE1

[powerpoint]

TA

 

 

Tue 13 Oct

IR 7. Computing scores in a complete search system: Components of an IR system. Efficient vector space scoring. Nearest neighbor techniques, reduced dimensionality approximations, random projection.

[powerpoint]
[PDF/6]
[PDF/1]

PR

IIR Ch. 7, IIR 6.1

 

Thu 15 Oct

IR 8. Results summaries: static and dynamic. Evaluating search engines. User happiness, precision, recall, F-measure. Creating test collections: kappa measure, interjudge agreement. Relevance, approximate vector retrieval.

[powerpoint]
[PDF/6]
[PDF/1]

PR

IIR Ch. 8
MG 4.5
MIR Ch. 3
The use of MMR, diversity-based reranking for reordering documents and producing summaries (Carbonell and Goldstein 1998)

PE1 due

Tue 20 Oct

Review session for midterm

 

TA

 

[Practice Midterm]
[Solution]

Tue 20 Oct

IR 9. Relevance feedback. Pseudo relevance feedback. Query expansion. Automatic thesaurus generation. Sense-based retrieval. Experimental results of performance effectiveness.

[powerpoint]
[PDF/6]
[PDF/1]

PR

IIR Ch. 9
MG 4.7
MIR 5.2-5.4

 

Thu 22 Oct

Midterm to be held in-class

 cm

Midterm Statistics
Mean=55.85
Stdev=16.54
20th Percentile=41.6
40th Percentile=50.2
60th Percentile=59.4
80th Percentile=72

Midterm
[Solutions]

Tue 27 Oct

CLASSIFICATION 1. Introduction to text classification. Naive Bayes models. Spam filtering.

[powerpoint]
[PDF/6]
[PDF/1]

CM

IIR Ch. 13
Machine learning in automated text categorization (Sebastiani 2002)
A re-examination of text categorization methods (Yang et al. 1999)
A Comparison of event models for naive Bayes text classification (McCallum et al. 1998)
Tom Mitchell. Machine Learning. McGraw-Hill, 1997.
Open Calais
Weka
Reuters-21578

Tackling the poor assumptions of Naive Bayes classifier (Rennie et al. 2003) (for PE2)

 

Thu 29 Oct

CLASSIFICATION 2. K Nearest Neighbors, Decision boundaries, Vector space classification using centroids, Decision Trees. Comparative results.

[powerpoint]
[PDF/6]
[PDF/1]

CM

IIR Ch. 14
Machine learning in automated text categorization (Sebastiani 2002)
Tom Mitchell. Machine Learning. McGraw-Hill, 1997.
A re-examination of text categorization methods (Yang et al. 1999)

Evaluating and optimizing autonomous text classification systems (Lewis 1995)
Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001.
Open Calais
Weka

 

Tue 3 Nov

CLUSTERING 1. Introduction to the problem. Partitioning methods: k-means clustering; Hierarchical clustering.

 [powerpoint]
[PDF/6]
[PDF/1]

PR

IIR Ch. 16, IIR 17.1-3

 

[PS2 out]

Thu 5 Nov

CLUSTERING 2. Latent semantic indexing (LSI). Applications to clustering and to information retrieval.

[powerpoint]
[PDF/6]

[PDF/1]

PR

IIR Ch. 18

 

Tue 10 Nov

Review session for PS2

[powerpoint]

TA

 

 

Tue 10 Nov

CLASSIFICATION 3. Support vector machine classifiers. Kernel Function. Evaluation of classification. Micro- and macro-averaging. Learning rankings.

 

[powerpoint]
[PDF/6]
[PDF/1]

CM

IIR Ch. 15
A tutorial on support vector machines for pattern recognition (Burges 1998)
Using SVMs for text categorization (Dumais 1998)
Inductive learning algorithms and representations for text categorization (Dumais et a. 1998)
A Re-examination of text categorization methods (Yang et al. 1999)
Text categorization based on regularized linear classification methods (Zhang et al. 2001)

Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001.
Reuters-21578
Thorsten Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
A loss function analysis for classification methods in text categorization (Li et al. 2003)

 

[PE2 out]
[Reading]

Thu 12 Nov

Web 4:  Learning to rank

[powerpoint]
[PDF/6]
[PDF/1]

CM

IIR 6.1.2-3, IIR 15.4
Discriminative models for information retrieval (Nallapati 2004)
Adapting ranking SVM to document retrieval (Cao et al. 2006)
A support vector method for optimizing average precision (Yue et al. 2007)
LETOR benchmark datasets

PS2 due
[Solutions]

Tue 17 Nov

Review session for PE2

 

TA

 

 

Tue 17 Nov

Web 1: What makes the web different. Web search overview, web structure, the user, paid placement, search engine optimization/spam. Web size measurement.

 

[powerpoint]
[PDF/6]
[PDF/1]

PR

IIR Ch.19

 

 

Thu 19 Nov

Web 2:  Crawling and web indexes. Near-duplicate detection.

 

[powerpoint]
[PDF/6]
[PDF/1]

PR

IIR Ch. 20
Mercator: A scalable, extensible web crawler (Heydon et al. 1999)
A standard for robot exclusion
The WebGraph framework I: Compression techniques (Boldi et al. 2004)

PE2 due

Tue 24 Nov

Thanksgiving (no class)

 

--

 

 

Thu 26 Nov

Thanksgiving (no class)

 

--

 

 

Tue 1 Dec

Review session for final

 

TA

 

[Practice Final]
[Solutions]

Tue 1 Dec

Web 3: Link analysis

[powerpoint]
[PDF/6]
[PDF/1]

PR

IIR Ch. 21
Ranking the web frontier (Eiron et al. 2004)
The WebGraph framework I: Compression techniques (Boldi et al. 2004)
Extrapolation methods for accelerating PageRank computations (Kamvar et al. 2003)
Searching the workplace web (Fagin et al. 2003)

 

 

Thu 3 Dec

Text understanding and mining: Question Answering

[powerpoint]
[PDF/6]
[PDF/1]

CM

AskMSR: Question answering using the worldwide web (Banko et al.)
Web question answering: Is more always better? (Dumais et al.)
Extracting product features and opinions from reviews (Popescu et al.)
The tradeoffs between open and traditional relation extraction (Banko et al.)
FALCON: Boosting knowledge for answer engines (Harabagiu et a.l)
High performance question/answering (Pasca et al.)

 

 

 

 

 

 

 

7-11 Dec

Final Exam: December 9th, Wed 12:15-3:15pm.

 

 

 

Final
[Solutions]


Back to the CS276 home page