CS 276 / LING 286: Information Retrieval and Web Search

Lecture: 3 units, Tu/Th 4:15-5:30 at NVIDIA Auditorium (available online through SCPD)
Staff e-mail: cs276-spr1213-staff@lists.stanford.edu

Course Description

Basic and advanced techniques for text-based information systems: efficient text indexing; Boolean and vector space retrieval models; evaluation and interface issues; Web search including crawling, link-based algorithms, and Web metadata; text/Web clustering, classification; text mining.

Policies Information (grading, etc.)

Prerequisites: CS 107, CS 109, CS 161. Ideally the whole CS Major Core.

Online Resources

Coursera: CS 276 will be making use of online videos and quizzes as well as live class lectures, presentations, and labs. Visit Coursera to find video chunks and online exercises. We will be posting the required programming assignments and problem sets through Coursera, so signup soon (with your Stanford ID)!

Piazza: We strongly recommend that you post any questions about assignments, lectures or general course material on Piazza. It facilitates lively discussion among students, and allows others to benefit from the discussion too. Even if you think your question is specific to your implementation, you can use the 'private question' feature to address a question to the staff specifically. We will be using Piazza to post course announcements, so be sure to signup on Piazza soon too. Here is a quick introduction video.

Staff Email: If you really have a question about your situation in specific, that you do not think is not appropriate for the class forum (not even a private question), please email the staff mailing list at cs276-spr1213-staff@lists.stanford.edu.

Announcements:
If you are not registered in the course but wish to receive course announcements, you may subscribe to the guest mailing list cs276-spr1213-guests.

CS 276 Staff Information

Professors:
Chris Manning, Office Hours: Wednesday 3-4 PM, Gates 158
Pandu Nayak, Office Hours: after class with prior arrangement
Prabhakar Raghavan, Office Hours: after class with prior arrangement

TAs: Sonal Gupta (Head TA), Shui Hu, Anshul Mittal, Rukmani RaviSundaram, Thang Luong

TA Office Hours

For a detailed schedule, check out the Google calendar using the ID: f0qejqeg52qo292jk3i44lqd04@group.calendar.google.com

Daytime Office Hours

Coding Sessions

For SCPD students, we will be available via Google hangout, links of which will be posted on Piazza.

Syllabus

Details of the schedule, videos, slides and reading lists will be updated as the quarter progresses.

Date In Class Assignments Who Materials
Tue 2 Apr Introduction to course: Discussion of issues in search plus Web 1: Web search, ads, SEO PN Coursera:
Boolean Retrieval (IIR Ch. 1)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 1
MG 3.2
MIR 8.2
Shakespeare plays
Thu 4 Apr Guest speaker: Jeff Dean on the evolution of Google's search and retrieval system PA1 out Jeff Coursera:
Term Vocabulary and Postings Lists (IIR Ch. 2)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 2
MG 3.6, 4.3
MIR 7.2
Porter's stemmer (MIR), Porter stemming algorithm (Official)
A skip list cookbook (Pugh 1990)
Fast phrase querying with combined indexes (Williams, Zobel, Bahle 2004)
Efficient phrase querying with an auxiliary index (Bahle, Williams, Zobel 2002) MapReduce: simplified data processing on large clusters (Dean and Ghemawat 2004)
Tue 9 Apr Class lab: Writing a merge algorithm for proximity queries using a positional index
PrimaryPad
Shared gdoc: http://bit.ly/HgCAdP
Intersection skeleton
Positional skeleton
Positional answer
[requires SUNet ID]
PS1 Out CM Coursera:
Index Construction (IIR Ch. 4)
Notes
[powerpoint]
[PDF/6]
[PDF/1]
Readings
IIR Ch. 4
MG Ch. 5
Efficient single-pass index construction for text databases (Heinz and Zobel 2003)
Thu 11 Apr Class lab: Algorithms for postings list compression
PrimaryPad
compress.py
Results gdoc spreadsheet
Jeff Dean's compression slides
Simple-9 (Anh/Moffat)
Wikipedia: Golumb/Rice, Huffman, Elias gamma code, unary code
CM Coursera:
Index Compression (IIR Ch. 5)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 5
MG 3.3, 3.4
Compression of inverted indexes for fast query evaluation (Scholer et al. 2002)
Inverted index compression using word-aligned binary codes (Anh and Moffat 2005)
Tue 16 Apr Spelling correction
[powerpoint]
[PDF/6]
[PDF/1]
PS1 Due PR Coursera:
Dictionaries and Tolerant Retrieval (IIR Ch. 3)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
How to write a spelling corrector (Peter Norvig)
IIR Ch. 3
MG 4.2
Techniques for automatically correcting words in text (Kukich 1992)
Finding approximate matches in large lexicons (Zobel and Dart 1995)
Efficient Generation and Ranking of Spelling Error Corrections (Tillenius)
Thu 18 Apr Class lab: Mapreduce with Python
PrimaryPad
mapreduce.py
count_words.py
lmtrain.py
idf.py
index.py
anchors.py
index_with_anchors.py
PA1 due
PA2 out
PN Coursera:
Vector Space Model (IIR Ch. 6)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
[Jeff Dean's Slides]
Readings:
IIR 6.2 - 6.4.3
Tue 23 Apr Probabilistic IR: Binary Independence Model
[powerpoint]
[PDF/6]
[PDF/1]
PS2 Out PN Readings:
IIR 11
Thu 25 Apr BM25, BM25F, and ranking signals [powerpoint]
[PDF/6]
[PDF/1]
PN Readings:
IIR 11
Tue 30 Apr Evaluation. Precision/recall, NDCG, using clickthrough rates
[powerpoint]
[PDF/6]
[PDF/1]
PS2 Due PR Readings:
IIR Ch. 8
MG 4.5
MIR Ch. 3
Thu 2 May Systems issues in efficient retrieval and scoring
[powerpoint]
[PDF/6]
[PDF/1]
PA2 due
PA3 out
PR Coursera:
Computing Scores (IIR Ch. 7)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 7, IIR 6.1
Tue 7 May Lucene Tutorial
[powerpoint]
[PDF/6]
[PDF/1]
PS3 Out PN
Thu 9 May CLASSIFICATION 1 + 2: Naive Byes, kNN, decision boundaries
[powerpoint]
[PDF/6]
[PDF/1]
PR Readings:
IIR Ch. 11
IIR Ch. 13
Machine learning in automated text categorization (Sebastiani 2002)
A re-examination of text categorization methods (Yang et al. 1999)
A Comparison of event models for naive Bayes text classification (McCallum et al. 1998)
Tom Mitchell. Machine Learning. McGraw-Hill, 1997.
Open Calais
Weka
Reuters-21578
Tackling the poor assumptions of Naive Bayes classifier (Rennie et al. 2003)
IIR Ch. 14
Machine learning in automated text categorization (Sebastiani 2002)
Tom Mitchell. Machine Learning.McGraw-Hill, 1997.
A re-examination of text categorization methods (Yang et al. 1999)

Evaluating and optimizing autonomous text classification systems (Lewis 1995)
Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001.
Tue 14 May CLASSIFICATION 3. Support vector machines
[powerpoint]
[PDF/6]
[PDF/1]
PS3 Due CM Readings:
IIR Ch. 15
A tutorial on support vector machines for pattern recognition (Burges 1998)
Using SVMs for text categorization (Dumais1998)
Inductive learning algorithms and representations for text categorization (Dumaiset a. 1998)
A Re-examination of text categorization methods (Yang et al. 1999)
Text categorization based on regularized linear classification methods (Zhang et al. 2001)

Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag,New York, 2001.
Reuters-21578
Thorsten Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
A loss function analysis for classification methods in text categorization (Li et al. 2003)
Thu 16 May Web 2: Learning to rank.
[powerpoint]
[PDF/6]
[PDF/1]
PA3 due
PA4 out
CM Readings:
IIR 6.1.2-3, IIR 15.4
Discriminative models for information retrieval (Nallapati 2004)
Adapting ranking SVM to document retrieval (Cao et al. 2006)
A support vector method for optimizing average precision (Yue et al. 2007)
LETOR benchmark datasets
Tue 21 May Invited talk: Sriram Sankar from Facebook on Graph Search Sriram Sankar
Thu 23 May CLUSTERING 1: k-means, HAC
[powerpoint]
[PDF/6]
[PDF/1]
PN Readings:
IIR Ch. 16, IIR 17.1-3
Tue 28 May CLUSTERING 2. LSI
[powerpoint]
[PDF/6]
[PDF/1]
PS4 Out CM Readings:
IIR Ch. 18
Thu 30 May Web 3: Link analysis.
[powerpoint]
[PDF/6]
[PDF/1]
PA4 due PR Readings:
IIR Ch. 21
Ranking the web frontier (Eiron et al. 2004)
The WebGraph framework I: Compression techniques (Boldi et al. 2004)
Extrapolation methods for accelerating PageRank computations (Kamvar et al. 2003)
Searching the workplace web (Fagin et al. 2003
Tue 4 Jun Web 4: Crawling, near-dups PS4 due PR Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 20
Mercator: A scalable, extensible web crawler (Heydon et al. 1999)
A standard for robot exclusion
Thu 6 Jun No classes.
Fri 7 Jun Final Exam (12:15PM - 3:15PM)
Practice Final
Practice Final Solutions

Required Textbook:

IIR = Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze. Cambridge University Press, 2008.

This book is available from the Stanford bookstore (or your favorite book purveyor). You can also download and print chapters at the book website. (We’d appreciate any reports of typos or of higher-level problems for the third printing. Thanks.)

Other Good IR Books:

MG = Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell.
IRAH = Information Retrieval: Algorithms and Heuristics by D. Grossman and O. Frieder.
MIR = Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto.
FOA = Finding Out About, by R. Belew.
MTW = Mining the Web, by S. Chakrabarti.
FSNLP = Foundations of Statistical Natural Language Processing, by C. Manning and H. Schütze.

These books all have useful information on topics that we cover and are recommended as references. MG is particularly good as a detailed reference for technical IR in the first half of the course. MTW covers many of the topics from the latter part of the course.

More detailed resources can be found here.

Lecturers:

CM = Chris Manning
PN = Pandu Nayak
PR = Prabhakar Raghavan

Related Courses:

Introduction to Computational Advertising (http://www.stanford.edu/class/msande239/)