CS 276 / LING 286: Information Retrieval and Web Search
Lecture: 3 units, Tu/Th 4:15-5:30 at NVIDIA Auditorium (available online through SCPD)
Staff e-mail: cs276-spr1213-staff@lists.stanford.edu
Course Description
Basic
and advanced techniques for text-based information systems: efficient text indexing;
Boolean and vector space retrieval models; evaluation and interface issues;
Web search including crawling, link-based algorithms, and Web metadata;
text/Web clustering, classification; text mining.
Policies
Information (grading, etc.)
Prerequisites: CS 107, CS 109, CS 161. Ideally the whole CS Major Core.
Online Resources
Coursera:
CS 276 will be making use of online videos and quizzes as well as live class lectures, presentations, and
labs. Visit Coursera to find video chunks
and online exercises. We will be posting the required programming assignments and problem sets through
Coursera, so signup soon (with your Stanford ID)!
Piazza:
We strongly recommend that you post any questions about assignments, lectures or general course material
on Piazza. It facilitates lively discussion among
students, and allows others to benefit from the discussion too.
Even if you think your question is specific to your implementation, you can use the 'private question'
feature to address a question to the staff specifically.
We will be using Piazza to post course announcements, so be sure to signup on Piazza soon too.
Here is a quick
introduction video.
Staff Email:
If you really have a question about your situation in specific, that you do not think is
not appropriate for the class forum (not even a private question), please email the staff mailing list at
cs276-spr1213-staff@lists.stanford.edu.
Announcements:
If you are not registered in the course but wish to receive course announcements,
you may subscribe to the guest mailing list
cs276-spr1213-guests.
CS 276 Staff Information
Professors:
Chris Manning, Office Hours: Wednesday 3-4 PM, Gates 158
Pandu Nayak, Office Hours:
after class with prior arrangement
Prabhakar Raghavan,
Office Hours: after class with prior arrangement
TAs: Sonal Gupta (Head TA),
Shui Hu, Anshul Mittal, Rukmani RaviSundaram, Thang Luong
TA Office Hours
For a detailed schedule, check out the Google calendar using the ID: f0qejqeg52qo292jk3i44lqd04@group.calendar.google.com
Daytime Office Hours
- Monday: Gates B24A 10-11am
- Wednesday: Gates B26B 11am-12pm
- Friday: Gates B26B 10-11am
Coding Sessions
- Tuesday: Bldg 160 Room 332 6-9pm
- Wednesday: Bldg 160 Room 332 6-9pm
For SCPD students, we will be available via Google hangout, links of which will be posted on Piazza.
Syllabus
Details of the schedule, videos, slides and reading lists will be updated as the quarter progresses.
| Date |
In Class |
Assignments |
Who |
Materials |
| Tue 2 Apr |
Introduction to course: Discussion of issues in search plus Web 1: Web search, ads, SEO
|
|
PN |
Coursera:
Boolean Retrieval (IIR Ch. 1)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 1
MG 3.2
MIR 8.2
Shakespeare plays
|
| Thu 4 Apr |
Guest speaker: Jeff Dean on the evolution of Google's search and retrieval system
|
PA1 out
|
Jeff |
Coursera:
Term Vocabulary and Postings Lists (IIR Ch. 2)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 2
MG 3.6, 4.3
MIR 7.2
Porter's stemmer (MIR),
Porter stemming algorithm (Official)
A skip list cookbook (Pugh 1990)
Fast phrase querying with combined indexes (Williams, Zobel, Bahle 2004)
Efficient phrase querying with an auxiliary index (Bahle, Williams, Zobel 2002)
MapReduce: simplified data processing on large clusters (Dean and Ghemawat 2004)
|
| Tue 9 Apr |
Class lab: Writing a merge algorithm for
proximity queries using a positional index
PrimaryPad
Shared gdoc: http://bit.ly/HgCAdP
Intersection skeleton
Positional skeleton
Positional answer
[requires SUNet ID]
|
PS1 Out |
CM |
Coursera:
Index Construction (IIR Ch. 4)
Notes
[powerpoint]
[PDF/6]
[PDF/1]
Readings
IIR Ch. 4
MG Ch. 5
Efficient single-pass index construction for text databases (Heinz and Zobel 2003)
|
| Thu 11 Apr |
Class lab: Algorithms for postings list compression
PrimaryPad
compress.py
Results
gdoc spreadsheet
Jeff Dean's
compression slides
Simple-9 (Anh/Moffat)
Wikipedia: Golumb/Rice,
Huffman,
Elias gamma code,
unary code
|
|
CM |
Coursera:
Index Compression (IIR Ch. 5)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 5
MG 3.3, 3.4
Compression of inverted indexes for fast query evaluation (Scholer et al. 2002)
Inverted index compression using word-aligned binary codes (Anh and Moffat 2005)
|
| Tue 16 Apr |
Spelling correction
[powerpoint]
[PDF/6]
[PDF/1]
|
PS1 Due |
PR |
Coursera:
Dictionaries and Tolerant Retrieval (IIR Ch. 3)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
How to write a
spelling corrector (Peter Norvig)
IIR Ch. 3
MG 4.2
Techniques for automatically correcting words in text (Kukich 1992)
Finding approximate matches in large lexicons (Zobel and Dart 1995)
Efficient Generation and Ranking of Spelling Error Corrections (Tillenius)
|
| Thu 18 Apr |
Class lab: Mapreduce with Python
PrimaryPad
mapreduce.py
count_words.py
lmtrain.py
idf.py
index.py
anchors.py
index_with_anchors.py
|
PA1 due
PA2 out
|
PN |
Coursera:
Vector Space Model (IIR Ch. 6)
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
[Jeff Dean's Slides]
Readings:
IIR 6.2 - 6.4.3
|
| Tue 23 Apr |
Probabilistic IR: Binary Independence Model
[powerpoint]
[PDF/6]
[PDF/1]
|
PS2 Out |
PN |
Readings:
IIR 11
|
| Thu 25 Apr |
BM25, BM25F, and ranking signals
[powerpoint]
[PDF/6]
[PDF/1]
|
|
PN |
Readings:
IIR 11
|
| Tue 30 Apr |
Evaluation. Precision/recall, NDCG, using clickthrough rates
[powerpoint]
[PDF/6]
[PDF/1]
|
PS2 Due |
PR |
Readings:
IIR Ch. 8
MG 4.5
MIR Ch. 3
|
| Thu 2 May |
Systems issues in efficient retrieval and scoring
[powerpoint]
[PDF/6]
[PDF/1]
|
PA2 due
PA3 out
|
PR |
Coursera:
Computing Scores (IIR Ch. 7)
Readings:
IIR Ch. 7,
IIR 6.1
|
| Tue 7 May |
Lucene Tutorial
[powerpoint]
[PDF/6]
[PDF/1]
|
PS3 Out |
PN |
|
| Thu 9 May |
CLASSIFICATION 1 + 2: Naive Byes, kNN, decision boundaries
[powerpoint]
[PDF/6]
[PDF/1]
|
|
PR |
Readings:
IIR Ch. 11
IIR Ch. 13
Machine learning in automated text categorization (Sebastiani 2002)
A re-examination of text categorization methods (Yang et al. 1999)
A Comparison of event models for naive Bayes text classification (McCallum et al. 1998)
Tom Mitchell. Machine Learning. McGraw-Hill, 1997.
Open Calais
Weka
Reuters-21578
Tackling the poor assumptions of Naive Bayes classifier (Rennie et al. 2003)
IIR Ch. 14
Machine learning in automated text categorization (Sebastiani 2002)
Tom Mitchell. Machine Learning.McGraw-Hill, 1997.
A re-examination of text categorization methods (Yang et al. 1999)
Evaluating and optimizing autonomous text classification systems (Lewis 1995)
Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001.
|
| Tue 14 May |
CLASSIFICATION 3. Support vector machines
[powerpoint]
[PDF/6]
[PDF/1]
|
PS3 Due |
CM |
Readings:
IIR Ch. 15
A tutorial on support vector machines for pattern recognition (Burges 1998)
Using SVMs for text categorization (Dumais1998)
Inductive learning algorithms and representations for text categorization (Dumaiset a. 1998)
A Re-examination of text categorization methods (Yang et al. 1999)
Text categorization based on regularized linear classification methods (Zhang et al. 2001)
Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag,New York, 2001.
Reuters-21578
Thorsten Joachims. Learning to Classify Text using Support Vector Machines.
Kluwer, 2002.
A loss function analysis for classification methods in text categorization (Li et al. 2003)
|
| Thu 16 May |
Web 2: Learning to rank.
[powerpoint]
[PDF/6]
[PDF/1]
|
PA3 due
PA4 out
|
CM |
Readings:
IIR 6.1.2-3, IIR 15.4
Discriminative models for information retrieval (Nallapati 2004)
Adapting ranking SVM to document retrieval (Cao et al. 2006)
A support vector method for optimizing average precision (Yue et al. 2007)
LETOR benchmark datasets
|
| Tue 21 May |
Invited talk: Sriram Sankar from Facebook on Graph Search
|
|
Sriram Sankar |
|
| Thu 23 May |
CLUSTERING 1: k-means, HAC
[powerpoint]
[PDF/6]
[PDF/1]
|
|
PN |
Readings:
IIR Ch. 16,
IIR 17.1-3
|
| Tue 28 May |
CLUSTERING 2. LSI
|
PS4 Out |
CM |
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 18
|
| Thu 30 May |
Web 3: Link analysis.
|
PA4 due
|
PR |
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 21
Ranking the web frontier (Eiron et al. 2004)
The WebGraph framework I: Compression techniques (Boldi et al. 2004)
Extrapolation methods for accelerating PageRank computations (Kamvar et al. 2003)
Searching the workplace web (Fagin et al. 2003
|
| Tue 4 Jun |
Web 4: Crawling, near-dups
|
PS4 due |
PR |
Notes:
[powerpoint]
[PDF/6]
[PDF/1]
Readings:
IIR Ch. 20
Mercator: A scalable, extensible web crawler (Heydon et al. 1999)
A standard for robot exclusion
|
| Thu 6 Jun |
No classes. |
|
|
|
| Fri 7 Jun |
Final Exam (12:15PM - 3:15PM) |
|
|
|
Required Textbook:
IIR = Introduction to Information Retrieval, by C. Manning, P. Raghavan,
and H. Schütze. Cambridge University Press, 2008.
This book is available from the Stanford bookstore (or your favorite book purveyor). You can also download and print chapters at
the book website.
(We’d appreciate any reports of typos or of higher-level problems for the third printing. Thanks.)
Other Good IR Books:
MG = Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell.
IRAH = Information Retrieval: Algorithms and Heuristics by D. Grossman and O. Frieder.
MIR = Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto.
FOA = Finding Out About, by R. Belew.
MTW = Mining the Web, by S. Chakrabarti.
FSNLP = Foundations of Statistical Natural Language Processing, by C. Manning and H. Schütze.
These books all have useful information on topics that we cover and are recommended as references. MG is particularly good as a
detailed reference for technical IR in the first half of the course. MTW covers many of the topics from the latter part of the
course.
More detailed resources can be found here.
Lecturers:
CM = Chris Manning
PN = Pandu Nayak
PR = Prabhakar Raghavan
Related Courses:
Introduction to Computational Advertising (
http://www.stanford.edu/class/msande239/)