|
CS 276 / LING 286 |
Lecture: 3 units, Tu/Th 4:15-5:30 at NVIDIA Auditorium (available online through SCPD)
Staff e-mail: cs276-spr1112-staff@lists.stanford.edu
Basic and advanced techniques for text-based information systems: efficient text indexing; Boolean and vector space retrieval models; evaluation and interface issues; Web search including crawling, link-based algorithms, and Web metadata; text/Web clustering, classification; text mining.
Policies Information (grading, etc.)
Prerequisites: CS 107, CS 109, CS 161. Ideally the whole CS Major Core.
Coursera: CS 276 will be making use of online videos and quizzes as well as live class lectures, presentations, and labs. Visit Coursera to find video chunks and online exercises. We will be posting the required programming assignments and problem sets through Coursera, so signup soon (with your Stanford ID)!
Piazza: We strongly recommend that you post any questions about assignments, lectures or general course material on Piazza. It facilitates lively discussion among students, and allows others to benefit from the discussion too. Even if you think your question is specific to your implementation, you can use the 'private question' feature to address a question to the staff specifically. We will be using Piazza to post course announcements, so be sure to signup on Piazza soon too. Here is a quick introduction video.
Staff Email: If you really have a question about your situation in specific, that you do not think is not appropriate for the class forum (not even a private question), please email the staff mailing list at cs276-spr1112-staff@lists.stanford.edu.
Announcements: If you are not registered in the course but wish to receive course announcements, you may subscribe to the guest mailing list cs276-spr1112-guests.
Professors:
Chris Manning, Office Hours: Monday 1-2 PM, Gates 158
Pandu Nayak, Office Hours: By appointment after class (Tuesday/Thursday 5:30 PM)
Prabhakar Raghavan
TAs: Mengqiu Wang, Dilli Raj Paudel, Zahan Malkani, Aditya Ramesh, Anshul Mittal
Details of the schedule, videos, slides and reading lists will be updated as the quarter progresses.
| Date | Topics | In Class | Video | Notes | Who | Readings | Assignments |
| Tue 3 Apr | IR 1. Introduction to Information Retrieval. Inverted indices and boolean queries. Query optimization. The nature of unstructured and semi-structured text. Course administrivia. | Introduction to course: Discussion of issues in search | Boolean Retrieval (IIR Ch. 1) |
[powerpoint] [PDF/6] [PDF/1] |
PN |
IIR Ch. 1 MG 3.2 MIR 8.2 Shakespeare plays |
|
| Thu 5 Apr | IR 2. The term vocabulary and postings lists. Text encoding: tokenization, stemming, lemmatization, stop words, phrases. Optimizing indices with skip lists. Proximity and phrase queries. Positional indices. | Class lab: Writing a merge algorithm for proximity queries
using a positional index PrimaryPad Shared gdoc: http://bit.ly/HgCAdP Intersection skeleton Positional skeleton |
Term Vocabulary and Postings Lists (IIR Ch. 2) |
[powerpoint] [PDF/6] [PDF/1] |
CM |
IIR Ch. 2 MG 3.6, 4.3 MIR 7.2 Porter's stemmer (MIR), Porter stemming algorithm (Official) A skip list cookbook (Pugh 1990) Fast phrase querying with combined indexes (Williams, Zobel, Bahle 2004) Efficient phrase querying with an auxiliary index (Bahle, Williams, Zobel 2002) |
PS1 out PA1 out |
| Tue 10 Apr | IR 4. Index construction. Postings size estimation, sort-based indexing, dynamic indexing, positional indexes, n-gram indexes, distributed indexing, real-world issues. | Guest speaker: Jeff Dean on the evolution of Google's search and retrieval systems. | Index Construction (IIR Ch. 4) |
[powerpoint] [PDF/6] [PDF/1] |
PN |
IIR Ch. 4 MG Ch. 5 MapReduce: simplified data processing on large clusters (Dean and Ghemawat 2004) Efficient single-pass index construction for text databases (Heinz and Zobel 2003) |
|
| Thu 12 Apr | IR 5. Index compression: lexicon compression and postings lists compression. Gap encoding, gamma codes, Zipf's Law, variable-byte encoding. Blocking. Extreme compression. | Class lab: Algorithms for postings list compression PrimaryPad compress.py Results gdoc spreadsheet Jeff Dean's compression slides Simple-9 (Anh/Moffat) Wikipedia: Golumb/Rice, Huffman |
Index Compression (IIR Ch. 5) |
[powerpoint] [PDF/6] [PDF/1] |
CM |
IIR Ch. 5 MG 3.3, 3.4 Compression of inverted indexes for fast query evaluation (Scholer et al. 2002) Inverted index compression using word-aligned binary codes (Anh and Moffat 2005) |
PS1 due |
| Tue 17 Apr | IR 3. Dictionaries and tolerant retrieval. Dictionary data structures. Wild-card queries, permuterm indices, n-gram indices. Spelling correction and synonyms: edit distance, soundex, language detection. |
Spelling correction Slides for spelling and bigram language model: [powerpoint], [PDF/6], [PDF/1] |
Dictionaries and Tolerant Retrieval (IIR Ch. 3) |
[powerpoint] [PDF/6] [PDF/1] |
CM |
IIR Ch. 3 MG 4.2 Techniques for automatically correcting words in text (Kukich 1992) Finding approximate matches in large lexicons (Zobel and Dart 1995) Efficient Generation and Ranking of Spelling Error Corrections (Tillenius) How to write a spelling corrector (Peter Norvig) |
|
| Thu 19 Apr | IR 6. Scoring, term weighting, and the vector space model. Parametric or fielded search. Document zones. The vector space retrieval model. tf.idf weighting. The cosine measure. Scoring documents. |
Class lab: Mapreduce with Python PrimaryPad mapreduce.py count_words.py lmtrain.py index.py anchors.py index_with_anchors.py |
Vector Space Model (IIR Ch. 6) |
[powerpoint] [PDF/6] [PDF/1] |
PN | IIR 6.2 - 6.4.3 |
PA1 due PS2 out PA2 out |
| Tue 24 Apr | IR 7. Computing scores in a complete search system: Components of an IR system. Efficient vector space scoring. Nearest neighbor techniques, reduced dimensionality approximations, random projection. |
Lucene Tutorial [powerpoint], [PDF/6], [PDF/1] |
Computing Scores (IIR Ch. 7) |
[powerpoint] [PDF/6] [PDF/1] |
PN | IIR Ch. 7, IIR 6.1 | |
| Thu 26 Apr | IR 8. Results summaries: static and dynamic. Evaluating search engines. User happiness, precision, recall, F-measure. Creating test collections: kappa measure, interjudge agreement. Relevance, approximate vector retrieval. |
Recent evaluation, NDCG, using clickthrough; rate queries & results In-class Slides: [powerpoint], [PDF/6], [PDF/1] |
IR Evaluation (IIR Ch. 8) |
[powerpoint] [PDF/6] [PDF/1] |
CM |
IIR Ch. 8 MG 4.5 MIR Ch. 3 The use of MMR, diversity-based reranking for reordering documents and producing summaries (Carbonell and Goldstein 1998) |
PS2 due |
| Tue 1 May | Probabilistic IR. Binary Independence Model on BM25. |
[powerpoint] [PDF/1] |
CM | IIR Ch. 11 | |||
| Thu 3 May | Web 1: What makes the web different. Web search overview, web structure, the user, paid placement, search engine optimization/spam. Web size measurement. |
[powerpoint] [PDF/6] [PDF/1] |
PN | IIR Ch.19 |
PA2 due PS3 out PA3 out |
||
| Tue 8 May | CLASSIFICATION 1. Introduction to text classification. Naive Bayes models. Spam filtering. Probabilistic IR. |
[powerpoint] [PDF/6] [PDF/1] |
CM |
IIR Ch. 11 IIR Ch. 13 Machine learning in automated text categorization (Sebastiani 2002) A re-examination of text categorization methods (Yang et al. 1999) A Comparison of event models for naive Bayes text classification (McCallum et al. 1998) Tom Mitchell. Machine Learning. McGraw-Hill, 1997. Open Calais Weka Reuters-21578 Tackling the poor assumptions of Naive Bayes classifier (Rennie et al. 2003) |
|||
| Thu 10 May | CLASSIFICATION 2. K Nearest Neighbors, Decision boundaries, Vector space classification using centroids. Comparative results. |
[powerpoint] [PDF/6] [PDF/1] |
PN |
IIR Ch. 14 Machine learning in automated text categorization (Sebastiani 2002) Tom Mitchell. Machine Learning.McGraw-Hill, 1997. A re-examination of text categorization methods (Yang et al. 1999) Evaluating and optimizing autonomous text classification systems (Lewis 1995) Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001. Open Calais Weka |
PS3 due | ||
| Tue 15 May | CLASSIFICATION 3. Support vector machine classifiers. Kernel Function. Evaluation of classification. Micro- and macro-averaging. Learning rankings. |
[powerpoint] [PDF/6] [PDF/1] |
CM |
IIR Ch. 15 A tutorial on support vector machines for pattern recognition (Burges 1998) Using SVMs for text categorization (Dumais1998) Inductive learning algorithms and representations for text categorization (Dumaiset a. 1998) A Re-examination of text categorization methods (Yang et al. 1999) Text categorization based on regularized linear classification methods (Zhang et al. 2001) Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag,New York, 2001. Reuters-21578 Thorsten Joachims. Learning to Classify Text using Support Vector Machines. Kluwer, 2002. A loss function analysis for classification methods in text categorization (Li et al. 2003) |
|||
| Thu 17 May | Invited Talk with Michael Busch from Twitter: Real-time Search with Lucene |
PA3 due PA4 out |
|||||
| Tue 22 May | CLUSTERING 1. Introduction to the problem. Partitioning methods: k-means clustering; Hierarchical clustering. |
[powerpoint] [PDF/6] [PDF/1] |
PN | IIR Ch. 16, IIR 17.1-3 | |||
| Thu 24 May | Web 2: Learning to rank. |
[powerpoint] [PDF/6] [PDF/1] |
CM |
IIR 6.1.2-3, IIR 15.4 Discriminative models for information retrieval (Nallapati 2004) Adapting ranking SVM to document retrieval (Cao et al. 2006) A support vector method for optimizing average precision (Yue et al. 2007) LETOR benchmark datasets |
PS4 out | ||
| Tue 29 May | CLUSTERING 2. Latent semantic indexing (LSI). Applications to clustering and to information retrieval. |
[powerpoint] [PDF/6] [PDF/1] |
CM | IIR Ch. 18 | |||
| Thu 31 May | Web 3: Link analysis. |
[powerpoint] [PDF/6] [PDF/1] |
PN |
IIR Ch. 21 Ranking the web frontier (Eiron et al. 2004) The WebGraph framework I: Compression techniques (Boldi et al. 2004) Extrapolation methods for accelerating PageRank computations (Kamvar et al. 2003) Searching the workplace web (Fagin et al. 2003 |
PA4 due | ||
| Tue 5 Jun | Web 4: Crawling and web indexes. Near-duplicate detection. |
[powerpoint] [PDF/6] [PDF/1] |
PN |
IIR Ch. 20 Mercator: A scalable, extensible web crawler (Heydon et al. 1999) A standard for robot exclusion The WebGraph framework I: Compression techniques (Boldi et al. 2004) |
PS4 due | ||
| Thu 7 Jun | No classes. | ||||||
| Mon 11 Jun | Final Exam (12:15PM - 3:15PM) |
IIR = Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze. Cambridge University Press, 2008.
This book is available from the Stanford bookstore (or your favorite book purveyor). You can also download and print chapters at the book website. (We’d appreciate any reports of typos or of higher-level problems for the third printing. Thanks.)
MG = Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell.
IRAH = Information Retrieval: Algorithms and Heuristics by D. Grossman and O. Frieder.
IR = Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto.
FOA = Finding Out About, by R. Belew.
MTW = Mining the Web, by S. Chakrabarti.
FSNLP = Foundations of Statistical Natural Language Processing, by C. Manning and H. Schütze.
These books all have useful information on topics that we cover and are recommended as references. MG is particularly good as a
detailed reference for technical IR in the first half of the course. MTW covers many of the topics from the latter part of the
course.
More detailed resources can be found here.
CM = Chris Manning
PN = Pandu Nayak
Related Courses:
Introduction to Computational Advertising (http://www.stanford.edu/class/msande239/)