Stanford
	    University

CS 276 / LING 286
Information Retrieval and Web Mining
Autumn 2006


Course Syllabus

Required Textbook:

IIR = Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze.

This book is available in draft form, as a reader, from the Stanford bookstore. You can also download and print (perhaps updated) chapters at the book website or from the course syllabus.  We’re trying to improve and finish this book, so any comments from typos through higher-level content and organization suggestions would be greatly appreciated! (It’d be useful if you could mention the date of the version you were reading.)

Other Good IR etc. Books:

MG = Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell.
MIR = Modern Information Retrieval, by R. Baeza-Yates and B. Ribeiro-Neto.
FOA = Finding Out About, by R. Belew.
MTW = Mining the Web, by S. Chakrabarti.
FSNLP = Foundations of Statistical Natural Language Processing, by C. Manning and H. Schütze.
IRAH = Information Retrieval: Algorithms and Heuristics by D. Grossman and O. Frieder.

These books all have useful information on topics that we cover and are recommended as references. MG is particularly good as a detailed reference for technical IR in the first half of the course. MTW covers many of the topics from the latter part of the course.

Lecturers:

CM = Christopher Manning
PR = Prabhakar Raghavan
AN = Ani Nenkova
TA = TA

Locations and times:

All lectures will be held at Gates B03 and review sessions will be held at Skilling 193. Final will be at Skilling Aud.
Lectures are Tuesdays and Thursdays from 4:15 to 5:30.
Review sessions, when scheduled, are Friday from 3:15 to 4:05.

Schedule:

Details of the schedule, slides and reading lists will be updated as the quarter progresses.
We have scheduled review sessions for assignments and exams on Friday afternoons.

Date

Topics

Notes

Who

Readings

Assignments

Tue 26 Sep

IR 1. Introduction to Information Retrieval. Inverted indices and boolean queries. Query optimization. The nature of unstructured and semi-structured text. Course administrivia.


[powerpoint]
[PDF]

CM

IIR 1: pdf
MG Sec. 3.2; MIR Sec. 8.2
Shakespeare plays

 

Thu 28 Sep

IR 2. Text encoding: tokenization, stemming, lemmatization, stop words, phrases. Further optimizing indices for query processing. Proximity and phrase queries. Positional indices.

[powerpoint]
[PDF]

CM

IIR 2: pdf
MG Ch. 3.6, 4.3; MIR Ch. 7.2
Porter's stemmer
Porter from the author
Skip Lists theory: Pugh (1990) Multilevel skip lists give same O(log n) efficiency as trees
D. Bahle, H. Williams, and J. Zobel. Efficient phrase querying with an auxiliary index. SIGIR 2002
Fast Phrase Querying with Combined Indexes

 

Tue 3 Oct

IR 3. Tolerant retrieval: spelling correction and synonyms. Wild-card queries, permuterm indices, n-gram indices. Edit distance, soundex, language detection.

[powerpoint]
[PDF]

PR

IIR 3 pdf
MG Ch. 4.2;
J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software - practice and experience 25(3), March 1995.

Mikael Tillenius: Efficient Generation and Ranking of Spelling Error Corrections. Master's thesis at Sweden's Royal Institute of Technology.

Efficient spell retrieval:n K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys 24(4), Dec 1992.

PS1 out

Thu 5 Oct

IR 4. Index construction. Postings size estimation, merge sort, dynamic indexing, positional indexes, n-gram indexes, real-world issues.

[powerpoint]
[PDF]

PR

IIR 4 pdf
MG Ch. 5

 

 

Fri 6 Oct

Review session for PS1

 

TA

 

 

Tue 10 Oct

IR 5. Index compression: lexicon compression and postings lists compression. Gap encoding, gamma codes, Zipf's Law. Blocking. Extreme compression.

[powerpoint]
[PDF

AN

IIR 5 pdf
MG 3.3, 3.4
Compression of inverted indexes for fast query evaluation
V. N. Anh and A. Moffat. 2005. Inverted Index Compression Using Word-Aligned Binary Codes. Information Retrieval 8: 151-166.

PE1 out

Thu 12 Oct

IR 6. Parametric or fielded search. Document zones. The vector space retrieval model. tf.idf weighting. Scoring documents.

[powerpoint]
[PDF]

AN

IIR 6 pdf

Exploring the Similarity Space - Zobel and Moffat (1998).

PS1 due

Fri 13 Oct

Review session for PE1

 

TA

 

 

Tue 17 Oct

IR 7. Vector space scoring. The cosine measure. Efficiency considerations. Nearest neighbor techniques, reduced dimensionality approximations, random projection.

[powerpoint]
[PDF]   

CM

IIR 7. pdf
MG Ch. 4.4-4.6; MIR 2.5, 2.7.2; FSNLP 15.4

Anh, V.N., de Krester, O, and A. Moffat. Vector-Space Ranking with Effective Early Termination, SIGIR 2001.
Anh, V.N. and A. Moffat. Pruned query evaluation using pre-computed impacts. SIGIR 2006.

Random projection theorem- Dasgupta and Gupta. An elementary proof of the Johnson-Lindenstrauss Lemma (1999).

Faster random projection - A.M. Frieze, R. Kannan, S. Vempala. Fast Monte-Carlo Algorithms for finding low-rank approximations. IEEE Symposium on Foundations of Computer Science, 1998.

 

Thu 19 Oct

IR 8. Results summaries: static and dynamic. Evaluating search engines. User happiness, precision, recall, F-measure. Creating test collections: kappa measure, interjudge agreement. Relevance, approximate vector retrieval.

[powerpoint]
[PDF]

CM

IIR 8 pdf

MG 4.5 MIR Chapter 3
Carbonell and Goldstein. The use of MMR, diversity-based reranking for reordering documents and producing summaries. SIGIR 1998

PE1 due

Fri 21 Oct

Review session for midterm

 

TA

 

 

Tue 24 Oct

IR 9. Relevance feedback. Pseudo relevance feedback. Query expansion. Automatic thesaurus generation. Sense-based retrieval. Experimental results of performance effectiveness.

[powerpoint]
[PDF]

CM

IIR 9. pdf
MG Ch 4.7; MIR Ch 5.2-5.4
Learning Routing Queries in a Query Zone
Concept Based Query Expansion
New Retrieval Approaches Using SMART: TREC 4
Gerard Salton and Chris Buckley. Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science, 41(4):288-297, 1990.
Jinxi Xu and W. Bruce Croft: Query Expansion Using Local and Global Document Analysis. SIGIR 1996: 4-11
Schuetze: Automatic Word Sense Discrimination, Computational Linguistics, 1998.
Harman: Relevance feedback revisited. SIGIR 1992.
Chris Buckley, Gerard Salton, and James Allan. The effect of adding relevance information in a relevance feedback environment. SIGIR 1994.
Use of query reformulation and relevance feedback by Excite users

 

Thu 26 Oct

Midterm to be held in-class

 

 

 

 

Tue 31 Oct

CLUSTERING 1. Introduction to the problem. Partitioning methods: k-means clustering; Hierarchical clustering.

[powerpoint]
[PDF]

PR

IIR 16 and 17 (Except 16.5, 17.3)

 

Thu 2 Nov

CLUSTERING 2. Latent semantic indexing (LSI). Applications to clustering and to information retrieval.

[powerpoint]
[PDF]

PR

IIR 18.

 

Tue 7 Nov

CLASSIFICATION 1. Introduction to text classification. Naive Bayes models. Spam filtering.

[powerpoint]
[PDF]

CM

IIR 13.
Machine Learning in Automated Text Categorization
A Comparison of Event Models for Naive Bayes Text Classification
Tom Mitchell. Machine Learning. McGraw-Hill, 1997.
A Re-examination of Text Categorization Methods

PS2 out

Thu 9 Nov

CLASSIFICATION 2. K Nearest Neighbors, Decision boundaries, Vector space classification using centroids, Decision Trees

[powerpoint]
[PDF]

CM

IIR 14.
Machine Learning in Automated Text Categorization
Tom Mitchell. Machine Learning. McGraw-Hill, 1997.
A Re-examination of Text Categorization Methods
David Lewis: Evaluating and Optimizing Autonomous Text Classification Systems. SIGIR 1995.
Manning and Schutze: Foundations of Statistical Natural Language Processing. Chapter 16.
Trevor Hastie, Robert Tibshirani, Jerome Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001.

 

Fri 10 Nov

Review session for PS2

 

TA

 

 

Tue 14 Nov

CLASSIFICATION 3. Support vector machine classifiers. Kernel Function. Evaluation of classification. Micro- and macro-averaging. Comparative results.

[powerpoint]
[PDF]

CM

IIR 15.
A Tutorial on Support Vector Machines for Pattern Recognition
Dumais: Using SVMs for text categorization, IEEE Intelligent Systems, 13(4), 1998.
Dumais, Platt, Heckerman and Sahami: Inductive learning algorithms and representations for text categorization. CIKM 1998.
A Re-examination of Text Categorization Methods
Text Categorization Based on Regularized Linear Classification Methods
Hastie, Tibshirani and Friedman: Elements of Statistical Learning: Data Mining, Inference and Prediction.
'Classic' Reuters data set
T. Joachims: Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
Li, Yang: A Loss Function Analysis for Classification Methods in Text Categorization. ICML 2003.
Tackling the Poor Assumptions of Naive Bayes Classifier (for PE2)

PE2 out

Thu 16 Nov

Web 1: What makes the web different. Web search overview, web structure, the user, paid placement, search engine optimization/spam.

 

[powerpoint]
[PDF

 

PR

  IIR 19.

PS2 due

Fri 17 Nov

Review session for PE2

 

TA

 

 

Tue 21 Nov

Thanksgiving (no class)

 

 

 

 

Thu 23 Nov

Thanksgiving (no class)

 

 

 

 

Tue 28 Nov

Web 2: Web characteristics II: Web size measurement, Near-duplicate detection

[powerpoint]
[PDF]   

PR

IIR 19.
Phelps & Wilensky. Robust Hyperlinks & Locations, 2002
Ziv Bar-Yossef and Maxim Gurevich. Random Sampling from a Search Engine's Index, WWW 2006.
Broder et al. Estimating corpus size via queries. CIKM 2006.

 

Thu 30 Nov

Web 3: Crawling and web indexes

[powerpoint]
[PDF]  

PR

IIR 20
Allan Heydon and Marc Najork. Mercator: A Scalable, Extensible Web Crawler
A Standard for Robot Exclusion
Paolo Boldi and Sebastiano Vigna. The WebGraph Framework I: Compression Techniques

PE2 due

Tue 5 Dec

Web 4: Link analysis

[powerpoint]
[PDF]   

PR

IIR 21.

Ranking the Web Frontier
The WebGraph Framework I: Compression Techniques
Extrapolation Methods for Accelerating PageRank Computations
Searching the Workplace Web

 

Thu 7 Dec

Text understanding and mining: Question Answering

[powerpoint]
[PDF]   

CM

 

 

Fri 8 Dec

Review session for final

 

TA

 

 

Mon 11 Dec

Final Exam 12:15-3:15, Skilling Aud

 

 

 

 


Back to the CS276 home page