Algorithms for Modern Data Models
MS&E 317/CS 263
Autumn 2011-12, Stanford University
MW 2:15-3:30, 380-380W
Instructor: Ashish Goel
Office hours: Mon 4-5, Huang
359.
Description: We traditionally
think of algorithms as running on data available in a
single location, typically main memory. In many modern applications
including web analytics, search and data mining, computational biology,
finance, and scientific computing, the data is often too large to
reside in a single location, is arriving incrementally over time, is
noisy/uncertain, or all of the above. Paradigms such as map-reduce,
streaming, sketching, Distributed Hash Tables, Bulk Synchronous
Processing, and random walks have proved useful for these applications.
This course will provide an introduction to the design and analysis of
algorithms for these modern data models. The class will be
largely theoretical, but also involve at least two
hands-on programming exercises.
We will post scribed-class notes, and update the class plan and reading
list as the class progresses. Guests, please sign-up for
msande317-aut1112-guests at http://mailman.stanford.edu
.
Pre-requisites: Algorithms at
the level of CS 261. Being able to competently program in any
main-stream high level language (C, C++, Ruby, Java, Scala, Python,
Perl). All code we hand out is likely to be in Ruby, since it is a
concise language for expressing algorithms.
Class work: There will be 4 HWs
of weight 15% each, everyone will be required to produce high-quality
scribed notes for at-least one class (15%), and a fifth HW (25%) which
will also double as a take-home exam. You can write a survey of any of
the topics we cover if we run out of scribing slots.
Schedule: Please keep track of
these deliverables proactively. We will only send one reminder.
09/30/2011: Send the instructor an email with three class dates, in
order of preference, that you can scribe for.
10/05/2011: Send the instructor an email with your group for
programming problems (2-3 people)
10/19/2011: HW 1 due (handed out 10/12)
11/02/2011: HW 2 due (handed out 10/26)
11/14/2011: HW 3 due (handed out 11/07)
12/5/2011: HW 4 due (handed out 11/28) (CANCELLED)
12/12/2011: Take-home exam due (handed out 12/5/2011)
All due times are 2:15pm. Collaboration policy: You can discuss general
strategies, but must write your own final answer. The programming parts
of each HW can be done in groups of upto 3. No collaboration is allowed
on the final HW/exam.
Handouts
Class Plan and Reading List
Introduction (2
lectures): Motivation, data models (eg. Streaming,
Semi-streaming, Map-Reduce, Continuous Map-Reduce, Pregel, Active DHTs,
uncertainty resolution) with examples, probabilistic techniques
(Linearity of Expectations, Concentration and tail inequalities).
Reading list:
Muthukrishnan, "Data Streams: Algorithms and Applications". Chapter 1.
A copy of the book is available from his webpage where there is
also a link to purchase it.
Motwani and Raghavan, Randomized Algorithms. Appendix C, Section 3.2,
and 4.1. Available
on-line for Stanford students.
Dean and Ghemawat, Map-Reduce:
Simplified Data Processing on Large Clusters.
Streaming (3-4 lectures): Sampling,
counting distinct elements, moment estimation, clustering, sketching,
key-word similarity.
Reading list
Muthukrishnan, "Data Streams: Algorithms and Applications". Chapters 3,
4, 5. A copy of the book is available from his webpage where there is
also a link to purchase it.
Cormode, Muthukrishnan An Improved Data Stream Summary:
The Count-Min Sketch and its Applications .
Meyerson,
Online Facility Location .
Datar, Gionis, Indyk, Motwani, Maintaining Stream Statistics over Sliding
Windows.
Additional Reading:
Babcock, Babu, Datar, Motwani, Widom, Models and
Issues in Data Stream Systems .
Alon, Matias, Szegedy, The space
complexity of approximating the frequency moments.
Guha, Koudas,
Shim, Approximation
and Streaming Algorithms for Histogram Construction Problems. If you want
to dig deeper, also read the papers on wavelet approximations mentioned in
this paper.
Semi-streaming (1
lecture): Union-Find, with applications.
Some
Algorithms for Large Scale Data (2 lectures): Introduction
to Locality Sensitive Hashing; Dimensionality Reduction; Min-Hash.
Reading list (will be updated):
Andoni, Indyk, Near-Optimal
Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions
(Survey, along with some code)
Broder, On the resemblance and containment of
documents. Describes the original ideas behind min-hash.
Charikar, Similarity Estimation Techniques from Rounding Algorithms.
Map-reduce algorithms (3 lectures): Simple
examples; shortest-paths and the problems with using Map-Reduce for
graph algorithms; Simple Locality Sensitive Hashing (LSH) and
Similarity Search using Map-Reduce; Co-occurrence counts; Covering and
Clustering algorithms; Efficient Random Walks.
Reading list:
Suri, Vassilvitskii,
Counting Triangles & The Curse of the Last Reducer.
Bahman, Kumar, Vassilvitskii, Densest
Subgraph in Streaming and MapReduce .
Bahman, Goel, Shinde, Efficient Distributed Locality Sensitive Hashing .
Active Distributed Hash Tables (2 lectures): Continuous
Map-Reduce; Incremental PageRank; Advanced LSH; Advanced Union-Find;
Social Search.
Bahman, Chowdhury, Goel, Fast Incremental and Personalized PageRank .
Bahman, Goel, Shinde, Efficient Distributed Locality Sensitive Hashing .
Goel, Kapralov, Kapralova, Khanna, Single pass graph sparsification in
distributed stream processing systems .
Graph-processing in the Bulk
Synchronous Model (2 lectures): Connectivity; Random Walks;
Differences and relationship between this model, active DHT, Continuous
Map-Reduce, and Map-Reduce.
Uncertainty Resolution (2 lectures): Explore vs Exploit; Models
of multi-armed bandit problems; Gittins' Index; Non-Bayesian algorithms
for the multi-armed bandit problem.
Tsitsiklis, A short proof
of the Gittins Index theorem Notice that this proof
assumes that the time for which an arm gets played is a continuous
distribution, whereas we assumed discrete in the proof in class.
Frostig, Weiss, Four
proofs of Gittins' multiarmed bandit theorem. Specially, see the proof which involves the fair and prevailing
prices.
Auer, Cesa-Bianchi, Fischer,
Finite-time Analysis of the Multiarmed Bandit Problem. Specifically, read
the analysis of UCB1. Then contemplate the following question: why is the
performance of this algorithm apparently worse if the arms have very similar
means? Can you derive a bound that does not depend on the difference in the
means?
CAM based computing (1 lecture): CAMs
and TCAMs; LSH with TCAMs; Route lookups and string-matching. Note: We will probably run out of time and
not finish this one.
Shinde, Goel, Gupta, Dutta,
Similarity Search and Locality Sensitive Hashing using TCAMs.