Algorithms for Modern Data Models

MS&E 317/CS 263
Autumn 2011-12, Stanford University
MW 2:15-3:30, 380-380W

Instructor: Ashish Goel
Office hours: Mon 4-5, Huang 359.

Description: We traditionally think of algorithms as running on data available in a single location, typically main memory. In many modern applications including web analytics, search and data mining, computational biology, finance, and scientific computing, the data is often too large to reside in a single location, is arriving incrementally over time, is noisy/uncertain, or all of the above. Paradigms such as map-reduce, streaming, sketching, Distributed Hash Tables, Bulk Synchronous Processing, and random walks have proved useful for these applications. This course will provide an introduction to the design and analysis of algorithms for these modern data models.  The class will be largely theoretical, but also involve at least two hands-on programming exercises.

We will post scribed-class notes, and update the class plan and reading list as the class progresses. Guests, please sign-up for msande317-aut1112-guests at http://mailman.stanford.edu .

Pre-requisites: Algorithms at the level of CS 261. Being able to competently program in any main-stream high level language (C, C++, Ruby, Java, Scala, Python, Perl). All code we hand out is likely to be in Ruby, since it is a concise language for expressing algorithms.

Class work: There will be 4 HWs of weight 15% each, everyone will be required to produce high-quality scribed notes for at-least one class (15%), and a fifth HW (25%) which will also double as a take-home exam. You can write a survey of any of the topics we cover if we run out of scribing slots.

Schedule: Please keep track of these deliverables proactively. We will only send one reminder.

09/30/2011: Send the instructor an email with three class dates, in order of preference, that you can scribe for.
10/05/2011: Send the instructor an email with your group for programming problems (2-3 people)
10/19/2011: HW 1 due (handed out 10/12)
11/02/2011: HW 2 due (handed out 10/26)
11/14/2011: HW 3 due (handed out 11/07)
12/5/2011: HW 4 due (handed out 11/28) (CANCELLED)
12/12/2011: Take-home exam due (handed out 12/5/2011)

All due times are 2:15pm. Collaboration policy: You can discuss general strategies, but must write your own final answer. The programming parts of each HW can be done in groups of upto 3. No collaboration is allowed on the final HW/exam.



Handouts



Class Plan and Reading List

Introduction (2 lectures): Motivation, data models (eg. Streaming, Semi-streaming, Map-Reduce, Continuous Map-Reduce, Pregel, Active DHTs, uncertainty resolution) with examples, probabilistic techniques (Linearity of Expectations, Concentration and tail inequalities).

Reading list:

Muthukrishnan, "Data Streams: Algorithms and Applications". Chapter 1. A copy of the book is available from his webpage where there is also a link to purchase it.
Motwani and Raghavan, Randomized Algorithms. Appendix C, Section 3.2, and 4.1. Available on-line for Stanford students.
Dean and Ghemawat, Map-Reduce: Simplified Data Processing on Large Clusters.

Streaming (3-4 lectures): Sampling, counting distinct elements, moment estimation, clustering, sketching, key-word similarity.

Reading list

Muthukrishnan, "Data Streams: Algorithms and Applications". Chapters 3, 4, 5. A copy of the book is available from his webpage where there is also a link to purchase it.
Cormode, Muthukrishnan An Improved Data Stream Summary: The Count-Min Sketch and its Applications .
Meyerson, Online Facility Location .
Datar, Gionis, Indyk, Motwani, Maintaining Stream Statistics over Sliding Windows.

Additional Reading:

Babcock, Babu, Datar, Motwani, Widom, Models and Issues in Data Stream Systems .
Alon, Matias, Szegedy, The space complexity of approximating the frequency moments.
Guha, Koudas, Shim, Approximation and Streaming Algorithms for Histogram Construction Problems. If you want to dig deeper, also read the papers on wavelet approximations mentioned in this paper.

Semi-streaming (1 lecture): Union-Find, with applications.

Some Algorithms for Large Scale Data (2 lectures): Introduction to Locality Sensitive Hashing; Dimensionality Reduction; Min-Hash.

Reading list (will be updated):

Andoni, Indyk, Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions (Survey, along with some code)
Broder, On the resemblance and containment of documents. Describes the original ideas behind min-hash.
Charikar, Similarity Estimation Techniques from Rounding Algorithms.

Map-reduce algorithms (3 lectures)
: Simple examples; shortest-paths and the problems with using Map-Reduce for graph algorithms; Simple Locality Sensitive Hashing (LSH) and Similarity Search using Map-Reduce; Co-occurrence counts; Covering and Clustering algorithms; Efficient Random Walks.

Reading list:

Suri, Vassilvitskii, Counting Triangles & The Curse of the Last Reducer.
Bahman, Kumar, Vassilvitskii, Densest Subgraph in Streaming and MapReduce .
Bahman, Goel, Shinde, Efficient Distributed Locality Sensitive Hashing .


Active Distributed Hash Tables (2 lectures):
Continuous Map-Reduce; Incremental PageRank; Advanced LSH; Advanced Union-Find; Social Search.

Bahman, Chowdhury, Goel, Fast Incremental and Personalized PageRank .
Bahman, Goel, Shinde, Efficient Distributed Locality Sensitive Hashing .
Goel, Kapralov, Kapralova, Khanna, Single pass graph sparsification in distributed stream processing systems .

Graph-processing in the Bulk Synchronous Model (2 lectures): Connectivity; Random Walks; Differences and relationship between this model, active DHT, Continuous Map-Reduce, and Map-Reduce.

Uncertainty Resolution (2 lectures):
Explore vs Exploit; Models of multi-armed bandit problems; Gittins' Index; Non-Bayesian algorithms for the multi-armed bandit problem.

Tsitsiklis, A short proof of the Gittins Index theorem Notice that this proof assumes that the time for which an arm gets played is a continuous distribution, whereas we assumed discrete in the proof in class.
Frostig, Weiss, Four proofs of Gittins' multiarmed bandit theorem. Specially, see the proof which involves the fair and prevailing prices.
Auer, Cesa-Bianchi, Fischer, Finite-time Analysis of the Multiarmed Bandit Problem. Specifically, read the analysis of UCB1. Then contemplate the following question: why is the performance of this algorithm apparently worse if the arms have very similar means? Can you derive a bound that does not depend on the difference in the means?

CAM based computing (1 lecture): CAMs and TCAMs; LSH with TCAMs; Route lookups and string-matching. Note: We will probably run out of time and not finish this one.

Shinde, Goel, Gupta, Dutta, Similarity Search and Locality Sensitive Hashing using TCAMs.