Algorithms for Modern Data Models

There are two threads in this research.

Algorithms for Ternary Content Addressable Memories

A Content Addressable Memory (CAM) is an associative lookup memory containing a set of fixed-width cells that can hold arbitrary data bits. A CAM takes a search key as a query, and returns the address of the entry that contains the key, if any. In a Ternary CAM (TCAM), each data bit is capable of storing one of three states: 0, 1, or *, where a * matches both 0 and 1. This is a powerful primitive, and has found much use in high performance network routers and switches. This project will develop general techniques for using TCAMs to implement sophisticated randomized data structures, and use them in multiple data intensive applications.

Distributed Algorithms at Scale

Map-Reduce and other modern processing paradigms have made it very easy to develop short pieces of software that solve substantial and complex problem from offline data. We believe that similar paradigms can be developed for real-time processing. We have developed algorithms for locality sensitive hashing, for keyword similarity given a background model, for social search, and for UNION-FIND which work well for either Map-Reduce or distributed real-time systems.

This project is funded primarily by NSF grant 0915040.

Project Participants

PI: Ashish Goel (Stanford)

Senior Personnel: Pankaj Gupta

Graduate Students: Rajendra Shinde, Bahman Bahmani.

Results

  • Nearest Neighbors and Similarity Search
  • Similarity search methods are widely used as kernels in various machine learning applications. Nearest neighbor search (NNS) algorithms are often used to retrieve similar entries, given a query. While there exist efficient techniques for exact query lookup using hashing, similarity search using exact nearest neighbors is known to be a hard problem and in high dimensions, best known solutions offer little improvement over a linear scan. Fast solutions to the approximate NNS problem include Locality Sensitive Hashing (LSH) based techniques, which need storage polynomial in n with exponent greater than 1, and query time sublinear, but still polynomial in n, where n is the size of the database. In this work we present a new technique of solving the approximate NNS problem in Euclidean space using a Ternary Content Addressable Memory (TCAM), which needs near linear space and has O(1) query time. In fact, this method also works around the best known lower bounds in the cell probe model for the query time using a data structure near linear in the size of the data base. TCAMs are high performance associative memories widely used in networking applications such as access control lists. A TCAM can query for a bit vector within a database of ternary vectors, where every bit position represents 0, 1 or *. The * is a wild card representing either a 0 or a 1. We leverage TCAMs to design a variant of LSH, called Ternary Locality Sensitive Hashing (TLSH) wherein we hash database entries represented by vectors in the Euclidean space into {0,1,*}. By using the added functionality of a TLSH scheme with respect to the * character, we solve an instance of the approximate nearest neighbor problem with 1 TCAM access and storage nearly linear in the size of the database. We believe that this work can open new avenues in very high speed data mining.

    Similarity Search and Locality Sensitive Hashing using TCAMs
    Authors: Rajendra Shinde, Ashish Goel, Pankaj Gupta, Debojyoti Dutta
    Proceedings of ACM SIGMOD, 2010.

  • Subset queries and string matching via TCAMs
  • We design and analyze an architecture which uses a single lookup into a Ternary Content Addressable Memory (TCAM) to solve the subset query problem for small sets, i.e., to check whether a given set (the query) contains (or alternately, is contained in) any one of a large collection of sets in a database. We use each TCAM entry as a small Ternary Bloom Filter (each `bit' of which is one of {0,1,*}) to store one of the sets in the collection. Like Bloom filters, our architecture is susceptible to false positives. Since each TCAM entry is quite small, asymptotic analyses of Bloom filters do not directly apply. Surprisingly, we are able to show that the asymptotic false positive probability formula can be safely used if we penalize the small Bloom filter by taking away just one bit of storage and adding just half an extra set element before applying the formula. We believe that this analysis is independently interesting. The subset query problem has applications in databases, network intrusion detection, packet classification in Internet routers, and Information Retrieval. We demonstrate our architecture on one illustrative streaming application -- intrusion detection in network traffic. By shingling the strings in the database, we can perform a single subset query, and hence a single TCAM search, to skip many bytes in the stream. We evaluate our scheme on the open source CLAM anti-virus database, for worst-case as well as random streams. Our architecture appears to be at least one order of magnitude faster than previous approaches. Since the individual Bloom filters must fit in a single TCAM entry (currently 72 to 576 bits), our solution applies only when each set is of a small cardinality. However, this is sufficient for many typical applications. Also, recent algorithms for the subset-query problem use a small-set version as a subroutine.

    Small Subset Queries and Bloom Filters Using Ternary Associative Memories, with Applications
    Authors: Ashish Goel and Pankaj Gupta
    Proc. ACM SIGMETRICS, 2010.

  • Social Search at Scale
  • To answer search queries on a social network rich with user-generated content, it is desirable to give a higher ranking to content that is closer to the individual issuing the query. Queries occur at nodes in the network, documents are also created by nodes in the same network, and the goal is to find the document that matches the query and is closest in network distance to the node issuing the query. In this paper, we present the ``Partitioned Multi-Indexing'' scheme, which provides an approximate solution to this problem. We believe this is the first feasible solution for the problem of large scale social search.
    Our solution adapts to distributed, real-time implementations on Active Distributed Hash Tables (basic distributed stream processing systems) with O(1) network calls per search or update.

    Partitioned Multi-Indexing: Bringing Order to Social Search
    Authors: Bahman Bahmani and Ashish Goel
    Proc. WWW, 2012.

  • Fast Incremental and Personalized PageRank
  • In this paper, we analyze the efficiency of Monte Carlo methods for incremental computation of PageRank, personalized PageRank, and similar random walk based methods (with focus on SALSA), on large-scale dynamically evolving social networks. We assume that the graph of friendships is stored in distributed shared memory, as is the case for large social networks such as Twitter. For global PageRank, we assume that the social network has $n$ nodes, and $m$ adversarially chosen edges arrive in a random order. We show that with a reset probability of $\epsilon$, the total work needed to maintain an accurate estimate (using the Monte Carlo method) of the PageRank of every node at all times is $O(\frac{n\ln m}{\epsilon^{2}})$. This is significantly better than all known bounds for incremental PageRank. For instance, if we naively recompute the PageRanks as each edge arrives, the simple power iteration method needs $\Omega(\frac{m^2}{\ln(1/(1-\epsilon))})$ total time and the Monte Carlo method needs $O(mn/\epsilon)$ total time; both are prohibitively expensive. Furthermore, we also show that we can handle deletions equally efficiently. We then study the computation of the top $k$ personalized PageRanks starting from a seed node, assuming that personalized PageRanks follow a power-law with exponent $\alpha < 1$. We show that if we store $R>q\ln n$ random walks starting from every node for large enough constant $q$ (using the approach outlined for global PageRank), then the expected number of calls made to the distributed social network database is $O(k/(R^{(1-\alpha)/\alpha}))$. We also present experimental results from the social networking site, Twitter, verifying our assumptions and analyses. The overall result is that this algorithm is fast enough for real-time queries over a dynamic social network, and can be implemented on an active DHT.

    Fast Incremental and Personalized PageRank
    Authors: Bahman Bahmani, Abdur Chowdhury, and Ashish Goel
    Proc. VLDB, 2011.