Algorithms for Modern Data Models
There are two threads in this research.
Algorithms for Ternary Content Addressable Memories
A Content Addressable Memory (CAM) is an associative lookup memory containing
a set of fixed-width cells that can hold arbitrary data bits. A CAM takes a
search key as a query, and returns the address of the entry that contains the
key, if any. In a Ternary CAM (TCAM), each data bit is capable of storing one
of three states: 0, 1, or *, where a * matches both 0 and 1. This is a
powerful primitive, and has found much use in high performance network routers
and switches. This project will develop general techniques for using TCAMs to
implement sophisticated randomized data structures, and use them in multiple
data intensive applications.
Distributed Algorithms at Scale
Map-Reduce and other modern processing paradigms have made it very easy to
develop short pieces of software that solve substantial and complex problem
from offline data. We believe that similar paradigms can be developed for
real-time processing. We have developed algorithms for locality sensitive
hashing, for keyword similarity given a background model, for social search,
and for UNION-FIND which work well for either Map-Reduce or distributed
real-time systems.
This project is funded primarily
by NSF grant
0915040.
Project Participants
PI:
Ashish Goel
(Stanford)
Senior Personnel:
Pankaj Gupta
Graduate Students: Rajendra
Shinde, Bahman Bahmani.
Results
Nearest Neighbors and Similarity Search
Similarity search methods are widely used as kernels in various machine
learning applications. Nearest neighbor search (NNS) algorithms are often used
to retrieve similar entries, given a query. While there exist efficient
techniques for exact query lookup using hashing, similarity search using exact
nearest neighbors is known to be a hard problem and in high dimensions, best
known solutions offer little improvement over a linear scan. Fast solutions to
the approximate NNS problem include Locality Sensitive Hashing (LSH) based
techniques, which need storage polynomial in n with exponent greater than
1, and query time sublinear, but still polynomial in n, where n is the
size of the database. In this work we present a new technique of solving the
approximate NNS problem in Euclidean space using a Ternary Content Addressable
Memory (TCAM), which needs near linear space and has O(1) query time. In fact,
this method also works around the best known lower bounds in the cell probe
model for the query time using a data structure near linear in the size of the
data base. TCAMs are high performance associative memories widely used in
networking applications such as access control lists. A TCAM can query for a
bit vector within a database of ternary vectors, where every bit position
represents 0, 1 or *. The * is a wild card representing either a 0
or a 1. We leverage TCAMs to design a variant of LSH, called Ternary
Locality Sensitive Hashing (TLSH) wherein we hash database entries represented
by vectors in the Euclidean space into {0,1,*}. By using the added
functionality of a TLSH scheme with respect to the * character, we solve an
instance of the approximate nearest neighbor problem with 1 TCAM access and
storage nearly linear in the size of the database. We believe that this work
can open new avenues in very high speed data mining.
Similarity Search and Locality Sensitive Hashing using TCAMs
Authors: Rajendra Shinde, Ashish Goel, Pankaj Gupta, Debojyoti Dutta
Proceedings of ACM SIGMOD, 2010.
Subset queries and string matching via TCAMs
We design and analyze an architecture which uses a
single lookup into a Ternary Content Addressable Memory (TCAM) to
solve the subset query problem for small sets, i.e., to check whether a
given set (the query) contains (or alternately, is contained in) any one of
a large collection of sets in a database. We use each TCAM entry as a small
Ternary Bloom Filter (each `bit' of which is one of {0,1,*}) to
store one of the sets in the collection. Like Bloom filters, our
architecture is susceptible to false positives. Since each TCAM entry is
quite small, asymptotic analyses of Bloom filters do not directly apply.
Surprisingly, we are able to show that the asymptotic false positive
probability formula can be safely used if we penalize the small Bloom filter
by taking away just one bit of storage and adding just half an extra set
element before applying the formula. We believe that this analysis is
independently interesting.
The subset query problem has applications in databases, network intrusion
detection, packet classification in Internet routers, and Information
Retrieval. We demonstrate our architecture on one illustrative
streaming application -- intrusion detection in network traffic. By
shingling the strings in the database, we can perform a single subset query,
and hence a single TCAM search, to skip many bytes in the stream. We
evaluate our scheme on the open source CLAM anti-virus database, for
worst-case as well as random streams. Our architecture appears to be at
least one order of magnitude faster than previous approaches. Since the
individual Bloom filters must fit in a single TCAM entry (currently 72 to
576 bits), our solution applies only when each set is of a small
cardinality. However, this is sufficient for many typical applications.
Also, recent algorithms for the subset-query problem use a small-set version
as a subroutine.
Small Subset Queries and Bloom Filters Using Ternary
Associative Memories, with Applications
Authors: Ashish Goel and Pankaj Gupta
Proc. ACM SIGMETRICS, 2010.
Social Search at Scale
To answer search queries on a social network rich with user-generated content,
it is desirable to give a higher ranking to content that is closer to the
individual issuing the query. Queries occur at nodes in the network, documents
are also created by nodes in the same network, and the goal is to find the
document that matches the query and is closest in network distance to the node
issuing the query. In this paper, we present the ``Partitioned
Multi-Indexing'' scheme, which provides an approximate solution to this
problem. We believe this is the first feasible solution for the problem of
large scale social search.
Our solution adapts to distributed, real-time implementations on Active
Distributed Hash Tables (basic distributed stream processing systems) with
O(1) network calls per search or update.
Partitioned Multi-Indexing: Bringing Order to Social Search
Authors: Bahman Bahmani and Ashish Goel
Proc. WWW, 2012.
Fast Incremental and Personalized PageRank
In this paper, we analyze the efficiency of Monte Carlo methods for
incremental computation of PageRank, personalized PageRank, and similar random
walk based methods (with focus on SALSA), on large-scale dynamically evolving
social networks. We assume that the graph of friendships is stored in
distributed shared memory, as is the case for large social networks such as
Twitter. For global PageRank, we assume that the social network has $n$
nodes, and $m$ adversarially chosen edges arrive in a random order. We show
that with a reset probability of $\epsilon$, the total work needed to maintain
an accurate estimate (using the Monte Carlo method) of the PageRank of every
node at all times is $O(\frac{n\ln m}{\epsilon^{2}})$. This is significantly
better than all known bounds for incremental PageRank. For instance, if we
naively recompute the PageRanks as each edge arrives, the simple power
iteration method needs $\Omega(\frac{m^2}{\ln(1/(1-\epsilon))})$ total time
and the Monte Carlo method needs $O(mn/\epsilon)$ total time; both are
prohibitively expensive. Furthermore, we also show that we can handle
deletions equally efficiently. We then study the computation of the top $k$
personalized PageRanks starting from a seed node, assuming that personalized
PageRanks follow a power-law with exponent $\alpha < 1$. We show that if we
store $R>q\ln n$ random walks starting from every node for large enough
constant $q$ (using the approach outlined for global PageRank), then the
expected number of calls made to the distributed social network database is
$O(k/(R^{(1-\alpha)/\alpha}))$. We also present experimental results from the
social networking site, Twitter, verifying our assumptions and analyses. The
overall result is that this algorithm is fast enough for real-time queries
over a dynamic social network, and can be implemented on an active DHT.
Fast Incremental and Personalized PageRank
Authors: Bahman Bahmani, Abdur Chowdhury, and Ashish Goel
Proc. VLDB, 2011.