Abstract
In this paper, we consider the problem of calculating fast and accurate approximations
to the personalized PageRank score ([8, 16]) of a webpage. We focus on techniques
to improve speed by limiting the amount of webgraph data we need to access.
PageRank scores are mainly used for ranking purposes, and generally only the scores
exceeding a given threshold are relevant. In practice, and relative to the size of the web,
only a small number of pages have a non-negligible personalized PageRank score. We
capitalize on this property of the PageRank score to reduce the amount of webgraph
data needed for computation. Our algorithms provide both the approximation to the
personalized PageRank score as well as guidance in using only the necessary information
— and therefore sensibly reduce not only the computational cost of the algorithm, but
also the memory and memory bandwidth requirements.
Our algorithms assume that we have random access to a sparse representation of
a webgraph (perhaps by crawling the web if necessary); some of them further require
knowledge of a hostgraph ([19]). All of them provide a fast approximate calculation of
the personalized PageRank score for pages where the score exceeds a given threshold.
We report experiments with these algorithms on webgraphs of up to 118 million
pages and prove theoretical approximation bound for all. We conclude by proposing
an application scenario of personalized Web Search that inspired and motivated our
work.
Available
Final publication
- Internet Mathematics Website
- Approximating Personalized PageRank with minimal usage of web graph data
(not yet available) - Project Euclid
- Approximating Personalized PageRank with minimal usage of web graph data
- Personal site
- Approximating Personalized PageRank with minimal usage of web graph data
Early draft available
Bibtex
@ARTICLE{gleich2007-approximate,
author = {David Gleich and Marzia Polito},
title = {Approximating Personalized {PageRank} with Minimal Use of Webgraph
Data},
journal = {Internet Mathematics},
year = {2007},
volume = {3},
pages = {257--294},
number = {3},
month = {December},
key = {GP2007},
keywords = {pagerank, hostrank},
owner = {David Gleich},
timestamp = {2006.11.27}
}