This dataset contains anonymized web access logs that originate from a search engine query collected from the main stanford web server. It consists of all queried (from the three major search engines) web pages that appeared in the access logs of the main stanford web server during a period of 12 months beginning March 2007.
Each entry in the dataset is of following form:
[IP and User-Agent hash] [timestamp] [requested_url] [http_referer_that_contains_search_query]
The size of the raw access log is 6.5GB.
In addition to the raw data, we have a processed dataset that contains entries as a pair <x,y> where x is a URL and y is a query tag (1-gram extracted from a query). The dataset consists of 33,579,286 such pairs with 359,749 unique URLs and 10,997,818 unique queries out of which we extracted 937,075 unique query tags (1-grams). The size of this dataset is 2GB.
You can find more information about the dataset in the paper: Tagging with Queries: How and Why? Ioannis Antonellis, Hector Garcia Molina, Jawed Karim, to appear in WSDM 2009
To gain access to the dataset please drop an e-mail: antonell at stanford dot edu or talk with any of the instructors of cs345a