Compression for Similarity Search

Amir Ingber
Post-doc, Stanford University
Given on: Nov. 30th, 2012

Abstract

The problem we study is that of performing similarity queries on compressed data. We study the fundamental tradeoff between compression rate, sequence length, and reliability of queries performed on compressed data. For a Gaussian source and a quadratic similarity criterion, we show that queries can be answered reliably if and only if the compression rate exceeds a given threshold – the identification rate – which we explicitly characterize. When compression is performed at a rate greater than the identification rate, responses to queries on the compressed data can be made exponentially reliable. We give a complete characterization of this exponent, which is analogous to the error and excess-distortion exponents in channel and source coding, respectively.

For a general source, we prove that the identification rate is at most that of a Gaussian source with the same variance. Therefore, as with classical compression, the Gaussian source requires the largest compression rate. Moreover, a universal scheme is described that attains this maximal rate for any source distribution. Joint work with Thomas Courtade and Tsachy Weissman