Skip to content

Stanford University

Compressed Image-based Retrieval


David Chen, Sam Tsai, Bernd Girod


Streaming Mobile Augmented Reality


We have developed a real-time CD/DVD/book cover recognition system for mobile phones. The user can point the phone's camera at a media cover and see the object's identity in the viewfinder in about 1 second. The boundary of the object is also displayed for easy visibility against a cluttered background. Both the object's identity and the geometry are obtained from a server hosting a large database of media covers. As the user pans across a set of media covers, the system automatically recognizes new objects that come into view, without the user ever having to press a button. Since we employ state-of-the-art recognition algorithms, our system accurately recognizes objects in challenging environments and is robust against occlusions, clutter, geometric deformations, and photometric distortions.

The recognition system uses speed-optimized operations on the phone and server to achieve low latency. On the phone, efficient motion estimation, specifically designed for mobile devices, is performed on the viewfinder frames to determine how quickly the user is moving the phone. Our user attention model assumes that when the movement is low, the user is very likely interested in the objects shown in the viewfinder, so the current viewfinder frame is uploaded to the server. It then takes approximately 1 second to receive a response from the server when using a WiFi connection. The reply message contains the identity of the media cover as well as the location of the object within the viewfinder. A boundary for the recognized object is shown as long as the user maintains visual focus on the object. By moving the viewfinder again, the user indicates the focus is being shifted onto another media cover.

On the server, accurate recognition is achieved through feature-based image matching. Robust local features are extracted from the uploaded viewfinder frame and matched against a database of pre-extracted features. This approach, currently popular in computer vision, enables us to reliably find the correct media cover despite many distortions in the viewfinder frame. Feature-based recognition also enables us to localize the position of the object within the viewfinder frame. All stages of the recognition process, from feature extraction to feature matching, have been optimized to run with very low latencies while maintaining high recognition accuracy.





Tree Histogram Coding for Mobile Image Search


We have also developed a coding scheme for large-scale image search, which significantly reduces bit rate compared to other state-of-the-art feature coding techniques. Previous image retrieval systems transmit compressed feature descriptors, which is well suited for pairwise image matching. For fast retrieval from large databases, however, scalable vocabulary trees are commonly employed. In our work, we demonstrate a rate-efficient codec designed for tree-based retrieval. By encoding a tree histogram, our codec can achieve a more than 5x rate reduction compared to sending compressed feature descriptors. By discarding the order amongst a list of features, histogram coding requires 1.5x lower rate than sending a tree node index for every feature. Probability models are developed for the tree histogram symbols to enable arithmetic coding. Recently, our codec has been integrated into a real-time system for CD/DVD cover recognition with a camera-phone. In the following video, we compare the bit rates and upload times of a system that sends compressed descriptors and a system that sends a compressed tree histogram.





Multiview Scalable Vocabulary Trees


A scalable vocabulary tree (SVT) built from fronto-parallel database images is ineffective at classifying query images that suffer from perspective distortion. In this work, we propose an efficient server-side extension of the single-view SVT to a set of multiview SVTs that may be simultaneously employed for image classification. Our solution results in significantly better retrieval performance when perspective distortion is present. Multiview SVTs are used in our real-time CD/DVD cover recognition demo. In the following video, we compare the recognition accuracies of a system without multiview SVTs and a system that employs multiview SVTs.





Internal Resources


Documents

References