STANFORD

 CS294A
 Projects in Holistic Scene Understanding
 Autumn 2009



Project Descriptions

Straight Lines in Images

This project explores the use of straight lines in images for predicting geometric properties of a scene, for example, the location of the horizon in outdoor scenes, or the location of walls, ceiling and floor in indoor scenes. Lines may also be useful for detecting windows in buildings or the edge between a building and the sky. These later properties may be useful for providing scale. We can provide a dataset of outdoor scenes; indoor scenes will need to be collected separately.

Oversegmentation Study

Good contours and regions are helpful for providing good object boundaries, speeding up the region configuration search process and escaping strong local optimum. In this project, the effect of a state-of-the-art contour detector and its segmentation output will be studied in a scene understanding framework. In particular, augment the current oversegmentation dictionary with the hierarchical segmentation from Gould et al. (ICCV 2009), which will provide a richer set of proposal moves. In addition, take the contour detector output and use it as the region boundary penalty, which will be incorporated in the global energy function and interact with other components.

Global (Contrastive) Learning for Scene Understanding

Learning large CRF models is a difficult machine learning problem. The goal of this project is to express the region-based scene understanding models of Gould et al. (ICCV 2009; NIPS 2009) as as a standard log-linear CRF and use a global contrastive learning objective to optimize the weights. This is a challenging project: You will need to have strong programming experience to integrate with an existing C/C++ framework and a good background in machine learning.

Object-based Proposal Moves

The aim of this project is to exploit and design different ways to propose object-like moves (generating object region hypothesis) for the region-based scene understanding model of Gould et al., (NIPS 2009). There are three examples: The first one is that one can take the set of object candidate windows returned by a sliding window object detector, and determine the best set of pixels inside the window based on the object model and the appearance in side the region (e.g. graphcut based on object masks). The second one is to explicitly detect parts of objects and vote for the position of the object based on the likelihood of each parts and the spatial constraint among them. The last one is to match the current regions at some point of the inference to a dictionary of object regions and propose the object regions based on the matching.

Understanding Indoor Scenes

Most scene understanding models make strong assumptions about the scene. For example, the work of Gould et al., (ICCV 2009) assumes an outdoor scene with flat ground-plane. Indoor scenes have totally different geometry to those of outdoor scenes. This projects involves developing a scene-understanding model which captures the geometry and semantics of indoor scenes. A new dataset will need to be collected for this project. This is a challenging project.

Experimenting with Object Detection Methods

A large number of object detection algorithms have been proposed. Many of these use a sliding-window approach for searching the image. The aim of this project is to adapt these sliding-window based algorithms (and others) to work in a holistic scene understanding framework. Such algorithms will include Berkeley region descriptors, patch-based detectors, HoG features and variants of these. Some of the code for these algorithms already exists or is available online.

Global Scale for Improving Object Detection

Due to the nature of perspective projection, the scale/sizes of objects in the same category vary as the distance between the objects and the camera varies. The aim of this project is to take advantage of the relationship between the scale/size of the object and the view point (distance between the object and the camera and the camera parameters) to help object detection. This work builds on the ideas of Hoiem et al. (IJCV, 2008).

Person Pose Detection

The aim of this project is to detect the pose of a person (or people) in images. This is made particularly difficult due to the large variation of poses that a person can take as well as complicating factors such as self-occlusion.

Learn to Describe a Scene

The aim of this project is to automatically extract a short textual description of an image which captures the relationships between objects in the scene. Initially, the project will use groundtruth segmentations and labels to construct a 2.5D model of the scene from which relationships can be extracted (e.g., the cow is eating grass or the horse is jumping over the fence). Later the model will need to handle noisy input such as from an automatic scene understanding algorithm.

High-level Scene Descriptor Variables

A simple description or classification of a scene can help segmentation and detection of objects and activities in a scene. For example, if a scene is labeled as sailing, then it's very likely to find the sea and sailboats in it and very unlikely to find grass. However, in most cases we don't have labels for the images, and sometimes a scene can be given multiple labels. For example, the sailing scene may also be labeled as sea, sunset and so on. If we treat the scene label as a hidden variable which links to objects/regions in the scene, then we can learn the distribution over this variable given the cardinality of it and the conditional distributions over other variables in the scene (e.g. objects, background regions). Given this model, not only the segmentation and detection results might be better, but also that the hidden variable will introduce clustering of the test images which will help organize images according to some semantic meanings.

Learning from Google/Flickr Retrieval Data

Learning an object model requires collecting a large set of training data and associate each sample with a label. This paradigm requires a lot of man hours and is hard to scale to hundreds of thousands of object categories. In this project, the aim is to explore the possibility of collecting training samples from search engines and image collection websites by simply typing the name of the object category. This requires to deal with the noise from the output of the search engine and handle the intra-class variations.

Using Noisy Text Annotations to Improve Scene Understanding

Many image databases (such as Flickr) allow user to tag images with textual descriptions or provide short captions. This project aims to make use of such information during test time to improve scene understanding tasks (segmentation; object detection; scene classification). For example, an image tagged with the word "polo" will most likely contain horses and people as well as background classes like grass. However, while these tags are often informative, they can also be noisy, e.g. "trunk" may refer to the back of a car or the nose of an elephant. This project will need to take into account such ambiguities.

Modeling Attentive Vision

Human vision relies on two different mechanisms to help understand a scene: Low-resolution peripheral vision provides us with a global gist of the scene and allows us to track objects as they move through the scene. High-resolution foveal vision, on the other hand, allows us to obtain detailed information for object recognition by fixating on certain areas of the scene. The fixating of the fovea (known as a saccade) is controlled by an attentive mechanism which attempts to move the gaze to the most salient region of the scene (for the current task). The aim of this process is to develop an attentive computer vision system for understanding static scenes. The idea is to start with a low-resolution image and progressively increase the resolution of parts of the image as needed for object recognition. This approach should result in significant computational cost savings.

Parts/Regions of Objects

Visually, objects can be viewed as a collection of parts/regions arranged at certain spatial locations. Different research work have different representations of the parts, such as landmarks/key points, patches and regions. Also, there are different ways to represent the spatial constraints between parts ranging from parametric probabilistic distribution over locations of parts to implicit non-parametric modeling. This is an open-ended project in which you can experiment and design object models using hybrid representations of objects using key points, patches and regions.

Propose Your Own

Propose your own project that fits into the theme of the class. Your proposal will need to be approved by the instructor or TAs.


Comments to cs294a-qa@cs.stanford.edu.

Home Page