CS294A
Projects in Holistic Scene Understanding
Autumn 2009
Project Descriptions
Straight Lines in Images
This project explores the use of straight lines in images for
predicting geometric properties of a scene, for example, the location
of the horizon in outdoor scenes, or the location of walls, ceiling
and floor in indoor scenes. Lines may also be useful for detecting
windows in buildings or the edge between a building and the sky. These
later properties may be useful for providing scale. We can provide a
dataset of outdoor scenes; indoor scenes will need to be collected
separately.
Oversegmentation Study
Good contours and regions are helpful for providing good object
boundaries, speeding up the region configuration search process and
escaping strong local optimum. In this project, the effect of a
state-of-the-art contour detector and its segmentation output will be
studied in a scene understanding framework. In particular, augment the
current oversegmentation dictionary with the hierarchical segmentation
from Gould et al. (ICCV 2009), which will provide a richer set of
proposal moves. In addition, take the contour detector output and use
it as the region boundary penalty, which will be incorporated in the
global energy function and interact with other components.
Global (Contrastive) Learning for Scene Understanding
Learning large CRF models is a difficult machine learning problem. The
goal of this project is to express the region-based scene
understanding models of Gould et al. (ICCV 2009; NIPS 2009) as as a
standard log-linear CRF and use a global contrastive learning
objective to optimize the weights. This is a challenging
project: You will need to have strong programming experience to
integrate with an existing C/C++ framework and a good background in
machine learning.
Object-based Proposal Moves
The aim of this project is to exploit and design different ways to
propose object-like moves (generating object region hypothesis) for
the region-based scene understanding model of Gould et al., (NIPS
2009). There are three examples: The first one is that one can take
the set of object candidate windows returned by a sliding window
object detector, and determine the best set of pixels inside the
window based on the object model and the appearance in side the region
(e.g. graphcut based on object masks). The second one is to explicitly
detect parts of objects and vote for the position of the object based
on the likelihood of each parts and the spatial constraint among
them. The last one is to match the current regions at some point of
the inference to a dictionary of object regions and propose the object
regions based on the matching.
Understanding Indoor Scenes
Most scene understanding models make strong assumptions about the
scene. For example, the work of Gould et al., (ICCV 2009) assumes an
outdoor scene with flat ground-plane. Indoor scenes have totally
different geometry to those of outdoor scenes. This projects involves
developing a scene-understanding model which captures the geometry and
semantics of indoor scenes. A new dataset will need to be collected
for this project. This is a challenging project.
Experimenting with Object Detection Methods
A large number of object detection algorithms have been proposed. Many
of these use a sliding-window approach for searching the image. The
aim of this project is to adapt these sliding-window based algorithms
(and others) to work in a holistic scene understanding framework.
Such algorithms will include Berkeley region descriptors, patch-based
detectors, HoG features and variants of these. Some of the code for
these algorithms already exists or is available online.
Global Scale for Improving Object Detection
Due to the nature of perspective projection, the scale/sizes of
objects in the same category vary as the distance between the objects
and the camera varies. The aim of this project is to take advantage of
the relationship between the scale/size of the object and the view
point (distance between the object and the camera and the camera
parameters) to help object detection. This work builds on the ideas
of Hoiem et al. (IJCV, 2008).
Person Pose Detection
The aim of this project is to detect the pose of a person (or people)
in images. This is made particularly difficult due to the large
variation of poses that a person can take as well as complicating
factors such as self-occlusion.
Learn to Describe a Scene
The aim of this project is to automatically extract a short textual
description of an image which captures the relationships between
objects in the scene. Initially, the project will use groundtruth
segmentations and labels to construct a 2.5D model of the scene from
which relationships can be extracted (e.g., the cow is eating grass or
the horse is jumping over the fence). Later the model will need to
handle noisy input such as from an automatic scene understanding
algorithm.
High-level Scene Descriptor Variables
A simple description or classification of a scene can help
segmentation and detection of objects and activities in a scene. For
example, if a scene is labeled as sailing, then it's very likely to
find the sea and sailboats in it and very unlikely to find
grass. However, in most cases we don't have labels for the images, and
sometimes a scene can be given multiple labels. For example, the
sailing scene may also be labeled as sea, sunset and so on. If we
treat the scene label as a hidden variable which links to
objects/regions in the scene, then we can learn the distribution over
this variable given the cardinality of it and the conditional
distributions over other variables in the scene (e.g. objects,
background regions). Given this model, not only the segmentation and
detection results might be better, but also that the hidden variable
will introduce clustering of the test images which will help organize
images according to some semantic meanings.
Learning from Google/Flickr Retrieval Data
Learning an object model requires collecting a large set of training
data and associate each sample with a label. This paradigm requires a
lot of man hours and is hard to scale to hundreds of thousands of
object categories. In this project, the aim is to explore the
possibility of collecting training samples from search engines and
image collection websites by simply typing the name of the object
category. This requires to deal with the noise from the output of the
search engine and handle the intra-class variations.
Using Noisy Text Annotations to Improve Scene Understanding
Many image databases (such as Flickr) allow user to tag images with
textual descriptions or provide short captions. This project aims to
make use of such information during test time to improve scene
understanding tasks (segmentation; object detection; scene
classification). For example, an image tagged with the word "polo"
will most likely contain horses and people as well as background
classes like grass. However, while these tags are often informative,
they can also be noisy, e.g. "trunk" may refer to the back of a car or
the nose of an elephant. This project will need to take into account
such ambiguities.
Modeling Attentive Vision
Human vision relies on two different mechanisms to help understand a
scene: Low-resolution peripheral vision provides us with a global gist
of the scene and allows us to track objects as they move through the
scene. High-resolution foveal vision, on the other hand, allows us to
obtain detailed information for object recognition by fixating on
certain areas of the scene. The fixating of the fovea (known as a
saccade) is controlled by an attentive mechanism which attempts to
move the gaze to the most salient region of the scene (for the current
task). The aim of this process is to develop an attentive computer
vision system for understanding static scenes. The idea is to start
with a low-resolution image and progressively increase the resolution
of parts of the image as needed for object recognition. This approach
should result in significant computational cost savings.
Parts/Regions of Objects
Visually, objects can be viewed as a collection of parts/regions
arranged at certain spatial locations. Different research work have
different representations of the parts, such as landmarks/key points,
patches and regions. Also, there are different ways to represent the
spatial constraints between parts ranging from parametric
probabilistic distribution over locations of parts to implicit
non-parametric modeling. This is an open-ended project in which you
can experiment and design object models using hybrid representations
of objects using key points, patches and regions.
Propose Your Own
Propose your own project that fits into the theme of the class. Your
proposal will need to be approved by the instructor or TAs.