SELF-PACED LEARNING FOR LATENT VARIABLE MODELS
M. Pawan Kumar, B. Packer and D. Koller
Overview
We develop an accurate iterative algorithm for learning the parameters of a latent variable model such as latent structural SVM. Our approach uses the intuition that the learner should be presented with the training samples in a meaningful order: easy samples first, hard samples later. At each iteration, our approach simultaneously chooses the easy samples and updates the parameters.
Latent Variable Models
Latent variable models contain the following types of variables: • Observed variables or data x, shown using filled circles in the graphical model representation on the right. The value of these variables is known during both training and testing. • Unobserved variables or output y, shown using unfilled circles in the graphical model. The value of these variables is known during training but not during testing. • Latent or hidden variables h, shown using a dashed circle in the graphical model. The value of these variables is unknown during both training and testing. |
Learning Latent Variable Models
Given a training dataset D = {(x_{i},y_{i}), i = 1,2,...,n}, we wish to learn the parameters w of the model. Typically, the parameters are learned by iteratively optimizing a function F(w) in two steps: (i) perform inference over latent variables; and (ii) update the parameters. Examples of such methods include the expectation-maximization (EM) algorithm for maximizing the likelihood and the concave-convex procedure (CCCP) algorithm for latent structural SVM.
Self-Paced Learning
Self-paced learning modifies the update step such that the new parameters are learned using only a subset of easy samples at each iteration. In more detail, let the original update step involve solving the following optimization problem:
min_{w} f(w) where f(w) = r(w) + Σ_{i} l_{i}(w).
The function r(w) is the regularization term and l_{i}(w) is the loss of the i^{th} sample. Self-paced learning replaces the above problem with the following:
min_{w,v} g(w,v) where g(w,v) = r(w) + Σ_{i} v_{i} l_{i}(w) - 1/K Σ_{i} v_{i}.
The term v_{i} is a binary variable that indicates whether the i^{th} sample is easy (v_{i} = 1) or not (v_{i} = 0), and K is the self-paced weight that determines how many samples are considered easy. As an illustration, consider the binary classification problem (circles vs. triangles) shown below. Easy samples are shown in green, difficult samples are shown in red. As K decreases the number of samples deemed easy increases.
The overall self-paced learning algorithm is summarized below.
Initialization
• Input: Training dataset D = {(x_{i},y_{i}), i = 1,2,...,n}, anneal factor μ and tolerance ε.
• Set w ← w_{0} and K ← K_{0}.
Iteration
• Using the current estimate of the parameters, perform inference on the hidden variables for all samples.
• Update the parameters by optimizing g(w,v) using alternate search:
• Fix w and update v.
• Fix v and update w.
• Iterate the above two steps until w and v do not change.
• K ← K/μ
Termination
• Stop if v_{i} = 1 for all samples and the objective function F(w) cannot be improved more than the tolerance ε.
Results
We tested self-paced learning using four standard machine learning applications that can be formulated as a latent structural SVM. We provide a comparison of self-paced learning with the state of the art CCCP algorithm. The details of all the experiments are provided in our NIPS 2010 paper.
Object Classification
Input x is an image. Output y is the category of the object present in the image. For simplicity, we assume that each image only contains instances of a single object category. The hidden variables h model the bounding box of the object in the image.
We use the Mammals dataset, which consists of 271 images each containing one of 6 different mammals. The dataset is split in a 90%-10% train-test fold.
The figure below shows the imputed bounding box during four different iterations of the self-paced learning algorithm. The blue bounding boxes indicate an easy sample, while red bounding boxes indicate difficult samples. Note that the algorithm automatically discards those samples whose bounding boxes are incorrect.
The figure below shows the value of the objective function, the training error and the testing error for five different folds of the dataset. Note that self-paced learning outperforms CCCP on all three measures.
Motif Finding
Input x is a DNA sequence. Output y indicates whether x binds to a protein of interest. The hidden variables h model the location of the motif that provides useful cues about the binding affinity of a sequence.
We use the five randomly selected proteins from the UniProbe dataset. For each protein we have around 40,000 sequences, which we split into five different folds with 50% for training and 50% for testing.
The figure below shows the average Hamming distance between the imputed motifs over different iterations. The low value in the initial iterations of self-paced learning indicates that the algorithm is successful in selecting easy samples first. Note that the Hamming distance for the motifs obtained using self-paced learning is lower than that of the motifs obtained using CCCP at convergence.
The figure below shows the mean (over five folds) value of the objective function, the training error and the testing error for five different proteins. Once again, self-paced learning outperforms CCCP on all three measures.
Digit Recognition
Input x is an image. Output y is the digit shown in the image. For simplicity, we consider a binary classification problem where each class is a digit. The hidden variables h model the rotation of digits in the images.
We use the standard MNIST dataset that specifies a training and testing split for each digit.
The figure below shows the relative objective function value (between CCCP and self-paced learning), the training error and the testing error for four different binary classification tasks. While the performance of CCCP and self-paced learning is comparable for three of the tasks, self-paced learning significantly outperforms CCCP for the 1 vs. 7 classication task for several values of the parameter C.
Noun Phrase Coreference
Input x is a document. Output y is the clustering of the nouns in the document such that each cluster corresponds to a single object. The hidden variables h model a forest over the nouns, with each tree corresponding to a cluster in y.
We use the MUC6 dataset that consists of 60 documents. We use the 50-50 split into training and testing that was employed in [1].
The figure below shows the relative objective function value (between CCCP and self-paced learning), the training error and the testing error for different values of C. The bottom plot corresponds to the pairwise loss, while the top plot corresponds to the MITRE loss. Once again, self-paced learning outperforms CCCP.
Resources All the datasets used in our experiments are publicly available. An API for self-paced learning of a latent structural SVM is available for download. |
Publications
M. Pawan Kumar, B. Packer and D. Koller |
References
[1] C.-N. Yu and T. Joachims. Learning structural SVMs with latent variables. In ICML, 2009.