MPI example for alternating direction method of multipliers

This page gives a sample MPI implementation of an ADMM-based solver for the Lasso problem, as described in sections 6.4 and 9.2 of our paper. This implementation is intended to be pedagogical, so it is heavily commented, is not performance optimized, and attempts to mirror the Matlab version of the solver.

Source code

The source code is available to view as syntax-highlighted HTML or for download.

The linear algebra operations in the solver use the GNU Scientific Library (GSL). This package is easy to install, well documented, and has a number of convenient data structures and functions for manipulating vectors and matrices, including a high-level BLAS interface. Users who are interested in better performance can link GSL to an optimized CBLAS library, such as one produced by ATLAS or provided by their hardware vendor (e.g., Accelerate.framework, ACML, or Intel MKL).

Compiling and installing

  1. Download the tarball above. The package contains a Makefile, the solver, and a standard library for reading in matrix data.

  2. Install GSL and an MPI implementation like OpenMPI or MPICH. This is straightforward using any standard package manager on Linux. OpenMPI is bundled with Mac OS X 10.5 and later so no additional installation is needed. Ensure that the mpicc command is on the path after installing; some Linux distributions have mispackaged MPI libraries. The mpicc command is a wrapper for the C compiler that includes the necessary MPI libraries.

  3. Edit the Makefile to ensure that the GSLROOT variable is set to point to the location where you installed GSL, and that the ARCH variable is set appropriately (most likely to i386 or x86_64). On some machines, it may be necessary to remove the use of the flag entirely.

  4. Run make. This produces a binary called lasso.

This code has been tested on Mac OS X 10.6, Debian 6, and Ubuntu 10.04.

Running MPI programs

Here, we provide a brief tutorial on running MPI programs.

Running on one machine

MPI programs can be run in single-machine mode, which is convenient for testing. In this case, each worker is just a separate process on the machine and they communicate directly, without having to have any server running. Simply run the following command:

$ mpirun -np 4 lasso

This will spawn 4 processes, all of which will run lasso (i.e., MPI follows the SPMD model). For ADMM, the first one will act as the master, and the remaining three will act as the slaves.

Running on multiple machines

Running on multiple machines requires that all the relevant machines are set up to communicate with each other appropriately. A discussion of how to do so is outside the scope of this tutorial. We assume you either have access to an existing MPI cluster or will need to use a cloud platform like Amazon (see below).

First, create a host file. This file tells MPI which machines can be used to run processes as well as how many processes can run on each node. The number of slots should never be set higher than the number of processes on each machine (see here for some details).

foo.example.com slots=1
bar.example.com slots=1
baz.example.com slots=1
bat.example.com slots=1

This means MPI will run a single process on each machine. (This is not ideal computationally, since each machine likely has multiple CPUs, but it makes the bookkeeping simpler since we don't need to think about which data on each machine each process will use.)

To run a program using this hostfile, type the following:

$ mpirun --hostfile sample_hostfile -np 4 lasso

If you do not have access to an existing MPI cluster, it is possible to rent one from Amazon Web Services (AWS) for relatively low rates. Essentially, one can rent (virtual) machines from Amazon's Elastic Compute Cloud (EC2) for an hourly rate and use their cloud storage services to store datasets and experimental results. The primary downside is that it is necessary to set up the machines more or less from scratch. Fully explaining how AWS works is outside the scope of this document, but to get more familiar, we suggest the following steps.

  1. Go through the EC2 Getting Started Guide. This will take you through setting up an EC2 account and launching and connecting to a test machine instance.

  2. Install StarCluster, a suite of Python scripts that make it fairly straightforward to bring up a cluster of machines already set up with MPI and other scientific computing software.

  3. GSL is not installed on the default StarCluster machine images. Customize the default images by installing GSL.

  4. You can now launch a cluster using your custom image by following the StarCluster documentation, and make data available to the cluster as needed by following the appropriate EC2 documentation. It may suffice to just copy the local datasets onto each instance's local instance store, then copy any generated results that should be preserved back into an S3 bucket or attached EBS volume.

Example output

The package linked above has a small dataset included that can be used to verify that the code is working correctly. The dataset is sliced up into four shards, A1.dat and b1.dat through A4.dat and b4.dat and is in a subdirectory called data in the source tree. Solving the full problem requires 5 processes (one master and four slaves), and the expected output is below:

$ mpirun -np 5 lasso
[0] reading A1.dat
[1] reading A1.dat
[2] reading A2.dat
[3] reading A3.dat
[4] reading A4.dat
[1] reading b1.dat
[3] reading b3.dat
[0] reading b1.dat
using lambda: 0.5000
[4] reading b4.dat
[2] reading b2.dat
 #     r norm    eps_pri     s norm   eps_dual  objective
 0     0.0000     0.0430     0.1692     0.0045    12.0262
 1     3.8267     0.0340     0.9591     0.0427    11.8101
 2     2.6698     0.0349     1.5638     0.0687    12.1617
 3     1.5666     0.0476     1.6647     0.0831    13.2944
 4     0.8126     0.0614     1.4461     0.0886    14.8081
 5     0.6825     0.0721     1.1210     0.0886    16.1636
 6     0.7332     0.0793     0.8389     0.0862    17.0764
 7     0.6889     0.0838     0.6616     0.0831    17.5325
 8     0.5750     0.0867     0.5551     0.0802    17.6658
 9     0.4539     0.0885     0.4675     0.0778    17.6560
10     0.3842     0.0897     0.3936     0.0759    17.5914
11     0.3121     0.0905     0.3389     0.0744    17.5154
12     0.2606     0.0912     0.2913     0.0733    17.4330
13     0.2245     0.0917     0.2558     0.0725    17.3519
14     0.1847     0.0923     0.2276     0.0720    17.2874
15     0.1622     0.0928     0.2076     0.0716    17.2312
16     0.1335     0.0934     0.1858     0.0713    17.1980
17     0.1214     0.0939     0.1689     0.0712    17.1803
18     0.1045     0.0944     0.1548     0.0710    17.1723
19     0.0931     0.0950     0.1344     0.0708    17.1768
20     0.0919     0.0954     0.1243     0.0707    17.1824
21     0.0723     0.0958     0.1152     0.0705    17.1867
22     0.0638     0.0962     0.1079     0.0704    17.1896
23     0.0570     0.0965     0.1019     0.0702    17.1900
24     0.0507     0.0968     0.0964     0.0701    17.1898
25     0.0460     0.0971     0.0917     0.0700    17.1885
26     0.0416     0.0973     0.0874     0.0699    17.1866
27     0.0382     0.0976     0.0834     0.0698    17.1846
28     0.0354     0.0978     0.0798     0.0697    17.1827
29     0.0329     0.0980     0.0762     0.0697    17.1815
30     0.0311     0.0983     0.0701     0.0696    17.1858
31     0.0355     0.0985     0.0667     0.0696    17.1890

The problem should take around a second to solve on an average laptop. The objective column is not the objective value of the full problem, since this would require an additional round of message passing to compute (evaluating the objective requires using all the data). Instead, the column shows the value of the objective using the first data shard, A1.dat and b1.dat, only. This is to ease comparison to results in other solvers without doing extra network communication. Modifying the source code above to compute the exact objective is straightforward. Similarly, it is straightforward to modify the code to look somewhere other than the data subdirectory for the data files.