\documentclass[twoside]{article}

\usepackage{amssymb, amsmath}
\usepackage{mathrsfs}


\oddsidemargin  0in \evensidemargin 0in \topmargin -0.5in
\headheight 0.2in \headsep 0.2in \textwidth   6.5in \textheight 9in
\parskip 1.5ex  \parindent 0ex \footskip 40pt

\newcommand{\Nystrom}{Nystr$\ddot{o}$m  }

\begin{document}

\framebox[6.4in]{
\begin{minipage}{6.4in}
  \vspace{1mm}
  \center \makebox[6.2in]{{\bf CS369M: Algorithms for Modern Massive Data Set Analysis \hfill Lecture 16 - 11/11/2009}}
  \vspace{2mm} \\
  \center \makebox[6.2in]{{\Large Multiplicative Update Algorithms, Boosting and Ensemble Methods}}
  \vspace{1mm} \\
  \center \makebox[6.2in]{{\it Lecturer: Michael Mahoney \hfill Scribes: Mark Wagner and Yunting Sun }}
  \vspace{1mm}
\end{minipage}
} \vspace{2mm} \\
\mbox{{ \it *Undited Notes}}

\section{Graph Partitioning}
 Expansion of random cuts
\[
\phi=\min_{S\subset V}\frac{E\left(S,\overline{S}\right)}{\frac{1}{n}\left|S\right|\left|\overline{S}\right|}\left(*1\right)\]


can relax to real numbers\[
d-\lambda_{2}=\min_{x\epsilon\mathbb{R}^{V}}\frac{\sum A_{ij}\left(x_{i}-x_{j}\right)^{2}}{\sum_{ij}\left(x_{i}-x_{j}\right)^{2}}\rightarrow\text{spectral}\left(*2\right)\]


or can relax to a vector

\underbar{Claim:}
\[d-\lambda_{2}=\min_{\vec{x_{j}}\epsilon\mathbb{\mathbb{R}}^{n}}\frac{\sum_{ij}A_{ij}\left|\left|x_{i}-x_{j}\right|\right|_{2}^{2}}{\sum_{ij}\left|\left|x_{i}-x_{j}\right|\right|_{2}^{2}}\left(*3\right)\]

\underbar{ Proof:}

$\left(*3\right)$ is relaxation of $\left(*2\right)$, the direct
solution of $\left(*3\right)$ is

\underbar{Claim}

$\left(*3\right)$ is equal to SDP\begin{eqnarray*}
\min\sum_{ij}A_{ij}\left|\left|x_{i}-x_{j}\right|\right|_{2}^{2}\\
\text{st}\sum_{ij}\left|\left|x_{i}-x_{j}\right|\right|_{2}^{2} & = & n\end{eqnarray*}
 which is equal to\begin{eqnarray*}
\min L_{G}[tr]X\\
\text{st }L_{k_{n}}[trace]X & = & n\\
x & \ge & 0\left(*4\right)\end{eqnarray*}
 Problem harder (SDP versus eigenvalue problem)

useful -
\begin{itemize}
\item look at duals
\item include {}``extra'' information
\end{itemize}
\underbar{Fact:}

Dual of $\left(*4\right)$ is\begin{eqnarray*}
\max y\\
\text{st }L_{G} & \ge & y\frac{1}{n}L_{n}\end{eqnarray*}


A feasible solution is a number $y$ and a matrix $Y$ such that\[
L_{G}=\frac{y}{n}L_{n}+Y\]


Recall:\begin{eqnarray*}
\left|S\right|\left|\overline{S}\right| & = & 1_{S}^{T}L_{n}1_{\overline{S}}\\
E\left(S,\overline{S}\right) & = & 1_{S}^{T}L_{G}L_{\overline{S}}\end{eqnarray*}
 but\begin{eqnarray*}
1_{S}^{T}L_{G}1_{S} & = & 1_{S}\left(\frac{y}{n}L_{n}+Y\right)1_{S}\\
 & \ge & 1_{S}\frac{y}{n}L_{n}1_{S}\end{eqnarray*}
 So cost of cut $\ge y$

What's going on here?
\begin{itemize}
\item embedding a scaled version of the complete graph in $G$
\item we know the expansion and cut values for $K_{n}$ and so relate it
to $G$. Note: $K_{n}$ is an expander.
\end{itemize}
Recall

Flow - if graph $H$ of known expansion can be embedded in $G$ {}``as
a flow'' then $h_{H}\le h_{G}$.
\begin{itemize}
\item Then the optimal solution for a fixed $H$ can be computed as the
solution to a concurrent multicommodity flow problem.
\item O$\left(\log n\right)$ approximation, which is tight
\end{itemize}
Spectral - relax to an {}``eigenvalue problem'' and use Cheeger.

ARV-type methods
\begin{itemize}
\item Can I construct iteratively a graph $H$ (and test its expansion)
and stop when it's a good expander and get a bound on $h_{G}$.
\item yes - write as an SDP.
\item Can compute faster by using primal-dual ideas
\end{itemize}
ARV - original $O\left(\sqrt{\log n}\right)$

AHK - {}``primal-dual'' method in theoretical computer science.

both using multicommodity flows

KRV - single-commodity flows using cut matching game

OSVV - extended KRV

LMO - empirical evaluation. Describes as spectral modified

\section{Online Learning}

prediction/inference problem - given data predict something.

Ways to formalize this, different assumptions on
\begin{itemize}
\item what the data are (real numbers, graphs, strings)
\item where they come/generated from (according to an underlying distribution; access to side information)
\end{itemize}
{}``Traditional Statistic''
\begin{itemize}
\item data generated according to an underlying distribtion
\item learn paramters describing distribtion
\item evaluate quality by \emph{Risk} - expected value of some loss function
over the distribtion in the data
\item ERM$\rightarrow$SRM
\end{itemize}
What if the data are not generated by some underlying process?

with no assumptions, hard to predict

\underbar{Idea}:

get data elements sequentially $\left\{ y_{i},x_{i}\right\} \epsilon\mathbb{R}$

predict the next element.

Evaluated by the loss function e.g. number of incorrect predictions

Access to side information, namely prediction of a set of {}``experts.''

Experts make predictions according to some rule deterministic, random,
adversarial , etc

At each time step, the experts also have a loss

Goal: want loss not too much worse than the {}``best'' expert.

Also: in prediction at time $t$ you have access to
\begin{itemize}
\item your prediction and losses in the past
\item predictions and losses of the experts in the past
\end{itemize}
What are the experts?
\begin{itemize}
\item oracle
\item statistical model
\item certain steps in an algorithm
\item basis functions
\end{itemize}

\section{Multiplicative weights update rule}
\begin{itemize}
\item maintain probability distribution over experts
\item at each step, increase or decrease the weight multiplicatively $ie$
by multiplying by $\left(1+\epsilon\right)$
\item $\epsilon$ = parameter judges how much confidence to place in expert's
prediction/regularization parameter
\end{itemize}
Discrete Experts:
\begin{itemize}
\item set of experts $E$ that makes predictions $f_{E_{i,t}}\epsilon\mathbb{R}^{n}$
\item set of vectors $\left\{ x\epsilon\mathbb{R}^{n}:\sum_{i=1}^{n}x_{i}=1\right\} =\text{weights on experts}$
\item $l_{t}\left(i\right)=$ loss of expert $i$ at stage $t$ $l\left(\hat{p}_{t},y_{t}\right)=$
loss of algorithm = $\sum_{i=1}^{n}x_{i}l_{t}\left(i\right)$
\end{itemize}
Algorithm
\begin{enumerate}
\item $W_{0}=\vec{1}$
\item when $y_{t}$ and the experts prediction algorithm uses this update
rule
\end{enumerate}
\begin{eqnarray*}
W_{t+1,i} & = & W_{t,i}\left(1-\epsilon\right)^{l_{t}\left(i\right)}\\
 & =\\
 &  & \left(1-\epsilon \right)^{\sum_{t=1}^{T}l_{t}\left(i\right)}\\
 & = & e^{-n\sum_{t=1}^{T}l_{t}\left(i\right)}, \textrm{where } \quad n=-\log\left(1-\epsilon\right)\end{eqnarray*}




\underbar{Thm:}

For any expert $E_{j}$ j$\epsilon\left[n\right]$ \[
\sum_{t=1}^{T}l_{t}\left(\hat{p_{t}}\right)\le\frac{\log n}{\epsilon}+\frac{y}{\epsilon}\sum_{t=1}^{T}l_{t}\left(i\right)\]


\underbar{Proof}

use potential function argument.

\[
W_{t}=\sum_{i=1}^{n}W_{t,i}\]
 First, relate potetial function \begin{eqnarray*}
W_{t+1} & \ge & W_{t+1,i}=\left(1-\epsilon\right)^{\sum_{t=1}^{T}l_{t}\left(i\right)}\\
 & = & e^{-n\sum_{t=1}^{n}l_{t}\left(i\right)}\end{eqnarray*}


Next relate potential function to performance of algorithm\[
W_{t+1}=\sum_{i=1}^{n}w_{t+1,i}=\sum_{i=1}^{n}w_{t,i}\left(1-\epsilon\right)^{l_{t}\left(i\right)}\]


Note $\left(1-\epsilon\right)^{x}\le1-\epsilon x$ for $0\le\epsilon\le1$

So \begin{eqnarray*}
w_{t+1} & \le & \sum_{i}w_{t,i}\left(1-\epsilon l_{t}\left(i\right)\right)\\
 & = & w_{t}\left(1-\frac{\epsilon}{w_{t}}\sum_{i}w_{t,i}l_{t}\left(i\right)\right)\\
 & = & w_{t}\left(1-\epsilon l_{t}\left(\hat{p}_{t}\right)\right)\\
 & \le & w_{t}\exp\left(-\epsilon l_{t}\left(\hat{p_{t}}\right)\right)\\
 & \le & w_{t}\exp\left(-\epsilon\sum_{t=1}^{T}l_{t}\left(\hat{p}_{t}\right)\right)\end{eqnarray*}
 \begin{eqnarray*}
e^{-\eta\sum_{t}l_{t}\left(i\right)} & \le & W_{t+1}\le ne^{-\epsilon\sum_{t}l_{t}\left(\hat{p}_{t}\right)}\\
\frac{\eta}{\epsilon}\sum_{t}l_{t}\left(i\right) & \le & \frac{\log n}{\epsilon}-\frac{\epsilon}{\epsilon}\sum_{t}l_{t}\left(\hat{p_{t}}\right)\\
\sum_{t}l_{t}\left(\hat{p_{t}}\right) & \le & \frac{\log n}{\epsilon}+\frac{n}{\epsilon}\sum_{t}l_{t}\left(n\right)\\
 & \le & \frac{\log n}{\epsilon}+\left(1+\epsilon\right)\sum_{t}l_{t}\left(i\right)\end{eqnarray*}


Define the regret\begin{eqnarray*}
R_{T} & = & \sum_{t=1}^{T}l_{t}\left(p_{t}\right)-\min_{experts}\sum_{t}l\left(f_{E,t}\right)\\
 & \le & \frac{\log n}{\epsilon}+\epsilon\sum_{t}l_{t}\left(i\right)\end{eqnarray*}
 if $\epsilon=\sqrt{\frac{\log n}{T}}$\[
\le2\sqrt{T\log n}\]
 Q: is $\log n$ large or small?

If {}``extra'' information is given that one expert will be perfect

find the best expert in log$n$ mistakes

-multiplicative weights update rule says you're not much worse than
this scenario, in more general cases

\underbar{applications to algorithms}
\begin{itemize}
\item AHK $\rightarrow$ generalize the losses to matrix losses to solve
SDPs$\rightarrow O\left(n^{2}\right)$ time
\item KRV - {}``cut-matching'' game to solve sparsest cuts. 2 players:
a cut player, and a matching player\end{itemize}
\begin{enumerate}
\item $G_{0}=0$
\item in each round, cut player chooses a bisection $\left(S,\overline{S}\right)$
and the matching player chooses a perfect matching $M$ across $\left(S,\overline{S}\right)$.
then $G_{t+1}\rightarrow G_{t}+M$.
\item game stops when G is an expander $eg$ $l_{G}\ge\frac{1}{10}$
\item value of game is number of steps it took. goal: cut player - stop
soon (find expander fast), matching player - delay stop.
\end{enumerate}
Dual algorithm.
\begin{enumerate}
\item Let $G'=\gamma G$
\item approximate the 2nd eigenvector of $L_{G'}$. Degree of approximation
governed by regularization parameter.
\item use the bisection $\left(S_{n/2t},\overline{S}_{n/2t}\right)$ from
the sweep cut. Call flow-based improvement algorithm to get a cut
$\left(T_{t},\overline{T_{t}}\right)$ and a matching $M_{t}$ until
stopping rule is satisfied. Let $G'=G'+M_{t}$. Return {}``best''
cut $\left(T_{t}\right)_{t=1}^{T}$
\end{enumerate}
Why would you hope/expect that these multiplicative weight update
algorithms would perform well in practice?
\begin{itemize}
\item faster than naive computation
\item often give better answers than the exact algorithm
\end{itemize}
\underbar{Boosting} - example of an ensemble method

Given $X$, learn $C:X\rightarrow\left\{ 0,1\right\} $ a classification
rule from some concept class $C$

Risk = $E\left[error\right]$.

Define a $\gamma-$weak learning algorithm is one that has error $\le\frac{1}{2}-\gamma$.

An $\epsilon-$strong learning algorithm with error $\le\epsilon$.

Can one combine a set of weak learners into a strong learner.

Idea - weak learners are a little better than chance, so combining
them doesn't make things worse. If they are {}``different'' then
we can hope for improvement by averaging predictions.

Boosting - AdaBoost - do boosting by sampling. Take a sample of data
and use algorithm to boost on that sample
\begin{itemize}
\item do this in an iterative manner by {}``updating'' weights on data
points to find new classification rule for data points that are misclassified
\end{itemize}
Events - hypotheses for the classification rule output at each step

At each step - get a classification rule $h_{t}$ (weak learner) and
final classification algorithm use $\sum_{t}h_{t}$ as prediction
\end{document}
