Model Selection

web.stanford.edu/class/stats202

Sergio Bacallado, Jonathan Taylor

Autumn 2022

Best subset selection

Best subset with regsubsets

library(ISLR2) # where Credit is stored
library(leaps) # where regsubsets is found
summary(regsubsets(Balance ~ ., data=Credit))
## Subset selection object
## Call: regsubsets.formula(Balance ~ ., data = Credit)
## 11 Variables  (and intercept)
##             Forced in Forced out
## Income          FALSE      FALSE
## Limit           FALSE      FALSE
## Rating          FALSE      FALSE
## Cards           FALSE      FALSE
## Age             FALSE      FALSE
## Education       FALSE      FALSE
## OwnYes          FALSE      FALSE
## StudentYes      FALSE      FALSE
## MarriedYes      FALSE      FALSE
## RegionSouth     FALSE      FALSE
## RegionWest      FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          Income Limit Rating Cards Age Education OwnYes StudentYes MarriedYes
## 1  ( 1 ) " "    " "   "*"    " "   " " " "       " "    " "        " "       
## 2  ( 1 ) "*"    " "   "*"    " "   " " " "       " "    " "        " "       
## 3  ( 1 ) "*"    " "   "*"    " "   " " " "       " "    "*"        " "       
## 4  ( 1 ) "*"    "*"   " "    "*"   " " " "       " "    "*"        " "       
## 5  ( 1 ) "*"    "*"   "*"    "*"   " " " "       " "    "*"        " "       
## 6  ( 1 ) "*"    "*"   "*"    "*"   "*" " "       " "    "*"        " "       
## 7  ( 1 ) "*"    "*"   "*"    "*"   "*" " "       "*"    "*"        " "       
## 8  ( 1 ) "*"    "*"   "*"    "*"   "*" " "       "*"    "*"        " "       
##          RegionSouth RegionWest
## 1  ( 1 ) " "         " "       
## 2  ( 1 ) " "         " "       
## 3  ( 1 ) " "         " "       
## 4  ( 1 ) " "         " "       
## 5  ( 1 ) " "         " "       
## 6  ( 1 ) " "         " "       
## 7  ( 1 ) " "         " "       
## 8  ( 1 ) " "         "*"

Choosing \(k\)

Naturally, \(\text{RSS}\) and \[ R^2 = 1-\frac{\text{RSS}}{\text{TSS}} \] improve as we increase \(k\).

To optimize \(k\), we want to minimize the test error, not the training error.

We could use cross-validation, or alternative estimates of test error:

\[\frac{1}{n}(\text{RSS}+2k\hat\sigma^2)\]

\[\frac{1}{n}(\text{RSS}+\log(n)\hat\sigma^2)\]

\[R^2_a = 1-\frac{\text{RSS}/(n-k-1)}{\text{TSS}/(n-1)}\]

How do these criteria above compare to cross validation?

Example: best subset selection for the Credit dataset

Best subset selection for the Credit dataset

Recall: In \(K\)-fold cross validation, we can estimate a standard error or accuracy for our test error estimate. Then, we can apply the 1SE rule.

Stepwise selection methods

Best subset selection has 2 problems:

  1. It is often very expensive computationally. We have to fit \(2^p\) models!

  2. If for a fixed \(k\), there are too many possibilities, we increase our chances of overfitting. The model selected has high variance.

In order to mitigate these problems, we can restrict our search space for the best model. This reduces the variance of the selected model at the expense of an increase in bias.

Forward selection

Algorithm (6.2 of ISLR)

  1. Let \({\cal M}_0\) denote the null model, which contains no predictors.

  2. For \(k=0, \dots, p-1\):

    1. Consider all \(p-k\) models that augment the predictors in \({\cal M}_k\) with one additional predictor.
    2. Choose the best among these \(p-k\) models, and call it \({\cal M}_{k+1}\). Here best is defined as having smallest RSS or \(R^2\).
  3. Select a single best model from among \({\cal M}_0, \dots, {\cal M}_p\) using cross-validated prediction error, AIC, BIC or adjusted \(R^2\).

Forward selection in using step

M.full = lm(Balance ~ ., data=Credit)
M.null = lm(Balance ~ 1, data=Credit)
step(M.null, scope=list(upper=M.full), direction='forward', trace=0)
## 
## Call:
## lm(formula = Balance ~ Rating + Income + Student + Limit + Cards + 
##     Age, data = Credit)
## 
## Coefficients:
## (Intercept)       Rating       Income   StudentYes        Limit        Cards  
##   -493.7342       1.0912      -7.7951     425.6099       0.1937      18.2119  
##         Age  
##     -0.6241

Backward selection

Algorithm (6.3 of ISLR)

  1. Let \({\cal M}_p\) denote the full model, which contains all \(p\) predictors

  2. For \(k=p,p-1,\dots,1\):

    1. Consider all \(k\) models that contain all but one of the predictors in \({\cal M}_k\) for a total of \(k-1\) predictors.
    2. Choose the best among these \(k\) models, and call it \({\cal M}_{k-1}\). Here best is defined as having smallest RSS or \(R^2\).
  3. Select a single best model from among \({\cal M}_0, \dots, {\cal M}_p\) using cross-validated prediction error, AIC, BIC or adjusted \(R^2\).

Backward selection

step(M.full, scope=list(lower=M.null), direction='backward', trace=0)
## 
## Call:
## lm(formula = Balance ~ Income + Limit + Rating + Cards + Age + 
##     Student, data = Credit)
## 
## Coefficients:
## (Intercept)       Income        Limit       Rating        Cards          Age  
##   -493.7342      -7.7951       0.1937       1.0912      18.2119      -0.6241  
##  StudentYes  
##    425.6099

Forward selection with BIC

step(M.null, scope=list(upper=M.full), direction='forward', trace=0, k=log(nrow(Credit))) # k defaults to 2, i.e AIC
## 
## Call:
## lm(formula = Balance ~ Rating + Income + Student + Limit + Cards, 
##     data = Credit)
## 
## Coefficients:
## (Intercept)       Rating       Income   StudentYes        Limit        Cards  
##   -526.1555       1.0879      -7.8749     426.8501       0.1944      17.8517

Forward vs. backward selection

\[ \begin{aligned} X_3 &= X_1 + 3 X_2 \\ Y &= X_1 + 2X_2 + \epsilon \end{aligned} \]

Forward vs. backward selection

n = 5000
X1 = rnorm(n)
X2 = rnorm(n)
X3 = X1 + 3 * X2 
Y = X1 + 2 * X2 + rnorm(n)
D = data.frame(X1, X2, X3, Y)

Forward

step(lm(Y ~ 1, data=D), list(upper=~ X1 + X2 + X3), direction='forward')
## Start:  AIC=9000.85
## Y ~ 1
## 
##        Df Sum of Sq     RSS    AIC
## + X3    1   24872.6  5368.7  359.7
## + X2    1   20513.5  9727.8 3331.7
## + X1    1    4871.7 25369.6 8124.6
## <none>              30241.3 9000.9
## 
## Step:  AIC=359.7
## Y ~ X3
## 
##        Df Sum of Sq    RSS    AIC
## + X1    1    400.51 4968.1 -25.95
## + X2    1    400.51 4968.1 -25.95
## <none>              5368.7 359.70
## 
## Step:  AIC=-25.95
## Y ~ X3 + X1
## 
##        Df Sum of Sq    RSS     AIC
## <none>              4968.1 -25.952
## 
## Call:
## lm(formula = Y ~ X3 + X1, data = D)
## 
## Coefficients:
## (Intercept)           X3           X1  
##    -0.01336      0.67367      0.29766

Backward

step(lm(Y ~ X1 + X2 + X3, data=D), list(lower=~ 1), direction='backward')
## Start:  AIC=-25.95
## Y ~ X1 + X2 + X3
## 
## 
## Step:  AIC=-25.95
## Y ~ X1 + X2
## 
##        Df Sum of Sq     RSS    AIC
## <none>               4968.1  -26.0
## - X1    1    4759.6  9727.8 3331.7
## - X2    1   20401.4 25369.6 8124.6
## 
## Call:
## lm(formula = Y ~ X1 + X2, data = D)
## 
## Coefficients:
## (Intercept)           X1           X2  
##    -0.01336      0.97134      2.02102

Other stepwise selection strategies

Shrinkage methods

A mainstay of modern statistics!

Why would shrunk coefficients be better?

Ridge regression

Ridge regression solves the following optimization:

\[\min_{\beta} \;\;\; \color{Blue}{ \sum_{i=1}^n \left(y_i -\beta_0 -\sum_{j=1}^p \beta_jx_{i,j}\right)^2} + \color{Red}{\lambda \sum_{j=1}^p \beta_j^2}\]

Ridge regression

Ridge regression of balance in the Credit dataset

Ridge regression

Selecting \(\lambda\) by cross-validation for ridge regression

Lasso regression

Lasso regression solves the following optimization:

\[\min_{\beta} \;\;\; \color{Blue}{ \sum_{i=1}^n \left(y_i -\beta_0 -\sum_{j=1}^p \beta_jx_{i,j}\right)^2} + \color{Red}{\lambda \sum_{j=1}^p |\beta_j|}\]

Why would we use the Lasso instead of Ridge regression?

Ridge regression of balance in the Credit dataset

A lot of pesky small coefficients throughout the regularization path

Lasso regression of balance in the Credit dataset

An alternative formulation for regularization

\[\underset{\beta}{\text{minimize}} ~~\left\{\sum_{i=1}^n \left(y_i -\beta_0 -\sum_{j=1}^p \beta_j x_{i,j}\right)^2 \right\} ~~~ \text{subject to}~~~ \sum_{j=1}^p \beta_j^2<s.\]

Visualizing Ridge and the Lasso with 2 predictors

\(\color{BlueGreen}{Diamond}: ~\sum_{j=1}^p |\beta_j|<s ~~~~~~~~~~~~~~~~~~~~ \color{BlueGreen}{Circle}:~ \sum_{j=1}^p \beta_j^2<s~~~~~\), Best subset with \(s=1\) is union of the axes…

When is the Lasso better than Ridge?

When is the Lasso better than Ridge?

Selecting \(\lambda\) by cross-validation for lasso regression

A very special case: ridge

\[(y_j-\beta_j)^2 + \lambda \beta_j^2.\]

\[\hat \beta^R_j = \frac{y_j}{1+\lambda}.\]

A very special case: LASSO

\[\sum_{j=1}^p (y_j-\beta_j)^2 + \lambda\sum_{j=1}^p |\beta_j|\]

Lasso and Ridge coefficients as a function of \(y\)

Bayesian interpretation

Dimensionality reduction

Regularization methods

Principal Component Regression

princomp(USArrests, cor=TRUE)$loadings[,1] # cor=TRUE makes sure to scale variables
##    Murder   Assault  UrbanPop      Rape 
## 0.5358995 0.5831836 0.2781909 0.5434321

Interpretation: The first principal component measures the overall rate of crime.

Principal Component Regression

princomp(USArrests, cor=TRUE)$scores[,1] # cor=TRUE makes sure to scale variables
##        Alabama         Alaska        Arizona       Arkansas     California 
##     0.98556588     1.95013775     1.76316354    -0.14142029     2.52398013 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##     1.51456286    -1.35864746     0.04770931     3.01304227     1.63928304 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##    -0.91265715    -1.63979985     1.37891072    -0.50546136    -2.25364607 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##    -0.79688112    -0.75085907     1.56481798    -2.39682949     1.76336939 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##    -0.48616629     2.10844115    -1.69268181     0.99649446     0.69678733 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##    -1.18545191    -1.26563654     2.87439454    -2.38391541     0.18156611 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##     1.98002375     1.68257738     1.12337861    -2.99222562    -0.22596542 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##    -0.31178286     0.05912208    -0.88841582    -0.86377206     1.32072380 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##    -1.98777484     0.99974168     1.35513821    -0.55056526    -2.80141174 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##    -0.09633491    -0.21690338    -2.10858541    -2.07971417    -0.62942666

Interpretation: The scores for the first principal component measure the overall rate of crime in each state.

Principal Component Regression

\[ \begin{aligned} y_i &= \theta_0 + \theta_1 z_{i1} + \theta_2 z_{i2} + \dots + \theta_M z_{iM} + \varepsilon_i\\ & = \theta_0 + \theta_1 \sum_{j=1}^p \phi_{j1} x_{ij} + \theta_2 \sum_{j=1}^p \phi_{j2} x_{ij} + \dots + \theta_M \sum_{j=1}^p \phi_{jM} x_{ij} + \varepsilon_i \\ & = \theta_0 + \left[\sum_{m=1}^M \theta_m \phi_{1m}\right] x_{i1} + \dots + \left[\sum_{m=1}^M \theta_m \phi_{pm}\right] x_{ip} + \varepsilon_i \end{aligned} \]

Application to the Credit dataset

Relationship between PCR and Ridge regression

\[RSS = (y-\mathbf{X}\beta)^T(y-\mathbf{X}\beta)\]

\[\frac{\partial RSS}{\partial \beta} = -2\mathbf{X}^T(y-\mathbf{X}\beta) = 0\]

\[\implies \hat \beta = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T y\]

\[(\mathbf{X}^T\mathbf{X})^{-1} = V D^{-1} V^T\] where \(D^{-1} = \text{diag}(1/d_1,1/d_2,\dots,1/d_p)\).

Relationship between PCR and Ridge regression

\[RSS + \lambda \|\beta\|_2^2 = (y-\mathbf{X}\beta)^T(y-\mathbf{X}\beta) + \lambda \beta^T\beta\]

\[RSS = (y-\mathbf{X}\beta)^T(y-\mathbf{X}\beta)\]

\[\frac{\partial (RSS + \lambda \|\beta\|_2^2)}{\partial \beta} = -2\mathbf{X}^T(y-\mathbf{X}\beta) + 2\lambda\beta= 0\]

\[\implies \hat \beta^R_\lambda = (\mathbf{X}^T\mathbf{X} + \lambda I)^{-1}\mathbf{X}^T y\]

Relationship between PCR and Ridge regression

\[\hat y = \mathbf{X}\hat \beta = \sum_{j=1}^p u_j u_j^T y, \qquad u_j \text{ is the } j\text{th column of } U\]

\[\hat y = \mathbf{X}\hat \beta^R_\lambda = \sum_{j=1}^p u_j \frac{d_j}{d_j + \lambda} u_j^T y\]

\[\hat y = \mathbf{X}\hat \beta^\text{PC} = \sum_{j=1}^p u_j \mathbf{1}(j\leq M) u_j^T y\]

Principal Component Regression

Partial least squares

Algorithm

  1. Define \(Z_1 = \sum_{j=1}^p \phi_{j1} X_j\), where \(\phi_{j1}\) is the coefficient of a simple linear regression of \(Y\) onto \(X_j\).

  2. Let \(X_j^{(2)}\) be the residual of regressing \(X_j\) onto \(Z_1\).

  3. Define \(Z_2 = \sum_{j=1}^p \phi_{j2} X_j^{(2)}\), where \(\phi_{j2}\) is the coefficient of a simple linear regression of \(Y\) onto \(X_j^{(2)}\).

  4. Let \(X_j^{(3)}\) be the residual of regressing \(X_j^{(2)}\) onto \(Z_2\).

Partial least squares

High-dimensional regression

Some problems

Some problems

Some problems

Interpreting coefficients when \(p>n\)

Interpreting inference when \(p>n\)