Model

\[y_i = \beta_0 + \beta_1 x_i +\varepsilon_i\]

Errors: \(\varepsilon_i \sim N(0,\sigma^2)\quad \text{i.i.d.}\)
Fit: the estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to minimize the (training) residual sum of squares (RSS):

\[ \begin{aligned} \text{RSS}(\beta_0,\beta_1) &= \sum_{i=1}^n (y_i -\hat y_i(\beta_0, \beta_1))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_i)^2. \end{aligned} \]

Sample code: advertising data

library(ISLR2)
Advertising = read.csv("https://www.statlearning.com/s/Advertising.csv", header=TRUE, row.names=1)
M.sales = lm(sales ~ TV, data=Advertising)
M.sales

## 
## Call:
## lm(formula = sales ~ TV, data = Advertising)
## 
## Coefficients:
## (Intercept)           TV  
##     7.03259      0.04754

Estimates \(\hat\beta_0\) and \(\hat\beta_1\)

A little calculus shows that the minimizers of the RSS are:

\[ \begin{aligned} \hat \beta_1 & = \frac{\sum_{i=1}^n (x_i-\overline x)(y_i-\overline y)}{\sum_{i=1}^n (x_i-\overline x)^2} \\ \hat \beta_0 & = \overline y- \hat\beta_1\overline x. \end{aligned} \]

Assessing the accuracy of \(\hat \beta_0\) and \(\hat\beta_1\)

How variable is the regression line?

Based on our model

The Standard Errors for the parameters are:

\[ \begin{aligned} \text{SE}(\hat\beta_0)^2 &= \sigma^2\left[\frac{1}{n}+\frac{\overline x^2}{\sum_{i=1}^n(x_i-\overline x)^2}\right] \\ \text{SE}(\hat\beta_1)^2 &= \frac{\sigma^2}{\sum_{i=1}^n(x_i-\overline x)^2}. \end{aligned}\]

95% confidence intervals:

\[ \begin{aligned} \hat\beta_0 &\pm 2\cdot\text{SE}(\hat\beta_0) \\ \hat\beta_1 &\pm 2\cdot\text{SE}(\hat\beta_1) \end{aligned} \]

Hypothesis test

Null hypothesis \(H_0\): There is no relationship between \(X\) and \(Y\).
Alternative hypothesis \(H_a\): There is some relationship between \(X\) and \(Y\).
Based on our model: this translates to
- \(H_0\): \(\beta_1=0\).
- \(H_a\): \(\beta_1\neq 0\).
Test statistic:

\[\quad t = \frac{\hat\beta_1 -0}{\text{SE}(\hat\beta_1)}.\]

Under the null hypothesis, this has a \(t\)-distribution with \(n-2\) degrees of freedom.

Sample output: advertising data

summary(M.sales)

## 
## Call:
## lm(formula = sales ~ TV, data = Advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

confint(M.sales)

##                  2.5 %     97.5 %
## (Intercept) 6.12971927 7.93546783
## TV          0.04223072 0.05284256

Interpreting the hypothesis test

If we reject the null hypothesis, can we assume there is an exact linear relationship?
No. A quadratic relationship may be a better fit, for example. This test assumes the simple linear regression model is correct which precludes a quadratic relationship.
If we don’t reject the null hypothesis, can we assume there is no relationship between \(X\) and \(Y\)?
No. This test is based on the model we posited above and is only powerful against certain monotone alternatives. There could be more complex non-linear relationships.

Multiple linear regression

Model

\[ \begin{aligned} y_i &= \beta_0 + \beta_1 x_{i1}+\dots+\beta_p x_{ip}+\varepsilon_i \\ Y &= \beta_0 + \beta_1 X_{1}+\dots+\beta_p X_{p}+\varepsilon \\ \end{aligned} \]

Errors: \(\varepsilon_i \sim N(0,\sigma^2)\quad \text{i.i.d.}\)
Fit: the estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to minimize the residual sum of squares (RSS):

\[ \begin{aligned} \text{RSS}(\beta) &= \sum_{i=1}^n (y_i -\hat y_i(\beta))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_{i,1}-\dots-\beta_p x_{i,p})^2 \\ &= \|Y-X\beta\|^2_2 \end{aligned} \]

Matrix notation: with \(\beta=(\beta_0,\dots,\beta_p)\) and \({X}\) our usual data matrix with an extra column of ones on the left to account for the intercept, we can write

\[E({Y}) = {X}\beta\]

Multiple linear regression answers several questions

Is at least one of the variables \(X_i\) useful for predicting the outcome \(Y\)?
Which subset of the predictors is most important?
How good is a linear model for these data?
Given a set of predictor values, what is a likely value for \(Y\), and how accurate is this prediction?

The estimates \(\hat\beta\)

Our goal again is to minimize the RSS: \[ \begin{aligned} \text{RSS}(\beta) &= \sum_{i=1}^n (y_i -\hat y_i(\beta))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_{i,1}-\dots-\beta_p x_{i,p})^2 \\ &= \|Y-X\beta\|^2_2 \end{aligned} \]

One can show that this is minimized by the vector \(\hat\beta\): \[\hat\beta = ({X}^T{X})^{-1}{X}^T{y}.\]
We usually write \(RSS=RSS(\hat{\beta})\) for the minimized RSS.

Which variables are important?

Consider the hypothesis: \(H_0:\) the last \(q\) predictors have no relation with \(Y\).
Based on our model: \(H_0:\beta_{p-q+1}=\beta_{p-q+2}=\dots=\beta_p=0.\)
Let \(\text{RSS}_0\) be the minimized residual sum of squares for the model which excludes these variables.
The \(F\)-statistic is defined by: \[F = \frac{(\text{RSS}_0-\text{RSS})/q}{\text{RSS}/(n-p-1)}.\]
Under the null hypothesis (of our model), this has an \(F\)-distribution.
Example: If \(q=p\), we test whether any of the variables is important. \[\text{RSS}_0 = \sum_{i=1}^n(y_i-\overline y)^2 \]

M.Boston = lm(medv ~ ., data=Boston)
summary(M.Boston)

## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1304  -2.7673  -0.5814   1.9414  26.2526 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.617270   4.936039   8.431 3.79e-16 ***
## crim         -0.121389   0.033000  -3.678 0.000261 ***
## zn            0.046963   0.013879   3.384 0.000772 ***
## indus         0.013468   0.062145   0.217 0.828520    
## chas          2.839993   0.870007   3.264 0.001173 ** 
## nox         -18.758022   3.851355  -4.870 1.50e-06 ***
## rm            3.658119   0.420246   8.705  < 2e-16 ***
## age           0.003611   0.013329   0.271 0.786595    
## dis          -1.490754   0.201623  -7.394 6.17e-13 ***
## rad           0.289405   0.066908   4.325 1.84e-05 ***
## tax          -0.012682   0.003801  -3.337 0.000912 ***
## ptratio      -0.937533   0.132206  -7.091 4.63e-12 ***
## lstat        -0.552019   0.050659 -10.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.798 on 493 degrees of freedom
## Multiple R-squared:  0.7343, Adjusted R-squared:  0.7278 
## F-statistic: 113.5 on 12 and 493 DF,  p-value: < 2.2e-16

M0.Boston = lm(medv ~ . -indus - zn, data=Boston)
anova(M0.Boston, M.Boston)

## Analysis of Variance Table
## 
## Model 1: medv ~ (crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + lstat) - indus - zn
## Model 2: medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + lstat
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)   
## 1    495 11614                                
## 2    493 11349  2    264.07 5.7355 0.003449 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The \(t\)-statistic associated to the \(i\)th predictor is the square root of the \(F\)-statistic for the null hypothesis which sets only \(\beta_i=0\).
A low \(p\)-value indicates that the predictor is important.
Warning: If there are many predictors, even under the null hypothesis, some of the \(t\)-tests will have low p-values even when the model has no explanatory power.

How many variables are important?

When we select a subset of the predictors, we have \(2^p\) choices.
A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best.
- Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step.
- Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step.
- Mixed selection: Starting from some model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable.
Choosing one model in the range produced is a form of tuning. This tuning can invalidate some of our methods like hypothesis tests and confidence intervals…

How good are the predictions?

The function predict in R outputs predictions and confidence intervals from a linear model:

Advertising = read.csv("https://www.statlearning.com/s/Advertising.csv", header=TRUE, row.names=1)
M.sales = lm(sales ~ TV, data=Advertising)
predict(M.sales, data.frame(TV=c(50,150,250)), interval='confidence', level=0.95)

##         fit       lwr      upr
## 1  9.409426  8.722696 10.09616
## 2 14.163090 13.708423 14.61776
## 3 18.916754 18.206189 19.62732

Prediction intervals reflect uncertainty on \(\hat\beta\) and the irreducible error \(\varepsilon\) as well.

predict(M.sales, data.frame(TV=c(50,150,250)), interval='prediction', level=0.95)

##         fit       lwr      upr
## 1  9.409426  2.946709 15.87214
## 2 14.163090  7.720898 20.60528
## 3 18.916754 12.451461 25.38205

These functions rely on our linear regression model \[ Y = X\beta + \epsilon. \]

Dealing with categorical or qualitative predictors

Example: Credit dataset trying to predict \(Y=\) Balance.

For each qualitative predictor, e.g. Region:

Choose a baseline category, e.g. East
For every other category, define a new predictor:
- \(X_\text{South}\) is 1 if the person is from the South region and 0 otherwise
- \(X_\text{West}\) is 1 if the person is from the West region and 0 otherwise.
The model will be: \[Y = \beta_0 + \beta_1 X_1 +\dots +\beta_7 X_7 + \color{Red}{\beta_\text{South}} X_\text{South} + \beta_\text{West} X_\text{West} +\varepsilon.\]
The parameter \(\color{Red}{\beta_\text{South}}\) is the relative effect on Balance (our \(Y\)) for being from the South compared to the baseline category (East).

The model fit and predictions are independent of the choice of the baseline category.
However, hypothesis tests derived from these variables are affected by the choice.
Solution: To check whether region is important, use an \(F\)-test for the hypothesis \(\beta_\text{South}=\beta_\text{West}=0\) by dropping Region from the model. This does not depend on the coding.
Note that there are other ways to encode qualitative predictors produce the same fit \(\hat f\), but the coefficients have different interpretations.

Recap

So far, we have:

Defined Multiple Linear Regression
Discussed how to test the importance of variables.
Described one approach to choose a subset of variables.
Explained how to code qualitative variables.
Now, how do we evaluate model fit? Is the linear model any good? What can go wrong?

How good is the fit?

To assess the fit, we focus on the residuals \[ e = Y - \hat{Y} \]
The RSS always decreases as we add more variables.
The residual standard error (RSE) corrects this: \[\text{RSE} = \sqrt{\frac{1}{n-p-1}\text{RSS}}.\]

Residuals

Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions:

Potential issues in linear regression

Interactions between predictors
Non-linear relationships
Correlation of error terms
Non-constant variance of error (heteroskedasticity)
Outliers
High leverage points
Collinearity

Interactions between predictors

Linear regression has an additive assumption: \[\mathtt{sales} = \beta_0 + \beta_1\times\mathtt{tv}+ \beta_2\times\mathtt{radio}+\varepsilon\]
i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of \(100 \beta_2\) USD in sales on average, regardless of how much you spend on radio ads.
We saw that in Fig 3.5 above. If we visualize the fit and the observed points, we see they are not evenly scattered around the plane. This could be caused by an interaction.

One way to deal with this is to include multiplicative variables in the model:

\[\mathtt{sales} = \beta_0 + \beta_1\times\mathtt{tv}+ \beta_2\times\mathtt{radio}+\color{Red}{\beta_3\times(\mathtt{tv}\cdot\mathtt{radio})}+\varepsilon\]

The interaction variable tv \(\cdot\) radio is high when both tv and radio are high.

R makes it easy to include interaction variables in the model:

summary(lm(sales ~ TV + radio + radio:TV, data=Advertising))

## 
## Call:
## lm(formula = sales ~ TV + radio + radio:TV, data = Advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

M = lm(sales ~ TV + radio + radio:TV, data=Advertising)
predict(M, data.frame(TV=50, radio=40), interval='confidence')

##        fit      lwr      upr
## 1 11.03268 10.73224 11.33311

Non-linearities

A nonlinear fit might be better here.

Example: Auto dataset.
A scatterplot between a predictor and the response may reveal a non-linear relationship.

Solution: include polynomial terms in the model.

\[\mathtt{MPG} = \beta_0 + \beta_1\times\mathtt{horsepower} + \beta_2 \times\mathtt{horsepower}^2 + \beta_3 \times\mathtt{horsepower}^3 + \beta_4 \times\mathtt{horsepower}^4 + \dots + \varepsilon \]

Could use other functions besides polynomials…

\[\mathtt{MPG} = \beta_0 + \beta_1\times h_1(\mathtt{horsepower}) + \beta_2 \times h_2(\mathtt{horsepower}) + \beta_3 \times h_3(\mathtt{horsepower}) + \beta_4 \times h_4(\mathtt{horsepower}) + \dots + \varepsilon \]

Residuals for Auto data

In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors?

Correlation of error terms

We assumed that the errors for each sample are independent:

\[y_i = f(x_i) + \varepsilon_i \quad;\quad \varepsilon_i \sim N(0,\sigma^2) \text{ i.i.d.}\]

What if this breaks down?
The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests…
Example: Suppose that by accident, we duplicate the data (we use each sample twice). Then, the standard errors would be artificially smaller by a factor of \(\sqrt{2}\).

When could this happen in real life:

Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated.
Spatial data: Each sample corresponds to a different location in space.
Grouped data: Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family, their shared environment could make them deviate from \(f(x)\) in similar ways.

Correlated errors

Simulations of time series with increasing correlations between \(\varepsilon_i\)

Non-constant variance of error (heteroskedasticity)

The variance of the error depends on some characteristics of the input features.
To diagnose this, we can plot residuals vs. fitted values:

If the trend in variance is relatively simple, we can transform the response using a logarithm, for example.

Outliers

Outliers from a model are points with very high errors.
While they may not affect the fit, they might affect our assessment of model quality.

Possible solutions:

If we believe an outlier is due to an error in data collection, we can remove it.
An outlier might be evidence of a missing predictor, or the need to specify a more complex model.

High leverage points

Some samples with extreme inputs have an outsized effect on \(\hat \beta\).
This can be measured with the leverage statistic or self influence:

\[h_{ii} = \frac{\partial \hat y_i}{\partial y_i} = (\underbrace{ X ( X^T X)^{-1} X^T}_{\text{Hat matrix}})_{i,i} \in [1/n,1].\]

Studentized residuals

The residual \(e_i = y_i - \hat y_i\) is an estimate for the noise \(\epsilon_i\).
The standard error of \(\hat \epsilon_i\) is \(\sigma \sqrt{1-h_{ii}}\).
A studentized residual is \(\hat \epsilon_i\) divided by its standard error (with appropriate estimate of \(\sigma\))
When model is correct, it follows a Student-t distribution with \(n-p-2\) degrees of freedom.

Collinearity

Two predictors are collinear if one explains the other well:

\[\mathtt{limit} \approx a\times\mathtt{rating}+b\]

Problem: The coefficients become unidentifiable.
Consider the extreme case of using two identical predictors limit: \[ \begin{aligned} \mathtt{balance} &= \beta_0 + \beta_1\times\mathtt{limit} + \beta_2\times\mathtt{limit} + \epsilon \\ & = \beta_0 + (\beta_1+100)\times\mathtt{limit} + (\beta_2-100)\times\mathtt{limit} + \epsilon \end{aligned} \]
For every \((\beta_0,\beta_1,\beta_2)\) the fit at \((\beta_0,\beta_1,\beta_2)\) is just as good as at \((\beta_0,\beta_1+100,\beta_2-100)\).

If 2 variables are collinear, we can easily diagnose this using their correlation.
A group of \(q\) variables is multilinear if these variables “contain less information” than \(q\) independent variables.
Pairwise correlations may not reveal multilinear variables.
The Variance Inflation Factor (VIF) measures how predictable it is given the other variables, a proxy for how necessary a variable is:

\[VIF(\hat \beta_j) = \frac{1}{1-R^2_{X_j|X_{-j}}},\]

Above, \(R^2_{X_j|X_{-j}}\) is the \(R^2\) statistic for Multiple Linear regression of the predictor \(X_j\) onto the remaining predictors.

\(K\)-nearest neighbors

Illustration of KNN regression: \(K=1\) (left) is rougher than \(K=9\) (right)

KNN estimator

\[\hat f(x) = \frac{1}{K} \sum_{i\in N_K(x)} y_i\]

Comparing linear regression to \(K\)-nearest neighbors

Linear regression: prototypical parametric method. Easy for inference.
KNN regression: prototypical nonparametric method. Inference less clear.

Long story short:

KNN is only better when the function \(f\) is far from linear (in which case linear model is misspecified)
When \(n\) is not much larger than \(p\), even if \(f\) is nonlinear, Linear Regression can outperform KNN.
KNN has smaller bias, but this comes at a price of higher variance.

KNN estimates for a simulation from a linear model

\(K=1\) on left, \(K=9\) on right.

+++

Linear models can dominate KNN

Truth is linear, plot of test MSE vs. 1/K shows KNN worse than linear regression.

Increasing deviations from linearity

When there are more predictors than observations, Linear Regression dominates

When \(p\gg n\), each sample has no near neighbors, this is known as the curse of dimensionality.
The variance of KNN regression is very large.

Linear Regression

web.stanford.edu/class/stats202

Simple linear regression

Model

Sample code: advertising data

Estimates \(\hat\beta_0\) and \(\hat\beta_1\)

Assessing the accuracy of \(\hat \beta_0\) and \(\hat\beta_1\)

Based on our model

Hypothesis test

Sample output: advertising data

Interpreting the hypothesis test

Multiple linear regression

Model

Multiple linear regression answers several questions

The estimates \(\hat\beta\)

Which variables are important?

How many variables are important?

How good are the predictions?

Dealing with categorical or qualitative predictors

Recap

How good is the fit?

Potential issues in linear regression

Interactions between predictors

Non-linearities

Correlation of error terms

Correlated errors

Non-constant variance of error (heteroskedasticity)

Outliers

Possible solutions:

High leverage points

Studentized residuals

Collinearity

\(K\)-nearest neighbors

KNN estimator

Comparing linear regression to \(K\)-nearest neighbors

Long story short:

KNN estimates for a simulation from a linear model

Linear models can dominate KNN

Increasing deviations from linearity

When there are more predictors than observations, Linear Regression dominates