\[y_i = \beta_0 + \beta_1 x_i +\varepsilon_i\]
Errors: \(\varepsilon_i \sim N(0,\sigma^2)\quad \text{i.i.d.}\)
Fit: the estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to minimize the (training) residual sum of squares (RSS):
\[ \begin{aligned} \text{RSS}(\beta_0,\beta_1) &= \sum_{i=1}^n (y_i -\hat y_i(\beta_0, \beta_1))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_i)^2. \end{aligned} \]
library(ISLR2)
Advertising = read.csv("https://www.statlearning.com/s/Advertising.csv", header=TRUE, row.names=1)
M.sales = lm(sales ~ TV, data=Advertising)
M.sales
##
## Call:
## lm(formula = sales ~ TV, data = Advertising)
##
## Coefficients:
## (Intercept) TV
## 7.03259 0.04754
A little calculus shows that the minimizers of the RSS are:
\[ \begin{aligned} \hat \beta_1 & = \frac{\sum_{i=1}^n (x_i-\overline x)(y_i-\overline y)}{\sum_{i=1}^n (x_i-\overline x)^2} \\ \hat \beta_0 & = \overline y- \hat\beta_1\overline x. \end{aligned} \]
\[ \begin{aligned} \text{SE}(\hat\beta_0)^2 &= \sigma^2\left[\frac{1}{n}+\frac{\overline x^2}{\sum_{i=1}^n(x_i-\overline x)^2}\right] \\ \text{SE}(\hat\beta_1)^2 &= \frac{\sigma^2}{\sum_{i=1}^n(x_i-\overline x)^2}. \end{aligned}\]
\[ \begin{aligned} \hat\beta_0 &\pm 2\cdot\text{SE}(\hat\beta_0) \\ \hat\beta_1 &\pm 2\cdot\text{SE}(\hat\beta_1) \end{aligned} \]
Null hypothesis \(H_0\): There is no relationship between \(X\) and \(Y\).
Alternative hypothesis \(H_a\): There is some relationship between \(X\) and \(Y\).
Based on our model: this translates to
Test statistic:
\[\quad t = \frac{\hat\beta_1 -0}{\text{SE}(\hat\beta_1)}.\]
##
## Call:
## lm(formula = sales ~ TV, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 6.12971927 7.93546783
## TV 0.04223072 0.05284256
If we reject the null hypothesis, can we assume there is an exact linear relationship?
No. A quadratic relationship may be a better fit, for example. This test assumes the simple linear regression model is correct which precludes a quadratic relationship.
If we don’t reject the null hypothesis, can we assume there is no relationship between \(X\) and \(Y\)?
No. This test is based on the model we posited above and is only powerful against certain monotone alternatives. There could be more complex non-linear relationships.
\[ \begin{aligned} y_i &= \beta_0 + \beta_1 x_{i1}+\dots+\beta_p x_{ip}+\varepsilon_i \\ Y &= \beta_0 + \beta_1 X_{1}+\dots+\beta_p X_{p}+\varepsilon \\ \end{aligned} \]
Errors: \(\varepsilon_i \sim N(0,\sigma^2)\quad \text{i.i.d.}\)
Fit: the estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to minimize the residual sum of squares (RSS):
\[ \begin{aligned} \text{RSS}(\beta) &= \sum_{i=1}^n (y_i -\hat y_i(\beta))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_{i,1}-\dots-\beta_p x_{i,p})^2 \\ &= \|Y-X\beta\|^2_2 \end{aligned} \]
\[E({Y}) = {X}\beta\]
Is at least one of the variables \(X_i\) useful for predicting the outcome \(Y\)?
Which subset of the predictors is most important?
How good is a linear model for these data?
Given a set of predictor values, what is a likely value for \(Y\), and how accurate is this prediction?
Our goal again is to minimize the RSS: \[ \begin{aligned} \text{RSS}(\beta) &= \sum_{i=1}^n (y_i -\hat y_i(\beta))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_{i,1}-\dots-\beta_p x_{i,p})^2 \\ &= \|Y-X\beta\|^2_2 \end{aligned} \]
One can show that this is minimized by the vector \(\hat\beta\): \[\hat\beta = ({X}^T{X})^{-1}{X}^T{y}.\]
We usually write \(RSS=RSS(\hat{\beta})\) for the minimized RSS.
Consider the hypothesis: \(H_0:\) the last \(q\) predictors have no relation with \(Y\).
Based on our model: \(H_0:\beta_{p-q+1}=\beta_{p-q+2}=\dots=\beta_p=0.\)
Let \(\text{RSS}_0\) be the minimized residual sum of squares for the model which excludes these variables.
The \(F\)-statistic is defined by: \[F = \frac{(\text{RSS}_0-\text{RSS})/q}{\text{RSS}/(n-p-1)}.\]
Under the null hypothesis (of our model), this has an \(F\)-distribution.
Example: If \(q=p\), we test whether any of the variables is important. \[\text{RSS}_0 = \sum_{i=1}^n(y_i-\overline y)^2 \]
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1304 -2.7673 -0.5814 1.9414 26.2526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.617270 4.936039 8.431 3.79e-16 ***
## crim -0.121389 0.033000 -3.678 0.000261 ***
## zn 0.046963 0.013879 3.384 0.000772 ***
## indus 0.013468 0.062145 0.217 0.828520
## chas 2.839993 0.870007 3.264 0.001173 **
## nox -18.758022 3.851355 -4.870 1.50e-06 ***
## rm 3.658119 0.420246 8.705 < 2e-16 ***
## age 0.003611 0.013329 0.271 0.786595
## dis -1.490754 0.201623 -7.394 6.17e-13 ***
## rad 0.289405 0.066908 4.325 1.84e-05 ***
## tax -0.012682 0.003801 -3.337 0.000912 ***
## ptratio -0.937533 0.132206 -7.091 4.63e-12 ***
## lstat -0.552019 0.050659 -10.897 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.798 on 493 degrees of freedom
## Multiple R-squared: 0.7343, Adjusted R-squared: 0.7278
## F-statistic: 113.5 on 12 and 493 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: medv ~ (crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + lstat) - indus - zn
## Model 2: medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + lstat
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 495 11614
## 2 493 11349 2 264.07 5.7355 0.003449 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The \(t\)-statistic associated to the \(i\)th predictor is the square root of the \(F\)-statistic for the null hypothesis which sets only \(\beta_i=0\).
A low \(p\)-value indicates that the predictor is important.
Warning: If there are many predictors, even under the null hypothesis, some of the \(t\)-tests will have low p-values even when the model has no explanatory power.
When we select a subset of the predictors, we have \(2^p\) choices.
A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best.
Choosing one model in the range produced is a form of tuning. This tuning can invalidate some of our methods like hypothesis tests and confidence intervals…
predict
in R outputs predictions and
confidence intervals from a linear model:Advertising = read.csv("https://www.statlearning.com/s/Advertising.csv", header=TRUE, row.names=1)
M.sales = lm(sales ~ TV, data=Advertising)
predict(M.sales, data.frame(TV=c(50,150,250)), interval='confidence', level=0.95)
## fit lwr upr
## 1 9.409426 8.722696 10.09616
## 2 14.163090 13.708423 14.61776
## 3 18.916754 18.206189 19.62732
## fit lwr upr
## 1 9.409426 2.946709 15.87214
## 2 14.163090 7.720898 20.60528
## 3 18.916754 12.451461 25.38205
Example: Credit dataset trying to predict \(Y=\) Balance
.
For each qualitative predictor, e.g. Region
:
Choose a baseline category, e.g. East
For every other category, define a new predictor:
South
region and 0 otherwiseWest
region and 0 otherwise.The model will be: \[Y = \beta_0 + \beta_1 X_1 +\dots +\beta_7 X_7 + \color{Red}{\beta_\text{South}} X_\text{South} + \beta_\text{West} X_\text{West} +\varepsilon.\]
The parameter \(\color{Red}{\beta_\text{South}}\) is the
relative effect on Balance
(our \(Y\)) for being from the South compared to
the baseline category (East).
The model fit and predictions are independent of the choice of the baseline category.
However, hypothesis tests derived from these variables are affected by the choice.
Solution: To check whether region
is important, use
an \(F\)-test for the hypothesis \(\beta_\text{South}=\beta_\text{West}=0\) by
dropping Region
from the model. This does not depend on the
coding.
Note that there are other ways to encode qualitative predictors produce the same fit \(\hat f\), but the coefficients have different interpretations.
So far, we have:
Defined Multiple Linear Regression
Discussed how to test the importance of variables.
Described one approach to choose a subset of variables.
Explained how to code qualitative variables.
Now, how do we evaluate model fit? Is the linear model any good? What can go wrong?
To assess the fit, we focus on the residuals \[ e = Y - \hat{Y} \]
The RSS always decreases as we add more variables.
The residual standard error (RSE) corrects this: \[\text{RSE} = \sqrt{\frac{1}{n-p-1}\text{RSS}}.\]
Interactions between predictors
Non-linear relationships
Correlation of error terms
Non-constant variance of error (heteroskedasticity)
Outliers
High leverage points
Collinearity
Linear regression has an additive assumption: \[\mathtt{sales} = \beta_0 + \beta_1\times\mathtt{tv}+ \beta_2\times\mathtt{radio}+\varepsilon\]
i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of \(100 \beta_2\) USD in sales on average, regardless of how much you spend on radio ads.
We saw that in Fig 3.5 above. If we visualize the fit and the observed points, we see they are not evenly scattered around the plane. This could be caused by an interaction.
\[\mathtt{sales} = \beta_0 + \beta_1\times\mathtt{tv}+ \beta_2\times\mathtt{radio}+\color{Red}{\beta_3\times(\mathtt{tv}\cdot\mathtt{radio})}+\varepsilon\]
tv
\(\cdot\) radio
is high
when both tv
and radio
are high. ##
## Call:
## lm(formula = sales ~ TV + radio + radio:TV, data = Advertising)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3366 -0.4028 0.1831 0.5948 1.5246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
## TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
## radio 2.886e-02 8.905e-03 3.241 0.0014 **
## TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
## F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
M = lm(sales ~ TV + radio + radio:TV, data=Advertising)
predict(M, data.frame(TV=50, radio=40), interval='confidence')
## fit lwr upr
## 1 11.03268 10.73224 11.33311
Example: Auto dataset.
A scatterplot between a predictor and the response may reveal a non-linear relationship.
\[\mathtt{MPG} = \beta_0 + \beta_1\times\mathtt{horsepower} + \beta_2 \times\mathtt{horsepower}^2 + \beta_3 \times\mathtt{horsepower}^3 + \beta_4 \times\mathtt{horsepower}^4 + \dots + \varepsilon \]
\[\mathtt{MPG} = \beta_0 + \beta_1\times h_1(\mathtt{horsepower}) + \beta_2 \times h_2(\mathtt{horsepower}) + \beta_3 \times h_3(\mathtt{horsepower}) + \beta_4 \times h_4(\mathtt{horsepower}) + \dots + \varepsilon \]
In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors?
\[y_i = f(x_i) + \varepsilon_i \quad;\quad \varepsilon_i \sim N(0,\sigma^2) \text{ i.i.d.}\]
What if this breaks down?
The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests…
Example: Suppose that by accident, we duplicate the data (we use each sample twice). Then, the standard errors would be artificially smaller by a factor of \(\sqrt{2}\).
When could this happen in real life:
Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated.
Spatial data: Each sample corresponds to a different location in space.
Grouped data: Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family, their shared environment could make them deviate from \(f(x)\) in similar ways.
Simulations of time series with increasing correlations between \(\varepsilon_i\)
The variance of the error depends on some characteristics of the input features.
To diagnose this, we can plot residuals vs. fitted values:
If the trend in variance is relatively simple, we can transform the response using a logarithm, for example.
Outliers from a model are points with very high errors.
While they may not affect the fit, they might affect our assessment of model quality.
If we believe an outlier is due to an error in data collection, we can remove it.
An outlier might be evidence of a missing predictor, or the need to specify a more complex model.
Some samples with extreme inputs have an outsized effect on \(\hat \beta\).
This can be measured with the leverage statistic or self influence:
\[h_{ii} = \frac{\partial \hat y_i}{\partial y_i} = (\underbrace{ X ( X^T X)^{-1} X^T}_{\text{Hat matrix}})_{i,i} \in [1/n,1].\]
The residual \(e_i = y_i - \hat y_i\) is an estimate for the noise \(\epsilon_i\).
The standard error of \(\hat \epsilon_i\) is \(\sigma \sqrt{1-h_{ii}}\).
A studentized residual is \(\hat \epsilon_i\) divided by its standard error (with appropriate estimate of \(\sigma\))
When model is correct, it follows a Student-t distribution with \(n-p-2\) degrees of freedom.
\[\mathtt{limit} \approx a\times\mathtt{rating}+b\]
Problem: The coefficients become unidentifiable.
Consider the extreme case of using two identical predictors
limit
: \[
\begin{aligned}
\mathtt{balance} &= \beta_0 + \beta_1\times\mathtt{limit} +
\beta_2\times\mathtt{limit} + \epsilon \\
& = \beta_0 + (\beta_1+100)\times\mathtt{limit} +
(\beta_2-100)\times\mathtt{limit} + \epsilon
\end{aligned}
\]
For every \((\beta_0,\beta_1,\beta_2)\) the fit at \((\beta_0,\beta_1,\beta_2)\) is just as good as at \((\beta_0,\beta_1+100,\beta_2-100)\).
If 2 variables are collinear, we can easily diagnose this using their correlation.
A group of \(q\) variables is multilinear if these variables “contain less information” than \(q\) independent variables.
Pairwise correlations may not reveal multilinear variables.
The Variance Inflation Factor (VIF) measures how predictable it is given the other variables, a proxy for how necessary a variable is:
\[VIF(\hat \beta_j) = \frac{1}{1-R^2_{X_j|X_{-j}}},\]
\[\hat f(x) = \frac{1}{K} \sum_{i\in N_K(x)} y_i\]
Linear regression: prototypical parametric method. Easy for inference.
KNN regression: prototypical nonparametric method. Inference less clear.
KNN is only better when the function \(f\) is far from linear (in which case linear model is misspecified)
When \(n\) is not much larger than \(p\), even if \(f\) is nonlinear, Linear Regression can outperform KNN.
KNN has smaller bias, but this comes at a price of higher variance.
+++
When \(p\gg n\), each sample has no near neighbors, this is known as the curse of dimensionality.
The variance of KNN regression is very large.