Multiple Linear Regression#

Download#

Outline#

  • Case studies:

    A. Effect of light on meadowfoam flowering

    B. Studying the brain sizes of mammals

  • Specifying the model.

  • Fitting the model: least squares.

  • Interpretation of the coefficients.

  • \(F\)-statistic revisited

  • Matrix approach to linear regression.

  • Investigating the design matrix

Case study A:#

url = 'https://raw.githubusercontent.com/StanfordStatistics/stats191-data/main/Sleuth3/light_flowering.csv'
flower = read.table(url, sep=',', header=TRUE)
head(flower)
A data.frame: 6 × 3
FlowersTimeIntensity
<dbl><int><int>
162.31150
277.41150
355.31300
454.21300
549.61450
661.91450
  • Researchers manipulate timing and intensity of light to investigate effect on number of flowers.

Case study B:#

url = 'https://raw.githubusercontent.com/StanfordStatistics/stats191-data/main/Sleuth3/brains.csv'
brains = read.csv(url, header=TRUE, row.names=1)
head(brains)
A data.frame: 6 × 4
BrainBodyGestationLitter
<dbl><dbl><int><dbl>
Aardvark 9.6 2.20 315.0
Acouchis 9.9 0.78 981.2
African elephant4480.02800.006551.0
Agoutis 20.3 2.801041.3
Axis deer 219.0 89.002181.0
Badger 53.0 6.00 602.2
  • How are litter size, gestation period associated to brain size in mammals?

A model for the brains data#

{height=400 fig-align=”center”}

Figure depicts our model: to generate \(Y_i\):#

  • First fix \(X=(X_1,\dots,X_p)\), form the mean (\(\beta_0 + \sum_j \beta_j X_{j}\)), add an error \(\epsilon\)


A model for brains#

Multiple linear regression model#

\[ \texttt{Brain}_i = \beta_0 + \beta_1 \cdot \texttt{Body}_i + \beta_2 \cdot \texttt{Gestation}_i + \beta_3 \cdot \texttt{Litter}_i + \epsilon_i \]
brains.lm = lm(Brain ~ Body + Gestation + Litter, data=brains)
summary(brains.lm)$coef
summary(brains.lm)$fstatistic
A matrix: 4 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)-225.292132883.05875218-2.7124437.971598e-03
Body 0.9858781 0.0942826310.4566242.517636e-17
Gestation 1.8087434 0.35444885 5.1029741.790007e-06
Litter 27.648639417.41429351 1.5876981.157857e-01
value
130.729924871479
numdf
3
dendf
92

Another model for brains#

\[ \texttt{Brain}_i = \beta_0 + \beta_1 \cdot \texttt{Body}_i + \beta_2 \cdot \texttt{Litter}_i + \epsilon_i \]
simpler.lm = lm(Brain ~ Body + Litter, data=brains)
summary(simpler.lm)$coef
summary(simpler.lm)$fstatistic
A matrix: 3 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)145.02838045.51865579 3.1861311.963384e-03
Body 1.301621 0.0801461916.2405836.755835e-29
Litter-29.02206715.11200472-1.9204645.786311e-02
value
144.23837946366
numdf
2
dendf
93

Fitting a multiple linear regression model#

  • Just as in simple linear regression, model is fit by minimizing

    \[\begin{split}\begin{aligned} SSE(\beta_0, \dots, \beta_p) &= \sum_{i=1}^n\left(Y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j X_{ij} \right) \right)^2 \\ &= \|Y - \widehat{Y}(\beta)\|^2 \end{aligned}\end{split}\]
  • Minimizers: \(\widehat{\beta} = (\widehat{\beta}_0, \dots, \widehat{\beta}_p)\) are the “least squares estimates”: are also normally distributed as in simple linear regression.

Estimating \(\sigma^2\)#

  • As in simple regression

    \[\widehat{\sigma}^2 = \frac{SSE}{n-p-1} \sim \sigma^2 \cdot \frac{\chi^2_{n-p-1}}{n-p-1}\]

    independent of \(\widehat{\beta}\).

  • Why \(\chi^2_{n-p-1}\)? Typically, the degrees of freedom in the estimate of \(\sigma^2\) is \(n-\# \text{number of parameters in regression function}\).

Interpretation of \(\beta_j\) in brains.lm#

  • Take \(\beta_1=\beta_{\tt Body}\) for example. This is the amount the average Brain weight increases for one kg of increase in Body, keeping everything else constant.

  • We refer to this as the effect of Body allowing for or controlling for the other variables.

Example#

  • Let’s take Beaked whale and artificially add a kg to its Body and compute the predicted weight

case1 = brains['Beaked whale',]
case2 = case1
case2['Body'] = case1['Body'] + 1
coef(brains.lm)
predict(brains.lm, rbind(case1, case2))
(Intercept)
-225.292132761766
Body
0.985878094591525
Gestation
1.808743390259
Litter
27.6486394142719
Beaked whale
505.043355493964
Beaked whale1
506.029233588555

Same example in simpler.lm#

  • To emphasize the parameters depend on the other variables, let’s redo in the simpler.lm model

case1 = brains['Beaked whale',]
case2 = case1
case2['Body'] = case1['Body'] + 1
coef(simpler.lm)
predict(simpler.lm, rbind(case1, case2))
(Intercept)
145.028380229758
Body
1.30162092388685
Litter
-29.0220669146293
Beaked whale
418.193890755138
Beaked whale1
419.495511679024

\(R^2\) for multiple regression#

\[\begin{split}\begin{aligned} SSE &= \sum_{i=1}^n(Y_i - \widehat{Y}_i)^2 \\ SSR &= \sum_{i=1}^n(\overline{Y} - \widehat{Y}_i)^2 \\ SST &= \sum_{i=1}^n(Y_i - \overline{Y})^2 = SSE + SSR \\ R^2 &= \frac{SSR}{SST} \end{aligned}\end{split}\]

\(R^2\) is now called the multiple correlation coefficient of the model, or the coefficient of multiple determination.

The sums of squares and \(R^2\) are defined analogously to those in simple linear regression.


Computing \(R^2\) by hand#

Y = brains$Brain
SST = sum((Y - mean(Y))^2)
SSE = sum(resid(brains.lm)^2)
SSR = SST - SSE
1 - SSE/SST
summary(brains.lm)$r.squared
0.80999185686385
0.80999185686385

Adjusted \(R^2\)#

  • As we add more and more variables to the model – even random ones, \(R^2\) will increase to 1.

  • Adjusted \(R^2\) tries to take this into account by replacing sums of squares by mean squares

(10)#\[\begin{equation} R^2_a = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - \frac{MSE}{MST}. \end{equation}\]

Computing \(R^2_a\) by hand#

n = length(Y)
MST = SST / (n - 1)
MSE = SSE / brains.lm$df.residual
MSR = SSR / (n - 1 - brains.lm$df.residual)
1 - MSE/MST
summary(brains.lm)$adj.r.squared
0.803795939152889
0.803795939152889

\(F\)-test in summary(brains.lm)#

  • Full model:

\[\texttt{Brain}_i = \beta_0 + \beta_1 \cdot \texttt{Body}_i + \beta_2 \cdot \texttt{Gestation}_i + \beta_3 \cdot \texttt{Litter}_i + \epsilon_i\]
  • Reduced model:

    \[\texttt{Brain}_i = \beta_0 + \varepsilon_i\]
  • Statistic:

    \[F=\frac{(SSE_R - SSE_F) / (df_R - df_F)}{SSE_F / df_F} = \frac{SSR/df(SSR)}{SSE/df(SSE)} = \frac{MSR}{MSE}.\]

Right triangle again#

{height=300 fig-align=”center”}

  • Sides of the triangle: \(df_R-df_F=3\), \(df_F=92\)

  • Hypotenuse: \(df_R=95\)


summary(brains.lm)$fstatistic
value
130.729924871479
numdf
3
dendf
92

Matrix formulation#

\[{ Y}_{n \times 1} = {X}_{n \times (p + 1)} {\beta}_{(p+1) \times 1} + {\varepsilon}_{n \times 1}\]
  • \({X}\) is called the design matrix of the model

  • \({\varepsilon} \sim N(0, \sigma^2 I_{n \times n})\) is multivariate normal

\(SSE\) in matrix form#

(11)#\[\begin{equation} SSE(\beta) = ({Y} - {X} {\beta})'({Y} - {X} {\beta}) = \|Y-X\beta\|^2 \end{equation}\]

Design matrix#

  • The design matrix is the \(n \times (p+1)\) matrix with entries

(12)#\[\begin{equation} X = \begin{pmatrix} 1 & X_{11} & X_{12} & \dots & X_{1p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & X_{n2} &\dots & X_{np} \\ \end{pmatrix} \end{equation}\]

n = nrow(brains)
X = with(brains, cbind(rep(1,n), Body, Gestation, Litter))
colnames(X)[1] = '(Intercept)'
head(X)
A matrix: 6 × 4 of type dbl
(Intercept)BodyGestationLitter
1 2.20 315.0
1 0.78 981.2
12800.006551.0
1 2.801041.3
1 89.002181.0
1 6.00 602.2

The matrix X is the same as formed by R

head(model.matrix(brains.lm))
A matrix: 6 × 4 of type dbl
(Intercept)BodyGestationLitter
Aardvark1 2.20 315.0
Acouchis1 0.78 981.2
African elephant12800.006551.0
Agoutis1 2.801041.3
Axis deer1 89.002181.0
Badger1 6.00 602.2

Math aside: least squares solution#

  • Normal equations

    \[\frac{\partial}{\partial \beta_j} SSE \biggl|_{\beta = \widehat{\beta}_{}} = -2 \left({Y\ } - {X} \widehat{\beta}_{} \right)^T {X}_j = 0, \qquad 0 \leq j \leq p.\]
  • Equivalent to

\[\begin{split}\begin{aligned} ({Y} - {X}{\widehat{\beta}_{}})^T{X} &= 0 \\ {\widehat{\beta}} &= ({X}^T{X})^{-1}{X}^T{Y} \end{aligned}\end{split}\]
  • Distribution: \(\widehat{\beta} \sim N(\beta, \sigma^2 (X^TX)^{-1}).\)

Math aside: multivariate normal#

  • To obtain the distribution of \(\hat{\beta}\) we used the following fact about the multivariate Normal.

  • Suppose \(Z \sim N(\mu,\Sigma)\). Then, for any fixed matrix \(A\)

\[ AZ \sim N(A\mu, A\Sigma A^T). \]

Math aside: how did we derive the distribution of \(\hat{\beta}\)?#

Above, we saw that \(\hat{\beta}\) is equal to a matrix times \(Y\). The matrix form of our model is

\[ Y \sim N(X\beta, \sigma^2 I). \]

Therefore,

\[\begin{split} \begin{aligned} \hat{\beta} &\sim N\left((X^TX)^{-1}X^T (X\beta), (X^TX)^{-1}X^T (\sigma^2 I) X (X^TX)^{-1}\right) \\ &\sim N(\beta, \sigma^2 (X^TX)^{-1}). \end{aligned} \end{split}\]

Math aside: checking the equation#

beta = as.numeric(solve(t(X) %*% X) %*% t(X) %*% Y)
names(beta) = colnames(X)
beta
coef(brains.lm)
(Intercept)
-225.292132761768
Body
0.985878094591526
Gestation
1.808743390259
Litter
27.648639414272
(Intercept)
-225.292132761766
Body
0.985878094591525
Gestation
1.808743390259
Litter
27.6486394142719

Categorical variables#

  • Recall case study A: the flower experiment

flower1.lm = lm(Flowers ~ factor(Time), data=flower)
summary(flower1.lm)$coef
A matrix: 2 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)50.058333.61551013.8454402.433418e-12
factor(Time)212.158335.113104 2.3778772.652620e-02

Design matrix with categorical variables#

  • R has used a binary column for factor(Time).

model.matrix(flower1.lm)
A matrix: 24 × 2 of type dbl
(Intercept)factor(Time)2
110
210
310
410
510
610
710
810
910
1010
1110
1210
1311
1411
1511
1611
1711
1811
1911
2011
2111
2211
2311
2411

How categorical variables are encoded#

  • We can change the columns in the design matrix:

model.matrix(lm(Flowers ~ factor(Time) - 1, data=flower))
A matrix: 24 × 2 of type dbl
factor(Time)1factor(Time)2
110
210
310
410
510
610
710
810
910
1010
1110
1210
1301
1401
1501
1601
1701
1801
1901
2001
2101
2201
2301
2401

Design matrix with categorical variables#

  • By default, R discards one of the columns. Why?

flower2.lm = lm(Flowers ~ factor(Intensity), data=flower)
model.matrix(flower2.lm)
A matrix: 24 × 6 of type dbl
(Intercept)factor(Intensity)300factor(Intensity)450factor(Intensity)600factor(Intensity)750factor(Intensity)900
1100000
2100000
3110000
4110000
5101000
6101000
7100100
8100100
9100010
10100010
11100001
12100001
13100000
14100000
15110000
16110000
17101000
18101000
19100100
20100100
21100010
22100010
23100001
24100001

Some additional models#

~ Intensity#

flower3.lm = lm(Flowers ~ Intensity, data=flower)
summary(flower3.lm)$coef
A matrix: 2 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)77.385000004.16118616418.5968616.059011e-15
Intensity-0.040471430.007123293-5.6815621.029503e-05

Some additional models#

~ Intensity + factor(Time)#

flower4.lm = lm(Flowers ~ Intensity + factor(Time), data=flower)
summary(flower4.lm)$coef
A matrix: 3 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)71.305833333.2737720221.7809406.767274e-16
Intensity-0.040471430.00513237-7.8855251.036787e-07
factor(Time)212.158333332.62955696 4.6237191.463776e-04

Some additional models#

~ factor(Intensity) + factor(Time)#

flower5.lm = lm(Flowers ~ factor(Intensity) + factor(Time), data=flower)
summary(flower5.lm)$coef
A matrix: 7 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept) 67.195833.62869318.5179161.049173e-12
factor(Intensity)300 -9.125004.751074-1.9206187.171506e-02
factor(Intensity)450-13.375004.751074-2.8151531.191898e-02
factor(Intensity)600-23.225004.751074-4.8883681.384982e-04
factor(Intensity)750-27.750004.751074-5.8407841.965699e-05
factor(Intensity)900-29.350004.751074-6.1775501.012760e-05
factor(Time)2 12.158332.743034 4.4324403.648933e-04

Interactions#

  • Suppose we believe that Flowers varies linearly with Intensity but the slope depends on Time.

  • We’d need two parameters for Intensity

flower6.lm = lm(Flowers ~ Intensity + factor(Time) + factor(Time):Intensity, data=flower)
summary(flower6.lm)$coef
A matrix: 4 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)71.6233333334.34330459916.49051584.143572e-13
Intensity-0.0410761900.007435051-5.52466822.083392e-05
factor(Time)211.5233333336.142360270 1.87604327.532164e-02
Intensity:factor(Time)2 0.0012095240.010514750 0.11503129.095675e-01
  • What is the regression line when Time==1? And Time==2?

Different models across groups#

  • Set \(\beta_1=\beta_{\tt Intensity}\), \(\beta_2=\beta_{\tt Time2}\), \(\beta_3=\beta_{\tt Time2:Intensity}\).

  • In Time==1 group, one unit change of Intensity leads to \(\beta_1\) units of change in Flower.

  • In Time==2 group, one unit change of Intensity leads to \(\beta_1 + \beta_3\) units of change in Flower.

  • Test \(H_0\) slope is the same within each group.

summary(flower6.lm)$coef
A matrix: 4 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)71.6233333334.34330459916.49051584.143572e-13
Intensity-0.0410761900.007435051-5.52466822.083392e-05
factor(Time)211.5233333336.142360270 1.87604327.532164e-02
Intensity:factor(Time)2 0.0012095240.010514750 0.11503129.095675e-01

Visualizing interaction#

with(flower, plot(Intensity, Flowers, ylim=c(40, 80)))
cur.lm = lm(Flowers ~ Intensity + factor(Time) + factor(Time):Intensity, pch=Time, data=flower)
Ival = seq(100, 1000, length=10)
Yhat1 = predict(cur.lm, list(Time=rep(1, 10), Intensity=Ival))
Yhat2 = predict(cur.lm, list(Time=rep(2, 10), Intensity=Ival))
lines(Ival, Yhat1, col='red')
lines(Ival, Yhat2, col='blue')
Warning message:
“In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
 extra argument ‘pch’ will be disregarded”
../../_images/02a12b38709bd2c04a6d16ac1f5e486e2a6d53ebb989b94c5c37cbfa05853d4f.png