Binary outcomes
options(repr.plot.width=5, repr.plot.height=5)
Generalized Linear Model
.logit.inv = function(x) {
return(exp(x) / (1 + exp(x)))
}
x = seq(-4, 4, length=200)
plot(x, logit.inv(x), lwd=2, type='l', col='red', cex.lab=1.2)
logit
¶logit = function(p) {
return(log(p / (1 - p)))
}
p = seq(0.01,0.99,length=200)
plot(p, logit(p), lwd=2, type='l', col='red', cex.lab=1.2)
logit.inv
above) $$F(x)=\frac{e^x}{1+e^x}.$$For logistic regression, coefficients have nice interpretation in terms of odds ratios
(to be defined shortly).
What about inference?
Instead of sum of squares, logistic regression uses deviance:
Minimizing deviance $\iff$ Maximum Likelihood
For any binary regression model, $\pi=\pi(\beta)$.
The deviance is: $$\begin{aligned} DEV(\beta| Y) &= -2 \sum_{i=1}^n \left( Y_i {\text{logit}}(\pi_i(\beta)) + \log(1-\pi_i(\beta)) \right) \end{aligned}$$
For the logistic model, the RHS is: $$ \begin{aligned} -2 \left[ (X\beta)^Ty + \sum_{i=1}^n\log \left(1 + \exp \left(\sum_{j=1}^p X_{ij} \beta_j\right) \right)\right] \end{aligned}$$
The logistic model is special in that $\text{logit}(\pi(\beta))=X\beta$. If we used a different transformation, the first part would not be linear in $X\beta$.
For ease of notation I'm assuming that X[,1]=1
corresponding to $\beta_0$
flu.table = read.table('http://stats191.stanford.edu/data/flu.table',
header=TRUE)
flu.glm = glm(Shot ~ Age + Health.Aware, data=flu.table,
family=binomial())
summary(flu.glm)
logodds = predict(flu.glm, list(Age=c(35,45),Health.Aware=c(50,50)),
type='link')
logodds
exp(logodds[2] - logodds[1])
The estimated probabilities are below, yielding a ratio of $0.1932/0.0254 \approx 7.61$. Not too far from 9.18.
prob = exp(logodds)/(1+exp(logodds))
prob
prob[2] / prob[1]
The Newton-Raphson updates for logistic regression are $$ \hat{\beta} \mapsto \hat{\beta} - \nabla^2 DEV(\hat{\beta})^{-1} \nabla DEV(\hat{\beta}) $$
These turn out to be the same as the updates above.
In earlier statistical software one might only have access to a weighted least squares estimator.
One thing the IRLS procedure hints at is what the approximate limiting distribution is.
center = coef(flu.glm)['Age']
SE = sqrt(vcov(flu.glm)['Age', 'Age'])
U = center + SE * qnorm(0.975)
L = center - SE * qnorm(0.975)
data.frame(L, center, U)
pi.hat = fitted(flu.glm)
W.hat = pi.hat * (1 - pi.hat)
X = model.matrix(flu.glm)
C = solve(t(X) %*% (W.hat * X))
c(SE, sqrt(C['Age', 'Age']))
The intervals above are slightly different from what R
will give you if you ask it for
confidence intervals.
R
uses so-called profile intervals.
For large samples the two methods should agree quite closely.
CI = confint(flu.glm)
CI
mean(CI[2,]) # profile intervals are not symmetric around the estimate...
data.frame(L, center, U)
What about comparing full and reduced model?
This is closely related to $F$ with large $df_F$: approximately $F_{df_R-df_F, df_R} \cdot (df_R-df_F)$.
For logistic regression this difference in $SSE$ is replaced with $$DEV({\cal M}_R) - DEV({\cal M}_F) \overset{n \rightarrow \infty, H_0:{\cal M}_R}{\sim} \chi^2_{df_R-df_F}$$
anova(glm(Shot ~ 1,
data=flu.table,
family=binomial()),
flu.glm)
anova(glm(Shot ~ Health.Aware,
data=flu.table,
family=binomial()),
flu.glm)
We should compare this difference in deviance with a $\chi^2_1$ random variable.
1 - pchisq(35.61, 2) # testing ~1 vs ~1 + Health.Aware + Age
1 - pchisq(16.863, 1) # testing ~ 1 + Health.Aware vs ~1 + Health.Aware + Age
Let's compare this with the Wald test:
summary(flu.glm)
Similar to least square regression, only residuals used are usually deviance residuals $r_i = \text{sign}(Y_i-\widehat{\pi}_i) \sqrt{DEV(\widehat{\pi}_i|Y_i)}.$
These agree with usual residual for least square regression.
par(mfrow=c(2,2))
plot(flu.glm)
influence.measures(flu.glm)
As the model is a likelihood based model, each fitted model has an AIC. Stepwise selection can be used easily …
step(flu.glm, scope=list(upper= ~.^2), direction='both')
library(glmnet)
X = model.matrix(flu.glm)[,-1]
Y = as.numeric(flu.table$Shot)
G = glmnet(X, Y, family="binomial")
plot(G)
library(ElemStatLearn)
data(spam)
dim(spam)
X = model.matrix(spam ~ ., data=spam)[,-1]
Y = as.numeric(spam$spam == 'spam')
G = glmnet(X, Y, family='binomial')
plot(G)
CV = cv.glmnet(X, Y, family='binomial')
plot(CV)
c(CV$lambda.min, CV$lambda.1se)
glmnet
¶beta.hat = coef(G, s=CV$lambda.1se)
beta.hat
Probit regression model: $$\Phi^{-1}(E(Y|X))= \sum_{j=1}^{p} \beta_j X_{j}$$ where $\Phi$ is CDF of $N(0,1)$, i.e. $\Phi(t) = {\tt pnorm(t)}$, $\Phi^{-1}(q) = {\tt qnorm}(q)$.
Regression function $$ \begin{aligned} E(Y|X) &= E(Y|X_1,\dots,X_p) \\ &= P(Y=1|X_1, \dots, X_p) \\ & = {\tt pnorm}\left(\sum_{j=1}^p \beta_j X_j \right)\\ \end{aligned} $$
In logit, probit and cloglog ${\text{Var}}(Y_i)=\pi_i(1-\pi_i)$ but the model for the mean is different.
Coefficients no longer have an odds ratio interpretation.
summary(glm(Shot ~ Age + Health.Aware,
data=flu.table,
family=binomial(link='probit')))
Given a dataset $(Y_i, X_{i1}, \dots, X_{ip}), 1 \leq i \leq n$ we consider a model for the distribution of $Y|X_1, \dots, X_p$.