Logistic regression

Binary outcomes

  • Most models so far have had response $Y$ as continuous.
  • Many responses in practice fall into the $YES/NO$ framework.
  • Examples:
    1. medical: presence or absence of cancer
    2. financial: bankrupt or solvent
    3. industrial: passes a quality control test or not
In [1]:
options(repr.plot.width=5, repr.plot.height=5)

Modelling probabilities

  • For $0-1$ responses we need to model $$\pi(x_1, \dots, x_p) = P(Y=1|X_1=x_1,\dots, X_p=x_p)$$
  • That is, $Y$ is Bernoulli with a probability that depends on covariates $\{X_1, \dots, X_p\}.$
  • Note: $\text{Var}(Y) = \pi ( 1 - \pi) = E(Y) \cdot ( 1- E(Y))$
  • Or, the binary nature forces a relation between mean and variance of $Y$.
  • This makes logistic regression a Generalized Linear Model.

Flu shot example

  • A local health clinic sent fliers to its clients to encourage everyone, but especially older persons at high risk of complications, to get a flu shot in time for protection against an expected flu epidemic.
  • In a pilot follow-up study, 50 clients were randomly selected and asked whether they actually received a flu shot. $Y={\tt Shot}$
  • In addition, data were collected on their age $X_1={\tt Age}$ and their health awareness $X_2={\tt Health.Aware}$

A possible model

  • Simplest model $\pi(X_1,X_2) = \beta_0 + \beta_1 X_1 + \beta_2 X_2$
  • Problems / issues:
    • We must have $0 \leq E(Y) = \pi(X_1,X_2) \leq 1$. OLS will not force this.
    • Ordinary least squares will not work because of relation between mean and variance.

Logistic model

  • Logistic model $\pi(X_1,X_2) = \frac{\exp(\beta_0 + \beta_1 X_1 + \beta_2 X_2)}{1 + \exp(\beta_0 + \beta_1 X_1 + \beta_2 X_2)}$
  • This automatically fixes $0 \leq E(Y) = \pi(X_1,X_2) \leq 1$.
  • Define: $\text{logit}(\pi(X_1, X_2)) = \log\left(\frac{\pi(X_1, X_2)}{1 - \pi(X_1,X_2)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2$

Logistic distribution

In [2]:
logit.inv = function(x) {
  return(exp(x) / (1 + exp(x)))
}
x = seq(-4, 4, length=200)
plot(x, logit.inv(x), lwd=2, type='l', col='red', cex.lab=1.2)

Logistic transform: logit

In [3]:
logit = function(p) {
  return(log(p / (1 - p)))
}
p = seq(0.01,0.99,length=200)
plot(p, logit(p), lwd=2, type='l', col='red', cex.lab=1.2)

Binary regression models

  • Models $E(Y)$ as $F(\beta_0 + \beta_1 X_1 + \beta_2 X_2)$ for some increasing function $F$ (usually a distribution function).
  • The logistic model uses the function (we called logit.inv above) $$F(x)=\frac{e^x}{1+e^x}.$$
  • Can be fit using Maximum Likelihood / Iteratively Reweighted Least Squares.
  • For logistic regression, coefficients have nice interpretation in terms of odds ratios (to be defined shortly).

  • What about inference?

Criterion used to fit model

Instead of sum of squares, logistic regression uses deviance:

  • $DEV(\mu| Y) = -2 \log L(\mu| Y) + 2 \log L(Y| Y)$ where $\mu$ is a location estimator for $Y$.
  • If $Y$ is Gaussian with independent $N(\mu_i,\sigma^2)$ entries then $DEV(\mu| Y) = \frac{1}{\sigma^2}\sum_{i=1}^n(Y_i - \mu_i)^2$
  • If $Y$ is a binary vector, with mean vector $\pi$ then $DEV(\pi| Y) = -2 \sum_{i=1}^n \left( Y_i \log(\pi_i) + (1-Y_i) \log(1-\pi_i) \right)$

Minimizing deviance $\iff$ Maximum Likelihood

Deviance for logistic regression

  • For any binary regression model, $\pi=\pi(\beta)$.

  • The deviance is: $$\begin{aligned} DEV(\beta| Y) &= -2 \sum_{i=1}^n \left( Y_i {\text{logit}}(\pi_i(\beta)) + \log(1-\pi_i(\beta)) \right) \end{aligned}$$

  • For the logistic model, the RHS is: $$ \begin{aligned} -2 \left[ (X\beta)^Ty + \sum_{i=1}^n\log \left(1 + \exp \left(\sum_{j=1}^p X_{ij} \beta_j\right) \right)\right] \end{aligned}$$

  • The logistic model is special in that $\text{logit}(\pi(\beta))=X\beta$. If we used a different transformation, the first part would not be linear in $X\beta$.

  • For ease of notation I'm assuming that X[,1]=1 corresponding to $\beta_0$

In [4]:
flu.table = read.table('http://stats191.stanford.edu/data/flu.table', 
                       header=TRUE)
flu.glm = glm(Shot ~ Age + Health.Aware, data=flu.table, 
              family=binomial())
summary(flu.glm)
Call:
glm(formula = Shot ~ Age + Health.Aware, family = binomial(), 
    data = flu.table)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5522  -0.2962  -0.1124   0.4208   2.3244  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -21.58458    6.41824  -3.363 0.000771 ***
Age            0.22178    0.07436   2.983 0.002858 ** 
Health.Aware   0.20351    0.06273   3.244 0.001178 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 68.029  on 49  degrees of freedom
Residual deviance: 32.416  on 47  degrees of freedom
AIC: 38.416

Number of Fisher Scoring iterations: 6

Odds Ratios

  • One reason logistic models are popular is that the parameters have simple interpretations in terms of odds $$ODDS(A) = \frac{P(A)}{1-P(A)}.$$
  • Logistic model: $$OR_{X_j} = \frac{ODDS(Y=1|\dots, X_j=x_j+h, \dots)}{ODDS(Y=1|\dots, X_j=x_j, \dots)} = e^{h \beta_j}$$
  • If $X_j \in {0, 1}$ is dichotomous, then odds for group with $X_j = 1$ are $e^{\beta_j}$ higher, other parameters being equal.

Rare disease hypothesis

  • When incidence is rare, $P(Y=0)\approxeq 1$ no matter what the covariates $X_j$’s are.
  • In this case, odds ratios are almost ratios of probabilities: $$OR_{X_j} \approxeq \frac{{\mathbb{P}}(Y=1|\dots, X_j=x_j+1, \dots)}{{\mathbb{P}}(Y=1|\dots, X_j=x_j, \dots)}$$
  • Hypothetical example: in a lung cancer study, if $X_j$ is an indicator of smoking or not, a $\beta_j$ of 5 means for smoking vs. non-smoking means smokers are $e^5 \approx 150$ times more likely to develop lung cancer

Rare disease hypothesis

  • In flu example, the odds ratio for a 45 year old with health awareness 50 compared to a 35 year old with the same health awareness are $$e^{-1.429284+3.647052}=9.18$$
In [5]:
logodds = predict(flu.glm, list(Age=c(35,45),Health.Aware=c(50,50)),
                 type='link')
logodds
exp(logodds[2] - logodds[1])
1
-3.64705158221393
2
-1.4292837652598
2: 9.18680133938049

The estimated probabilities are below, yielding a ratio of $0.1932/0.0254 \approx 7.61$. Not too far from 9.18.

In [6]:
prob = exp(logodds)/(1+exp(logodds))
prob
prob[2] / prob[1]
1
0.0254056044584861
2
0.193210306432725
2: 7.60502694389495

Iteratively reweighted least squares

An algorithm to fit the model

  1. Initialize $\widehat{\pi}_i = \bar{Y}, 1 \leq i \leq n$
  2. Define $$Z_i = g(\widehat{\pi}_i) + g'(\widehat{\pi}_i) (Y_i - \widehat{\pi_i})$$
  3. Fit weighted least squares model $$Z_i \sim \sum_{j=1}^p \beta_j X_{ij}, \qquad w_i = \widehat{\pi_i} (1 - \widehat{\pi}_i)$$
  4. Set $\widehat{\pi}_i = \text{logit}^{-1} \left(\widehat{\beta}_0 + \sum_{j=1}^p \widehat{\beta}_j X_{ij}\right)$.
  5. Repeat steps 2-4 until convergence.

Newton-Raphson

The Newton-Raphson updates for logistic regression are $$ \hat{\beta} \mapsto \hat{\beta} - \nabla^2 DEV(\hat{\beta})^{-1} \nabla DEV(\hat{\beta}) $$

  • These turn out to be the same as the updates above.

  • In earlier statistical software one might only have access to a weighted least squares estimator.

Inference

One thing the IRLS procedure hints at is what the approximate limiting distribution is.

  • The IRLS procedure suggests using approximation $\widehat{\beta} \approx N(\beta, (X^TWX)^{-1})$
  • This allows us to construct CIs, test linear hypotheses, etc.
  • Intervals formed this way are called Wald intervals.
In [7]:
center = coef(flu.glm)['Age']
SE = sqrt(vcov(flu.glm)['Age', 'Age'])
U = center + SE * qnorm(0.975)
L = center - SE * qnorm(0.975)
data.frame(L, center, U)
LcenterU
Age0.07603950.22177680.3675141

Covariance

  • The estimated covariance uses the weights computed from the fitted model.
In [8]:
pi.hat = fitted(flu.glm)
W.hat = pi.hat * (1 - pi.hat)
X = model.matrix(flu.glm)
C = solve(t(X) %*% (W.hat * X))
c(SE, sqrt(C['Age', 'Age']))
  1. 0.0743571225661872
  2. 0.0743580666906677

Confidence intervals

  • The intervals above are slightly different from what R will give you if you ask it for confidence intervals.

  • R uses so-called profile intervals.

  • For large samples the two methods should agree quite closely.

In [9]:
CI = confint(flu.glm)
CI
mean(CI[2,]) # profile intervals are not symmetric around the estimate...
data.frame(L, center, U)
Waiting for profiling to be done...
2.5 %97.5 %
(Intercept)-38.0402235-11.6669218
Age 0.1004533 0.4046856
Health.Aware 0.1026984 0.3595480
0.252569449283972
LcenterU
Age0.07603950.22177680.3675141

Testing in logistic regression

What about comparing full and reduced model?

  • For a model ${\cal M}$, $DEV({\cal M})$ replaces $SSE({\cal M})$.
  • In least squares regression (with $\sigma^2$ known), we use $$\frac{1}{\sigma^2}\left( SSE({\cal M}_R) - SSE({\cal M}_F) \right) \overset{H_0:{\cal M}_R}{\sim} \chi^2_{df_R-df_F}$$
  • This is closely related to $F$ with large $df_F$: approximately $F_{df_R-df_F, df_R} \cdot (df_R-df_F)$.

  • For logistic regression this difference in $SSE$ is replaced with $$DEV({\cal M}_R) - DEV({\cal M}_F) \overset{n \rightarrow \infty, H_0:{\cal M}_R}{\sim} \chi^2_{df_R-df_F}$$

  • Resulting tests do not agree numerically with those coming from IRLS (Wald tests). Both are often used.
In [10]:
anova(glm(Shot ~ 1,
          data=flu.table, 
          family=binomial()), 
      flu.glm)
anova(glm(Shot ~ Health.Aware,
          data=flu.table, 
          family=binomial()), 
      flu.glm)
Resid. DfResid. DevDfDeviance
49 68.02920NA NA
47 32.41631 2 35.61289
Resid. DfResid. DevDfDeviance
48 49.27886NA NA
47 32.41631 1 16.86254

We should compare this difference in deviance with a $\chi^2_1$ random variable.

In [11]:
1 - pchisq(35.61, 2) # testing ~1 vs ~1 + Health.Aware + Age
1 - pchisq(16.863, 1) # testing ~ 1 + Health.Aware vs ~1 + Health.Aware + Age
1.85091617588284e-08
4.01771904346981e-05

Let's compare this with the Wald test:

In [12]:
summary(flu.glm)
Call:
glm(formula = Shot ~ Age + Health.Aware, family = binomial(), 
    data = flu.table)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5522  -0.2962  -0.1124   0.4208   2.3244  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -21.58458    6.41824  -3.363 0.000771 ***
Age            0.22178    0.07436   2.983 0.002858 ** 
Health.Aware   0.20351    0.06273   3.244 0.001178 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 68.029  on 49  degrees of freedom
Residual deviance: 32.416  on 47  degrees of freedom
AIC: 38.416

Number of Fisher Scoring iterations: 6

Diagnostics

  • Similar to least square regression, only residuals used are usually deviance residuals $r_i = \text{sign}(Y_i-\widehat{\pi}_i) \sqrt{DEV(\widehat{\pi}_i|Y_i)}.$

  • These agree with usual residual for least square regression.

In [13]:
par(mfrow=c(2,2))
plot(flu.glm)
In [14]:
influence.measures(flu.glm)
Influence measures of
	 glm(formula = Shot ~ Age + Health.Aware, family = binomial(),      data = flu.table) :

      dfb.1_  dfb.Age dfb.Hl.A    dffit cov.r   cook.d     hat inf
1  -0.017208  0.01543  0.01564 -0.01763 1.083 3.66e-05 0.01604    
2  -0.102897  0.09560  0.10416  0.13491 1.102 2.24e-03 0.05231    
3  -0.015043  0.01280  0.01442 -0.01556 1.081 2.85e-05 0.01454    
4  -0.058889  0.02418  0.11920  0.28755 1.040 1.18e-02 0.05850    
5  -0.103488  0.03853  0.17167  0.23475 1.144 6.94e-03 0.09789    
6  -0.130919  0.11898  0.11098 -0.15203 1.109 2.86e-03 0.06040    
7  -0.066734  0.06691  0.06003  0.07761 1.104 7.22e-04 0.04249    
8  -0.043827  0.04399  0.03438 -0.04704 1.101 2.62e-04 0.03508    
9  -0.018883  0.01881  0.01515 -0.01980 1.085 4.62e-05 0.01867    
10 -0.112978  0.09695  0.13265  0.20841 1.076 5.68e-03 0.05526    
11 -0.009578  0.09584 -0.04305  0.36490 1.004 2.08e-02 0.06318    
12  0.020212 -0.13373  0.05812 -0.43566 0.979 3.20e-02 0.07059    
13  0.192770 -0.50776  0.11790 -0.93778 1.040 1.63e-01 0.18972   *
14  0.372025 -0.47261 -0.14549  0.68814 0.902 1.02e-01 0.09827    
15 -0.038716  0.03596  0.03351 -0.04035 1.095 1.93e-04 0.02970    
16 -0.057484  0.06480  0.03735 -0.06892 1.121 5.65e-04 0.05393    
17 -0.046412  0.13914 -0.10858 -0.47973 0.993 3.86e-02 0.08404    
18 -0.029026  0.02499  0.03004  0.03257 1.090 1.25e-04 0.02416    
19  0.111762  0.19947 -0.49074 -1.00669 0.979 2.10e-01 0.18084   *
20 -0.063893  0.05199  0.06246 -0.06972 1.105 5.80e-04 0.04169    
21 -0.148884  0.11621  0.14276 -0.19327 1.096 4.76e-03 0.06199    
22 -0.046451  0.03665  0.05220  0.05658 1.101 3.81e-04 0.03615    
23 -0.058938  0.03695  0.06933 -0.07587 1.116 6.87e-04 0.05121    
24 -0.046668  0.04039  0.04338 -0.04918 1.098 2.87e-04 0.03329    
25  0.181347 -0.10844 -0.17866  0.43049 0.921 3.50e-02 0.05453    
26 -0.009162  0.00759  0.00903 -0.00953 1.077 1.07e-05 0.00994    
27 -0.085845  0.05982  0.10771  0.12539 1.116 1.91e-03 0.05923    
28 -0.030981  0.02315  0.03287 -0.03466 1.094 1.42e-04 0.02750    
29 -0.017942  0.01549  0.01840  0.01972 1.083 4.58e-05 0.01681    
30 -0.140841  0.11498  0.13148 -0.17269 1.103 3.74e-03 0.06098    
31 -0.160987  0.27877  0.03379  0.40526 1.178 2.18e-02 0.14807    
32 -0.005600  0.00447  0.00572 -0.00592 1.074 4.12e-06 0.00685    
33 -0.094999  0.09844  0.08375  0.11838 1.107 1.71e-03 0.05249    
34 -0.003504  0.00331  0.00324  0.00365 1.071 1.57e-06 0.00442    
35 -0.106728  0.08563  0.10366 -0.12263 1.111 1.83e-03 0.05562    
36 -0.019690  0.01858  0.01689 -0.02026 1.085 4.84e-05 0.01820    
37 -0.094324  0.08615  0.09622  0.11987 1.104 1.76e-03 0.05039    
38  0.652043 -0.52099 -0.63916  0.73958 0.630 2.72e-01 0.05275   *
39 -0.019865  0.16852 -0.19502 -0.60555 0.998 6.42e-02 0.11145    
40 -0.114652  0.12911  0.09175  0.15634 1.110 3.03e-03 0.06186    
41 -0.094999  0.09844  0.08375  0.11838 1.107 1.71e-03 0.05249    
42 -0.019865  0.16852 -0.19502 -0.60555 0.998 6.42e-02 0.11145    
43 -0.069257 -0.23785  0.34314 -0.81279 1.267 9.58e-02 0.25544   *
44  0.050764 -0.06530  0.01674  0.34876 0.982 1.96e-02 0.05344    
45 -0.079968  0.08479  0.05701 -0.09176 1.121 1.01e-03 0.05710    
46 -0.038716  0.03596  0.03351 -0.04035 1.095 1.93e-04 0.02970    
47 -0.000879 -0.00838  0.05023  0.31156 1.001 1.49e-02 0.05090    
48  0.040334 -0.19110  0.07980 -0.50899 0.992 4.40e-02 0.08986    
49 -0.025666  0.02050  0.02589 -0.02749 1.089 8.92e-05 0.02260    
50 -0.016308  0.01570  0.01369 -0.01682 1.083 3.33e-05 0.01600    

Model selection

As the model is a likelihood based model, each fitted model has an AIC. Stepwise selection can be used easily …

In [15]:
step(flu.glm, scope=list(upper= ~.^2), direction='both')
Start:  AIC=38.42
Shot ~ Age + Health.Aware

                   Df Deviance    AIC
+ Age:Health.Aware  1   24.283 32.283
<none>                  32.416 38.416
- Age               1   49.279 53.279
- Health.Aware      1   56.078 60.078

Step:  AIC=32.28
Shot ~ Age + Health.Aware + Age:Health.Aware

                   Df Deviance    AIC
<none>                  24.283 32.283
- Age:Health.Aware  1   32.416 38.416
Call:  glm(formula = Shot ~ Age + Health.Aware + Age:Health.Aware, family = binomial(), 
    data = flu.table)

Coefficients:
     (Intercept)               Age      Health.Aware  Age:Health.Aware  
        26.75944          -0.88151          -0.82239           0.02365  

Degrees of Freedom: 49 Total (i.e. Null);  46 Residual
Null Deviance:	    68.03 
Residual Deviance: 24.28 	AIC: 32.28

Penalized regression

LASSO

  • Instead of just minimizing deviance, we can also look at penalized versions $$ \text{minimize}_{\beta} \frac{1}{2n} DEV(\beta) + \lambda \|\beta\|_1 $$
In [16]:
library(glmnet)
X = model.matrix(flu.glm)[,-1]
Y = as.numeric(flu.table$Shot)
G = glmnet(X, Y, family="binomial")
plot(G)
Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-13

In [17]:
library(ElemStatLearn)
data(spam)
dim(spam)
X = model.matrix(spam ~ ., data=spam)[,-1]
Y = as.numeric(spam$spam == 'spam')
G = glmnet(X, Y, family='binomial')
plot(G)
  1. 4601
  2. 58
In [18]:
CV = cv.glmnet(X, Y, family='binomial')
plot(CV)
c(CV$lambda.min, CV$lambda.1se)
  1. 0.000442786380969069
  2. 0.00312377038361533

Extracting coefficients from glmnet

In [19]:
beta.hat = coef(G, s=CV$lambda.1se)
beta.hat
58 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept) -1.5830855234
A.1         -0.1388801655
A.2         -0.0775978236
A.3          0.1285587024
A.4          0.0807647798
A.5          0.5038182922
A.6          0.5319787173
A.7          2.2708935572
A.8          0.5479196034
A.9          0.4813025001
A.10         0.0787952926
A.11         .           
A.12        -0.1096373568
A.13         .           
A.14         0.0459229935
A.15         0.3977581954
A.16         0.7414061384
A.17         0.7359683360
A.18         0.2063671075
A.19         0.0781643265
A.20         0.4718180642
A.21         0.2181105007
A.22         0.2418330361
A.23         2.0363734630
A.24         0.5643225286
A.25        -1.1523055800
A.26        -0.6031629994
A.27        -0.6966886852
A.28         0.0993959886
A.29        -0.3397729762
A.30        -0.0636310823
A.31         .           
A.32         .           
A.33        -0.5811093985
A.34         .           
A.35        -0.2071110084
A.36         0.2347021221
A.37        -0.0617667077
A.38        -0.1658177936
A.39        -0.3297833503
A.40        -0.0952199518
A.41        -0.5874911315
A.42        -0.9004504494
A.43        -0.3994998977
A.44        -0.5770754114
A.45        -0.5186107765
A.46        -0.8827392284
A.47        -1.0180447520
A.48        -0.9635108591
A.49        -0.7854952727
A.50        -0.0093209799
A.51        -0.3422283121
A.52         0.4502944695
A.53         4.6954119068
A.54         0.1211366744
A.55         .           
A.56         0.0029582500
A.57         0.0005368089

Probit model

  • Probit regression model: $$\Phi^{-1}(E(Y|X))= \sum_{j=1}^{p} \beta_j X_{j}$$ where $\Phi$ is CDF of $N(0,1)$, i.e. $\Phi(t) = {\tt pnorm(t)}$, $\Phi^{-1}(q) = {\tt qnorm}(q)$.

  • Regression function $$ \begin{aligned} E(Y|X) &= E(Y|X_1,\dots,X_p) \\ &= P(Y=1|X_1, \dots, X_p) \\ & = {\tt pnorm}\left(\sum_{j=1}^p \beta_j X_j \right)\\ \end{aligned} $$

  • In logit, probit and cloglog ${\text{Var}}(Y_i)=\pi_i(1-\pi_i)$ but the model for the mean is different.

  • Coefficients no longer have an odds ratio interpretation.

In [20]:
summary(glm(Shot ~ Age + Health.Aware, 
            data=flu.table, 
            family=binomial(link='probit')))
Call:
glm(formula = Shot ~ Age + Health.Aware, family = binomial(link = "probit"), 
    data = flu.table)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5471  -0.2883  -0.0648   0.4060   2.2955  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -12.35039    3.22797  -3.826 0.000130 ***
Age            0.12786    0.03887   3.289 0.001005 ** 
Health.Aware   0.11642    0.03237   3.596 0.000323 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 68.029  on 49  degrees of freedom
Residual deviance: 32.076  on 47  degrees of freedom
AIC: 38.076

Number of Fisher Scoring iterations: 7

Generalized linear models

Given a dataset $(Y_i, X_{i1}, \dots, X_{ip}), 1 \leq i \leq n$ we consider a model for the distribution of $Y|X_1, \dots, X_p$.

  • If $\eta_i=g(E(Y_i|X_i)) = g(\mu_i) = \sum_{j=1}^p \beta_j X_{ij}$ then $g$ is called the link function for the model.
  • If ${\text{Var}}(Y_i) = \phi \cdot V({\mathbb{E}}(Y_i)) = \phi \cdot V(\mu_i)$ for $\phi > 0$ and some function $V$, then $V$ is the called variance function for the model.
  • Canonical reference Generalized linear models.

Binary regression as GLM

  • For a logistic model, $g(\mu)={\text{logit}}(\mu), \qquad V(\mu)=\mu(1-\mu).$
  • For a probit model, $g(\mu)=\Phi^{-1}(\mu), \qquad V(\mu)=\mu(1-\mu).$
  • For a cloglog model, $g(\mu)=-\log(-\log(\mu)), \qquad V(\mu)=\mu(1-\mu).$
  • All of these have dispersion $\phi=1$.