Classification

web.stanford.edu/class/stats202

Sergio Bacallado, Jonathan Taylor

Autumn 2022

Basic approach

  1. Medical diagnosis: Given the symptoms a patient shows, predict which of 3 conditions they are attributed to.

  2. Online banking: Determine whether a transaction is fraudulent or not, on the basis of the IP address, client’s history, etc.

  3. Web searching: Based on a user’s history, location, and the string of a web search, predict which link a person is likely to click.

  4. Online advertising: Predict whether a user will click on an ad or not.

Bayes classifier

\[\hat y_0 = \text{argmax}_{\;y}\; P(Y=y \mid X=x_0).\]

\[ E\left[ \frac{1}{m} \sum_{i=1}^m \mathbf{1}(\hat y_i \neq y_i) \right]\]

Basic strategy: estimate \(P(Y\mid X)\)

\[\hat y_0 = \text{argmax}_{\;y}\; \hat P(Y=y \mid X=x_0).\]

\[P(Y=1 | X) = \beta_0 + \beta_1X_1 + \dots+ \beta_1X_p \]

Logistic regression

\[ \begin{aligned} P(Y=1 \mid X) &= \frac{e^{\beta_0 + \beta_1 X_1 +\dots+\beta_p X_p}}{1+e^{\beta_0 + \beta_1 X_1 +\dots+\beta_p X_p}} \\ P(Y=0 \mid X) &= \frac{1}{1+e^{\beta_0 + \beta_1 X_1 +\dots+\beta_p X_p}}. \end{aligned} \]

This is the same as using a linear model for the log odds:

\[\log\left[\frac{P(Y=1 \mid X)}{P(Y=0 \mid X)}\right] = \beta_0 + \beta_1 X_1 +\dots+\beta_p X_p.\]

Fitting logistic regression

\[\log\left[\frac{P(Y=1 \mid X)}{P(Y=0 \mid X)}\right] = \beta_0 + \beta_1 X_1 +\dots+\beta_p X_p,\]

Likelihood

\[\prod_{i=1}^n P(Y=y_i \mid X=x_i) \]

\[ \prod_{i=1}^n \left(\frac{e^{\beta_0 + \beta_1 x_{i1} +\dots+\beta_p x_{ip}}}{1+e^{\beta_0 + \beta_1 x_{i1} +\dots+\beta_p x_{ip}}}\right)^{y_i} \left(\frac{1}{1+e^{\beta_0 + \beta_1 x_{j1} + \dots+\beta_p x_{jp}}}\right)^{1-y_i} \]

Logistic regression in R

library(ISLR2)
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
              family=binomial, data=Smarket)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.446  -1.203   1.065   1.145   1.326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

Inference for logistic regression

  1. We can estimate the Standard Error of each coefficient.

  2. The \(z\)-statistic is the equivalent of the \(t\)-statistic in linear regression:

\[z = \frac{\hat \beta_j}{\text{SE}(\hat\beta_j)}.\]

  1. The \(p\)-values are test of the null hypothesis \(\beta_j=0\) (Wald’s test).

  2. Other possible hypothesis tests: likelihood ratio test (chi-square distribution).

Example: Predicting credit card default

Predictors:

Confounding

In this dataset, there is confounding, but little collinearity.

Results: predicting credit card default

Confounding in Default data

Using only balance

summary(glm(default ~ balance,
        family=binomial, data=Default))
## 
## Call:
## glm(formula = default ~ balance, family = binomial, data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2697  -0.1465  -0.0589  -0.0221   3.7589  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
## balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1596.5  on 9998  degrees of freedom
## AIC: 1600.5
## 
## Number of Fisher Scoring iterations: 8

Using only student

summary(glm(default ~ student,
        family=binomial, data=Default))
## 
## Call:
## glm(formula = default ~ student, family = binomial, data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.2970  -0.2970  -0.2434  -0.2434   2.6585  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.50413    0.07071  -49.55  < 2e-16 ***
## studentYes   0.40489    0.11502    3.52 0.000431 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 2908.7  on 9998  degrees of freedom
## AIC: 2912.7
## 
## Number of Fisher Scoring iterations: 6

Using both balance and student

summary(glm(default ~ balance + student,
        family=binomial, data=Default))
## 
## Call:
## glm(formula = default ~ balance + student, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4578  -0.1422  -0.0559  -0.0203   3.7435  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.075e+01  3.692e-01 -29.116  < 2e-16 ***
## balance      5.738e-03  2.318e-04  24.750  < 2e-16 ***
## studentYes  -7.149e-01  1.475e-01  -4.846 1.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.7  on 9997  degrees of freedom
## AIC: 1577.7
## 
## Number of Fisher Scoring iterations: 8

Using all 3 predictors

summary(glm(default ~ balance + income + student,
        family=binomial, data=Default))
## 
## Call:
## glm(formula = default ~ balance + income + student, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4691  -0.1418  -0.0557  -0.0203   3.7383  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## income       3.033e-06  8.203e-06   0.370  0.71152    
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8

Multinomial logistic regression

\[\log\left[\frac{P(Y=j \mid X)}{P(Y=1 \mid X)}\right] = \beta_{0,j} + \beta_{1,j} X_1 +\dots+\beta_{p,j} X_p\]

Some potential problems

Linear Discriminant Analysis (LDA)

\[ \begin{aligned} \hat P(Y = k \mid X = x) &= \frac{\hat P(X = x \mid Y = k) \hat P(Y = k)}{\hat P(X=x)} \\ &= \frac{\hat P(X = x \mid Y = k) \hat P(Y = k)}{\sum_{j=1}^K\hat P(X = x \mid Y=j) \hat P(Y=j)} \end{aligned} \]

LDA: multivariate normal with equal covariance

Decision boundaries

Density contours and decision boundaries for LDA with three classes.

LDA has (piecewise) linear decision boundaries

Suppose that:

  1. We know \(P(Y=k) = \pi_k\) exactly.

  2. \(P(X=x | Y=k)\) is Mutivariate Normal with density:

\[f_k(x) = \frac{1}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]

  1. Above: \(\mu_k:\) Mean of the inputs for category \(k\) and \(\mathbf\Sigma:\) covariance matrix (common to all categories)

Then, what is the Bayes classifier?

\[P(Y=k \mid X=x) = \frac{f_k(x) \pi_k}{P(X=x)}\]

\[P(Y=k \mid X=x) = C \times f_k(x) \pi_k\]

\[P(Y=k \mid X=x) = \frac{C\pi_k}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]

\[P(Y=k \mid X=x) = C'\pi_k e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]

Decision boundaries

\[ \begin{aligned} \delta_k(x) &= \delta_\ell(x) \\ \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k & = \log \pi_\ell - \frac{1}{2}\mu_\ell^T \mathbf{\Sigma}^{-1}\mu_\ell + x^T \mathbf{\Sigma}^{-1}\mu_\ell \end{aligned} \]

Decision boundaries revisited

Density contours and decision boundaries for LDA with three classes.

Estimating \(\pi_k\)

\[\hat \pi_k = \frac{\#\{i\;;\;y_i=k\}}{n}\]

Estimating the parameters of \(f_k(x)\)

Estimate the center of each class \(\mu_k\):

\[\hat\mu_k = \frac{1}{\#\{i\;;\;y_i=k\}}\sum_{i\;;\; y_i=k} x_i\]

\[\hat \sigma^2 = \frac{1}{n-K}\sum_{k=1}^K \;\sum_{i:y_i=k} (x_i-\hat\mu_k)^2.\]

Final decision rule

\[\hat\delta_k(x) = \log \hat\pi_k - \frac{1}{2}\hat\mu_k^T \mathbf{\hat\Sigma}^{-1}\hat\mu_k + x^T \mathbf{\hat\Sigma}^{-1}\hat\mu_k\]

Quadratic discriminant analysis (QDA)

Comparison of LDA and QDA boundaries

QDA: multivariate normal with differing covariance

Naive Bayes: special case of QDA

Evaluating a classification method

\[\frac{1}{m}\sum_{i=1}^m \mathbf{1}(y_i \neq \hat y_i).\]

Confusion matrix for a 2 class problem

Confusion matrix for Default example

library(MASS) # where the `lda` function lives
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
## 
##     Boston
lda.fit = predict(lda(default ~ balance + student, data=Default))
table(lda.fit$class, Default$default)
##      
##         No  Yes
##   No  9644  252
##   Yes   23   81
  1. The error rate among people who do not default (false positive rate) is very low.

  2. However, the rate of false negatives is 76%.

  3. It is possible that false negatives are a bigger source of concern!

  4. One possible solution: Change the threshold

Changing decision rule

new.class = rep("No", length(Default$default))
new.class[lda.fit$posterior[,"Yes"] > 0.2] = "Yes"
table(new.class, Default$default)
##          
## new.class   No  Yes
##       No  9432  138
##       Yes  235  195

Let’s visualize the dependence of the error on the threshold:

Error rates for LDA classifier on Default dataset

\(-- -- --\) False negative rate (error for defaulting customers), \(\cdot\cdot\cdot\) False positive rate (error for non-defaulting customers), \(--------\) Overall error rate.

The ROC curve

ROC curve for LDA classifier on Default dataset.

Comparing classification methods through simulation

Scenario 1

Instance for simulation scenario #1.

Scenario 2

Instance for simulation scenario #2.

Scenario 3

Instance for simulation scenario #3.

Results for first 3 scenarios

Simulation results for linear scenarios #1-3.

Scenario 4

Instance for simulation scenario #4.

Scenario 5

Scenario 6

Results for scenarios 4-6

Simulation results for nonlinear scenarios #4-6.