Supervised learning with a qualitative or categorical response.
Just as common, if not more common than regression:
Medical diagnosis: Given the symptoms a patient shows, predict which of 3 conditions they are attributed to.
Online banking: Determine whether a transaction is fraudulent or not, on the basis of the IP address, client’s history, etc.
Web searching: Based on a user’s history, location, and the string of a web search, predict which link a person is likely to click.
Online advertising: Predict whether a user will click on an ad or not.
\[\hat y_0 = \text{argmax}_{\;y}\; P(Y=y \mid X=x_0).\]
\[ E\left[ \frac{1}{m} \sum_{i=1}^m \mathbf{1}(\hat y_i \neq y_i) \right]\]
\[\hat y_0 = \text{argmax}_{\;y}\; \hat P(Y=y \mid X=x_0).\]
\[P(Y=1 | X) = \beta_0 + \beta_1X_1 + \dots+ \beta_1X_p \]
\[ \begin{aligned} P(Y=1 \mid X) &= \frac{e^{\beta_0 + \beta_1 X_1 +\dots+\beta_p X_p}}{1+e^{\beta_0 + \beta_1 X_1 +\dots+\beta_p X_p}} \\ P(Y=0 \mid X) &= \frac{1}{1+e^{\beta_0 + \beta_1 X_1 +\dots+\beta_p X_p}}. \end{aligned} \]
This is the same as using a linear model for the log odds:
\[\log\left[\frac{P(Y=1 \mid X)}{P(Y=0 \mid X)}\right] = \beta_0 + \beta_1 X_1 +\dots+\beta_p X_p.\]
The training data is a list of pairs \((y_1,x_1), (y_2,x_2), \dots, (y_n,x_n)\).
We don’t observe the left hand side in the model
\[\log\left[\frac{P(Y=1 \mid X)}{P(Y=0 \mid X)}\right] = \beta_0 + \beta_1 X_1 +\dots+\beta_p X_p,\]
\[\prod_{i=1}^n P(Y=y_i \mid X=x_i) \]
\[ \prod_{i=1}^n \left(\frac{e^{\beta_0 + \beta_1 x_{i1} +\dots+\beta_p x_{ip}}}{1+e^{\beta_0 + \beta_1 x_{i1} +\dots+\beta_p x_{ip}}}\right)^{y_i} \left(\frac{1}{1+e^{\beta_0 + \beta_1 x_{j1} + \dots+\beta_p x_{jp}}}\right)^{1-y_i} \]
Choose estimates \(\hat \beta_0, \dots,\hat \beta_p\) which maximize the likelihood.
Solved with numerical methods (e.g. Newton’s algorithm).
library(ISLR2)
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
family=binomial, data=Smarket)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Smarket)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.446 -1.203 1.065 1.145 1.326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
We can estimate the Standard Error of each coefficient.
The \(z\)-statistic is the equivalent of the \(t\)-statistic in linear regression:
\[z = \frac{\hat \beta_j}{\text{SE}(\hat\beta_j)}.\]
The \(p\)-values are test of the null hypothesis \(\beta_j=0\) (Wald’s test).
Other possible hypothesis tests: likelihood ratio test (chi-square distribution).
default
Predictors:
student
: 1 if student, 0 otherwise
balance
: credit card balance
income
: person’s income.
In this dataset, there is confounding, but little collinearity.
Students tend to have higher balances. So, balance
is explained by student
, but not very well.
People with a high balance
are more likely to
default.
Among people with a given balance
, students are less
likely to default.
default
balance
##
## Call:
## glm(formula = default ~ balance, family = binomial, data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2697 -0.1465 -0.0589 -0.0221 3.7589
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
## balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1596.5 on 9998 degrees of freedom
## AIC: 1600.5
##
## Number of Fisher Scoring iterations: 8
student
##
## Call:
## glm(formula = default ~ student, family = binomial, data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.2970 -0.2970 -0.2434 -0.2434 2.6585
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.50413 0.07071 -49.55 < 2e-16 ***
## studentYes 0.40489 0.11502 3.52 0.000431 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 2908.7 on 9998 degrees of freedom
## AIC: 2912.7
##
## Number of Fisher Scoring iterations: 6
balance
and student
##
## Call:
## glm(formula = default ~ balance + student, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4578 -0.1422 -0.0559 -0.0203 3.7435
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.075e+01 3.692e-01 -29.116 < 2e-16 ***
## balance 5.738e-03 2.318e-04 24.750 < 2e-16 ***
## studentYes -7.149e-01 1.475e-01 -4.846 1.26e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.7 on 9997 degrees of freedom
## AIC: 1577.7
##
## Number of Fisher Scoring iterations: 8
##
## Call:
## glm(formula = default ~ balance + income + student, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4691 -0.1418 -0.0557 -0.0203 3.7383
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
## balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
## income 3.033e-06 8.203e-06 0.370 0.71152
## studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.5 on 9996 degrees of freedom
## AIC: 1579.5
##
## Number of Fisher Scoring iterations: 8
Extension of logistic regression to more than 2 categories
Suppose \(Y\) takes values in \(\{1,2,\dots,K\}\), then we can use a linear model for the log odds against a baseline category (e.g. 1): for \(j \neq 1\)
\[\log\left[\frac{P(Y=j \mid X)}{P(Y=1 \mid X)}\right] = \beta_{0,j} + \beta_{1,j} X_1 +\dots+\beta_{p,j} X_p\]
The coefficients become unstable when there is collinearity. Furthermore, this affects the convergence of the fitting algorithm.
When the classes are well separated, the coefficients become unstable. This is always the case when \(p\geq n-1\). In this case, prediction error is low, but \(\hat{\beta}\) is very variable.
\[ \begin{aligned} \hat P(Y = k \mid X = x) &= \frac{\hat P(X = x \mid Y = k) \hat P(Y = k)}{\hat P(X=x)} \\ &= \frac{\hat P(X = x \mid Y = k) \hat P(Y = k)}{\sum_{j=1}^K\hat P(X = x \mid Y=j) \hat P(Y=j)} \end{aligned} \]
LDA is the special case of the above strategy when \(P(X \mid Y=k) = N(\mu_k, \mathbf\Sigma)\).
That is, within each class the features have multivariate normal distribution with center depending on the class and common covariance \(\mathbf\Sigma\).
The probabilities \(P(Y=k)\) are estimated by the fraction of training samples of class \(k\).
Suppose that:
We know \(P(Y=k) = \pi_k\) exactly.
\(P(X=x | Y=k)\) is Mutivariate Normal with density:
\[f_k(x) = \frac{1}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]
Then, what is the Bayes classifier?
\[P(Y=k \mid X=x) = \frac{f_k(x) \pi_k}{P(X=x)}\]
\[P(Y=k \mid X=x) = C \times f_k(x) \pi_k\]
\[P(Y=k \mid X=x) = \frac{C\pi_k}{(2\pi)^{p/2}|\mathbf\Sigma|^{1/2}} e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]
\[P(Y=k \mid X=x) = C'\pi_k e^{-\frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}\]
Take the logarithm of both sides: \[\log P(Y=k \mid X=x) = \color{Red}{\log C'} + \color{blue}{\log \pi_k - \frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k)}.\]
This is the same for every category, \(k\).
We want to find the maximum of this expression over \(k\).
Goal is to maximize the following over \(k\): \[ \begin{aligned} &\log \pi_k - \frac{1}{2}(x-\mu_k)^T \mathbf{\Sigma}^{-1}(x-\mu_k).\\ =&\log \pi_k - \frac{1}{2}\left[x^T \mathbf{\Sigma}^{-1}x + \mu_k^T \mathbf{\Sigma}^{-1}\mu_k\right] + x^T \mathbf{\Sigma}^{-1}\mu_k \\ =& C'' + \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k \end{aligned} \]
We define the objectives (called discriminant functions): \[\delta_k(x) = \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k\] At an input \(x\), we predict the response with the highest \(\delta_k(x)\).
\[ \begin{aligned} \delta_k(x) &= \delta_\ell(x) \\ \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k + x^T \mathbf{\Sigma}^{-1}\mu_k & = \log \pi_\ell - \frac{1}{2}\mu_\ell^T \mathbf{\Sigma}^{-1}\mu_\ell + x^T \mathbf{\Sigma}^{-1}\mu_\ell \end{aligned} \]
\[\hat \pi_k = \frac{\#\{i\;;\;y_i=k\}}{n}\]
\[\hat\mu_k = \frac{1}{\#\{i\;;\;y_i=k\}}\sum_{i\;;\; y_i=k} x_i\]
Estimate the common covariance matrix \(\mathbf{\Sigma}\):
One predictor (\(p=1\)):
\[\hat \sigma^2 = \frac{1}{n-K}\sum_{k=1}^K \;\sum_{i:y_i=k} (x_i-\hat\mu_k)^2.\]
\[\hat\delta_k(x) = \log \hat\pi_k - \frac{1}{2}\hat\mu_k^T \mathbf{\hat\Sigma}^{-1}\hat\mu_k + x^T \mathbf{\hat\Sigma}^{-1}\hat\mu_k\]
The assumption that the inputs of every class have the same covariance \(\mathbf{\Sigma}\) can be quite restrictive.
Bayes boundary (\(-- -- --\)), LDA (\(\cdot\cdot\cdot\)), QDA (\(--------\)).
In quadratic discriminant analysis we estimate a mean \(\hat\mu_k\) and a covariance matrix \(\hat{\mathbf \Sigma}_k\) for each class separately.
Given an input, it is easy to derive an objective function: \[\delta_k(x) = \log \pi_k - \frac{1}{2}\mu_k^T \mathbf{\Sigma}_k^{-1}\mu_k + x^T \mathbf{\Sigma}_k^{-1}\mu_k - \frac{1}{2}x^T \mathbf{\Sigma}_k^{-1}x -\frac{1}{2}\log |\mathbf{\Sigma}_k|\]
This objective is now quadratic in \(x\) and so the decision boundaries are 0s of quadratic functions.
Adds in the extra constraints that \(\mathbf{\Sigma}_k\) are diagonal.
Fewer parameters to estimate than full QDA.
Might be misspecified.
Still, can perform quite well with many features… better than QDA.
Diagonal covariance for Gaussian equivalent to assumption that features \(X\) are conditionally independent given label \(Y\).
Allows use of categorical variables, and other models besides Gaussian for density.
\[\frac{1}{m}\sum_{i=1}^m \mathbf{1}(y_i \neq \hat y_i).\]
It is possible to make the wrong prediction for some classes more often than others. The 0-1 loss doesn’t tell you anything about this.
A much more informative summary of the error is a confusion matrix:
Default
example##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
lda.fit = predict(lda(default ~ balance + student, data=Default))
table(lda.fit$class, Default$default)
##
## No Yes
## No 9644 252
## Yes 23 81
The error rate among people who do not default (false positive rate) is very low.
However, the rate of false negatives is 76%.
It is possible that false negatives are a bigger source of concern!
One possible solution: Change the threshold
new.class = rep("No", length(Default$default))
new.class[lda.fit$posterior[,"Yes"] > 0.2] = "Yes"
table(new.class, Default$default)
##
## new.class No Yes
## No 9432 138
## Yes 235 195
Predicted Yes
if \(P(\mathtt{default}=\text{yes} | X) >
\color{Red}{0.2}\).
Changing the threshold to 0.2 makes it easier to classify to
Yes
.
Note that the rate of false positives became higher! That is the price to pay for fewer false negatives.
Let’s visualize the dependence of the error on the threshold:
\(-- -- --\) False negative rate (error for defaulting customers), \(\cdot\cdot\cdot\) False positive rate (error for non-defaulting customers), \(--------\) Overall error rate.
Displays the performance of the method for any choice of threshold.
The area under the curve (AUC) measures the quality of the classifier:
Simulate data from several different known distributions with \(2\) predictors and a binary response variable.
Compare the test error (0-1 loss) for the following methods:
\(X_1,X_2\) normal with identical variance.
No correlation in either class.
\(X_1,X_2\) normal with identical variance.
Correlation is -0.5 in both classes.
\(X_1,X_2\) student \(T\).
No correlation in either class.
\(X_1, X_2\) normal with identical variance.
First class has correlation 0.5, second class has correlation -0.5.
\(X_1, X_2\) normal with identical variance.
Response \(Y\) was sampled from: \[ \begin{aligned} P(Y=1 \mid X) &= \frac{e^{\beta_0+\beta_1 X_1^2+\beta_2X_2^2+\beta_3X_1X_2}}{1+e^{\beta_0+\beta_1X_1^2+\beta_2X_2^2+\beta_3X_1X_2}}. \end{aligned} \]
The true decision boundary is quadratic but this is not QDA model. (Why?)
\(X_1, X_2\) normal with identical variance.
Response \(Y\) was sampled from: \[ \begin{aligned} P(Y=1 \mid X) &= \frac{e^{f_\text{nonlinear}(X_1,X_2)}}{1+e^{f_\text{nonlinear}(X_1,X_2)}}. \end{aligned} \]
The true decision boundary is very rough.