Classification (ISL 4)

Econ 425T

Author

Dr. Hua Zhou @ UCLA

Published

February 7, 2023

Credit: This note heavily uses material from the books An Introduction to Statistical Learning: with Applications in R (ISL2) and Elements of Statistical Learning: Data Mining, Inference, and Prediction (ESL2).

Display system information for reproducibility.

import IPython
print(IPython.sys_info())

{'commit_hash': 'add5877a4',
 'commit_source': 'installation',
 'default_encoding': 'utf-8',
 'ipython_path': '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/IPython',
 'ipython_version': '8.8.0',
 'os_name': 'posix',
 'platform': 'macOS-10.16-x86_64-i386-64bit',
 'sys_executable': '/Library/Frameworks/Python.framework/Versions/3.10/bin/python3',
 'sys_platform': 'darwin',
 'sys_version': '3.10.9 (v3.10.9:1dd9be6584, Dec  6 2022, 14:37:36) [Clang '
                '13.0.0 (clang-1300.0.29.30)]'}

sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9        here_1.0.1        lattice_0.20-45   png_0.1-8        
 [5] rprojroot_2.0.3   digest_0.6.30     grid_4.2.2        lifecycle_1.0.3  
 [9] jsonlite_1.8.4    magrittr_2.0.3    evaluate_0.18     rlang_1.0.6      
[13] stringi_1.7.8     cli_3.4.1         rstudioapi_0.14   Matrix_1.5-1     
[17] reticulate_1.26   vctrs_0.5.1       rmarkdown_2.18    tools_4.2.2      
[21] stringr_1.5.0     glue_1.6.2        htmlwidgets_1.6.0 xfun_0.35        
[25] yaml_2.3.6        fastmap_1.1.0     compiler_4.2.2    htmltools_0.5.4  
[29] knitr_1.41

1 Overview of classification

Qualitative variables take values in an unordered set \(\mathcal{C}\), such as: \(\text{eye color} \in \{\text{brown}, \text{blue}, \text{green}\}\).
Given a feature vector \(X\) and a qualitative response \(Y\) taking values in the set \(\mathcal{C}\), the classification task is to build a function \(C(X)\) that takes as input the feature vector \(X\) and predicts its value for \(Y\), i.e. \(C(X) \in \mathcal{C}\).
Often we are more interested in estimating the probabilities that \(X\) belongs to each category in \(\mathcal{C}\).

2 Credit `Default` data

Response variable: Credit default (yes or no).
Predictors:
- balance: credit card balance.
- income: income of customer.
- student: customer is a student or not.

# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font sizes in plots
sns.set(font_scale = 1.2)
# Display all columns
pd.set_option('display.max_columns', None)

# Import Credit Default data
Default = pd.read_csv("../data/Default.csv")
Default

     default student      balance        income
0         No      No   729.526495  44361.625074
1         No     Yes   817.180407  12106.134700
2         No      No  1073.549164  31767.138947
3         No      No   529.250605  35704.493935
4         No      No   785.655883  38463.495879
...      ...     ...          ...           ...
9995      No      No   711.555020  52992.378914
9996      No      No   757.962918  19660.721768
9997      No      No   845.411989  58636.156984
9998      No      No  1569.009053  36669.112365
9999      No     Yes   200.922183  16862.952321

[10000 rows x 4 columns]

# Visualize income ~ balance
plt.figure()
sns.relplot(
  data = Default, 
  x = 'balance', 
  y = 'income', 
  hue = 'default', 
  style = 'default', 
  height = 10
  ).set(
  xlabel = 'Balance', 
  ylabel = 'Income'
  );
plt.show()

# Visualize balance ~ default
plt.figure()
sns.boxplot(
  data = Default, 
  x = 'default', 
  y = 'balance'
  ).set(
  xlabel = 'Default', 
  ylabel = 'Balance'
  );
plt.show()

# Visualize income ~ default
plt.figure()
sns.boxplot(
  data = Default, 
  x = 'default', 
  y = 'income'
  ).set(
  xlabel = 'Default', 
  ylabel = 'Income'
  );
plt.show()

# Visualize student ~ default
plt.figure()
sns.countplot(
  data = Default, 
  x = 'default', 
  hue = 'student'
  ).set(
  xlabel = 'Default', 
  ylabel = 'Count'
  );
plt.show()

library(ISLR2)
library(GGally) # ggpairs function
library(tidyverse)

# Credit default data
Default <- as_tibble(Default) %>% 
  print(width = Inf)

# A tibble: 10,000 × 4
   default student balance income
   <fct>   <fct>     <dbl>  <dbl>
 1 No      No         730. 44362.
 2 No      Yes        817. 12106.
 3 No      No        1074. 31767.
 4 No      No         529. 35704.
 5 No      No         786. 38463.
 6 No      Yes        920.  7492.
 7 No      No         826. 24905.
 8 No      Yes        809. 17600.
 9 No      No        1161. 37469.
10 No      No           0  29275.
# … with 9,990 more rows

# Visualization by pair plot
ggpairs(
  data = Default, 
  mapping = aes(), 
  lower = list(continuous = "smooth", combo = "box", discrete = "ratio")
  ) + 
  labs(title = "Credit Default Data")

3 Linear vs logistic regression

If we code default as \[ Y = \begin{cases} 0 & \text{if No} \\ 1 & \text{if Yes} \end{cases}, \] can we simply perform a linear regression of \(Y\) on \(X\) and classify as Yes if \(\hat Y > 0.5\)?
- Since \(\operatorname{E}(Y \mid X = x) = \operatorname{Pr}(Y=1 \mid X = x)\), we might think that linear regression is perfect for this task.
- The issue is that linear regression may produce probabilities <0 or >1, which does not make sense.

Code

# 0-1 (No-Yes) coding
Default['y'] = (Default['default'] == 'Yes').astype(int)
# Linear regression fit vs logistic regression fit
plt.figure()
sns.regplot(
  data = Default, 
  x = 'balance', 
  y = 'y',
  label = 'Linear regression'
  )
sns.regplot(
  data = Default, 
  x = 'balance', 
  y = 'y', 
  logistic = True, 
  label = 'Logistic regression',
  line_kws = {'color': 'g'}
  ).set(
    xlabel = 'Balance', 
    ylabel = 'Probability of Default'
  )
plt.show()

Figure 1: Predicted probabilities by linear regression (blue line) vs logistic regression (green line).

Code

Default %>%
  mutate(y = as.integer(default == "Yes")) %>%
  ggplot(mapping = aes(x = balance, y = y)) + 
  geom_point() + 
  geom_smooth(
    method = "lm",
    color = "blue",
    show.legend = TRUE
    ) + 
  geom_smooth(
    method = "glm", 
    method.args = list(family = binomial),
    color = "green",
    show.legend = TRUE
    ) +
  labs(
    x = "Balance",
    y = "Probability of Default"
    )

Now suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms. \[ Y = \begin{cases} 1 & \text{if stroke} \\ 2 & \text{if drug overdose} \\ 3 & \text{if epileptic seizure} \end{cases}. \] This coding suggests an ordering, and in fact implies that the difference between stroke and drug overdose is the same as between drug overdose and epileptic seizure.

Linear regression is not appropriate here. Multiclass Logistic Regression or Discriminant Analysis are more appropriate.

4 Logistic regression

Let’s write \[ p(X) = \operatorname{Pr}(Y = 1 \mid X) \] for short. Logistic regression assumes \[ p(X) = \frac{e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}{1 + e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}}. \] \(p(X) \in [0, 1]\), no matter what values of \(\beta_j\) and \(X\) take.
A bit of rearrangement gives \[ \log \left( \frac{p(X)}{1 - p(X)} \right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p. \] This monotone transformation is called the log odds or logit transformation of \(p(X)\).
We use maximum likelihood to estimate the parameters. That is find \(\beta_0, \ldots, \beta_p\) that maximizes the likelihood function \[ \ell(\beta_0, \ldots, \beta_p) = \prod_{i:y_i=1} p(x_i) \prod_{i: y_i = 0} [1 - p(x_i)]. \]

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

# Code student as numerical 0/1
col_tf = make_column_transformer(
  # OHE transformer for categorical variables
  (OneHotEncoder(drop = 'first'), ['student']),
  remainder = 'passthrough'
  )
X = Default[['student', 'balance', 'income']]
y = Default['default']

# Pipeline
pipe_logit = Pipeline(steps = [
  ("col_tf", col_tf),
  ("model", LogisticRegression())
])

# Fit logistic regression
logit_fit = pipe_logit.fit(X, y)
logit_fit

# Predicted labels from logistic regression

Pipeline(steps=[('col_tf',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['student'])])),
                ('model', LogisticRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

logit_pred = logit_fit.predict(X)

# Confusion matrix from the logistic regression
logit_cfm = pd.crosstab(
  logit_pred, 
  y,
  margins = True, 
  rownames = ['Predicted Default Stats'],
  colnames = ['True Default Status']
  )
logit_cfm

True Default Status        No  Yes    All
Predicted Default Stats                  
No                       9609  271   9880
Yes                        58   62    120
All                      9667  333  10000

Overall training accuracy of logistic regression (using 0.5 as threshold) is

(logit_cfm.loc['Yes', 'Yes'] + logit_cfm.loc['No', 'No']) / logit_cfm.loc['All', 'All']

0.9671

The area-under-ROC curve (AUC) of logistic regression (sklearn) is

from sklearn.metrics import roc_auc_score

logit_auc = roc_auc_score(
  y,
  logit_fit.predict_proba(X)[:, 1]
)
logit_auc

0.9070324073944639

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Fit logistic regression
logit_mod = smf.logit(
  formula = 'y ~ balance + income + student', 
  data = Default
  ).fit()

Optimization terminated successfully.
         Current function value: 0.078577
         Iterations 10

logit_mod.summary()

# Predicted labels from logistic regression

Logit Regression Results
Dep. Variable:	y	No. Observations:	10000
Model:	Logit	Df Residuals:	9996
Method:	MLE	Df Model:	3
Date:	Tue, 07 Feb 2023	Pseudo R-squ.:	0.4619
Time:	10:44:24	Log-Likelihood:	-785.77
converged:	True	LL-Null:	-1460.3
Covariance Type:	nonrobust	LLR p-value:	3.257e-292

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-10.8690	0.492	-22.079	0.000	-11.834	-9.904
student[T.Yes]	-0.6468	0.236	-2.738	0.006	-1.110	-0.184
balance	0.0057	0.000	24.737	0.000	0.005	0.006
income	3.033e-06	8.2e-06	0.370	0.712	-1.3e-05	1.91e-05

Possibly complete quasi-separation: A fraction 0.15 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

logit_smpred = np.where(logit_mod.predict(X) > 0.5, 'Yes', 'No')
  
# Confusion matrix from the logistic regression
logit_smcfm = pd.crosstab(
  logit_smpred, 
  y,
  margins = True, 
  rownames = ['Predicted Default Stats'],
  colnames = ['True Default Status']
  )

Overall accuracy of logistic regression (by statsmodels)

(logit_smcfm.loc['Yes', 'Yes'] + logit_smcfm.loc['No', 'No']) / logit_smcfm.loc['All', 'All']

0.9732

The area-under-ROC curve (AUC) of logistic regression (statsmodels) is

logit_smauc = roc_auc_score(
  y,
  logit_mod.predict(X)
)
logit_smauc

0.9495581233452343

library(gtsummary)

logit_mod <- glm(
  default ~ balance + income + student, 
  family = binomial, 
  data = Default
  )
summary(logit_mod)


Call:
glm(formula = default ~ balance + income + student, family = binomial, 
    data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4691  -0.1418  -0.0557  -0.0203   3.7383  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
income       3.033e-06  8.203e-06   0.370  0.71152    
studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1571.5  on 9996  degrees of freedom
AIC: 1579.5

Number of Fisher Scoring iterations: 8

# logit_mod %>% tbl_regression()

# Predicted labels from logistic regression
logit_pred = ifelse(
  predict(logit_mod, Default, type = "response") > 0.5,
  "Yes",
  "No"
)

# Confusion matrix
logit_cfm = table(Predicted = logit_pred, Default = Default$default)

# Accuracy
(logit_cfm['Yes', 'Yes'] + logit_cfm['No', 'No']) / sum(logit_cfm)

[1] 0.9732

We interpret the logistic regression coefficient as the expected change in log odds associated with one-unit increase in the corresponding predictor.
Wait! Why the coefficient of student is negative, contradicting with the plot?
Confounding: student status is confounded with balance. Students tend to have higher balances than non-students, so their marginal default rate is higher than for non-students.

But for each level of balance, students default less than non-students.

Multiple logistic regression can tease this out.

Figure 2: Student status confounds with credit card balance. Students tend to have higher balances than non-students. Even though students default less at than non-students for each level of balance, marginally their default rate is higher than for non-students.

Code

ggplot(data = Default) + 
  geom_boxplot(mapping = aes(x = student, y = balance)) + 
  labs(
    x = "Student Status",
    y = "Credit Card Balance",
    title = "Students tend to carry larger credit card balance."
    )

Code

Default %>%
  mutate(yhat = logit_mod$fitted.values) %>%
  ggplot() + 
    geom_point(mapping = aes(x = balance, y = yhat, color = student)) +
    labs(
      x = "Credit Card Balance",
      y = "Default Rate",
      subtitle = "Even though students default less at than non-students for each level of balance, marginally their default rate is higher than for non-students."
    )

5 Multinomial logit regression

For more than two classes, we generalize logistic regression to \[ \operatorname{Pr}(Y = k \mid X) = \frac{e^{\beta_{0k} + \beta_{1k} X_1 + \cdots + \beta_{pk} X_p}}{\sum_{\ell = 1}^K e^{\beta_{0\ell} + \beta_{1\ell} X_1 + \cdots + \beta_{p\ell} X_p}}. \] Note each class has its own set of regression coefficients.

In Python, we can fit multinomial logit model by statsmodels.discrete.discrete_model.MNLogit. In R, we can use glmnet package or nnet package.

6 Discriminant analysis

Another approach for classification is to model the distribution of \(X\) in each of the classes separately, and then use Bayes theorem to flip things around and obtain \(\operatorname{Pr}(Y = j \mid X)\).
When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis.
However, this approach is quite general, and other distributions can be used as well. We will focus on normal distributions.
Thomas Bayes was a famous mathematician whose name represents a big subfield of statistical and probabilistic modeling. Here we focus on a simple result, known as Bayes theorem: \[\begin{eqnarray*} p_k(x) &=& \operatorname{Pr}(Y = k \mid X = x) \\ &=& \frac{\operatorname{Pr}(X = x, Y = k)}{\operatorname{Pr}(X = x)} \\ &=& \frac{\operatorname{Pr}(X = x \mid Y = k) \cdot \operatorname{Pr}(Y = k)}{\operatorname{Pr}(X = x)} \\ &=& \frac{\pi_k f_k(x)}{\sum_{\ell=1}^K \pi_\ell f_\ell(x)}, \end{eqnarray*}\] where
- \(f_k(x) = \operatorname{Pr}(X = x \mid Y = k)\) is the density of \(X\) in class \(k\)
- \(\pi_k = \operatorname{Pr}(Y = k)\) is the marginal or prior probability for class \(k\).
Advantages of discriminant analysis.
- When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable (separation or quasi-separation). Linear discriminant analysis does not suffer from this problem.
- If \(n\) is small and the distribution of the predictors \(X\) is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.

6.1 Linear discriminant analysis (LDA)

Assume \(X \in \mathbb{R}^p\) in the \(k\)-th class is distributed as as multivariate normal \(N(\mu_k, \boldsymbol{\Sigma})\) with density \[ f_k(x) = \frac{1}{(2\pi)^{p/2} |\boldsymbol{\Sigma}|^{1/2}} e^{- \frac{1}{2} (x - \mu_k)^T \boldsymbol{\Sigma}^{-1} (x - \mu_k)}. \] To classify \(X=x\), we need to see which of \(p_k(x)\) is small. Taking logs, and discarding terms that do not depend on \(k\), we just need to assign \(x\) to the class with the largest discriminant score \[ \delta_k(x) = x^T \boldsymbol{\Sigma}^{-1} \mu_k - \frac{1}{2} \mu_k^T \boldsymbol{\Sigma}^{-1} \mu_k + \log \pi_k, \] which is a linear function in \(x\)!
We estimate the unknown parameters \(\pi_k\), \(\mu_k\), and \(\boldsymbol{\Sigma}\) by \[\begin{eqnarray*} \hat{\pi}_k &=& \frac{n_k}{n} \\ \hat{\mu}_k &=& \frac{1}{n_k} \sum_{i: y_k = k} x_i \\ \hat{\boldsymbol{\Sigma}} &=& \frac{1}{n-K} \sum_{k=1}^K \sum_{i: y_i = k} (x_i - \hat \mu_k)(x_i - \hat \mu_k)^T. \end{eqnarray*}\]
Once we have estimated \(\hat \delta_k(x)\), we can turn these into class probabilities \[ \hat{\operatorname{Pr}}(Y = k \mid X = x) = \frac{e^{\hat \delta_k(x)}}{\sum_{\ell=1}^K e^{\hat \delta_{\ell}(x)}}. \]

Figure 3: Here \(\pi_1 = \pi_2 = \pi_3 = 1/3\). The dashed lines are known as the Bayes decision boundaries. Were they known, they would yield the fewest misclassification errors, among all possible classifiers. The black line is the LDA decision boundary.

LDA on the credit Default data.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Pipeline
pipe_lda = Pipeline(steps = [
  ("col_tf", col_tf),
  ("model", LinearDiscriminantAnalysis())
])

# Fit LDA
lda_fit = pipe_lda.fit(X, y)

# Predicted labels from LDA
lda_pred = lda_fit.predict(X)

# Confusion matrix
lda_cfm = pd.crosstab(
  lda_pred, 
  y, 
  margins = True, 
  rownames = ['Predicted Default Stats'],
  colnames = ['True Default Status']
  )
lda_cfm

True Default Status        No  Yes    All
Predicted Default Stats                  
No                       9645  254   9899
Yes                        22   79    101
All                      9667  333  10000

Overall training accuracy of LDA (using 0.5 as threshold) is

(lda_cfm.loc['Yes', 'Yes'] + lda_cfm.loc['No', 'No']) / lda_cfm.loc['All', 'All']

0.9724

The area-under-ROC curve (AUC) of LDA is

lda_auc = roc_auc_score(
  y,
  lda_fit.predict_proba(X)[:, 1]
)
lda_auc

0.9495202246831501

library(MASS)

# Fit LDA
lda_mod <- lda(
  default ~ balance + income + student, 
  data = Default
  )
lda_mod

Call:
lda(default ~ balance + income + student, data = Default)

Prior probabilities of groups:
    No    Yes 
0.9667 0.0333 

Group means:
      balance   income studentYes
No   803.9438 33566.17  0.2914037
Yes 1747.8217 32089.15  0.3813814

Coefficients of linear discriminants:
                     LD1
balance     2.243541e-03
income      3.367310e-06
studentYes -1.746631e-01

# Predicted labels from LDA
lda_pred = predict(lda_mod, Default)

# Confusion matrix
lda_cfm = table(Predicted = lda_pred$class, Default = Default$default)

# Accuracy
(lda_cfm['Yes', 'Yes'] + lda_cfm['No', 'No']) / sum(lda_cfm)

[1] 0.9724

6.2 Quadratic discriminant analysis (QDA)

In LDA, the normal distribution for each class shares the same covariance \(\boldsymbol{\Sigma}\).
If we assume that the normal distribution for class \(k\) has covariance \(\boldsymbol{\Sigma}_k\), then it leads to the quadratic discriminant analysis (QDA).
The discriminant function takes the form \[ \delta_k(x) = - \frac{1}{2} (x - \mu_k)^T \boldsymbol{\Sigma}_k^{-1} (x - \mu_k) + \log \pi_k - \frac{1}{2} \log |\boldsymbol{\Sigma}_k|, \] which is a quadratic function in \(x\).

Figure 4: The Bayes (purple dashed), LDA (black dotted), and QDA (green solid) decision boundaries for a two-class problem. Left: \(\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_2\). Right: \(\boldsymbol{\Sigma}_1 \ne \boldsymbol{\Sigma}_2\).

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# Pipeline
pipe_qda = Pipeline(steps = [
  ("col_tf", col_tf),
  ("model", QuadraticDiscriminantAnalysis())
])

# Fit QDA
qda_fit = pipe_qda.fit(X, y)

# Predicted labels from QDA
qda_pred = qda_fit.predict(X)

# Confusion matrix from the QDA
qda_cfm = pd.crosstab(
  qda_pred, 
  y,
  margins = True, 
  rownames = ['Predicted Default Stats'],
  colnames = ['True Default Status']
  )
qda_cfm

True Default Status        No  Yes    All
Predicted Default Stats                  
No                       9636  239   9875
Yes                        31   94    125
All                      9667  333  10000

Overall training accuracy of QDA (using 0.5 as threshold) is

(qda_cfm.loc['Yes', 'Yes'] + qda_cfm.loc['No', 'No']) / qda_cfm.loc['All', 'All']

0.973

The area-under-ROC curve (AUC) of QDA is

qda_auc = roc_auc_score(
  y,
  qda_fit.predict_proba(X)[:, 1]
)
qda_auc

0.9492120650701389

# Fit QDA
qda_mod <- qda(
  default ~ balance + income + student, 
  data = Default
  )
qda_mod

Call:
qda(default ~ balance + income + student, data = Default)

Prior probabilities of groups:
    No    Yes 
0.9667 0.0333 

Group means:
      balance   income studentYes
No   803.9438 33566.17  0.2914037
Yes 1747.8217 32089.15  0.3813814

# Predicted probabilities from QDA
qda_pred = predict(qda_mod, Default)

# Confusion matrix
qda_cfm = table(Predicted = qda_pred$class, Default = Default$default)

# Accuracy
(qda_cfm['Yes', 'Yes'] + qda_cfm['No', 'No']) / sum(qda_cfm)

[1] 0.973

6.3 Naive Bayes

If we assume \(f_k(x) = \prod_{j=1}^p f_{jk}(x_j)\) (conditional independence model) in each class, we get naive Bayes.

For Gaussian this means the \(\boldsymbol{\Sigma}_k\) are diagonal.
Naive Bayes is useful when \(p\) is large (LDA and QDA break down).
Can be used for \(mixed\) feature vectors (both continuous and categorical). If \(X_j\) is qualitative, replace \(f_{kj}(x_j)\) with probability mass function (histogram) over discrete categories.
Despite strong assumptions, naive Bayes often produces good classification results.

from sklearn.naive_bayes import GaussianNB

# Pipeline
pipe_nb = Pipeline(steps = [
  ("col_tf", col_tf),
  ("model", GaussianNB())
])

# Fit Naive Bayes classifier
nb_fit = pipe_nb.fit(X, y)

# Predicted labels by NB classifier
nb_pred = nb_fit.predict(X)

# Confusion matrix of NB classifier
nb_cfm = pd.crosstab(
  nb_pred, 
  y, 
  margins = True, 
  rownames = ['Predicted Default Stats'],
  colnames = ['True Default Status']
  )
nb_cfm

True Default Status        No  Yes    All
Predicted Default Stats                  
No                       9620  246   9866
Yes                        47   87    134
All                      9667  333  10000

Overall training accuracy of Naive Bayes classifier (using 0.5 as threshold) is

(nb_cfm.loc['Yes', 'Yes'] + nb_cfm.loc['No', 'No']) / nb_cfm.loc['All', 'All']

0.9707

The area-under-ROC curve (AUC) of NB is

nb_auc = roc_auc_score(
  y,
  nb_fit.predict_proba(X)[:, 1]
)
nb_auc

0.9450012751967858

library(e1071)

# Fit Naive Bayes classifier
nb_mod <- naiveBayes(
  default ~ balance + income + student, 
  data = Default
  )
nb_mod


Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
    No    Yes 
0.9667 0.0333 

Conditional probabilities:
     balance
Y          [,1]     [,2]
  No   803.9438 456.4762
  Yes 1747.8217 341.2668

     income
Y         [,1]     [,2]
  No  33566.17 13318.25
  Yes 32089.15 13804.22

     student
Y            No       Yes
  No  0.7085963 0.2914037
  Yes 0.6186186 0.3813814

# Predicted labels from Naive Bayes
nb_pred = predict(nb_mod, Default)

# Confusion matrix
nb_cfm = table(Predicted = nb_pred, Default = Default$default)

# Accuracy
(nb_cfm['Yes', 'Yes'] + nb_cfm['No', 'No']) / sum(nb_cfm)

[1] 0.9707

7 \(K\)-nearest neighbor (KNN) classifier

KNN is a nonparametric classifier.
Given a positive integer \(K\) and a test observation \(x_0\), the KNN classifier first identifies the \(K\) points in the training data that are closest to \(x_0\), represented by \(\mathcal{N}_0\). It estimates the conditional probability by \[ \operatorname{Pr}(Y=j \mid X = x_0) = \frac{1}{K} \sum_{i \in \mathcal{N}_0} I(y_i = j) \] and then classifies the test observation \(x_0\) to the class with the largest probability.
We illustrate KNN with \(K=5\) on the credit default data.

from sklearn.neighbors import KNeighborsClassifier

# Pipeline
pipe_knn = Pipeline(steps = [
  ("col_tf", col_tf),
  ("model", KNeighborsClassifier(n_neighbors = 5))
])

# Fit KNN with K = 5
knn_fit = pipe_knn.fit(X, y)

# Predicted labels from KNN
knn_pred = knn_fit.predict(X)

# Confusion matrix of KNN
knn_cfm = pd.crosstab(
  knn_pred, 
  y, 
  margins = True, 
  rownames = ['Predicted Default Stats'],
  colnames = ['True Default Status']
  )
knn_cfm

True Default Status        No  Yes    All
Predicted Default Stats                  
No                       9648  231   9879
Yes                        19  102    121
All                      9667  333  10000

Overall training accuracy of KNN classifier with \(K=5\) (using 0.5 as threshold) is

(knn_cfm.loc['Yes', 'Yes'] + knn_cfm.loc['No', 'No']) / knn_cfm.loc['All', 'All']

0.975

The area-under-ROC curve (AUC) of KNN (\(K=5\)) is

knn_auc = roc_auc_score(
  y,
  knn_fit.predict_proba(X)[:, 1]
)
knn_auc

0.9806299006154183

library(class)

X_default <- Default %>% 
  mutate(x_student = as.integer(student == "Yes")) %>%
  dplyr::select(x_student, balance, income)

# KNN prediction with K = 5
knn_pred <- knn(
  train = X_default, 
  test = X_default,
  cl = Default$default,
  k = 5
  )

# Confusion matrix
knn_cfm = table(Predicted = knn_pred, Default = Default$default)

# Accuracy
(knn_cfm['Yes', 'Yes'] + knn_cfm['No', 'No']) / sum(knn_cfm)

[1] 0.975

8 Evaluation of classification performance: false positive, false negative, ROC and AUC

Let’s summarize the classification performance of different classifiers on the training data.

# Confusion matrix from the null classifier (always 'No')
null_cfm = pd.DataFrame(
  data = {
    'No': [9667, 0, 9667],
    'Yes': [333, 0, 333],
    'All': [10000, 0, 10000]
    },
  index = ['No', 'Yes', 'All']
  )
null_pred = np.repeat('No', 10000)
# Fitted classifiers
classifiers = [logit_fit, lda_fit, qda_fit, nb_fit, knn_fit]
# Confusion matrices
cfms = [logit_cfm, lda_cfm, qda_cfm, nb_cfm, knn_cfm, null_cfm]

sm_df = pd.DataFrame(
  data = {
  'acc': [(cfm.loc['Yes', 'Yes'] + cfm.loc['No', 'No']) / cfm.loc['All', 'All'] for cfm in cfms],
  'fpr': [(cfm.loc['Yes', 'No'] / cfm.loc['All', 'No']) for cfm in cfms],
  'fnr': [(cfm.loc['No', 'Yes'] / cfm.loc['All', 'Yes']) for cfm in cfms],
  'auc': np.append([roc_auc_score(y, c.predict_proba(X)[:, 1]) for c in classifiers], roc_auc_score(y, np.repeat(0, 10000)))
  },
  index = ['logit', 'LDA', 'QDA', 'NB', 'KNN', 'Null']
  )
sm_df.sort_values('acc')

          acc       fpr       fnr       auc
Null   0.9667  0.000000  1.000000  0.500000
logit  0.9671  0.006000  0.813814  0.907032
NB     0.9707  0.004862  0.738739  0.945001
LDA    0.9724  0.002276  0.762763  0.949520
QDA    0.9730  0.003207  0.717718  0.949212
KNN    0.9750  0.001965  0.693694  0.980630

There are two types of classification errors:
- False positive rate: The fraction of negative examples that are classified as positive.
- False negative rate: The fraction of positive examples that are classified as negative.

Above table is the training classification performance of classifiers using their default thresholds. Varying thresholds lead to varying false positive rates (1 - specificity) and true positive rates (sensitivity). These can be plotted as the receiver operating characteristic (ROC) curve. The overall performance of a classifier is summarized by the area under ROC curve (AUC).

from sklearn.metrics import roc_curve
from sklearn.metrics import RocCurveDisplay

# plt.rcParams.update({'font.size': 12})
for idx, m in enumerate(classifiers):
  plt.figure();
  RocCurveDisplay.from_estimator(m, X, y, name = sm_df.iloc[idx].name);
  plt.show()
  
# ROC curve of the null classifier (always No or always Yes)

<Figure size 1400x1000 with 0 Axes>
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x129301120>
<Figure size 1400x1000 with 0 Axes>
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x129301120>
<Figure size 1400x1000 with 0 Axes>
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x129301120>
<Figure size 1400x1000 with 0 Axes>
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x129301120>
<Figure size 1400x1000 with 0 Axes>
<sklearn.metrics._plot.roc_curve.RocCurveDisplay object at 0x129301120>

plt.figure()
RocCurveDisplay.from_predictions(y, np.repeat(0, 10000), pos_label = 'Yes', name = 'Null Classifier');

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/metrics/_plot/roc_curve.py:125: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
  fig, ax = plt.subplots()

plt.show()

See sklearn.metrics for other popularly used metrics for classification tasks.

9 Comparison between classifiers

For a two-class problem, we can show that for LDA, \[ \log \left( \frac{p_1(x)}{1 - p_1(x)} \right) = \log \left( \frac{p_1(x)}{p_2(x)} \right) = c_0 + c_1 x_1 + \cdots + c_p x_p. \] So it has the same form as logistic regression. The difference is in how the parameters are estimated.
Logistic regression uses the conditional likelihood based on \(\operatorname{Pr}(Y \mid X)\) (known as discriminative learning).

LDA, QDA, and Naive Bayes use the full likelihood based on \(\operatorname{Pr}(X, Y)\) (known as generative learning).
Despite these differences, in practice the results are often very similar.
Logistic regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model.
Logistic regression is very popular for classification, especially when \(K = 2\).
LDA is useful when \(n\) is small, or the classes are well separated, and Gaussian assumptions are reasonable. Also when \(K > 2\).
Naive Bayes is useful when \(p\) is very large.
LDA is a special case of QDA.
Under normal assumption, Naive Bayes leads to linear decision boundary, thus a special case of LDA.
KNN classifier is non-parametric and can dominate LDA and logistic regression when the decision boundary is highly nonlinear, provided that \(n\) is very large and \(p\) is small.
See ISL Section 4.5 for theoretical and empirical comparisons of logistic regression, LDA, QDA, NB, and KNN.

1 Overview of classification

2 Credit Default data

3 Linear vs logistic regression

4 Logistic regression

5 Multinomial logit regression

6 Discriminant analysis

6.1 Linear discriminant analysis (LDA)

6.2 Quadratic discriminant analysis (QDA)

6.3 Naive Bayes

7 \(K\)-nearest neighbor (KNN) classifier

8 Evaluation of classification performance: false positive, false negative, ROC and AUC

9 Comparison between classifiers

2 Credit `Default` data