Statistical Learning (ISL 2)

Econ 425T

Author

Dr. Hua Zhou @ UCLA

Published

January 12, 2023

Credit: This note heavily uses material from the books An Introduction to Statistical Learning: with Applications in R (ISL2) and Elements of Statistical Learning: Data Mining, Inference, and Prediction (ESL2).

Display system information for reproducibility.

import IPython
print(IPython.sys_info())

{'commit_hash': 'add5877a4',
 'commit_source': 'installation',
 'default_encoding': 'utf-8',
 'ipython_path': '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/IPython',
 'ipython_version': '8.8.0',
 'os_name': 'posix',
 'platform': 'macOS-10.16-x86_64-i386-64bit',
 'sys_executable': '/Library/Frameworks/Python.framework/Versions/3.10/bin/python3',
 'sys_platform': 'darwin',
 'sys_version': '3.10.9 (v3.10.9:1dd9be6584, Dec  6 2022, 14:37:36) [Clang '
                '13.0.0 (clang-1300.0.29.30)]'}

sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9        here_1.0.1        lattice_0.20-45   png_0.1-8        
 [5] rprojroot_2.0.3   digest_0.6.30     grid_4.2.2        lifecycle_1.0.3  
 [9] jsonlite_1.8.4    magrittr_2.0.3    evaluate_0.18     rlang_1.0.6      
[13] stringi_1.7.8     cli_3.4.1         rstudioapi_0.14   Matrix_1.5-1     
[17] reticulate_1.26   vctrs_0.5.1       rmarkdown_2.18    tools_4.2.2      
[21] stringr_1.5.0     glue_1.6.2        htmlwidgets_1.6.0 xfun_0.35        
[25] yaml_2.3.6        fastmap_1.1.0     compiler_4.2.2    htmltools_0.5.4  
[29] knitr_1.41

using InteractiveUtils

versioninfo()

1 What is statistical learning

1.1 `income` data

The Income data contains Income, Years of Education, and Seniority of individuals.

# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font sizes in plots
sns.set(font_scale = 2)
# Display all columns
pd.set_option('display.max_columns', None)

# Import income2 data
income = pd.read_csv(
  "../data/Income2.csv",
  index_col = 0
  )
income

    Education   Seniority     Income
1   21.586207  113.103448  99.917173
2   18.275862  119.310345  92.579135
3   12.068966  100.689655  34.678727
4   17.034483  187.586207  78.702806
5   19.931034   20.000000  68.009922
6   18.275862   26.206897  71.504485
7   19.931034  150.344828  87.970467
8   21.172414   82.068966  79.811030
9   20.344828   88.275862  90.006327
10  10.000000  113.103448  45.655529
11  13.724138   51.034483  31.913808
12  18.689655  144.137931  96.282997
13  11.655172   20.000000  27.982505
14  16.620690   94.482759  66.601792
15  10.000000  187.586207  41.531992
16  20.344828   94.482759  89.000701
17  14.137931   20.000000  28.816301
18  16.620690   44.827586  57.681694
19  16.620690  175.172414  70.105096
20  20.344828  187.586207  98.834012
21  18.275862  100.689655  74.704699
22  14.551724  137.931034  53.532106
23  17.448276   94.482759  72.078924
24  10.413793   32.413793  18.570665
25  21.586207   20.000000  78.805784
26  11.241379   44.827586  21.388561
27  19.931034  168.965517  90.814035
28  11.655172   57.241379  22.636163
29  12.068966   32.413793  17.613593
30  17.034483  106.896552  74.610960

# Plot Income ~ Education
sns.lmplot(
  data = income, 
  x = "Education", 
  y = "Income",
  lowess = True,
  scatter_kws = {'alpha' : 0.5},
  height = 8
  ).set(
  xlabel = 'Years of Education', 
  ylabel = 'Income (k$)'
  )

Figure 1: Income increases nonlinearly with education.

# Plot Income ~ Seniority
sns.lmplot(
  data = income, 
  x = "Seniority", 
  y = "Income",
  lowess = True,
  scatter_kws = {'alpha' : 0.5},
  height = 8
  ).set(
  xlabel = 'Seniority', 
  ylabel = 'Income (k$)'
  )

Figure 2: Income increases nonlinearly with seniority.

library(tidyverse)

# Import Income2 data
income <- read_csv("../data/Income2.csv", col_select = Education:Income) %>% 
  print(width = Inf)

# A tibble: 30 × 3
   Education Seniority Income
       <dbl>     <dbl>  <dbl>
 1      21.6     113.    99.9
 2      18.3     119.    92.6
 3      12.1     101.    34.7
 4      17.0     188.    78.7
 5      19.9      20     68.0
 6      18.3      26.2   71.5
 7      19.9     150.    88.0
 8      21.2      82.1   79.8
 9      20.3      88.3   90.0
10      10       113.    45.7
# … with 20 more rows

# Plot income ~ Education
income %>%
  ggplot(mapping = aes(x = Education, y = Income)) + 
  geom_point() + 
  geom_smooth() +
  labs(title = "Income increases nonlinearly with education",
       x = "Years of Education",
       y = "Income (k$)")

# Plot income ~ Seniority
income %>%
  ggplot(mapping = aes(x = Seniority, y = Income)) + 
  geom_point() + 
  geom_smooth() +
  labs(title = "Income increases nonlinearly with seniority",
       x = "Seniority",
       y = "Income (k$)")

Can we predict Income using Years of Education? Can we predict Income using Seniority? Perhaps we can do better with a model using both? $Income \approx f (Education, Seniority)$
$Y$ : Income is the response, or target, or output, or outcome, or dependent variables that we wish to predict.
$X$ : Education and Seniority. $X = (\begin{matrix} X_{1} \\ ⋮ \\ X_{p} \end{matrix})$ is the vector of features, or inputs, or predictors, or independent variables.
We assume the model $\begin{matrix} (1) & Y = f (X) + ϵ, \end{matrix}$ where
- $f$ represents the systematic information that $X$ provides about $Y$ ;
- The error term $ϵ$ captures measurement errors and other discrepancies.
In essence, statistical learning refers to a set of approaches for estimating the function $f$ .

1.2 Why estimate $f$ ?

Prediction. With a good estimate $\hat{f}$ , we can predict $Y$ at new points $\hat{Y} = \hat{f} (X) .$
Inference.
- We can understand which components of $X = (X_{1}, \dots, X_{p})$ are important in explaining $Y$ , and which are irrelevant. For example, Seniority and Years of Education have a big impact on Income, but Marital Status typically does not.
- Depending on the complexity of $f$ , we may be able to understand how each component $X_{j}$ of $X$ affects $Y$ . For example, how much extra income will one earn with two more year of education? Does a linear relationship hold?
- Causal inference. We may infer whether purposedly changing the values of certain predictors will change outcomes. For example, does an advertising campaign increase the sales? Or it’s just seasonal change in sales?

1.3 The optimal (but unrealistic) $f$

It can be shown (HW1) that $f_{opt} = E (Y | X)$ minimizes the mean-squared prediction error $E [Y - f (X)]^{2},$ where the expectations averages over variations in both $X$ and $Y$ .
However we almost never know the conditional distribution of $Y$ given $X$ . In practice, we use various learning methods to estimate $f$ .

1.4 Reducible and irreducible errors

Assuming $\hat{f}$ and $X$ are fixed, then

$\begin{array}{rcl} E (Y - \hat{Y})^{2} & = & E [f (X) + ϵ - \hat{f} (X)]^{2} \\ = & E \underset{Reducible}{\underset{⏟}{[f (X) - \hat{f} (X)]^{2}}} + \underset{Irreducible}{\underset{⏟}{Var (ϵ)}} . \end{array}$

Statistical learning techniques may yield better $\hat{f}$ , thus decreasing the reducible errors.
Even if we knew the truth $f$ , we would still make errors in prediction due to the irreducible error.

1.5 How to estimate $f$ ?

Our goal is to apply a statistical learning method to find a function $\hat{f}$ such that $Y \approx \hat{f} (X)$ .

1.5.1 Parametric or structured model

Step 1. We make an assumption about the functional form, or shape, of $f$ .

For example, one may assume that $f$ is linear in $X$ : $f (X) = β_{0} + β_{1} X_{1} + \dots β_{p} X_{p} .$
Step 2. We use the training data ${(x_{1}, y_{1}), \dots, (x_{n}, y_{n})}$ to train or fit the model.

The most common approach for fitting the linear model is least squares. However there are many other possible ways to fit the linear model (to be discussed later).

Blue surface is the true

f

Yellow surface is the linear model fit

\hat{f}

Figure 3: Linear approximation of

f

Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function $f$ .

1.5.2 Non-parametric model

Non-parametric methods do not make explicit assumptions about the functional form of $f$ . Instead they seek an estimate of $f$ that gets as close to the data points as much as possible without being too rough or wiggly.
- Advantages: better fit.
- Disadvantage: many more paremeters; need more training samples to accurately estimate $f$ .
Thin-plate spline (discussed later) approximates the true $f$ better than the linear model.

Blue surface is the true

f

Yellow surface is the thin-plate spline fit

\hat{f}

Figure 4: Thin-plate spline fit of

f

We may even fit an extremely flexible spline model such that the fitted model makes no errors on the training data! By doing this, we are in the danger of overfitting. Overfit models do not generalize well, which means the prediction accuracy on a separate test data can be very bad.

Blue surface is the true

f

Overfitting

Figure 5: Overfitting of

f

1.6 Some trade-offs

Prediction accuracy vs interpretability.
- Linear models are easy to interpret; thin-plate splines are not.
Good fit vs over-fit or under-fit.
- How do we know when the fit is just right?
Parsimony vs black-box.
- Practitioners often prefer a simpler model involving fewer variables over a black-box predictor involving them all.

Figure 6: Trade-off of model flexibility vs interpretability.

1.7 Assessing model accuracy

Given training data ${(x_{1}, y_{1}), \dots, (x_{n}, y_{n})}$ , we fit a model $\hat{f}$ . We can evaluate the model accuracy on the training data by the mean squared error ${MSE}_{train} = \frac{1}{n} \sum_{i = 1}^{n} [y_{i} - \hat{f} (x_{i})]^{2} .$ The smaller ${MSE}_{train}$ , the better model fit.
However, in most situations, we are not interested in the training MSE. Rather, we are interested in the accuracy of the predictions on previously unseen test data.
- If we have a separate test set with both predictors and outcomes. Then the task is easy, we choose the learning method that yields the best test MSE ${MSE}_{test} = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} [y_{i} - \hat{f} (x_{i})]^{2} .$
- In many applications, we don’t have a separate test set. Is this a good idea to choose the learning method with smallest training MSE?

Figure 7: Black curve is truth. Red curve on right is the test MSE, grey curve is the training MSE. Orange, blue and green curves/squares correspond to fits of different flexibility.

Figure 8: Here the truth is smoother, so the smoother fit and linear model do really well.

Figure 9: Here the truth is wiggly and the noise is low, so the more flexible fits do the best.

As the previous three examples illustrate, the flexibility level corresponding to the model with the minimal test MSE can vary considerably among data sets.
Later we will discuss the cross-validation strategy to estimate test MSE using only the training data.

1.8 Bias-variance trade-off

The U-shaped observed in the test MSE curves (Figure 7-Figure 9) reflects the bias-variance trade-off.
Let $(x_{0}, y_{0})$ be a test observation. Under the model Equation 1, the expected prediction error (EPE) at $x_{0}$ , or the test error, or generalization error, can be decomposed as (HW1) $E [y_{0} - \hat{f} (x_{0})]^{2} = \underset{MSE of \hat{f} (x_{0}) for estimating f (x_{0})}{\underset{⏟}{Var (\hat{f} (x_{0})) + [Bias (\hat{f} (x_{0}))]^{2}}} + \underset{irreducible}{\underset{⏟}{Var (ϵ)}},$ where
- $Bias (\hat{f} (x_{0})) = E [\hat{f} (x_{0})] - f (x_{0})$ ;
- the expectation averages over the variability in $y_{0}$ and $\hat{f}$ (function of training data).
Typically as the flexibility of $\hat{f}$ increases, its variance increases and its bias decreases.

Figure 10: Bias-variance trade-off.

1.9 Classical regime vs modern regime

Above U-shaped test MSE curves are in the so-called classical regime where the number of features (or the degree of freedom) is less than the training samples.

In the modern regime, where the number of features (or the degree of freedom) can be order of magnitude larger than the training samples (recall that ChatGPT3 model has 175 billion parameters!), the double descent phenomenon is observed and being actively studied. See the recent paper and references therein.

Figure 11: Double descent phenomenon (OpenAI Blog).

1.10 Classification problems

When the outcome $Y$ is discrete, for example, email is one of $C =$ {spam, ham} (ham=good email), handwritten digit is one of $C = {0, 1, \dots, 9}$ .
Our goals are to
- build a classifier $f (X)$ that assigns a class label from $C$ to a future unlabeled observation $X$ ;
- assess the uncertainty in each classification;
- understand the roles of the different predictors among $X = (X_{1}, \dots, X_{p})$ .
To evaluate the performance of classification algorithms, the training error rate is $\frac{1}{n} \sum_{i = 1}^{n} I (y_{i} \neq {\hat{y}}_{i}),$ where ${\hat{y}}_{i} = \hat{f} (x_{i})$ is the predicted class label for the $i$ th observation using $\hat{f}$ .
As in the regression setting, we are most interested in the test error rate associated with a set of test observations $\frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} I (y_{i} \neq {\hat{y}}_{i}) .$
Suppose $C = {1, 2, \dots, K}$ , the Bayes classifier assigns a test observation with predictor vector $x_{0}$ to the class $j \in C$ for which $\Pr (Y = j ∣ X = x_{0})$ is largest.

In a two-class problem $K = 2$ , the Bayes classifier assigns a test case to class 1 if $\Pr (Y = 1 ∣ X = x_{0}) > 0.5$ , and to class 2 otherwise.
The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate $1 - max_{j} \Pr (Y = j ∣ X = x_{0})$ at $X = x_{0}$ . The overall Bayes error is given by $1 - E [max_{j} \Pr (Y = j ∣ X)],$ where the expectation averages over all possible values of $X$ .
Unfortunately, for real data, we don’t know the conditional distribution of $Y$ given $X$ , and computing the Bayes classifier is impossible.
Various learning algorithms attempt to estimate the conditional distribution of $Y$ given $X$ , and then classify a given observation to the class with the highest estimated probability.
One simple classifier is the $K$ -nearest neighbor (KNN) classifier. Given a positive integer $K$ and a test observation $x_{0}$ , the KNN classifier first identifies the $K$ points in the training data that are closest to $x_{0}$ , represented by $N_{0}$ . It then estimates the conditional probability by $\Pr (Y = j ∣ X = x_{0}) = \frac{1}{K} \sum_{i \in N_{0}} I (y_{i} = j)$ and then classifies the test observation $x_{0}$ to the class with the largest probability.

Figure 12: Black curve is the KNN decision boundary using

K = 10

. The purple dashed line is the Bayes decision boundary.

Smaller $K$ yields more flexible classification rule.

Figure 13: Left panel: KNN with

K = 1

. Right panel: KNN with

K = 100

Bias-variance trade-off of KNN.

Figure 14: KNN with

K \approx 10

achieves the Bayes error rate (black dashed line).

1 What is statistical learning

1.1 income data

1.2 Why estimate f?

1.3 The optimal (but unrealistic) f

1.4 Reducible and irreducible errors

1.5 How to estimate f?

1.5.1 Parametric or structured model

1.5.2 Non-parametric model

1.6 Some trade-offs

1.7 Assessing model accuracy

1.8 Bias-variance trade-off

1.9 Classical regime vs modern regime

1.10 Classification problems

1.1 `income` data

1.2 Why estimate $f$ ?

1.3 The optimal (but unrealistic) $f$

1.5 How to estimate $f$ ?