Linear Regression (ISL 3)

Econ 425T

Author

Dr. Hua Zhou @ UCLA

Published

January 17, 2023

Credit: This note heavily uses material from the books An Introduction to Statistical Learning: with Applications in R (ISL2) and Elements of Statistical Learning: Data Mining, Inference, and Prediction (ESL2).

Display system information for reproducibility.

import IPython
print(IPython.sys_info())
{'commit_hash': 'add5877a4',
 'commit_source': 'installation',
 'default_encoding': 'utf-8',
 'ipython_path': '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/IPython',
 'ipython_version': '8.8.0',
 'os_name': 'posix',
 'platform': 'macOS-10.16-x86_64-i386-64bit',
 'sys_executable': '/Library/Frameworks/Python.framework/Versions/3.10/bin/python3',
 'sys_platform': 'darwin',
 'sys_version': '3.10.9 (v3.10.9:1dd9be6584, Dec  6 2022, 14:37:36) [Clang '
                '13.0.0 (clang-1300.0.29.30)]'}
sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9        here_1.0.1        lattice_0.20-45   png_0.1-8        
 [5] rprojroot_2.0.3   digest_0.6.30     grid_4.2.2        lifecycle_1.0.3  
 [9] jsonlite_1.8.4    magrittr_2.0.3    evaluate_0.18     rlang_1.0.6      
[13] stringi_1.7.8     cli_3.4.1         rstudioapi_0.14   Matrix_1.5-1     
[17] reticulate_1.26   vctrs_0.5.1       rmarkdown_2.18    tools_4.2.2      
[21] stringr_1.5.0     glue_1.6.2        htmlwidgets_1.6.0 xfun_0.35        
[25] yaml_2.3.6        fastmap_1.1.0     compiler_4.2.2    htmltools_0.5.4  
[29] knitr_1.41       
using InteractiveUtils

versioninfo()

1 Linear regression

  • This lecture serves as a review of your Econ 430+442A material.

  • Linear regression is a simple approach to supervised learning. It assumes that the dependence of \(Y\) on \(X_1, X_2, \ldots, X_p\) is linear.

  • True regression functions are never linear!

  • Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically.

Essentially, all models are wrong, but some are useful.

George Box

2 Advertising data

  • Response variable: sales (in thousands of units).

  • Predictor variables:

    • TV: budget for TV advertising (in thousands of dollars).
    • radio: budget for radio advertising (in thousands of dollars).
    • newspaper: budget for newspaper advertising (in thousands of dollars).
# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font sizes in plots
sns.set(font_scale = 2)
# Display all columns
pd.set_option('display.max_columns', None)

# Import Advertising data
Advertising = pd.read_csv("../data/Advertising.csv", index_col = 0)
Advertising
        TV  radio  newspaper  sales
1    230.1   37.8       69.2   22.1
2     44.5   39.3       45.1   10.4
3     17.2   45.9       69.3    9.3
4    151.5   41.3       58.5   18.5
5    180.8   10.8       58.4   12.9
..     ...    ...        ...    ...
196   38.2    3.7       13.8    7.6
197   94.2    4.9        8.1    9.7
198  177.0    9.3        6.4   12.8
199  283.6   42.0       66.2   25.5
200  232.1    8.6        8.7   13.4

[200 rows x 4 columns]
# Visualization by pair plot
sns.pairplot(data = Advertising)

library(GGally) # ggpairs function
library(tidyverse)

# Advertising
Advertising <- read_csv("../data/Advertising.csv",  col_select = TV:sales) %>% 
  print(width = Inf)
# A tibble: 200 × 4
      TV radio newspaper sales
   <dbl> <dbl>     <dbl> <dbl>
 1 230.   37.8      69.2  22.1
 2  44.5  39.3      45.1  10.4
 3  17.2  45.9      69.3   9.3
 4 152.   41.3      58.5  18.5
 5 181.   10.8      58.4  12.9
 6   8.7  48.9      75     7.2
 7  57.5  32.8      23.5  11.8
 8 120.   19.6      11.6  13.2
 9   8.6   2.1       1     4.8
10 200.    2.6      21.2  10.6
# … with 190 more rows
# Visualization by pair plot
ggpairs(
  data = Advertising, 
  mapping = aes(alpha = 0.25), 
  lower = list(continuous = "smooth")
  ) + 
  labs(title = "Advertising Data")

  • A few questions we might ask

    • Is there a relationship between advertising budget and sales?
    • How strong is the relationship between advertising budget and sales?
    • Which media contribute to sales?
    • How accurately can we predict future sales?
    • Is the relationship linear?
    • Is there synergy among the advertising media?

3 Simple linear regression vs multiple linear regression

  • In simple linear regression, we assume a model \[ Y = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} X + \epsilon. \] For example \[ \text{sale} \approx \beta_0 + \beta_1 \times \text{TV}. \] The lower left plot in the pair plot (R) displays the fitted line \[ \hat y = \hat \beta_0 + \hat \beta_1 \times \text{TV}. \]

  • In multiple linear regression, we assume the model \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon. \] For example, \[ \text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \cdots + \beta_3 \times \text{newspaper}. \] We review the multiple linear regression below.

4 Estimation in multiple linear regression

  • We estimate regression coefficients \(\beta_0, \beta_1, \ldots, \beta_p\) by the method of least squares, which minimizes the sum of squared residuals \[ \text{RSS} = \sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_{i1} - \cdots - \beta_p x_{ip})^2. \]
  • The minimizer \(\hat \beta_0, \hat \beta_1, \ldots, \hat \beta_p\) admits the analytical solution \[ \hat \beta = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}. \] and we can make prediction using the formula \[ \hat y = \hat \beta_0 + \hat \beta_1 x_1 + \cdots + \hat \beta_p x_p. \]
# sklearn does not offer much besides coefficient estimate
from sklearn.linear_model import LinearRegression

X = Advertising[['TV', 'radio', 'newspaper']]
y = Advertising['sales']
lmod = LinearRegression().fit(X, y)
lmod.intercept_
2.938889369459412
lmod.coef_
# Score is the R^2
array([ 0.04576465,  0.18853002, -0.00103749])
lmod.score(X, y)
0.8972106381789522
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Fit linear regression
lmod = smf.ols(formula = 'sales ~ TV + radio + newspaper', data = Advertising).fit()
lmod.summary()
OLS Regression Results
Dep. Variable: sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 570.3
Date: Tue, 17 Jan 2023 Prob (F-statistic): 1.58e-96
Time: 08:51:15 Log-Likelihood: -386.18
No. Observations: 200 AIC: 780.4
Df Residuals: 196 BIC: 793.6
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.9389 0.312 9.422 0.000 2.324 3.554
TV 0.0458 0.001 32.809 0.000 0.043 0.049
radio 0.1885 0.009 21.893 0.000 0.172 0.206
newspaper -0.0010 0.006 -0.177 0.860 -0.013 0.011
Omnibus: 60.414 Durbin-Watson: 2.084
Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.241
Skew: -1.327 Prob(JB): 1.44e-33
Kurtosis: 6.332 Cond. No. 454.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
library(gtsummary)

# Linear regression
lmod <- lm(sales ~ TV + radio + newspaper, data = Advertising)
summary(lmod)

Call:
lm(formula = sales ~ TV + radio + newspaper, data = Advertising)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8277 -0.8908  0.2418  1.1893  2.8292 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.938889   0.311908   9.422   <2e-16 ***
TV           0.045765   0.001395  32.809   <2e-16 ***
radio        0.188530   0.008611  21.893   <2e-16 ***
newspaper   -0.001037   0.005871  -0.177     0.86    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared:  0.8972,    Adjusted R-squared:  0.8956 
F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16
# Better summary table
lmod %>%
  tbl_regression() %>%
  bold_labels() %>%
  bold_p(t = 0.05)
Characteristic Beta 95% CI1 p-value
TV 0.05 0.04, 0.05 <0.001
radio 0.19 0.17, 0.21 <0.001
newspaper 0.00 -0.01, 0.01 0.9
1 CI = Confidence Interval

5 Interpreting regression coefficients

  • We interpret \(\beta_j\) as the average effect on \(Y\) of a a one-unit increase in \(X_j\), holding all other predictors fixed.

a regression coefficient \(\beta_j\) estimates the expected change in \(Y\) per unit change in \(X_j\), with all other predictors held fixed. But predictors usually change together!

“Data Analysis and Regression”, Mosteller and Tukey 1977

  • Only when the predictors are uncorrelated (balanced or orthogonal design), each coefficient can be estimated, interpreted, and tested separately.

    In other words, only in a balanced design, the regression coefficients estimated from multiple linear regression are the same as those from separate simple linear regressions.

  • Claims of causality should be avoided for observational data.

6 Some important questions

6.1 Goodness of fit

Question: Is at least one of the predictors \(X_1, X_2, \ldots, X_p\) useful in predicting the response?

  • We can use the F-statistic to assess the fit of the overall model \[ F = \frac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n-p-1)} \sim F_{p, n-p-1}, \] where \[ \text{TSS} = \sum_{i=1}^n (y_i - \bar y)^2. \] Formally we are testing the null hypothesis \[ H_0: \beta_1 = \cdots = \beta_p = 0 \] versus the alternative \[ H_a: \text{at least one } \beta_j \text{ is non-zero}. \]

  • In general, the F-statistic for testing \[ H_0: \text{ a restricted model } \omega \] versus \[ H_a: \text{a more general model } \Omega \] takes the form \[ F = \frac{(\text{RSS}_\omega - \text{RSS}_\Omega) / (\text{df}(\Omega) - \text{df}(\omega))}{\text{RSS}_\Omega / (n - \text{df}(\Omega))}, \] where df is the degree of freedom (roughly speaking the number of free parameters) of a model.

# F test for R \beta = 0
R = [[0, 1, 0 , 0], [0, 0, 1, 0], [0, 0, 0, 1]]
lmod.f_test(R)
<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=570.2707036590946, p=1.5752272560923742e-96, df_denom=196, df_num=3>
lmod_null <- lm(sales ~ 1, data = Advertising)
anova(lmod_null, lmod)
Analysis of Variance Table

Model 1: sales ~ 1
Model 2: sales ~ TV + radio + newspaper
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    199 5417.1                                  
2    196  556.8  3    4860.3 570.27 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.2 Model selection

Question: Do all the predictors help to explain \(Y\) , or is only a subset of the predictors useful?

  • For a naive implementation of the best subset, forward selection, and backward selection regressions in Python, interesting readers can read this blog.

6.2.1 Best subset regression

  • The most direct approach is called all subsets or best subsets regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size.

  • We try the highly-optimized abess package here.

from abess.linear import LinearRegression
from patsy import dmatrices

# Create design matrix and response vector using R-like formula
ym, Xm = dmatrices(
  'sales ~ TV + radio + newspaper', 
  data = Advertising, 
  return_type = 'matrix'
  )
lmod_abess = LinearRegression(support_size = range(3))
lmod_abess.fit(Xm, ym)
LinearRegression(support_size=range(0, 3))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
lmod_abess.coef_
array([nan,  0.,  0.,  0.])
library(leaps)

bs_mod <- regsubsets(
  sales ~ ., 
  data = Advertising, 
  method = "exhaustive"
  ) %>% summary()
bs_mod
Subset selection object
Call: regsubsets.formula(sales ~ ., data = Advertising, method = "exhaustive")
3 Variables  (and intercept)
          Forced in Forced out
TV            FALSE      FALSE
radio         FALSE      FALSE
newspaper     FALSE      FALSE
1 subsets of each size up to 3
Selection Algorithm: exhaustive
         TV  radio newspaper
1  ( 1 ) "*" " "   " "      
2  ( 1 ) "*" "*"   " "      
3  ( 1 ) "*" "*"   "*"      
tibble(
  K = 1:3, 
  R2 = bs_mod$rsq,
  adjR2 = bs_mod$adjr2,
  BIC = bs_mod$bic,
  CP = bs_mod$cp
) %>% print(width = Inf)
# A tibble: 3 × 5
      K    R2 adjR2   BIC     CP
  <int> <dbl> <dbl> <dbl>  <dbl>
1     1 0.612 0.610 -179. 544.  
2     2 0.897 0.896 -439.   2.03
3     3 0.897 0.896 -434.   4   
  • However we often can’t examine all possible models, since there are \(2^p\) of them. For example, when \(p = 30\), there are \(2^{30} = 1,073,741,824\) models!

  • Recent advances make possible to solve best subset regression with \(p \sim 10^3 \sim 10^5\) with high-probability guarantee of optimality. See this article.

  • For really big \(p\), we need some heuristic approaches to search through a subset of them. We discuss three commonly use approaches next.

6.2.2 Forward selection

  • Begin with the null model, a model that contains an intercept but no predictors.

  • Fit \(p\) simple linear regressions and add to the null model the variable that results in the lowest RSS (equivalently the lowest AIC).

  • Add to that model the variable that results in the lowest RSS among all two-variable models.

  • Continue until some stopping rule is satisfied, for example when the AIC does not decrease anymore.

step(lmod_null,
     scope = list(lower = sales ~ 1, upper = sales ~ TV + radio + newspaper),
     trace = TRUE, 
     direction = "forward"
     )
Start:  AIC=661.8
sales ~ 1

            Df Sum of Sq    RSS    AIC
+ TV         1    3314.6 2102.5 474.52
+ radio      1    1798.7 3618.5 583.10
+ newspaper  1     282.3 5134.8 653.10
<none>                   5417.1 661.80

Step:  AIC=474.52
sales ~ TV

            Df Sum of Sq     RSS    AIC
+ radio      1   1545.62  556.91 210.82
+ newspaper  1    183.97 1918.56 458.20
<none>                   2102.53 474.52

Step:  AIC=210.82
sales ~ TV + radio

            Df Sum of Sq    RSS    AIC
<none>                   556.91 210.82
+ newspaper  1  0.088717 556.83 212.79

Call:
lm(formula = sales ~ TV + radio, data = Advertising)

Coefficients:
(Intercept)           TV        radio  
    2.92110      0.04575      0.18799  

6.2.3 Backward selection

  • Start with all variables in the model.

  • Remove the variable with the largest \(p\)-value. That is the variable that is the least statistically significant.

  • The new \((p-1)\)-variable model is fit, and the variable with the largest p-value is removed.

  • Continue until a stopping rule is reached. For instance, we may stop when AIC does not decrease anymore, or when all remaining variables have a significant \(p\)-value defined by some significance threshold.

  • Backward selection cannot start with a model with \(p > n\).

step(lmod,
     scope = list(lower = sales ~ 1, upper = sales ~ TV + radio + newspaper),
     trace = TRUE, 
     direction = "backward"
     )
Start:  AIC=212.79
sales ~ TV + radio + newspaper

            Df Sum of Sq    RSS    AIC
- newspaper  1      0.09  556.9 210.82
<none>                    556.8 212.79
- radio      1   1361.74 1918.6 458.20
- TV         1   3058.01 3614.8 584.90

Step:  AIC=210.82
sales ~ TV + radio

        Df Sum of Sq    RSS    AIC
<none>                556.9 210.82
- radio  1    1545.6 2102.5 474.52
- TV     1    3061.6 3618.5 583.10

Call:
lm(formula = sales ~ TV + radio, data = Advertising)

Coefficients:
(Intercept)           TV        radio  
    2.92110      0.04575      0.18799  

6.2.4 Mixed selection

  • We alternately perform forward and backward steps until all variables in the model have a sufficiently low \(p\)-value, and all variables outside the model would have a large \(p\)-value if added to the model.
step(lmod_null,
     scope = list(lower = sale ~ 1, upper = sale ~ TV + radio + newspaper),
     trace = TRUE, 
     direction = "both"
     )
Start:  AIC=661.8
sales ~ 1

            Df Sum of Sq    RSS    AIC
+ TV         1    3314.6 2102.5 474.52
+ radio      1    1798.7 3618.5 583.10
+ newspaper  1     282.3 5134.8 653.10
<none>                   5417.1 661.80

Step:  AIC=474.52
sales ~ TV

            Df Sum of Sq    RSS    AIC
+ radio      1    1545.6  556.9 210.82
+ newspaper  1     184.0 1918.6 458.20
<none>                   2102.5 474.52
- TV         1    3314.6 5417.1 661.80

Step:  AIC=210.82
sales ~ TV + radio

            Df Sum of Sq    RSS    AIC
<none>                    556.9 210.82
+ newspaper  1      0.09  556.8 212.79
- radio      1   1545.62 2102.5 474.52
- TV         1   3061.57 3618.5 583.10

Call:
lm(formula = sales ~ TV + radio, data = Advertising)

Coefficients:
(Intercept)           TV        radio  
    2.92110      0.04575      0.18799  

6.2.5 Other model selection criteria

Commonly used model selection criteria include Mallow’s \(C_p\), Akaike information criterion (AIC), Bayesian information criterion (BIC), adjusted \(R^2\), and cross-validation (CV).

6.3 Model fit

Question: How well does the selected model fit the data?

  • \(R^2\): the fraction of variance explained \[ R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2}. \] It turns out (HW1 bonus question) \[ R^2 = \operatorname{Cor}(Y, \hat Y)^2. \] Adding more predictors into a model always increase \(R^2\).

  • From computer output, we see the fitted linear model by using TV, radio, and newspaper has an \(R^2=0.8972\).

# R^2
lmod.rsquared
# Cor(Y, \hat Y)^2
0.8972106381789522
np.corrcoef(Advertising['sales'], lmod.predict())
array([[1.        , 0.94721203],
       [0.94721203, 1.        ]])
np.corrcoef(Advertising['sales'], lmod.predict())[0, 1]**2
0.8972106381789521
# R^2
summary(lmod)$r.squared
[1] 0.8972106
# Cor(Y, \hat Y)^2
cor(predict(lmod), Advertising$sales)^2
[1] 0.8972106
  • The adjusted \(R_a^2\) adjusts to the fact that a larger model also uses more parameters. \[ R_a^2 = 1 - \frac{\text{RSS}/(n-p-1)}{\text{TSS}/(n-1)}. \] Adding a predictor will only increase \(R_a^2\) if it has some predictive value.
# Adjusted R^2
lmod.rsquared_adj
0.8956373316204668
# Adjusted R^2
summary(lmod)$adj.r.squared
[1] 0.8956373
  • The rooted squared error (RSE) \[ \text{RSE} = \sqrt{\frac{\text{RSS}}{n - p - 1}} \] is an estimator of the noise standard error \(\sqrt{\operatorname{Var}(\epsilon)}\). The smaller RSE, the better fit of the model.

6.4 Prediction

Question: Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

  • Given a new set of predictors \(x\), we predict \(y\) by \[ \hat y = \hat \beta_0 + \hat \beta_1 x_1 + \cdots + \hat \beta_p x_p. \]

  • Under the selected model \[ \text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio}, \] how many units of sale do we expect with $100k budget in TV and $20k budget in radio?

# Create design matrix and response vector using R-like formula
y_small, X_small = dmatrices(
  'sales ~ TV + radio', 
  data = Advertising, 
  return_type = 'dataframe'
  )
# Fit linear regression
lmod_small = sm.OLS(y_small, X_small).fit()
lmod_small_pred = lmod_small.get_prediction(pd.DataFrame({'Intercept': [1], 'TV': [100], 'radio': [20]})).summary_frame()
lmod_small_pred
        mean   mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  \
0  11.256466  0.137526      10.985254      11.527677      7.929616   

   obs_ci_upper  
0     14.583316  
x_new <- list(TV = 100, radio = 20)
lmod_small <- lm(sales ~ TV + radio, data = Advertising)
predict(lmod_small, x_new, interval = "confidence")
       fit      lwr      upr
1 11.25647 10.98525 11.52768
predict(lmod_small, x_new, interval = "prediction")
       fit      lwr      upr
1 11.25647 7.929616 14.58332
  • 95% Confidence interval: probability of covering the true \(f(X)\) is 0.95.

  • 95% Predictive interval: probability of covering the true \(Y\) is 0.95.

    Predictive interval is always wider than the confidence interval because it incorporates the randomness in both \(\hat \beta\) and \(\epsilon\).

7 Qualitative (or categorical) predictors

  • In the Advertising data, all predictors are quantitative (or continuous).

  • The Credit data has a mix of quantitative (Balance, Age, Cards, Education, Income, Limit, Rating) and qualitative predictors (Own, Student, Married, Region).

# Import Credit data
Credit = pd.read_csv("../data/Credit.csv")
Credit
      Income  Limit  Rating  Cards  Age  Education  Own Student Married  \
0     14.891   3606     283      2   34         11   No      No     Yes   
1    106.025   6645     483      3   82         15  Yes     Yes     Yes   
2    104.593   7075     514      4   71         11   No      No      No   
3    148.924   9504     681      3   36         11  Yes      No      No   
4     55.882   4897     357      2   68         16   No      No     Yes   
..       ...    ...     ...    ...  ...        ...  ...     ...     ...   
395   12.096   4100     307      3   32         13   No      No     Yes   
396   13.364   3838     296      5   65         17   No      No      No   
397   57.872   4171     321      5   67         12  Yes      No     Yes   
398   37.728   2525     192      1   44         13   No      No     Yes   
399   18.701   5524     415      5   64          7  Yes      No      No   

    Region  Balance  
0    South      333  
1     West      903  
2     West      580  
3     West      964  
4    South      331  
..     ...      ...  
395  South      560  
396   East      480  
397  South      138  
398  South        0  
399   West      966  

[400 rows x 11 columns]
# Visualization by pair plot
sns.pairplot(data = Credit)

library(ISLR2)

# Credit
Credit <- as_tibble(Credit) %>% print(width = Inf)
# A tibble: 400 × 11
   Income Limit Rating Cards   Age Education Own   Student Married Region
    <dbl> <dbl>  <dbl> <dbl> <dbl>     <dbl> <fct> <fct>   <fct>   <fct> 
 1   14.9  3606    283     2    34        11 No    No      Yes     South 
 2  106.   6645    483     3    82        15 Yes   Yes     Yes     West  
 3  105.   7075    514     4    71        11 No    No      No      West  
 4  149.   9504    681     3    36        11 Yes   No      No      West  
 5   55.9  4897    357     2    68        16 No    No      Yes     South 
 6   80.2  8047    569     4    77        10 No    No      No      South 
 7   21.0  3388    259     2    37        12 Yes   No      No      East  
 8   71.4  7114    512     2    87         9 No    No      No      West  
 9   15.1  3300    266     5    66        13 Yes   No      No      South 
10   71.1  6819    491     3    41        19 Yes   Yes     Yes     East  
   Balance
     <dbl>
 1     333
 2     903
 3     580
 4     964
 5     331
 6    1151
 7     203
 8     872
 9     279
10    1350
# … with 390 more rows
# Pair plot of continuous predictors
ggpairs(
  data = Credit[, c(1:6, 11)],
  mapping = aes(alpha = 0.25)
  ) + 
  labs(title = "Credit Data")

# Pair plot of categorical predictors
ggpairs(
  data = Credit[, c(7:10, 11)],
  mapping = aes(alpha = 0.25)
  ) + 
  labs(title = "Credit Data")

  • By default, most software create dummy variables, also called one-hot coding, for categorical variables. Both Python (statsmodels) and R use the first level (in alphabetical order) as the base level.
# Hack to get all predictors
my_formula = "Balance ~ " + "+".join(Credit.columns.difference(["Balance"]))
# Create design matrix and response vector using R-like formula
y, X = dmatrices(
  my_formula,
  data = Credit, 
  return_type = 'dataframe'
  )
# One-hot coding for categorical variables  
X
     Intercept  Married[T.Yes]  Own[T.Yes]  Region[T.South]  Region[T.West]  \
0          1.0             1.0         0.0              1.0             0.0   
1          1.0             1.0         1.0              0.0             1.0   
2          1.0             0.0         0.0              0.0             1.0   
3          1.0             0.0         1.0              0.0             1.0   
4          1.0             1.0         0.0              1.0             0.0   
..         ...             ...         ...              ...             ...   
395        1.0             1.0         0.0              1.0             0.0   
396        1.0             0.0         0.0              0.0             0.0   
397        1.0             1.0         1.0              1.0             0.0   
398        1.0             1.0         0.0              1.0             0.0   
399        1.0             0.0         1.0              0.0             1.0   

     Student[T.Yes]   Age  Cards  Education   Income   Limit  Rating  
0               0.0  34.0    2.0       11.0   14.891  3606.0   283.0  
1               1.0  82.0    3.0       15.0  106.025  6645.0   483.0  
2               0.0  71.0    4.0       11.0  104.593  7075.0   514.0  
3               0.0  36.0    3.0       11.0  148.924  9504.0   681.0  
4               0.0  68.0    2.0       16.0   55.882  4897.0   357.0  
..              ...   ...    ...        ...      ...     ...     ...  
395             0.0  32.0    3.0       13.0   12.096  4100.0   307.0  
396             0.0  65.0    5.0       17.0   13.364  3838.0   296.0  
397             0.0  67.0    5.0       12.0   57.872  4171.0   321.0  
398             0.0  44.0    1.0       13.0   37.728  2525.0   192.0  
399             0.0  64.0    5.0        7.0   18.701  5524.0   415.0  

[400 rows x 12 columns]
# Fit linear regression
sm.OLS(y, X).fit().summary()
OLS Regression Results
Dep. Variable: Balance R-squared: 0.955
Model: OLS Adj. R-squared: 0.954
Method: Least Squares F-statistic: 750.3
Date: Tue, 17 Jan 2023 Prob (F-statistic): 1.11e-253
Time: 08:51:57 Log-Likelihood: -2398.7
No. Observations: 400 AIC: 4821.
Df Residuals: 388 BIC: 4869.
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -479.2079 35.774 -13.395 0.000 -549.543 -408.873
Married[T.Yes] -8.5339 10.363 -0.824 0.411 -28.908 11.841
Own[T.Yes] -10.6532 9.914 -1.075 0.283 -30.145 8.839
Region[T.South] 10.1070 12.210 0.828 0.408 -13.899 34.113
Region[T.West] 16.8042 14.119 1.190 0.235 -10.955 44.564
Student[T.Yes] 425.7474 16.723 25.459 0.000 392.869 458.626
Age -0.6139 0.294 -2.088 0.037 -1.192 -0.036
Cards 17.7245 4.341 4.083 0.000 9.190 26.259
Education -1.0989 1.598 -0.688 0.492 -4.241 2.043
Income -7.8031 0.234 -33.314 0.000 -8.264 -7.343
Limit 0.1909 0.033 5.824 0.000 0.126 0.255
Rating 1.1365 0.491 2.315 0.021 0.171 2.102
Omnibus: 34.899 Durbin-Watson: 1.968
Prob(Omnibus): 0.000 Jarque-Bera (JB): 41.766
Skew: 0.782 Prob(JB): 8.52e-10
Kurtosis: 3.241 Cond. No. 3.87e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.87e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
lm(Balance ~ ., data = Credit) %>%
  tbl_regression() %>%
  bold_labels() %>%
  bold_p(t = 0.05)
Characteristic Beta 95% CI1 p-value
Income -7.8 -8.3, -7.3 <0.001
Limit 0.19 0.13, 0.26 <0.001
Rating 1.1 0.17, 2.1 0.021
Cards 18 9.2, 26 <0.001
Age -0.61 -1.2, -0.04 0.037
Education -1.1 -4.2, 2.0 0.5
Own
    No
    Yes -11 -30, 8.8 0.3
Student
    No
    Yes 426 393, 459 <0.001
Married
    No
    Yes -8.5 -29, 12 0.4
Region
    East
    South 10 -14, 34 0.4
    West 17 -11, 45 0.2
1 CI = Confidence Interval
  • There are many different ways of coding qualitative variables, measuring particular contrasts.

8 Interactions

  • In our previous analysis of the Advertising data, we assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media \[ \text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio}. \]

  • But suppose that spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases.

  • In this situation, given a fixed budget of $100,000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio.

  • In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect.

  • Assume the interaction model: \[ \text{sales} \approx \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \beta_3 \times (\text{TV} \times \text{radio}). \]

# Design matrix with interaction
y, X_int = dmatrices(
  'sales ~ 1 + TV * radio',
  data = Advertising, 
  return_type = 'dataframe'
  )

# Fit linear regression with interaction
sm.OLS(y, X_int).fit().summary()
OLS Regression Results
Dep. Variable: sales R-squared: 0.968
Model: OLS Adj. R-squared: 0.967
Method: Least Squares F-statistic: 1963.
Date: Tue, 17 Jan 2023 Prob (F-statistic): 6.68e-146
Time: 08:52:03 Log-Likelihood: -270.14
No. Observations: 200 AIC: 548.3
Df Residuals: 196 BIC: 561.5
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 6.7502 0.248 27.233 0.000 6.261 7.239
TV 0.0191 0.002 12.699 0.000 0.016 0.022
radio 0.0289 0.009 3.241 0.001 0.011 0.046
TV:radio 0.0011 5.24e-05 20.727 0.000 0.001 0.001
Omnibus: 128.132 Durbin-Watson: 2.224
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1183.719
Skew: -2.323 Prob(JB): 9.09e-258
Kurtosis: 13.975 Cond. No. 1.80e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
lmod_int <- lm(sales ~ 1 + TV * radio, data = Advertising) 
summary(lmod_int)

Call:
lm(formula = sales ~ 1 + TV * radio, data = Advertising)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.3366 -0.4028  0.1831  0.5948  1.5246 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
TV:radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9435 on 196 degrees of freedom
Multiple R-squared:  0.9678,    Adjusted R-squared:  0.9673 
F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16
lmod_int %>%
  tbl_regression(estimate_fun = function(x) style_sigfig(x, digits = 4)) %>%
  bold_labels() %>%
  bold_p(t = 0.05)
Characteristic Beta 95% CI1 p-value
TV 0.0191 0.0161, 0.0221 <0.001
radio 0.0289 0.0113, 0.0464 0.001
TV * radio 0.0011 0.0010, 0.0012 <0.001
1 CI = Confidence Interval
  • The results in this table suggests that the TV \(\times\) Radio interaction is important.

  • The \(R^2\) for the interaction model is 96.8%, compared to only 89.7% for the model that predicts sales using TV and radio without an interaction term.

  • This means that (96.8 - 89.7)/(100 - 89.7) = 69% of the variability in sales that remains after fitting the additive model has been explained by the interaction term.

  • The coefficient estimates in the table suggest that an increase in TV advertising of $1,000 is associated with increased sales of \[ (\hat \beta_1 + \hat \beta_3 \times \text{radio}) \times 1000 = 19 + 1.1 \times \text{radio units}. \]

  • An increase in radio advertising of $1,000 will be associated with an increase in sales of \[ (\hat \beta_2 + \hat \beta_3 \times \text{TV}) \times 1000 = 29 + 1.1 \times \text{TV units}. \]

  • Sometimes it is the case that an interaction term has a very small p-value, but the associated main effects (in this case, TV and radio) do not.

    The hierarchy principle staties

If we include an interaction in a model, we should also include the main effects, even if the \(p\)-values associated with their coefficients are not significant.

  • The rationale for this principle is that interactions are hard to interpret in a model without main effects – their meaning is changed. Specifically, the interaction terms still contain main effects, if the model has no main effect terms.

9 Non-linear effects of predictors

  • Auto data.

  • Linear fit vs polynomial fit.

# Import Auto data (missing data represented by '?')
Auto = pd.read_csv("../data/Auto.csv", na_values = '?').dropna()
Auto
      mpg  cylinders  displacement  horsepower  weight  acceleration  year  \
0    18.0          8         307.0       130.0    3504          12.0    70   
1    15.0          8         350.0       165.0    3693          11.5    70   
2    18.0          8         318.0       150.0    3436          11.0    70   
3    16.0          8         304.0       150.0    3433          12.0    70   
4    17.0          8         302.0       140.0    3449          10.5    70   
..    ...        ...           ...         ...     ...           ...   ...   
392  27.0          4         140.0        86.0    2790          15.6    82   
393  44.0          4          97.0        52.0    2130          24.6    82   
394  32.0          4         135.0        84.0    2295          11.6    82   
395  28.0          4         120.0        79.0    2625          18.6    82   
396  31.0          4         119.0        82.0    2720          19.4    82   

     origin                       name  
0         1  chevrolet chevelle malibu  
1         1          buick skylark 320  
2         1         plymouth satellite  
3         1              amc rebel sst  
4         1                ford torino  
..      ...                        ...  
392       1            ford mustang gl  
393       2                  vw pickup  
394       1              dodge rampage  
395       1                ford ranger  
396       1                 chevy s-10  

[392 rows x 9 columns]
plt.clf()
# Linear fit
ax = sns.regplot(
  data = Auto,
  x = 'horsepower',
  y = 'mpg', 
  order = 1
  )
# Quadratic fit  
sns.regplot(
  data = Auto,
  x = 'horsepower',
  y = 'mpg', 
  order = 2,
  ax = ax
  )
# 5-degree fit  
sns.regplot(
  data = Auto,
  x = 'horsepower',
  y = 'mpg', 
  order = 5,
  ax = ax
  )
ax.set(
  xlabel = 'Horsepower',
  ylabel = 'Miles per gallon'
)

Auto <- as_tibble(Auto) %>% print(width = Inf)
# A tibble: 392 × 9
     mpg cylinders displacement horsepower weight acceleration  year origin
   <dbl>     <int>        <dbl>      <int>  <int>        <dbl> <int>  <int>
 1    18         8          307        130   3504         12      70      1
 2    15         8          350        165   3693         11.5    70      1
 3    18         8          318        150   3436         11      70      1
 4    16         8          304        150   3433         12      70      1
 5    17         8          302        140   3449         10.5    70      1
 6    15         8          429        198   4341         10      70      1
 7    14         8          454        220   4354          9      70      1
 8    14         8          440        215   4312          8.5    70      1
 9    14         8          455        225   4425         10      70      1
10    15         8          390        190   3850          8.5    70      1
   name                     
   <fct>                    
 1 chevrolet chevelle malibu
 2 buick skylark 320        
 3 plymouth satellite       
 4 amc rebel sst            
 5 ford torino              
 6 ford galaxie 500         
 7 chevrolet impala         
 8 plymouth fury iii        
 9 pontiac catalina         
10 amc ambassador dpl       
# … with 382 more rows
# Linear fit
lm(mpg ~ horsepower, data = Auto) %>% summary()

Call:
lm(formula = mpg ~ horsepower, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5710  -3.2592  -0.3435   2.7630  16.9240 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.935861   0.717499   55.66   <2e-16 ***
horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.6049 
F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
# Degree 2
lm(mpg ~ poly(horsepower, 2), data = Auto) %>% summary()

Call:
lm(formula = mpg ~ poly(horsepower, 2), data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.7135  -2.5943  -0.0859   2.2868  15.8961 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            23.4459     0.2209  106.13   <2e-16 ***
poly(horsepower, 2)1 -120.1377     4.3739  -27.47   <2e-16 ***
poly(horsepower, 2)2   44.0895     4.3739   10.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.374 on 389 degrees of freedom
Multiple R-squared:  0.6876,    Adjusted R-squared:  0.686 
F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16
# Degree 5
lm(mpg ~ poly(horsepower, 5), data = Auto) %>% summary()

Call:
lm(formula = mpg ~ poly(horsepower, 5), data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.4326  -2.5285  -0.2925   2.1750  15.9730 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            23.4459     0.2185 107.308  < 2e-16 ***
poly(horsepower, 5)1 -120.1377     4.3259 -27.772  < 2e-16 ***
poly(horsepower, 5)2   44.0895     4.3259  10.192  < 2e-16 ***
poly(horsepower, 5)3   -3.9488     4.3259  -0.913  0.36190    
poly(horsepower, 5)4   -5.1878     4.3259  -1.199  0.23117    
poly(horsepower, 5)5   13.2722     4.3259   3.068  0.00231 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.326 on 386 degrees of freedom
Multiple R-squared:  0.6967,    Adjusted R-squared:  0.6928 
F-statistic: 177.4 on 5 and 386 DF,  p-value: < 2.2e-16
ggplot(data = Auto, mapping = aes(x = horsepower, y = mpg)) +
  geom_point() + 
  geom_smooth(method = lm, color = 1) + 
  geom_smooth(method = lm, formula = y ~ poly(x, 2), color = 2) + 
  geom_smooth(method = lm, formula = y ~ poly(x, 5), color = 3) +
  labs(x = "Horsepower", y = "Miles per gallon")

10 What we did not cover

  • Outliers.

  • Non-constant variance of error terms.

  • High leverage points.

  • Collinearity.

See ISL Section 3.3.3.

11 Generalizations of linear models

In much of the rest of this course, we discuss methods that expand the scope of linear models and how they are fit.

  • Classification problems: logistic regression, discriminant analysis, support vector machines.

  • Non-linearity: kernel smoothing, splines and generalized additive models, nearest neighbor methods.

  • Interactions: tree-based methods, bagging, random forests, boosting, neural networks.

  • Regularized fitting: ridge regression and lasso.