Machine Learning Workflow: Adding Nonlinearities to Predictive Modeling

Econ 425T

Author

Dr. Hua Zhou @ UCLA

Published

January 31, 2023

Display system information for reproducibility.

import IPython
print(IPython.sys_info())
{'commit_hash': 'add5877a4',
 'commit_source': 'installation',
 'default_encoding': 'utf-8',
 'ipython_path': '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/IPython',
 'ipython_version': '8.8.0',
 'os_name': 'posix',
 'platform': 'macOS-10.16-x86_64-i386-64bit',
 'sys_executable': '/Library/Frameworks/Python.framework/Versions/3.10/bin/python3',
 'sys_platform': 'darwin',
 'sys_version': '3.10.9 (v3.10.9:1dd9be6584, Dec  6 2022, 14:37:36) [Clang '
                '13.0.0 (clang-1300.0.29.30)]'}
sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9        here_1.0.1        lattice_0.20-45   png_0.1-8        
 [5] rprojroot_2.0.3   digest_0.6.30     grid_4.2.2        lifecycle_1.0.3  
 [9] jsonlite_1.8.4    magrittr_2.0.3    evaluate_0.18     rlang_1.0.6      
[13] stringi_1.7.8     cli_3.4.1         rstudioapi_0.14   Matrix_1.5-1     
[17] reticulate_1.26   vctrs_0.5.1       rmarkdown_2.18    tools_4.2.2      
[21] stringr_1.5.0     glue_1.6.2        htmlwidgets_1.6.0 xfun_0.35        
[25] yaml_2.3.6        fastmap_1.1.0     compiler_4.2.2    htmltools_0.5.4  
[29] knitr_1.41       

1 Overview

We illustrate the typical machine learning workflow for regression problems using the Wage data set from R ISLR2 package. The steps are

  1. Initial splitting to test and non-test sets.

  2. Pre-processing of data: one-hot-encoder for categorical variables, add nonlinear features for some continuous predictors.

  3. Choose a learner/method. Elastic net in this example.

  4. Tune the hyper-parameter(s) (alpha and l1_ratio in elastic net) using \(K\)-fold cross-validation (CV) on the non-test data.

  5. Choose the best model by CV and refit it on the whole non-test data.

  6. Final prediction on the test data.

These steps completes the process of training and evaluating one machine learning method. We repeat the same process for other learners, e.g., random forest or neural network, using the same test/non-test and CV split. The final report compares the learners based on CV and test errors.

2 Wage data

A documentation of the Wage data is here. The goal is to predict the wage.

# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font sizes in plots
sns.set(font_scale = 1.2)
# Display all columns
pd.set_option('display.max_columns', None)

Wage = pd.read_csv("../data/Wage.csv").drop(['logwage', 'region'], axis = 1)
Wage
      year  age            maritl      race        education        jobclass  \
0     2006   18  1. Never Married  1. White     1. < HS Grad   1. Industrial   
1     2004   24  1. Never Married  1. White  4. College Grad  2. Information   
2     2003   45        2. Married  1. White  3. Some College   1. Industrial   
3     2003   43        2. Married  3. Asian  4. College Grad  2. Information   
4     2005   50       4. Divorced  1. White       2. HS Grad  2. Information   
...    ...  ...               ...       ...              ...             ...   
2995  2008   44        2. Married  1. White  3. Some College   1. Industrial   
2996  2007   30        2. Married  1. White       2. HS Grad   1. Industrial   
2997  2005   27        2. Married  2. Black     1. < HS Grad   1. Industrial   
2998  2005   27  1. Never Married  1. White  3. Some College   1. Industrial   
2999  2009   55      5. Separated  1. White       2. HS Grad   1. Industrial   

              health health_ins        wage  
0          1. <=Good      2. No   75.043154  
1     2. >=Very Good      2. No   70.476020  
2          1. <=Good     1. Yes  130.982177  
3     2. >=Very Good     1. Yes  154.685293  
4          1. <=Good     1. Yes   75.043154  
...              ...        ...         ...  
2995  2. >=Very Good     1. Yes  154.685293  
2996  2. >=Very Good      2. No   99.689464  
2997       1. <=Good      2. No   66.229408  
2998  2. >=Very Good     1. Yes   87.981033  
2999       1. <=Good     1. Yes   90.481913  

[3000 rows x 9 columns]
# Numerical summaries
Wage.describe()
              year          age         wage
count  3000.000000  3000.000000  3000.000000
mean   2005.791000    42.414667   111.703608
std       2.026167    11.542406    41.728595
min    2003.000000    18.000000    20.085537
25%    2004.000000    33.750000    85.383940
50%    2006.000000    42.000000   104.921507
75%    2008.000000    51.000000   128.680488
max    2009.000000    80.000000   318.342430

Graphical summary takes longer to run so suppressed here.

# Graphical summaries
plt.figure()
sns.pairplot(data = Wage);
plt.show()

library(GGally)
library(ISLR2)
library(tidymodels)
library(tidyverse)

Wage <- as_tibble(Wage) %>%
  select(-region) %>%
  print(width = Inf)
# A tibble: 3,000 × 10
    year   age maritl           race     education       jobclass      
   <int> <int> <fct>            <fct>    <fct>           <fct>         
 1  2006    18 1. Never Married 1. White 1. < HS Grad    1. Industrial 
 2  2004    24 1. Never Married 1. White 4. College Grad 2. Information
 3  2003    45 2. Married       1. White 3. Some College 1. Industrial 
 4  2003    43 2. Married       3. Asian 4. College Grad 2. Information
 5  2005    50 4. Divorced      1. White 2. HS Grad      2. Information
 6  2008    54 2. Married       1. White 4. College Grad 2. Information
 7  2009    44 2. Married       4. Other 3. Some College 1. Industrial 
 8  2008    30 1. Never Married 3. Asian 3. Some College 2. Information
 9  2006    41 1. Never Married 2. Black 3. Some College 2. Information
10  2004    52 2. Married       1. White 2. HS Grad      2. Information
   health         health_ins logwage  wage
   <fct>          <fct>        <dbl> <dbl>
 1 1. <=Good      2. No         4.32  75.0
 2 2. >=Very Good 2. No         4.26  70.5
 3 1. <=Good      1. Yes        4.88 131. 
 4 2. >=Very Good 1. Yes        5.04 155. 
 5 1. <=Good      1. Yes        4.32  75.0
 6 2. >=Very Good 1. Yes        4.85 127. 
 7 2. >=Very Good 1. Yes        5.13 170. 
 8 1. <=Good      1. Yes        4.72 112. 
 9 2. >=Very Good 1. Yes        4.78 119. 
10 2. >=Very Good 1. Yes        4.86 129. 
# … with 2,990 more rows
# Numerical summaries
summary(Wage)
      year           age                     maritl           race     
 Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
 1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
 Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
 Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
 3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
 Max.   :2009   Max.   :80.00                                          
              education             jobclass               health    
 1. < HS Grad      :268   1. Industrial :1544   1. <=Good     : 858  
 2. HS Grad        :971   2. Information:1456   2. >=Very Good:2142  
 3. Some College   :650                                              
 4. College Grad   :685                                              
 5. Advanced Degree:426                                              
                                                                     
  health_ins      logwage           wage       
 1. Yes:2083   Min.   :3.000   Min.   : 20.09  
 2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
               Median :4.653   Median :104.92  
               Mean   :4.654   Mean   :111.70  
               3rd Qu.:4.857   3rd Qu.:128.68  
               Max.   :5.763   Max.   :318.34  

Graphical summary takes longer to run so suppressed here.

# Graphical summaries
ggpairs(
  data = Wage, 
  mapping = aes(alpha = 0.25), 
  lower = list(continuous = "smooth")
  ) + 
  labs(title = "Wage Data")

3 Initial split into test and non-test sets

from sklearn.model_selection import train_test_split

Wage_other, Wage_test = train_test_split(
  Wage, 
  train_size = 0.75,
  random_state = 425, # seed
  )
Wage_test.shape
(750, 9)
Wage_other.shape
(2250, 9)

Separate \(X\) and \(y\).

# Non-test X and y
X_other = Wage_other.drop(['wage'], axis = 1)
y_other = Wage_other.wage
# Test X and y
X_test = Wage_test.drop(['wage'], axis = 1)
y_test = Wage_test.wage
# For reproducibility
set.seed(425)
data_split <- initial_split(
  Wage, 
  # # stratify by percentiles
  # strata = "Salary", 
  prop = 0.75
  )

Wage_other <- training(data_split)
dim(Wage_other)
[1] 2250   10
Wage_test <- testing(data_split)
dim(Wage_test)
[1] 750  10

4 Preprocessing (Python) or recipe (R)

For regularization methods such as ridge and lasso, it is essential to center and scale predictors.

Pre-processor for one-hot coding of categorical variables and then standardizing all numeric predictors.

from sklearn.preprocessing import OneHotEncoder, StandardScaler, SplineTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline


col_tf = make_column_transformer(
  # OHE transformer for categorical variables
  (OneHotEncoder(drop = 'first'), ['maritl', 'race', 'education', 'jobclass', 'health', 'health_ins']),
  # Nonlinear features by splines of age and year
  (SplineTransformer(
    degree = 3,
    n_knots = 5,
    extrapolation = 'linear'
    ), ['age']),
  (SplineTransformer(
    degree = 3,
    n_knots = 4,
    extrapolation = 'linear'
    ), ['year']),
  remainder = 'passthrough'
)
# Standardization transformer
std_tf = StandardScaler()
norm_recipe <- 
  recipe(
    wage ~ ., 
    data = Wage_other
  ) %>%
  # create traditional dummy variables
  step_dummy(all_nominal()) %>%
  # zero-variance filter
  step_zv(all_predictors()) %>% 
  # B-splines of age
  step_bs(age, deg_free = 5) %>%
  # B-splines of year
  step_bs(year, deg_free = 4) %>%
  # center and scale numeric data
  step_normalize(all_predictors()) %>%
  # estimate the means and standard deviations
  prep(training = Wage_other, retain = TRUE)
norm_recipe
Recipe

Inputs:

      role #variables
   outcome          1
 predictor          9

Training data contained 2250 data points and no missing data.

Operations:

Dummy variables from maritl, race, education, jobclass, health, health_ins [trained]
Zero variance filter removed <none> [trained]
B-splines on age [trained]
B-splines on year [trained]
Centering and scaling for logwage, maritl_X2..Married, maritl_X3..Widowed... [trained]

5 Model

from sklearn.linear_model import ElasticNet

enet_mod = ElasticNet()
enet_mod
ElasticNet()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
enet_mod <- 
  # mixture = 0 (ridge), mixture = 1 (lasso)
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet")
enet_mod
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = tune()

Computational engine: glmnet 

6 Pipeline (Python) or workflow (R)

Here we bundle the preprocessing step (Python) or recipe (R) and model.

pipe = Pipeline(steps = [
  ("col_tf", col_tf),
  ("std_tf", std_tf),
  ("model", enet_mod)
  ])
pipe
Pipeline(steps=[('col_tf',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['maritl', 'race',
                                                   'education', 'jobclass',
                                                   'health', 'health_ins']),
                                                 ('splinetransformer-1',
                                                  SplineTransformer(extrapolation='linear'),
                                                  ['age']),
                                                 ('splinetransformer-2',
                                                  SplineTransformer(extrapolation='linear',
                                                                    n_knots=4),
                                                  ['year'])])),
                ('std_tf', StandardScaler()), ('model', ElasticNet())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
lr_wf <- 
  workflow() %>%
  add_model(enet_mod) %>%
  add_recipe(norm_recipe)
lr_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_dummy()
• step_zv()
• step_bs()
• step_bs()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = tune()

Computational engine: glmnet 

7 Tuning grid

Set up the 2D grid for tuning.

# Tune hyper-parameter(s)
alphas = np.logspace(start = -6, stop = -1, base = 10, num = 50)
l1_ratios = np.linspace(start = 0, stop = 1, num = 6)
# n_knots = [2, 3, 4, 5]
tuned_parameters = {
  "model__alpha": alphas,
  "model__l1_ratio": l1_ratios
  # "bs_tf__n_knots": n_knots
  }
tuned_parameters  
{'model__alpha': array([1.00000000e-06, 1.26485522e-06, 1.59985872e-06, 2.02358965e-06,
       2.55954792e-06, 3.23745754e-06, 4.09491506e-06, 5.17947468e-06,
       6.55128557e-06, 8.28642773e-06, 1.04811313e-05, 1.32571137e-05,
       1.67683294e-05, 2.12095089e-05, 2.68269580e-05, 3.39322177e-05,
       4.29193426e-05, 5.42867544e-05, 6.86648845e-05, 8.68511374e-05,
       1.09854114e-04, 1.38949549e-04, 1.75751062e-04, 2.22299648e-04,
       2.81176870e-04, 3.55648031e-04, 4.49843267e-04, 5.68986603e-04,
       7.19685673e-04, 9.10298178e-04, 1.15139540e-03, 1.45634848e-03,
       1.84206997e-03, 2.32995181e-03, 2.94705170e-03, 3.72759372e-03,
       4.71486636e-03, 5.96362332e-03, 7.54312006e-03, 9.54095476e-03,
       1.20679264e-02, 1.52641797e-02, 1.93069773e-02, 2.44205309e-02,
       3.08884360e-02, 3.90693994e-02, 4.94171336e-02, 6.25055193e-02,
       7.90604321e-02, 1.00000000e-01]), 'model__l1_ratio': array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])}
param_grid <-grid_regular(
  penalty(range = c(-5, 0), trans = log10_trans()), 
  mixture(range = c(0, 1)),
  levels = c(penalty = 50, mixture = 6)
  )
param_grid
# A tibble: 300 × 2
     penalty mixture
       <dbl>   <dbl>
 1 0.00001         0
 2 0.0000126       0
 3 0.0000160       0
 4 0.0000202       0
 5 0.0000256       0
 6 0.0000324       0
 7 0.0000409       0
 8 0.0000518       0
 9 0.0000655       0
10 0.0000829       0
# … with 290 more rows

8 Cross-validation (CV)

Set up CV partitions and CV criterion.

from sklearn.model_selection import GridSearchCV

# Set up CV
n_folds = 10
search = GridSearchCV(
  pipe,
  tuned_parameters,
  cv = n_folds, 
  scoring = "neg_root_mean_squared_error",
  # Refit the best model on the whole data set
  refit = True
  )

Fit CV. This is typically the most time-consuming step.

# Fit CV
search.fit(X_other, y_other)
GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('col_tf',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('onehotencoder',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['maritl',
                                                                          'race',
                                                                          'education',
                                                                          'jobclass',
                                                                          'health',
                                                                          'health_ins']),
                                                                        ('splinetransformer-1',
                                                                         SplineTransformer(extrapolation='linear'),
                                                                         ['age']),
                                                                        ('splinetransformer-2',
                                                                         SplineTransformer(extrapolation='...
       1.84206997e-03, 2.32995181e-03, 2.94705170e-03, 3.72759372e-03,
       4.71486636e-03, 5.96362332e-03, 7.54312006e-03, 9.54095476e-03,
       1.20679264e-02, 1.52641797e-02, 1.93069773e-02, 2.44205309e-02,
       3.08884360e-02, 3.90693994e-02, 4.94171336e-02, 6.25055193e-02,
       7.90604321e-02, 1.00000000e-01]),
                         'model__l1_ratio': array([0. , 0.2, 0.4, 0.6, 0.8, 1. ])},
             scoring='neg_root_mean_squared_error')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Visualize CV results (TODO).

Code
cv_res = pd.DataFrame({
  "alpha": alphas,
  "rmse": -search.cv_results_["mean_test_score"]
  })

plt.figure()
sns.relplot(
  data = cv_res,
  x = "alpha",
  y = "rmse"
  ).set(
    xlabel = "alpha",
    ylabel = "CV RMSE",
    xscale = "log"
);
plt.show()

Best CV RMSE:

-search.best_score_
33.80941591090395

Set cross-validation partitions.

set.seed(250)
folds <- vfold_cv(Wage_other, v = 10)
folds
#  10-fold cross-validation 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [2025/225]> Fold01
 2 <split [2025/225]> Fold02
 3 <split [2025/225]> Fold03
 4 <split [2025/225]> Fold04
 5 <split [2025/225]> Fold05
 6 <split [2025/225]> Fold06
 7 <split [2025/225]> Fold07
 8 <split [2025/225]> Fold08
 9 <split [2025/225]> Fold09
10 <split [2025/225]> Fold10

Fit cross-validation.

enet_fit <- 
  lr_wf %>%
  tune_grid(
    resamples = folds,
    grid = param_grid,
    )
enet_fit
# Tuning results
# 10-fold cross-validation 
# A tibble: 10 × 4
   splits             id     .metrics           .notes          
   <list>             <chr>  <list>             <list>          
 1 <split [2025/225]> Fold01 <tibble [600 × 6]> <tibble [0 × 3]>
 2 <split [2025/225]> Fold02 <tibble [600 × 6]> <tibble [0 × 3]>
 3 <split [2025/225]> Fold03 <tibble [600 × 6]> <tibble [0 × 3]>
 4 <split [2025/225]> Fold04 <tibble [600 × 6]> <tibble [0 × 3]>
 5 <split [2025/225]> Fold05 <tibble [600 × 6]> <tibble [0 × 3]>
 6 <split [2025/225]> Fold06 <tibble [600 × 6]> <tibble [0 × 3]>
 7 <split [2025/225]> Fold07 <tibble [600 × 6]> <tibble [0 × 3]>
 8 <split [2025/225]> Fold08 <tibble [600 × 6]> <tibble [0 × 3]>
 9 <split [2025/225]> Fold09 <tibble [600 × 6]> <tibble [0 × 3]>
10 <split [2025/225]> Fold10 <tibble [600 × 6]> <tibble [0 × 3]>

Visualize CV criterion.

enet_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "rmse") %>%
  ggplot(mapping = aes(x = penalty, y = mean)) + 
  geom_point() + 
  geom_line(aes(group = mixture)) + 
  labs(x = "Penalty", y = "CV RMSE") + 
  scale_x_log10(labels = scales::label_number())
# A tibble: 600 × 8
     penalty mixture .metric .estimator   mean     n std_err
       <dbl>   <dbl> <chr>   <chr>       <dbl> <int>   <dbl>
 1 0.00001         0 rmse    standard   13.4      10 0.362  
 2 0.00001         0 rsq     standard    0.905    10 0.00427
 3 0.0000126       0 rmse    standard   13.4      10 0.362  
 4 0.0000126       0 rsq     standard    0.905    10 0.00427
 5 0.0000160       0 rmse    standard   13.4      10 0.362  
 6 0.0000160       0 rsq     standard    0.905    10 0.00427
 7 0.0000202       0 rmse    standard   13.4      10 0.362  
 8 0.0000202       0 rsq     standard    0.905    10 0.00427
 9 0.0000256       0 rmse    standard   13.4      10 0.362  
10 0.0000256       0 rsq     standard    0.905    10 0.00427
   .config               
   <chr>                 
 1 Preprocessor1_Model001
 2 Preprocessor1_Model001
 3 Preprocessor1_Model002
 4 Preprocessor1_Model002
 5 Preprocessor1_Model003
 6 Preprocessor1_Model003
 7 Preprocessor1_Model004
 8 Preprocessor1_Model004
 9 Preprocessor1_Model005
10 Preprocessor1_Model005
# … with 590 more rows

Show the top 5 models (\(\lambda\) values)

enet_fit %>%
  show_best("rmse")
# A tibble: 5 × 8
  penalty mixture .metric .estimator  mean     n std_err .config               
    <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
1  0.0471     1   rmse    standard    12.7    10   0.414 Preprocessor1_Model287
2  0.0373     1   rmse    standard    12.7    10   0.414 Preprocessor1_Model286
3  0.0596     0.8 rmse    standard    12.7    10   0.414 Preprocessor1_Model238
4  0.0471     0.8 rmse    standard    12.7    10   0.414 Preprocessor1_Model237
5  0.0596     1   rmse    standard    12.7    10   0.414 Preprocessor1_Model288

Let’s select the best model

best_enet <- enet_fit %>%
  select_best("rmse")
best_enet
# A tibble: 1 × 3
  penalty mixture .config               
    <dbl>   <dbl> <chr>                 
1  0.0471       1 Preprocessor1_Model287

9 Finalize our model

Now we are done tuning. Finally, let’s fit this final model to the whole training data and use our test data to estimate the model performance we expect to see with new data.

Since we called GridSearchCV with refit = True, the best model fit on the whole non-test data is readily available.

search.best_estimator_
Pipeline(steps=[('col_tf',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['maritl', 'race',
                                                   'education', 'jobclass',
                                                   'health', 'health_ins']),
                                                 ('splinetransformer-1',
                                                  SplineTransformer(extrapolation='linear'),
                                                  ['age']),
                                                 ('splinetransformer-2',
                                                  SplineTransformer(extrapolation='linear',
                                                                    n_knots=4),
                                                  ['year'])])),
                ('std_tf', StandardScaler()),
                ('model', ElasticNet(alpha=0.1, l1_ratio=1.0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The final prediction RMSE on the test set is

from sklearn.metrics import mean_squared_error

mean_squared_error(y_test, search.best_estimator_.predict(X_test), squared = False)
33.77484584348061
# Final workflow
final_wf <- lr_wf %>%
  finalize_workflow(best_enet)
final_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_dummy()
• step_zv()
• step_bs()
• step_bs()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0.0471486636345739
  mixture = 1

Computational engine: glmnet 
# Fit the whole training set, then predict the test cases
final_fit <- 
  final_wf %>%
  last_fit(data_split)
final_fit
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits             id               .metrics .notes   .predictions .workflow 
  <list>             <chr>            <list>   <list>   <list>       <list>    
1 <split [2250/750]> train/test split <tibble> <tibble> <tibble>     <workflow>
# Test metrics
final_fit %>% collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard      12.0   Preprocessor1_Model1
2 rsq     standard       0.915 Preprocessor1_Model1