Econ 425T Homework 6

Due Mar 24, 2023 @ 11:59PM

Author

YOUR NAME and UID

Published

March 13, 2023

Load Python libraries.

Code
# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt
# For read file from url
import io
import requests

# Set font sizes in plots
sns.set(font_scale = 1.)
# Display all columns
pd.set_option('display.max_columns', None)

1 New York Stock Exchange (NYSE) data (1962-1986) (10 pts)

Figure 1: Historical trading statistics from the New York Stock Exchange. Daily values of the normalized log trading volume, DJIA return, and log volatility are shown for a 24-year period from 1962-1986. We wish to predict trading volume on any day, given the history on all earlier days. To the left of the red bar (January 2, 1980) is training data, and to the right test data.

The NYSE.csv file contains three daily time series from the New York Stock Exchange (NYSE) for the period Dec 3, 1962-Dec 31, 1986 (6,051 trading days).

  • Log trading volume (\(v_t\)): This is the fraction of all outstanding shares that are traded on that day, relative to a 100-day moving average of past turnover, on the log scale.

  • Dow Jones return (\(r_t\)): This is the difference between the log of the Dow Jones Industrial Index on consecutive trading days.

  • Log volatility (\(z_t\)): This is based on the absolute values of daily price movements.

# Read in NYSE data from url
url = "https://raw.githubusercontent.com/ucla-econ-425t/2023winter/master/slides/data/NYSE.csv"
s = requests.get(url).content.decode('utf-8')
NYSE = pd.read_csv(io.StringIO(s), index_col = 0)
NYSE
           day_of_week  DJ_return  log_volume  log_volatility  train
date                                                                
1962-12-03         mon  -0.004461    0.032573      -13.127403   True
1962-12-04        tues   0.007813    0.346202      -11.749305   True
1962-12-05         wed   0.003845    0.525306      -11.665609   True
1962-12-06        thur  -0.003462    0.210182      -11.626772   True
1962-12-07         fri   0.000568    0.044187      -11.728130   True
...                ...        ...         ...             ...    ...
1986-12-24         wed   0.006514   -0.236104       -9.807366  False
1986-12-26         fri   0.001825   -1.322425       -9.906025  False
1986-12-29         mon  -0.009515   -0.371237       -9.827660  False
1986-12-30        tues  -0.001837   -0.385638       -9.926091  False
1986-12-31         wed  -0.006655   -0.264986       -9.935527  False

[6051 rows x 5 columns]

The autocorrelation at lag \(\ell\) is the correlation of all pairs \((v_t, v_{t-\ell})\) that are \(\ell\) trading days apart. These sizable correlations give us confidence that past values will be helpful in predicting the future.

Code
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

plt.figure()
plot_acf(NYSE['log_volume'], lags = 20)
plt.show()

Figure 2: The autocorrelation function for log volume. We see that nearby values are fairly strongly correlated, with correlations above 0.2 as far as 20 days apart.

Do a similar plot for (1) the correlation between \(v_t\) and lag \(\ell\) Dow Jones return \(r_{t-\ell}\) and (2) correlation between \(v_t\) and lag \(\ell\) Log volatility \(z_{t-\ell}\).

2 Project goal

Our goal is to forecast daily Log trading volume, using various machine learning algorithms we learnt in this class.

The data set is already split into train (before Jan 1st, 1980, \(n_{\text{train}} = 4,281\)) and test (after Jan 1st, 1980, \(n_{\text{test}} = 1,770\)) sets.

In general, we will tune the lag \(L\) to acheive best forecasting performance. In this project, we would fix \(L=5\). That is we always use the previous five trading days’ data to forecast today’s log trading volume.

Pay attention to the nuance of splitting time series data for cross validation. Study and use the TimeSeriesSplit in Scikit-Learn. Make sure to use the same splits when tuning different machine learning algorithms.

Use the \(R^2\) between forecast and actual values as the cross validation and test evaluation criterion.

3 Baseline method (20 pts)

We use the straw man (use yesterday’s value of log trading volume to predict that of today) as the baseline method. Evaluate the \(R^2\) of this method on the test data.

4 Autoregression (AR) forecaster (30 pts)

  • Let \[ y = \begin{pmatrix} v_{L+1} \\ v_{L+2} \\ v_{L+3} \\ \vdots \\ v_T \end{pmatrix}, \quad M = \begin{pmatrix} 1 & v_L & v_{L-1} & \cdots & v_1 \\ 1 & v_{L+1} & v_{L} & \cdots & v_2 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & v_{T-1} & v_{T-2} & \cdots & v_{T-L} \end{pmatrix}. \]

  • Fit an ordinary least squares (OLS) regression of \(y\) on \(M\), giving \[ \hat v_t = \hat \beta_0 + \hat \beta_1 v_{t-1} + \hat \beta_2 v_{t-2} + \cdots + \hat \beta_L v_{t-L}, \] known as an order-\(L\) autoregression model or AR(\(L\)).

  • Tune AR(5) with elastic net (lasso + ridge) regularization using all 3 features on the training data, and evaluate the test performance.

  • Hint: Workflow: Lasso is a good starting point.

5 Autoregression by MLP (30 pts)

6 LSTM forecaster (30 pts)

  • We extract many short mini-series of input sequences \(X=\{X_1,X_2,\ldots,X_L\}\) with a predefined lag \(L\): \[\begin{eqnarray*} X_1 = \begin{pmatrix} v_{t-L} \\ r_{t-L} \\ z_{t-L} \end{pmatrix}, X_2 = \begin{pmatrix} v_{t-L+1} \\ r_{t-L+1} \\ z_{t-L+1} \end{pmatrix}, \cdots, X_L = \begin{pmatrix} v_{t-1} \\ r_{t-1} \\ z_{t-1} \end{pmatrix}, \end{eqnarray*}\] and \[ Y = v_t. \]

  • Tune LSTM and evaluate the test performance.

7 Random forest forecaster (30pts)

  • Use the same features as in AR(\(L\)) for the random forest. Tune the random forest and evaluate the test performance.

  • Hint: Workflow: Random Forest for Prediction is a good starting point.

8 Boosting forecaster (30pts)

  • Use the same features as in AR(\(L\)) for the boosting. Tune the boosting algorithm and evaluate the test performance.

  • Adventurous students should try to learn and use XGBoost instead of Scikit-Learn.

9 Summary (30pts)

Your score for this question is largely determined by your final test performance.

Summarize the performance of different machine learning forecasters in the following format.

Method CV \(R^2\) Test \(R^2\)
Baseline
AR(5)
AR(5) MLP
LSTM
Random Forest
Boosting