# Load the pandas libraryimport pandas as pd# Load numpy for array manipulationimport numpy as np# Load seaborn plotting libraryimport seaborn as snsimport matplotlib.pyplot as plt# For read file from urlimport ioimport requests# Set font sizes in plotssns.set(font_scale =1.)# Display all columnspd.set_option('display.max_columns', None)
1 New York Stock Exchange (NYSE) data (1962-1986) (10 pts)
The NYSE.csv file contains three daily time series from the New York Stock Exchange (NYSE) for the period Dec 3, 1962-Dec 31, 1986 (6,051 trading days).
Log trading volume (\(v_t\)): This is the fraction of all outstanding shares that are traded on that day, relative to a 100-day moving average of past turnover, on the log scale.
Dow Jones return (\(r_t\)): This is the difference between the log of the Dow Jones Industrial Index on consecutive trading days.
Log volatility (\(z_t\)): This is based on the absolute values of daily price movements.
# Read in NYSE data from urlurl ="https://raw.githubusercontent.com/ucla-econ-425t/2023winter/master/slides/data/NYSE.csv"s = requests.get(url).content.decode('utf-8')NYSE = pd.read_csv(io.StringIO(s), index_col =0)NYSE
The autocorrelation at lag \(\ell\) is the correlation of all pairs \((v_t, v_{t-\ell})\) that are \(\ell\) trading days apart. These sizable correlations give us confidence that past values will be helpful in predicting the future.
Code
from statsmodels.graphics.tsaplots import plot_acf, plot_pacfplt.figure()plot_acf(NYSE['log_volume'], lags =20)plt.show()
Do a similar plot for (1) the correlation between \(v_t\) and lag \(\ell\)Dow Jones return\(r_{t-\ell}\) and (2) correlation between \(v_t\) and lag \(\ell\)Log volatility\(z_{t-\ell}\).
2 Project goal
Our goal is to forecast daily Log trading volume, using various machine learning algorithms we learnt in this class.
The data set is already split into train (before Jan 1st, 1980, \(n_{\text{train}} = 4,281\)) and test (after Jan 1st, 1980, \(n_{\text{test}} = 1,770\)) sets.
In general, we will tune the lag \(L\) to acheive best forecasting performance. In this project, we would fix \(L=5\). That is we always use the previous five trading days’ data to forecast today’s log trading volume.
Pay attention to the nuance of splitting time series data for cross validation. Study and use the TimeSeriesSplit in Scikit-Learn. Make sure to use the same splits when tuning different machine learning algorithms.
Use the \(R^2\) between forecast and actual values as the cross validation and test evaluation criterion.
3 Baseline method (20 pts)
We use the straw man (use yesterday’s value of log trading volume to predict that of today) as the baseline method. Evaluate the \(R^2\) of this method on the test data.
Fit an ordinary least squares (OLS) regression of \(y\) on \(M\), giving \[
\hat v_t = \hat \beta_0 + \hat \beta_1 v_{t-1} + \hat \beta_2 v_{t-2} + \cdots + \hat \beta_L v_{t-L},
\] known as an order-\(L\) autoregression model or AR(\(L\)).
Tune AR(5) with elastic net (lasso + ridge) regularization using all 3 features on the training data, and evaluate the test performance.