Putting It All Together
A Summary of lecture "Time Series Analysis in Python", via datacamp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
Cointegration Models
- What is Cointegration?
- Two series, $P_t$ and $Q_t$ can be random walks
- But the linear combination $P_t - c Q_t$ may not be a random walk!
- If that's true
- $P_t - c Q_t$ is forecastable
- $P_t$ and $Q_t$ are said to be cointegrated
- Analogy: Dog on a Leash
- $P_t =$ Owner
- $Q_t =$ Dog
- Both series look like a random walk
- Difference of distance between them, looks mean reverting
- If dog falls too far behind, it gets pulled forward
- If dog gets too far ahead, it gets pulled back
- What types of series are cointegrated?
- Economic substitutes
- Heating Oil and Natural Gas
- Platinum and Palladium
- Corn and Wheat
- Corn and Sugar
- $\dots$
- Bitcoin and Ethereum
- Economic substitutes
- Two steps to test for Cointegration
- Regress $P_t$ on$Q_t$ and get slope $c$
- Run Augmented Dickey-Fuller test on $P_t - c Q_t$ to test for random walk
- Alternatively, can use
coint
function in statsmodels that combines both steps
A Dog on a Leash? (Part 1)
The Heating Oil and Natural Gas prices are pre-loaded in DataFrames HO and NG. First, plot both price series, which look like random walks. Then plot the difference between the two series, which should look more like a mean reverting series (to put the two series in the same units, we multiply the heating oil prices, in $/gallon, by 7.25, which converts it to $/millionBTU, which is the same units as Natural Gas).
The data for continuous futures (each contract has to be spliced together in a continuous series as contracts expire) was obtained from Quandl.
- Preprocess
HO = pd.read_csv('./dataset/CME_HO1.csv', index_col=0)
NG = pd.read_csv('./dataset/CME_NG1.csv', index_col=0)
HO.index = pd.to_datetime(HO.index, format='%m/%d/%Y')
NG.index = pd.to_datetime(NG.index, format='%m/%d/%Y')
HO = HO.sort_index()
NG = NG.sort_index()
HO.head()
NG.head()
plt.subplot(2, 1, 1)
plt.plot(7.25 * HO, label='Heating Oil');
plt.plot(NG, label='Natural Gas');
plt.legend(loc='best', fontsize='small');
# Plot the spread
plt.subplot(2, 1, 2)
plt.plot(7.25 * HO - NG, label='Spread');
plt.legend(loc='best', fontsize='small');
plt.axhline(y=0, linestyle='--', color='k');
from statsmodels.tsa.stattools import adfuller
# Compute the ADF for HO and NG
result_HO = adfuller(HO['Close'])
print("The p-value for the ADF test on HO is ", result_HO[1])
result_NG = adfuller(NG['Close'])
print("The p-value for the ADF test on NG is ", result_NG[1])
# Compute the ADF of the spread
result_spread = adfuller(7.25 * HO['Close'] - NG['Close'])
print("The p-value for the ADF test on the spread is ", result_spread[1])
Are Bitcoin and Ethereum Cointegrated?
Cointegration involves two steps: regressing one time series on the other to get the cointegration vector, and then perform an ADF test on the residuals of the regression. In the last example, there was no need to perform the first step since we implicitly assumed the cointegration vector was (1,−1). In other words, we took the difference between the two series (after doing a units conversion). Here, you will do both steps.
You will regress the value of one cryptocurrency, bitcoin (BTC), on another cryptocurrency, ethereum (ETH). If we call the regression coefficient b, then the cointegration vector is simply (1,−b). Then perform the ADF test on BTC −b ETH.
- Preprocess
BTC = pd.read_csv('./dataset/BTC.csv', index_col=0)
ETH = pd.read_csv('./dataset/ETH.csv', index_col=0)
BTC.index = pd.to_datetime(BTC.index, format='%Y-%m-%d')
ETH.index = pd.to_datetime(ETH.index, format='%Y-%m-%d')
import statsmodels.api as sm
# Regress BTC on ETH
ETH = sm.add_constant(ETH)
result = sm.OLS(BTC, ETH).fit()
# Compare ADF
b = result.params[1]
adf_stats = adfuller(BTC['Price'] - b * ETH['Price'])
print("The p-value for the ADF test is ", adf_stats[1])
Is Temperature a Random Walk (with Drift)?
An ARMA model is a simplistic approach to forecasting climate changes, but it illustrates many of the topics covered in this class.
The DataFrame temp_NY
contains the average annual temperature in Central Park, NY from 1870-2016 (the data was downloaded from the NOAA here). Plot the data and test whether it follows a random walk (with drift).
- Preprocess
temp_NY = pd.read_csv('./dataset/NOAA_TAVG.csv', index_col=0)
temp_NY.index = pd.to_datetime(temp_NY.index, format='%Y')
temp_NY.head()
temp_NY.plot();
# Compute and print ADF p-value
result = adfuller(temp_NY['TAVG'])
print("The p-value for the ADF test is ", result[1])
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Take first difference of the temperature Series
chg_temp = temp_NY.diff()
chg_temp = chg_temp.dropna()
# Plot the ACF and PACF on the same page
fig, axes = plt.subplots(2, 1)
# Plot the ACF
plot_acf(chg_temp, lags=20, ax=axes[0]);
# Plot the PACF
plot_pacf(chg_temp, lags=20, ax=axes[1]);
plt.tight_layout()
Which ARMA Model is Best?
Recall from Chapter 3 that the Akaike Information Criterion (AIC) can be used to compare models with different numbers of parameters. It measures goodness-of-fit, but places a penalty on models with more parameters to discourage overfitting. Lower AIC scores are better.
Fit the temperature data to an AR(1), AR(2), and ARMA(1,1) and see which model is the best fit, using the AIC criterion. The AR(2) and ARMA(1,1) models have one more parameter than the AR(1) has.
from statsmodels.tsa.arima_model import ARMA
# Fit the data to an AR(1) model and print AIC:
mod_ar1 = ARMA(chg_temp, order=(1, 0))
res_ar1 = mod_ar1.fit()
print("The AIC for an AR(1) is: ", res_ar1.aic)
# Fit the data to an AR(2) model and print AIC
mod_ar2 = ARMA(chg_temp, order=(2, 0))
res_ar2 = mod_ar2.fit()
print("The AIC for an AR(2) is: ", res_ar2.aic)
# fit the data to an ARMA(1, 1) model and print AIC
mod_arma11 = ARMA(chg_temp, order=(1, 1))
res_arma11 = mod_arma11.fit()
print("The AIC for an ARMA(1,1) is: ", res_arma11.aic)
Don't Throw Out That Winter Coat Yet
Finally, you will forecast the temperature over the next 30 years using an ARMA(1,1) model, including confidence bands around that estimate. Keep in mind that the estimate of the drift will have a much bigger impact on long range forecasts than the ARMA parameters.
Earlier, you determined that the temperature data follows a random walk and you looked at first differencing the data. In this exercise, you will use the ARIMA module on the temperature data (before differencing), which is identical to using the ARMA module on changes in temperature, followed by taking cumulative sums of these changes to get the temperature forecast.
from statsmodels.tsa.arima_model import ARIMA
# Forecast temperatures using an ARIMA(1,1,1) model
mod = ARIMA(temp_NY, order=(1,1,1))
res = mod.fit()
# Plot the original series and the forecasted series
res.plot_predict(start='1872-01-01', end='2046-01-01');
plt.savefig('../images/tavg_predict.png')