Seasonal ARIMA Models
In this final chapter, you'll learn how to use seasonal ARIMA models to fit more complex data. You'll learn how to decompose this data into seasonal and non-seasonal parts and then you'll get the chance to utilize all your ARIMA tools on one last global forecast challenge. This is the Summary of lecture "ARIMA Models in Python", via datacamp.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('fivethirtyeight')
Seasonal decompose
You can think of a time series as being composed of trend, seasonal and residual components. This can be a good way to think about the data when you go about modeling it. If you know the period of the time series you can decompose it into these components.
In this exercise you will decompose a time series showing the monthly milk production per cow in the USA. This will give you a clearer picture of the trend and the seasonal cycle. Since the data is monthly you will guess that the seasonality might be 12 time periods, however this won't always be the case.
milk_production = pd.read_csv('./dataset/milk_production.csv', index_col='date', parse_dates=True)
milk_production = milk_production.asfreq('MS')
milk_production.head()
from statsmodels.tsa.seasonal import seasonal_decompose
# Perform additive decomposition
decomp = seasonal_decompose(milk_production['pounds_per_cow'], period=12)
# Plot decomposition
decomp.plot();
plt.tight_layout();
water = pd.read_csv('./dataset/water.csv', index_col='date', parse_dates=True)
water = water.asfreq('MS')
water.head()
water.plot();
In this exercise you will use the ACF and PACF to test this data for seasonality. You can see from the plot above that the time series isn't stationary, so you should probably detrend it. You will detrend it by subtracting the moving average. Remember that you could use a window size of any value bigger than the likely period.
from statsmodels.graphics.tsaplots import plot_acf
# Create figure and subplot
fig, ax1 = plt.subplots()
# Plot the ACF on ax1
plot_acf(water['water_consumers'], lags=25, zero=False, ax=ax1);
water_2 = water - water.rolling(15).mean()
# Drop the NaN values
water_2 = water_2.dropna()
# Create figure and subplots
fig, ax1 = plt.subplots()
# Plot the ACF
plot_acf(water_2['water_consumers'], lags=25, zero=False, ax=ax1);
Based on this figure, 12 time steps is the time period of the seasonal component.
SARIMA models
- Seasonal ARIMA = SARIMA
$$ \text{SARIMA}(p, d, q)(P, D, Q)_S $$
- Non-seasonal orders
- p: autoregressive order
- d: differencing order
- q: moving average order
- Seasonal orders
- P: seasonal autoregressive order
- D: seasonal differencing order
- Q: seasonal moving average order
- S: Number of time steps per cycle
- Non-seasonal orders
-
The SARIMA model
- ARIMA(2, 0, 1) model: $$ y_t = a_1 y_{t-1} + a_2 y_{t-2} + m_1 \epsilon_{t-1} + \epsilon_t $$
- $\text{SARIMA}(0,0,0)(2, 0, 1)_7$ model: $$ y_t = a_7 y_{t-7} + a_{14} y_{t-14} + m_y \epsilon_{t-7} + \epsilon_t $$
-
Seasonal differencing
- Subtract the time series value of one season ago $$ \Delta y_t = y_t - y_{t-S} $$
Fitting SARIMA models
Fitting SARIMA models is the beginning of the end of this journey into time series modeling.
It is important that you get to know your way around the SARIMA model orders and so that's what you will focus on here.
In this exercise, you will practice fitting different SARIMA models to a set of time series.
df1 = pd.read_csv('./dataset/df1.csv', index_col=0, parse_dates=True)
df2 = pd.read_csv('./dataset/df2.csv', index_col=0, parse_dates=True)
df3 = pd.read_csv('./dataset/df3.csv', index_col=0, parse_dates=True)
df1 = df1.asfreq('d')
df2 = df2.asfreq('d')
df3 = df3.asfreq('d')
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Create a SARIMAX model
model = SARIMAX(df1, order=(1, 0, 0), seasonal_order=(1, 1, 0, 7))
# Fit the model
results = model.fit()
# Print the results summary
results.summary()
model = SARIMAX(df2, order=(2, 1, 1), seasonal_order=(1, 0, 0, 4))
# Fit the model
results = model.fit()
# Print the results summary
results.summary()
model = SARIMAX(df3, order=(1, 1, 0), seasonal_order=(0, 1, 1, 12))
# Fit the model
results = model.fit()
# Print the results summary
results.summary()
Choosing SARIMA order
In this exercise you will find the appropriate model order for a new set of time series. This is monthly series of the number of employed persons in Australia (in thousands). The seasonal period of this time series is 12 months.
You will create non-seasonal and seasonal ACF and PACF plots and use the table below to choose the appropriate model orders.
AR(p) | MA(q) | ARMA(p, q) | |
---|---|---|---|
ACF | Tails off | Cuts off after lag q | Tails off |
PACF | Cuts off after lag p | Tails off | Tails off |
aus_employment = pd.read_csv('./dataset/aus_employment.csv', index_col='date', parse_dates=True)
aus_employment = aus_employment.asfreq('MS')
aus_employment.head()
from statsmodels.graphics.tsaplots import plot_pacf
# Take the first and seasonal differences and drop NaNs
aus_employment_diff = aus_employment.diff().diff(12).dropna()
# Create the figure
fig, (ax1, ax2) = plt.subplots(2,1,figsize=(8,6))
# Plot the ACF on ax1
plot_acf(aus_employment_diff, lags=11, zero=False, ax=ax1);
# Plot the PACF on ax2
plot_pacf(aus_employment_diff, lags=11, zero=False, ax=ax2);
plt.tight_layout();
lags = [12, 24, 36, 48, 60]
# Create the figure
fig, (ax1, ax2) = plt.subplots(2,1,figsize=(8,6))
# Plot the ACF on ax1
plot_acf(aus_employment_diff, lags=lags, zero=False, ax=ax1);
# Plot the PACF on ax2
plot_pacf(aus_employment_diff, lags=lags, zero=False, ax=ax2);
plt.tight_layout();
The non-seasonal ACF doesn't show any of the usual patterns of MA, AR or ARMA models so we choose none of these. The Seaosnal ACF and PACF look like an MA(1) model. : $\text{SARIMAX}(0,1,0)(0,1,1)_{12}$
SARIMA vs ARIMA forecasts
In this exercise, you will see the effect of using a SARIMA model instead of an ARIMA model on your forecasts of seasonal time series.
Two models, an ARIMA(3,1,2) and a SARIMA(0,1,1)(1,1,1)12, have been fit to the Wisconsin employment time series. These were the best ARIMA model and the best SARIMA model available according to the AIC.
In the exercise you will use these two models to make dynamic future forecast for 25 months and plot these predictions alongside held out data for this period, wisconsin_test.
wisconsin_test = pd.read_csv('./dataset/wisconsin_test.csv', index_col='date', parse_dates=True)
wisconsin_test = wisconsin_test.asfreq('MS')
dates = wisconsin_test.index
wisconsin_test.head()
model = SARIMAX(wisconsin_test, order=(3, 1, 2), trend='c',
enforce_stationarity=True, enforce_invertibility=True)
arima_results = model.fit()
model = SARIMAX(wisconsin_test, order=(0, 1, 1), seasonal_order=(1, 1, 1, 12), trend='c',
enforce_stationarity=True, enforce_invertibility=True)
sarima_results = model.fit()
arima_results.specification
sarima_results.specification
arima_pred = arima_results.get_forecast(25)
arima_mean = arima_pred.predicted_mean
# Create SARIMA mean forecast
sarima_pred = sarima_results.get_forecast(25)
sarima_mean = sarima_pred.predicted_mean
# Plot mean ARIMA and SARIMA predictions and observed
plt.plot(dates, sarima_mean, label='SARIMA');
plt.plot(dates, arima_mean, label='ARIMA');
plt.plot(wisconsin_test, label='observed');
plt.legend();
Automated model selection
The pmdarima
package is a powerful tool to help you choose the model orders. You can use the information you already have from the identification step to narrow down the model orders which you choose by automation.
Remember, although automation is powerful, it can sometimes make mistakes that you wouldn't. It is hard to guess how the input data could be imperfect and affect the test scores.
In this exercise you will use the pmdarima package to automatically choose model orders for some time series datasets.
import pmdarima as pm
# Create auto_arima model
model1 = pm.auto_arima(df1,
seasonal=True, m=7,
d=0, D=1,
max_p=2, max_q=2,
trace=True,
error_action='ignore',
suppress_warnings=True)
# Print model summary
print(model1.summary())
model2 = pm.auto_arima(df2,
seasonal=False,
d=1,
trend='c',
max_p=2, max_q=2,
trace=True,
error_action='ignore',
suppress_warnings=True)
# Print model summary
print(model2.summary())
model3 = pm.auto_arima(df3,
seasonal=True, m=7,
d=1, D=1,
start_p=1, start_q=1,
max_p=1, max_q=1,
max_P=1, max_Q=1,
trace=True,
error_action='ignore',
suppress_warnings=True)
# Print model summary
print(model3.summary())
Saving and updating models
Once you have gone through the steps of the Box-Jenkins method and arrived at a model you are happy with, you will want to be able to save that model and also to incorporate new measurements when they are available. This is key part of putting the model into production.
In this exercise you will save a freshly trained model to disk, then reload it to update it with new data.
import joblib
# Set model name
filename='candy_model.pkl'
# Pickle it
joblib.dump(model1, filename)
# Load the model back in
loaded_model = joblib.load(filename)
# Update the model
# loaded_model.update(df)
SARIMA model diagnostics
Usually the next step would be to find the order of differencing and other model orders. However, this time it's already been done for you. The time series is best fit by a $\text{SARIMA}(1, 1, 1)(0, 1, 1)_{12}$ model with an added constant.
In this exercise you will make sure that this is a good model by first fitting it using the SARIMAX
class and going through the normal model diagnostics procedure.
co2 = pd.read_csv('./dataset/co2.csv', index_col='date', parse_dates=True)
co2 = co2.asfreq('MS')
co2.head()
model = SARIMAX(co2, order=(1, 1, 1), seasonal_order=(0, 1, 1, 12), trend='c')
# Fit model
results = model.fit()
results.summary()
results.plot_diagnostics(figsize=(20, 15));
plt.tight_layout();
SARIMA forecast
In the previous exercise you confirmed that a SARIMA (1,1,1) x (0,1,1)12 model was a good fit to the CO2 time series by using diagnostic checking.
Now its time to put this model into practice to make future forecasts. Climate scientists tell us that we have until 2030 to drastically reduce our CO2 emissions or we will face major societal challenges.
In this exercise, you will forecast the CO2 time series up to the year 2030 to find the CO2 levels if we continue emitting as usual.
forecast_object = results.get_forecast(136)
# Extract predicted mean attribute
mean = forecast_object.predicted_mean
# Calculate the confidence intervals
conf_int = forecast_object.conf_int()
# Extract the forecast dates
dates = mean.index
plt.figure()
# Plot past CO2 levels
plt.plot(co2.index, co2, label='past');
# Plot the prediction means as line
plt.plot(dates, mean, label='predicted');
# Shade between the confidence intervals
plt.fill_between(dates, conf_int.loc[:, 'lower CO2_ppm'],
conf_int.loc[:, 'upper CO2_ppm'], alpha=0.4);
# Plot legend
plt.legend();
plt.savefig('../images/co2_predicted.png')
print(mean.iloc[-1])
# Print last confidence interval
print(conf_int.iloc[-1])