- Autocorrelation and Partial autocorrelation
- Seasonality, trend and noise in time series data
- A quick review
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns plt.rcParams['figure.figsize'] = (10, 5) plt.style.use('fivethirtyeight')
- Autocorrelation in time series data
- Autocorrelation is measured as the correlation between a time series and a delayed copy of itself
- For example, an autocorrelation of order 3 returns the correlation between a time series at points($t_1, t_2, t_3$) and its own values lagged by 3 time points. ($t_4, t_5, t_6$)
- It is used to find repetitive paterns or periodic signal it time series
- Partial autocorrelation in time series data
- Contrary to autocorrelation, partial autocorrelation removes the effect of previous time points
- For example, a partial autocorrelatio nfunction of order 3 returns the correlation between out time series ($t_1, t_2, t_3$) and lagged values of itself by 3 time points ($t_4, t_5, t_6$), but only after removing all effects attributable to lags 1 and 2
In the field of time series analysis, autocorrelation refers to the correlation of a time series with a lagged version of itself. For example, an autocorrelation of order 3 returns the correlation between a time series and its own values lagged by 3 time points.
It is common to use the autocorrelation (ACF) plot, also known as self-autocorrelation, to visualize the autocorrelation of a time-series. The
plot_acf() function in the statsmodels library can be used to measure and plot the autocorrelation of a time series.
co2_levels = pd.read_csv('./dataset/ch2_co2_levels.csv') co2_levels.set_index('datestamp', inplace=True) co2_levels = co2_levels.fillna(method='bfill')
from statsmodels.graphics import tsaplots # Display fig = tsaplots.plot_acf(co2_levels['co2'], lags= 24);
If autocorrelation values are close to 0, then values between consecutive observations are not correlated with one another. Inversely, autocorrelations values close to 1 or -1 indicate that there exists strong positive or negative correlations between consecutive observations, respectively.
In order to help you asses how trustworthy these autocorrelation values are, the
plot_acf() function also returns confidence intervals (represented as blue shaded regions). If an autocorrelation value goes beyond the confidence interval region, you can assume that the observed autocorrelation value is statistically significant.
Like autocorrelation, the partial autocorrelation function (PACF) measures the correlation coefficient between a time-series and lagged versions of itself. However, it extends upon this idea by also removing the effect of previous time points. For example, a partial autocorrelation function of order 3 returns the correlation between our time series ($t_1, t_2, t_3, \dots$) and its own values lagged by 3 time points ($t_4, t_5, t_6, \dots$), but only after removing all effects attributable to lags 1 and 2.
plot_pacf() function in the statsmodels library can be used to measure and plot the partial autocorrelation of a time series.
fig = tsaplots.plot_pacf(co2_levels['co2'], lags=24);
If partial autocorrelation values are close to 0, then values between observations and lagged observations are not correlated with one another. Inversely, partial autocorrelations with values close to 1 or -1 indicate that there exists strong positive or negative correlations between the lagged observations of the time series.
.plot_pacf() function also returns confidence intervals, which are represented as blue shaded regions. If partial autocorrelation values are beyond this confidence interval regions, then you can assume that the observed partial autocorrelation values are statistically significant.
- The properties of time series
- Seasonality: does the data display a clear periodic pattern?
- Trend: does the data follow a consistent upwards or downwards slope?
- Noise: are there any outlier points or missing values that are not consistent with the rest of the data?
You can rely on a method known as time-series decomposition to automatically extract and quantify the structure of time-series data. The statsmodels library provides the
seasonal_decompose() function to perform time series decomposition out of the box.
decomposition = sm.tsa.seasonal_decompose(time_series)
You can extract a specific component, for example seasonality, by accessing the seasonal attribute of the
co2_levels.index = pd.to_datetime(co2_levels.index)
import statsmodels.api as sm # Perform time series decomposition decomposition = sm.tsa.seasonal_decompose(co2_levels) # Print the seasonality component print(decomposition.seasonal)
datestamp 1958-03-29 1.028042 1958-04-05 1.235242 1958-04-12 1.412344 1958-04-19 1.701186 1958-04-26 1.950694 ... 2001-12-01 -0.525044 2001-12-08 -0.392799 2001-12-15 -0.134838 2001-12-22 0.116056 2001-12-29 0.285354 Name: seasonal, Length: 2284, dtype: float64
It is also possible to extract other inferred quantities from your time-series decomposition object. The following code shows you how to extract the observed, trend and noise (or residual,
observed = decomposition.observed trend = decomposition.trend residuals = decomposition.resid
You can then use the extracted components and plot them individually.
trend = decomposition.trend # Plot the values of the trend ax = trend.plot(figsize=(12, 6), fontsize=10); # Specify axis labels ax.set_xlabel('Date', fontsize=10); ax.set_title('Seasonal component the CO2 time-series', fontsize=10);
airline = pd.read_csv('./dataset/ch3_airline_passengers.csv', parse_dates=['Month'], index_col='Month')
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 144 entries, 1949-01-01 to 1960-12-01 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AirPassengers 144 non-null int64 dtypes: int64(1) memory usage: 2.2 KB
ax = airline.plot(color='blue', fontsize=12); # Add a red vertical line at the date 1955-12-01 ax.axvline('1955-12-01', color='red', linestyle='--'); # Specify the labels in your plot ax.set_xlabel('Date', fontsize=12); ax.set_title('Number of Monthly Airline Passengers', fontsize=12);
In Chapter 2 you learned:
- How to check for the presence of missing values, and how to collect summary statistics of time series data contained in a pandas DataFrame.
- To generate boxplots of your data to quickly gain insight in your data.
- Display aggregate statistics of your data using groupby().
print(airline.isnull().sum()) # Print out summary statistics of the airline DataFrame print(airline.describe())
AirPassengers 0 dtype: int64 AirPassengers count 144.000000 mean 280.298611 std 119.966317 min 104.000000 25% 180.000000 50% 265.500000 75% 360.500000 max 622.000000
ax = airline.boxplot(); # Specify the title of your plot ax.set_title('Boxplot of Monthly Airline\nPassengers Count', fontsize=20);
index_month = airline.index.month # Compute the mean number of passengers for each month of the year mean_airline_by_month = airline.groupby(index_month).mean() # Plot the mean number of passengers for each month of the year mean_airline_by_month.plot(); plt.legend(fontsize=20);
decomposition = sm.tsa.seasonal_decompose(airline) # Extract the trend and seasonal components trend = decomposition.trend seasonal = decomposition.seasonal
airline_decomposed = pd.concat([trend, seasonal], axis=1)
print(airline_decomposed.head(5)) # Plot the values of the airline_decomposed DataFrame ax = airline_decomposed.plot(figsize=(12, 6), fontsize=15); # Specify axis labels ax.set_xlabel('Date', fontsize=15); plt.legend(fontsize=15); plt.savefig('../images/trend_seasonal.png')
trend seasonal Month 1949-01-01 NaN -24.748737 1949-02-01 NaN -36.188131 1949-03-01 NaN -2.241162 1949-04-01 NaN -8.036616 1949-05-01 NaN -4.506313