import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('fivethirtyeight')

Autocorrelation and Partial autocorrelation

Autocorrelation in time series data
- Autocorrelation is measured as the correlation between a time series and a delayed copy of itself
- For example, an autocorrelation of order 3 returns the correlation between a time series at points($t_1, t_2, t_3$) and its own values lagged by 3 time points. ($t_4, t_5, t_6$)
- It is used to find repetitive paterns or periodic signal it time series
Partial autocorrelation in time series data
- Contrary to autocorrelation, partial autocorrelation removes the effect of previous time points
- For example, a partial autocorrelatio nfunction of order 3 returns the correlation between out time series ($t_1, t_2, t_3$) and lagged values of itself by 3 time points ($t_4, t_5, t_6$), but only after removing all effects attributable to lags 1 and 2

Autocorrelation in time series data

In the field of time series analysis, autocorrelation refers to the correlation of a time series with a lagged version of itself. For example, an autocorrelation of order 3 returns the correlation between a time series and its own values lagged by 3 time points.

It is common to use the autocorrelation (ACF) plot, also known as self-autocorrelation, to visualize the autocorrelation of a time-series. The plot_acf() function in the statsmodels library can be used to measure and plot the autocorrelation of a time series.

co2_levels = pd.read_csv('./dataset/ch2_co2_levels.csv')
co2_levels.set_index('datestamp', inplace=True)
co2_levels = co2_levels.fillna(method='bfill')

from statsmodels.graphics import tsaplots

# Display 
fig = tsaplots.plot_acf(co2_levels['co2'], lags= 24);

Interpret autocorrelation plots

If autocorrelation values are close to 0, then values between consecutive observations are not correlated with one another. Inversely, autocorrelations values close to 1 or -1 indicate that there exists strong positive or negative correlations between consecutive observations, respectively.

In order to help you asses how trustworthy these autocorrelation values are, the plot_acf() function also returns confidence intervals (represented as blue shaded regions). If an autocorrelation value goes beyond the confidence interval region, you can assume that the observed autocorrelation value is statistically significant.

Partial autocorrelation in time series data

Like autocorrelation, the partial autocorrelation function (PACF) measures the correlation coefficient between a time-series and lagged versions of itself. However, it extends upon this idea by also removing the effect of previous time points. For example, a partial autocorrelation function of order 3 returns the correlation between our time series ($t_1, t_2, t_3, \dots$) and its own values lagged by 3 time points ($t_4, t_5, t_6, \dots$), but only after removing all effects attributable to lags 1 and 2.

The plot_pacf() function in the statsmodels library can be used to measure and plot the partial autocorrelation of a time series.

fig = tsaplots.plot_pacf(co2_levels['co2'], lags=24);

Interpret partial autocorrelation plots

If partial autocorrelation values are close to 0, then values between observations and lagged observations are not correlated with one another. Inversely, partial autocorrelations with values close to 1 or -1 indicate that there exists strong positive or negative correlations between the lagged observations of the time series.

The .plot_pacf() function also returns confidence intervals, which are represented as blue shaded regions. If partial autocorrelation values are beyond this confidence interval regions, then you can assume that the observed partial autocorrelation values are statistically significant.

Seasonality, trend and noise in time series data

The properties of time series
- Seasonality: does the data display a clear periodic pattern?
- Trend: does the data follow a consistent upwards or downwards slope?
- Noise: are there any outlier points or missing values that are not consistent with the rest of the data?

Time series decomposition

You can rely on a method known as time-series decomposition to automatically extract and quantify the structure of time-series data. The statsmodels library provides the seasonal_decompose() function to perform time series decomposition out of the box.

decomposition = sm.tsa.seasonal_decompose(time_series)

You can extract a specific component, for example seasonality, by accessing the seasonal attribute of the decomposition object.

co2_levels.index = pd.to_datetime(co2_levels.index)

import statsmodels.api as sm

# Perform time series decomposition
decomposition = sm.tsa.seasonal_decompose(co2_levels)

# Print the seasonality component
print(decomposition.seasonal)

datestamp
1958-03-29    1.028042
1958-04-05    1.235242
1958-04-12    1.412344
1958-04-19    1.701186
1958-04-26    1.950694
                ...   
2001-12-01   -0.525044
2001-12-08   -0.392799
2001-12-15   -0.134838
2001-12-22    0.116056
2001-12-29    0.285354
Name: seasonal, Length: 2284, dtype: float64

Plot individual components

It is also possible to extract other inferred quantities from your time-series decomposition object. The following code shows you how to extract the observed, trend and noise (or residual, resid) components.

observed = decomposition.observed
trend = decomposition.trend
residuals = decomposition.resid

You can then use the extracted components and plot them individually.

trend = decomposition.trend

# Plot the values of the trend
ax = trend.plot(figsize=(12, 6), fontsize=10);

# Specify axis labels
ax.set_xlabel('Date', fontsize=10);
ax.set_title('Seasonal component the CO2 time-series', fontsize=10);

A quick review

Visualize the airline dataset

You will now review the contents of chapter 1. You will have the opportunity to work with a new dataset that contains the monthly number of passengers who took a commercial flight between January 1949 and December 1960.

airline = pd.read_csv('./dataset/ch3_airline_passengers.csv', parse_dates=['Month'], index_col='Month')

airline.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 144 entries, 1949-01-01 to 1960-12-01
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   AirPassengers  144 non-null    int64
dtypes: int64(1)
memory usage: 2.2 KB

ax = airline.plot(color='blue', fontsize=12);

# Add a red vertical line at the date 1955-12-01
ax.axvline('1955-12-01', color='red', linestyle='--');

# Specify the labels in your plot
ax.set_xlabel('Date', fontsize=12);
ax.set_title('Number of Monthly Airline Passengers', fontsize=12);

Analyze the airline dataset

In Chapter 2 you learned:

How to check for the presence of missing values, and how to collect summary statistics of time series data contained in a pandas DataFrame.
To generate boxplots of your data to quickly gain insight in your data.
Display aggregate statistics of your data using groupby().

print(airline.isnull().sum())

# Print out summary statistics of the airline DataFrame
print(airline.describe())

AirPassengers    0
dtype: int64
       AirPassengers
count     144.000000
mean      280.298611
std       119.966317
min       104.000000
25%       180.000000
50%       265.500000
75%       360.500000
max       622.000000

ax = airline.boxplot();

# Specify the title of your plot
ax.set_title('Boxplot of Monthly Airline\nPassengers Count', fontsize=20);

index_month = airline.index.month

# Compute the mean number of passengers for each month of the year
mean_airline_by_month = airline.groupby(index_month).mean()

# Plot the mean number of passengers for each month of the year
mean_airline_by_month.plot();
plt.legend(fontsize=20);

Time series decomposition of the airline dataset

In this exercise, you will apply time series decomposition to the airline dataset, and visualize the trend and seasonal componenets.

decomposition = sm.tsa.seasonal_decompose(airline)

# Extract the trend and seasonal components
trend = decomposition.trend
seasonal = decomposition.seasonal

airline_decomposed = pd.concat([trend, seasonal], axis=1)

print(airline_decomposed.head(5))

# Plot the values of the airline_decomposed DataFrame
ax = airline_decomposed.plot(figsize=(12, 6), fontsize=15);

# Specify axis labels
ax.set_xlabel('Date', fontsize=15);
plt.legend(fontsize=15);
plt.savefig('../images/trend_seasonal.png')

            trend   seasonal
Month                       
1949-01-01    NaN -24.748737
1949-02-01    NaN -36.188131
1949-03-01    NaN  -2.241162
1949-04-01    NaN  -8.036616
1949-05-01    NaN  -4.506313