Seasonality, Trend and Noise
You will go beyond summary statistics by learning about autocorrelation and partial autocorrelation plots. You will also learn how to automatically detect seasonality, trend and noise in your time series data. This is the Summary of lecture "Visualizing Time-Series data in Python", via datacamp.
- Autocorrelation and Partial autocorrelation
- Seasonality, trend and noise in time series data
- A quick review
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('fivethirtyeight')
Autocorrelation and Partial autocorrelation
- Autocorrelation in time series data
- Autocorrelation is measured as the correlation between a time series and a delayed copy of itself
- For example, an autocorrelation of order 3 returns the correlation between a time series at points($t_1, t_2, t_3$) and its own values lagged by 3 time points. ($t_4, t_5, t_6$)
- It is used to find repetitive paterns or periodic signal it time series
- Partial autocorrelation in time series data
- Contrary to autocorrelation, partial autocorrelation removes the effect of previous time points
- For example, a partial autocorrelatio nfunction of order 3 returns the correlation between out time series ($t_1, t_2, t_3$) and lagged values of itself by 3 time points ($t_4, t_5, t_6$), but only after removing all effects attributable to lags 1 and 2
Autocorrelation in time series data
In the field of time series analysis, autocorrelation refers to the correlation of a time series with a lagged version of itself. For example, an autocorrelation of order 3 returns the correlation between a time series and its own values lagged by 3 time points.
It is common to use the autocorrelation (ACF) plot, also known as self-autocorrelation, to visualize the autocorrelation of a time-series. The plot_acf()
function in the statsmodels library can be used to measure and plot the autocorrelation of a time series.
co2_levels = pd.read_csv('./dataset/ch2_co2_levels.csv')
co2_levels.set_index('datestamp', inplace=True)
co2_levels = co2_levels.fillna(method='bfill')
from statsmodels.graphics import tsaplots
# Display
fig = tsaplots.plot_acf(co2_levels['co2'], lags= 24);
Interpret autocorrelation plots
If autocorrelation values are close to 0, then values between consecutive observations are not correlated with one another. Inversely, autocorrelations values close to 1 or -1 indicate that there exists strong positive or negative correlations between consecutive observations, respectively.
In order to help you asses how trustworthy these autocorrelation values are, the plot_acf()
function also returns confidence intervals (represented as blue shaded regions). If an autocorrelation value goes beyond the confidence interval region, you can assume that the observed autocorrelation value is statistically significant.
Partial autocorrelation in time series data
Like autocorrelation, the partial autocorrelation function (PACF) measures the correlation coefficient between a time-series and lagged versions of itself. However, it extends upon this idea by also removing the effect of previous time points. For example, a partial autocorrelation function of order 3 returns the correlation between our time series ($t_1, t_2, t_3, \dots$) and its own values lagged by 3 time points ($t_4, t_5, t_6, \dots$), but only after removing all effects attributable to lags 1 and 2.
The plot_pacf()
function in the statsmodels library can be used to measure and plot the partial autocorrelation of a time series.
fig = tsaplots.plot_pacf(co2_levels['co2'], lags=24);
Interpret partial autocorrelation plots
If partial autocorrelation values are close to 0, then values between observations and lagged observations are not correlated with one another. Inversely, partial autocorrelations with values close to 1 or -1 indicate that there exists strong positive or negative correlations between the lagged observations of the time series.
The .plot_pacf()
function also returns confidence intervals, which are represented as blue shaded regions. If partial autocorrelation values are beyond this confidence interval regions, then you can assume that the observed partial autocorrelation values are statistically significant.
Seasonality, trend and noise in time series data
- The properties of time series
- Seasonality: does the data display a clear periodic pattern?
- Trend: does the data follow a consistent upwards or downwards slope?
- Noise: are there any outlier points or missing values that are not consistent with the rest of the data?
Time series decomposition
You can rely on a method known as time-series decomposition to automatically extract and quantify the structure of time-series data. The statsmodels library provides the seasonal_decompose()
function to perform time series decomposition out of the box.
decomposition = sm.tsa.seasonal_decompose(time_series)
You can extract a specific component, for example seasonality, by accessing the seasonal attribute of the decomposition
object.
co2_levels.index = pd.to_datetime(co2_levels.index)
import statsmodels.api as sm
# Perform time series decomposition
decomposition = sm.tsa.seasonal_decompose(co2_levels)
# Print the seasonality component
print(decomposition.seasonal)
Plot individual components
It is also possible to extract other inferred quantities from your time-series decomposition object. The following code shows you how to extract the observed, trend and noise (or residual, resid
) components.
observed = decomposition.observed
trend = decomposition.trend
residuals = decomposition.resid
You can then use the extracted components and plot them individually.
trend = decomposition.trend
# Plot the values of the trend
ax = trend.plot(figsize=(12, 6), fontsize=10);
# Specify axis labels
ax.set_xlabel('Date', fontsize=10);
ax.set_title('Seasonal component the CO2 time-series', fontsize=10);
airline = pd.read_csv('./dataset/ch3_airline_passengers.csv', parse_dates=['Month'], index_col='Month')
airline.info()
ax = airline.plot(color='blue', fontsize=12);
# Add a red vertical line at the date 1955-12-01
ax.axvline('1955-12-01', color='red', linestyle='--');
# Specify the labels in your plot
ax.set_xlabel('Date', fontsize=12);
ax.set_title('Number of Monthly Airline Passengers', fontsize=12);
Analyze the airline dataset
In Chapter 2 you learned:
- How to check for the presence of missing values, and how to collect summary statistics of time series data contained in a pandas DataFrame.
- To generate boxplots of your data to quickly gain insight in your data.
- Display aggregate statistics of your data using groupby().
print(airline.isnull().sum())
# Print out summary statistics of the airline DataFrame
print(airline.describe())
ax = airline.boxplot();
# Specify the title of your plot
ax.set_title('Boxplot of Monthly Airline\nPassengers Count', fontsize=20);
index_month = airline.index.month
# Compute the mean number of passengers for each month of the year
mean_airline_by_month = airline.groupby(index_month).mean()
# Plot the mean number of passengers for each month of the year
mean_airline_by_month.plot();
plt.legend(fontsize=20);
decomposition = sm.tsa.seasonal_decompose(airline)
# Extract the trend and seasonal components
trend = decomposition.trend
seasonal = decomposition.seasonal
airline_decomposed = pd.concat([trend, seasonal], axis=1)
print(airline_decomposed.head(5))
# Plot the values of the airline_decomposed DataFrame
ax = airline_decomposed.plot(figsize=(12, 6), fontsize=15);
# Specify axis labels
ax.set_xlabel('Date', fontsize=15);
plt.legend(fontsize=15);
plt.savefig('../images/trend_seasonal.png')