Work with Multiple Time Series
In the field of Data Science, it is common to be involved in projects where multiple time series need to be studied simultaneously. In this chapter, we will show you how to plot multiple time series at once, and how to discover and describe relationships between multiple time series. This is the Summary of lecture "Visualizing Time-Series data in Python", via datacamp.
- Working with more than one time series
- Plot multiple time series
- Find relationships between multiple time series
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('fivethirtyeight')
Load multiple time series
Whether it is during personal projects or your day-to-day work as a Data Scientist, it is likely that you will encounter situations that require the analysis and visualization of multiple time series at the same time.
Provided that the data for each time series is stored in distinct columns of a file, the pandas library makes it easy to work with multiple time series. In the following exercises, you will work with a new time series dataset that contains the amount of different types of meat produced in the USA between 1944 and 2012.
meat = pd.read_csv('./dataset/ch4_meat.csv')
# Review the first five lines of the meat DataFrame
print(meat.head(5))
# Convert the date column to a datestamp type
meat['date'] = pd.to_datetime(meat['date'])
# Set the date column as the index of your DataFrame meat
meat = meat.set_index('date')
# Print the summary statistics of the DataFrame
print(meat.describe())
Visualize multiple time series
If there are multiple time series in a single DataFrame, you can still use the .plot()
method to plot a line chart of all the time series. Another interesting way to plot these is to use area charts. Area charts are commonly used when dealing with multiple time series, and can be used to display cumulated totals.
With the pandas library, you can simply leverage the .plot.area()
method to produce area charts of the time series data in your DataFrame.
ax = meat.plot(linewidth=2, fontsize=12);
# Additional customizations
ax.set_xlabel('Date');
ax.legend(fontsize=12);
ax = meat.plot.area(fontsize=12);
# Additional customizations
ax.set_xlabel('Date');
ax.legend(fontsize=12);
Define the color palette of your plots
When visualizing multiple time series, it can be difficult to differentiate between various colors in the default color scheme.
To remedy this, you can define each color manually, but this may be time-consuming. Fortunately, it is possible to leverage the colormap argument to .plot()
to automatically assign specific color palettes with varying contrasts. You can either provide a matplotlib colormap as an input to this parameter, or provide one of the default strings that is available in the colormap()
function available in matplotlib (all of which are available here).
For example, you can specify the 'viridis'
colormap using the following command:
df.plot(colormap='viridis')
ax = meat.plot(colormap='cubehelix', fontsize=15);
# Additional customizations
ax.set_xlabel('Date');
ax.legend(fontsize=12);
ax = meat.plot(colormap='PuOr', fontsize=15);
# Additional customizations
ax.set_xlabel('Date');
ax.legend(fontsize=12);
Add summary statistics to your time series plot
It is possible to visualize time series plots and numerical summaries on one single graph by using the pandas API to matplotlib along with the table
method:
# Plot the time series data in the DataFrame
ax = df.plot()
# Compute summary statistics of the df DataFrame
df_summary = df.describe()
# Add summary table information to the plot
ax.table(cellText=df_summary.values,
colWidths=[0.3]*len(df.columns),
rowLabels=df_summary.index,
colLabels=df_summary.columns,
loc='top')
meat_mean = meat.mean().to_frame().T
meat_mean.rename({0:'mean'}, inplace=True)
meat_mean
ax = meat.plot(fontsize=10, linewidth=1);
# Add x-axis labels
ax.set_xlabel('Date', fontsize=10);
# Add summary table information to the plot
ax.table(cellText=meat_mean.values,
colWidths=[0.15] * len(meat_mean.columns),
rowLabels=meat_mean.index,
colLabels=meat_mean.columns,
loc='top');
# Specify the fontsize and location of your legend
ax.legend(loc='upper center', bbox_to_anchor=(0.5, 0.95), ncol=3, fontsize=10);
Plot your time series on individual plots
It can be beneficial to plot individual time series on separate graphs as this may improve clarity and provide more context around each time series in your DataFrame.
It is possible to create a "grid" of individual graphs by "faceting" each time series by setting the subplots
argument to True
. In addition, the arguments that can be added are:
-
layout
: specifies the number of rows x columns to use. -
sharex
andsharey
: specifies whether the x-axis and y-axis values should be shared between your plots.
meat.plot(subplots=True,
layout=(2, 4),
sharex=False,
sharey=False,
colormap='viridis',
fontsize=8,
legend=False,
linewidth=0.2);
plt.tight_layout();
Find relationships between multiple time series
- Correlations between two variables
- In the field of Statistics, the correlation coefficient is a measure used to determine the strength or lack of relationship between two variables:
- Pearson's coefficient can be used to compute the correlation coefficient between variables for which the relationship is thought to be linear
- Kendall Tau or Spearman rank can be used to compute the correlation coefficient between variables for which the relationship is thought to be non-linear
- In the field of Statistics, the correlation coefficient is a measure used to determine the strength or lack of relationship between two variables:
- What is a correlation matrix?
- When computing the correlation coefficient between more than two variables, you obtain a correlation matrix
- Range: [-1, 1]
- 0: No relationship
- 1: Strong positive relationship
- -1: Strong negative relationship
- A correlation matrix is always "symmetric"
- The diagonal values will always be equal to 1
- When computing the correlation coefficient between more than two variables, you obtain a correlation matrix
Compute correlations between time series
The correlation coefficient can be used to determine how multiple variables (or a group of time series) are associated with one another. The result is a correlation matrix that describes the correlation between time series. Note that the diagonal values in a correlation matrix will always be 1, since a time series will always be perfectly correlated with itself.
Correlation coefficients can be computed with the pearson, kendall and spearman methods. A full discussion of these different methods is outside the scope of this course, but the pearson method should be used when relationships between your variables are thought to be linear, while the kendall and spearman methods should be used when relationships between your variables are thought to be non-linear.
print(meat[['beef', 'pork']].corr(method='spearman'))
# Print the correlation between beef and port columns
print(meat[['pork', 'veal', 'turkey']].corr(method='pearson'))
Visualize correlation matrices
The correlation matrix generated in the previous exercise can be plotted using a heatmap. To do so, you can leverage the heatmap()
function from the seaborn library which contains several arguments to tailor the look of your heatmap.
df_corr = df.corr()
sns.heatmap(df_corr)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
You can use the .xticks()
and .yticks()
methods to rotate the axis labels so they don't overlap.
To learn about the arguments to the heatmap()
function, refer to this page.
corr_meat = meat.corr(method='spearman')
# Customize the heatmap of the corr_meat correlation matrix
sns.heatmap(corr_meat,
annot=True,
linewidths=0.4,
annot_kws={'size': 10});
plt.xticks(rotation=90);
plt.yticks(rotation=0);
Clustered heatmaps
Heatmaps are extremely useful to visualize a correlation matrix, but clustermaps are better. A Clustermap allows to uncover structure in a correlation matrix by producing a hierarchically-clustered heatmap:
pyrhon
df_corr = df.corr()
fig = sns.clustermap(df_corr)
plt.setp(fig.ax_heatmap.xaxis.get_majorticklabels(), rotation=90)
plt.setp(fig.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
To prevent overlapping of axis labels, you can reference the Axes
from the underlying fig
object and specify the rotation. You can learn about the arguments to the clustermap()
function here.
corr_meat = meat.corr(method='pearson')
# Customize the heatmap of the corr_meat correlation matrix
fig = sns.clustermap(corr_meat,
row_cluster=True,
col_cluster=True,
figsize=(10, 10));
plt.setp(fig.ax_heatmap.xaxis.get_majorticklabels(), rotation=90);
plt.setp(fig.ax_heatmap.yaxis.get_majorticklabels(), rotation=0);
plt.savefig('../images/clustermap.png')