Case Study in time series analysis
This chapter will give you a chance to practice all the concepts covered in the course. You will visualize the unemployment rate in the US from 2000 to 2010. This is the Summary of lecture "Visualizing Time-Series data in Python", via datacamp.
- Apply your knowledge to a new dataset
- Beyond summary statistics
- Decompose time series data
- Compute correlations between time series
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('fivethirtyeight')
Explore the Jobs dataset
In this exercise, you will explore the new jobs
DataFrame, which contains the unemployment rate of different industries in the USA during the years of 2000-2010. As you will see, the dataset contains time series for 16 industries and across 122 timepoints (one per month for 10 years). In general, the typical workflow of a Data Science project will involve data cleaning and exploration, so we will begin by reading in the data and checking for missing values.
jobs = pd.read_csv('./dataset/ch5_employment.csv')
# Print first five lines of your DataFrame
print(jobs.head(5))
# Check the type of each column in your DataFrame
print(jobs.dtypes)
# Convert datestamp column to a datetime object
jobs['datestamp'] = pd.to_datetime(jobs['datestamp'])
# Set the datestamp columns as the index of your DataFrame
jobs = jobs.set_index('datestamp')
# Check the number of missing values in each columns
print(jobs.isnull().sum())
Describe time series data with boxplots
You should always explore the distribution of the variables, and because you are working with time series, you will explore their properties using boxplots and numerical summaries. As a reminder, you can plot data in a DataFrame as boxplots with the command:
df.boxplot(fontsize=6, vert=False)
Notice the introduction of the new parameter vert
, which specifies whether to plot the boxplots horizontally or vertically.
jobs.boxplot(fontsize=8, vert=False);
# Generate numberical summaries
print(jobs.describe())
Plot all the time series in your dataset
The jobs
DataFrame contains 16 time series representing the unemployment rate of various industries between 2001 and 2010. This may seem like a large amount of time series to visualize at the same time, but Chapter 4 introduced you to facetted plots. In this exercise, you will explore some of the time series in the jobs
DataFrame and look to extract some meaningful information from these plots.
jobs_subset = jobs[['Finance', 'Information', 'Manufacturing', 'Construction']]
# Print the first 5 rows of jobs_subset
print(jobs_subset.head(5))
# Create a facetted graph with 2 rows and 2 columns
ax = jobs_subset.plot(subplots=True,
layout=(2, 2),
sharex=False,
sharey=False,
linewidth=0.7,
fontsize=8,
legend=False);
Annotate significant events in time series data
When plotting the Finance
, Information
, Manufacturing
and Construction
time series of the jobs
DataFrame, you observed a distinct increase in unemployment rates during 2001 and 2008. In general, time series plots can be made even more informative if you include additional annotations that emphasize specific observations or events. This allows you to quickly highlight parts of the graph to viewers, and can help infer what may have caused a specific event.
Recall that you have already set the datestamp
column as the index of the jobs
DataFrame, so you are prepared to directly annotate your plots with vertical or horizontal lines.
ax = jobs.plot(colormap='Spectral', fontsize=6, linewidth=0.8);
# Set labels and legend
ax.set_xlabel('Date', fontsize=10);
ax.set_ylabel('Unemployment Rate', fontsize=10);
ax.set_title('Unemployment rate of U.S. workers by industry', fontsize=10);
ax.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
# Annotate your plots with vertical lines
ax.axvline('2001-07-01', color='blue', linestyle='--', linewidth=0.8);
ax.axvline('2008-09-01', color='blue', linestyle='--', linewidth=0.8);
Plot monthly and yearly trends
Like we saw in Chapter 2, when the index of a DataFrame is of the datetime
type, it is possible to directly extract the day, month or year of each date in the index. As a reminder, you can extract the year of each date in the index using the .index.year
attribute. You can then use the .groupby()
and .mean()
methods to compute the mean annual value of each time series in your DataFrame:
index_year = df.index.year
df_by_year = df.groupby(index_year).mean()
You will now apply what you have learned to display the aggregate mean values of each time series in the jobs
DataFrame.
index_month = jobs.index.month
# Compute the mean unemployment rate for each month
jobs_by_month = jobs.groupby(index_month).mean()
# Plot the mean unemployment rate for each month
ax = jobs_by_month.plot(fontsize=8, linewidth=1);
# Set axis labels and legend
ax.set_xlabel('Month', fontsize=10);
ax.set_ylabel('Mean unemployment rate', fontsize=10);
ax.legend(bbox_to_anchor=(0.8, 0.6), fontsize=10);
index_year = jobs.index.year
# Compute the mean unemployment rate for each year
jobs_by_year = jobs.groupby(index_year).mean()
# Plot the mean unemployment rate for each year
ax = jobs_by_year.plot(fontsize=8, linewidth=1);
# Set axis labels and legend
ax.set_xlabel('Year', fontsize=10);
ax.set_ylabel('Mean unemployment rate', fontsize=10);
ax.legend(bbox_to_anchor=(0.1, 0.5), fontsize=10);
Apply time series decomposition to your dataset
You will now perform time series decomposition on multiple time series. You can achieve this by leveraging the Python dictionary to store the results of each time series decomposition.
In this exercise, you will initialize an empty dictionary with a set of curly braces, {}
, use a for loop to iterate through the columns of the DataFrame and apply time series decomposition to each time series. After each time series decomposition, you place the results in the dictionary by using the command my_dict[key] = value
, where my_dict
is your dictionary, key
is the name of the column/time series, and value
is the decomposition object of that time series.
import statsmodels.api as sm
# Initialize dictionary
jobs_decomp = {}
# Get the names of each time series in the DataFrame
jobs_names = jobs.columns
# run time series decomposition on each time series of the DataFrame
for ts in jobs_names:
ts_decomposition = sm.tsa.seasonal_decompose(jobs[ts])
jobs_decomp[ts] = ts_decomposition
Visualize the seasonality of multiple time series
You will now extract the seasonality
component of jobs_decomp
to visualize the seasonality in these time series. Note that before plotting, you will have to convert the dictionary of seasonality
components into a DataFrame using the pd.DataFrame.from_dict()
function.
jobs_seasonal = {}
for ts in jobs_names:
jobs_seasonal[ts] = jobs_decomp[ts].seasonal
# Create a DataFrame from the jobs_seasonal dictionary
seasonality_df = pd.DataFrame.from_dict(jobs_seasonal)
# Remove the label for the index
seasonality_df.index.name = None
# Create a faceted plot of the seasonality_df DataFrame
seasonality_df.plot(subplots=True,
layout=(4, 4),
sharey=False,
fontsize=2,
linewidth=0.3,
legend=False);
Correlations between multiple time series
In the previous exercise, you extracted the seasonal
component of each time series in the jobs
DataFrame and stored those results in new DataFrame called seasonality_df
. In the context of jobs data, it can be interesting to compare seasonality behavior, as this may help uncover which job industries are the most similar or the most different.
This can be achieved by using the seasonality_df
DataFrame and computing the correlation between each time series in the dataset. In this exercise, you will leverage what you have learned in Chapter 4 to compute and create a clustermap visualization of the correlations between time series in the seasonality_df
DataFrame.
seasonality_corr = seasonality_df.corr(method='spearman')
# Customize the clustermap of the seasonality_corr correlation matrix
fig = sns.clustermap(seasonality_corr,
annot=True,
annot_kws={"size": 4},
linewidths=.4,
figsize=(15, 10));
plt.setp(fig.ax_heatmap.yaxis.get_majorticklabels(), rotation=0);
plt.setp(fig.ax_heatmap.xaxis.get_majorticklabels(), rotation=90);
plt.savefig('../images/jobs_clustermap.png')