Summary Statistics and Diagnostics
In this chapter, you will gain a deeper understanding of your time series data by computing summary statistics and plotting aggregated views of your data. This is the Summary of lecture "Visualizing Time-Series data in Python", via datacamp.
- Clean your time series data
- Plot aggregates of your data
- Summarize the values in your time series data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('fivethirtyeight')
Find missing values
In the field of Data Science, it is common to encounter datasets with missing values. This is especially true in the case of time series data, where missing values can occur if a measurement fails to record the value at a specific timestamp. To count the number of missing values in a DataFrame called df that contains time series data, you can use the command:
missing_values = df.isnull().sum()
In this exercise, you will learn how to find whether your data contains any missing values.
co2_levels = pd.read_csv('./dataset/ch2_co2_levels.csv')
print(co2_levels.head(7))
# Set datestamp column as index
co2_levels = co2_levels.set_index('datestamp')
# Print out the number of missing values
print(co2_levels.isnull().sum())
Handle missing values
In order to replace missing values in your time series data, you can use the command:
df = df.fillna(method="ffill")
where the argument specifies the type of method you want to use. For example, specifying bfill
(i.e backfilling) will ensure that missing values are replaced using the next valid observation, while ffill
(i.e. forward-filling) ensures that missing values are replaced using the last valid observation.
co2_levels = co2_levels.fillna(method='bfill')
# Print out the number of missing values
print(co2_levels.isnull().sum())
Display rolling averages
It is also possible to visualize rolling averages of the values in your time series. This is equivalent to "smoothing" your data, and can be particularly useful when your time series contains a lot of noise or outliers. For a given DataFrame df
, you can obtain the rolling average of the time series by using the command:
df_mean = df.rolling(window=12).mean()
The window parameter should be set according to the granularity of your time series. For example, if your time series contains daily data and you are looking for rolling values over a whole year, you should specify the parameter to window=365
. In addition, it is easy to get rolling values for other other metrics, such as the standard deviation (.std()
) or variance (.var()
).
ma = co2_levels.rolling(window=52).mean()
# Compute the 52 weeks rolling standart deviation of the co2_levels DataFrame
mstd = co2_levels.rolling(window=52).std()
# Add the upper bound column to the ma DataFrame
ma['upper'] = ma['co2'] + (2 * mstd['co2'])
# Add the lower bound column to the ma DataFrame
ma['lower'] = ma['co2'] - (2 * mstd['co2'])
# Plot the content of the ma DataFrame
ax = ma.plot(linewidth=0.8, fontsize=6);
# Specify labels, legend, and show the plot
ax.set_xlabel('Date', fontsize=10);
ax.set_ylabel('CO2 levels in Mauai Hawaii', fontsize=10);
ax.set_title('Rolling mean and variance of CO2 levels\nin Mauai Hawaii from 1958 to 2001',
fontsize=10);
plt.savefig('../images/rolling_minmax.png')
Display aggregated values
You may sometimes be required to display your data in a more aggregated form. For example, the co2_levels
data contains weekly data, but you may need to display its values aggregated by month of year. In datasets such as the co2_levels
DataFrame where the index is a datetime type, you can extract the year of each dates in the index:
# extract of the year in each dates of the df DataFrame
index_year = df.index.year
To extract the month or day of the dates in the indices of the df
DataFrame, you would use df.index.month
and df.index.day
, respectively. You can then use the extracted year of each indices in the co2_levels
DataFrame and the groupby function to compute the mean CO2 levels by year:
df_by_year = df.groupby(index_year).mean()
co2_levels.index = pd.to_datetime(co2_levels.index)
index_month = co2_levels.index.month
# Compute the mean CO2 levels for each month of the year
mean_co2_levels_by_month = co2_levels.groupby(index_month).mean()
# Plot the mean CO2 levels for each month of the year
mean_co2_levels_by_month.plot(fontsize=6);
# Specify the fontsize on the legend
plt.legend(fontsize=10);
Compute numerical summaries
You have learnt how to display and annotate time series data in multiple ways, but it is also informative to collect summary statistics of your data. Being able to achieve this task will allow you to share and discuss statistical properties of your data that can further support the plots you generate. In pandas, it is possible to quickly obtain summaries of columns in your DataFrame by using the command:
print(df.describe())
This will print statistics including the mean, the standard deviation, the minima and maxima and the number of observations for all numeric columns in your pandas DataFrame.
print(co2_levels.describe())
# Print out the minima of the co2 column in the co2_levels DataFrame
print(co2_levels.min())
# Print out the maxima of the co2 column in the co2_levels DataFrame
print(co2_levels.max())
Boxplots and Histograms
Boxplots represent a graphical rendition of the minimum, median, quartiles, and maximum of your data. You can generate a boxplot by calling the .boxplot()
method on a DataFrame.
Another method to produce visual summaries is by leveraging histograms, which allow you to inspect the data and uncover its underlying distribution, as well as the presence of outliers and overall spread. An example of how to generate a histogram is shown below:
ax = co2_levels.plot(kind='hist', bins=100)
Here, we used the standard .plot()
method but specified the kind argument to be 'hist'
. In addition, we also added the bins=100
parameter, which specifies how many intervals (i.e bins
) we should cut our data into.
ax = co2_levels.boxplot();
# Set the labels and display the plot
ax.set_xlabel('CO2', fontsize=10);
ax.set_ylabel('Boxplot CO2 levels in Maui Hawaii', fontsize=10);
plt.legend(fontsize=10);
ax = co2_levels.plot(kind='hist', bins=50, fontsize=6);
# Set the labels and display the plot
ax.set_xlabel('CO2', fontsize=10);
ax.set_ylabel('Histogram of CO2 levels in Maui Hawaii', fontsize=10);
plt.legend(fontsize=10);
Density plots
In practice, histograms can be a substandard method for assessing the distribution of your data because they can be strongly affected by the number of bins that have been specified. Instead, kernel density plots represent a more effective way to view the distribution of your data. An example of how to generate a density plot of is shown below:
ax = df.plot(kind='density', linewidth=2)
The standard .plot()
method is specified with the kind argument set to 'density'
. We also specified an additional parameter linewidth, which controls the width of the line to be plotted.
ax = co2_levels.plot(kind='density', linewidth=4, fontsize=6);
# Annotate x-axis labels
ax.set_xlabel('CO2', fontsize=10);
# Annotate y-axis labels
ax.set_ylabel('Density plot of CO2 levels in Maui Hawaii', fontsize=10);