import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('fivethirtyeight')

Clean your time series data

Find missing values

In the field of Data Science, it is common to encounter datasets with missing values. This is especially true in the case of time series data, where missing values can occur if a measurement fails to record the value at a specific timestamp. To count the number of missing values in a DataFrame called df that contains time series data, you can use the command:

missing_values = df.isnull().sum()

In this exercise, you will learn how to find whether your data contains any missing values.

co2_levels = pd.read_csv('./dataset/ch2_co2_levels.csv')
print(co2_levels.head(7))

# Set datestamp column as index
co2_levels = co2_levels.set_index('datestamp')

# Print out the number of missing values
print(co2_levels.isnull().sum())
    datestamp    co2
0  1958-03-29  316.1
1  1958-04-05  317.3
2  1958-04-12  317.6
3  1958-04-19  317.5
4  1958-04-26  316.4
5  1958-05-03  316.9
6  1958-05-10    NaN
co2    59
dtype: int64

Handle missing values

In order to replace missing values in your time series data, you can use the command:

df = df.fillna(method="ffill")

where the argument specifies the type of method you want to use. For example, specifying bfill (i.e backfilling) will ensure that missing values are replaced using the next valid observation, while ffill (i.e. forward-filling) ensures that missing values are replaced using the last valid observation.

co2_levels = co2_levels.fillna(method='bfill')

# Print out the number of missing values
print(co2_levels.isnull().sum())
co2    0
dtype: int64

Plot aggregates of your data

  • Moving averages
    • In the field of time series analysis, a moving average can be used for many different purposes:
      • Smoothing out short-term fluctuations
      • Removing outliers
      • Highlighting long-term trends or cycles

Display rolling averages

It is also possible to visualize rolling averages of the values in your time series. This is equivalent to "smoothing" your data, and can be particularly useful when your time series contains a lot of noise or outliers. For a given DataFrame df, you can obtain the rolling average of the time series by using the command:

df_mean = df.rolling(window=12).mean()

The window parameter should be set according to the granularity of your time series. For example, if your time series contains daily data and you are looking for rolling values over a whole year, you should specify the parameter to window=365. In addition, it is easy to get rolling values for other other metrics, such as the standard deviation (.std()) or variance (.var()).

ma = co2_levels.rolling(window=52).mean()

# Compute the 52 weeks rolling standart deviation of the co2_levels DataFrame
mstd = co2_levels.rolling(window=52).std()

# Add the upper bound column to the ma DataFrame
ma['upper'] = ma['co2'] + (2 * mstd['co2'])

# Add the lower bound column to the ma DataFrame
ma['lower'] = ma['co2'] - (2 * mstd['co2'])

# Plot the content of the ma DataFrame
ax = ma.plot(linewidth=0.8, fontsize=6);

# Specify labels, legend, and show the plot
ax.set_xlabel('Date', fontsize=10);
ax.set_ylabel('CO2 levels in Mauai Hawaii', fontsize=10);
ax.set_title('Rolling mean and variance of CO2 levels\nin Mauai Hawaii from 1958 to 2001',
             fontsize=10);
plt.savefig('../images/rolling_minmax.png')

Display aggregated values

You may sometimes be required to display your data in a more aggregated form. For example, the co2_levels data contains weekly data, but you may need to display its values aggregated by month of year. In datasets such as the co2_levels DataFrame where the index is a datetime type, you can extract the year of each dates in the index:

# extract of the year in each dates of the df DataFrame
index_year = df.index.year

To extract the month or day of the dates in the indices of the df DataFrame, you would use df.index.month and df.index.day, respectively. You can then use the extracted year of each indices in the co2_levels DataFrame and the groupby function to compute the mean CO2 levels by year:

df_by_year = df.groupby(index_year).mean()
co2_levels.index = pd.to_datetime(co2_levels.index)
index_month = co2_levels.index.month

# Compute the mean CO2 levels for each month of the year
mean_co2_levels_by_month = co2_levels.groupby(index_month).mean()

# Plot the mean CO2 levels for each month of the year
mean_co2_levels_by_month.plot(fontsize=6);

# Specify the fontsize on the legend
plt.legend(fontsize=10);

Summarize the values in your time series data

Compute numerical summaries

You have learnt how to display and annotate time series data in multiple ways, but it is also informative to collect summary statistics of your data. Being able to achieve this task will allow you to share and discuss statistical properties of your data that can further support the plots you generate. In pandas, it is possible to quickly obtain summaries of columns in your DataFrame by using the command:

print(df.describe())

This will print statistics including the mean, the standard deviation, the minima and maxima and the number of observations for all numeric columns in your pandas DataFrame.

print(co2_levels.describe())

# Print out the minima of the co2 column in the co2_levels DataFrame
print(co2_levels.min())

# Print out the maxima of the co2 column in the co2_levels DataFrame
print(co2_levels.max())
               co2
count  2284.000000
mean    339.657750
std      17.100899
min     313.000000
25%     323.975000
50%     337.700000
75%     354.500000
max     373.900000
co2    313.0
dtype: float64
co2    373.9
dtype: float64

Boxplots and Histograms

Boxplots represent a graphical rendition of the minimum, median, quartiles, and maximum of your data. You can generate a boxplot by calling the .boxplot() method on a DataFrame.

Another method to produce visual summaries is by leveraging histograms, which allow you to inspect the data and uncover its underlying distribution, as well as the presence of outliers and overall spread. An example of how to generate a histogram is shown below:

ax = co2_levels.plot(kind='hist', bins=100)

Here, we used the standard .plot() method but specified the kind argument to be 'hist'. In addition, we also added the bins=100 parameter, which specifies how many intervals (i.e bins) we should cut our data into.

ax = co2_levels.boxplot();

# Set the labels and display the plot
ax.set_xlabel('CO2', fontsize=10);
ax.set_ylabel('Boxplot CO2 levels in Maui Hawaii', fontsize=10);
plt.legend(fontsize=10);
No handles with labels found to put in legend.
ax = co2_levels.plot(kind='hist', bins=50, fontsize=6);

# Set the labels and display the plot
ax.set_xlabel('CO2', fontsize=10);
ax.set_ylabel('Histogram of CO2 levels in Maui Hawaii', fontsize=10);
plt.legend(fontsize=10);

Density plots

In practice, histograms can be a substandard method for assessing the distribution of your data because they can be strongly affected by the number of bins that have been specified. Instead, kernel density plots represent a more effective way to view the distribution of your data. An example of how to generate a density plot of is shown below:

ax = df.plot(kind='density', linewidth=2)

The standard .plot() method is specified with the kind argument set to 'density'. We also specified an additional parameter linewidth, which controls the width of the line to be plotted.

ax = co2_levels.plot(kind='density', linewidth=4, fontsize=6);

# Annotate x-axis labels
ax.set_xlabel('CO2', fontsize=10);

# Annotate y-axis labels
ax.set_ylabel('Density plot of CO2 levels in Maui Hawaii', fontsize=10);