import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('fivethirtyeight')

Predicting data over time

  • Correlation and regression
    • Regression is similar to calculating correlation, with some key differences
      • Regression: A process that results in a formal model of the data
      • Correlation: A statistic that describes the data. Less information than regression model
  • Correlation between variables often changes over time
    • Time series often have patterns that change over time
    • Two timeseries that seem correlated at one moment may not remain so over time.
  • Scoring regression models
    • Two most common methods:
      • Correlation ($r$)
      • Coefficient of Determination ($R^2$)
        • The value of $R^2$ is bounded on the top by 1, and can be infinitely low
        • Values closer to 1 mean the model does a better jot of predicting outputs \ $1 - \frac{\text{error}(model)}{\text{variance}(testdata)}$
prices = pd.read_csv('./dataset/tsa_prices.csv', index_col='date', parse_dates=True)
prices.head()
EBAY YHOO
date
2010-01-04 23.900000 17.100000
2010-01-05 23.650000 17.230000
2010-01-06 23.500000 17.170000
2010-01-07 23.229998 16.700001
2010-01-08 23.509999 16.700001
prices.plot();
prices.plot.scatter('EBAY', 'YHOO');
prices.plot.scatter('EBAY', 'YHOO', c=prices.index, cmap=plt.cm.viridis, colorbar=False);

Fitting a simple regression model

Now we'll look at a larger number of companies. Recall that we have historical price values for many companies. Let's use data from several companies to predict the value of a test company. You'll attempt to predict the value of the Apple stock price using the values of NVidia, Ebay, and Yahoo. Each of these is stored as a column in the all_prices DataFrame. Below is a mapping from company name to column name:

ebay: "EBAY"
nvidia: "NVDA"
yahoo: "YHOO"
apple: "AAPL"

We'll use these columns to define the input/output arrays in our model.

all_prices = pd.read_csv('./dataset/all_prices.csv', index_col=0, parse_dates=True)
all_prices.head()
AAPL ABT AIG AMAT ARNC BAC BSX C CHK CMCSA ... QCOM RF SBUX T V VZ WFC XOM XRX YHOO
date
2010-01-04 214.009998 54.459951 29.889999 14.30 16.650013 15.690000 9.01 3.40 28.090001 16.969999 ... 46.939999 5.42 23.049999 28.580000 88.139999 33.279869 27.320000 69.150002 8.63 17.100000
2010-01-05 214.379993 54.019953 29.330000 14.19 16.130013 16.200001 9.04 3.53 28.970002 16.740000 ... 48.070000 5.60 23.590000 28.440001 87.129997 33.339868 28.070000 69.419998 8.64 17.230000
2010-01-06 210.969995 54.319953 29.139999 14.16 16.970013 16.389999 9.16 3.64 28.650002 16.620001 ... 47.599998 5.67 23.420000 27.610001 85.959999 31.919873 28.110001 70.019997 8.56 17.170000
2010-01-07 210.580000 54.769952 28.580000 14.01 16.610014 16.930000 9.09 3.65 28.720002 16.969999 ... 48.980000 6.17 23.360001 27.299999 86.760002 31.729875 29.129999 69.800003 8.60 16.700001
2010-01-08 211.980005 55.049952 29.340000 14.55 17.020014 16.780001 9.00 3.59 28.910002 16.920000 ... 49.470001 6.18 23.280001 27.100000 87.000000 31.749874 28.860001 69.519997 8.57 16.700001

5 rows × 50 columns

from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Use stock symbols to extract training data
X = all_prices[['EBAY', 'NVDA', 'YHOO']]
y = all_prices[['AAPL']]

# Fit and score the model with cross-validation
scores = cross_val_score(Ridge(), X, y, cv=3)
print(scores)
[-6.09050633 -0.3179172  -3.72957284]

Visualizing predicted values

When dealing with time series data, it's useful to visualize model predictions on top of the "actual" values that are used to test the model.

In this exercise, after splitting the data (stored in the variables X and y) into training and test sets, you'll build a model and then visualize the model's predictions on top of the testing data in order to estimate the model's performance.

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Split our data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8,
                                                   shuffle=False, random_state=1)

# Fit our model and generate predictions
model = Ridge()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = r2_score(y_test, predictions)
print(score)
-0.33806972562289417
fig, ax = plt.subplots(figsize=(15, 5))
ax.plot(range(len(y_test)), y_test, color='k', lw=3);
ax.plot(range(len(predictions)), predictions, color='r', lw=2);

Advanced time series prediction

  • Data is messy
    • Real-world data is often messy
    • The two most common problems are missing data and outliers
    • This often happens because of human error, machine error malfunction, database failure, etc..
    • Visualizing your raw data makes it easier to spot these problems
  • Interpolation: using time to fill in missing data
    • A common way to deal with missing data is to interpolate missing values
    • With timeseries data, you can use time to assist in interpolation.
    • In this case, interpolation means using the known values on either side of a gap in the data to make assumptions about what's missing
  • Using a rolling window to transform data
    • Another common use of rolling windows is to transform the data
  • Finding outliers in your data
    • Outliers are datapoints that are significantly statistically different from the dataset.
    • They can have negative effects on the predictive power of your model, biasing it away from its "true" value
    • One solution is to remove or replace outliers with a more representative value
      Note: Be very careful about doing this - often it is difficult to determine what is a legitimately extreme value vs an abberation.

Visualizing messy data

Let's take a look at a new dataset - this one is a bit less-clean than what you've seen before.

As always, you'll first start by visualizing the raw data. Take a close look and try to find datapoints that could be problematic for fitting models.

prices = pd.read_csv('./dataset/prices_null.csv', index_col=0, parse_dates=True)
prices.plot(legend=False);
plt.tight_layout();

# Count the missing values of each time series
missing_values = prices.isnull().sum()
print(missing_values)
EBAY    273
NVDA    502
YHOO    232
dtype: int64

Imputing missing values

When you have missing data points, how can you fill them in?

In this exercise, you'll practice using different interpolation methods to fill in some missing values, visualizing the result each time. But first, you will create the function (interpolate_and_plot()) you'll use to interpolate missing data points and plot them.

def interpolate_and_plot(prices, interpolation):
    
    # Create a boolean mask for missing values
    missing_values = prices.isna()
    
    # Interpolate the missing values
    prices_interp = prices.interpolate(interpolation)
    
    # Plot the results, highlighting the interpolated values in black
    fig, ax = plt.subplots(figsize=(10, 5))
    prices_interp.plot(color='k', alpha=0.6, ax=ax, legend=False);
    
    # Note plot the interpolated values on top in red
    prices_interp[missing_values].plot(ax=ax, color='r', lw=3, legend=False);
interpolation_type = 'zero'
interpolate_and_plot(prices, interpolation_type)
interpolation_type = 'linear'
interpolate_and_plot(prices, interpolation_type)
interpolation_type = 'quadratic'
interpolate_and_plot(prices, interpolation_type)

Transforming raw data

In the last chapter, you calculated the rolling mean. In this exercise, you will define a function that calculates the percent change of the latest data point from the mean of a window of previous data points. This function will help you calculate the percent change over a rolling window.

This is a more stable kind of time series that is often useful in machine learning.

def percent_change(series):
    # Collect all *but* the last value of this window, then the final value
    previous_values = series[:-1]
    last_value = series[-1]
    
    # Calculate the % difference between the last value and the mean of earlier values
    percent_change = (last_value - np.mean(previous_values)) / np.mean(previous_values)
    return percent_change

# Apply your custom function and plot
prices_perc = prices.rolling(20).apply(percent_change)
prices_perc.loc["2014":"2015"].plot();

Handling outliers

In this exercise, you'll handle outliers - data points that are so different from the rest of your data, that you treat them differently from other "normal-looking" data points. You'll use the output from the previous exercise (percent change over time) to detect the outliers. First you will write a function that replaces outlier data points with the median value from the entire time series.

def replace_outliers(series):
    # Calculate the absolute difference of each timepoint from the series mean
    absolute_differences_from_mean = np.abs(series - np.mean(series))
    
    # Calculate a mask for the difference that are > 3 standard deviations from zero
    this_mask = absolute_differences_from_mean > (np.std(series) * 3)
    
    # Replace these values with the median across the data
    series[this_mask] = np.nanmedian(series)
    return series

# Apply your preprocessing functino to the timeseries and plot the results
prices_perc = prices_perc.apply(replace_outliers)
prices_perc.loc["2014":"2015"].plot();

Creating features over time

  • Calculating "date-based" features
    • Thus far we've focused on calculating "statistical" features - these are features that correspond statistical properties of the data, like "mean" , "standard deviation", etc
    • However, don't forget the timeseries data oftenhas more "human" features associated with it, like days of the week, holidays, etc.
    • These features are often useful when dealing with timeseries data that spans multiple years (such as stock value over time)

Engineering multiple rolling features at once

Now that you've practiced some simple feature engineering, let's move on to something more complex. You'll calculate a collection of features for your time series data and visualize what they look like over time. This process resembles how many other time series models operate.

prices_perc_rolling = prices_perc['EBAY'].rolling(20, min_periods=5, closed='right')

# Define the features you'll calculate for each window
features_to_calculate = [np.min, np.max, np.mean, np.std]

# Calculate these features for your rolling window object
features = prices_perc_rolling.aggregate(features_to_calculate)

# Plot the results
ax = features.loc[:"2011-01"].plot();
prices_perc['EBAY'].loc[:"2011-01"].plot(ax=ax, color='k', alpha=0.2, lw=3);
ax.legend(loc=(1.01, 0.6));

Percentiles and partial functions

In this exercise, you'll practice how to pre-choose arguments of a function so that you can pre-configure how it runs. You'll use this to calculate several percentiles of your data using the same percentile() function in numpy.

from functools import partial
percentiles = [1, 10, 25, 50, 75, 90, 99]

# Use a list comprehension to create a partial function for each quantile
percentile_functions = [partial(np.percentile, q=percentile) for percentile in percentiles]

# Calculate each of these quantiles on the data using a rolling window
prices_perc_rolling = prices_perc['EBAY'].rolling(20, min_periods=5, closed='right')
features_percentiles = prices_perc_rolling.aggregate(percentile_functions)

# Plot a subset of the result
ax = features_percentiles.loc[:"2011-01"].plot(cmap=plt.cm.viridis);
ax.legend(percentiles, loc=(1.01, 0.5));
plt.savefig('../images/price_percentile.png')

Using "date" information

It's easy to think of timestamps as pure numbers, but don't forget they generally correspond to things that happen in the real world. That means there's often extra information encoded in the data such as "is it a weekday?" or "is it a holiday?". This information is often useful in predicting timeseries data.

prices_perc['day_of_week'] = prices_perc.index.dayofweek
prices_perc['week_of_year'] = prices_perc.index.weekofyear
prices_perc['month_of_year'] = prices_perc.index.month

# Print prices_perc
print(prices_perc)
                EBAY      NVDA      YHOO  day_of_week  week_of_year  \
date                                                                  
2010-01-04       NaN       NaN       NaN            0             1   
2010-01-05       NaN       NaN       NaN            1             1   
2010-01-06       NaN       NaN       NaN            2             1   
2010-01-07       NaN       NaN       NaN            3             1   
2010-01-08       NaN       NaN       NaN            4             1   
...              ...       ...       ...          ...           ...   
2016-12-23  0.024842  0.147720 -0.038006            4            51   
2016-12-27  0.038030  0.014446 -0.027985            1            52   
2016-12-28  0.026925  0.117737 -0.029311            2            52   
2016-12-29  0.021850  0.129669 -0.028632            3            52   
2016-12-30  0.007285  0.068559 -0.026603            4            52   

            month_of_year  
date                       
2010-01-04              1  
2010-01-05              1  
2010-01-06              1  
2010-01-07              1  
2010-01-08              1  
...                   ...  
2016-12-23             12  
2016-12-27             12  
2016-12-28             12  
2016-12-29             12  
2016-12-30             12  

[1762 rows x 6 columns]