Model Selection
In this post, We will review the process of model selection through Cross Validation.
Resources & Credits
The dataset that we use are from the book Introduction to Statistical Learning
by Gareth James, Daniela Witten, Trevor Hastie, and Rob Tibshirani. You can check the details in here.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import itertools
from copy import deepcopy
advertising = pd.read_csv('./dataset/Advertising.csv', index_col=0)
advertising.head()
advertising.shape
train = advertising[:100]
val = advertising[100:150]
test = advertising[150:]
Here, we generate the set of predictor for best-subset selection.
predictors = ['TV', 'radio', 'newspaper']
bestsubsets = deepcopy(predictors)
for i in range(2, len(predictors) + 1):
bestsubsets.extend([list(x) for x in list(itertools.combinations(predictors, i))])
bestsubsets
Above, we generate 7 possible predictors from each features. We need to find which model is the best from these subsets. So based on these subsets, we can measure the MSE of each predictor models, especially on Validation set. In this time, we will use OLS in statsmodel.
val_mse = []
for i in range(len(bestsubsets)):
ols = sm.OLS(train['sales'], train[bestsubsets[i]]).fit()
ols_pred = np.array(ols.predict(val[bestsubsets[i]]))
mse = (((ols_pred - np.array(val['sales'])) ** 2).sum()) / len(ols_pred)
val_mse.append(mse)
val_mse
From the result, we can find that the last predictor (the model that uses whole features) has lowest MSE.
min_mse_subset = bestsubsets[np.argsort(val_mse)[0]]
min_mse_subset
After that, we can measure the MSE for test dataset. Test dataset doesn't be used for training, So the predictor doesn't know the information about test dataset like distribution or trends.
best_subset_ols = sm.OLS(train['sales'], train[min_mse_subset]).fit()
best_subset_ols_pred = np.array(best_subset_ols.predict(test[min_mse_subset]))
test_mse = ((best_subset_ols_pred - np.array(test['sales'])) ** 2).sum() / len(best_subset_ols_pred)
test_mse
k = 5
val_size = int(len(advertising) / k)
val_size
i = 0
idx = np.arange(len(advertising))
cv_val_idx = np.arange(i * val_size, (i + 1) * val_size)
cv_train_idx = np.array([x for x in idx if x not in cv_val_idx])
cv_val_idx
cv_train_idx
We separated the dataset with cross validation set and training set. So we can train the model with train dataset, and test it with validation set.
ols = sm.OLS(advertising['sales'].iloc[cv_train_idx], advertising[predictors].iloc[cv_train_idx]).fit()
cv_pred = np.array(ols.predict(advertising[predictors].iloc[cv_val_idx]))
mse = (((cv_pred - np.array(advertising['sales'].iloc[cv_val_idx])) ** 2).sum()) / len(cv_pred)
mse
We tried Cross Validation at first time. In k-fold cross validation, we can try k times.
cv_mse = []
for i in range(k):
idx = np.arange(len(advertising))
cv_val_idx = np.arange(i * val_size, (i + 1) * val_size)
cv_train_idx = np.array([x for x in idx if x not in cv_val_idx])
ols = sm.OLS(advertising['sales'].iloc[cv_train_idx],
advertising[predictors].iloc[cv_train_idx]).fit()
cv_pred = np.array(ols.predict(advertising[predictors].iloc[cv_val_idx]))
mse = (((cv_pred - np.array(advertising['sales'].iloc[cv_val_idx])) ** 2).sum()) / len(cv_pred)
cv_mse.append(mse)
cv_mse
After that, we can measure the validation MSE by averaging whole MSE results.
np.mean(cv_mse)