Resources & Credits

The dataset that we use are from the book Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Rob Tibshirani. You can check the details in here.

Packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

import itertools
from copy import deepcopy
C:\Users\kcsgo\anaconda3\lib\site-packages\numpy\_distributor_init.py:32: UserWarning: loaded more than 1 DLL from .libs:
C:\Users\kcsgo\anaconda3\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
C:\Users\kcsgo\anaconda3\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll
  stacklevel=1)

Define Validation set and Use

Dataset

We will use advertising dataset including in ISL book.

advertising = pd.read_csv('./dataset/Advertising.csv', index_col=0)
advertising.head()
TV radio newspaper sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
advertising.shape
(200, 4)
train = advertising[:100]
val = advertising[100:150]
test = advertising[150:]

Here, we generate the set of predictor for best-subset selection.

predictors = ['TV', 'radio', 'newspaper']
bestsubsets = deepcopy(predictors)

for i in range(2, len(predictors) + 1):
    bestsubsets.extend([list(x) for x in list(itertools.combinations(predictors, i))])
bestsubsets
['TV',
 'radio',
 'newspaper',
 ['TV', 'radio'],
 ['TV', 'newspaper'],
 ['radio', 'newspaper'],
 ['TV', 'radio', 'newspaper']]

Above, we generate 7 possible predictors from each features. We need to find which model is the best from these subsets. So based on these subsets, we can measure the MSE of each predictor models, especially on Validation set. In this time, we will use OLS in statsmodel.

val_mse = []

for i in range(len(bestsubsets)):
    ols = sm.OLS(train['sales'], train[bestsubsets[i]]).fit()
    ols_pred = np.array(ols.predict(val[bestsubsets[i]]))
    mse = (((ols_pred - np.array(val['sales'])) ** 2).sum()) / len(ols_pred)
    val_mse.append(mse)
val_mse
[25.055058944864008,
 49.07576335191685,
 76.2856426923714,
 5.199428939050649,
 19.634679234831346,
 41.522704231424875,
 4.927541779411094]

From the result, we can find that the last predictor (the model that uses whole features) has lowest MSE.

min_mse_subset = bestsubsets[np.argsort(val_mse)[0]]
min_mse_subset
['TV', 'radio', 'newspaper']

After that, we can measure the MSE for test dataset. Test dataset doesn't be used for training, So the predictor doesn't know the information about test dataset like distribution or trends.

best_subset_ols = sm.OLS(train['sales'], train[min_mse_subset]).fit()
best_subset_ols_pred = np.array(best_subset_ols.predict(test[min_mse_subset]))
test_mse = ((best_subset_ols_pred - np.array(test['sales'])) ** 2).sum() / len(best_subset_ols_pred)
test_mse
3.6851575325875667

Cross Validation

In this time, we will implement the k-fold Cross Validation.

k = 5
val_size = int(len(advertising) / k)
val_size
40
i = 0
idx = np.arange(len(advertising))

cv_val_idx = np.arange(i * val_size, (i + 1) * val_size)
cv_train_idx = np.array([x for x in idx if x not in cv_val_idx])
cv_val_idx
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39])
cv_train_idx
array([ 40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
       131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
       144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
       170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182,
       183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195,
       196, 197, 198, 199])

We separated the dataset with cross validation set and training set. So we can train the model with train dataset, and test it with validation set.

ols = sm.OLS(advertising['sales'].iloc[cv_train_idx], advertising[predictors].iloc[cv_train_idx]).fit()
cv_pred = np.array(ols.predict(advertising[predictors].iloc[cv_val_idx]))
mse = (((cv_pred - np.array(advertising['sales'].iloc[cv_val_idx])) ** 2).sum()) / len(cv_pred)
mse
4.5884410560588424

We tried Cross Validation at first time. In k-fold cross validation, we can try k times.

cv_mse = []

for i in range(k):
    idx = np.arange(len(advertising))
    cv_val_idx = np.arange(i * val_size, (i + 1) * val_size)
    cv_train_idx = np.array([x for x in idx if x not in cv_val_idx])
    ols = sm.OLS(advertising['sales'].iloc[cv_train_idx],
                 advertising[predictors].iloc[cv_train_idx]).fit()
    cv_pred = np.array(ols.predict(advertising[predictors].iloc[cv_val_idx]))
    mse = (((cv_pred - np.array(advertising['sales'].iloc[cv_val_idx])) ** 2).sum()) / len(cv_pred)
    cv_mse.append(mse)
cv_mse
[4.5884410560588424,
 3.8196496216111955,
 3.5214234286826027,
 4.780892181762023,
 4.135290964081366]

After that, we can measure the validation MSE by averaging whole MSE results.

np.mean(cv_mse)
4.169139450439206