import pandas as pd
import numpy as np
from pprint import pprint


### Build Grid Search functions

In data science it is a great idea to try building algorithms, models and processes 'from scratch' so you can really understand what is happening at a deeper level. Of course there are great packages and libraries for this work (and we will get to that very soon!) but building from scratch will give you a great edge in your data science work.

In this exercise, you will create a function to take in 2 hyperparameters, build models and return results. You will use this function in a future exercise.

from sklearn.model_selection import train_test_split

# To change categorical variable with dummy variables
credit_card = pd.get_dummies(credit_card, columns=['SEX', 'EDUCATION', 'MARRIAGE'], drop_first=True)

X = credit_card.drop(['ID', 'default payment next month'], axis=1)
y = credit_card['default payment next month']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Create the function
def gbm_grid_search(learn_rate, max_depth):
# Create the model

# Use the model to make predictions
predictions = model.fit(X_train, y_train).predict(X_test)

# Return the hyperparameters and score
return ([learn_rate, max_depth, accuracy_score(y_test, predictions)])


### Iteratively tune multiple hyperparameters

In this exercise, you will build on the function you previously created to take in 2 hyperparameters, build a model and return the results. You will now use that to loop through some values and then extend this function and loop with another hyperparameter.

results_list = []
learn_rate_list = [0.01, 0.1, 0.5]
max_depth_list = [2, 4, 6]

# Create the for loop
for learn_rate in learn_rate_list:
for max_depth in max_depth_list:
results_list.append(gbm_grid_search(learn_rate, max_depth))

# Print the results
pprint(results_list)

[[0.01, 2, 0.8214444444444444],
[0.01, 4, 0.8198888888888889],
[0.01, 6, 0.8172222222222222],
[0.1, 2, 0.8205555555555556],
[0.1, 4, 0.8204444444444444],
[0.1, 6, 0.8204444444444444],
[0.5, 2, 0.8188888888888889],
[0.5, 4, 0.8042222222222222],
[0.5, 6, 0.7894444444444444]]

def gbm_grid_search_extended(learn_rate, max_depth, subsample):
# Extend the model creation section
subsample=subsample)

predictions = model.fit(X_train, y_train).predict(X_test)

# Extend the return part
return([learn_rate, max_depth, subsample, accuracy_score(y_test, predictions)])

subsample_list = [0.4, 0.6]

for learn_rate in learn_rate_list:
for max_depth in max_depth_list:
# Extend the for loop
for subsample in subsample_list:
# Extend the results to include the new hyperparameter
results_list.append(gbm_grid_search_extended(learn_rate, max_depth, subsample))

# Print the results
pprint(results_list)

[[0.01, 2, 0.8214444444444444],
[0.01, 4, 0.8198888888888889],
[0.01, 6, 0.8172222222222222],
[0.1, 2, 0.8205555555555556],
[0.1, 4, 0.8204444444444444],
[0.1, 6, 0.8204444444444444],
[0.5, 2, 0.8188888888888889],
[0.5, 4, 0.8042222222222222],
[0.5, 6, 0.7894444444444444],
[0.01, 2, 0.4, 0.8192222222222222],
[0.01, 2, 0.6, 0.8208888888888889],
[0.01, 4, 0.4, 0.8183333333333334],
[0.01, 4, 0.6, 0.8195555555555556],
[0.01, 6, 0.4, 0.8177777777777778],
[0.01, 6, 0.6, 0.8196666666666667],
[0.1, 2, 0.4, 0.821],
[0.1, 2, 0.6, 0.8201111111111111],
[0.1, 4, 0.4, 0.8207777777777778],
[0.1, 4, 0.6, 0.8196666666666667],
[0.1, 6, 0.4, 0.8155555555555556],
[0.1, 6, 0.6, 0.8183333333333334],
[0.5, 2, 0.4, 0.8128888888888889],
[0.5, 2, 0.6, 0.8156666666666667],
[0.5, 4, 0.4, 0.7945555555555556],
[0.5, 4, 0.6, 0.8065555555555556],
[0.5, 6, 0.4, 0.7714444444444445],
[0.5, 6, 0.6, 0.7743333333333333]]


## Grid Search with Scikit Learn

• Steps in a Grid Search
1. An algorithm to tune the hyperparameters (or estimator)
2. Defining which hyperparameters to tune
3. Defining a range of values for each hyperparameter
4. Setting a cross-validatoin scheme
5. Defining a score function so we can decide which square on our grid was 'the best'
6. Include extra useful information or functions

### GridSearchCV with Scikit Learn

The GridSearchCV module from Scikit Learn provides many useful features to assist with efficiently undertaking a grid search. You will now put your learning into practice by creating a GridSearchCV object with certain parameters.

The desired options are:

• A Random Forest Estimator, with the split criterion as 'entropy'
• 5-fold cross validation
• The hyperparameters max_depth (2, 4, 8, 15) and max_features ('auto' vs 'sqrt')
• Use roc_auc to score the models
• Use 4 cores for processing in parallel
• Ensure you refit the best model and return training scores
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest Classifier with specified criterion
rf_class = RandomForestClassifier(criterion='entropy')

# Create the parametergrid
param_grid = {
'max_depth':[2, 4, 8, 15],
'max_features':['auto', 'sqrt']
}

# Create a GridSearchCV object
grid_rf_class = GridSearchCV(
estimator=rf_class,
param_grid=param_grid,
scoring='roc_auc',
n_jobs=4,
cv=5,
refit=True,
return_train_score=True
)

print(grid_rf_class)

GridSearchCV(cv=5, estimator=RandomForestClassifier(criterion='entropy'),
n_jobs=4,
param_grid={'max_depth': [2, 4, 8, 15],
'max_features': ['auto', 'sqrt']},
return_train_score=True, scoring='roc_auc')


## Understanding a grid search output

### Exploring the grid search results

You will now explore the cv_results_ property of the GridSearchCV object defined in the video. This is a dictionary that we can read into a pandas DataFrame and contains a lot of useful information about the grid search we just undertook.

A reminder of the different column types in this property:

• time_ columns
• param_ columns (one for each hyperparameter) and the singular params column (with all hyperparameter settings)
• a train_score column for each cv fold including the mean_train_score and std_train_score columns
• a test_score column for each cv fold including the mean_test_score and std_test_score columns
• a rank_test_score column with a number from 1 to n (number of iterations) ranking the rows based on their mean_test_score
grid_rf_class.fit(X_train, y_train)

cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
print(cv_results_df)

# Extract and print the column with a dictionary of hyperparameters used
column = cv_results_df.loc[:, ["params"]]
print(column)

# Extract and print the row that had the best mean test score
best_row = cv_results_df[cv_results_df['rank_test_score'] == 1]
print(best_row)

   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
0       0.633362      0.012019         0.022155        0.000390
1       0.641156      0.005938         0.021984        0.000733
2       1.075554      0.013504         0.025514        0.000493
3       1.078453      0.010363         0.025199        0.000147
4       1.944722      0.012599         0.033582        0.000308
5       1.945425      0.028632         0.034088        0.000777
6       3.206863      0.097319         0.053727        0.002878
7       3.147933      0.044344         0.051127        0.000389

param_max_depth param_max_features  \
0               2               auto
1               2               sqrt
2               4               auto
3               4               sqrt
4               8               auto
5               8               sqrt
6              15               auto
7              15               sqrt

params  split0_test_score  \
0   {'max_depth': 2, 'max_features': 'auto'}           0.762140
1   {'max_depth': 2, 'max_features': 'sqrt'}           0.764059
2   {'max_depth': 4, 'max_features': 'auto'}           0.770780
3   {'max_depth': 4, 'max_features': 'sqrt'}           0.771145
4   {'max_depth': 8, 'max_features': 'auto'}           0.777029
5   {'max_depth': 8, 'max_features': 'sqrt'}           0.775533
6  {'max_depth': 15, 'max_features': 'auto'}           0.764570
7  {'max_depth': 15, 'max_features': 'sqrt'}           0.768063

split1_test_score  split2_test_score  ...  mean_test_score  std_test_score  \
0           0.765329           0.761279  ...         0.766216        0.006197
1           0.763412           0.762643  ...         0.766549        0.006738
2           0.767779           0.767878  ...         0.772387        0.005807
3           0.767072           0.768481  ...         0.772561        0.006485
4           0.774657           0.774855  ...         0.778609        0.005532
5           0.773319           0.775794  ...         0.777846        0.005482
6           0.767184           0.773903  ...         0.771833        0.005240
7           0.770179           0.775381  ...         0.773331        0.003961

rank_test_score  split0_train_score  split1_train_score  \
0                8            0.769725            0.770268
1                7            0.769879            0.770792
2                5            0.780189            0.780651
3                4            0.780138            0.781665
4                1            0.829683            0.829852
5                2            0.830955            0.830563
6                6            0.977494            0.973979
7                3            0.975494            0.973033

split2_train_score  split3_train_score  split4_train_score  \
0            0.770262            0.769099            0.766369
1            0.771637            0.767746            0.765142
2            0.781025            0.780711            0.777592
3            0.781055            0.780815            0.777521
4            0.829912            0.829288            0.829285
5            0.830177            0.831415            0.827246
6            0.974724            0.974242            0.976732
7            0.976973            0.974084            0.973877

mean_train_score  std_train_score
0          0.769145         0.001453
1          0.769039         0.002340
2          0.780034         0.001249
3          0.780239         0.001444
4          0.829604         0.000270
5          0.830071         0.001471
6          0.975434         0.001412
7          0.974692         0.001388

[8 rows x 22 columns]
params
0   {'max_depth': 2, 'max_features': 'auto'}
1   {'max_depth': 2, 'max_features': 'sqrt'}
2   {'max_depth': 4, 'max_features': 'auto'}
3   {'max_depth': 4, 'max_features': 'sqrt'}
4   {'max_depth': 8, 'max_features': 'auto'}
5   {'max_depth': 8, 'max_features': 'sqrt'}
6  {'max_depth': 15, 'max_features': 'auto'}
7  {'max_depth': 15, 'max_features': 'sqrt'}
mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
4       1.944722      0.012599         0.033582        0.000308

param_max_depth param_max_features  \
4               8               auto

params  split0_test_score  \
4  {'max_depth': 8, 'max_features': 'auto'}           0.777029

split1_test_score  split2_test_score  ...  mean_test_score  std_test_score  \
4           0.774657           0.774855  ...         0.778609        0.005532

rank_test_score  split0_train_score  split1_train_score  \
4                1            0.829683            0.829852

split2_train_score  split3_train_score  split4_train_score  \
4            0.829912            0.829288            0.829285

mean_train_score  std_train_score
4          0.829604          0.00027

[1 rows x 22 columns]


### Analyzing the best results

At the end of the day, we primarily care about the best performing 'square' in a grid search. Luckily Scikit Learn's gridSearchCV objects have a number of parameters that provide key information on just the best square (or row in cv_results_).

Three properties you will explore are:

• best_score_ – The score (here ROC_AUC) from the best-performing square.
• best_index_ – The index of the row in cv_results_ containing information on the best-performing square.
• best_params_ – A dictionary of the parameters that gave the best score, for example 'max_depth': 10
best_score = grid_rf_class.best_score_
print(best_score)

# Create a variable from the row related to the best-performing square
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
best_row = cv_results_df.loc[[grid_rf_class.best_index_]]
print(best_row)

# Get the max_depth parameter from the best-performing square and print
best_max_depth = grid_rf_class.best_params_['max_depth']
print(best_max_depth)

0.7786085423910816
mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
4       1.944722      0.012599         0.033582        0.000308

param_max_depth param_max_features  \
4               8               auto

params  split0_test_score  \
4  {'max_depth': 8, 'max_features': 'auto'}           0.777029

split1_test_score  split2_test_score  ...  mean_test_score  std_test_score  \
4           0.774657           0.774855  ...         0.778609        0.005532

rank_test_score  split0_train_score  split1_train_score  \
4                1            0.829683            0.829852

split2_train_score  split3_train_score  split4_train_score  \
4            0.829912            0.829288            0.829285

mean_train_score  std_train_score
4          0.829604          0.00027

[1 rows x 22 columns]
8


### Using the best results

While it is interesting to analyze the results of our grid search, our final goal is practical in nature; we want to make predictions on our test set using our estimator object.

We can access this object through the best_estimator_ property of our grid search object.

In this exercise we will take a look inside the best_estimator_ property and then use this to make predictions on our test set for credit card defaults and generate a variety of scores. Remember to use predict_proba rather than predict since we need probability values rather than class labels for our roc_auc score. We use a slice [:,1] to get probabilities of the positive class.

from sklearn.metrics import confusion_matrix, roc_auc_score

# See what type of object the best_estimator_property is
print(type(grid_rf_class.best_estimator_))

# Create an array of predictions directly using the best_estimator_property
predictions = grid_rf_class.best_estimator_.predict(X_test)

# Take a look to confirm it worked, this should be an array of 1's and 0's
print(predictions[0:5])

# Now create a confusion matrix
print("Confusion Matrix \n", confusion_matrix(y_test, predictions))

# Get the ROC-AUC score
predictions_proba = grid_rf_class.best_estimator_.predict_proba(X_test)[:, 1]
print("ROC-AUC Score \n", roc_auc_score(y_test, predictions_proba))

<class 'sklearn.ensemble._forest.RandomForestClassifier'>
[1 0 0 1 0]
Confusion Matrix
[[6685  323]
[1292  700]]
ROC-AUC Score
0.7767071783137115


The .best_estimator_ property is a really powerful property to understand for streamlining your machine learning model building process. You now can run a grid search and seamlessly use the best model from that search to make predictions.