import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb

Why tune your model?

Tuning the number of boosting rounds

Let's start with parameter tuning by seeing how the number of boosting rounds (number of trees you build) impacts the out-of-sample performance of your XGBoost model. You'll use xgb.cv() inside a for loop and build one model per num_boost_round parameter.

Here, you'll continue working with the Ames housing dataset. The features are available in the array X, and the target vector is contained in y.

df = pd.read_csv('./dataset/ames_housing_trimmed_processed.csv')
X, y = df.iloc[:, :-1], df.iloc[:, -1]
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Creata the parameter dictionary for each tree: params
params = {"objective":"reg:squarederror", "max_depth":3}

# Create list of number of boosting rounds
num_rounds = [5, 10, 15]

# Empty list to store final round rmse per XGBoost model
final_rmse_per_round = []

# Interate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in num_rounds:
    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, 
                        num_boost_round=curr_num_rounds, metrics='rmse', 
                        as_pandas=True, seed=123)
    
    # Append final round RMSE
    final_rmse_per_round.append(cv_results['test-rmse-mean'].tail().values[-1])
    
# Print the result DataFrame
num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))
print(pd.DataFrame(num_rounds_rmses, columns=['num_boosting_rounds', 'rmse']))
   num_boosting_rounds          rmse
0                    5  50903.299479
1                   10  34774.194010
2                   15  32895.097656

Automated boosting round selection using early_stopping

Now, instead of attempting to cherry pick the best possible number of boosting rounds, you can very easily have XGBoost automatically select the number of boosting rounds for you within xgb.cv(). This is done using a technique called early stopping.

Early stopping works by testing the XGBoost model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric ("rmse" in our case) does not improve for a given number of rounds. Here you will use the early_stopping_rounds parameter in xgb.cv() with a large possible number of boosting rounds (50). Bear in mind that if the holdout metric continuously improves up through when num_boost_rounds is reached, then early stopping does not occur.

housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params
params = {"objective":"reg:squarederror", "max_depth":4}

# Perform cross-validation with early-stopping: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, nfold=3, params=params, metrics="rmse", 
                    early_stopping_rounds=10, num_boost_round=50, as_pandas=True, seed=123)

# Print cv_results
print(cv_results)
    train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0     141871.635417      403.636200   142640.651042     705.559164
1     103057.033854       73.769960   104907.664063     111.119966
2      75975.966146      253.726099    79262.054687     563.764349
3      57420.529948      521.658354    61620.136719    1087.694282
4      44552.955729      544.170190    50437.561198    1846.446330
5      35763.947917      681.797248    43035.660156    2034.472841
6      29861.464193      769.572153    38600.881510    2169.798065
7      25994.675781      756.519243    36071.817708    2109.795430
8      23306.836588      759.238254    34383.184896    1934.546688
9      21459.770833      745.624687    33509.141276    1887.375284
10     20148.720703      749.612103    32916.807292    1850.894702
11     19215.382162      641.387014    32197.832682    1734.456935
12     18627.389323      716.256596    31770.852214    1802.156241
13     17960.694661      557.043073    31482.781901    1779.124534
14     17559.736979      631.412969    31389.990234    1892.319927
15     17205.712891      590.171393    31302.881511    1955.166733
16     16876.572591      703.632148    31234.059896    1880.707172
17     16597.663086      703.677647    31318.348308    1828.860391
18     16330.460937      607.274494    31323.634766    1775.910706
19     16005.972656      520.471365    31204.135417    1739.076156
20     15814.300456      518.605216    31089.863281    1756.020773
21     15493.405599      505.616447    31047.996094    1624.673955
22     15270.734375      502.018453    31056.916015    1668.042812
23     15086.382161      503.912447    31024.984375    1548.985605
24     14917.607747      486.205730    30983.686198    1663.131107
25     14709.589518      449.668010    30989.477865    1686.668378
26     14457.286458      376.787206    30952.113281    1613.172049
27     14185.567383      383.102234    31066.901042    1648.534606
28     13934.067057      473.464991    31095.641927    1709.225000
29     13749.645182      473.671021    31103.887370    1778.880069
30     13549.836263      454.898488    30976.085938    1744.515079
31     13413.485026      399.603470    30938.469401    1746.054047
32     13275.916016      415.408786    30930.999349    1772.470428
33     13085.878255      493.792509    30929.056641    1765.541578
34     12947.181641      517.790106    30890.629557    1786.510976
35     12846.027344      547.732372    30884.492839    1769.730062
36     12702.378906      505.523126    30833.541667    1691.002487
37     12532.244140      508.298516    30856.688151    1771.445059
38     12384.055013      536.225042    30818.016927    1782.786053
39     12198.444010      545.165502    30839.392578    1847.326928
40     12054.583333      508.841412    30776.964844    1912.779587
41     11897.036784      477.177937    30794.702474    1919.674832
42     11756.221354      502.992395    30780.957031    1906.820066
43     11618.846680      519.837469    30783.753906    1951.260704
44     11484.080404      578.428621    30776.731771    1953.446772
45     11356.552734      565.368794    30758.542969    1947.454481
46     11193.557943      552.299272    30729.972005    1985.699338
47     11071.315104      604.089876    30732.663411    1966.999196
48     10950.778320      574.862779    30712.240885    1957.751118
49     10824.865885      576.665756    30720.853516    1950.511977

Overview of XGBoost's hyperparameters

  • Common tree tunable parameters
    • learning rate: learning rate/eta
    • gamma: min loss reduction to create new tree split
    • lambda: L2 regularization on leaf weights
    • alpha: L1 regularization on leaf weights
    • max_depth: max depth per tree
    • subsample: % samples used per tree
    • colsample_bytree: % features used per tree
  • Linear tunable parameters
    • lambda: L2 reg on weights
    • alpha: L1 reg on weights
    • lambda_bias: L2 reg term on bias
  • You can also tune the number of estimators used for both base model types!

Tuning eta

It's time to practice tuning other XGBoost hyperparameters in earnest and observing their effect on model performance! You'll begin by tuning the "eta", also known as the learning rate.

The learning rate in XGBoost is a parameter that can range between 0 and 1, with higher values of "eta" penalizing feature weights more strongly, causing much stronger regularization.

housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:squarederror", "max_depth":3}

# Create list of eta values and empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []

# Systematicallyvary the eta
for curr_val in eta_vals:
    params['eta'] = curr_val
    
    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3,
                        early_stopping_rounds=5, num_boost_round=10, metrics='rmse', seed=123, 
                       as_pandas=True)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results['test-rmse-mean'].tail().values[-1])
    
# Print the result DataFrame
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=['eta', 'best_rmse']))
     eta      best_rmse
0  0.001  195736.401042
1  0.010  179932.187500
2  0.100   79759.408854

Tuning max_depth

In this exercise, your job is to tune max_depth, which is the parameter that dictates the maximum depth that each tree in a boosting round can grow to. Smaller values will lead to shallower trees, and larger values to deeper trees.

housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary
params = {"objective":"reg:squarederror"}

# Create list of max_depth values
max_depths = [2, 5, 10, 20]
best_rmse = []

for curr_val in max_depths:
    params['max_depth'] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, 
                       early_stopping_rounds=5, num_boost_round=10, metrics='rmse', seed=123,
                        as_pandas=True)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results['test-rmse-mean'].tail().values[-1])
    
# Print the result DataFrame
print(pd.DataFrame(list(zip(max_depths, best_rmse)), columns=['max_depth', 'best_rmse']))
   max_depth     best_rmse
0          2  37957.468750
1          5  35596.599610
2         10  36065.546875
3         20  36739.576172

Tuning colsample_bytree

Now, it's time to tune "colsample_bytree". You've already seen this if you've ever worked with scikit-learn's RandomForestClassifier or RandomForestRegressor, where it just was called max_features. In both xgboost and sklearn, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost, colsample_bytree must be specified as a float between 0 and 1.

housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params={"objective":"reg:squarederror", "max_depth":3}

# Create list of hyperparameter values: colsample_bytree_vals
colsample_bytree_vals = [0.1, 0.5, 0.8, 1]
best_rmse = []

# Systematically vary the hyperparameter value 
for curr_val in colsample_bytree_vals:
    params['colsample_bytree'] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
                 num_boost_round=10, early_stopping_rounds=5,
                 metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), 
                   columns=["colsample_bytree","best_rmse"]))
   colsample_bytree     best_rmse
0               0.1  48193.453125
1               0.5  36013.544922
2               0.8  35932.962891
3               1.0  35836.042968

There are several other individual parameters that you can tune, such as "subsample", which dictates the fraction of the training data that is used during any given boosting round. Next up: Grid Search and Random Search to tune XGBoost hyperparameters more efficiently!

  • Grid search: review
    • Search exhaustively over a given set of hyperparameters, once per set of hyperparameters
    • Number of models = number of distinct values per hyperparameter multiplied across each hyperparameter
    • Pick final model hyperparameter values that give best cross-validated evaluation metric value
  • Random search: review
    • Create a (possibly infinte) range of hyperparameter values per hyperparameter that you would like to search over
    • Set the number of iterations you would like for the random search to continue
    • During each iteration, randomly draw a value in the range of specified values for each hyperparameter searched over and train/evaluate a model with those hyperparameters
    • After you've reached the maximum number of iterations, select the hyperparameter configuration with the best evaluated score

Grid search with XGBoost

Now that you've learned how to tune parameters individually with XGBoost, let's take your parameter tuning to the next level by using scikit-learn's GridSearch and RandomizedSearch capabilities with internal cross-validation using the GridSearchCV and RandomizedSearchCV functions. You will use these to find the best model exhaustively from a collection of possible parameter values across multiple parameters simultaneously. Let's get to work, starting with GridSearchCV!

from sklearn.model_selection import GridSearchCV

# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    'colsample_bytree': [0.3, 0.7],
    'n_estimators': [50],
    'max_depth': [2, 5]
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor()

# Perform grid search: grid_mse
grid_mse = GridSearchCV(param_grid=gbm_param_grid, estimator=gbm, 
                        scoring='neg_mean_squared_error', cv=4, verbose=1)

# Fit grid_mse to the data
grid_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))
Fitting 4 folds for each of 4 candidates, totalling 16 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Best parameters found:  {'colsample_bytree': 0.7, 'max_depth': 5, 'n_estimators': 50}
Lowest RMSE found:  30744.105707685176
[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:    0.4s finished

Random search with XGBoost

Often, GridSearchCV can be really time consuming, so in practice, you may want to use RandomizedSearchCV instead, as you will do in this exercise. The good news is you only have to make a few modifications to your GridSearchCV code to do RandomizedSearchCV. The key difference is you have to specify a param_distributions parameter instead of a param_grid parameter.

from sklearn.model_selection import RandomizedSearchCV

# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    'n_estimators': [25],
    'max_depth': range(2, 12)
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(n_estimators=10)

# Perform random search: randomized_mse
randomized_mse = RandomizedSearchCV(param_distributions=gbm_param_grid, estimator=gbm, 
                                    scoring='neg_mean_squared_error', n_iter=5, cv=4, 
                                   verbose=1)

# Fit randomized_mse to the data
randomized_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))
Fitting 4 folds for each of 5 candidates, totalling 20 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Best parameters found:  {'n_estimators': 25, 'max_depth': 4}
Lowest RMSE found:  29998.4522530019
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.5s finished
  • limitations
    • Grid Search
      • Number of models you must build with every additionary new parameter grows very quickly
    • Random Search
      • Parameter space to explore can be massive
      • Randomly jumping throughtout the space looking for a "best" results becomes a waiting game