Goal: Understand - through visualization - the effect of changing the primary hyperparameters on decision boundaries in extreme gradient boosting machines with xgboost
library.
Introduction
If you are like me, complex concepts are best grasped if you can see their effect visually. There are a lot of different machine learning algorithms and even more hyperparameters that need to be tuned. Although methods like randomized search and gridsearch negate some of the need to understand how the parameters work, it’s still a good idea to have an intuitive sense of their effect in case you’re not getting the performance you desire.
In part 1 of this series, I’m going to focus on XGBoost
. I’ve spent a lot of time with XGBoost and its performance on most problems is exceptional. Most tuning guides and best practices you find on the internet provide numerial heuristics and rules-of-thumb for tuning the parameters. But what effect do these have on the decision boundaries? What does the effect look like? All else being equal, what effect does increasing the gamma parameter have on the decision boundary in different datasets? We’re going to explore this in detail in this post.
Attribution: The plotting solution used in this tutorial was borrowed from the great classifier comparison tutorial on the sklearn website. In this series I adapt this to visualize the effect of hyperparameter tuning on a variety of top performing algorithms.
I’m going to change each parameter in isolation and plot the effect on the decision boundary. All hyperparameters will be set to their defaults, except for the parameter in question. We’ll do this for:
n_estimators
learning_rate
min_samples_split
max_depth
gamma
min_child_weight
subsample
Our first step is to import our libraries and declare our plotting function because we’re going to be re-using this code a lot.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings("ignore")
def plot_decision_bounds(names,classifiers):
'''
This function takes in a list of classifier variants and names and plots the decision boundaries
for each on three different datasets that different decision boundary solutions.
Parameters:
names: list, list of names for labelling the subplots.
classifiers: list, list of classifer variants for building decision boundaries.
Returns:
None
'''
h = .02 # step size in the mesh
X, y = make_classification(n_features=2,
n_redundant=0,
n_informative=2,
random_state=1,
n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
datasets = [make_moons(noise=0.3, random_state=0),
make_circles(noise=0.25, factor=0.5, random_state=1),
linearly_separable
]
figure = plt.figure(figsize=(13, 8))
i = 1
# iterate over datasets
for ds_cnt, ds in enumerate(datasets):
# preprocess dataset, split into training and test part
X, y = ds
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=.4, random_state=42)
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# just plot the dataset first
cm = plt.cm.cool
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
if ds_cnt == 0:
ax.set_title("Input data")
# Plot the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
edgecolors='k')
# and testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6,
edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
i += 1
# iterate over classifiers
for name, clf in zip(names, classifiers):
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
# Put the result into a color plot
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)
# Plot also the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
edgecolors='k')
# and testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
edgecolors='k', alpha=0.6)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
if ds_cnt == 0:
ax.set_title(name)
ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
size=15, horizontalalignment='right')
i += 1
plt.tight_layout()
plt.show()
return
The first parameter we are going to visualize is max_depth. This parameter determines how deep the algorithm is allowed to build the tree. In general, the deeper the tree gets the more resolution the tree will learn - as well as noise. Deep trees tend to suffer from overfitting, but this can be mitigated in some respects. The intention of the XGBoost classifier is to iteratively develop “weak learners” that improve on the error of the previous tree. Weak learners tend to be shallower trees. Because of this, going too deep can negate some of the benefits of ensemble methods.
Effects of XGBoost params on Decision Boundries
max_depth
names = ['Max Depth = 1', 'Max Depth = 3', 'Max Depth = 10', 'Max Depth = 20']
classifiers = [XGBClassifier(max_depth=1),
XGBClassifier(max_depth=3),
XGBClassifier(max_depth=10),
XGBClassifier(max_depth=20)
]
plot_decision_bounds(names,classifiers);
gamma
names = ['gamma = 0', 'gamma = 0.5', 'gamma = 2', 'gamma = 10']
classifiers = [XGBClassifier(gamma=0),
XGBClassifier(gamma=0.5),
XGBClassifier(gamma=2),
XGBClassifier(gamma=10)
]
plot_decision_bounds(names,classifiers);
subsample
Size
names = ['subsample = 1.0', 'subsample = 0.9', 'subsample = 0.5',
'subsample = 0.1']
classifiers = [XGBClassifier(subsample = 1.0),
XGBClassifier(subsample = 0.9),
XGBClassifier(subsample = 0.5),
XGBClassifier(subsample = 0.1)
]
plot_decision_bounds(names,classifiers)
n_estimators
names = ['n_estimators = 2', 'n_estimators=5', 'n_estimators=10',
'n_estimators=50']
classifiers = [XGBClassifier(n_estimators = 2),
XGBClassifier(n_estimators = 5),
XGBClassifier(n_estimators = 25),
XGBClassifier(n_estimators = 100)
]
plot_decision_bounds(names,classifiers)
learning_rate
names = ['learning_rate = 0.001', 'learning_rate=0.01', 'learning_rate=0.1',
'learning_rate=0.5']
classifiers = [XGBClassifier(learning_rate = 0.001),
XGBClassifier(learning_rate = 0.010),
XGBClassifier(learning_rate = 0.100),
XGBClassifier(learning_rate = 0.500)
]
plot_decision_bounds(names,classifiers)
min_child_weight
names = ['min_child_weight = 0.01', 'min_child_weight = 0.05',
'min_child_weight = 0.25', 'min_child_weight = 0.75']
classifiers = [XGBClassifier(min_child_weight = 0.01),
XGBClassifier(min_child_weight = 0.05),
XGBClassifier(min_child_weight = 0.25),
XGBClassifier(min_child_weight = 0.75)
]
plot_decision_bounds(names,classifiers)
Summary
- Algorithm tuning is an important, and often complex aspect of machine learning.
- Visualizing the effect of each hyperparameters can help solidify your understanding of the algorithms.
- As I mentioned at the beginning, all of these effects are observed here in isolation. Try experimenting with combinations of the hyperparameters to see what the effects are.
- In this example, increasing Gamma has the effect of forcing a linear decision making boundary on the algorithm.
- Performance gains or losses are rarely directly related to these variables. Rather, there are “sweet spots” with each hyperparameter that exist both in isolation, and in combination with other hyperparameters.
- In future posts in this series, we’ll be visualizing the effects of other popular machine learning algorithms.
Thanks for reading!