Introduction
In this post I’m going to repeat the experiment we did in our XGBoost post, but for Support Vector Machines - if you haven’t read that one I encourage you to view that first!
Support Vector Machines are one of my favourite machine learning algorithms because they’re elegant and intuitive (if explained in the right way). All this algorithm tries to do is draw a line in the dataset that seperates the classes with as little error as possible.
SVM Intuition
Imagine you had a whole bunch of chocolate M&M’s on your counter top. Also, suppose that you only have two colors of M&M’s for this example: red and blue. A linear support vector machine would be equivalent to trying to seperate the M&M’s with a ruler, in such a way that you get the best color seperation possible.
Using a
Polysupport vector machine would be like using a ruler that you can bend and then use to seperate the M&M’s. A 1 degreepolysupport vector machine is equivalent to a straight line. Increasing the number of degrees allows you to have more bends in your ruler. You can imagine this might be handy depending on how mixed the pile of M&M’s is.Using an
RBFkernel support vector machine is for situations where you simply can’t use a straight ruler or bent ruler to effectively seperate the M&M’s. An analogy for RBF support vector machines would be where the M&M’s are so mixed that you have to (if you could) suspend the M&M’s in three dimensions and then try to seperate the two colors with a sheet of paper instead of a ruler (a hyperplane instead of a line).
During the demonstrations below, keep this analogy in mind. The datasets we show can be thought of as the M&M piles. There are three types of datasets and they’re designed to be seperated effectively by different types of support vector machines.
Below you’re going to see multiple lines and multiple color bands - this is because we’ve tasked the support vector machines to assign a probability of the datapoint being a blue dot or a red dot (Blue M&M or Red M&M). The different shades represent varying degrees of probability between 0 and 1.
The parameter C in each sub experiment just tells the support vector machine how many misclassifications are tolerable during the training process. C=1.0 represents no tolerance for errors. C=0.0 represents extreme tolerance for errors. In most real-world datasets, there can never be a perfect seperating boundary without overfitting the algorithm.
Effects of Changing the SVM Hyperparameters
Our first step is to import our libraries and declare our plotting function because we’re going to be re-using this code a lot.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.svm import SVC
import warnings
warnings.filterwarnings("ignore")
def plot_decision_bounds(names, classifiers):
'''
This function takes in a list of classifier variants and names and plots the decision boundaries
for each on three different datasets that different decision boundary solutions.
Parameters:
names: list, list of names for labelling the subplots.
classifiers: list, list of classifer variants for building decision boundaries.
Returns:
None
'''
h = .02 # step size in the mesh
X, y = make_classification(n_features=2,
n_redundant=0,
n_informative=2,
random_state=1,
n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)
datasets = [make_moons(noise=0.3, random_state=0),
make_circles(noise=0.25, factor=0.5, random_state=1),
linearly_separable
]
figure = plt.figure(figsize=(13, 8))
i = 1
# iterate over datasets
for ds_cnt, ds in enumerate(datasets):
# preprocess dataset, split into training and test part
X, y = ds
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=.4, random_state=42)
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# just plot the dataset first
cm = plt.cm.cool
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
if ds_cnt == 0:
ax.set_title("Input data")
# Plot the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
edgecolors='k')
# and testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6,
edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
i += 1
# iterate over classifiers
for name, clf in zip(names, classifiers):
ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
if hasattr(clf, "decision_function"):
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
# Put the result into a color plot
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)
# Plot also the training points
ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
edgecolors='k')
# and testing points
ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
edgecolors='k', alpha=0.6)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())
if ds_cnt == 0:
ax.set_title(name)
ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
size=15, horizontalalignment='right')
i += 1
plt.tight_layout()
plt.show()
return
Changing the Degree Parameter for Poly Kernel SVM
names = ['Degree=1', 'Degree=2', 'Degree=3', 'Degree=4']
classifiers = [SVC(probability=True, kernel='poly',degree=1,C=0.8),
SVC(probability=True, kernel='poly',degree=2,C=0.8),
SVC(probability=True, kernel='poly',degree=3,C=0.8),
SVC(probability=True, kernel='poly',degree=4,C=0.8)]
plot_decision_bounds(names, classifiers);

We can see visually from the results below what we talked about above - that the amount of “bend” in our ruler can determine how well we can seperate our pile of M&M’s.
- Degree 1 works best for dataset 1.
- Degree 4 works best for dataset 2.
- Degree 1 works best for dataset 3.
Using the RBF Kernel with different C Values
Recall that the RBF kernel is suspending our pile of M&M’s in the air and trying to seperate them with a sheet of paper instead of using a ruler when they’re all flat on the counter top. The effect you see below is a 2-D projection of how the plane slices through the 3-D pile of M&M’s.
We can see here that the effect the C-value has is very much dependent on the dataset. This highlights the importance of visualizing your data at the beginning of a machine learning project so that you can see what you’re dealing with!
names = ['C=1.0', 'C=0.8', 'C=0.6', 'C=0.2']
classifiers = [SVC(probability=True, kernel='rbf',C=1.0),
SVC(probability=True, kernel='rbf',C=0.8),
SVC(probability=True, kernel='rbf',C=0.6),
SVC(probability=True, kernel='rbf',C=0.2)]
plot_decision_bounds(names, classifiers);

Using the sigmoid Kernel with different C Values
The sigmoid kernel is another type of kernel that allows more bend patterns to be used by the algorithm in the training process. The effect is visualized below.
names = ['C=1.0', 'C=0.8', 'C=0.6', 'C=0.2']
classifiers = [SVC(kernel='sigmoid',degree=3,C=1.0),
SVC(kernel='sigmoid',degree=3,C=0.8),
SVC(kernel='sigmoid',degree=3,C=0.6),
SVC(kernel='sigmoid',degree=3,C=0.2)]
plot_decision_bounds(names,classifiers);

Summary
- Using M&M’s as an analogy, we can see that support vector machines attempt to seperate our pile of M&M’s as effectively as possible.
- Specifying the kernel type is akin to using different shaped rulers for seperating the M&M pile.
- The specific method that works best will be data-dependent.
- Support Vector Machines, to this day, are a top performing machine learning algorithm. The method it uses is intuitive if presented in the right way.
- One drawback of SVMs is that the computation time to train them scales quadratically with the size of the dataset. On large datasets the training time can be astronimical.
The top performers were:
- Dataset 1: RBF Kernel with C=1.0 (Score=0.95)
- Dataset 2: Poly Kernel with Degree=4 (Score=0.88)
- Dataset 3: Tie between Poly Kernel, Degree=1 and all four C-variants of the RBF Kernel (Score=0.95)
I hope you enjoyed this post!