Selecting a Machine Learning Algorithm

April 14, 2019

scikit-learn ml data engineering

Let’s Code: Dependencies and Toy Data

I am going to begin by creating a synthetic dataset that we can work with.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, Y = make_classification(n_samples = 5000,
                           n_features = 15,
                           n_informative = 2,
                           n_redundant = 10,
                           n_classes = 2,
                           random_state = 8)

We’re going to split our dataset up into a training set and a test set. The test set will not get exposed to the machine learning algorithms during training and will only be used to test algorithms once they’ve been trained. We’re going to set our test size to 20% of the dataset size - in this case 1000 samples.

test_size = 0.2
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size = test_size,
                                                    random_state = 8,
                                                    shuffle = True)
print(X_train.shape)

(4000, 15)

print(X_test.shape)

(1000, 15)

Now that we have out training and testing datasets, we’re going to set up a pipeline of algorithms to test. We’re going to spot-check the following algorithms in our baseline:

Naive Bayes
Random Forests
Support Vector machines
XGBoost
KNN
Linear Discriminant Analysis
Logistic Regression

We’re going to use the default parameters for all of these with our baseline, with the exception of SVM which needs probability to be declared as True if you want to score using log loss.

%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
import warnings
import time

warnings.filterwarnings("ignore")

def algo_spotcheck(X, Y):
    
    #Declare an empty list that will contain all of our pipelines.
    algorithms=[]
    algorithms.append(['RF', RandomForestClassifier()])
    algorithms.append(['NB', GaussianNB()])
    algorithms.append(['LDA', LinearDiscriminantAnalysis()])
    algorithms.append(['XGB', XGBClassifier()])
    algorithms.append(['KNN', KNeighborsClassifier()])
    algorithms.append(['SVM', SVC(probability = True)])
    algorithms.append(['LOG', LogisticRegression()])
    
    results = []
    names = []
    scoring = 'neg_log_loss'
    start=time.time()
    summary=[]
    
    #For each algorithm, test performance using 10-fold cross validation 
    #and log the results
    for name, algo in algorithms:
        kfold = KFold(n_splits=10, random_state=7)
        cv_results = cross_val_score(algo, X_train, Y_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)    
        summary.append([cv_results.mean(), cv_results.std(), name]) 
    
    #Rank the results and display the top performer
    
    summary = sorted(summary, reverse=True)
    print('')
    print('Top performer: %s %f (%f)' % (summary[0][2], summary[0][0], summary[0][1]))
    
    # boxplot algorithm comparison
    mpl.style.use('fivethirtyeight')

    print('')
    print('Time to spot-check baseline algorithms: %.2f seconds' % (time.time() - start))
    fig = plt.figure(figsize = (12,8))
    fig.suptitle('Algorithm Spot-Check Comparison')
    ax = fig.add_subplot(111)
    plt.boxplot(results)
    ax.set_xticklabels(names)
        
    return

results = algo_spotcheck(X_train, Y_train)

RF: -0.258714 (0.133219)
NB: -0.665659 (0.127053)
LDA: -0.229887 (0.033906)
XGB: -0.157936 (0.050807)
KNN: -0.563349 (0.160962)
SVM: -0.134562 (0.026416)
LOG: -0.230040 (0.033473)

Top performer: SVM -0.134562 (0.026416)

Time to spot-check baseline algorithms: 14.48 seconds

png

With negative log loss, a score closer to zero is best. Log loss is considered to be a superior form of binary accuracy measurement because it punishes confident predictions that are wrong more than neutral predictions that are wrong.

We can see that by running our baseline algorithm spot-checking algorithm the top performing algorithm using this test harness on this dataset is XGBoost. Keep in mind that this spot-check is done without any datatransformations, and only on the baseline defaults for each algorithm. It might be that we didn’t give some of the algorithms a fair shot because the data wasn’t presented properly. XGBoost doesn’t suffer performance degredation with un-scaled data like some of the other algorithms like KNN and SVM.

Let’s Be Fair: Giving the Algorithms a Fair Shake

We can modify our spot-checking algorithm to make it more robust and give all the algorithms a fair shake. Let’s make some changes to the function we created so that it does to comparisons - one using unscaled data, and the other using scaled data. Then we can merge the results, plot, and compare.

We can leverage the Pipeline function in sklearn to apply a standard scaler to the data during each fold of the cross-validation process.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

def algo_spotcheck(X,Y):
    
    #Declare an empty list that will contain all of our pipelines.
    algorithms=[]
    algorithms.append(['RF', RandomForestClassifier()])
    algorithms.append(['RF(s)',  Pipeline([('Scale',StandardScaler()),
                                           ('RF',RandomForestClassifier())])])
    algorithms.append(['NB', GaussianNB()])
    algorithms.append(['NB(s)',  Pipeline([('Scale',StandardScaler()),
                                           ('NB',GaussianNB())])])
    algorithms.append(['LDA', LinearDiscriminantAnalysis()])
    algorithms.append(['LDA(s)',  Pipeline([('Scale',StandardScaler()),
                                            ('LDA',LinearDiscriminantAnalysis())])])
    algorithms.append(['XGB', XGBClassifier()])
    algorithms.append(['XGB(s)',  Pipeline([('Scale',StandardScaler()),
                                            ('XGB',XGBClassifier())])])
    algorithms.append(['KNN', KNeighborsClassifier()])
    algorithms.append(['KNN(s)',  Pipeline([('Scale',StandardScaler()),
                                            ('KNN',KNeighborsClassifier())])])
    algorithms.append(['SVM', SVC(probability=True)])
    algorithms.append(['SVM(s)',  Pipeline([('Scale',StandardScaler()),
                                            ('SVM',SVC(probability=True))])])
    algorithms.append(['LOG', LogisticRegression()])
    algorithms.append(['LOG(s)',  Pipeline([('Scale',StandardScaler()),
                                            ('LOG',LogisticRegression())])])
        
    results = []
    names = []
    scoring = 'neg_log_loss'
    start=time.time()
    summary=[]
    
    #For each algorithm, test performance using 10-fold cross validation 
    #and log the results
    for name, algo in algorithms:
        kfold = KFold(n_splits=10, random_state=7)
        cv_results = cross_val_score(algo, X_train, Y_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)    
        summary.append([cv_results.mean(), cv_results.std(), name]) 
    
    #Rank the results and display the top performer
    
    summary = sorted(summary,reverse=True)
    print('')
    print ('Top performer: %s %f (%f)' % (summary[0][2], summary[0][0], summary[0][1]))
    
    # boxplot algorithm comparison
    mpl.style.use('fivethirtyeight')

    print('')
    print('Time to spot-check baseline algorithms: %.2f seconds' % (time.time() - start))
    fig = plt.figure(figsize = (12,8))
    fig.suptitle('Algorithm Spot-Check Comparison')
    ax = fig.add_subplot(111)
    plt.boxplot(results)
    ax.set_xticklabels(names)
        
    return

algo_spotcheck(X_train, Y_train)

RF: -0.253002 (0.116676)
RF(s): -0.242979 (0.133891)
NB: -0.665659 (0.127053)
NB(s): -0.665659 (0.127053)
LDA: -0.229887 (0.033906)
LDA(s): -0.229887 (0.033906)
XGB: -0.157936 (0.050807)
XGB(s): -0.157936 (0.050807)
KNN: -0.563349 (0.160962)
KNN(s): -0.588838 (0.210365)
SVM: -0.134595 (0.026408)
SVM(s): -0.133659 (0.026629)
LOG: -0.230040 (0.033473)
LOG(s): -0.230038 (0.033456)

Top performer: SVM(s) -0.133659 (0.026629)

Time to spot-check baseline algorithms: 29.24 seconds

png

Conclusion

So now we have a more robust test harness that spot-checks the algorithms using scaled data and un-scaled data. Note that when a pipeline object gets passed to the cross_val function it executes the operations in the pipeline sequentially each time.

So far so good. But can we do better? Of course. The issue at hand is that each of these algorithms have hyperparameters that can be tuned. It might be the case that one of these algorithms, appropriately tuned, might exceed the performance of XGBoost. It also might be the XGBoost’s score might become even better when properly tuned.

The problem with doing this, especially on big datasets, is that you need a lot of computing power to scan the entire solution space. In part we’ll adapt these functions to incorporate intelligent search with Hyperopt.

Thanks for reading!