Selecting a Machine Learning Algorithm

Selecting a Machine Learning Algorithm - Part II

April 14, 2019

scikit-learn ml data engineering

Introduction

In a previous post, I demonstrated an algorithm to spot-test performances of ML algorithms out of the box. In the present post, we’re going to create a new spot-checking algorithm using Hyperopt. In the first part of this series our spot-checking algorithm used the default configuration in sklearn for each algorithm and returned the score of the top performing algorithm.

By integrating Hyperopt into the spot-checking algorithm we can perform a quick, intelligent search of each solution space and then return the best result that hyperopt determines after parameter tuning for each algorithm.

With some alteration and sufficient computing power, this code can be used as a framework for completely automating the algorithm selection and tuning process (set it and forget it).

We start by importing our libraries and declaring the same dataset that we used in Part 1.

%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl


from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from xgboost import XGBClassifier
from hyperopt import hp, fmin, tpe, rand,STATUS_OK, Trials

import warnings
import time
import pprint

warnings.filterwarnings("ignore")
mpl.style.use('fivethirtyeight')

#create training and testing dataset
X, Y = make_classification(n_samples = 5000,
                           n_features = 15,
                           n_informative = 2,
                           n_redundant = 10,
                           n_classes = 2,
                           random_state = 8)

test_size = 0.2
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size = test_size,
                                                    random_state = 8,
                                                    shuffle = True)

The next step is to declare objective functions for each of our algorithms. If you have read my other posts on Hyperopt, you might recall that Hyperopt (like other optimization algorithms) requires a cost function to minimize. In this case, we’re trying to minimize our log-loss score.

Toward that end, we declare objective functions for each machine learning algorithm with the mean log-loss score from the cross-validation cycle as one of the output.

def ob_1(space):

    model = XGBClassifier(**space)     
    kfold = KFold(n_splits=10, random_state=7,shuffle=True)
    score = -cross_val_score(model, X_train, Y_train, cv=kfold, scoring='neg_log_loss',verbose=False,n_jobs=-1)
    
    return space, score.mean(), score.std()


def ob_2(space):
    
    model = RandomForestClassifier(**space)     
    kfold = KFold(n_splits=10, random_state=7,shuffle=True)
    score = -cross_val_score(model, X_train, Y_train, cv=kfold, scoring='neg_log_loss',verbose=False,n_jobs=-1)
    
    return space, score.mean(), score.std()


def ob_3(space):
    
    model = GaussianNB()     
    kfold = KFold(n_splits=10, random_state=7,shuffle=True)
    score = -cross_val_score(model, X_train, Y_train, cv=kfold, scoring='neg_log_loss',verbose=False,n_jobs=-1)
    
    return space, score.mean(), score.std()


def ob_4(space):
    
    model = LinearDiscriminantAnalysis(**space)     
    kfold = KFold(n_splits=10, random_state=7,shuffle=True)
    score = -cross_val_score(model, X_train, Y_train, cv=kfold, scoring='neg_log_loss',verbose=False,n_jobs=-1)
    
    return space, score.mean(), score.std()


def ob_5(space):
    
    model = KNeighborsClassifier(**space)     
    kfold = KFold(n_splits=10, random_state=7,shuffle=True)
    score = -cross_val_score(model, X_train, Y_train, cv=kfold, scoring='neg_log_loss',verbose=False,n_jobs=-1)
    
    return space, score.mean(), score.std()


def ob_6(space):
    
    model = SVC(**space)     
    kfold = KFold(n_splits=10, random_state=7,shuffle=True)
    score = -cross_val_score(model, X_train, Y_train, cv=kfold, scoring='neg_log_loss',verbose=False,n_jobs=-1)
    
    return space, score.mean(), score.std()


def ob_7(space):
    
    model = LogisticRegression(**space)     
    kfold = KFold(n_splits=10, random_state=7,shuffle=True)
    score = -cross_val_score(model, X_train, Y_train, cv=kfold, scoring='neg_log_loss',verbose=False,n_jobs=-1)
    
    return space, score.mean(), score.std()

The next function we declare is our hyperparameter search space function.

Each algorithm in sklearn has parameters that can be ‘tuned’. Hyperopt requires a search space as an input - so we need to declare a search space for each of our machine learning algorithms. I’m not going to go into the explanations for what each parameter does, but very good explanations can be found in the sklearn documentation.

I try to clean this process up by using a dictionary of search spaces (a dictionary of dictionaries).

def GetSpace(space):
    
    
    Space_List={
                         
                 #XGBoost Hyperparameter search space.
                 'XGB':     {
                            'max_depth': hp.choice('x_max_depth',[2,3,4,5,6]),
                            'min_child_weight':hp.choice('x_min_child_weight',np.round(np.arange(0.0001,0.2,0.0001),5)),
                            'gamma':hp.choice('x_gamma',np.round(np.arange(0.0,10.0,0.005),5)),
                            'learning_rate':hp.choice('x_learning_rate',np.round(np.arange(0.005,0.3,0.01),5)),
                            'subsample':hp.choice('x_subsample',np.round(np.arange(0.01,1.0,0.01),5)),
                            'colsample_bylevel':hp.choice('x_colsample_bylevel',np.round(np.arange(0.1,1.0,0.01),5)),
                            'colsample_bytree':hp.choice('x_colsample_bytree',np.round(np.arange(0.1,1.0,0.01),5)),
                            'n_estimators':hp.choice('x_n_estimators',[50,100,150,200,500])
                            },
        
                 #Random Forest Hyperparameter search space.
                 'RFC':     {
                            'n_estimators': hp.choice('x_loss',np.arange(50,750,5)),
                            'max_depth':hp.choice('x_max_depth',np.arange(1,8,1)),
                            'max_features':hp.choice('x_max_features',['sqrt','log2',None]),
                            'criterion':hp.choice('x_criterion',['gini','entropy']),
                            'min_samples_split':hp.choice('x_min_samples_split',np.arange(0.000001,0.3,0.00005)),
                            'min_samples_leaf':hp.choice('x_min_samples_leaf',np.arange(0.000001,0.3,0.00005)),
                            'min_impurity_decrease':hp.choice('min_impurity_decrease',np.arange(0.000001,0.3,0.00005)),
                            'n_jobs':hp.choice('x_njobs',[-1])
                            },                    
                
                 #Naive Bayes has no major hyperparameters to tune we pass an empty space.
                 'NBC':     {'dummmy': hp.choice('dummy', [1])},
                        
        
                 #LDA hyperparameter search space.     
                 'LDA':     {
                            'solver': hp.choice('x_solver', ['lsqr','eigen']),
                            'shrinkage': hp.choice('x_shrinkage',np.arange(0.0,1,0.01))
                            },
        
                 #KNN Hyperparameter search space        
                 'KNN':     {
                            'n_neighbors': hp.choice('X_nneighbors', np.arange(1,75,1))
                            },       

                 #Support Vector Machine hyperparameter search space.
                 'SVM':     {
                            'C': hp.choice('C', np.arange(0.0,1.0,0.00005)),
                            'kernel': hp.choice('x_kernel',['poly', 'rbf']),
                            'degree':hp.choice('x_degree',[2,3,4,5]),
                            'probability':hp.choice('x_probability',[True])
                            },
                                
                 #Logistic Regression hyperparameter search space.
                 'LRC':     {
                            'penalty': hp.choice('x_penalty', ['l1']),
                            'C': hp.choice('C', np.round(np.arange(0.0,1.0,0.00001),5)),
                            'solver':hp.choice('x_solver',['liblinear','saga'])
                            },                         
                   
                }
                
    
    return Space_List[space]

Now that we have declared all of our search spaces, I create an iterable list of all out search space keys and their corresponding functions.

Hyperopt is going to iterate through this list, run an optimization, and append the results to the list ‘results’.

algorithms=[['XGB',ob_1],
            ['RFC',ob_2],
            ['NBC',ob_3],
            ['LDA',ob_4],
            ['KNN',ob_5],
            ['SVM',ob_6],
            ['LRC',ob_7]]

global results
results=[]
start=time.time()

for spc,function in algorithms:
    
    print('Scanning '+spc+'...')
    
    #This is the function space that is passed to Hyperopt.
    def f(space):

        space, mean_score,score_std = function(space)
        results.append([mean_score, score_std,space,spc])
        
        return {'loss': mean_score, 'status': STATUS_OK}
    
    #fetch our search space and allow Hyperopt to attempt 25 iterations
    #to find an optimal solution.
    space=GetSpace(spc)
    trials=Trials()
    best=fmin(fn=f,
              space=space,
              algo=tpe.suggest,
              max_evals=25,
              trials=trials)
    
results=sorted(results,reverse=False)   
print('')
print('Best score %f (%f) achieved using: %s' %(results[0][0],results[0][1],results[0][3]))
print('')
print('Best parameters: ')
print('')
pprint.pprint(results[0][2])

print('')
print('Time to spot-check using Hyperopt: %.2f seconds' % (time.time() - start))

Scanning XGB...
100%|██████████| 25/25 [00:32<00:00,  1.30s/trial, best loss: 0.11899569997331128]
Scanning RFC...
100%|██████████| 25/25 [03:20<00:00,  8.02s/trial, best loss: 0.18186662213868968]
Scanning NBC...
100%|██████████| 25/25 [00:00<00:00, 37.88trial/s, best loss: 0.6668993490975412]
Scanning LDA...
100%|██████████| 25/25 [00:01<00:00, 20.64trial/s, best loss: 0.2297824756575503] 
Scanning KNN...
100%|██████████| 25/25 [00:02<00:00, 10.19trial/s, best loss: 0.1847869290989434] 
Scanning SVM...
100%|██████████| 25/25 [02:03<00:00,  4.93s/trial, best loss: 0.133770693552004]  
Scanning LRC...
100%|██████████| 25/25 [03:25<00:00,  8.22s/trial, best loss: 0.22981320885740003]

Best score 0.118996 (0.014104) achieved using: XGB

Best parameters: 

{'colsample_bylevel': 0.77,
 'colsample_bytree': 0.83,
 'gamma': 7.38,
 'learning_rate': 0.135,
 'max_depth': 5,
 'min_child_weight': 0.0249,
 'n_estimators': 100,
 'subsample': 0.36}

Time to spot-check using Hyperopt: 602.05 seconds