Nerdy Introduction to Neural Networks - from Scratch

April 14, 2019
scikit-learn ml data engineering

Introduction

In this notebook, I will explain how to implement a neural network from scratch and use the version of MNIST dataset that is provided within Scikit-Learn for testing. I will specificallty illustrate the use of Python classes to define layers in the network as objects. Each layer object has forward and backward propagation methods which leads to compact, easily readable code.

# Imports

import numpy as np
from sklearn.datasets import load_digits

Data Preparation

After loading the data, we divide it into three parts, training, validation and testing sets. The validation set is to be used to determine the hyperparameters (i.e. number and size of hidden layers and regularization parameter) and the testing set is a separate picece of data to assess the final model performance.

## Load the data and reshape images
digits = load_digits()
n_samples = len(digits['images'])
data = digits['images'].reshape((n_samples, -1))
targets = digits['target']

## Train-test splitting
from sklearn.model_selection import train_test_split
X, X_test, y, y_test = train_test_split(data, targets,
                                        test_size=0.33,
                                        random_state=111)

## Train and validation splitting
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      test_size=0.40,
                                                      random_state=123)

Activation Function and Classifier

We will use the sigmoid activation function \(\sigma(x) = 1/(1+e^{-x})\) in the layers of the network. For the classifier, we will use the softmax function, which results in class probabilities:

\[ {\rm softmax}_i({\bf Y}) = \frac{e^{Y_i}}{\sum{k=1}^K, e^{Y_k}} \] where \(Y_i\) is vector from the output of the neural network for a given example, and \(K\) is the number of classes (10 in our case). Both functions are implemented below:

# define the activation function, the Softmax classifier

# sigmoid function
def sigmoid(X):
    return 1.0 / (1.0 + np.exp(-X))

# Softmax function
def softmax(X):
    temp = np.max(X, axis = 0) # determine the maximum of each col
    X = X - temp # subtract the max from each: doesnt change outcome
    
    return np.exp(X) / np.sum(np.exp(X), axis = 0)

In the above implementations, the assumption is that the arguments of both functions are cast in a matrix columnwise, so that each column represents one example: \({\bf X} \in {\mathbb R}^{n_{\rm feat} \times N}\) where \(N\) is the number of examples and \(n_feat\) is the number of features (i.e. number of pixels in our case). In softmax, we have subtracted the maximum of each column from each training example vector, an operation that does not change the outcome, for numerical stability.

Now that we have our data, the activation function and the classifier, we can construct the layers of the network.

Linear Update

First, we define the class LinearUpdate which performs the linear transformation of the (derived) features in the current layer:

\[{\bf Y}^{(i)} = {\bf W}^{T} \cdot A^{(i)} + {\bf b}\]

In this equation, \[{\bf A}^{(i)} \in {\mathbb R}^{n_{\rm in}}\] represents the current layer state (i.e features in case of input layer and derived features in case of hidden layers) with nin being the number of neurons. \[Y^{(i)} \in {\mathbb R}^{n_{\in out}}\] is the linear ouput of the current layer (which will later be the argument of an activation function) which is passed as the input to the next layer with \(n_out\) neurons. The supersript (i) refers to the training example being considered. Instead of using a for loop over the training examples, we can cast them in a matrix where each column is one training example vector \(Y^{(i)}\), i.e.

\[Y^{(i)}j \rightarrow Y{ji} : {\bf Y} \in {\mathbb R}^{n_{\rm out} \times N}\]

where \(N\) is the number of training examples. Similary \(A^{(i)}j \rightarrow A{ji} : {\bf A} \in {\mathbb R}^{n_{\rm in} \times N}\).

# Define Linear Update
class LinearUpdate(object):
    
    def __init__(self, n_in, n_out, W = None, b = None):
        # Initialize W randomly
        eps = np.sqrt(6.0) / np.sqrt(n_in + n_out)
        self.W = (np.random.uniform(-eps, eps, (n_in, n_out)) if 
                  W is None else W)
        #initialize biases as zero
        self.b = np.zeros((n_out, 1))
        
    def forward(self, A):
        """
        Forward Propagation method: A is current state, method
        results in the next state linearly Y ,- W.T * A + b
        """
        return np.dot(self.W.T, A) + self.b
    
    def backward(self, A, gY):
        """
        Backward propagation method: A is current state, gY is
        backpropagated derivative from next layer.
        """
        gW = np.dot(A, gY.T) # dL/dW_ab = sum_i A_ai * (gY)_bi
        gA = np.dot(self.W, gY) # dL/dl_ab = sum_j((W)_aj * gY_jb)
        gb = np.sum(gY, axis = 1) # dL/db_l = sum_i (gY)_li
        
        return gA, gW, gb
## Define Logistic Update
class LogisticUpdate(object):
    
    def forward(self, Z):
        """ Sigmoid activation: """
        
        return sigmoid(Z) # Y = sigmoid(Z)
    
    def backward(self, Y, grad_out):
        """Backward propagation: """
        return np.multiply(Y * (1 -Y), grad_out) # dL/dZ = dL/dY * Y * (1-Y)
# Define Softmax classifier layer
class SoftmaxClassifier(object):
    
    def forward(self, A):
        """
        Given state A, produces output probs P by softmax
        """
        return softmax(A)
    
    def backward(self, P, T):
        """
        Given output probs P and targets T, produces output
        gradient.
        """
        expansionMatrix = np.eye(P.shape[0], dtype = int)
        expandedTarget = expansionMatrix[:, T]
        
        return P - expandedTarget # No division by number of samples yet
    
    def crossEntropy(self, P, T):
        """
        Computes cross entropy
        """
        expansionMatrix = np.eye(P.shape[0], dtype = int)
        expandedTarget = expansionMatrix[:, T]
        
        CE = -np.sum(expandedTarget * np.log(P + 1e-30)) / P.shape
        
        return CE
## Define Layes (combine linear and logistic updates)
class Layer(object):
    def __init__ (self, n_in, n_out):
        self.linear = LinearUpdate(n_in, n_out)
        self.logistic = LogisticUpdate()
        
    def forward(self, A):
        """ Forward propagation method """
        return self.logistic.forward(self.linear.forward(A))
    
    def backward(self, A, Y, grad_out):
        """ Backward propagation method """
        # First the derivative of the logistic State
        gZ = self.logistic.backward(Y, grad_out)
        # Then, gZ becomes gY for the linear state
        gA, gW, gb = self.linear.backward(A, gZ) 
        
        return gA, gW, gb
## Construct a two hidden layer network
class nnet(object):
    def __init__ (self, nInput, nHidden1, nHidden2, K):
        """ Initiate method for the net """
        self.inputLayer = Layer(nInput, nHidden1)     # Input layer
        self.hiddenLayer1 = Layer(nHidden1, nHidden2) # 1st hidden layers
        self.hiddenLayer2 = Layer(nHidden2, K)        # 2nd hidden layer
        self.classifier = SoftmaxClassifier()         # Output: classification
    
    def forward(self, input_train):
        """ Perform forward propagation through all layers """
        A1 = input_train.T                 # Initial data
        A2 = self.inputLayer.forward(A1)   # inp -> hid1 
        A3 = self.hiddenLayer1.forward(A2) # hid1 -> hid2
        A4 = self.hiddenLayer2.forward(A3) # hid2 -> out
        P = self.classifier.forward(A4)    # output probabilities
        
        return A1, A2, A3, A4, P
    
    def backward(self, P, A4, A3, A2, A1, T, lam = 0.0):
        """ Back propagation method through all layers """
        grad_out = self.classifier.backward(P, T)                    # output grads
        gA3, gW3, gb3 = self.hiddenLayer2.backward(A3, A4, grad_out) # hid2 grads
        gA2, gW2, gb2 = self.hiddenLayer1.backward(A2, A3, gA3)      # hid1 grads
        gA1, gW1, gb1 = self.inputLayer.backward(A1, A2, gA2)        # input grads
        
        # Add regularization terms to the gradients
        if (lam > 0.0):
            gW3 += lam * self.hiddenLayer2.linear.W
            gW2 += lam * self.hiddenLayer1.linear.W
            gW1 += lam * self.inputLayer.linear.W
        
        return grad_out, gW3, gb3, gW2, gb2, gW1, gb1
    
    def Loss(self, P, T, lam = 0.0):
        """ Method for computing loss """        
        # Regularization term
        reg = np.sum(self.inputLayer.linear.W**2) + np.sum(self.hiddenLayer1.linear.W**2) + \
              np.sum(self.hiddenLayer2.linear.W**2)
        
        # Full loss
        return self.classifier.crossEntropy(P, T) + (0.5*lam/P.shape[1]) * reg
def trainNetwork(epochs, learningRate, momentum, lam, n_hidden_1, n_hidden_2, K):
    """ Function for training the network """
    
    ## Construct network
    network = nnet(n_input, n_hidden_1, n_hidden_2, K)
    
    ## Initiate velocities
    v_W1 = np.zeros_like(network.inputLayer.linear.W) 
    v_W2 = np.zeros_like(network.hiddenLayer1.linear.W)
    v_W3 = np.zeros_like(network.hiddenLayer2.linear.W)
    v_b1 = np.zeros_like(network.inputLayer.linear.b) 
    v_b2 = np.zeros_like(network.hiddenLayer1.linear.b)
    v_b3 = np.zeros_like(network.hiddenLayer2.linear.b)
    
    ## Train
    losses = {'train':[], 'validation':[]}
    for epoch in range(epochs):
        # Feed forward 
        A1, A2, A3, A4, P = network.forward(X_train)
        
        # Backprop
        grad_out, gW3, gb3, gW2, gb2, gW1, gb1 = network.backward(P, A4, A3, A2, A1, y_train, lam)
    
        # Update weights by momentum method (Divide by the number of examples in the batch)
        v_W1 = momentum * v_W1 - learningRate * gW1/n_train
        v_W2 = momentum * v_W2 - learningRate * gW2/n_train
        v_W3 = momentum * v_W3 - learningRate * gW3/n_train
    
        network.hiddenLayer2.linear.W += v_W3
        network.hiddenLayer1.linear.W += v_W2
        network.inputLayer.linear.W += v_W1
    
        # Update biases by momentum method (Divide by the number of examples in the batch)
        v_b1 = momentum * v_b1 - learningRate * gb1[:,None]/n_train
        v_b2 = momentum * v_b2 - learningRate * gb2[:,None]/n_train
        v_b3 = momentum * v_b3 - learningRate * gb3[:,None]/n_train
    
        network.hiddenLayer2.linear.b += v_b3
        network.hiddenLayer1.linear.b += v_b2
        network.inputLayer.linear.b += v_b1
    
        # Compute loss on training set
        loss_train = network.Loss(P, y_train, lam)
        losses['train'].append(loss_train)
    
        # Compute loss on validation set
        A1, A2, A3, A4, P = network.forward(X_valid)
        loss_valid = network.Loss(P, y_valid, lam)
        losses['validation'].append(loss_valid)
        
        # Compute validation accuracy
        pred = np.argmax(P, axis = 0)
        
        # Accuracy
        acc = np.sum((pred == y_valid).astype(float))/y_valid.shape[0]

    # Return the trained network and the losses
    return network, losses, acc
## Setup: For this example, just vary the regularization parameter
epochs=1000
learningRate = 0.2
momentum = 0.9

# Network structure
n_hidden_1 = 48
n_hidden_2 = 32
K = 10
n_input = X_train.shape[1]
n_train = X_train.shape[0]

# Values of lambda to try
lambda_l2 = np.logspace(-2,2,10)

# Train
dict_acc = {'lambda':[], 'acc':[]}; count=1
for lam in lambda_l2:
    network, _, acc = trainNetwork(epochs, learningRate, momentum, lam, n_hidden_1, n_hidden_2, K)
    dict_acc['lambda'].append(lam)
    dict_acc['acc'].append(acc)
    print("Trained %d of %d networks" % (count, lambda_l2.shape[0]))
    count +=1
Trained 1 of 10 networks
Trained 2 of 10 networks
Trained 3 of 10 networks
Trained 4 of 10 networks
Trained 5 of 10 networks
Trained 6 of 10 networks
Trained 7 of 10 networks
Trained 8 of 10 networks
Trained 9 of 10 networks
Trained 10 of 10 networks
## Plot the training and validation losses
%matplotlib inline

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.0;

plt.plot(dict_acc['lambda'],dict_acc['acc'])
plt.xlabel('lambda')
plt.ylabel('Accuracy');

png

# Best lambda
lam_best = lambda_l2[np.argmax(dict_acc['acc'])]
best_validation_acc = max(dict_acc['acc'])

# One final training with the best model
network, losses, acc = trainNetwork(epochs, learningRate,
                                    momentum, lam_best,
                                    n_hidden_1, n_hidden_2, K)

## Predict on the test set
A1, A2, A3, A4, P = network.forward(X_test)

# Predict digits
pred_test = np.argmax(P, axis = 0)
        
# Accuracy
acc_test = np.sum((pred_test == y_test).astype(float))/y_test.shape[0]

print("Best lambda: %.6f, Validation set accuracy: %.6f" % (lam_best, acc))
print("Test set accuracy: %.6f" % acc_test)
Best lambda: 0.027826, Validation set accuracy: 0.966805
Test set accuracy: 0.962963
## Plot the training and validation losses
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

losses_df = pd.DataFrame()
losses_df['train'] = [item[1] for item in losses.get('train')]
losses_df['validation'] = [item[1] for item in losses.get('validation')]

plt.plot(losses_df['train'], label='Training loss')
plt.plot(losses_df['validation'], label='Validation loss')
plt.legend();

png

Tuning Support Vector Machines - Visualized

June 2, 2019
scikit-learn SVM classification ml

Visualizing `XGBoost` Hyperparameters

May 26, 2019
hyperparameters xgboost classification ml

Selecting a Machine Learning Algorithm - Part II

April 14, 2019
scikit-learn ml data engineering
comments powered by Disqus