janmr blog

Neural Networks - Implementation

This post will describe how to implement a simple, trainable neural network in Python using NumPy.

The components needed have already been described in previous posts: Evaluating the network, back-propagation and gradient descent. We will look at each in turn.

Evaluating the network

We use the expressions from the Multiple Inputs post. We can simply loop over the layers and compute the ZZ's and the AA's:

for layer in self.layers:
    Z = np.dot(layer.W, A) + layer.b
    A = layer.g[0](Z)

Note that the activation function g[0] (and similarly for the derivative g[1]) should be able to apply the activation function to each element of the input.

Note also that it is necessary to save the ZZ's and the AA's for each layer, as they will be referenced during back-propagation.


Back-propagation can be performed using the expressions from the Back-propagation Matrix-style post. Some other things to note:

  • Remember to loop through the layers in reverse.
  • There is no need to save the dAdA's and dZdZ's for each layer and the variables can be overwritten as we move back through the layers.

First, we need to compute dAdA where there is a special case for the output layer:

if l == L:
    dA = (values[L].A - Y) / m
    dA = np.dot(self.layers[l].W.T, dZ)

Here, dZdZ will be from the previous iteration (and, therefore, from layer l+1l+1). (Note that self.layers[l] corresponds to layer l+1l+1, since the self.layers array is shifted by one—layer 0 is not needed in the array).

The dZdZ matrix is updated as

dZ = dA * self.layers[l - 1].g[1](values[l].Z)

Some things to note here: * does element-wise multiplication, g[1] is the first derivative of the activation function for layer ll and values[l].Z is ZlZ^l from the evaluation of the network.

Now we can compute

dW = np.dot(dZ, values[l - 1].A.T)
db = np.sum(dZ, axis=1, keepdims=True)

The expression for db is just an efficient way of multiplying a matrix by a column vector of 1's.

Gradient Descent

Left to do is a training algorithm using Gradient Descent. The following snippet assumes that the network's weights and biases have been initialized with (pseudo-)random numbers and performs a fixed number of steps:

for epoch in range(epochs):
    values = self.evaluate(Xs)
    dWs, dbs = self.compute_gradient(values, Ys)
    for layer, dW, db in zip(self.layers, dWs, dbs):
        layer.W -= learning_rate * dW
        layer.b -= learning_rate * db


A full implementation is available. The code includes a small example of training a network (single input unit, a 20-unit hidden layer with a sigmoid activation function and a single output unit) to fit a part of a sine wave:

Neural Network