The components needed have already been described in previous posts: Evaluating the network, back-propagation and gradient descent. We will look at each in turn.
Evaluating the network
We use the expressions from the Multiple Inputs post. We can simply loop over the layers and compute the 's and the 's:
for layer in self.layers: Z = np.dot(layer.W, A) + layer.b A = layer.g(Z)
Note that the activation function
g (and similarly for the derivative
g) should be able to apply the activation function to each element of the input.
Note also that it is necessary to save the 's and the 's for each layer, as they will be referenced during back-propagation.
Back-propagation can be performed using the expressions from the Back-propagation Matrix-style post. Some other things to note:
- Remember to loop through the layers in reverse.
- There is no need to save the 's and 's for each layer and the variables can be overwritten as we move back through the layers.
First, we need to compute where there is a special case for the output layer:
if l == L: dA = (values[L].A - Y) / m else: dA = np.dot(self.layers[l].W.T, dZ)
Here, will be from the previous iteration (and, therefore, from layer ).
self.layers[l] corresponds to layer , since the
is shifted by one—layer 0 is not needed in the array).
The matrix is updated as
dZ = dA * self.layers[l - 1].g(values[l].Z)
Some things to note here:
g is the first derivative of the activation function
for layer and
values[l].Z is from the evaluation of the network.
Now we can compute
dW = np.dot(dZ, values[l - 1].A.T) db = np.sum(dZ, axis=1, keepdims=True)
The expression for
db is just an efficient way of multiplying a matrix by a
column vector of 1's.
Left to do is a training algorithm using Gradient Descent. The following snippet assumes that the network's weights and biases have been initialized with (pseudo-)random numbers and performs a fixed number of steps:
for epoch in range(epochs): values = self.evaluate(Xs) dWs, dbs = self.compute_gradient(values, Ys) for layer, dW, db in zip(self.layers, dWs, dbs): layer.W -= learning_rate * dW layer.b -= learning_rate * db
A full implementation is available. The code includes a small example of training a network (single input unit, a 20-unit hidden layer with a sigmoid activation function and a single output unit) to fit a part of a sine wave: