This post will describe how to implement a simple, trainable neural network in Python using NumPy.

The components needed have already been described in previous posts: Evaluating the network, back-propagation and gradient descent. We will look at each in turn.

**Evaluating the network**

We use the expressions from the Multiple Inputs post. We can simply loop over the layers and compute the $Z$'s and the $A$'s:

`for layer in self.layers:`

Z = np.dot(layer.W, A) + layer.b

A = layer.g[0](Z)

Note that the activation function `g[0]`

(and similarly for the derivative
`g[1]`

) should be able to apply the activation function to each element of the input.

Note also that it is necessary to save the $Z$'s and the $A$'s for *each* layer,
as they will be referenced during back-propagation.

**Back-propagation**

Back-propagation can be performed using the expressions from the Back-propagation Matrix-style post. Some other things to note:

- Remember to loop through the layers in reverse.
- There is no need to save the $dA$'s and $dZ$'s for each layer and the variables can be overwritten as we move back through the layers.

First, we need to compute $dA$ where there is a special case for the output layer:

`if l == L:`

dA = (values[L].A - Y) / m

else:

dA = np.dot(self.layers[l].W.T, dZ)

Here, $dZ$ will be from the previous iteration (and, therefore, from layer $l+1$).
(Note that `self.layers[l]`

corresponds to layer $l+1$, since the `self.layers`

array
is shifted by one—layer 0 is not needed in the array).

The $dZ$ matrix is updated as

`dZ = dA * self.layers[l - 1].g[1](values[l].Z)`

Some things to note here: `*`

does
element-wise multiplication,
`g[1]`

is the first derivative of the activation function
for layer $l$ and `values[l].Z`

is $Z^l$ from the evaluation of the network.

Now we can compute

`dW = np.dot(dZ, values[l - 1].A.T)`

db = np.sum(dZ, axis=1, keepdims=True)

The expression for `db`

is just an efficient way of multiplying a matrix by a
column vector of 1's.

**Gradient Descent**

Left to do is a training algorithm using Gradient Descent. The following snippet assumes that the network's weights and biases have been initialized with (pseudo-)random numbers and performs a fixed number of steps:

`for epoch in range(epochs):`

values = self.evaluate(Xs)

dWs, dbs = self.compute_gradient(values, Ys)

for layer, dW, db in zip(self.layers, dWs, dbs):

layer.W -= learning_rate * dW

layer.b -= learning_rate * db

**Code**

A full implementation is available. The code includes a small example of training a network (single input unit, a 20-unit hidden layer with a sigmoid activation function and a single output unit) to fit a part of a sine wave: