Consider a neural network as previously described.
The structure of the neural network is fixed, that is, the number of layers and the number of
nodes and the activation functions for each layer.
We furthermore have a training set consisting of input/output pairs,
and , are computed as described in the
multiple inputs post.
We now have the error function
and in order to compute the gradient we need the partial derivates with respect to the
weights and biases,
for , , .
The key to computing these quantities is by applying the
in appropriate ways (see also, e.g., Theorem 9.15 of Walter Rudin's
Principles of Mathematical Analysis).
First, for fixed , we view as a function of and
each , in turn, as a function of :
for , , , where we use
the definition of .
Similarly for :
for and .
We then get
for , , . Here, the requirement that
each activation function is differentiable becomes apparent.
The remaining quantities are
for , , , where the final piece of the puzzle
can be obtained by differentiating directly (no chain rule!):
for , .
By careful observation, we see that the quantities above can be computed by working your
way backwards through the layers. Hence, the name back-propagation, which was first
described for neural networks by David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams
in the paper Learning representations by back-propagating errors.
If all these partial derivatives and indices are making you dizzy, I don't blame you.
The next post
will look at how to compute the gradient using matrix notation, which
should be easier to comprehend.