Consider a neural network as previously described.
The structure of the neural network is fixed, that is, the number of layers and the number of
nodes and the activation functions for each layer.
We furthermore have a training set consisting of m input/output pairs,
(xc,yc) for c=1,…,m.
We set
A0Y=[x1x2⋯xm],=[y1y2⋯ym],
and A1,Z1,…,AL,ZL, are computed as described in the
multiple inputs post.
We now have the error function
defined as
E=2m1c=1∑mi=1∑nL(AicL−Yic)2,
and in order to compute the gradient ∇E we need the partial derivates with respect to the
weights and biases,
∂Wijl∂Eand∂bil∂E
for i=1,…,nl, j=1,…,nl−1, l=1,…,L.
The key to computing these quantities is by applying the
chain rule
in appropriate ways (see also, e.g., Theorem 9.15 of Walter Rudin's
Principles of Mathematical Analysis).
First, for fixed l,i,j, we view E as a function of Zi1l,…,Ziml and
each Zicl, in turn, as a function of Wijl:
∂Wijl∂E=c=1∑m∂Zicl∂E∂Wijl∂Zicl=c=1∑m∂Zicl∂EAjcl−1,
for i=1,…,nl, j=1,…,nl−1, l=1,…,L, where we use
the definition of Zicl.
Similarly for bil:
∂bil∂E=c=1∑m∂Zicl∂E∂bil∂Zicl=c=1∑m∂Zicl∂E,
for i=1,…,nl and l=1,…,L.
We then get
∂Zicl∂E=∂Aicl∂E∂Zicl∂Aicl=∂Aicl∂Egl′(Zicl),
for i=1,…,nl, l=1,…,L, c=1,…,m. Here, the requirement that
each activation function gl is differentiable becomes apparent.
The remaining quantities are
∂Aicl∂E=j=1∑nl+1∂Zjcl+1∂E∂Aicl∂Zjcl+1=j=1∑nl+1∂Zjcl+1∂EWjil+1,
for i=1,…,nl, l=1,…,L−1, c=1,…,m, where the final piece of the puzzle
can be obtained by differentiating E directly (no chain rule!):
∂AicL∂E=m1(AicL−Yic)
for i=1,…,nl−1, c=1,…,m.
By careful observation, we see that the quantities above can be computed by working your
way backwards through the layers. Hence, the name back-propagation, which was first
described for neural networks by David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams
in the paper Learning representations by back-propagating errors.
If all these partial derivatives and indices are making you dizzy, I don't blame you.
The next post
will look at how to compute the gradient using matrix notation, which
should be easier to comprehend.