janmr blog

Neural Networks - Back-propagation Derivation

Consider a neural network as previously described. The structure of the neural network is fixed, that is, the number of layers and the number of nodes and the activation functions for each layer.

We furthermore have a training set consisting of mm input/output pairs, (xc,yc)(\textbf{x}_c, \textbf{y}_c) for c=1,,mc=1,\ldots,m. We set

A0=[x1  x2    xm],Y=[y1  y2    ym],\begin{aligned} A^0 &= [\textbf{x}_1 \; \textbf{x}_2 \; \cdots \; \textbf{x}_m], \\ Y &= [\textbf{y}_1 \; \textbf{y}_2 \; \cdots \; \textbf{y}_m], \\ \end{aligned}

and A1,Z1,,AL,ZLA^1, Z^1, \ldots, A^L, Z^L, are computed as described in the multiple inputs post.

We now have the error function defined as

E=12mc=1mi=1nL(AicLYic)2,E = \tfrac{1}{2m} \sum_{c=1}^m \sum_{i=1}^{n^L} \left( A_{ic}^L - Y_{ic} \right)^2,

and in order to compute the gradient E\nabla E we need the partial derivates with respect to the weights and biases,

EWijlandEbil\frac{\partial E}{\partial W^l_{ij}} \quad \text{and} \quad \frac{\partial E}{\partial b^l_i}

for i=1,,nli=1,\dots,n^l, j=1,,nl1j=1,\ldots,n^{l-1}, l=1,,Ll=1,\ldots,L.

The key to computing these quantities is by applying the chain rule in appropriate ways (see also, e.g., Theorem 9.15 of Walter Rudin's Principles of Mathematical Analysis).

First, for fixed l,i,jl, i, j, we view EE as a function of Zi1l,,ZimlZ^l_{i1}, \ldots, Z^l_{im} and each ZiclZ^l_{ic}, in turn, as a function of WijlW^l_{ij}:

EWijl=c=1mEZiclZiclWijl=c=1mEZiclAjcl1,\frac{\partial E}{\partial W^l_{ij}} = \sum_{c=1}^m \frac{\partial E}{\partial Z^l_{ic}} \frac{\partial Z^l_{ic}}{\partial W^l_{ij}} = \sum_{c=1}^m \frac{\partial E}{\partial Z^l_{ic}} A^{l-1}_{jc},

for i=1,,nli=1,\ldots,n^l, j=1,,nl1j=1,\ldots,n^{l-1}, l=1,,Ll=1,\ldots,L, where we use the definition of ZiclZ^l_{ic}. Similarly for bilb^l_i:

Ebil=c=1mEZiclZiclbil=c=1mEZicl,\frac{\partial E}{\partial b^l_i} = \sum_{c=1}^m \frac{\partial E}{\partial Z^l_{ic}} \frac{\partial Z^l_{ic}}{\partial b^l_i} = \sum_{c=1}^m \frac{\partial E}{\partial Z^l_{ic}},

for i=1,,nli=1,\ldots,n^l and l=1,,Ll=1,\ldots,L.

We then get

EZicl=EAiclAiclZicl=EAiclgl(Zicl),\frac{\partial E}{\partial Z^l_{ic}} = \frac{\partial E}{\partial A^l_{ic}} \frac{\partial A^l_{ic}}{\partial Z^l_{ic}} = \frac{\partial E}{\partial A^l_{ic}} {g^l}'(Z^l_{ic}),

for i=1,,nli=1,\ldots,n^l, l=1,,Ll=1,\ldots,L, c=1,,mc=1,\ldots,m. Here, the requirement that each activation function glg^l is differentiable becomes apparent.

The remaining quantities are

EAicl=j=1nl+1EZjcl+1Zjcl+1Aicl=j=1nl+1EZjcl+1Wjil+1,\frac{\partial E}{\partial A^l_{ic}} = \sum_{j=1}^{n^{l+1}} \frac{\partial E}{\partial Z^{l+1}_{jc}} \frac{\partial Z^{l+1}_{jc}}{\partial A^l_{ic}} = \sum_{j=1}^{n^{l+1}} \frac{\partial E}{\partial Z^{l+1}_{jc}} W^{l+1}_{ji},

for i=1,,nli=1,\ldots,n^l, l=1,,L1l=1,\ldots,L-1, c=1,,mc=1,\ldots,m, where the final piece of the puzzle can be obtained by differentiating EE directly (no chain rule!):

EAicL=1m(AicLYic)\frac{\partial E}{\partial A^L_{ic}} = \tfrac{1}{m} \left( A^L_{ic} - Y_{ic} \right)

for i=1,,nl1i=1,\ldots,n^{l-1}, c=1,,mc=1,\ldots,m.

By careful observation, we see that the quantities above can be computed by working your way backwards through the layers. Hence, the name back-propagation, which was first described for neural networks by David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams in the paper Learning representations by back-propagating errors.

If all these partial derivatives and indices are making you dizzy, I don't blame you. The next post will look at how to compute the gradient using matrix notation, which should be easier to comprehend.