Neural Networks - Back-propagation Derivation
Published on 2023-01-19
Consider a neural network as previously described .
The structure of the neural network is fixed, that is, the number of layers and the number of
nodes and the activation functions for each layer.
We furthermore have a training set consisting of m m m input/output pairs,
( x c , y c ) (\textbf{x}_c, \textbf{y}_c) ( x c , y c ) for c = 1 , … , m c=1,\ldots,m c = 1 , … , m .
We set
A 0 = [ x 1 x 2 ⋯ x m ] , Y = [ y 1 y 2 ⋯ y m ] , \begin{aligned}
A^0 &= [\textbf{x}_1 \; \textbf{x}_2 \; \cdots \; \textbf{x}_m], \\
Y &= [\textbf{y}_1 \; \textbf{y}_2 \; \cdots \; \textbf{y}_m], \\
\end{aligned}
A 0 Y = [ x 1 x 2 ⋯ x m ] , = [ y 1 y 2 ⋯ y m ] ,
and A 1 , Z 1 , … , A L , Z L A^1, Z^1, \ldots, A^L, Z^L A 1 , Z 1 , … , A L , Z L , are computed as described in the
multiple inputs post .
We now have the error function
defined as
E = 1 2 m ∑ c = 1 m ∑ i = 1 n L ( A i c L − Y i c ) 2 , E = \tfrac{1}{2m} \sum_{c=1}^m \sum_{i=1}^{n^L} \left( A_{ic}^L - Y_{ic} \right)^2,
E = 2 m 1 c = 1 ∑ m i = 1 ∑ n L ( A i c L − Y i c ) 2 ,
and in order to compute the gradient ∇ E \nabla E ∇ E we need the partial derivates with respect to the
weights and biases,
∂ E ∂ W i j l and ∂ E ∂ b i l \frac{\partial E}{\partial W^l_{ij}}
\quad \text{and} \quad
\frac{\partial E}{\partial b^l_i}
∂ W ij l ∂ E and ∂ b i l ∂ E
for i = 1 , … , n l i=1,\dots,n^l i = 1 , … , n l , j = 1 , … , n l − 1 j=1,\ldots,n^{l-1} j = 1 , … , n l − 1 , l = 1 , … , L l=1,\ldots,L l = 1 , … , L .
The key to computing these quantities is by applying the
chain rule
in appropriate ways (see also, e.g., Theorem 9.15 of Walter Rudin's
Principles of Mathematical Analysis /).
First, for fixed l , i , j l, i, j l , i , j , we view E E E as a function of Z i 1 l , … , Z i m l Z^l_{i1}, \ldots, Z^l_{im} Z i 1 l , … , Z im l and
each Z i c l Z^l_{ic} Z i c l , in turn, as a function of W i j l W^l_{ij} W ij l :
∂ E ∂ W i j l = ∑ c = 1 m ∂ E ∂ Z i c l ∂ Z i c l ∂ W i j l = ∑ c = 1 m ∂ E ∂ Z i c l A j c l − 1 , \frac{\partial E}{\partial W^l_{ij}}
= \sum_{c=1}^m \frac{\partial E}{\partial Z^l_{ic}} \frac{\partial Z^l_{ic}}{\partial W^l_{ij}}
= \sum_{c=1}^m \frac{\partial E}{\partial Z^l_{ic}} A^{l-1}_{jc},
∂ W ij l ∂ E = c = 1 ∑ m ∂ Z i c l ∂ E ∂ W ij l ∂ Z i c l = c = 1 ∑ m ∂ Z i c l ∂ E A j c l − 1 ,
for i = 1 , … , n l i=1,\ldots,n^l i = 1 , … , n l , j = 1 , … , n l − 1 j=1,\ldots,n^{l-1} j = 1 , … , n l − 1 , l = 1 , … , L l=1,\ldots,L l = 1 , … , L , where we use
the definition of Z i c l Z^l_{ic} Z i c l .
Similarly for b i l b^l_i b i l :
∂ E ∂ b i l = ∑ c = 1 m ∂ E ∂ Z i c l ∂ Z i c l ∂ b i l = ∑ c = 1 m ∂ E ∂ Z i c l , \frac{\partial E}{\partial b^l_i}
= \sum_{c=1}^m \frac{\partial E}{\partial Z^l_{ic}} \frac{\partial Z^l_{ic}}{\partial b^l_i}
= \sum_{c=1}^m \frac{\partial E}{\partial Z^l_{ic}},
∂ b i l ∂ E = c = 1 ∑ m ∂ Z i c l ∂ E ∂ b i l ∂ Z i c l = c = 1 ∑ m ∂ Z i c l ∂ E ,
for i = 1 , … , n l i=1,\ldots,n^l i = 1 , … , n l and l = 1 , … , L l=1,\ldots,L l = 1 , … , L .
We then get
∂ E ∂ Z i c l = ∂ E ∂ A i c l ∂ A i c l ∂ Z i c l = ∂ E ∂ A i c l g l ′ ( Z i c l ) , \frac{\partial E}{\partial Z^l_{ic}}
= \frac{\partial E}{\partial A^l_{ic}} \frac{\partial A^l_{ic}}{\partial Z^l_{ic}}
= \frac{\partial E}{\partial A^l_{ic}} {g^l}'(Z^l_{ic}),
∂ Z i c l ∂ E = ∂ A i c l ∂ E ∂ Z i c l ∂ A i c l = ∂ A i c l ∂ E g l ′ ( Z i c l ) ,
for i = 1 , … , n l i=1,\ldots,n^l i = 1 , … , n l , l = 1 , … , L l=1,\ldots,L l = 1 , … , L , c = 1 , … , m c=1,\ldots,m c = 1 , … , m . Here, the requirement that
each activation function g l g^l g l is differentiable becomes apparent.
The remaining quantities are
∂ E ∂ A i c l = ∑ j = 1 n l + 1 ∂ E ∂ Z j c l + 1 ∂ Z j c l + 1 ∂ A i c l = ∑ j = 1 n l + 1 ∂ E ∂ Z j c l + 1 W j i l + 1 , \frac{\partial E}{\partial A^l_{ic}}
= \sum_{j=1}^{n^{l+1}} \frac{\partial E}{\partial Z^{l+1}_{jc}} \frac{\partial Z^{l+1}_{jc}}{\partial A^l_{ic}}
= \sum_{j=1}^{n^{l+1}} \frac{\partial E}{\partial Z^{l+1}_{jc}} W^{l+1}_{ji},
∂ A i c l ∂ E = j = 1 ∑ n l + 1 ∂ Z j c l + 1 ∂ E ∂ A i c l ∂ Z j c l + 1 = j = 1 ∑ n l + 1 ∂ Z j c l + 1 ∂ E W ji l + 1 ,
for i = 1 , … , n l i=1,\ldots,n^l i = 1 , … , n l , l = 1 , … , L − 1 l=1,\ldots,L-1 l = 1 , … , L − 1 , c = 1 , … , m c=1,\ldots,m c = 1 , … , m , where the final piece of the puzzle
can be obtained by differentiating E E E directly (no chain rule!):
∂ E ∂ A i c L = 1 m ( A i c L − Y i c ) \frac{\partial E}{\partial A^L_{ic}}
= \tfrac{1}{m} \left( A^L_{ic} - Y_{ic} \right)
∂ A i c L ∂ E = m 1 ( A i c L − Y i c )
for i = 1 , … , n l − 1 i=1,\ldots,n^{l-1} i = 1 , … , n l − 1 , c = 1 , … , m c=1,\ldots,m c = 1 , … , m .
By careful observation, we see that the quantities above can be computed by working your
way backwards through the layers. Hence, the name back-propagation , which was first
described for neural networks by David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams
in the paper Learning representations by back-propagating errors .
If all these partial derivatives and indices are making you dizzy, I don't blame you.
The next post
will look at how to compute the gradient using matrix notation, which
should be easier to comprehend.