janmr blog

Neural Networks - The Optimization Problem

Consider a neural network as previously described. As before, we fix the structure of the neural network: The number of layers and the number of nodes and the activation functions for each layer. Now, given the weights and biases for each layers, we can compute the output vector aL=N(x)RnL\textbf{a}^L = N(\textbf{x}) \in \mathbb{R}^{n^L} for any input vector xRn0\textbf{x} \in \mathbb{R}^{n^0}.

How close does aL\textbf{a}^L come to some desired output vector yRnL\textbf{y} \in \mathbb{R}^{n^L}? A good way to compute this closeness is using the sum of the squares of the element-wise differences:

12i=1nL(aiLyi)2=12aLy22,\tfrac{1}{2} \sum_{i=1}^{n^L} (a^L_i - y_i)^2 = \tfrac{1}{2} \| \textbf{a}^L - \textbf{y} \|_2^2,

where 2\|\cdot\|_2 is the 2-norm.

Assume now that we have a set of mm input/output pairs:

(xc,yc)Rn0×RnL,(\textbf{x}_c, \textbf{y}_c) \in \mathbb{R}^{n^0} \times \mathbb{R}^{n^L},

for c=1,,mc=1,\ldots,m. How close do the outputs N(xc)N(\textbf{x}_c) come to the desired outputs yc\textbf{y}_c? We measure this closeness by setting

AL=[N(x1)  N(x2)N(xm)],Y=[y1  y2ym]A^L = [N(\textbf{x}_1) \; N(\textbf{x}_2) \cdots N(\textbf{x}_m)], \quad Y = [\textbf{y}_1 \; \textbf{y}_2 \cdots \textbf{y}_m]

and then computing the error/cost function EE by averaging over the errors of the individual pairs:

E=12mc=1mi=1nL(AicLYic)2=12mc=1mA,cLY,c22=12mALYF2,E = \tfrac{1}{2m} \sum_{c=1}^m \sum_{i=1}^{n^L} \left( A_{ic}^L - Y_{ic} \right)^2 = \tfrac{1}{2m} \sum_{c=1}^m \| A^L_{\ast,c} - Y_{\ast,c} \|_2^2 = \tfrac{1}{2m} \left\| A^L - Y \right\|_F^2,

where F\|\cdot\|_F is the Frobenius norm.

Note how EE can, and should, be seen as a function of the weights and biases. This way EE becomes a map from Rp\mathbb{R}^p into R\mathbb{R} where pp is the total number of weights and biases, p=(n0+1)n1+(n1+1)n2++(nL1+1)nLp=(n^0+1)n^1 + (n^1+1)n^2 + \cdots + (n^{L-1}+1)n^L.

The quantity EE has some obvious, and useful, properties:

  • EE is always non-negative.
  • The closer EE is to zero, the closer the computed outputs N(xc)N(\textbf{x}_c) are to the desired outputs yc\textbf{y}_c. (This follows from the fact that N(xc)yc22)2mE\| N(\textbf{x}_c) - \textbf{y}_c \|_2^2) \leq 2m E for all c=1,,mc=1,\ldots,m).

The set of mm input/output pairs (xc,yc)(\textbf{x}_c, \textbf{y}_c) is typically called a training set. It is called so because given a training set, we can seek the weights and biases of the neural network that minimizes the error EE.

How do you find the parameters that minimizes a given function? That is the subject of the next post.