Consider a neural network as previously described.
As before, we fix the structure of the neural network: The number of layers and the number of
nodes and the activation functions for each layer. Now, given the weights and biases
for each layers, we can compute the output vector
aL=N(x)∈RnL for any input vector
x∈Rn0.
How close does aL come to some desired output vector y∈RnL?
A good way to compute this closeness is using the sum of the squares of the element-wise differences:
Note how E can, and should, be seen as a function of the weights and biases. This way E becomes
a map from Rp into R where p is the total number of weights and biases,
p=(n0+1)n1+(n1+1)n2+⋯+(nL−1+1)nL.
The quantity E has some obvious, and useful, properties:
E is always non-negative.
The closer E is to zero, the closer the computed outputs N(xc) are to the desired outputs yc.
(This follows from the fact that ∥N(xc)−yc∥22)≤2mE for all c=1,…,m).
The set of m input/output pairs (xc,yc) is typically called
a training set. It is called so because given a training set, we can seek the weights and biases of
the neural network that minimizes the error E.
How do you find the parameters that minimizes a given function? That is the subject of the
next post.