Let us forget the specifics of neural networks for a moment and consider some function .
Now let be differentiable at some point . The gradient at this point is
As the title of this post suggests, the gradients are important. This is due to the fact that the gradient is the direction in which the function increases the most. Conversely, the negative gradient is the direction in which the function decreases this most. So if is differentiable in a neighborhood of some point , then a exists such that
That is the general idea of the Gradient Descent method (also called Steepest Descent): Iteratively find , , , and then, hopefully, arrive at a point where (or close to, for practical purposes). When the gradient is zero we have a stationary point and, hopefully, a local minimum.
The previous paragraph says "hopefully" twice and that is because the Gradient Descent algorithm may not always converge to a local minimum.
If has certain nice properties ( convex and Lipschitz) it can be proved that the method converges to the global minimum (also requires that the steps are chosen carefully).
If is defined and continuously differentiable on a closed set , then the Gradient Descent method will either run into the boundary of or converge to a stationary point. This was shown by Haskell B. Curry (yes, that Haskell Curry, both currying and Haskell are named after him) in the paper The method of steepest descent for non-linear minimization problems.
That was some theory, but what happens when we apply the Gradient Descent method to some neural network? Firstly, we do have one requirement if we plan to use Gradient Descent for a neural network: The activation functions must be differentiable, which, in turn, will make the error function differentiable.
In general, a neural network is not convex and may contain several local minima. There is also the question of choosing the step size at each iteration. What to choose? In practise, a global learning rate is often chosen for every step. And that may lead to divergence because it may be too large. Finally, even though , a minimum may not even exist (just think of the function ), which will also lead to divergence.
There is also the question of how to compute the gradient of the error function. But, fortunately, this is surprisingly straightforward and is exactly what the next post on back-propagation deals with.