Let us forget the specifics of neural networks for a moment and consider some function $G: \mathbb{R}^n \mapsto \mathbb{R}$.

Now let $G$ be differentiable at some point $\textbf{x} = (x_1,x_2,\ldots,x_n)$. The gradient at this point is

$\nabla G(\textbf{x}) = \left( \frac{\partial G}{\partial x_1}(\textbf{x}), \frac{\partial G}{\partial x_2}(\textbf{x}), \ldots, \frac{\partial G}{\partial x_n}(\textbf{x}) \right).$

As the title of this post suggests, the gradients are important.
This is due to the fact that the gradient is the direction in which the function *increases the most*.
Conversely, the negative gradient is the direction in which the function *decreases this most*.
So if $G$ is differentiable in a
neighborhood
of some point $\textbf{x}^n$, then a $\gamma_n > 0$ exists such that

$G(\textbf{x}^{n+1}) \leq G(\textbf{x}^n) \quad \text{for} \quad \textbf{x}^{n+1} = \textbf{x}^n - \gamma_n \nabla G(\textbf{x}^n).$

That is the general idea of the Gradient Descent method (also called Steepest Descent): Iteratively find $\textbf{x}^0$, $\textbf{x}^1$, $\textbf{x}^2$, $\ldots$ and then, hopefully, arrive at a point $\textbf{x}^n$ where $\nabla G(\textbf{x}^n) = \textbf{0}$ (or close to, for practical purposes). When the gradient is zero we have a stationary point and, hopefully, a local minimum.

The previous paragraph says "hopefully" twice and that is because the Gradient Descent algorithm may not always converge to a local minimum.

If $G$ has certain nice properties ($G$ convex and $\nabla G$ Lipschitz) it can be proved that the method converges to the global minimum (also requires that the steps $\gamma_n$ are chosen carefully).

If $G$ is defined and
continuously differentiable
on a closed set $S$, then the Gradient Descent method will either run into the boundary of $S$
or converge to a stationary point. This was shown by Haskell B. Curry
(yes, *that* Haskell Curry, both currying
and Haskell are named after him) in the paper
The method of steepest descent for non-linear minimization problems.

That was some theory, but what happens when we apply the Gradient Descent method to some neural network? Firstly, we do have one requirement if we plan to use Gradient Descent for a neural network: The activation functions must be differentiable, which, in turn, will make the error function $E$ differentiable.

In general, a neural network is not convex and may contain several local minima. There is
also the question of choosing the step size $\gamma_n$ at each iteration. What to choose?
In practise, a global *learning rate* $\gamma$ is often chosen for every step. And that may
lead to divergence because it may be too large. Finally, even though $E \geq 0$, a minimum
may not even *exist* (just think of the function $x \mapsto e^{-x}$), which will also
lead to divergence.

There is also the question of how to compute the gradient of the error function. But, fortunately, this is surprisingly straightforward and is exactly what the next post on back-propagation deals with.