janmr blog

Neural Networks - Gradient Descent

Let us forget the specifics of neural networks for a moment and consider some function G:RnRG: \mathbb{R}^n \mapsto \mathbb{R}.

Now let GG be differentiable at some point x=(x1,x2,,xn)\textbf{x} = (x_1,x_2,\ldots,x_n). The gradient at this point is

G(x)=(Gx1(x),Gx2(x),,Gxn(x)).\nabla G(\textbf{x}) = \left( \frac{\partial G}{\partial x_1}(\textbf{x}), \frac{\partial G}{\partial x_2}(\textbf{x}), \ldots, \frac{\partial G}{\partial x_n}(\textbf{x}) \right).

As the title of this post suggests, the gradients are important. This is due to the fact that the gradient is the direction in which the function increases the most. Conversely, the negative gradient is the direction in which the function decreases this most. So if GG is differentiable in a neighborhood of some point xn\textbf{x}^n, then a γn>0\gamma_n > 0 exists such that

G(xn+1)G(xn)forxn+1=xnγnG(xn).G(\textbf{x}^{n+1}) \leq G(\textbf{x}^n) \quad \text{for} \quad \textbf{x}^{n+1} = \textbf{x}^n - \gamma_n \nabla G(\textbf{x}^n).

That is the general idea of the Gradient Descent method (also called Steepest Descent): Iteratively find x0\textbf{x}^0, x1\textbf{x}^1, x2\textbf{x}^2, \ldots and then, hopefully, arrive at a point xn\textbf{x}^n where G(xn)=0\nabla G(\textbf{x}^n) = \textbf{0} (or close to, for practical purposes). When the gradient is zero we have a stationary point and, hopefully, a local minimum.

The previous paragraph says "hopefully" twice and that is because the Gradient Descent algorithm may not always converge to a local minimum.

If GG has certain nice properties (GG convex and G\nabla G Lipschitz) it can be proved that the method converges to the global minimum (also requires that the steps γn\gamma_n are chosen carefully).

If GG is defined and continuously differentiable on a closed set SS, then the Gradient Descent method will either run into the boundary of SS or converge to a stationary point. This was shown by Haskell B. Curry (yes, that Haskell Curry, both currying and Haskell are named after him) in the paper The method of steepest descent for non-linear minimization problems.

That was some theory, but what happens when we apply the Gradient Descent method to some neural network? Firstly, we do have one requirement if we plan to use Gradient Descent for a neural network: The activation functions must be differentiable, which, in turn, will make the error function EE differentiable.

In general, a neural network is not convex and may contain several local minima. There is also the question of choosing the step size γn\gamma_n at each iteration. What to choose? In practise, a global learning rate γ\gamma is often chosen for every step. And that may lead to divergence because it may be too large. Finally, even though E0E \geq 0, a minimum may not even exist (just think of the function xexx \mapsto e^{-x}), which will also lead to divergence.

There is also the question of how to compute the gradient of the error function. But, fortunately, this is surprisingly straightforward and is exactly what the next post on back-propagation deals with.