janmr blog

Neural Networks - Activation Functions

As mentioned in a previous post, the activation functions used in a neural network can be any differentiable function g:RRg: \mathbb{R} \mapsto \mathbb{R}. Such functions make the output of a neural network well-defined, it makes it possible to compute the gradient at any point and it, in turn, makes it possible to perform the Gradient Descent method.

There are some considerations when choosing activation functions:

  1. Their shape dictate which non-linear properties the neural network can have.
  2. Their properties can affect if and how fast Gradient Descent converges.
  3. How many local minima the error function can have.
  4. The activation function for the output layer represents the values that the neural network can produce.

One thing that seems to improve/ensure convergence is smoothness. Recall from the Gradient Descent post that both a continuous derivative and a Lipschitz condition for the gradient helped prove certain convergence theorems.

Most activation functions are monotonic. There is nothing wrong with using an activation function gg for which g(z1)=g(z2)g(z_1)=g(z_2) for some z1z2z_1 \neq z_2 (which is exactly what will be the case for a non-monotonic function), but it means that the error function may have multiple minima.

Let us consider some common activation functions.

Linear

Any linear function RR\mathbb{R} \mapsto \mathbb{R} will have the signature zαzz \mapsto \alpha z for some real number α\alpha. But it is easy to see that using Wl,blW^l, b^l and gl(z)=αzg^l(z) = \alpha z for some layer l1l \geq 1 is equivalent to using αWl,αbl\alpha W^l, \alpha b^l and gl(z)=zg^l(z) = z. So the only linear activation function worth considering is the identity

gid(z)=z.g_{\text{id}}(z) = z.

It can make perfect sense to use a linear activation function for the output layer, but it does not make much sense to use it for a hidden layer. To see this, assume that a hidden layer ll has gl(z)=zg^l(z)=z. Then we have, in general,

[zl1]=[Wlbl01][al11],[zl+11]=[Wl+1bl+101][al1],\begin{bmatrix} z^l \\ 1 \end{bmatrix} = \begin{bmatrix} W^l & b^l \\ 0 & 1 \end{bmatrix} \begin{bmatrix} a^{l-1} \\ 1 \end{bmatrix}, \quad \begin{bmatrix} z^{l+1} \\ 1 \end{bmatrix} = \begin{bmatrix} W^{l+1} & b^{l+1} \\ 0 & 1 \end{bmatrix} \begin{bmatrix} a^l \\ 1 \end{bmatrix},

but because of the linear activation function we have al=zla^l = z^l which means that

[zl+11]=[Wl+1bl+101][Wlbl01][al11],\begin{bmatrix} z^{l+1} \\ 1 \end{bmatrix} = \begin{bmatrix} W^{l+1} & b^{l+1} \\ 0 & 1 \end{bmatrix} \begin{bmatrix} W^l & b^l \\ 0 & 1 \end{bmatrix} \begin{bmatrix} a^{l-1} \\ 1 \end{bmatrix},

which implies that layer ll is essentially redundant (removing the layer will not always be equivalent to keeping it for the special case where nl<min(nl1,nl+1)n^l < \min(n^{l-1}, n^{l+1}) since the linear map from [al11]\begin{bmatrix} a^{l-1} \\ 1 \end{bmatrix} to [zl+11]\begin{bmatrix} z^{l+1} \\ 1 \end{bmatrix} has rank at most nl+1n^l+1).

Sigmoid

The sigmoid activation function is defined as

gσ(z)=11+ezg_\sigma(z) = \frac{1}{1 + e^{-z}}

with derivative

gσ(z)=ez(1+ez)2=gσ(z)(1gσ(z)).g'_\sigma(z) = \frac{e^{-z}}{(1 + e^{-z})^2} = g_\sigma(z) (1-g_\sigma(z)).

A plot of gσg_\sigma and gσg'_\sigma looks like this:

Neural Network
Figure 1. The sigmoid activation function and its derivative.

ReLU

The Rectified Linear Unit, ReLU, is defined as

gReLU(z)={0if z0zif z>0g_{\scriptscriptstyle \text{ReLU}}(z) = \begin{cases} 0 & \text{if $z \leq 0$} \\ z & \text{if $z > 0$} \end{cases}

with derivative

gReLU(z)={0if z01if z>0.g'_{\scriptscriptstyle \text{ReLU}}(z) = \begin{cases} 0 & \text{if $z \leq 0$} \\ 1 & \text{if $z > 0$.} \end{cases}

A plot of gReLUg_{\scriptscriptstyle \text{ReLU}} and gReLUg'_{\scriptscriptstyle \text{ReLU}} looks like this:

Neural Network
Figure 2. The ReLU activation function and its derivative.

Let us now open our favourite code editor and look at an implementation.