As mentioned in a previous post, the activation functions used in a neural network can be any differentiable function . Such functions make the output of a neural network well-defined, it makes it possible to compute the gradient at any point and it, in turn, makes it possible to perform the Gradient Descent method.
There are some considerations when choosing activation functions:
- Their shape dictate which non-linear properties the neural network can have.
- Their properties can affect if and how fast Gradient Descent converges.
- How many local minima the error function can have.
- The activation function for the output layer represents the values that the neural network can produce.
One thing that seems to improve/ensure convergence is smoothness. Recall from the Gradient Descent post that both a continuous derivative and a Lipschitz condition for the gradient helped prove certain convergence theorems.
Most activation functions are monotonic. There is nothing wrong with using an activation function for which for some (which is exactly what will be the case for a non-monotonic function), but it means that the error function may have multiple minima.
Let us consider some common activation functions.
Linear
Any linear function will have the signature for some real number . But it is easy to see that using and for some layer is equivalent to using and . So the only linear activation function worth considering is the identity
It can make perfect sense to use a linear activation function for the output layer, but it does not make much sense to use it for a hidden layer. To see this, assume that a hidden layer has . Then we have, in general,
but because of the linear activation function we have which means that
which implies that layer is essentially redundant (removing the layer will not always be equivalent to keeping it for the special case where since the linear map from to has rank at most ).
Sigmoid
The sigmoid activation function is defined as
with derivative
A plot of and looks like this:
ReLU
The Rectified Linear Unit, ReLU, is defined as
with derivative
A plot of and looks like this:
Let us now open our favourite code editor and look at an implementation.