Activation Functions

Introduction

A neural network is like a brain i.e it tries to copy the behavior of how our brain works. It has neurons that are connected to each other when it comes to classifying useful information. We know that how our brain priorities the information for future use. But what about neural networks, how they classify the information such as text, sound, image based on some useful features present in the input. In that case, the activation function comes into the picture. The activation function helps the neural networks to learn some useful information. The activation function helps the neural network to learn the complex patterns in the data, so it predicts accurately on some unseen data. The activation function helps to normalize the output of each neuron to the range of 0 to 1 or -1 to 1.

Types of Activation Function:

1. Binary Step Function:

Binary Step Function

        As seen from the above figure it’s a threshold-based classifier. When the input function is greater than the threshold, the neuron gets activated, else it remains idle. Let’s say if a function has a value greater than zero then the neuron is fired else it is deactivated. The gradient of the step function is zero because the derivative of f(x) with respect to x is zero. Hence weights and biases don’t update during backpropagation.

2. Linear Function:

Liner Function

f(x) = mx + b

Differentiating with respect to x we get

df(x)/d(x) = m

In this case derivative of a function is constant as activation is linearly dependent on the input. In this case, the gradient doesn’t become zero but it’s constant. This means weights and biases are updated during backpropagation by the same factor. Hence neural networks won’t be able to learn the complex patterns from data. 

3. Sigmoid:

Sigmoid Function

          Sigmoid is the most widely used non-linear activation function, whose values range between 0-1. So, we can say that neural networks with neurons/perceptrons having sigmoid as their activation function having a non-linear output. The Sigmoid function is differentiable and having a smooth gradient.

f(x) = 1/1+e-x

Differentiating f(x) with respect to x we get

df(x)/dx = 1/1+e-x .(1-1/1+e-x)

The Sigmoid network has the main problem that it suffered from the problem of “vanishing gradient” this is because the large change in input will cause only a small change in the output as the range is between 0 to 1. During the backpropagation weight updation is by very very small number or negligible.

Another problem associated with it is that the sigmoid function output is not zero centered means it will take more computation time and more convergence time to reach the global minima.

4. Tanh:

Tanh Function

       Though it looks similar to the sigmoid function the range for the Tanh function varies from -1 to 1 and another difference is that it is zero centric, any data that is passed through zero is zero centric, which is good. But it also suffered from a vanishing gradient problem, when the input is large or small, the output is smooth and the gradient is small, which is not good for the weight updation.

        In general, for binary classification, Tanh is used at the hidden layers and sigmoid is used at the output layer.

5. ReLU (Rectified Linear Unit):

Relu Function

Formula is:

ReLU = max(0,x)

ReLU is one of the important achievement in recent year. It is one of the most popular and widely used non-linear activation functions in hidden layers. One of the main advantages of using ReLU is that it does not activates all neurons at the same time. ReLU function states that whenever the x value is greater than zero, then the output will be x else it is zero. This makes it computationally more efficient than the Sigmoid function and the Tanh function.

ReLU solves the vanishing gradient problem of Sigmoid and Tanh function because the derivative of the ReLU function is either 0 or 1.

The main problem is during the backpropagation, whenever you tried to find out the derivative of negative weights this is going to be zero. So, weight updation will not occur because the old weight is equal to the new weight. This leads to the “Dying ReLU” or “Dead Activation” problem. 

6. Leaky ReLU:

Leaky ReLU Function

Formula is:

f(x) = max(0.01x,x)

In order to solve the Dying ReLU problem because we are multiplying a small factor with x. So, whenever we take the derivative 0.01x it always be a 0.01 which is not a negative value. Because whenever the values are negative due to negative weights and if we take the derivative, it is always coming to 0.01.

But in some cases when most of the weights are negative, this will lead to a vanishing gradient problem.

7. ELU (Exponential Linear Unit):

ELU Function

Formula is:

f(x) = x             if x>0

          α(ex -1)   o.w

ELU is created to solve the problem of ReLU. It has an advantage over ReLU, specifically in the negative part of the graph. Here the α is the learning parameter, which is to be depend upon the problem.

So, it has no Dead ReLU problem. Second, from the above graph, it is shown that the curve is passing through zero, so its output is close to zero, zero-centered. The problem is that the ELU function contains the exponential term, so it takes more time. Hence it is computationally expensive

8. PReLU (Parametric ReLU):

PReLU Function

So, what does it says, whenever x >0 it gives us x else α*x. If your α is zero it becomes a ReLU activation function and if α=0.01 it becomes a Leaky ReLU activation function. The α value is dynamically learned during training when we using different values of α other than 0 and 0.01, so it’s parametric ReLU.

9. Swish Function(A Self-Gated Function):

Swish Function

Formula is:

f(x) = x.sigmoid(x)

This function is used mostly in LSTMs. This is computationally very very expensive. This kind of activation function we use when your neural network is greater than 40 layers.

It is solving the Dead ReLU problem and it is also zero-centric. It has a range of negative infinity to positive infinity.

10. Softmax:

                              S(xj) = exj/∑Kk=1exk     j =1,2,3,….,K

For an arbitrary real vector of length K, the Softmax function compresses it into a real vector of having a range between 0 to 1, and the sum of all elements in a vector is 1.

When we have a binary classification problem is to solve, we use the Sigmoid activation function. But when we have to solve the multiclass classification problems we need to use the Softmax activation function.

Leave a Comment

Your email address will not be published. Required fields are marked *