# 10 Activation Functions: Basics of Neural Network

10 Activation Functions: Basics of Neural Network – So what are activation functions? In a neural network activation function helps neuron to get activated or not. So if you look at the sample model over here inputs are multiplied with the weights. And their summation is added to the activation function along with the bias, so this is just like x1*w1 + x2*w2 .. + xm*wm + bias. The calculated answer is fed as input to the activation function and depending on the threshold of the function. It decides whether to fire it or not.

Okay, that’s cool. But why do we use an activation function? So if we don’t have the activation function the weights and bias would simply do a linear transformation. And that neural network becomes a linear regression model. Derivatives of activation functions are very important in back propagation when updating the curve of loss function. Activation functions help the gradient descent algorithm to converge to their local minimal.

So let’s start with very first activation function sigmoid or logistic function, so the formula is 1/[1+e^(-βx)]. Here β is responsible for the smoothness of the curve. So if you keep on increasing the value of beta. There will be a point where sigmoid function will be a replica of stepwise function for a particular neuron. X = Σ inputs*weights + bias. And this value of x goes as an input to the sigmoid function. So basically sigmoid function normalizes this input into 0 to 1 range, so whatever your input value be the output will be always in the range of 0 and 1.

So there are few advantages and disadvantages such as differentiable, smooth gradient & normalizing. You can check out this web app from the description. So, let’s quickly jump to our next activation function hyperbolic tangent. If you’ll compare this with the sigmoid function, you’ll observe that sigmoid function is not zero centric. Which is actually a disadvantage for sigmoid function.

This overcomes the disadvantage of sigmoid function being non-zero centric. Just like beta in sigmoid function. θ is responsible for the curvature, so unlike sigmoid function. Range of hyperbolic tangent function is between -1 to plus 1. If you are working with the binary classification problem, this is the best fit. Use this in the hidden layer and sigmoid function in the output layer. Major advantages are zero centered and being smooth gradient, the disadvantages are vanishing gradient slow convergence and slower computation due to the exponential calculations.

Third activation function is rectified linear unit Let me put in simple terms if value of x >0 then output will be x and if it’s < 0 then output will be 0, so no matter what your value is beyond 0 if your value is x then output will be x. But if your value is negative output will be zero. Which means the neuron is dead for all the negative values which contributes to the disadvantage. The Dying ReLU problem. Also, it is not a zero centric function. Advantages are function and its derivative both are monotonic. It is computationally efficient.

And its non-linear to avoid dying of neuron for negative values. Leaky ReLU comes to the rescue. The formula stays same for the positive values that is linear and for negative values it is 0.01*x. Which is 1% of the negative value. Which helps the neuron not to die in case of negative values since 0.01 is a constant multiplier of x. What if we’ll convert this into a learnable parameter which comes from back propagation? Oh, yes. We already have it in parametric ReLU function.

If you’ll replace this alpha with 0.01, It will act as a leaky ReLU function. So by doing this we give more power to the neuron in terms of learnability for back propagation. What if for negative values we need exponential functionality instead of linear multiplicity? Well, that’s our next activation function exponential linear unit. Note that for positive values of input ELU is same as ReLU.

Here alpha is a constant multiplier which acts as a learnable parameter. If we’ll talk about advantages There is no Dead ReLU problem and also this is zero centered. Since there is a presence of exponent calculations you need to pay for it. On moving further we have swish function. This is simply x * sigmoid(x). This function was released by researchers at google brain in 2017. If the value of beta is way too high then it becomes the replica of relu function.

Let’s move to next activation function softmax. Based on input values to a neuron softmax function converts them into their probabilistic occurrence. So if you have 5 inputs, then there will be 5 outputs based on each other’s probability. So if suppose some input values are large then their probabilities will also be on the larger side. Let’s try it out. if there’s a single element then obviously the probability will be 1 and if there are more than one element, for example 6 and 4.

The probabilities will be higher for 6 and lower for 4 One thing to notice that the summation of the outputs is always equals to 1.  Which is a major advantage of using softmax functions. Here comes the next activation function SoftPlus this is just a smoother version of relu function.

And as I previously said if there is an exponent you need to pay for higher computation. Suppose input is minus 0.001. So in case of relu due to negative value it deactivates the neuron. Let’s go to our last function max out. Max out activation is generalization of the ReLU and leaky ReLU function. It basically outputs the maximum of calculated inputs. Let’s try with our examples.

Let’s select input value x1 as 0 and its corresponding weight as 2. so 0 * 2 = 0, – 2 and output of I is -2 Suppose we have another input neuron with value as three and its corresponding weight as 3. So 3 * 3 – 2 = 7. So max(-2,7)=7. Let’s change value of w1 and bias to 0. I am doing this to create a replica of ReLU function.

So now we have max(0,x). And if we’ll further change the values of x2 to some negative value. The output of max out function will be zero. This is how max out function can act like ReLU function.