Sigmoid Activation Function – We saw what was the activation functions. We had a gentle introduction regarding. What was the step function and why we are not using in our neural networks in order for the learning our weights. It’s so in order to overcome that problem. We have one particular special function that we have come across from logistic regression. That is the sigmoid function.
So we essentially start with the properties that we have seen in the activation function. So in order to be a function to be applicable at the nonlinear part first, it should be having a nonlinear nature or non-linearity and second. It should be differentiable. So essentially, you know your unit if you have say this is the linear part that we compute that is if we have say just the bias. weight, w 0 and B. And if you have some x 2 and x 1 and here you have W1 and W2 you compute the linear portion.
That is W 0 plus W1 X1 plus W2 X2. So this is a linear part and whatever you have here that is your sigmoid function. So and then you get the output say we call it as ol that is the output from all these layers now. We are just focusing on to one particular. That unit or one particular neuron. So the output function is given as 1 upon 1 plus e raise to minus Z. So where your range is that is in linear regression. What we mainly had is the problem was that our decision boundary or when we want to classify this it just shot up to minus 1 to plus 1 so we can’t use that in our probability.
So since probability cannot be negative we want to squeeze in Some portion or in some form. So that’s why we mainly use this sigmoid function. So sigmoid has the property like it will just squeeze your input signal or value range to this portion. So it has minus 1 and plus 1 so at minus infinity, it achieves 0 and at positive infinity, it achieves 1this value and then you have some threshold say point five. So this essentially is used for your binary classification.
That is 0 or 1 and then it is also used as an activation function. And for calculating the probability values also, it is a smooth function. Which you can differentiate so it is also differentiable. So it basically satisfies all are more of the properties of activation function that we have stated. So this is the basic requirement for any activation function to be that is nonlinear. So what we basically do is we now do an activity of computing the derivative of this particular function.
That is the output function and We’ll see how does it looks like or graphically. What is the interpretation? So what we have is we have this ol so what we do is we just have to have a dependent variable. So that is say, Z so we differentiate with respect to Z for ol so if I write this as 1 upon 1 plus e raise to minus it So he can do the differentiative in either of the two ways. So the very first way that I have discussed in the previous article.
So that’s where you consider this as 1 by X and you take the derivative as minus 1 by x square and now here we’ll just use a separate technique that is you have a rule that is U by V rule so that will mainly use so now after this we have something called tan H hyperbolic tangent, so there will be mainly using so So, in order to have a quick recap of that will just apply this. particular method over here.
So what this basically says is that denominator derivative of the numerator. So denominator then derivative of the numerator. So say if you want to differentiate with respect to x minus numerator derivative of the denominator upon denominator Square, so it’s simple so what we do is we take so that is we write it as denominator square. So first you have denominator then derivative of the numerator.
Now numerator is not having or not holding any terms, which is there in Z. So it becomes 0 – numerator into derivative of denominator. So derivative of e raised to minus z is e raised to minus it and you have a chain rule for – z so that becomes a minus 1 here. So what do you obtain is? So this particular portion this becomes 0 and then you have a minus outside. So you multiply these two – so that gives you a raised to minus Z with a positive sign in front of that and in the denominator, you have 1 plus e raised to minus Z square.
Now what we do is so if you do the either of the ways, you will just arrive at this portion. Now, we need to adjust this numerator. So you add up 1 and in order to compensate that you subtract the one and you write in this form. So now you factorize this so you have one up on you just split the denominator here. And you have – so this is e raised to minus z and minus 1 upon 1 plus e raise to minus z square. So if you see one power from the denominator just gets cancelled out. So that becomes 1 upon 1 plus e raise to minus Z and this is as it is e raised to minus Z square now just take one particular.
This term common. So in the bracket, what will be left is 1 minus 1 upon 1 plus e raised to minus Z. So now this is nothing but our output so we have that as ol and in the bracket, you have one – ol square so if you take the differentiative of your sigmoid function, that is the s-shaped curve, which is this you obtain the output as output 1 minus output. So how do you mathematically interpret this? So that is also essential for you to know so If I say Mathematically interpret this so how it looks like is basically say you have zero here.
You have one here and say you have some point five. So this particular graph looks something like this. So where this is now also since this is a function of a nonlinear function, you can see the graph is also nonlinear. So it is also differentiable. So that’s why we have occurred at this particular graph, which is a differentiated geometrical representation of a sigmoid function.
Now, if you look at this particular two extremities of this particular graph, these are approaching zero. So since I have a space constraint here, so now what I do is if I just expand this particular number line on both the axis at some particular point it will converge to 0 so this is essentially a problem that we have seen that is called as the Vanishing gradient. So it satisfies non-linearity.
It is differentiable so that you can backtrack to the layers in a neural network, but it will fall at some portion the victim of Vanishing gradient. So but you know for Vanishing reading you don’t have any solutions any practical solutions till date, so that is the reason why we also don’t use the sigmoid function in certain scenarios, but if your application requires that you need to use sigmoid function then free to use the sigmoid function but you have other variants like you can use tan H function or you can use some ReLU functions or Advanced ReLU functions like Leaky ReLU, parametric relu shifted ReLU exponential relu many are there which are much more smooth and which can handle the situation much easier.
So if you have this geometrical representation, and if you are asked to say visualize this particular thing in a hyperplane how it basically looks like is See you have something like this. So it is very much narrower. So this is a small trough which are seen in this plane. So this is a hyperplane and in the hyperplane, you can see do peace which is not too much broader, but it is narrower in shape.
So the geometrical representation of this in a hyperplane. So this is in two Dimension if you see in three dimensions, so, You can see like like this. So on the backside of this, you have one more Peak or Mountain, which is seen like this. So well, that was all regarding the sigmoid activation function in deep learning.
reference – Sigmoid Activation Function
Web enthusiast. Thinker. Evil coffeeaholic. Food specialist. Reader. Twitter fanatic. Music maven. AI and Machine Learning!