 Vanishing & Exploding Gradients – The topic for today’s article is vanishing and exploding gradients. So this has something to do with their normal grading descent so that we have already seen in the past few articles. So basically what we do is when we talk about gradient descent, it’s mainly used for updation of the weights. So essentially we have our loss function. So we differentiate mainly with respect to the weights and This is a quick recap. So we have our output. So from the output, we just backtrack to some particular layers.

So say our first layer is a 2 which is going back and we have dou a2 by dou Z 1 and then dou z1 by dou W 1 and so on up till if you have n layers door w n plus 1 up till this n minus 1 layer. So essentially for going back we mainly deploy the backpropagation algorithm and for updation of the weights at each of the point.

We mainly use gradient descent. So in this process, what we have is we can have a generalized representation of weight at each individual layer say this is L that is for the layer. So that is the old weight – we have some learning parameter. That is eta and we have dW at that layer. So this is a compact representation of our weights that we represented in Matrix form that is for the differentiate of each of the weights. So now in this particular equation, what we have is when we just compute this gradient.

So these are essentially small multiplication values that you just multiply across. So if you have a large pool of different layers, then you can have several different small gradients. So if We want to represent this geometrically say we have our coordinate axis and we have our gradient like this. So now our point is to reach this at the bottom. So what we do is we take some baby steps. And we try to converge to this point.

Now, what happens to say, for example, if this gradient computation value at this point is 0.5 0.2 0.1 0.0 0 1 and so on so now say all the multiplications of whatever gradients we have done so far resulted into 10 raise to minus 10 value. So what this essentially means is a closely equally to the 0 so if you are gradient computation is 0 that means you’re learning is stopped.

That means the algorithm will come to know that you’re learning is stopped and it has reached its minimum point so but ideally, you’re learning should not stop by because your gradient values are very small that this it converges to 0, so if you want to see like you have some larger step then you just move down and somewhere in the between you got lost. So that’s essentially called a Vanishing gradient problem. So you vanished somewhere in between so you can say as you went for outside to buy something but in the middle someone just kidnapped do so, you just got vanished your parents don’t know where you went.

So similar is the case with Vanishing gradient. You don’t know what happened somewhere in the middle. You just stop learning for this particular function. So you need to take care of this Vanishing gradient most efficiently and then we what we call as the exploding gradient so we have the similar case but instead, for this values, we have to say 10. We have 20 25 30 50 it shoot some higher values. So it started from this particular point or say we start from this you have this you just missed this particular small point.

So our original aim was to just convert to this. Point and so before All these things we used to take one unit. So this is not one unit it’s more than one unit. So you have a higher chance of just missing this particular converging point or at this particular minimum. So this is essentially called as exploding gradients. So where your value just can be say like 10 raised to positive 30. So that’s a very high value means there you are learning nothing so that becomes your exploding gradient, but you’re exploding reading problem can be solved by using something called clipping.

Our gradient clipping So now what is grading clipping? So you have this particular multiplication of different small minor derivatives say these are hundred in numbers hundred more gradients. Are there you keep a threshold there say you have less than a hundred or greater than under it something depending upon your choice. And then you just assume like you want to compute the gradient only up till a hundred. So if say for example if you have 105 gradients but on your system, you can give only a threshold of up till hundred you will just limit that particular gradients to 100 and then you have the computation.

So there you will escape from the problem of having what is exploding gradient. So that is where there’s a solution to exploding gradient is a but in practical Vanishing, the gradient is the most crucial problem when you deal with gradient descent because you don’t have any known Solutions in order to treat this problem. So but you can closely inspect your algorithm like it’s making any problem or you have to closely monitor where it is converging and so on. So that’s how you can eliminate this problem.

But there is no particular solution as of now like how you can minimize this. So well that was all regarding the vanishing gradient and the exploding gradient problem in deep learning.