 Gradient Descent Variations – In the last article, we studied Gradient Descent. We understood the intuition behind gradient descent and looked at broad steps that are involved in gradient descent. In this class, we will continue to understand gradient descent. Some of the nuances of gradient descent and some of the variations of gradient descent that are being employed in the practical machine learning training.

Let us begin. So, this is where we stopped in our previous class where so, let us quickly recap this particular scheme. So, we looked at gradient descent which has got 7 steps. So, we randomly initialize the parameter values. Then, we repeat until convergence, the following steps.

First so, the randomly initialize parameter values gives us a point in the lost space, this point will correspond to a model in the model space. Since, we get a model with these parameter values, we can predict the output for each data point to the training using the particular model. Then, we calculate loss which is J b w you know using this particular equation.

And you know this is the predicted value and we the loss is calculated as the difference between the predicted value and the actual value and we square the difference. We sum it across all the end points and we you know use half as a factor for mathematical convenience and that is how we get our loss function.

After calculating the loss function, we calculate gradient of the loss function at those parameter values and we use those gradients to update each of the parameter value to a new parameter value by subtracting the product of learning rate and a gradient with respect to that parameter.

And once we complete, this particular process of calculating gradient with respect to each of the parameter, we update all the parameters to their new values simultaneously and then, we go back to step two, where we go to a new point and we repeat the process. We continue this iteration until the algorithm is converged.

So, let us try to understand the notion of convergence and then, we will come back to understand how exactly we can calculate the gradient. So, let us try to understand how the convergence works in gradient descent. So, we have loss on y axis and we have the value of parameter on the x axis and it is a loss function in the context of linear regression and this is the minima. So, what happens at minima?

So, at minima value of gradient is 0 and because value of gradient is 0, we do not see any change in the parameter value. Because this is how we calculate the change in the parameter value, we take the old value and we subtract the product of learning rate and gradient from that and if gradient is 0, we do not have anything to subtract, so, value do not change.

So, this is the first way in which we calculate the convergence is by observing that there is no change in parameter value, either in subsequent iterations or for past few iterations. What if so, we can also apply this gradient descent algorithm on some tricky functions which may not be convex. So, the another safeguard, so what happens is so let us say we have a function which is like a flat function after some time.

So, what will happen is we will not see any change in the parameter value after some time or it will be very very slow. So, we do not want to keep on making the changes and keep running this particular algorithm forever. So, we also define some kind of maximum number of steps or if we define a parameter called number of steps for which we want to run this. So, we either check for one of these two conditions and if one of these two conditions are being met, we stop the gradient descent updates.

So, the first condition is no change in the parameter value in the subsequent iterations or in past few iterations or the number of steps that we decided to run gradient descent algorithm are done. So, we do not do any more updates to the parameter values. So, this is how we handle convergence part of it. Let us try to understand how do we handle the gradient calculation part of it.

As I earlier said gradient is nothing but the derivative of the loss function with respect to the parameter value. So, let us try to take an example of; so just for your reference I am going to write the loss function again. So, predicted value minus the actual value, we take the square of it and we sum the difference. The square of a difference across all the endpoints and now, we want to let us say calculate and let us it. So, our w, its actually there are a bunch of parameters here.

So, there is w 1, w 2 all the way up to w m and you have parameter b. So, what we do is we calculate a partial derivative with respect to the parameter value. Let us say with respect to w 1; partial derivative of the loss with respect to w 1, del del w 1 of j w comma b.

And so, let us apply derivative operator on both the sides and try to calculate the whole thing. So, this is like typo here; this should this should go from 1 right? It should all in examples right. So, what we do is get half out and you apply this on this particular function. So, we get 2 into; so, we have.

So, we get the difference and we calculate slightly complex looking though it is not complex. So, have some patience and be with me, then I was talking about mathematical convenience. So, this two gets cancelled with this two. So, we have this factor of half deliberate deliberately added to get an advantage or mathematical convenience while calculating the partial derivative value.

So, we get is the predicted value minus the actual value, the difference and what happens is we will do a small expansion here. So, this is going to be this particular part is nothing but b plus 1 to m w i x is a j. 6 points like this right plus w 1 x 1 plus w 2 x 2 all the way to w m x m. Now, if we calculate the derivative of the whole thing with respect to this. What happens is all other values are constant and derivative of constant is 0. So, only this particular term will survive, this term has parameter w 1 with respect to which you are calculating the partial derivative.

So, only this particular term, we will survive; all other, for all other terms the derivative will be 0. So, what happens is we get and in the derivative of this particular term with respect to w 1 is x 1. So, the derivative of the loss with respect to w 1 turns out to be the summation over difference into the value of the feature x 1. So, is the this is the derivative or this is the gradient with respect to w 1.

Now, let us also write the equation of gradient with respect to b, the bias term. In case of bias term, what happens is let us write it down. Del del b, again we take the difference between the predicted value minus actual value and we multiply it by 1 for the bias term. So, this is how we calculate it. Whereas, in case of other all other parameter values, it was something like this del del w j. I am writing it here so that you can see it on a single screen and its easier for you to j. So, you can see that only difference is that this x j a in case of each of the parameter.

But in case of biased, we have x j is equal to 1. So, this is how we calculate the gradient with respect to the particular parameter w j. So, you can see that this particular parameter calculation or this sorry this particular gradient calculation, involves summing up over all the input data points, which is denoted by this particular equation. So, this equation what it is doing is we calculate or we first do the prediction on each of the input data points and we take the difference between prediction and the actual value on all the data points and sum it across them.

So, this particular process, think of it think of you doing this on a very large training data set and this becomes the bottleneck. So, this particular step takes quite a lot of time and hence, we are going to study couple of techniques in a short while from now, which will help us to do this particular step more efficiently. So, one possible way, so, this particular version of gradient descent is called as a Batch gradient descent; batch gradient descent. So, here we take all “n” data points. We consider all n data points to calculate the gradient.

So, we have slightly up, we have couple of variations of this gradient descent idea. So, in one variation, we use we instead of using all n points, we use a batch of “k” points. Where, this number k is much lesser than n; it is a much smaller than n and what happens now is so, here instead of calculating the gradient on all input points, this sum gets reduced to k points.

So, this is everywhere it has to be i. So, x j value of i data point ok. Now, what really happens is instead of taking k points; instead of taking n points, you are taking k points. So, what is happening is let us say this is our training data of all n points 1 to n, we divide this training data into. This is so we divide it into smaller chunk; each chunk has exactly k elements and this k is called as batch size. We normally use batch size which is a number which is power of 2; 32, 64, 512, 1024 etcetera.

So, this particular version of gradient descent is called as Mini batch gradient descent. Where only difference is instead of taking all the endpoints, we select a chunk of k data points and calculate the gradient update. Now, there is there are a couple of new terms that comes into play here. So, in each iteration, we are using exactly k points to calculate the gradient descent update. So, and when we complete one gradient descent update for all the data, we call this as 1 epoch. So, 1 epoch is a full iteration of the complete training data of n points.

So, in an epoch, we will have n by k batches or n by k batches or n by k iterations. So, in an epoch we will have multiple iterations. So, n by k 2 to be precise and so, we will be doing multiple epochs to this. So, there will be more iterations. So, normally we specify the number of iterations in terms of epochs. So, it is very important to understand what epoch is and we also specify the batch size which is an important parameter in the mini batch gradient descent and you can see when we go into neural network and any other machine learning models also, mini batch gradient descent performs better or does more efficient learning than the batch gradient descent.

So, extending this idea further if you use the value of k. So, if you use k is equal to 1, you have a very special case; where we are changing or we are making a gradient update just look by looking at a single example and this is called as stochastic gradient descent. So, in case of stochastic gradient descent the value of k is equal to 1. So, overall, we have this scenario. Let us say this is the entire; so at size 1, we have SGD and we will be use complete training data to make 1 gradient descent update, it is called Grade gradient descent and we have mini batch which is somewhere in between.

So, that is a mini batch is preferred or S.G.D or of gradient descent in general, but if you have extremely large amount of training data even some people use SGD and you will see later in the neural network that we use some variation of SGD to make or for learning the neural networks ok. So, let us write down the gradient descent process again in the context of mini batch gradient descent and stochastic gradient descent.

We draw a bunch of k examples; an example contains features and the labels. We first randomly initialize parameters; repeat until convergence. What do we repeat until convergence? We first obtain predictions from the model. Second, we calculate loss on the batch. Then, we calculate we calculate gradient of loss with respect to each parameter. Next one is each parameter w we say w to w old and finally, it update simultaneously.

So, you can see that the only difference between the gradient descent and batch gradient descent is the first step, where we select a batch of example and we calculate it on the batch. So, the same process can be applied here, if we instead of k example if you select all n examples and if we calculate the loss on the batch of all the examples here, then you know it is a same process as the gradient descent.

So, these are. So, we studied gradient descent and couple of its variations. So, this makes us ready to train any machine learning model. So, you can use one of these techniques to train machine learning model. Now, the key question is how do we know whether the machine learning model is getting trained or is it facing some issues and if it is facing some issues, how do we really fix those issues? So, before watching the video further, I would strongly encourage you to stop here and think for a couple of minutes how will you; how will you measure or how will you monitor or how will you infer that your model is learning by looking at subsequent losses in the gradient descent updates or in the gradient descent step.

Any ideas, I mean it will be a good exercise for you to pause here for a couple of minutes and try to understand how you can or try to think how you can solve this problem. So, remember that in gradient descent what is happening is this is our toy example, where we have a parameter value here and j or loss on y axis and this is our loss function. So, if your model is training correctly; what will happen is we start somewhere and we get updates and with each update, you can see that the loss is coming down. The loss was somewhere over here to begin with when we made this update, the loss became lesser and lesser and further lesser.

So, what happens is that if you plot a graph of the iteration number and the loss. So, ideally what should happen as we train the loss should come down and should come down to 0. Smoothly, it should come down to 0. In real life unfortunately getting 0 loss is sometimes difficult because of noise in the training data and the noise in the training data occurs due to some issues in labeling or in data collection. So, what happens? So, if you have a learning curve, so these are the iteration numbers at iteration 1, we had very high loss iteration 2, iteration 3 and so on.

So, you should see in if your model is learning in subsequent iteration, you should see that your loss will gradually decrease. Now, there are different characteristics of this learning curve. So, this particular curve is called as a learning curve. And this is one particular type of a learning curve, you can see that the learning curve strongly depends on the learning rate. So, learning rate what happens if you have very small learning rate? Then, you will possibly be taking baby steps to reach the minima which is over here.

So, in that case your learning will be very very slow. So, your graph will look something like this. It is learning all right, but it is learning very very slowly ok. What if you have slightly larger learning rate? Then, what might happen is that you will be making longer stride and you will reach obviously, faster. So, your you might come down faster larger in rate; you are you will reach maybe 0 error quite fast, but in this is a slow learning rate where you are learning very slowly. Here, you are learning very fast. So, slightly larger learning rate larger than the first one.

Then, you might wonder if I make learning rate very high; I might be able to reach the minimum very in no time, but unfortunately that is not the case. If you have too large; if you have too large learning rate; then, essentially your gradient updates. So, we set like this right. So, w new is w old minus lambda times gradient. So, the minus alpha times gradient.

Now, if this value is too large, what will happen is let us say I am starting over here and if this value is very large, I might end up crossing the minima to this side. And then, when I calculate the gradient here. So, I will probably oscillate again on the other side and I may not converge, I will keep on going from one side of the minima to the other side. So, what you will observe is either you will not learn or sometimes your loss will come down and sometimes your loss will come down and then, it will go up.

So, we will see some such kind of. So, this kind of learning curve is a problematic learning curve and it tells you that there is something wrong with your learning rate and you should probably reduce your learning rate, if you see this kind of a pattern in your learning rate. If you see the if you see this particular pattern, then you have a room to increase the learning rate and if you see this particular pattern this, these are all perfect patterns.

So, you probably found the right learning rate that helps you to reach the goal faster. So, ideally what you should do is you should start with some learning rate like 0.001 and then, increase it by order of magnitude something like this. So, this is how you can diagnose if your learning rate is too high or learning rate is too small and take corrective measures so that your machine learning algorithm trains efficiently and even before efficiency, you have to make sure that it is training and learning as per your expectations ok.

So, this brings us to the end of this particular module. In this module, we studied different optimization algorithms mainly of the flavor of gradient descent and its variations and we also understood how to monitor and diagnose problems in the learning or problems during the training and how to fix them. See you in the next article.