 Gradient Descent – Welcome to the next article of our course Practical Machine Learning with TensorFlow 2.0. In this session, we will study about optimization algorithms and their roles in machine learning. We will specifically study a specific optimization algorithm called Gradient Descent and its variation mainly minimize gradient descent and stochastic gradient descent.

We studied what is called as loss function of machine learning model. We used the term or we use symbol J w, b. So, loss function is parameterized by w and b, which are parameters of our model. So, just to remind you our model that we considered for linear regression was as follows. The output y is equal to b which is a bias term plus w1 x 1 + w 2 x 2 ……w m x m.

Here, we have training data with m features and we have modelled with m plus 1 parameters and these are the list of parameters. So, we can think of model as parameterized by b w1, w2 ……w m. We will use w as a vector to represent, all these m weights. So, that is why we say that the loss is the function of parameters and in case of linear regression the loss function looks something like this.

So, we calculated the loss at each point and the loss was calculated as the predicted value on ith input. The difference between the predicted value and the actual value and we squared it. This is a predicted value. Let us write this, this is the predicted value and this is the actual value. And just to remind you, this i over here represent that this is ith point. This is ith input and this is label for ith input. So, this is the loss function.

So, geometrically; so, lets consider there are there are m plus 1 parameters. So, loss function is essentially a function in this case, so the loss function can be shown. Let us say this is the loss, this is parameter w1, this is parameter b and there will be several such kind of parameters let us say w2, w3 …….wm. And loss will be a surface in m plus 2 dimensional space. So, it will be some kind of a hyperbola kind of a shape.

What we will do is in order to understand this; we will take a simplified model and try to visualize the loss function in the 2D space. Let us take a simplified model where there is a single parameter; our data is of the following form. So, we have data D is we have this pairs. We have the feature and the label. We have n such data points. So, we can compactly present this as, we have ordered pair of features and labels, and there are n such kind of ordered pairs. And each data point has a single feature which is represented by x1 and or y is a real number.

So, this is a regression problem and the model that we use is y = b + w1 x1. So, the loss function is J(b, w1) is 1 over 2. We compute the difference between the actual value and the predicted value is a predicted value minus the actual value and we take square of the difference. Geometrically, these loss function look something like this. So, we have J b w1 or the loss on y axis. We have weight w1 on one axis and b which is a bias on the other axis.

Since, this is a contradict function this will look something like this. It is a bowl shape function right, it will be bowl shape function; it will be a bowl shape function. So, what will happen is we have to find out optimal point on this bowl. Let us spend couple of minutes to understand the duality of the loss space and the model space. So, we have a model, where we have input x1 and we want to predict value y. Each of this point on this surface if you take this particular point, this particular point; lets show some data points here these are data points, ok.

If you use this particular point over here, this point would correspond to a line, this point is represented by two numbers which is w1 and b and for w1 and b it gives us a specific loss and our model also has two parameters which is w1 and b. So, for this specific value of w1 and b we get such a model. If we select another point on this surface, let us say this particular point, this point will represent some other line. So, let us call this as point 1 and let us call this as point 2.

So, this is w12, and this is second point and this was a first point. If you choose some other point this might represent some other line. Let us call this as point number 3. This is essentially a line which is w1, third point and b 3. And for each one of them there is a loss and in order to recall what the loss is; so, for this particular line for the third line the loss is essentially a distance between the actual value and the predicted value.

This is the predicted value and this is the actual value. We calculate the difference and we take square of the difference. So, we find out all this differences and sum them up and that represents our loss. So, if we sum all this numbers up that you will get a loss corresponding to w13 and b 3 which is some number on the y axis which is a loss, ok. So, this is very important to understand the duality in the loss space and in the model space. So, point in the loss function represents a model and we want to get a model that gives us the minimum possible loss. Is our objective clear?

Our objective is to find a model or model parameters in such a way that the loss incurred due to those parameters is minimized. So, of course, I mean you must be wondering that you can also try some kind of a brute force approach where you will explore this particular space and try to find out points that gives the minimum value of loss. But this is not really efficient. If we take let us say a parameter space of or the loss function with m parameters with the value of m being very large and large ms are kind of their routine in our day to day machine learning problems.

So, if you are trying to solve this problem in the context of m which is some large number then you know a brute force is almost impossible. So, we cannot really do brute force. So, we have to do something more intelligent. And you know the way we are phrasing this particular problem, we are saying that we want to find the values of parameters in such a way that the loss function is minimized.

So, this is a minimization problem find w and b or find parameters such that the loss function is minimized. Let us see how to do that. And let us first develop the intuition of it, and then get into the details of our first optimization algorithm which is gradient descent which is work hours of machine learning algorithms we will see that in a minute after understanding the intuition behind it. So, what is our learning problem? The learning problem find w and b such that loss is minimized, ok.

And loss we represent with J. We want to find out the values of w and b in such a way that J is minimized. Let us try to understand how we can do this intuitively. So, what I will do is I will again consider our linear regression model, y is equal to b + w1 x1. And in order to give order to keep the expression simple we will assume that b is equal to 0, so we get very simple model which is y is equal to w1 x1.

Now, our job is to find out now the optimization problem is find out w1 such that J (w1, b) is minimized. So, there is exactly one parameter here because you have already said b = 0, so there is exactly one parameter. And now we will see we will first visualize how the loss function looks like. So, loss function is parameterized by w1 or the value of w1. For each value of w1 we get some loss, job done.

And since we have a squared error mean squared error or a squared error as a loss function. So, you can see that it is a bowl shape function. This function in the language of mathematics is called as a convex function, ok. So, what we will do is, so essentially what is happening here is for each of the value of w1 there is a corresponding value of the loss, and you can visually see that this is the point where loss is minimized where there is a the value of loss is minimum.

So, since this is a problem with a single variable or with a single parameter we can visually find it. But the problem here is if we have multiple parameters, we cannot even visualize the loss function, how can we algorithmically or how can we programmatically find out this particular point over here. So, what is given to us is we know the loss function and we know we essentially know the loss function, and we want to find out the value of the parameter such that the loss is minimized.

Now, there is this particular method which is called as gradient descent, gradient descent that helps us to do this programmatically. Let us try to understand gradient descent intuitively before getting into the details of the steps involved in the process. So, what happens here is we first initialize the value of w1 to some random value. Let us say isolate this particular value of w1. So, for this particular value of w1 there is a loss associated with it.

So, this is the point where, so this is the point where my initial guess landed. So, my initial guess is this for the value of w1 and at this point what we do is we calculate the loss. So, what is it representing? I am selecting, I am randomly selecting or randomly setting the value of w1 to some parameter. So, this will actually get me a model. This will actually define a model for me. So, remember the duality of the loss space and the model space, so we have x1 here and we have y here.

Let us say these are the points, ok. And we have model which is a line passing through the origin. Now, what we do is, this is a difference between the predicted values and the actual values. And what we do is, we calculate square of the difference and sum them up across all the points and get loss corresponding to this particular model. Now, so the first thing that we did is we randomly initialize w1. Then what we did?

We calculate the loss value. We calculate the loss. So, we know the loss. Now, this is the point where we want to reach, right. How do we really reach this point from here? Now, think of this and as a task that is analogous to, let us say you know climbing down from the mountain top. So, what happens is while we are climbing down the mountain top, we are let say at a specific point, we look around and find out what is a direction that will take me down to the valley. So, gradient, in gradient descent we exactly do to we exactly do it at a particular point on the loss surface.

So, at this point what we will do is we will calculate the slope or the direction in which I should be moving the direction of slope. So, let us say this is a slope, this point. So, this is a tangent to this point. So, we can calculate slope of this tangent. And we get the direction of the slope. So, we will move in the direction that is opposite to the slope. So, the slope is negative here, so we will move in the negative directions, we have to move in this direction. So, first is we calculated the loss. Second is, we calculate the gradient. And once we know the gradient, the next question is how much we move from the original point so that we reach the valley.

So, there are multiple options that we have. We can move or we can step, we can have a longer strides or we can take shorter strides. So, the length of the stride it is decided by a parameter called learning rate, which we denote by letter alpha. So, learning rate helps us to control how long strides are we going to take from a particular point. So, let us say if we are at this point and if we have some learning rate, we are going to take a stride and we will end up over here.

So, this becomes our new point. And what we do is we repeat the same process that we did at this point at this point, we first get the predictions on this particular model, we calculate the loss and then we calculate the gradient. In order to get the loss, we should also have predictions. We need to get predictions at every point. As you can see here this is a prediction for this particular point, so predicted value and actual value; because in order to calculate loss we need to know what is the actual value and what is the predicted value.

So, we first make the predictions with the value of the parameter. We substitute that in the model and we do the predictions, and based on predictions we calculate the loss. And after calculating the loss we calculate gradient of the loss. So, we will again do the same thing at this particular point and we see that it is a direction of the slope. So, we will be moving in this direction by taking some step. So, let us say I come we come here. Now, you can see that as we approach this particular point which is our golden point or point whether loss has got the minimum value, as we go closer and closer to that point the derivative or the gradient will become smaller and smaller.

At this particular point the gradient will be 0 because it is a minimum point. So, you can see that as we as we move closer to the gradient on the derivative or the gradient as we move closer to the minima, the gradient value will become smaller and smaller. So, our, so our strides. So, we have a constant learning rate, but our gradient is so, so we calculate gradient, we have learning rate. And then what we do is we have new point.

So, let us say w1(new) = w1(old) minus learning rate times the gradient. So, we can see that learning rate is constant, gradient is gradient becomes smaller and smaller. So, we will be making eventually shorter and shorter stride, our stride will become shorter and shorter as we approach the actual minima. Is this clear? So, this is how this is how we reach to this particular point. Now, you can you can work yourself, you know you can take a pause here and you can work out yourself how this particular calculation work if I randomly initialize the point over here.

You can see that the gradient, this is the direction this is the slope and we will be moving in the opposite direction of gradient. So, we will be since gradient is positive here, we will be making we will be moving in the opposite direction. So, whatever value we have we are, so here the gradient is positive. So, what will happen is since gradient is positive at this point, we will be effectively moving in the opposite direction because this is the value of w1(old), we are going to subtract something from it. So, we will be moving in this particular direction, if I am starting from here.

On the other hand, when we started from here, here the gradient was negative and since we are going to move in the negative direction of this. So, you can see that gradient is negative, this negative and negative becomes positive. So, we move in the positive direction or we have some value we are adding something to it, so we are getting a value which is greater than the old value. So, if we start from here we are going to get the value of w1, which will be greater than the previous value. If we start over here we will get values of w1 which will be lesser than the previous value. So, this is the intuition behind gradient descent.

I hope you understand this clearly. If you have not understood it, it might be a good idea to take a pause, go back in the video and check it out. And ones you understand this, then we will move into the next part where we are going to write down the steps that are involved in the gradient descent algorithm. So, let us write down steps in gradient descent.

Now, what we will do is, we will try to generalize the gradient descent for m parameter case. So, when we are trying to establish the intuition of the gradient descent we did it intentionally with a single parameters, so that it is easy to geometrically show what is happening. But when we go to the to the m parameter setting it becomes difficult to visualize what is happening.

So, let us consider a setting where we have m plus 1 parameters in the model b w1, w2 all the way up to wm, and we are trying to solve for a regression, so the model is y is equal to b + wi xi. So, this is a linear combination or linear combination the parameter and the feature value. And this is short form, these are short hand form of writing the model, and obviously, our loss function is the squared loss which is done over n examples, ok. So, the first step is we randomly initialize b, w1, w2 all the way up to wm.

So, we randomly initialize all the parameter values. Then, we are going to repeat until convergence. We will see how to identify that we have converged to a solution, repeat until convergence. We first use all this parameter values and we predict y hat for each data point in training. This quantity is nothing but y i hat.

So, we calculate y i hat or the predictions first, then we calculate loss, we calculate loss which is J (b, w). We know that the predicted value and we know the actual value based on that we can calculate the loss. Then we calculate the gradient of the loss. So, we will take. So, is this clear to everyone? We will we will see how to calculate gradient of loss. Fifth step is b(new) = b(old) minus alpha times the value of gradient. So, gradient with respect to b, gradient with respect to w1 and we do it for all the parameter values wm (old) gradient with respect to wm, ok. Update b, w1, w2, wm simultaneously. This  part is very very important.

Note that when we are calculating; so, couple of points note here. We are calculating this gradients with respect to each of the parameter values and we are, notice that this is not a equal to sign, we are using some kind of us other notation where the effect of this is we are setting the value of b to a new value which is coming from the equation on the right hand side in this case which is this equation, from here to here. And when we are calculating gradient with respect to w1, we are not going to use this b(new), we will still be using the value b (old) and all the old values for all other parameters.

And we change the values from old to new right at the end in the step number 7. So, this is what is called as simultaneous update. So, this is important to note that all this parameters have to be updated simultaneously at the end of the loop. And then you again go back to this two, and go to step two, we check for the convergence, and we essentially repeat. So, what happens? After this step is we get a new point on the on the loss surface and we repeat the same process at that particular point in the loss surface.

So, this is the basic gradient descent algorithm or a sketch of gradient descent algorithm for your reference. We have not yet talked about how to determine the convergence of this or how to calculate the gradients which we will see in the next session .