Loss Functions in Machine Learning (Explained!)

  • by
Loss Functions in Machine Learning

In the last articles, we studied how machine learning model looks like, the session before that we also studied how the input data looks like. Now that we know about what is a training data in machine learning and also how model looks like, we will you know change gears to understand how to estimate the parameters of the model let us begin.

So, most of the models so we looked at linear regression model and the linear regression model looked like this b + w1x1 + w2x2 all the way up to wmxm. I can compactly represent this as i = 1 to m w i x i, so this is my linear regression model and we normally use hw (x). So, this is the regression model we also looked at the logistic, how does logistic regression model looks like? Logistic regression model predicts the probability of y=1 given x as 1 over 1 plus exponential of minus z where z is nothing but a linear combination of features and their parameters.

And in case of neural network model we learned some complex function which also had lots of parameter with it. Now that we have this particular model and we have trained data our job is to come up with values of these parameters. So, let us try to build an intuition about it based on a very simple example of a linear regression with a single variable.

So, let us say this is a single variable x1 and this is the value y to which is the output value or the label and let us say these are the points and let us say we want to fit a line linear regression represent a line. So, let us say we somehow got let say we have a line which passes through the origin. So, the equation of this line is y = b + w1x1 here b is 0 because line passes through the origin. So, we have w1x1 as an equation of this line.

Now our job is to estimate the value of w1, let us say if I use a different value of w1 it will result into a different line. So, here we have a line which passes through the origin and which has got some slope. So, if I change the value of the slope I can draw let us say some other line from the origin which looks something like this or I can draw the other lines it looks something like this or so just by changing the value of w1 I can get different models or different functions.

Now, which of this function is the most appropriate function for our cause is the central question in front of us. So, we are having training data we have fixed up our model and now the task is to estimate the model parameter and one of the tool that we use to estimate parameter model parameter is called loss function.

We normally denote loss function with J and loss function is the function of parameter value, depending on what parameters we choose we get a model and because of model we incur some loss. Let us see what loss means in the context of linear regression.

So, you can see that for this particular value of x1 this was a true value, but if we use red line as our model then this is the predicted value. So, this is an actual value and this is a predicted value for this value of x1. So, we incurred some error over here, the error is difference between the actual value and the predicted value. In the same manner incur some error over here, some error over here, here we have negligible errors some error here some error here some error here.

If you use red as the line of choice or red line as the model. If we use some other line as the model then there will be different errors you can see that these are all errors for the orange line right. How do we measure this loss cumulatively? So, what we do is we find out the loss at every point and summit across all the points, let us write it mathematically how we do it for a single point.

For a single point we look at the prediction at that point minus the actual value of y and since we do not care about the sign we take square. So, that all positive and negative losses are measured in the same manner. So, this is the loss this is the actual value if this equation is bit scary to you. So, you can read that this is an actual value, this is the predicted value and we square this up.

So, we have actual minus predicted and we sum this loss across all end points and we add one half as a mathematical convenience. So, is this clear to you to everyone. So, what we are doing here is we are calculating loss at every individual point and then summing of the loss across all the points and this is the total loss that we incur and this is a total loss that we incur because we choose parameters w and b and because of w and b we get a model and because of model we incur some loss, so that is the relationship.

So, if we expand this in case of linear regression this will look something like this, this is nothing but b + w1x1 – y(i), so this is of course i point and whole square. So, you can now see now it is pretty much obvious to see that this J or the loss function is actual function of the parameter values. So, loss function is the central piece that helps us to identify model parameters such that the loss is minimize.

So, we try to identify parameters in such a way that we minimize our loss and we will see how to minimize the loss in the next session. So, this is the loss that we compute for linear regression. Let us try to see how we can compute loss or how we can formulate a loss function for classification problems. In case of classification problems we have two labels.

So, let us let us prepare a table of the labels. So, we have an actual label and we have a predicted label y and this is y hat. So, if actual value is 1 and if you predict 0 then there is an error or vice versa actual value 0 we predict 1 that is an error. If we if actual value is 1 we predict 1 that is fine. So, these are these are three combinations, so this is an error situation there is an error here; error here and this is no error or no error ok.

So, let us try to develop an intuition for the loss in case of classification. So, if the actual value of y 0 and if you predict 1 you want to give a very large penalty and if actual value of y = 1 and if we predict 0 we want to give a very large penalty as well. So, we will we have very very similar curve here.

So, we say that so we write this mathematically as if y = 1 we have a term -y log (p) which is log of the prediction and if y = 0, we use the loss –( 1 – y) log (1- p) and this gives us what is called as cross entropy loss which is written as -y log (p) – (1 – y) log (1- p). So, let us try to understand when y = 1 what happens for y = 1 we get the term minus log of p and since y = 1 this becomes 0.

So, eventually we only get the first term for y = 1. For y = 0 you can see that the first term become 0 and we only get the second term, so we get – log (1 – p). So, this is a clever representation of this two losses into a single equation, this cross entropy loss that we try to minimize while solving classification problems. So, cross entropy loss is specifically used for binary classification problems.

See for a multiclass classification problem we use categorical cross entropy loss and if we represent our output or our labels as integers we use what is called as sparse categorical cross entropy loss. So, this is these are the loss functions that are used for classification task, whether with logistic regression or also with neural network models.

So, having defined a loss function we know how to measure an error ones we fix up the parameters of the model. Now, our job is to find out the optimal values of parameters such that the loss function is minimized. How do we really solve this problem and this is where optimization techniques or optimization algorithms help us in tackling this particular problem.

Share this post ...

Leave a Reply

Your email address will not be published. Required fields are marked *