Backpropagation and Gradient Descent – In this article, we are going to continue our project by explaining two important concepts in deep learning: backpropagation and gradient descent. In the previous article, we had already shown the forward process in our model structure, First using linear sum to calculate logit score.
Then, we can further transform the logit score into probability with softmax function. And finally by comparing the probability with labels defined as one hot encoder, we can calculate the cross entropy loss. Now we can send the loss information backward to all the weights within the model and update them.
We call this process backpropagation. But here comes the question: how to update the model weights using this back propagated information? To answer this question, let’s have a look at the relationship between the lost function and the model. parameters. For simplicity.
Let’s only consider two parameters W1 and W2 here. We can calculate the loss for each point on this plot. And finally, draw a lost function as a surface in 3d plot. The surface shape is dependent on the loss function you are choosing, the model you are using and the type of problems you are working on. Having this specific loss function plot, It tells us how good our model is at making predictions for the current given parameters setting.
And more importantly, it tells us how to optimize the parameters at a given point. To be more specific, which direction you should go to decrease the loss most quickly. It’s helpful to think of a ball rolling down at a given point. For this kind of complicated loss function. There is a lot of local minimum in this plot, which points the ball landed depending on the random start point.
How to find the global minimum is another topic that I’m not going to dive deep today. Let’s look at one of the minimum points and how to optimize the parameters setting to reach it. We can further simplify this problem by looking at only the W1 parameter and keep W2 fixed. Mathematically, we can write the update function for W1 as the linear rate times the slope of the current point.
And the slope can be obtained with the partial derivative of the loss function. Okay. If we do this repeatedly, we are gonna approach this minimum of the function after some iterations. We call this whole process gradient descent. We have a lot of pieces already. Let’s have a look at the code to integrate these pieces. For each iteration in the training process.
We first obtain the model outputs with the model we define before, then we can calculate the loss function with a model outputs and labels. The loss function we are using is the cross entropy. Finally, we can back propagate the loss and update the model parameters with gradient descent. This is an introduction of back propagation and gradient descent concept.
And in the next article, we’re going to talk about how to evaluate our trained model.
reference – Backpropagation and Gradient Descent
Web enthusiast. Thinker. Evil coffeeaholic. Food specialist. Reader. Twitter fanatic. Music maven. AI and Machine Learning!