# The Backpropogation Algorithm

The Backpropogation Algorithm – Basically when we want to just backtrack to each individual unit in a different layer in our neural network, we mainly use the backpropagation algorithm and the updation of the weights has performed mainly by the gradient descent algorithm. So we essentially know that as our weight, which is the new weight is given by the old weight – we have some parameter say Lambda and we differentiate the loss function with respect to the weights. So this is how our gradient descent algorithm says. So this is the weight updation at each unit takes place.

So this Lambda is nothing but your learning rate. And which controls? The speed of convergence so whenever you have a function like this, so if you want to just go down that is against the direction of this. So in order to take small baby steps you just descend down the valley. So that’s why we mainly use this particular negative sign to just come down. So now essentially what we have is we have our cost function that is J, which is a function of both weights as well as the bias so our free parameters that are in loss function is mainly revolved around these two parameters that are weights and biases.

So we need to adjust this which we have assigned randomly. So that is our major goal. So how do we write this particular equation? So this is given by say you have minus 1 upon M where m is the number of training examples you have iron from 1 to m. Log of Y and Y cap that is the loss function. So the cost between the weights and the biases is expressed in terms of the loss function.

That is why and why predicted now, let us try to expand this so which becomes minus 1 upon M summation. I run from 1 to m. That is this becomes y log y cap. Plus 1 minus y log 1 minus y cap. So this is how our loss function looks like in terms of Y and Y predicted that is the actual and the predicted value. So now we know our chain rule from our previous article. So what we try to do is we try to achieve that is dou L by dou W. So we try to update the weights in this particular scenario. But if you look the cost function that is this particular equation does not have anything.

W so that we can differentiate so essentially we will try to apply the chain rule and we’ll see how we can represent this. So first of all in our neural network, say, for instance, we have X1 X2 X3 consider this scenario and you have some weights and connections like this. Finally, you get the Y cap so you have one layer here. So this is your output layer.

This is your hidden layer. And this is your input layer. So now with respect to the loss function, you need to go back and adjust the weights. So weeds are essentially present at this particular portion. So here you have say weights and the first layer and here you have weights at the second layer. So how do you represent this? So first of all, the loss function will be dependent on your y cap.

That is predicted. From the predicted, we start so From this portion that is from the right-hand side, which is to backtrack to the left-hand side. Now this y predicted is a function of Z. So since we know that is our result is W1 X1 plus W2 X2 up till we have W NX n plus p So this y predicted that is y cap so y can be essential you can say if you use a sigmoid function here at this or calculation. You have 1 upon 1 plus e raise to minus it so you can essentially differentiate this, why because you have a z in this particular term and then finally your dz can we differentiate it with respect to W.

So that is z is a function of weights and your inputs. So this Essentially how our chain rule is that? We are going to use in this particular backpropagation algorithm. So let us compute each of the terms one by one. So essentially say take we have this particular equation so which is nothing but we’ll just consider for – we’ll just not average this that is the total loss function. This is just the average of the loss function will not take the average instead we’ll take the normal loss.

So just consider – and save this is nothing but your dou L by dou y cap so y cap is your final function. So how do you differentiate this? So for that say you have – and you have y log by cap, I’ll just write this here so that it becomes easier. So this is our function now, we need to differentiate this. How do we differentiate this instead? I will write it as more clearly. This is your L now do L by dou y cap. So that is Y this is not this particular Y is not a function of Y cap.

He will keep the Y as it is the derivative for that is dou L by dou y cap in this is the derivative of log y cap is 1 upon y cap. So you will just write this and one minus y for this you have 1 upon 1 minus y cap now derivative of minus y cap. So which is a function of Y cap that is nothing but your minus 1, this completes your derivative with respect to the chain rule.

So now our final dou L by dou y cap becomes will just consolidate this. So this is nothing but your y upon y cap which has a negative sign you have this negative when it goes inside this becomes negative and with this becomes positive so you can rearrange this as 1 minus y upon 1 minus y cap and in this portion, you have y upon y cap.

So We have just occurred. We got this that is the very first term in this particular chain Rule. Now, we need to differentiate y cap. That is our output. So essentially our output is nothing but your this function that is for the logistic regression. It is sigmoid. So if you can see you can essentially differentiate this very much easily with respect to dou Z. So you have dou y cap by dou z so if you just look into this expression what you can see is either you can apply two ways.

So you can do in either of the two ways so you can use the U by V rule for differentiation. Or you can just normally consider this as expression as 1 by X so derivative of this is minus 1 by x square. So for Simplicity, we’ll just consider this as 1 by X. So if you treat this as 1 by X, so you have minus 1 by x square and then you need to take the derivative of this denominator. That is this part. So that becomes – e raised to -z so derivative of e raised x is nothing but e raised to X, but you have this Z as well that is minus Z.

So we need to take the derivative of minus z, which is nothing but a minus 1, so that’s why we are taking this – So now what we get is e raised to so this negative just cancels out here. So 1 plus e raise to minus it. So now we need to adjust this particular numerator such that we can reduce this. So what we do is we add a plus one and minus one. So what our equation just reduces to is 1 plus e raise to minus it. This is a square
upon 1 plus e raise to minus z square minus 1 upon 1 plus e raised to minus Z Square.

So now what we do this, we will just rearrange that and we’ll just substitute. So our dou y cap by dou z what we can achieve is, say this numerator and denominator are same so if you just cancel out one term, so you are left with 1 upon 1 plus e raise to minus it. So one power gets lost here and that is nothing but your 1 upon 1 plus e raised to minus Z Square. So what you can do is this is nothing but your y-cap y-cap and this is your y cap Square so y cap minus y cap square or you can take this as y cap common you have 1 minus y cap.

Or you can just write this as the output so this is the output ol what we get. Here is as the ol output layer 1 – ol so you can essentially write your sigmoid function. That is whenever we use this. As output 1 minus output, so that is nothing but this so now you have the two terms from the chain rule now we have have to take the one more that is our final dose it by dou W. So dou Z by dou W is nothing but your Z it is WX + b So if you differentiate this B is a constant so that becomes 0 then for w you have left with x.

So our final dou L by dou W became, so you have this particular quantity. So, this is the first quantity 1 minus y 1 minus y cap – y by y cap. This is first quantity. Second quantity is or this that is y cap 1 minus y cap. And finally, you have X. So what we try to do is we try to just minimize this or see if we can optimize this particular thing. So what we do is we just cross multiply so that is y cap 1 minus y minus y into 1 minus y cap divided by y cap 1 minus y cap into y cap.

F 1 minus y cap and you have X here. So you can see these two terms just get cancelled out and what you are left with is you have this numerator part and with this x so if I open this bracket have y cap minus y y cap minus y plus y y cap and with this you have X, so if you see these two gets cancelled out What you’re left with this y cap minus y into X. So this word we have done is a general representation of how we can differentiate our loss function with respect to weights. So now you can see if you have such a particular kind of network.

If you look you have 2 cross 3 that is 6 number of weights. So for each of the weights, you have to do this much. Each amount of computation for updation of this gradient descent or by using the chain rule. So we’ll this was for your particular individual weight now for biases that is nothing but they are dou L by Dou B. So this part all remains the same, but just when you come to this particular point that is here when you just differentiate with respect to be this part and X remains the constant for b you have one left.

So that is nothing but your one. So if you have dou L by dou B, if you take the differential of this what will be left with this by you have minus y and you have a one that is derivative of B. That is nothing but 1 so you have these two equations. So that’s where how you differentiate it the loss function with respect to both your weights as well as the biases.

Now, what we’ll do is we’ll So now what we have is we’ll take a setup of having two different neurons and we will have one training example and we’ll see how we can use the backpropagation algorithm or use the chain rule in order for our vectorization computation. So say we have two neurons here or two units we get the final y predicted as the loss a y y cap and then you have weights here say one training example x 1 and then you have some weights here as well. So whatever computation goes here.

That is the linear summation. And whatever you get here that is say we call it as a 1 and here what you get is Z2. So, where is it to get a simple test that even serves as the input for Z2 and it gets multiplied with this W2 and with sigmoid and gives you a to so now what we try to do is we will just use the chain rule that we have seen in the last article or the last clip so We have dual by dou W 1.

So how do you take the derivative of this or if you want to just backtrack by using the chain rule? How will you go? So dou L by dou y cap that is you go this way do y cap by dou y cap is a function of what that is dou z then dou z2 is a function of a1 so we differentiate with the duo a 1 and dou it a1 is a function of dou z1.

And now dou z1 is a function of dou W 1 so you can see how chain rule can be applied across the different layers are for this particular unit. So this is just a small fraction of what our neural network is So we can consider n number of different such connections happening inside a large neural network, a like convolutional neural network or neural networks. So you can just replace this y cap by a 2 so that’s not a big deal.

So just for our understanding we have just started from y cap, but gradually your A2 is the output from this particular neuron, so you would be coming to this and then just backtrack. So by y cap is nothing but your net output that you get after all the processing from different layers. So now what we do is we’ll just move this particular structure of the neural network to some higher structure. Say we have now we have this setup. Say we have three inputs.

And we have two neurons at this particular portion and say we have connections here. So you have connectivities like this and you have final output y cap and then you have two portions here. So now what we do is we’ll just label this particular equation and we’ll just rename this. So Whatever we get here. That is we have the z1. That is the linear function at this particular layer.

That is the very first layer then what we have is we have the output and that is a one from the first layer that is from this unit. And similarly. You have Z 2 from the first layer and you have a 2 from the first layer. Now for this What we represent this as Z2? Z one of the second layer and the output is a 1 from the second layer since we have only one neuron that is sitting in this particular portion or so. This is the output layer.

So now what we do is we will just try to differentiate this particular equation or we’ll just say like how we can take the derivative and how this particular path is followed. So say our loss function. So this consider this as a So where you have individual layers, I will just call this as a-net or say a, anyways, so now what we do is we take the derivative of with respect to dou L by dou w11 will be no one will and before that if you remember in our weight matrix article, we have told that we are representing our weights in form of some matrix so that our computations becomes easier, so So how many weight Matrices is our way to units you can have here so that is nothing but you have to cross three that is two rows and three columns.

So that is W 1 W 1 2 W13 W 2 and W 2 W 2 3 and similarly for in this particular Network. That is the second weight you have that is 1 cross 2. There it is. You have W 1 1 and W 1 2. So now what our major goal is to just minimize this particular weight Matrix in a much differentiable form by using the chain rule so that our computations becomes much simpler.

So we try to optimize this particular functionality by computation as much as we can. So now how do we just go back from this particular loss function to the individual weights now you can consider this as a complicated network now you can see how many parts are there are that is the blue connectivities.

So we’ll just see how we can just go to this particular portion. So now what we start from is this particular layer. So what we have is dou a one and two, so we start from this particular version. Then what we have is do a 1 from the second layer we go to its Linear function. So this part is Sigma and sigma and nonlinear part. Now dou z1 from the second layer so from the second layer now, we have two ways to go either this way or this way. So you have to choose any one path. So but our w11 is from this particular portion. So essentially we would be taking that is this Z is a function of this output and then you have dou a11.

And then you have dou Z 1 and finally from dou Z 1 1 you will reach dou W 1 1 from the first layer. So how you backtracked is basically from the second layer you went to the first layer. So now this particular routing of from where to where you have to go that becomes a particularly difficult activity, so Now similarly for dou W 1 into W 1 3 you will write the same equations. So essentially what we do is we take a matrix something like you have a differentiated weight matrix.

So what you do is differentiated weight Matrix at the very first layer, so instead, we keep this what we do is we take the differentiated form. That is you take dou L By dou W 1 1 though L by W 1 to 2 L by I do W13 through L by dou W 2 1 though L by W to to do L by 2 W 2 3 so we have nothing changed in this but instead we have tried to represent in some compact representation. So we are keeping some of the minute values of Weights so that our updation becomes easier and similarly we do the same justice with our second Matrix that is dou L by dou W 1 1 2 L by dou W 1 2 so this dimension does not change.

But instead we try to represent in much more simpler form so that it becomes easier for our computation. So well, this is all regarding the backpropagation algorithm, which is the most important rule when you want to update the weights in a particular neural network with the help of chain rule. So if you know, what is chain rule then it becomes easier for calculating the backpropagation. So well, that was all regarding the backpropagation algorithm in deep learning.