The Delta Rule – The topic for today’s article is Delta rule. So in the last article, we saw what was backpropagation algorithm and we saw some derivatives and derivation which we come across the loss functions and we saw how we represented in Matrix form. So in today’s article Let’s see.
What is Delta Rule and where it is mainly used? So Delta rule is mainly used for your perceptron learning. So in the very first article regarding the Deep learning, we have seen that we have two kinds of neurons or two kinds of a neural network structure one that is MCP and we regarded it because it does not represent the XOR implementation of the xor gate and then we have the perceptron.
So that is the Rosenblatt perceptron that we majorly use. So Delta rule is mainly used for updation of the weights. So essentially this Delta rule comes from or derived from your Gradient descent So gradient descent. We already know that is from our linear regression. And so what we basically do is say we have our Network. Say x 4 and then we have, say some neurons at this particular point. Then we have some connections which are connecting to these new neurons.
Okay, and what we have is we have the final output that is y cap. So, you know that there are some weights present at this particular point. So you have a weight Matrix of 3 cross 4 and then at the second layer you have a weight Matrix of 2 cross 3 and then at the third layer you have a weight Matrix of 1 cross 2. So our major goal is to learn these weights. So that is the major goal that is when we implement the neural networks or whenever we do deep learning. So I have mentioned before that. These weights are assigned randomly.
So but we have not discussed why these are assigned randomly or why we are not keeping a uniform value across all this. So today we will be starting with this issue. That is why we are saying this randomly. So whenever we design our neural network, it always has a problem called symmetry breaking. Are also called a problem of symmetry. So say, for example, we are initially assigning these weights as say 10-10-10 for all these weights across for 12 weights across these connections in the first layer.
We are keeping it as 10 means we are keeping one single value. Now, these particular weights will update depending upon the error value that you get at the final output. So essentially when you just backpropagate across the different layers, each of them has to be updated equally by because they have all identical weights now say, for example, these are some marks of some particular student.
Now, if the teacher just graded them equally means they do not differentiate between each other means if one person has got 10 out of 20, and if I’m also getting the same then I would not go to the teacher and just complained why I got done and why he did not get some other marks that differentiation or that will not happen in this particular case. So what essentially will do is this will learn nothing.
So if you are initially assigning all the weights as equal, but eventually the neural network do we do is it will learn nothing at all? But if you assign this as some random link so say this has got 15. This has got 20. This has got 11. Then there becomes a picture of like why this particular weight is assigned 15 or why this has assigned 20 or This as 11. So this particular problem is. He counteracted with the help of randomly assigning So if you want to just geometrically to represent say we have two weights. And say we have two coordinates.
Somewhere here. So what we can see is it will form the error pattern this way so you can see there is a saddle-shaped function that is formed across this error pattern. So the error in one particular layer is mitigated with the help of adjusting the weights which are put randomly. So that is the main reason why we assign the weights put randomly. So if in this particular scenario say this is the positive quadrant. And this is the negative quadrant. So whatever portion you can see here. So there we have to adjust the weights accordingly.
So if it at the beginning you assign the weights equally say 0,0 then eventually, they would not be learning anything. So that’s why we mainly use random weights. So we assign random weights whenever we start the neural network learning. So our major goal is to learn these weights so for that weight updation whenever we get the final output that is changed with the help of some rule. So we know that our error function that is J that is dependent on the weights and the biases so that is given as say we are taking d that is for each instance.
We are calculating td minus od td-od Square, so this is not new so you are familiar with this. That is Y and Y cap. The same that is we saw from linear regression for gradient descent. So we are just changing the terminology here. So TD is nothing but our true value and OD is our output function or output value. Say initially for the setting we have given the output as W 0 now. This is for the bias. Then we have W1 X1 plus W1 X1 square plus W2 X2 plus W2 X2 square and so on up till W, NX n plus WN X and square. So what we need to do is we need to differentiate this error function.
So error somewhere is also represented as E so, we’ll just go with only one rotation that is will stick to J. That is for jacobian will not change this each and every time so what we need to do is we need to differentiate this J with respect to each and individual weights. So if you just represent this, so that is with respect to W 0 with respect to W1 with respect to W2 and so on. up till say with respect to w n and dou J with respect to w n plus 1 So we majorly update the weights across different layers so that we get a uniform error that is obtained at the final output.
So how we do this. So for doing this, we need to take the differentiative of this so just remember our output is represented in this form. So it is some linear and then it has some polynomial it’s not completely linear, but it is linear in weights. It’s not linear in your input vector, so So what I basically do is I take the derivative of this. So I need to differentiate this particular equation.
So and one major thing by we have taken Square because it is a convex function and so we can easily differentiate it so that’s why we have taken the 2 and we have averaged it over to normally we use 1 by n so here n is kept as to so if we want to differentiate this particular thing how would we go is say, dou J by dou wi so we have 1 by 2.
So that is a constant if you want to differentiate this D belongs to D. That is 2 times of TD minus OD And then you again have This part TD minus OD So this is clear so we have differentiated this just as it is. Know what we do is So if you just closely inspect these terms gets cancelled out so we can just take this part on the other side so our differentiate if so, I can represent it as we are differentiating with respect to our weight Vector. So everything is a vector representation. So whether it is our weights or whether it’s our input, everything is weight vector.
So what we have is we will just keep this side. We have this summation D which belongs to this capital D represents the entire population of your data and out of that you’re taking one particular instance. And then you want to differentiate this particular with respect to w i TD minus OD so say now you have B belongs to D TD minus OD and what I do is I will just represent this OD in some form. So we know that our OD was represented in as W 0 plus W1 X1 plus W1 X1 Square. So I will write a generalized form of this particular representation.
So what I do is I’ll just write it as W Vector, so that is a weight Vector. It is a general for all the weights. So regardless, it’s w 0 W 1 W 2 w n I will represent it as W vector and we have some input Vector. So that is for each instance d represents each instance and plus we have that is in a similar form.
But here we have one square it up and we’ll just close the bracket. So what I did is I just represented my output function with respect to whatever is given. Now what we do is we need to take the differentiative. If so TD minus OD what we have is this part remains constant. So that is 0 so whatever we have in this particular form. So that is if you just open the brackets you have – so that is X bar now. Just remember this i is with respect to this input Vector.
So what do you can just do is we can The counter of I over d So that this weight differentiating of this weight with respect to this function, so that is nothing but minus XD bar and similarly you have XD ID bar Square, so that is simple. So what we can obtain is we can take one negative sign outside. ID and We can write in this way.
So if I just take the – over this it is TD minus OD and we have multiplied with ID plus X bar Square ID. So now this particular entire function is what you just differentiated it so that becomes your error function. So let us call this as some Delta J or Del J with respect. To some weight Vector. So since you have differentiated with respect to weight, you have some this notation. So now whenever you do gradient descent, you know, that is you have a learning rate or learning parameter that is eta.
That is at a so let us say we have taken a small baby step of 0.05 error learning rate. So now what we can do is we can essentially represent this as negative. So that is going down the valley by taking small baby steps, that’s why we have a negative sign and then you have this learning rate eta and then dell J with the weight vector.
So this was our Delta W weight vector or in fact if you want to represent in some other notation save with respect to Wi how you can represent as – Eta and you have dou J by dou w i so you can represent in either of the ways for Delta rule for learning perceptron. So well, that was all regarding the Delta rule for a Perceptron learning in Deep Learning.
reference – The Delta Rule
Web enthusiast. Thinker. Evil coffeeaholic. Food specialist. Reader. Twitter fanatic. Music maven. AI and Machine Learning!