Basics of Neural Networks – Hi. Let’s continue our exploration of neural networks so that then we can go on to more complex neural networks, also known as deep learning. In the past few articles, we looked at a very simple toy example where we had features that describe cat and dog. We feed those features into a network, and then we got a classification for cat or dog. For example we had numbers that are multiplied by weights, we pass them to activation functions and then we got either 1 or 0. Sometimes we made errors, and then we use those errors to go back, this is called back propagation, and readjust the weights until they worked for us.
Here we will look a little bit more in depth at the structure of a neural network. So again, if you didn’t read the last few articles, if you have a lot of experience with machine learning, this was just the summary of what we did. We had two items, cat and dog. Each of them has a vector of features. The two features that we had was has whiskers or not, for the cat is a 1 and for the dog is at 0, and is a good boy, for the cat is 0 (jokingly) and for the dog it’s a 1.
So we have a training step which is vectors of features and then we have labels for each of them. So the cat was a 1, the dog was a 0. We have an input layer with two neurons where we had the feature has whiskers and is a good boy. The values were multiplied by the weights, and then the result was passed through an activation function round which was just round to the nearest integer, and then we got an output value, 0 or 1.
The way the network was at the beginning was with the weights on the start here to the lower-left. Those weights did not give us good results, so we used back propagation. We got the errors after the activation function, we tried to figure out how much error we had, and then we used that error to correct the weights going back. When we corrected those weights, the – the weights changed a little bit, little by little, we got them to where we needed them.
And after three training epochs, they – the weights have the values where they correctly identified a cat and a dog, based on these features. Again as a summary, we had an input layer with two input neurons, we had an output layer with one output neuron. The neurons and the layers were connected by weights, and we used the training set to readjust those weights, so we calculated our values, figured out what the error is using a kind of loss function, so the loss was the accumulation of error, and then knowing what the errors were, we corrected the weights, we back propagated change in the weights so that the the weights themselves could change and we could eventually get our correct calculations.
We also regulated the degree of change using something called a learning rate, where we multiplied the change that we had, the error, by 0.1. For example, we did this over several epochs until the network converged onto the values that we want. So when the weights started to give us the multiplications that we wanted, we said that we had converged onto our solutions, and of course this is just arithmetic. We’re just trying to look for the right numbers for this kind of magic to happen.
The idea of neural networks came along in the 1940s when McCulloch and Pitts proposed that biological neurons from our brains and computer logic had certain par – parallels. For example, biological neurons have – are cells, they have a nucleus, but they also have extensions that are called dendrites. These receive inputs from neighboring neurons, electrical inputs, literally electricity. When they get these inputs, then they – then we transmit some of the out – some of this electricity onto terminals, onto axons which then connect to other neurons, and they transmit a different electrical signal as their output.
So they get some input as electricity, and then they transmit some output as electricity unto other neurons. And we scientists propose that this is very similar to the behavior of logical circuits. For example if a biological neuron got two positive electric inputs and then use that to give a positive electric output, you have two positives and with that you output a positive. This is similar to the behavior of logical gate ‘and’ where you have yes and yes equals yes.
If, for example, your neuron got a positive electric input and a negative electric input, and out of that it offered a positive output, so that it had positive negative and the output is positive, this is similar to the gate ‘or’ where you have one or zero equals one. So there are parallels between what the neurons in our brains are doing and what computer logic does.
And their theory was that by putting many of these units together, we could have the equivalent of many logical gates and eventually we could build software that that resembled and followed the model of these biological neurons. There’s another way in which the neurons in our brain are similar to neural networks, to artificial neural networks.
And it’s that they don’t just transfer the electricity that they got as a continuous function, they don’t transfer the information verbatim, they alter it. For example, if you get 0.3 activation, and 0.1 activation, you don’t get necessarily 0.4 activation. What happens in neurons is that they need a certain electrical – a certain amount of electricity for them to jumpstart. They need a certain amount of depolarization for them to open their chemical gates and when they reach a certain electrical level, they fire and then they change polarity and stay there and then suddenly repolarize and come back to their previous polarity.
So for natural neurons, they don’t transmit the information, the electrical potential exactly as they got it. They have to accumulate in a certain – a certain potential, and only once enough potential it’s accum – accumulated, the gate opens and transmits and then goes back to its resting state. If it doesn’t get enough electric – electrical activation, the neuron is not going to transfer information. So again, it needs a certain activation potential and it can switch between, for example just off states and on states using something like all no or all yes. In general we call this a categorical perception, or you know categorization, where you get an input but then you have to decide whether it’s one thing or another.
There’s many domains of human cognition that have these kinds of phenomena, where you don’t just add activation energies or activations of input, but you have to make a decision with the input. You have to go one way or the other. This is very common in vision. For example, is this dress blue or white? None of you are thinking that it’s something in the middle, all of you are thinking that it’s either one or the other.
Likewise, with the illusion at the bottom. Is this two faces or is this a glass? You can see one or the other, but none of you are seeing something in the middle. Your brain goes either one direction or the other. And this is something that our artificial networks also do. They have types of activation functions that manipulate the signal. A very common one is the sigmoid function which can have continuous values but only between zero and one, so the minimum is zero, the maximum is one, and you can take decimals inside .1 .2 .3 .4 We can do this to simulate probability for example where you can have continuous values, but only in this range.
There’s other functions like ReLU, for example a rectified linear unit, where you can get a zero or things larger than zero. Now ReLU is very popular for example for you to not clutter your field, to have a larger range of values that you could use, but we like these functions because they are derivable, they all have
derivatives and we’re going to use those derivatives in a minute or two to calculate our loss, to get – to compensate for the error that we get. One final note about perceptrons and in this case multi-layer perceptrons. In our simple example, we had input layers and, I’m sorry, a single input layer and one output layer.
Most neural networks that you will find in the world have more layers, hidden layers, in between the input and the output. These layers are gonna help us add complexity to our calculation, and in particular they’re gonna help us see – they’re gonna help us classify nonlinear functions. In previous classifiers, we could run lines through things, or maybe wavy lines through things. With these kinds of functions, we can arbitrarily approximate any nonlinear function given that we have a network of enough complexity.
For example, if we have something like this, where you have a blobular cluster of blue inside a different ring of orange, there’s no line that could go in between those clusters that would correctly separate the blue from the orange. You need some sort of nonlinear function like circles or something like that to correctly separate the two. And in a neural network, you can do this. We talked about epochs.
One epoch, two epochs for our cats and dogs. This is – this is the way that you can train a neural network. An epoch is an entire iteration over all of your training se,t and the training examples are sometimes sent in batches because most training sets are not just two items. Sometimes there are millions or even billions of items, so you cannot send them all and then back propagate them all of them at the same time. You need to send them in blocks that go forward, and where you can do back propagation on them.
So you send patches and then once you go over your training set that constitutes an epoch. When you go from the input to the hidden layers to the output layers, you get a certain error at the end, that we saw as the deltas. This is the loss, and the loss is the summation of all of the errors that we made along the way. So all the differences between what we got and what we wanted to get. The summation of all that is the loss.
Once we have a function to calculate this loss, our mission is going to be to minimize that function, and in order to minimize it we need we need it to be derivable. We need to get the derivative of that function so that, if it has this shape for example, the minimum is going to happen here, where the derivative of the function is zero.
As you can see here, if the loss function, for example, had this kind of shape, we would be looking at areas where the derivative of that function is zero, so we’re looking for the shallows in the function. We’re going to use a kind of numerical method called gradient descent where we’re going to walk across the space of the loss function in tiny steps which are gonna be our descent into hopefully the shallow for the minimum. The rate of descent is gonna be the learning rate, what we had as alpha equals zero point one in the cats and dogs exercise, and we need to be smart about how we choose our gradient descent, because if we choose steps that are too small we may take forever to get to the shallow or maybe even be trapped in a local minimum as we’ll see in a moment.
So if you take small steps, you may never get there, but if you take steps that are too large, maybe you will skip back and forth, never going into the shallow. So the correct learning rate is one parameter hyper parameter that we can fix, and that we need to think about. Obviously these functions are not very simple. They – this is – these are even simple. These are represented in three dimensions. On the one on the right, for example, you can see that there’s shallows across all the function. Some of the shallows are local minima and then there’s one there’s signaled in red that is the absolute minimum.
We would of course like to get to the absolute minimum loss, but there is a possibility that we will get stuck in local minima and we will not be able to get out of there with the learning rate that we might have. Again, if the learning rate is too small, we run in danger of getting stuck in local minimum. For the learning rate that’s too large, we’re just gonna skip around the space and never go into the appropriate minima. This is not easy.
This is a visualization of neural network data that has a hundred and ten hidden layers, and so the space that – like the function for the minimization has a lot of features, and it’s – it has a very high dimensionality. This is a reduction of dimensionality from those dimensions to three dimensions, and as you can see, it does have one absolute shallow in the blue part, but it also has many local minima, for example, around here there’s a local minimum. Around here, there’s a local minimum, and there’s always the dang – danger that we could get stuck in there, never able to reach the absolute minimum. So it’s difficult to train these algorithms.
In summary, neural networks have neurons that are arranged in layers. Each of them takes a feature and then the different layers, be they input, output, hidden, are connected through weights. We take the value of a feature, we multiply it by the weight, and transmit it to the next neuron, until we reach the output.
Once we reach the output, we calculate how much error we have, that’s gonna be a loss, and then we use that function to calculate the back propagation to calculate how much correction we’re going to need for each of the weights. We adjust the weights using our data, our training set, and we go through several epochs trying to get less and less error as we go along, less and less loss. This method explores the loss function and uses something called gradient descent.
The speed of the descent is determined by our learning rate. If it’s too little, we’ll get stuck in local minima if it’s too much, we’ll skip around erratically. As a practical exercise, please go to this webpage here. It’s a beautiful playground for neural networks. It’s Tensor Flow playground, and you have here several nonlinear functions. Well this one can be divided linearly. Here’s a few that cannot, and you can use different learning rates, different activation functions, different numbers of hidden layers, and different numbers of neurons in the layers, and also you can use different types of features such as the squared of f of X1, and you can see that X1 is here and X2 is here. The square of X2 the, multiplication of X1 and X2, and so forth.
Let’s see, oh also here you have, let’s have no hidden layers. This is the output. Play. Ooh, as you can see, this is very easy to separate with no hidden layers. I just have an input and an output, I have different weights, here it converged a long time ago. Let me try that again, within 44 epochs, I already have a loss of only 0.003, and as you can see I can very clearly separate the blue dots from the orange dots. Here’s the weights from the input layer to the output layer, so this is something that can be very easily separated with just a line, but how about something like this.
We’re gonna try again, and here you can see that without hidden layers, the error is always gonna be 0.5. The – the loss is gonna be 0.5, and no matter how many epochs we have, this is not gonna get any better. We’re gonna have to add for example hidden layers that maybe will help, maybe they’re not. Look at how many epochs it’s taking for me – it’s taking for us to go down, down, down, and it’s finally getting there, but even then, the function that it has is not radial, it’s a line. And notice that it still has a lot of errors going in there.
So it’ll – I assume it will get to a plateau, and then it will not make a lot of progress from there. As you can see, here’s the weights, and the final classifications. So try to play with each of the four functions that we have here, and explore the effect of adding layers and add – adding neurons. This is one of the most difficult parts of neural networks, trying to play with the hyper parameters. Give this a try.
Web enthusiast. Thinker. Evil coffeeaholic. Food specialist. Reader. Twitter fanatic. Music maven. AI and Machine Learning!