# Mathematical Foundations of Deep Learning

Mathematical Foundations of Deep Learning – In the last article you studied basics of tensors and we also looked at tensors that we often encounter in practice. In this session we will focus on key operations on tensors in the context of Deep Learning. In neural network we specify the layer with tf.keras.layers.Dense, here we use 128 units in a hidden layer and we use relu as an activation function.

The layer takes 2D tensor as an input and returns another 2D tensor as an output. The core operation that layer does is as follows, the layer computes a dot between the parameter vector and the input adds a bias vector to it. So, this becomes a linear combination and this linear combination is subjected to non-linear activation like relu or sigmoid. So there are three distinct operations here one is the linear combination of input and the corresponding weights as representative by dot and then we add bias in the calculation.

And finally we apply a non-linear activation on the result. Let us take a concrete example. So, here we defined a toy neural network. So, we have a toy neural network with two inputs let us say this is feature X1 and this is feature X2. Then we have a layer with two units this particular layer is a dense layer, so it receives input from all the units from the previous layer.

And let us set w are the parameters for each of the layers. So, for the first layer we have 1 and 0.5 as the weight and each of the unit has a bias term and this bias term is minus 2 for the first unit and it is 0.5 for the second unit. So, you can see that we can represent the weights of these units as tensor. So, the first unit has weights 1 and 0.5 and the second unit has weight 0.5 and 1.

So, you put this in a tensor. So, this is a matrix is a 2 by 2 metrics or alternatively this is a 2D tensor with shape 2 by 2 this is the weight tensor. You also have the input tensor and how does input tensor look like input tensor has two components x 1 and x 2. So, this is a vector and what happens is that when we combine multiple examples it becomes a 2D tensor.

So, we have so let us say if you want to represent these two examples we have a tensor which is again a 2D tensor which shape 2,2 . In addition to that we also have a bias vector, so there is a bias unit for each of the hidden unit in the layer. So, we have we can represent the biases. So, bias is a scalar quantity for individual unit, we combine all the biases we get a vector. So, we have 2 and 0.5 in this case this is the bias vector which is a 1D tensor with shape 2.

So, let us say we have input, so we have w here which is 1.5 and 2, 1; then we have input which is (1,2) and (-1 , 2). So, let us look at  the shape of each of these tensors and number of dimensions. So, you can see that w is a 2D tensor which shape 2 by 2. Input is also a 2D tensor which shape 2 by 2 here there are two examples and axes and bias is a 1D tensor with a shape of 2 we apply a linear combination.

That means, we will perform a dot product between the weight vector and the input and we add a bias term to it. Let us look at shape of each of the terms or each of the tensors that are involved here which is w, input and b and we also look at shape of the resulting tensor which is z and we will also print the value of z. So, you can see that all the tensors weight and a input tensors are 2D tensors with shape of 2 by 2 and bias is a vector which shape of 2, we get z as a 2D tensor which shape 2 by 2.

So, we essentially get two outputs for each example. And you can see that  these two outputs are nothing but this is z coming from the first unit and this is z coming from the second unit. So, for every example that we input here we get z1 and z2 for each of the examples. So, here there are two examples so we get a 2D tensor with shape 2,2 or shape of 2 by 2.

Now, let us apply relu on the output of linear combination. So, relu essentially does a max between 0 and z. Let us print the shape of the output and the output itself. So, you can see that the output is also a 2 D tensor of 2 by 2 and we get 0 when you apply relu on these values, because relu puts 0 for negative inputs and positive numbers are written as they are. Let us look at how relu activation function looks like.

So, here what we do is we calculate the value of relu between -10 to +10 at hundred samples. So, we actually sampled hundred points in the range between -10 to +10 and we calculated the value of relu and plotted it over here. So, we have the value x on the x axis and relu on the y axis. So, you can see that relu outputs 0 for the negative numbers and it outputs the positive number as it is so we have a 45 degree line after 0.

In the same manner you also visualize the sigmoid activation function by sampling hundred points between -10 to +10 and we calculated sigmoid function for each of these points. So, sigmoid squashes the input between 0 to 1, as we go away from zero in the positive direction sigmoid tends to give 1 and as we go to the left of 0 in the negative direction.

The sigmoid becomes closer and closer to 0. Now, note that relu and tensor additions are both element wise operations and can be parallelized. You will find vectorized implementation of these operations in deep neural network, that helps us to speed up the competitions. The vectorized implementation exploits the parallelization note that relu and tensor additions are both element wise operations and hence can be parallelized.

The vectorized implementations exploit it is characteristics for efficiency of competitions. So, imagine a situation where we have to add two matrix which are not compatible in terms of the shapes, so this is where broadcasting helps us. In broadcasting we perform two steps, so that two matrices become compatible in their shapes we first start axes in the smaller tensor to match the dimension of the larger tensor these axes are called as broadcast axes.

The smaller tensor is then repeated alongside these new axis to match the full shape of the larger tensor. Let us take a concrete example let us say we have two tensors x_1 and x_2. x_1 has shape of 32 by 10. Whereas, x_ 2 has a shape of 10, so the first tensor is a matrix and the second tensor is a vector. So, what we do is we will first expand the dimension of the second matrix along the zeroth axis.

So, this will make sure that we have the same number of the same number of components on the zeroth axis and after that what we do is we repeat the x_ 2 tensor 32 times along the zeroth axis. So, that the shape of x2 becomes fully compatible with the shape of x1. Let us look at the shapes of tensors that are involved here shape of x underscore 1 is 32 by 10 the original shape of x_ 2 was 10.

So, x 1 was a 2 D tensor and x 2 was a 1 D tensor. After adding broadcast axis we get a 2 D tensor which shape 1 by 10 and then we broadcast x2 along the broadcast axis to get a shape of 32 by 10. So, essentially you first add broadcast axis and then copied x2 along each of those axis. So, in actual implementation the entries are not repeated, but algorithms implement these operations virtually.

So, numpy implements broadcasting, so if we try to add a 2 D tensor with 1 D tensor numpy automatically does the broadcasting and we can see the shape of the output which will be same as the shape of the 2 D tensor. Let us look at the next operation in a tensor which is reshaping, reshaping is used to rearrange the rows and columns of tensor to match the shape of the target tensor.

The reshape tensor has the same number of elements as the initial tensor, reshaping is mainly used in data preprocessing. So, let us take a concrete let us take a concrete example of tensor. The shape of the tensor is 3 by 2 we are going to reshape it into a tensor of shape 6 by 1. So, to reshaping it to a tensor of size 6 by 1 of shapes 6 by 1 we get the we get a reshape tensor.

Later we reshape this tensor to a target tensor of shape 2 by 3, here what happens is that first three values are copied in the first row and the next three values are copied in the next row. So, this is how we get a (2, 3) 2D tensor. Transposition is a special case of reshaping in transpose rows becomes column and columns becomes rows. So, np.transpose does the transpose of the matrix the origin original shape of the matrix was 300 by 20, after transpose operation it became 20 by 300.

Having studied tensors and key operations on the tensor, let us move on to understand how neural network training is performed. We will first formulate the problem of training in neural network. So, let us take a toy neural network  two inputs one hidden layer with two units and an output layer with a single unit. These are the inputs x 1 and x 2 these are two hidden units they also have bias term. So, the problem of training here is to estimate the weights of each of the units in the neural network. So, this particular unit has three weights the weight corresponding with this connection this connection and to a bias. So, there are three weights for this particular unit, in the same manner there are three weights corresponding to the this particular unit.

This particular unit also has three weights one corresponding to each of the connection, there are two weights corresponding to the inputs coming from the previous layer and one bias term. So, the problem of training is given the training data you want to estimate the parameters of this neural network model. How many parameters are there are 1, 2, 3, 4, 5, 6, 7, 8 and 9, there are total nine parameters.

In this particular network and our job is to come up with weights of these nine parameters such that the loss function is minimized. So, we studied in previous sessions that we define loss function for each of the machine learning algorithm. So, in case of neural network, if you are solving a regression problem we use least square as a loss function.

If you are solving a binary classification problem we use binary cross entropy loss as the loss function and we have to find out this parameters such that the loss function is minimized. We use gradient descent and stochastic gradient descent as basic techniques for solving the optimization problem or for parameter estimation. Here there is a difference between the way we apply gradient descent in standard machine learning algorithms and neural network.

Let us try to understand that difference that will help us to appreciate the complexity of training neural networks. So, along with this neural network what will do is we will draw let us define a regression problem involving two variables x 1 and x 2. So, in this particular regression problem we have two weights one corresponding to x1 and the second is corresponding to x2.

So, weights are w1 and w2 here and the problem is to identify these weights, such that the least square error between the prediction and the actual output is minimized. You can notice that in case of remember that when you apply gradient descent what we do is we find out the gradient of the loss function with respect to each of the parameters. In case of linear regression we get the output over here the prediction over here and once we get a prediction let us say y hat is a prediction and if you know the actual y.

We calculate the loss for that particular instance of the parameter value. So, we calculate loss over here and we compute the gradient of the loss with respect to both the parameters in this case, here the situation is slightly more complicated. We get the value of the y at the output layer, here we can calculate the loss we calculate this particular loss over here.

Now, we have to compute the derivative of this particular loss with respect to the parameters and we do not know the loss at each layer or at each unit. So, that makes it complicated, so we are getting the loss at the final layer and our challenge is to find out what is the contribution to the loss by individual units in the neural network. So, neural network it is important to understand that this is the direction of prediction it is called as forward pass.

In forward pass we pass the values we perform linear combination followed by non-linear activation steps repetitively in each layer to get the prediction. And once you get a prediction we can find out a loss and our job is to propagate this particular loss in the reverse direction. So, we calculate the gradient with respect to the direct connections to the output layer and then we apply chain rule of derivative successively to find out losses at intermediate levels and this is called as back propagation.

So, this is the direction in which gradients are propagated. So, this is the backward pass, so we use a back propagation to propagate that gradient of the loss function with respect to parameter with respect to parameters in the network. So, in the modern deep learning packages like tensor flow the gradient operation is already implemented using symbolic differentiation.

So, we do not really implement back propagation algorithm by hand. But instead we call this particular gradient api or gradient function to calculate gradient of the loss function with respect to each of the parameters. So, we do not focus a lot on back propagation in this course.

So, neural network implements a variation of stochastic gradient descent for foster optimizations, one class of optimization method focus on learning focused on applying adaptive learning rate instead of using the same learning rate.

As in case of classical gradient descent they try to adopt the learning rate, so that convergence can be attained faster. Adam RMSProp are optimization algorithms in this particular class. The other set of algorithms which are worth mentioning here are momentum based algorithms. So, in case of deep learning the loss function is a non convex loss function.

So, there is a great chance that we will get in we will get stuck into one of the local minima and we won’t be able to come out of that if you apply classical algorithms. So, here we use what is called as momentum based strategies, where we calculate the momentum at a point and use that to get out of the local minima.

Think of momentum based algorithm using a ball and let us say you are sliding this particular ball on the slope if the ball has enough velocity or if the ball has enough momentum it will not get stuck at a local minima. But it will get it will slide pass local minima and will go towards the next local minima. So, momentum based methods a momentum based strategies are used to tackle problems with respect to local minimas, that we offer encounter in deep learning loss functions.

We use some of the novel regularization strategies for neural network apart from l, 1 and l, 2 regulation that we apply in classical machine learning algorithms. We also apply techniques like early stopping in case of early stopping we apply a simple strategy of stopping training early.

This helps us to prevent overtraining of the model which may lead to over fitting. We keep track of the training error and the validation error and if validation error is not improving after few iterations we stopped the training.

So, this is a possibly this is possibly an automated way of deciding when to stop the training. Apart from early stopping there is another clever idea called dropout which is popularly which is popularly used as regularize mechanism in neural network. Let us understand idea of dropout through a concrete example.

So, in case of neural network what happens is let us say this is a toy neural network let us say this a toy neural network. So, in case of dropout what we do is  you define a dropout rate let us say dropout rate is point 2 or 20 percent. So, what happens is 20 percent of the node in each layer where we define dropout are dropped during an epoch.

So for example, let us define a dropout rate of 50 percent and see what happens. So, in this network let us say we defined dropout rate of 50 percent at this layer. So, in one of the epoch what might happen is we look at the node. And let us say we flip a coin if coin turns in the head we decide to return the node otherwise we drop it.

So, let us say we decide to drop this node and this particular node. Let us say we also apply 50 percent dropout in this layer. So, this could happen in one of the epochs, in the next epoch we again check we again decide to randomly shut down 50 percent nodes in each of these layers.

In the second round it could be possible that this particular node and this node is dropped or this node and this node is dropped and you can see that every time we drop node we are effectively cutting the connections to and from that particular node. So, this gives us a new kind of a neural network architecture and we are training a lot of such kind of different architectures in one training iteration.

So we apply dropout during training and in order to compensate for drop out what we do is during prediction time we multiply the activation of each of the unit where drop out is applied by the factor of dropout. So, in this case what will happen is the activations out of this particular node we will be scaled by 50 percent, because that was the dropout rate.

So, this is how we apply dropout as a regularization strategy in case of neural network. So, when we face the problem over fitting in neural network, we either try to get more examples or we apply one of these regularization strategies. Whenever we are faced with under fitting problem in neural network we can increase the complexity of the model by simply adding more layers to the network.

So, in this session we looked at mathematical foundations of deep learning, we understood we studied basics of tensors. The tensors that we encounter in practice the key tensor operations in deep learning and some of the basics of training and regularization in the deep learning. With this session you are now well equipped with basic understanding of machine learning flow, from the next session onwards we will start diving deeper into practical or implementation aspect of machine learning pipeline with tensor flow.