 In this practical, we see all the pieces we've seen in the practicals 2.0 and 2.1 coming together and playing nicely in Torch, using Torch NN package. We start with reviewing the Jacobian and the Hessian partial derivative formulation. This is quite essential to understand how partial derivatives are computed in Torch. We will create then a logistic unit with the NN package. We will perform the forward pass step by step, in order to apply the knowledge we have acquired the previous practicals. We will start with NN linear, which requires an input dimensionality and an output dimensionality. We will learn then how to use the curly brackets operator. We will see how to access the weight and the bias of a linear layer. And its own derivatives of the cost function with respect to its own weights and bias. We will see then how to zero these partial derivatives. And then we will introduce the sigmoid nonlinearity. We will see how to use the forward method of the module. Linear and sigmoid are modules. So by writing NN.module, I just represent both of them. Then we will also implement the back propagation pass step by step, again to verify our understanding of the underlying algorithm. We will start by defining a loss function with the NN MSC criterion, mean square error. Then we will see the attribute size average of the criterion and how to change its value. Moreover, we see that the forward method for a criterion now expects an input and a target, whereas for the module it just expects a single input. Moreover, we will see how to compute the gradient of the error with respect to the output of the network using the update grad input. And then how to back propagate the gradient computation through all the modules of our network, which also require the current input to the specific module and the gradient at the output of the given module. We will see also how to accumulate the gradient parameters, basically how to compute the partial derivative of the error function with respect to the parameters of a given layer. In this specific case, the linear is the only module which has parameters. Then we will introduce the NN sequential, which allows us to perform the forward and back propagation with just two simple commands. So we see how to add modules to our sequential container, how to forward an input through all the modules within the sequential container. Then we will forward the output of the sequential within our loss function, where we specify the input, which is the prediction of the network and the expected target. Then we perform the backward step, which is actually for the criterion most of the time like the same of update grad input. The backward step actually performs both the update grad input and the act grad parameters. Usually, the criterion don't have trainable parameters, but if they do, then it will perform that as well. Then we see how to get a specific layer out of a sequential with the get method. We see how to zero the partial derivative of the error with respect to the parameters of the model with the zero act parameters. And then we back propagate the gradient of the criterion through the network and with the backward method to which we have to provide the input to the network and the grad criterion. Finally, we can update our parameters based on the learning rate, etha. And finally now we will train a generic neural network by using stochastic gradient descent and mini batch gradient descent. We'll do so by using the forward, backward and update grad parameters together with the zero grad parameters functions we just illustrated. We can do the same by using instead an n stochastic gradient to which we simply provide the network and the loss function with the side. And then all we actually need is simply to ask the stochastic gradient to perform a training of our network based on the specific criterion on the data set that we provide. As last point, we will see how to perform a regression task with a three layer neural network input, a hidden layer of three neurons and just one output neuron. We will see this one at the following address, github.com, etcal, my username, torch, machine learning with torch. So far we have used the Jacobian formulation. This means when we have a partial derivative dy on the vector x, this is going to be a one row matrix. Where each component is going to be simply the y in the x1 up to the last one, the y scalar on the xn. If we have instead dy vector in dx, so it's only variable, which is a scalar on the denominator, this is going to be equal to the column vector of the derivatives of the components. So we have dy1 in the x and down until the last one, which is going to be dym in the x. And there we go. Finally, if we have a combination of the two, so we have dy vector and dx vector. This is going to be the Jacobian, so which is the y1 in the x1. And we go down to the last one, so it's going to be the ym above the x1. And then if we go on the other side, we're going to have still here the first row versus the last element of n. And then down we have the last one element here, the ym and the xn. And this is the Jacobian of y. The last case is when we have a scalar, so we have dy on the x matrix. And this one, since we use the Jacobian formulation, if we say that x belongs to rm times n, then the partial derivative with respect to the matrix x is going to be the... The dimensionality is going to be n times m. So we have that the dimensionality is the transposed basically of what it is on the denominator. So like it happened here in the first case, we have the vector column vector x. So since this is the denominator, we are going to have a row of derivatives. And the same way here, we have a matrix of the denominator, which is dimension m times n. And therefore the partial derivative with respect to that matrix is going to be the transposed in terms of dimensionality. So it's n times m. And for this very same reason, let's compute now the dimensionality of the partial derivative of the error with respect to the parameter theta, capital theta of layer l. So a hat of layer l is going simply to be, belongs to the dimension s layer l plus 1. On the other side, up to here, we have the error of the layer plus 1 is going to be of dimensionality s l plus 1. And therefore if we compute also the transposed, we are going to have 1 times s l plus 1. And therefore we have that the multiplication is going to be having as many rows as s l plus 1 and as many columns as s l plus 1, it's more. And we know that the matrix capital theta of layer l goes from, goes to s layer l plus 1 and it starts from s l plus 1. And we can see of course here that we have the partial derivative with respect to a matrix of s l plus 1 times s l plus the bias is going to be in the transposed. So s l plus the bias times s of layer l plus 1. In Torch, we are using instead the Hessian formulation. And in this case we have that dy on the capital X, where X belongs to RMN. It's also belonging to RMN. In this case when we have to do the weight update, so we have our matrix theta, we can actually write this on entities legitimate theta plus a learning rate eta, multiply by de respect to d theta. And we saw just now that these all elements here will have the same dimensionality of this one. In the Jacobian formulation, we should have had a transposition to the second part. But as we said in Torch, we use the Hessian formulation where basically all the results of the partial derivatives are basically the transposed of the normal kind of notation. This is also called numerator layout. And this one, the Hessian, is also called denominator layout. You can read more about this on Wikipedia and look for matrix calculus. Let's run Torch. We would like to create our sigmoid unit. Let's require the package NN. Let's say our neuron has N equal 5 neurons. And then the final output has dimensionality 3. So we can create the linear, which is going to be creating the weighted input. It's going to be an NN.linear going from N to K. So if we print linear, we see it goes from 5 elements of input, which is the dimensionality of our x vector without accounting for the element x0, the bias. And it goes to our output. So we go from x, the input of size 5 to the output h theta of x, which has size 3. Let's see what's inside this linear. Linear has a bunch of things inside. Let's start with the things that are making more sense to begin with. So we have lean.weight. And this is the matrix 3 times 5, meaning it shoots to a dimensionality of 3, which is the output, and takes 5 elements, which are the elements of the size of the input x. And we have another one, which is lean.bias, which is basically our first column of our theta matrix. So we can call here now theta 1. It's going to be torch concatenate the first term, the bias, and the second part in weight. And we concatenate along the second dimension, so across the columns. And this actually creates a new tensor. So if I display theta 1, it's going to be simply the bias vector for 0.43, 0.44, minus 0.22, and minus 0.28. And then we have the rest of the weights. Let's see again lean. So what's this in here? So we have something that is called grad weight. So let's see what is that. Grad weight. And they are all zeros. And lean.grad.bias. And all these are zeros as well. So the first one, what is called grad weight, is going to be basically the partial derivative of E with respect to the weight of this module. And grad.bias is going to be the E with respect to the bias. So in our case, if we consider grad theta 1, this is simply going to be torch.cat lean.grad.bias stack in front of grad weight on the second dimensionality. And there you go. This is the vector, which is the partial derivative of the error with respect to the parameter theta 1. In case those gradients are not already zero, we can zero them with lean.grad parameters. And if they were not zero, now they are zero for sure. We will see later why we need to zero these parameters. And it's basically connected to the fact, connected to the way we can train a network if we use a batch or minion batch gradient descent. We have to accumulate the partial derivatives with respect to the parameters for several iterations. So this is the place where these partial derivatives are accumulated. And therefore we have to start with a clean accumulation point. So we want to zero them before starting. Then so the first part is done. So I think we have then we have an output, which is empty at the moment because we haven't insert anything in the module. We have a grad input, which is the derivative of the output with respect to the input of this block. And we don't have anything at the output and nothing in the input. So it doesn't it's empty at the moment. And the last part we haven't covered is the type, which is torch double tensor. And this shows the type of module we are using at the moment. Let's create the second module. After the linear, we need a sigmoid function. So we have sigmoid equal nn dot sigmoid. Let's see. I print sigmoid. Let's nn sigmoid. And let's see what's inside. So we have again a grad input, a gradient input gradient, which is empty. We have a type that is torch double tensor by default. And then also the output is empty. Let's see how this sigmoid looks like. So let's require new plot. And then we can do, for example, z equal torch linear space from minus 10 to 10 with 21 points. And then we can have new plot dot z and sigmoid to which we forward the z. And let's see. And here we have the sigmoid. It goes to 0 basically after minus 5 for numbers that are lower than minus 5. And then it hits the 1 half for x equals 0 here. And then after for x greater than 5, it approaches the 1. All right. So let's go on and let's start now with the forward pass. So we can clear the screen. And let's start. Forward, forward pass. So let's have our input vector x. Just a torch random normal distribution of size we said n. So this is our x. We have that a1. It's equal x by definition. We have that h theta, the final hypothesis, the output of our network. It's going to be the sigmoid to which we forward the linear to which we forward our input x. This is going to be our output. Let's have a look. So it's three dimension as expected. And those are the values. 0.36, 0.55, 0.29. Let's try to reproduce these values by ourselves so we actually understand what's going on. So we have z2. It's equal to theta1, which is the concatenation of the bias column with the weight matrices. For definition of how we saw in the previous lessons. Multiplier by what? By our input to which we put on top the plus one, so our a1 hat. So we have torch hat of torch ones, just one one. And then we have the a1. We concatenate in the first dimension. So z2 is the weighted input. And then we have to apply a sigmoid to the z2. Let's do this by hand. So we have a2 equal z2, which I clone. And then I apply. If I don't clone, I'm going to override its value. So I just want to preserve whatever it was there. So we are going to have a function. Function of the scalar z. Which simply returns a1 divided by 1 plus what? f dot x pon initial of minus z, the scalar. We close this one. And we end the function. And we close the parenthesis. And let's see our output a2. And there you go. 0.3613, 0.25510, and 0.2924. So we have just seen that our computations are, our network is actually computing what we expect, what we have seen before in the theory. Let's perform now the backward pass. This is a bit more elaborated. So backward pass. Oh, there's back propagation. Or back drop allocation. So we have to define first a loss function, which is going to be, in this case, the mean square error criterion. So if we print loss, it's going to tell us MSC criterion. And then if we would like to know what it does, we can press question mark. Then we have nn dot msc criterion. A description of how it works creates a criterion that measures the mean square error between n elements in the input x and the output y. So loss of x and y, where x is the input and y are the basically targets or the labels, is equal to 1 above n, where n is the dimensionality of the vector x and y, a summation of the square differences. We can also see that the division by n can be avoided if one sets the internal variable size average to false. That's what we are going to do because in our previous lectures we haven't performed any size average. Let's print the content of loss. And we see it contains a gradient, basically, a size average that is true by default and the output that is 0 to begin with. So we are going to say loss dot size average is actually false because we don't want that division. So we can check if it worked. Yes, it did work. So let's compute now the error based on our targets. So we have to define our targets. We have y, which is going to be torch rand, not rand n, because the output of the network goes from 0 to 1 since we have applied a sigmoid non-linearity. So rand of size k, so this is our y. Let's see the API of the loss function, the criterion. Basically, if we write forward, we are going to need an input and then we need our target or label. So we can do e, the error for this specific sample, the only sample we have here. It's going to be equal loss forward, to which we send the output of the network, which was called the h subscript theta, capital theta, and the targets are labeled y. And here we have e, which is 0.35. Let's try to compute this by ourselves. So as we saw in the description before, the criterion performs the difference of h theta minus y in the target on rise to the power of 2, and then summing all together. There is no one-half factor as we saw in the lesson before. So we had to actually keep in mind this one-half missing here, which is going to be coming back, of course, when we perform the derivative of the square. And here we go, we have exactly the same result. So we are still performing forward, but we are forwarding in the criterion to compute the error, which is required in order to perform the back propagation steps. So let's compute now the derivative, partial derivative of the error, with respect to the output of our network. So this is simply our loss. We can write update grad input, and we send h theta, which is the input. We have the same API of the forward. We have input and target. And then the targets are the y. And there we go. So if I print the e in the h, as we saw, as we said before, since we are using the hessian notation, it's going to be having the same dimensionality of h of theta. Let's compute this by hand, so we can see whether it is correct or not. So if we perform the derivative of this expression here, we are going to get, we had two times, simply the difference between h theta and y. And there we go. We had the same numbers. Minus 0.37, 0.19, and minus 1.10. So far, so good. What we are seeing here in Torch is exactly what we have seen before in the theoretical slide. Let's compute the error now at the output. So we have delta 2. It's going to be the sigmoid, to which we say to update grad input. And we send inside the input. We always have to send the input, which is z2. And then the output derivative. So the derivative which we find at the output of this block, which is going to be actually the input derivative of the criterion. And here we have delta 2, which we of course have an element of 3. It's going to be the size of 3. So let's verify this is correct so we can see. We start from d, e, dh. We do a clone. Then we have a sigmoid. And then we have a2, to which we do a sigmoid of 1 minus a2. And there we go. So we just apply the formula number 3. Now we can compute the partial derivative of the error with respect to the parameters of the linear module. So to do so, in tors we can do a lean, accumulate, gradient parameters, to which we provide the input, which is to the module, which is x, and the grad input of the following module or the grad output of the current module. So in this case it's delta 2. And then we can check what this has done. Looks like it hasn't done much. So inside lean we have grad bias and grad weight. So we can recall before they were zeros. So if I go back up, they were zero. So let's see if I print them again where they are. So if we print now grad theta, there we go. So these are the partial derivative of the error with respect to the capital theta matrix, which is simply the concatenation of the first column, which is the grad bias, and the other matrix, which is the grad weight. Let's verify also that this computation is correct. So we can do so by performing the column vector multiplication time the row vector. So that we have the partial derivative of the error with respect to the parameters of the linear module. Let's write down so it's more easy to explain. So we have delta 2 to which I'm specifying which I'd like to view as a column vector. So it's going to be as many rows as it needs and just one column to which I multiply the concatenation of a simply one and my x on the only dimension. And then this one instead I'd like to view as one row and how many columns it cares. There we go. We got the same result. Nice. Let's go one step further and let's compute the partial derivative of the error function with respect to the input of this module, which is actually the global input. It's not useful for training, it may have other utilities. For example, if we have multiple layers we are going to need that value in order to compute the delta at the previous layers. So let's just call this as lean grad input which is not delta 1 because delta 1 we take in account also the non-linearity in this case we don't have yet the non-linearity. There is no non-linearity at all. So it's simply going to be the partial derivative of the error function with respect to the input of the linear module. This is equal to simply linear, update grad input to which I provide the input and the delta, the partial derivative at the output. So grad output basically it's called this one, let's see. It's going to be of the same dimensionality of the input which was 5. This is still because we are using the Hessian notation or the denominator layout. Let's verify these numbers are also correct. Also as we recalled from equation 5 that we saw also today before we can compute this by using the matrix lean dot weight transpose to which I multiply which is applied to the delta 2. And there we go. So we just verified it. All the computations are correct and we obtain consistent results. It's pretty nice. We input so far so many commands and it looks like quite daunting to follow how it has been done so far. This is because we haven't actually used properly the NN package. The NN package provides us also and other amenities which make the whole training a breeze. And especially the forward and back propagations can be run in just two lines of code. Here we have typed down all the steps just in order to verify the correctness of these steps. Now we can instead see how we can type all these in a much more compact way and hide, of course, some intuition and some gatchas that will actually strengthen our understanding of the algorithm and how the package works. So let's define a network as being a container which is called sequential. Sequential simply allows us to have a sequential series of blocks one after each other where all the forward steps are performed automatically. So the output of the first block is sent to the second block, to the third, to the fourth, etc. And when we perform the grad input steps all the grad inputs go from the last to the first block. So we can add a net, add our linear module. We can add also our sick-moid module. And then we can print the network and it will show us that we have a simple sequential... We have an input which is sent to the first module which is sent to the second module which is basically the output of the network. To perform the whole forward pass we can write simply pred which is our prediction of the network which is our basically... which is our age of theta we defined before we call it now prediction so we don't overwrite our values that we have written before. It's simply network forward of variable input x. So now we can see prediction is this and if we check age capital theta it's exactly the same. So that's perfect. With one line we can compute the output of the network. We can compute the error, error equal loss to which we forward our prediction and the correct label y. So error equal this one that previous e is the same so far so good. And now the cool part so we can compute grad criterion equal loss and backward to which we send again pred and y. So we can compare grad criterion with d, d and dh before proceeding with the backward step for the network we have to actually clear out the grad bias and weight. Now that we have a network I can have net get first and we get the first module within the network. So if I'd like to see what are the actual derivative of the error with respect to the weight I can print them here as this net get one grad bias concatenated to net get one dot grad weight along the second dimension so these were the numbers we got before so we have to zero them otherwise they will sum to this one. So we can simply do this by typing net zero grad parameters so if we call again the same function we have all zeros that's lovely we have net backward to which we input our input x and the criterion and this one has performed a bunch of things let's see the last thing that has performed we have this minus 0.0 0.958 so this is exactly the input we have computed before you can see minus 0.0 0.958 0.1339 0.0826 if we go down we have the same numbers here so when we perform a backwards step return as the input gradient of the current network module whatever we are using let's call this guy again so we compare these numbers and we can see we have minus 0.0860 minus 0.0615 if you go second line 0.0461 third line minus 0.02284 that's perfect so we have computed the gradient of the error with respect to the parameters all the parameters of the linear module which are the only parameters in this case with just one instruction that was the network column backward to which we send x the input of the network and the gradient of the error with respect to the output of the network the age of capital theta so let's see how the parameters are updated so to update the parameters I can define a learning rate eta equal 0.01 for example and then I can say network update your parameters based on the learning rate eta bam the network has updated its own parameters based on the current parameters and the gradient of the error with respect to the parameters let's check that that is actually what it happened in reality we can have d e I think we have d e with respect to the age no so d in the theta 1 it's equal this thing here above right and then we have theta 1 minus eta which multiplies the theta d e the theta 1 so this one is this one and let's see there we go, theta that's called nu and let's print theta nu and we have what we expected so the new parameters where we have here the first is the column relative to the bias and here it's the the matrix the new matrix theta 1 it's exactly what we expected so this is actually this has been computed by torch because we actually perform now we just check inside the internals of the lean so linear dot bias and linear dot weight we have them concatenated across the second dimension so we have the first one here lean bias and then there is the matrix here lean weight which has been computed as the previous theta the full matrix here which includes the bias terms the bias vector minus eta to perform the gradient descent step multiplied by the d e in the theta 1 finally let's see how we can train a system with all we have learned so far so now we are confident that the torch works the library nn it works magically it does require really few instructions in order to compute forward and backward propagation basically there are two instructions one for forward one for backwards and then we have the two zero sometimes the parameters the grad parameters so let's have an example of working example almost working example so let's see how to train the full system with a script we have training we have our x it's going to be our design matrix and then we have our y it's going to be our labels or targets if you like matrix or vector the first one has size m times n and the second one is going to be of size m examples times capital K all right control u to actually clean the line that's cool so we are going to have again a for loop which is going to go from the first example so number one to the m example perform what so we are going to have our prediction which is equal to the network we have defined forward to which we send the ith example of the design matrix then we can have again a local error which is going to be equal let's call it loss forward forward to which we provide the prediction and the label of the ith example and then we can have the grad loss which is going to be those backwards of what we have to send the same prediction and y and then we have now to zero we have to zero the accumulated parameters so this is in this case we are talking here of this stochastic gradient descent in this case so we have to zero every time zero grad parameters and then we have network network to which we provide the input so it's going to be capital X of I and then we have grad loss and then the last one is going to be a network to which we go say update parameters because again we provide an update every time we provide a new example in the case of stochastic purely stochastic gradient descent update parameters and here we have our ith the learning rate and this is the full example working example for applying stochastic gradient descent given that we have a design matrix X and a design a label matrix Y capital Y and capital X if instead we would like to perform a mini batch a gradient descent this is a little bit more code to write but it performs better in terms of convergence in terms of speedup if we use multi-dimensional input which I'm not going to illustrate right now but mini batches provide some specific advantage both computationally and also optimization wise so we are going to have here I again that goes from 1 to the number of examples we have but jump with batch size so in this case if we go like from 1 to 1024 with batch size of 128 we are going to have I equal 1 129 257 and etc up to the last value so we can do we start by zeroing the parameters and then we do for j equal goes from 0 to the batch size minus 2 so at the beginning I the summation i plus j is equal 1 and for the last for the last value of j is going to be batch size minus 1 actually batch size this is going to be batch size minus 1 so it's going to be batch size so it goes from 1 to 128 then the second time is going to be 129 to 256 then it's going to be 250 7 and so on to what there is the output then we have if we exceed the index m so if i plus j it's greater than our index m then please don't do anything otherwise the same thing of before so prediction equal net forward then we are going to have local error and like before and then we have again local grad loss and then we follow with net backwards and then we end it here after when we finish to process the batch we can actually update parameters based on the grad with the learning rate and then we are actually done last part is actually so we actually don't have to write all this code neither so torch allows us to write even this code so that's why it gives us so many very nice treats let's say this way so the treat is basically if I have I have to provide a data set to start with so we have a data set which can be an empty table at the beginning and then this table has to provide a size which simply returns m the size of the data set so far nothing too complicated and then we have for i equal 1 to the size of the data set please do populate data set so element i is going to be a table of the first element is going to be the example and the second element is going to be the table and here so once we have this kind of data set which returns a size if required and then each each item is the table of example of the input and the target we can simply create a trainer so we have local trainer a local trainer yeah funny right so this is going to be a stochastic gradient trainer which we send the model and we send the loss we decided to use and all we really need to do is say a trainer please train my network on the data set i'd like to so basically to train your network you just really need those two lines of code of course sure you have to specify a network and a loss function and of course a data set otherwise you wouldn't be training on anything so sure you have a necessity of defining a data set a network a loss function which you know suits your needs and then you create a trainer which is going to be training your network in just one line we can see a full example of how to train a network with the trainer the stochastic gradient from torch at this address let's have a look so here we have the machine learning with torch repository and specific the first topic it's regression with MLP multilayer perception which is simply a neural fancy way of calling a neural network usually a multilayer perception MLPs are used for pattern recognition classification task in the field of image and speech recognition nevertheless they can be effectively used for regression check out MLP regression section to find out more about it so you can go here and then you can see what I'm trying to do in this tutorial so basically I will explain how to perform regression with a neural network so while I was reading the b-shops book pattern recognition and machine learning I got to the point in which a three-layer perception with only three hidden neurons and one output linear neuron, so four neurons overall was used to regress seemingly some contiguous functions and here is the image so we have the illustration of the capability of a multilayer perception to approximate four different functions comprising f of x equal x square f of x equal sine of x f of x equal absolute value of x and last one is the f equal h of x where h is the step function or the heavy side step function in each case n equals 50 data points so the x coordinates shown as blue dots have been sampled uniformly in x over the interval from minus one to plus one and the corresponding values of f, x evaluated so f of x are the y targets or labels and the x are simply the input one-dimensional scalar and also the output is a scalar so we have size of the input layer it's one size of the output layer it's also one and then we have three neurons which are the internal hidden neurons these data points are then used to train a two-layer neural network having three hidden units with a tan h which is similar to the sigmoid that goes from minus one to plus one instead activation function and linear output units so there is no non-linearity after the last layer the resulting network function are shown by the red curves here in the picture and the output of the three hidden hidden units are shown by the three dashed curves and so just go through the tutorial and then there is the algorithm where I explain how I use the trainer for training on the data set and I highly recommend to click here on the source regression and actually play with the code so read the description and use it interactively so change some values and play around to understand better how it works you have different modes just go through it carefully and you will get a lot of intuition and understanding I believe and that's the end of the overview of the NN package we will see more features in the next podcast stay tuned have a good night bye and good night because yes it's night now I'm working at night night