 In this chapter we will see how to use the opt-in package for training any kind of neural network We will start by mentioning the non-convexity optimization problem that neural networks training are affected by Then we will see what are the advance optimizers that the opt-in package offers us in order to be able to train complex network We will see then how to reformulate the training of a network as an objective function optimization problem We will see also some basic optimization equations and At the end we will see what is the optimum training workflow and we will write a script Which is going to be training a network on a CPU and a GPU Using any kind of opt-in advance optimizer The objective function used to train neural networks is highly known context This means that finding the global minimum is not a trivial task As we can see from this animation by Alec Redford We can see how different optimizers manage to escape a subtle point Training this network is not trivial because the loss function is populated by local minimum And also there is who claims that the major obstacle for the training are instead subtle points You can find more details in the description below The opt-in package from Torch provides a series of different optimizers For example, we can see stochastic gradient descent average stochastic gradient descent LBFGS conjugate gradient at a delta at a grad Adam at a max Fista nesterback slated gradient method LMS prop our prop and C mice A description of these algorithms is provided below or Also, it can be reached from the command line So far we have seen that the set theta is the collection of all the parameters at Different layers like layer one layer two and so on until the last layer and In the case of fully connected layers or classical neural network, we have seen that these elements here are Metrices instead for a convolutional layer. We have that this element over here. It's up for the kernel So at the end capital theta will be a collection of very diverse objects In order to apply more powerful optimization techniques We had to convert this collection of different sized object into a one-dimensional vector that we are going to call theta lowercase Which is a point in a RD dimensional space Where D is the number of trainable parameters, so our D Is also referred as parameter space Therefore now when we compute back propagation We aim to compute the gradient of Our loss function function of the parameter theta with respect to its parameter theta Some of the most common Optimization techniques are the following we will start with revising gradient descent in this case We are gonna have that our new theta is Going to be our previous theta To which we subtract the gradient with respect to the parameters of our loss function Computed on the whole data set, so I can also write this one as theta minus eta Gradient with respect to the parameters of cost function of Theta given The whole input X design matrix X and the labeled capital Y Then we can see the stochastic gradient descent in this case We can write that theta takes theta minus eta Gradient with respect to the parameters of our cost function function of parameters computed only for the Example I with the label Y mini batch Gradient descent instead can be written this way We have theta takes Theta minus Eta gradient with respect to the parameters of the cost function and here we are gonna have a sub-metrics X I and Y I So if we think about X or Y being the design matrix with mx samples and n features we may think that X I Is a subset of this data set for example this could be of height batch batch size and And still here we have the n features One more interesting method Which I think it's a pretty easy and I'd like to mention here is the momentum Which update is as follow we have that velocity we have gamma previous velocity plus eta Gradient with respect to the parameters of J Here I don't specify if it's a mini batch if it's stochastic if it's complete Classic batch gradient descent. I just put generic J of theta and then we have that theta Becomes theta minus the velocity A link to a very nice blog post by Sebastian Ruder. It's included in the description below I highly recommend to go over it to have an overview of several optimization techniques that are used currently in deep learning So how are we going to train with opt-in? First thing we are going to build our data set. We are going to have our data contained into the design matrix X And then here we are going to have for example our Labels Y so this is my training Data set then I'm going to build my network Layer after layer Which is fed with our examples At the end here. I have to define my loss function To which I feed my output of the network and of course I will be feeding also my labels Finally from the gradient I'm going to compute all the derivative of the loss function with respect to each and every Parameter in the network. It's final step. I will get finally from every layer I'm going to get the parameters which is my set of Queer objects and from this one actually I'm going to make My parameter vector Theta and moreover I got to get these guys here and I'm gonna have My gradient With respect to the parameter theta of my cost function So both those two elements here are going to be a mono dimensional vector Which I will feed to my optimizer together with the evaluation Of my cost function So that the optimizer can find the best solution to minimize my loss function given the parameters and the derivatives of the loss function with respect to the parameters themselves So we are going to write now our training script Which is going to be a generic training script and it can be used for training any kind of network In this case, I will just simply train a classical neural network. So Let's open our editor And we start with require nn Then I can actually set Manual Seat To one two three four so that we have consistent results across multiple runs Then I can define my model model Equal nn dot sequension I will also define the number of input features and in this case I'm gonna plan to have a xor kind of problem. So I'm gonna have just two input features And therefore for the output, I'm gonna just have one output a scalar output therefore my Size of the network is going to be n. Let's put 10 neurons in the hidden layers and then our k. So it's a three layer neural network Let's add therefore Our first linear Going from the first dimensionality to the second dimensionality Then we are gonna have model Add non-linear function. For example, in this case, I'd like to use a tanh non-linearity Then let's add Another linear function another linear Projection and that's it for for my network We can decide to visualize this network by typing print model the torch train And there we go our neural network. So let's keep going and Let's define the loss function. We are going to define now our loss function So we are going to have local loss equal nm dot msc Criterion and now we can build our data set. So let's say that our data set has 128 instances Therefore I can build my design matrix x equal torch dot double double tensor Of m rows and n columns I would have written CUDA tensor If I would have liked To run this on a GPU instead Then I can build my labels. So it's going to be torch dot double tensor Of just m elements and again here I would have written CUDA tensor if I would have liked to create a data set that stays in the device memory And so let's feel Let's fill up this data set with some values So we can run For i index i that goes from one to the number of elements of examples Let's create our local x Which is lowercase x means the singular example is going to be torch dot round n Of two this is this is a normal distribution two elements of our normal distribution Then have let's have our y which is going to be an xor output. So let's say if x1 times x2 It's greater than zero. So it means or both are positive or both are negative Then y it's equal minus one because it's x or Otherwise if it's lower than zero means they have opposite signs. We are going to have a plus one Now we can copy these single samples into our main design matrix So I'm going to have my i throw Copy over the example i and this works perfectly Fine also for kuda And the same we are going to have y of i example It's going to be equal in this case y which is a scalar. So I don't use the copy And that's it. And here we have built our data set Now before starting the the training I'd like actually to Extract all the capital theta relative to each layer which in this case they are matrices, but It doesn't matter. They are just objects with different sizes I just like now to have a one dimensional tensor which contains all the parameters of my network So let's have actually a global variable net which is going to be equal our local model So that I can access Outside this script. So let's call torch interactive train So if I print net, okay, we're gonna have our model if I do net Going full screen if I do net parameters Then this one returns me theta and the partial derivative of the loss function With respect to each and every of this object with the Haitian notation Let's see also the dimensionality. We said that the first dimension we were going from two elements to 10 elements So here we see that the First matrix theta one It's 10 by 2 of course should be 10 by 3 but we have here as well our 10 elements bias Which actually in our notation it's in front of the other part And the second layer we're going to have one times 10 Plus one. So actually it's going to be one times 11 in our notation In the part below we can see that the gradient Of the loss function with respect to the these objects have the same dimensionality because again the Haitian notation has been used We would like now to have instead just two vectors the parameters vector and the Grad parameters vector or theta and grad theta. We can do this by typing net get parameters So if we have get parameters instead We can see now that it has output Two vectors The first one is one dimensional vector of 41 elements, which is our Current weights that are randomly initialized And then we have the grad parameters that are initially all zero Because we haven't run We haven't run the back propagation already Okay, so we can go back To our script so We can have therefore local lowercase theta and grad theta This should be lowercase, but Doesn't really work in the coding. So I will write model Get And this gives me the one dimensional theta and the gradient with the of the loss function with respect to the theta Then we are also defining a opt-in state Which is going to be all the parameters that we are going to send to our optimizer In this example, I will use the Stochastic gradient descent so we can check what are the options available by typing Require opt-in Then we can do Opt-in dot s gt We press enter and we can see here what we can Specify for example, right learning rate learning rate decay weight decay weight decays momentum Dampening naster off so we can specify multiple Options In our case we are going to specify that the learning rate is going to be 0.15 for example And let's start the training So we require opt-in And then we are going to write our epoch. So let's say I don't know from one to 1000 epochs So we write function f eval In theta Basically f evaluated in theta should return the Scalar value of the objective function computed in theta And the derivative of this cost function with respect to these parameters. Let's provide these elements So at the beginning we are gonna Zero the ground parameters of course So that we don't overwrite a previous result. This is the same of actually calling model uh zero Grad parameters Okay, since we already take them out. I just simply zero the decal identity this way would be the same Then I can define my Hypothesis of the network the output of the network Which is equal model Uh forward of my Metrics x Then I will define local J Which is my loss To which I forward My output of the network and the labels Let's even print this value of J just so we can see How how it performs how the training performs Uh, so this is just for Just for debugging Purpose then we have our local uh dj in dh X is going to be our loss Which I do backward Of our output x h of x and y Then we have model uh backward To which I send my input x and the dj dh Finally our function as I said before has to return the scalar j and the Grad theta After we define this function I will simply call optim sgt To which I send my f uh evaluated My parameters theta and then the optim state I defined before And that's pretty much it Let's actually uh have again a global net so I can play with it afterwards Equal our equal our model Uh, let's go below Let's have torch interactive train So let's try our network forward with torch tensor minus one minus one And this gives us negative value minus one and 20 Then let's try to have minus one one And this one gives us a positive result Then let's try plus one minus one And this also gives us a positive result And then let's try both positive and that's perfect. It gives us uh a negative result Uh, something I'd like to show as well. It's like uh in this case we train the network And we got his final value This one, so let's save this value here. For example, I can go print um Previews previous j This one um And let's see how the cost function changes if instead of having Uh a three layer neural network. I have a Four layer neural network with the same number of neurons. So let's uh make these on fine neurons Also here fine neurons and then I simply Increase those two guys two guys So same number of neurons simply one more layer Let's go below And let's see what happens here And there we go From 0.18 we went down to 0.06 So we can see now that stacking multiple layers Improved the capabilities of abstraction of a network which can converge to a better Minima in the same number with the same number of epochs uh furthermore If I would have liked to use another kind of network, I would have simply defined a different model here Let's see what are now the required operations to have this script running on a gpu So so first of all we are gonna have We just go here and we are gonna have Running on gpu So first of all we have to say require Coon see you kuda neural network Then we have model sent to kuda. We have loss sent to kuda Then we are gonna have x equal x we sent to kuda y equal y sent to kuda And That's it Now if we run this script instead of running on my cpu, it will be running on the gpu of my machine So let's try if it works On this mac I don't have a gpu so I will go On my machine And let's zoom and then here we have my file is the same one So let's try to run Torch train And then we can also try to train without these instructions We saw we can see it took some a little bit more time