 So today, last lesson, yeah, I'm smiling, but I'm sad. I wanted to talk about energy-based models and how to train them, but I think I need to prepare for a month before that. So actually, if you are still interested in this summer, you will be able to get a tutorial on energy-based models. We are writing a paper with Jan together. And so I'm planning to get this paper written as part. It's going to be math. And then part is going to be actually the implementation, such that you can actually execute the paper, basically, and you can get a better understanding of what's going on. Yeah, that's going to come out maybe in a month. I have to do a pretty good job there. And maybe we can even have an additional class later on if you're interested. And I'm always here up for teaching you. So again, if you're interested in this energy-based model later on, like outside the course and whatever, we can meet and record. And the pretend is actually one more class. So yeah, I didn't manage to do it for today. So today we're going to be covering if I get to finish two topics. We never talk about them too much before because they are more machine learning related. But nevertheless, we care also in deep learning. And the topic of the day is regularization, overfitting and regularization. Let me start sharing the screen. So again, this is my, as usual, perspective of the topic. It's not usually the mainstream, but it's what you get since it's my view. And I'm your educator, your instructor today. So overfitting and regularization, connection between them. So those are two different topics. Those are two different things, but they are, of course, connected. So I start with this drawing here. Someone told me it's not intuitive, but again, for me it is. So there you get it. Here I'm showing you with the pink box, the data complexity. So those are samples from my training data set. And then I try to fit there three different models. So in the first case, the model complexity is smaller than the data complexity. And therefore, you have some phenomenon called underfitting, because you try to fit what looks like a parabola with a straight line. And therefore, you're not doing a good job. Then what happened next? Here we actually have the right fitting. In this case, the model complexity matches the data complexity. And so in this case, what's the difference with the previous case? In this case, you have zero error. So your model exactly matches the training points, those points. Finally, we have overfitting, where the model complexity is actually greater than the data complexity. In this case, the model doesn't choose a parabola, because why? Question for you, audience, live audience. Why is this model like wiggly in this case? Why is not a parabola? And you're supposed to type in the chat, because otherwise, I don't know if you're following. So my question is, in the last case, my model complexity is superior than the larger than the data complexity. And although those points look like they belong to a parabola, my model decides to get that spiky guy, like spiky peak on the left, and some weird stuff. Model doesn't learn but memorizes overfitting. But sure, it's written there overfitting. But why? If those points are coming from a parabola, I would expect even a very larger model would make a very nice parabola. You're privately writing to me. Don't private write. So if, and this is a big if, if my points, my training points, come from an actual parabola, even the overfitting model would be making a perfect parabola. The point here is that there is some noise, right? There's always some noise. And therefore, the model that perfectly goes through every training point will be like that. It's going to be like crazy. Because all those points don't exactly live on the parabola, but they are slightly offset. And in order to be perfectly going through them, the model is going to have to try to come up with some funky function. Does it make sense? So the point is that without noise, this would be just a perfect parabola. So someone would say, OK, maybe we should use the right fitting, right? In machine learning, maybe. We are doing deep learning. And it's not quite the case. Right fitting, it's definitely not the case. Actually, our models are so, so, so powerful that they even managed to learn noise. Like there was a paper where they were showing that you can label ImageNet with random labels. You can get a network to perfectly memorize every label for each of these samples. So you can clearly tell that the models we are using are absolutely overparameterized and therefore means that you have way more power than it's necessary in order to learn the structure of the data. Nevertheless, we actually need that. So let's figure out what's going on. Oh, actually, maybe you know the answer, right? So what is the point? Why do we want to go in very, very high-dimensional space? I told you a few times, right? Because? Who answers? Come on. It's the last class, answer me. Why do we want to go in very, to expand the data distribution? Yeah, optimization is easier, yeah, fantastic. That's the point, right? Whenever we go in a high-overparameterized space, everything is very easy to move around, right? And therefore, we always would like to put ourself in the overfitting scenarios with our networks because the training is going to be easier. Nevertheless, what's the problem now? Well, the problem is you're going to be like, they wiggle like crazy. Another thing, so this is point number one, point number two, why would you think you actually have to overfit when writing your script? Second question. I know, interactive question today. Actually, show there is some trend you can model. Okay, maybe it's in the right direction, but it's too complicated as an answer. So are you experts, you're a network trainer? You should be, right? Because you've been following these lessons for a bit, but at the beginning, okay, try to answer this question. So why would you like to overfit? I even tell you one bit more. I would always, I do always start training my network on one batch if the model has capabilities. So this is the number one rule to debug machine learning code, okay? You would like to see whether you fucked up in your model creation, okay? So first thing, you can just get a batch of the correct size, even with random noise, right? Even torch.rand, something with random labels. And then you would like to go over a few epochs with one batch, with random crap, which could be the first batch of your dataset or whatever, just to prove that your model can learn, okay? You can easily make some tiny mistakes, like I made a few times, like doing the zero, zero grad after the backward. Yeah, I know it happens, and nothing happens, nothing learns, okay? So you always want to see that your model can learn, right? Then if it can memorize, yeah, fantastic. We are gonna be now learning how to improve performance of a model that memorizes its own data, okay? So two reasons, right? First one, we said over parameterize, models are easy to train because the landscape is much smoother. And if you have an over parameterize model, you're gonna have, you can ideally start with different initialization. So you get initial points in the parameter space. And then whenever you train these different models, all of them will converge to a different position because you can think about like a same model. You can permute all the weights. You're gonna get, I mean, if you permute the weights per layer, you can still get the same model at the end. So they are comparable in terms of the function approximator you are building. Nevertheless, in the parameter space, they are not the same, right? So in the function space, they are exactly equivalent models. In the parameter space, they are absolutely different models. Nevertheless, they will converge to equivalent models as in, they will perform equivalently, equivalently good, right? Are you following, right? Am I talking about weird stuff today? But I guess this counts a bit also from Joan's class where we talk about parameter space and functional space is so cool in that class. I think next year, I will try to put it online as well. Okay, okay. So first point, over parameterization helps with training. Second point, over parameterization helps with math debugging. Can you repeat the point about function and parameter space? Yeah. So if you have a neural net and you permute the rows in your matrices, right? And then you permute the column of the next layer. You can basically, you know, you can reorganize the weights so you can get always the same performance, right? So if you have the first matrix, you have first element of the hidden layer equals some number. Let's say the hidden layer has size of two, right? So you have a matrix with two rows. And so you can swap the rows. You're gonna get a hidden layer that is flipped. And then the next weight matrix, you can flip the columns, I guess. And you would get exactly the same network. The same, sorry, you would get exactly the same function. It's gonna give you exactly the same number as an output. Although the parameters are actually different, right? Because you swap them. So the same parameter W11 is gonna be W21, right? So they are different. So in the parameter space, these are different models. So they are one point is here, in the parameter space, one other point is here. Nevertheless, the mapping from the parameter space to the functional space, both of them, both these two initial, those two configuration will map to the same function, right? Because the function connects the input to the output and they are gonna be the same, even if you do this permutation of the rows and then of the columns, right? Makes sense? So if the space of parameters, if the space for parameter space is very big for a given dataset, can we say that the model is very uncertain about its prediction? Okay, we are gonna be talking about uncertainty in a bit. So I'll address that in a bit. All right, so we always start with the third column here with overfitting. I always want to have a model that is over parameterized because it's easy to learn. And also it's gonna be powerful in terms, in the sense that it's gonna be learning more than what we expect. And so how do we deal with these overfitting? How do we improve now the validation or testing performances, right? So we said that overfitting means, we didn't say, we are gonna see that next slide, but here we see how to fight this kind of overfitting. So we start from the right-hand side where we introduce these weak regularizer. So there is no regularization. Therefore, the last plot, the sixth plot here is the same as my third plot, okay? Then I keep adding some medium regularizer. And so I like to think about this as, you know, smoothing edges, right? So my square gets around edges. You can tell now that this second plot here is different from my second window here, right? So the medium regularization is different from the just right fitting. As you can see, there are some, you know, corners here. Finally, if you crank up this medicine, this kind of, you know, it's like a drug, you're dragging, you're hitting, you're poisoning your model for to restrict its power. And then you get like a very strong regularizer, which gives you the circular one. This is my mental image. Anyhow, we gave you, I think I give you my, the big picture first, and then let's go on with the actual definitions, right? So there are a few definitions here. They are not quite equivalent, but in deep learning, that's what we use. So here we go. So the regularization adds prior knowledge to a model. A prior distribution is specified for the parameters. So we expect these parameters to be coming from a specific distribution, from a specific generation, generating process, okay? And then whenever we actually think about regularization, we can think about, you know, strongly assuming that these parameters should be coming from this specific process that generates them. Okay, so this is talking about parameter space. Then we can also talk about the functional space. In this case, we can be, it can be seen, regularization is a restriction of the set of possible learnable functions, okay? So these are again, two perspective, one is on the weights, where how are supposed to be, what kind of weights, what kind of animals, what kind of objects these weights are, like they should be somehow of some specific shape, length or whatever structure. There is some structure that I assume in advance, that's the prior, this means before in Latin. And in other case, instead, if you have all possible function, you'd like to find a restriction of those possible functions, such that they are not too crazy, okay? They are not too extreme as in the way they behave. There's a question, but in that image, the square is still in the circle. Yeah, I'm getting back. Oh, oh, I see. So maybe the circle should have been smaller than the square. Okay, right, good point. Okay, cool, cool. Finally, that's the last definition of regularization, which is the real deep learning part, which is the following, which is kind of not, it's like, you know, as a stretch. Okay, my Google thinks I'm talking Italian, what the heck? Okay, regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error, okay? So this is actually a stretch because it's no longer talking about prior knowledge and functional space, but actually modification to learning algorithms. So this is like moving towards maybe programming, you know, so parameters function, then it's like algorithmic implementation, right? So these are really three different perspective of the same thing. Cool, so first let's start with regularizing techniques, a few examples. So first actually, I started with Xavier initialization. I told you before that we can think about these parameters as coming from some generating process, right? So whenever you initialize a network, you can choose to select one prior, right? So this is defining where your weights are coming from. So in this case, we can choose Xavier normal, which is a initialization technique. And this assumes this kind of Gaussian distribution, right? So you had the weight space, weight values, and you know, most of them will be picked towards the zero and then you have some kind of standard deviation that is based on the size of the input and output size of that specific layer. And so from here, we can start introducing the weight decay. Weight decay is the first regularization technique that is widespread in machine learning, not maybe too much in neural nets, still relevant. So weight decay, you can find it directly inside the opt-in package. Like it's a flag in the different optimizer. It's also called L2, regularization, rich regression or Gaussian prior, which basically tells you that things come from this Gaussian process or Gaussian distribution, generating distribution. Nevertheless, we call it weight decay. So why do we call it weight decay? So this is first thing that, you know, if you train neural nets, you're gonna call weight decay, not the other things. So we can start with this J train. That's our objective, which is acting upon the parameters which is equal to the old training, the one without the regularization, plus a penalty term like the following. So we have the square norm, the square two norm, right, of these parameters. And so if you compute the gradient, of course, you're gonna get just the lambda theta, right? Because the two comes down, simplifies, you get that guy. So if you think about this second equation, what do you see? You say that the theta gets previous theta minus, you know, the minus a step. So like minus a step towards the gradient, so a step towards the opposite direction of the gradient such that you can go down the hill, right, in your training laws, minus some eta lambda, which is a scalar, multiplying by theta, right? And that means that there's gonna be, you know, the first part tells you, oh, go down the hill, whereas the other one tells you, and go also towards where? Zero, right? And so how does this look? So this looks like this, right? In every point, so consider we are already trained and the training loss is zero. And we just consider the second term, right? So let's consider we already finished training. So there is no this term. We just have theta minus eta lambda theta. What does it mean? So if there is no at first term here, in any point you are, so in theta, you're gonna be subtracting some multiplier, some scalar. You know, I told you a scalar is what scales, right? So this scalar scales this vector, probably by a factor that is lower than one. And so if you're here, this one is gonna take you down on the point that is connecting your head of the theta towards zero, right? Or this point here, this is theta, and then it takes you down to zero, okay? So if you don't have this term here and you perform a few steps in this parameter update, you're gonna get that the vector field that results is something that attracts you towards zero. And that's why it's called weight decay, right? So if you let it go, this stuff is gonna decay to zero. Makes sense, right? So these are very cute drawings, I think. Cool. So, okay, now you know about weight decay. Weight decay is also, we can think about this as adding a constraint over the length of a vector. So the length of a vector is the Euclidean norm. And so here we basically try to reduce the length of this vector, right? So weight decay is a way to reduce the length. Okay, so L1, hmm, what is this L1? So L1 can also be used as a flag in the optimizer in Torch. It's also called lasso, which is least absolute shrinking selector operator. Wow, yeah, statisticians, whatever. It's also called a Laplacian prior because it comes from a Laplacian probability distribution. And then also it can be called as a sparsity prior. Why is that? So this is pretty interesting. So here in the bottom part, you can see there is the dashed line represent my Gaussian prior, right? And then here I just show you the Laplace. What's the difference with Laplace? Laplace is the same as Gaussian. So you have the exponential, but instead of having the square norm, you have the one norm, okay? And so the, whereas the quadratic is very shallow, like it's very flat towards zero, the L1 is like a spiky, right? So that's why if you get the exponential, you get a spike. This is minus the absolute value, right? So you get a spike for the Laplacian or you get like a smooth for the square because you have the parabola, right? Which is smooth on the bottom part. Okay, so the point is that there is much more mass now in this region than it was before, right? So this is pretty, this is like a spike. There is much more probability that you get something towards zero. Nevertheless, maybe this is not too clear as an explanation. So I show you the second diagram. So in this case, my training loss, instead of being the all-trained loss, I'm gonna be summing lambda, the norm one of my theta, okay? Therefore, if you compute the gradient of the L1, what you get? So L1 is going to be just one, right? If you're positive or it's gonna be minus one, it's the sine function, yeah, exactly. So you get eta, eta lambda sine function. And so let's now think the same way in what happens. If you already finished training, you don't have this term over here, you just get theta minus eta lambda sine theta. So if you are on the x-axis, you know, the y is completely, doesn't have is already zero. So you're gonna get some arrows bringing you in, right? So if you're on the axis, you're gonna get exactly as L2. You're gonna go towards zero. Now, what happened if you're in first quadrant? So in the first quadrant, you get a sine in both direction, right? Scaled by the scalar factor there. And so it's gonna be pointing down this way. So here, I show you the gray arrows here, they're showing you the L2 regularization, which are taking you from the initial point towards zero as proportional to this vector that is here. Whereas the L1, which is going to be in a different color, and color green, the L1 instead starting from here, it takes you down 40 degrees here. And then what happened here? Well, you just killed the y component, right? And so the L1 vector field, it will quickly kill components that are close to the axis, right? So if you're kind of close to the axis, this one, bam, takes you down to the axis. In a few steps, right? And then if you still apply this one, you're gonna go down the axis here, right? So this one allow you to quickly go down here. And then if you still apply, you can shrink the length. But the point is that you're not looking at the length shrinking as in the L2, right? So L2 was just shrinking the length of the vector. In the L1 instead, you're actually gonna kill the components that are kind of near the axis. So I think you can clearly now understand how this works, right? So, and this actually is quite relevant for training, let's say our regularized latent variable models because you can think about a very quick way to regularize this latent variable. So you're gonna be just killing some of these components such that only the information is gonna be restricted in a few of these values, okay? You like this stuff? You like the drawings? They're cute, I think. Okay, oh, okay, dropout, right? So we talk, I think, about dropout a few times, but I never show you the animation. So arrow, boom, okay. So dropout, what does this dropout do? So I can show you my ninja skills in PowerPoint and we have an infinite loop animation. So the input in the pink is provided to the network. And then you have that these hidden layers, hidden neurons are sometimes set to zero. In this case is you have a dropping rate of 0.5. So half of the neurons are gonna be turned to zero randomly during the training. And so what happens here is that there is no more path between the input and the output that is, you know, there is no learning of a singular path for input to output. So every time, if you want to try to memorize one specific input, you can't because every time you get a different network. And so again, this basically tell you, discard, okay. So what happens here is that again, before if we have like a fully connected network, like this, you can think about, oh, I won't like to memorize this neuron going this path and then here, right? So you can try to memorize some specific, you know, sample, you get, you can memorize a specific sample in this case. But again, if you have the network that is taking off neurons sometimes, sometimes this neuron here on the left hand side doesn't exist, right? And so if this one doesn't exist, then you cannot memorize a specific path. Moreover, you can think about this dropout as training a infinitely, infinite number of networks that are different, right? Because every time you drop some neurons, you basically get a new network. They all share the initial kind of starting position with the initial weights. But then at the end, whenever you use it in France, usually you turn off this dropout and then you had to scale the weights, right? Because otherwise you get a network that is, you know, blowing you up. This is because if you have half of the neurons off, you know, the other neurons are doing, the half of the neurons are doing the whole job. And if you turn everyone on, you're gonna have twice as many more values, right? So you can do two things. Or when you actually use dropout, you crank up, you multiply by, let's say one over the dropping rates. So if you have dropping rate of 0.5, you can multiply by two, such that your neurons are twice as powerful, right? Twice as more powerful than one minus 0.5, right? One divided one minus 0.5. So if you have a dropping rate of 0.1, means you have 90% of your neurons there. And so your neurons should be one over 0.9 stronger, right? To have like the same kind of power, right? In terms of values. Anyhow, so you can think about dropout as having these multiple networks during training, but then whenever you use them at inference, you turn off this dropout module and you basically average out all these performance of the singular network. And these allow you to get much better reduction of the noise which was introduced, like that was arised by the training procedure. Because again, if you have multiple experts, you take the average of multiple experts, you're gonna get a better answer because it's gonna be removing that kind of variability in the specific answer, right? But perhaps we should keep in mind this variability of the answers, right? Because it can turn out quite interesting. Anyhow, so dropout is amazing way to basically have an automatic model averaging, model ensembling performance. Cool, cool, cool. Is dropout a good technique only for classification task or also for other tasks as well like metric learning and coding learning? I would say that dropout gives you a much more robust network, a much more robust prediction. Regardless. Of the task, it doesn't restrict to classification. You basically train multiple networks of reduced size, right? And then you average out these reduced size network. So although at the end, you're gonna have a large network. This large network is just the average of small networks performance. So, and also if you think in this way, the small network can no longer overfit, right? Because they are no longer that over-parameterized, perhaps, right? And so dropout allows you to fight overfitting with several, by different mechanisms. Finally, you can think, if you apply, let's think about like applying dropout to the input. This is kind of sort of like denoising out encoder, no? I mean, you perturb the input, right? In this case, and then you force still the output to be the same. So if you think about that, you are going to be insensitive to some small variations of the input which are gonna make your network more robust, right? Or the same as I was, as I wrote you in the midterm. How can you get a input that is annoying? You can find some noise in the input which is gonna be increasing your loss, right? So you can do some kind of adversarial generation of noise and then you train your network on these handcrafted samples which were corrected, were like perturbed in order to increase your training loss, right? Okay, so I gave you like four different reasons why to use dropout, but then I don't use dropout. Not that often. I actually do use it for a different reason which I'm gonna be coming to that in a bit. Okay, so early stopping. So this is much one of the most basic techniques. If you're training your model and your validation loss starts increasing, then you stop there, okay? Such that you get the lowest validation score which tells you, okay, you're not yet overfitting. And that basically doesn't let your weights grow too much, right? So instead of getting the L2, which is trying not to get those weights to get too lengthy, too long, you just stop whenever they are not yet that long, right? Fighting overfitting. So these are techniques that end up regularizing our parameters, our models, but they are not regularizers, okay? So this is important. These are not regularizer, although they do regularize the network. Okay, as long as you keep this in mind, we can also see these other options, but they are not regularizing techniques, right? They do act as a regularizer, though. First one, batch normalization, okay? So we talk about this several times. We don't know quite how it works too well. There is an article on a blog post that is explaining this. We put the link in the optimization lecture. Check it out. I think it's like lecture seven of some blog post. I really can't remember. Anyhow, so the point is that you reset the mu, the mean and the sigma, the sigma square, the variance at each layer. And these allow you to, okay, when you reset the mean and the sigma, this is based on the specific batch you have, right? Because you compute the mean and the sigma square over the specific batch. But then if you actually sample uniformly from your training dataset, you will never have two identical batches, right? So every batch will have a different configuration of samples. Therefore, if you compute the mean and the standard deviation, they will always be different, right? And therefore, again, I said five times, therefore you are gonna be applying a different correction per batch. And the model will never see twice the same input, right? Because they are altered based on where they happen to appear in your training procedure. So, because you never showed the same input twice. And this is so cool. I really like it. And that's all you need usually most of the time to train your network. You don't want it drop out. And this technique also speeds up your training like crazy. Before batch norm was introduced, it was taking me I think one week to train on ImageNet. I think at least if it wasn't a month, it was terrible, I think. But again, that's like eight years ago. Yeah, it was terrible training on ImageNet. With batch normalization, I think you can train in one day. So that's ridiculous. Do you mean robust in terms of adversarial learning as well? I don't understand why we don't see the same sample twice. I'm saying robust here as in you are providing different inputs every time because, and so the network gets a better coverage of what is the training manifold. You don't see the same input twice because the same input based on how it appears in the batch. So if it appears, you have input 42. And this input 42 happens in a given batch. You subtract the mean of the batch and divide by the standard deviation and you get the new value within the network. But then if that input 42 happens in a different batch, then the mean of the different batch is gonna be a different mean. And therefore you're gonna get a slightly different input every time. So you never actually observed the same input because they happen to be packed in a different batch. And therefore the statistics of that specific batch will be just specific to that batch and is gonna change every time you're gonna have a different batch. So same input get a different correction, let's say this way if it appears in a different batch. So you never see the same input twice. So this technique is all I use usually for training my network and it works. But again, recently I've been using dropout for a different reason. So we are gonna be, okay. We are gonna see this in a few minutes. More data, of course. Just provide more data, you're gonna find all overfitting, but then, you know, tink, tink, tink, tink. Okay, finally data augmentation. So data augmentation is also a very valid technique in order to, you know, provide some kind of deformed version of the input if you're talking about images, we have center crop, color jitter, different crops, transformations, like a fine random transformations, crops, random rotation, horizontal flip, right? If you see myself like that and you flip my face, I'm still me kind of, right? So if it's upside down, maybe not quite. Nevertheless, you can see that if you provide some alterations that are perturbation that you are, if you'd like to be insensitive against, then you can improve your performance of the network, which is gonna be learning how to be insensitive to this kind of, you know, variations. Okay, okay, okay. So quickly, quickly, quickly, quickly. Oh, okay, transfer learning. We already know about transfer learning, I think. But again, so you get your network, you already train on a specific task, you just leave the first classifier there, you move everything. You plug a new classifier or whatever. And then if you have, you know, a few data with a similar kind of training distribution, you just do transfer learning, which is again, training just the final classifier. If you have lots of data, you should fine tune because you would like to also improve these, the performance of the, like you would like also to tweak the feature extractor, the blue, the blue layers, and the colors are flipped here. Damn, the hidden layer should have been green and the output blue. Okay, a few data and different from training. You want to do early transfer learning, which means, you know, you start changing also, you know, a little bit of the other layers as well, not all of them. And then, yeah, you want to remove a few more layers. Actually, yeah, oh, my bad. So you would like to remove a few of those final hidden layers because they are kind of already specialized. So you want to retrain the base features extractor here. And if you have lots of data, which are different from the training distribution, just train, okay. Okay. Also, you can use different learning, learning rate for different layers, right? To improve performance. So maybe you'd like to change. Yeah, so you can see that usually these final layers are the one that are changing quicker because they are close to the loss. But then again, if you use a batch norm, all these layers are kind of training the same speed. Otherwise, again, you can see whether you want to change the learning rate, maybe change these guys slower or not. Did you say is the difference between transfer learning and fine tuning? Transfer learning, I just train the final classifier because I don't have, if you have few data, you don't have enough, you know, you don't want to overfit, so if you have few data, you want to just reuse the whole network from the previous task and you just train the final classifier. If you have lots of data, then you can actually even try to have like some changes. You can also, you know, you can start, you have a lower learning rate, it will also change for these feature extractor if they are similarly. So for transfer learning, you freeze the base network? Yeah, I would say the transfer learning, you just freeze the blue guy and you just train the orange. In fine tuning, you actually tune all the other parameters as well. Maybe with smaller learning rate. This is the number 12 notebook. Here I'm classifying the sentiment of these reviews on the IMDB dataset. All right, and so I'd like to compare different regularization techniques. So I'm just keeping everything because I'd just like to show you the final results. Let me see, where is the optimizer? So you can toggle different things. At the beginning, we have no weight decay, nothing, right? So we train with this regularizer. Let's check what is the model. So the model is just a fit forward neural net, which is fit forward neural net. We have some embeddings, a linear, a linear. And then my forward is gonna be getting my embeddings, sending to the forward, the fully connected, the reload. And then, you know, you get the output from the second fully connected and I'm outputting a sigmoid because I'm just doing, I think a two-class classification problem. So we'd like to figure out if it's a positive review or a negative review. And so this is the initial training and we got, you know, the validation curve climbs up as crazy, whereas the training curve goes down to zero. And so here you can see the validation accuracy which goes up to 64 more or less. So, and here we just store the weights of the network for when there is no kind of regularization, okay? Then first thing I'd like to do is gonna be trying to do the weight L1, the L1 regularization. So let's see how to do that. So L1 regularization, okay, toggle this one to do L1 regularization. So here I'm extracting the model parameters and then I'm gonna be adding some term to the loss, okay? So the loss is gonna be some part of this, like I'm gonna sum the one norm of the FC1 to the loss, okay? Because there is no other way to do this in PyTorch for the moment. Okay, so let me re-initialize the network. So I start here, I get this one and then I start training here. So this guy is training, how many iterations? Let's check the epochs, okay? One, two, three, four, five, six. All right, so before we were checking, we can go down here. We had the validation accuracy was around 64 and now we have validation accuracy went to 66, right? So we actually have improved the performance by getting these guys to be, oh, it's getting down, down, oh, back up, 67. Looks good, 68, okay, it's finished. So I can show you in this case what happened with L1, oh, it's not yet finished. Okay, it's taking forever. Okay, while this is training, okay, I'm gonna show you the output of this guy and then I'm gonna be showing just briefly the second usage of the dropout. Should we stop this guy, 69? So you can see now, we are at 69 in validation accuracy, right? Okay, cool. And here you can see both the training and the validation. They are both losses, they go down and then here I show you the validation, which went up to 67 and 68, okay? And so here I just show, are you gonna be storing these weights for the L1? So here I just store this L1 over here, okay? I'm gonna go back here. We are gonna be undoing this one, right? Because we don't want L1. We're gonna be choosing now a L2 regularizer, right? So I can toggle this one and toggle this one, all right? So now we have weight decay of this value. Model, I execute this one and I execute these guys. All right, so while the L2 is training, I'll just show you a quick overview about Bayesian neural nets. So estimating a predictive distribution. So why to care about uncertainty? Many reasons. If you have a cat doc classifier and you show a hippopotamus, the network is gonna tell you, oh, this is a dog, no. It doesn't know, it cannot tell you, oh, this is not of any of the above, right? You can think about, oh, let's make a third category, but then how can you show the network not a cat and not a dog? It doesn't quite work like that. So you can't really find, I mean, cat is an object. Dog is an object, not a cat or not a dog is not an object. So you can't really train your network to say everything else. Reliability on steering control. Let's say you're training your car to steer right and left and then your car say, steer to the right. Okay, hold on. How certain are you about this action? Is it gonna kill me, right? Physics simulator prediction, if you know about physics or physicists, they always want to know how certain you are about your value, right? So measurements and physics always have, you had the value plus minus the uncertainty. And so, your network should be able to tell you as well how certain some number are, what is the confidence interval for a specific prediction. Moreover, you can think to use this for minimizing action randomness when connected to a reward. What the heck does this mean? So if there is some uncertainty with some associated to some actions, you can actually exploit that and train your model to minimize that uncertainty. And this is so cool because we use something similar in our project, right? So dropout I told you about before. So how this neural net with dropout works are just gonna be quickly going through this. I multiply my input and my hidden layer with these random zero one masks, okay? And you can have the activation function to be some nonlinearity. And then here you have this Bernoulli with the probability of one minus the dropping out rate. So this is the dropping out rate. And then you want to scale the delta such that you will resize the amplitude of those weights. The training has just finished. So I'm gonna be switching that. I'm sorry for the context switching. Oh, okay, cool, cool. All right, calculate the variance. Yes, someone was saying calculate the variance. I know I'm switching. I'm sorry, it's the last lesson. I'm making a mess. Okay, so this was train and we got 64, which is, so these are also going both down. This is both the L2 regularization. And before we were getting to 68 with the L1, here we get something else. Maybe, oh, you can see it's still climbing, right? So maybe I just stopped too early. So if you keep training, you're gonna get a better performance. It's monotonic, non-decreasing, right? So I think kind of. So I think you can squeeze more. And here I'm gonna be saving these weights in these L2 weights. Okay, so I saved that. And the last one, then it's gonna be exactly the dropout, right? So go back here. We turn off the L2. So we turn off this guy. We turn back the simple one. But then we have to go back in this network. We would like to turn on the dropout rate. True, there we go. Boom, boom, boom. Okay, is it training? Yeah, it's training. All right, cool, cool, cool, cool. Back to the presentation. I know, I'm sorry, I'm going over time. What a bad teacher. Okay, so this is actually what we are doing, the dropout part, right? Okay, cool, cool. All right, so this is my dropout. I mean, I am basically multiplying these inputs and hidden layers with masks. Here you just have like a network which is trying to train that prediction, that is weekly prediction. It's like a CO2 concentration level. If you use a Gaussian kernel with a square exponential kernel, you can get, after the dashed line, the network says, the model says, I have no clue. So I give you my prediction, which is zero, but then this is my confidence level. Can we do something similar with neural nets? Yes, we can. So this is a uncertainty estimation we're using the real world, nonlinearity in the network. And this is instead using tanh, which is exactly nothing. If I'd like to do a binary classification, in the first case, I are gonna be my logits on the section minus three to 2.5 is the training interval. And then if I show my network, if I ask, oh, what is the prediction for x hat, no x star? If I don't use any uncertainty estimation, you're gonna get a very high value, right? Which is corresponding to, oh, this is one. So this is my one class. If I just use the white, big, thick line, instead, if you use these uncertainty estimation, you get this network to get those logits here with it kind of, you know, blur, foggy shadow. And therefore, if you apply the sigmoid, you get basically that to flip down from zero to one, right? So you no longer say it's one. You can say it's one with some specific probability, right? And here I'm showing you a network that is trying to, it was trained on MNIST, and then you provide a one that is, you know, tilting. And then you can see that it begins with having a high value for the logits for the purple, for the one. And then as you move across, it becomes like a five and then becomes a seven because it looks like some part of the one, like some part of the seven, right? And these are the output after the soft argmax. So you see that, you know, after you tilt, they get very blur and very spread around. So how can we have something like that? And this is the other notebook. So we are done here with the regularization. Let me give you the final thing. So here we can see with the dropout, you always have the validation and train curves. They are one on the other. And then this was the L2 regularization. I can execute this other one, which shows you also that this is keep increasing, right? So although the model is over parameterized, we are not overfitting, which was the case at the beginning. Finally here, let's store these weights in the dropout version, okay? So I save all of them. And so I can start showing you a few things. For example, this one, let's see if it works. Boom. So here you can see that the red are the L1 and the red one are basically all in the center. Boom. And all the other reds are to zero, right? So L1, I just show you the histogram of the weights. When I train the network with the L1 regularizer, you get all of these are here. In the purple case, you actually have, it looks like it's higher. I'm not entirely sure why you have a higher peak at zero in L2, but then the purple one have some values as well here in the tails. Whereas if there is no regularization, you get something that is, you know, resembling a much spread gaussian, right? So you get values that are much, much more, much larger, okay? Instead, the L1 should be all towards, you know, very, very short. Again, I'm not sure why this purple is taller than the red here. I think it's an issue. So these are, I show you the weights. We can show the individual one, L1. So L1, basically all are here. And this is, these are instead the one with nothing, right? So these are the one without the regularization. And these are the one with the L1 regularization. We can also have more bins to have a better understanding of what's going on. Okay, see? Boom, fantastic, right? I can show you also the weights L2, L2. L2 and L1. Oh, you can tell, no? What's the difference? But again, there are 100,000, 100,000. Not entirely sure. But the point is that in the L1, in the L1, you have so many more weights, cluster at the zero. But there are a few larger weights. In the L2, you have all the weights that are pretty small. Can you see, right? There is no large weights. So L1 doesn't shrink the weight. L1 just get them towards zero, okay? That's why you had this big guy here. Boom. Okay. Finally, I know I'm over time. The last notebook, which is the one that is computing the uncertainty through usage of the dropout, right? So kernel execute all. Where is it? Run all. So what are we doing here? How do we compute the uncertainty in the previous lesson, right? In the slides I just showed you. So here we have some points. I try to fit them with my network and you get something like this. Can you tell me what network I used? What is the, where is the chat? Can you tell what is the nonlinearity I used? You should know, right? You don't answer? Sure. Okay. And so here, yeah, really. And then here I show you how this uncertainty looks. So what is this? This I'm using the network with a dropout and then I actually don't use the evaluation mode. I just use the training mode such that the dropout is still on. And then I compute the variance of the predictions of the network by sending multiple times the data through, okay? So here you have range in 100, you know, I just provide 100 times my data inside the network. Okay, so this is a network with a reload. Let me show you how a network with a hyperbolic tangent works. So, oh, yeah, let me kill this one. So here I create the network and this is the network train with the hyperbolic tangent, such it's much nicer, right? And then I show you the network is in train mode, right? But then I feed several times, I feed 100 times my data points inside and then I evaluate the mean. You can see now that the network mean, that the network outputs a uncertainty which is constant, even if you move outside this interval which was the region where the training data were coming. So you can see now that this uncertainty estimation are a bit funky as in different activation functions give you different kind of estimation, they're not even calibrated. Nevertheless, you have the uncertainty close to the data points. It's very, very, very tiny, right? So you can tell how far you are from the training region. And we use this trick here, this part in order to, so again, this variance here is like, it's a differentiable function. And so you can run within the send, right? In order to minimize the variance and this would allow you to move towards the region where the data points were basically the training region. This is what we use for our policy, right? In our driving scenario. So that was it, right? We reached the end of the class, the end of the semester. It was such a great honor to be your teacher for this semester. I screw up a little bit, maybe halfway through. Thank you for helping me getting back on my feet. If you need anything, right? Really anything, just let me know. I'm always open to discuss and help out and explain. And again, as I told you before, we can even think to have one more extra lesson in a month time if you want, the same way Zoom and whatever about the energy-based models. Again, if you have any question about all, any of the lessons you can write on YouTube in the comments below, I will answer. If you have like specific, if you're interested in making drawings or visualization, you can always actually should talk to me because I'm actually creating a group for visualizing machine learning stuff. And we have the website, we have plenty of things to do. English has to be fixed in many of the contributions. Some math is broken and there is plenty of things, open source things to do if you are inclined, if you are interested. And yeah, I think pretty much that's it. I'll see you next Monday, right? Again, you should submit the three video presentation. I made a tutorial about how to make a presentation if you like how I teach and you may want to hear my opinion about how you should present your work. It's on again, on YouTube. And yeah, I think that's it, all right? So again, thank you so much and I can't wait to see all your results for the project. See you on Monday. Good luck, bye-bye. Questions about the class? Ah, fuck, there was one more notebook. Ah, damn, okay. Okay, let me, ah, okay, I can't go over. I'm too late, right? In the extra and there is one more notebook I wanted to talk about, which is the, so this is the projection notebook. Damn, okay. So, okay, maybe we can do an extra lesson with the projection and I talk about this next week up to you guys. More questions? I know it's late and there was this notebook, it's, okay, yeah, you know, I want to be teaching more. Okay, no questions, there is a question. Google users, Visor, to select hyperparameters for its neural, for its networks, those tend to be either random search or Gaussian process for hyperparameter optimization, yeah, exactly. Yeah, but I haven't worked, like I haven't tried them out so I can't really give you a opinion. So I know they exist, but I'm not, I don't exactly know everything yet. Okay, I think that's it, right? Okay, so see you Monday. No, no, no, thanks, yeah, of course. Why post a lasagne? Oh, I put the lemon cake. Right, keep the teaching going, yeah, that's for sure. I think we are, Yann is teaching also in the fall, actually, Yann and Kyungyong are pairing up and they are teaching in the fall and I will be also teaching in the labs, but I don't know. We haven't yet discussed the content. I'm like, oh boy, more teaching, but it's fun. But, okay. Boy, okay. So I think that was it for today. Unless there are some questions for me, for Yann, I know you send me emails. I have a few, I think a few hundred emails from you. I will answer. I will answer, don't worry. Yeah, don't worry too much. We can figure out what's happened, right? Don't freak out. As I told you before, we can have an extra lesson in one month for the energy-based models. Whenever I'm done preparing it, again, this is like up to you. Voluntary is not, it's completely off-class, right? It's like, I was thinking that it makes sense since someone asked to create like a lab for the energy-based model and I said yes. Well, I always keep my word, so I didn't manage to do it on time, but I will do, I will work for this. Questions? Nope. All right, so it has been an honor. Seriously, I loved being teaching to you this semester. You had so many questions and especially when we switched to this online format, I think I personally loved it, right? So at least in my opinion, before we had Yann lecturing and maybe you are a bit shy, I'm not shy. I mean, I don't care. So I think this format where you write questions and I just read out whatever it's in your mind, it really worked well in terms of, you know, figuring out what are those aspects that are a little bit harder to catch, right? Because again, we may not be able to figure out what is the part that is less clear maybe because we've been talking about these things for a while now. So again, I think if you write those questions, I read them and we have like a speaker, we have like some kind of conversation presentation. It's much more effective in terms of content delivery, right? Yeah, I want to echo what Alfredo said. It was a pleasure teaching the class as well, you know, despite the circumstances. And you know, I'm very thankful to Alfredo. I think, you know, he's putting his heart into this as you can tell. And you know, I'm really thankful for him to do all this job because I think it makes a huge difference in terms of the usefulness of the class. And so thank you Alfredo. Thank you. And Justin, right? Justin made the whole, the charge. I actually did a huge one. Oh my God. This last month, Justin made this competition possible to put together the data, the basic code, the data loader. This was, I mean, he worked on this for, you know, a lot for the last few months and then, you know, gathering all the results. So thank you, Justin. Yeah, I think it's been two months now he's been working on this stuff. All right guys, thank you. You always get me, you know, just tweet me. I answer every time. Anything you need, you know, you can find me. My door is always open or in the office or here on Zoom, right? So, you know, as Alfredo said, this project, we have this autonomous driving project and, you know, we need all the help we can get with it. So if you are in some of the top teams and you're interested in participating, getting in touch with Alfredo and you know, you could work on this during the summer or perhaps beyond. All right. All right. Bye-bye guys. You all the teams. All right, okay, bye-bye guys. Bye.