 Welcome back to class deep learning 2022 fall semester New York City 5 p.m. Live Okay, almost 5 p.m. Two minutes to 5 p.m. I don't have announcements. I have a review from yesterday, right? All right, so we were talking about supervised learning how to do this with neural net, right? I told you at the beginning very quickly, right, that we had the problem here very quickly. We had this issue of overlapping regions. So we decided to unwork the data. And then we do that from different perspective. We talk about different types of perspective of well, the nomenclature of the data, right? And then we talk about the classification, how to organize this data. And today we're actually going to be putting together this part such that we understand practically how to do this with pytorch. Okay, so that today is going to be the practicum, right? So we're going to see some practical aspect of this thing. Then the other thing we cover was the, this was the architecture. We had some basic equations. Here the C is going to be the spring that makes the prediction close to the target. We have a few equations. And then we said that the F is going to be the energy, the level of compatibility of the input and the C is going to be the cost you pay for making a bad prediction. And then we introduced the soft argmax and we went through all the understanding of this thing. And then we talk about the loss. The loss was the loss of the training set in this case. It tells us how bad a parameterization is for the entire data set. And we chose for the specific case to have the per sample loss function, the loss, meaning how bad the parameters are for that specific input to be equal to the F, the energy, right? Which is a choice, right? It's called the energy loss. And in that case, in this case, the only square that was inside that dash box was the C. So we only have the efficacy in that case. C was the cross entropy, the negative log probability. And then we saw how this thing goes to zero to infinity. And then we talk about how to train by using green descent, such that we can minimize the loss, the badness, right? By starting with a random initial location. And then we eventually find a good location, how to find the gradients by backprop. And then we went through all the backprop explanation with all the flows of the gradient, which I don't, I'm not going to go through right now because it doesn't, I mean, we really covered that part yesterday. And we also covered the notebook, right? So the yesterday part was the notebook, which was showing you the proof that things are actually done in the way I explained to you, right? Through backpropagation of, what is the subject backpropagation of the, like, grad output, very good. Okay. How do we train networks, neural networks, neural networks are trained by using gradient descent? Very good. No, no, no backprop gradient descent is gradient descent used only for training models? No. Okay, what do we use gradient descent for also in France? Yeah, okay, very good. But then in order to follow the negative direction of the gradient, then we need to compute these, the gradient, right? And then to do that, we use just basically the chain rule, which we use a fancy name for that, which is backpropagation. I show you yesterday, we basically get the grad output you multiply by the Jacobian and you're going to get the grad input, you just repeat that multiple times until you get all these grad inputs throughout the architecture. And I believe now it's very straightforward, right? I mean, I show you young went through the first time, I show you the second time, and we went through the code. And that was like giving you a tangible evidence of that fact that these are the things done in the in the back, right? Behind the curtain. I really encourage and recommend trying to type down the small network I showed you yesterday in class, and try to to probe things as we have done together, such that you get confident and come from there was a question. There was a question. Okay, let's actually go back to the notebook. So we were in a CD CD work with a big book, right? And Python. Then we did conda conda, active date book. And then we just did do with the lab. What is this number 1.42? How can you compute this number, right? I have my calculator here, right? I'm going to be computing that number. It should be you should beat me up. And you should type in the chat in the number I'm going to be writing on my calculator. Okay, the number I should get is this one. You see? Can you get this number? 1.61 1.6. Can you get 1.6? How how do they get this number? What calculation did I do? Kilometers to mile. It is not clear what I'm asking. I'm asking you how to read this number here on the screen before executing this cell, right? So every time you are running neural networks and, you know, code and things with mathematics, you will use a computer to speed up the computations, right? Graphic card, but you always need to know in advance the answer of your computation, more or less, right? Ballpark. So my ballpark is 1.61 or 1.6. And this 1.4, it's reasonably close to 1.6. What is that 1.6? Yeah, where do I get that? Yeah, calculate the loss by hand. So what did I do? There are five spiral, right? Okay, natural logarithm of five. That is exactly 1.61. Why is that why supposed to be to be the case, right? So at the beginning, the model is untrained, right? This is not trained. I didn't even talk about training. So I just generated the random model. The output of the random model is going to be like the linear output of random model is going to be the linear output of a random model, a deep, let's call it a deep random model is going to be, we covered this yesterday, zero, okay? Roughly zero, okay? Therefore, if you shoot that if you shoot a zero vector into a soft arc max, you're gonna get one over K. And one over K, in this case, K is going to be five, right? And so whenever I compute the cross entropy, what is the cross entropy? How do I compute cross entropy? Negative log, yes, of one over five, right? And or otherwise, since the negative and the inside flips, right, you put a minus or remove a minus, you get to flip the inside of the of the log, right? So you're gonna have log five, log five, natural logarithm, right? But in a computer science log, L O G is the natural logarithm. I know in engineering, L O G stands for the base 10, right? We're in computer science. So now it's mathematics and computer science log is a natural one. Anyway, so natural log of five is going to be 1.6, which is my expectation for the original initial loss, right? So this is just one sample. And for one sample is going to be 1.5. If I send in, let's say, 100 samples and I average out the original, the initial loss should be ball park 1.6. Okay. Good. This is again, important, because you need to be able to debug your model. So your mathematics, right? The only way to debug mathematics on a computer, well, your computations, let's, or your code, your computational code, let's call it this way, is to know in advance the ballpark of the output, right? Like a good physicist. Every physicist will always know the number they are thinking of in order of like in order of magnitude, right? Then they can use computers and calculator in order to get the precise number, right? But we don't care, right? Like Feynman, when you have to, he answer a question about some quantum electrodynamics calculations are going to be, oh, how many digits do you need to know, right? And he always know in advance, what is, you know, the, either you can get a rough estimate, or you can just even have like a good intuition about those things. Anyway, you have to build up this kind of intuition. Back to the, is this a Fermi approximation? It's not a Fermi approximation. Anyway, anyway, anyway, going back to the slides, right? Today, we're going to be covering the pytorch aspect, right? How to get these things trained with a computer. And so I guess this is the actual first class where we, yes, it is the first class where we train something. It took some time. We could have done this the first day of class, but then you would lack the understanding of all the other things like inference, the what is the machinery behind, and now eventually how to put together these few instructions, which are going to be, you know, afterwards obvious, I remember when I was at the beginning of my PhD, there were no instructions, how do you train on your network? You just copy someone's code. And I was like, it's horrible, right? I don't want to copy code. I want to understand first. And then I just write my own code from scratch, right? That means you can really reproduce something, right? If you copy someone else, then it's like, do you really understand? I don't know, I don't think so. But anyway, pytorch, setting up the environment, we start with importing torch. And then we're going to be importing from torch, and then the neural network package and the opt team package, the neural network is going to give us a like a very convenient way to generate all these different modules and opt team is where we just pick our standard Adam or socaste green descent optimizer. Then we decide to use a device device allows us to at least I think this is actually going to be automated or has been automated, but until recently, at least you had to specify the device if you would like to train your model either on a CPU on a graphic card or a tensor processing unit from Google or different things, right? There is also on the new max, right? The M one, the M whatever, where you actually have shared memory between the the graphic card and the CPU, right? You can also send their is it called MPS? I don't know. Is you can send also the computations to the accelerator, which is super cool. Then we have the model, right, which is going to be basically at the beginning, just a sequential, which is just, you know, cascading several modules one after the other. And then we send it to the correct device. Then I will define my cost as in since we are doing classifications going to be the cross entropy, right? The negative log of the probability, right? And then we're going to be having a optimizer again for where you can specify the type of optimizer. We just usually go with SGD. We will talk about optimizer in the future. So there are many degrees of freedom. You have the degree that you can choose the type of model you can have one whole, you know, you can design here different types of model, you can choose different losses. But again, for classification, I think it's pretty standard to go with this one. But it's not the only option. And then again, optimizer, you have many possibilities. But again, this is going to be like standard one. So now we're going to be having the training loop five steps. And these are our five steps. I think also this is going to be is in the homework. If I'm not mistaken, I don't know, we'll see. I can't remember. So basically for every x and y pair in the data set, do the following first compute a prediction y tilde from the model is a step number one. So step number one is going to be the forward pass such that you can compute the prediction. Number two is getting the loss through the computation of the energy, which is given us in this case, through the computation of the cost. Okay, all of these are just simplified in this case. So we just compute the loss. Given the cost, more or less, right? Well, given the energy to be honest, all of these things are synonyms right now. So it's kind of the simplify. So we compute the loss second point. Third point, and which is very important and doesn't show in the mathematics, it hasn't shown in the mathematics so far, we zero the gradient. What does this mean? So all different parameters will keep their grad parameters around unless you clean up. Whenever you compute backprop in the number line number four, you will compute the new grad parameters and sum them to whatever you already had in the before. Okay, so we keep always these grad parameters around whenever I compute the new one, I sum them to what I had before. So in this case, step number three and step number four together are what we actually call backprop, right? The computation of the of the of the grad parameters. But in pytorch, we split these two things we have. Well, actually the back propagation will backward in in pytorch does two things. The computation and accumulation in order to just compute without accumulate, you need to clean up whenever whatever you had before. Okay. And this is important. I tell you why it's important in two seconds. Finally, once we computed the gradient, right, step three and four gives me the gradient. I will step in the opposite direction of the gradient. Okay, so repeating number one, forward propagation, number two, lost computation, meaning how bad my parameters are. Number three and four is the computation of the gradients, right? Or either number three, clean up the previous gradients, number four, compute the new one and sum them to zero basically. Number five is going to be stepping in the opposite direction of the gradient. We should zero out the gradients before forward computation of autograd methods, right? No, you zero out the gradient before calling backward because this is just semantic, a semantical reason. Backward in pytorch does two thing, computes the grad parameters and accumulates the new value to the previous one, right? So there are two operations done by backward. If you proceed the backward operation, the backward line with the zero grad line, those two lines are basically one operation, right? You can think about the two lines together as simply computing the new gradient, the new grad parameters, right? So I saw many times people having zero grad away from the backward, and this hurts me because those two lines belongs together, they have a meaning if they're run one on top of the other, right? Which is different if you pull them apart. It's like they are kind of separated in space while they're acting on the same, they're belonging to the same conceptual thing, right? Is there any case in which backward is used without zero grad? Yes, absolutely. Good question. So first of all, well, actually, first, before answering the question, Caglar, I will ask you, I will ask a question. Why do we need to accumulate the gradient? Okay, someone mentioned momentum. That's a good, good call, but that's done in the in the optimizer. Okay, multiple samples in a batch, that's totally right, Patrick. So that's one option. Hongxian for efficiency is the same together with Patrick correct option. Okay, actually, there you go. When we have reuse of the input, there you go, input multiple times convolution in your network job, you exactly are, you got the correct answer, right? So that's the answer number one. Question. So I just repeat the question for for sake of clarity of what's going on. Question, why are we accumulating the gradients answer, because we might be reusing the same module multiple times. So in this case, this is like a little bit of a more elaborate diagram, right? And we saw several diagrams with me so far. This is a little bit more fancy diagram. This circle and this circle should be green. They are white. This is bugging me, but okay. Anyway, so in this case, what is the energy of the system? Do we know? Maybe you don't know, I don't think I told you. How do I compute the energy of this specific model? Right now, we have four of these boxes. Therefore, the energy in this case will be simply the summation of all these distinct items. Okay. So now you have the first case where the energy is not equal to the cost, but it's going to be the sum of all the costs inside the in the in the model. Okay. So this is the first distinction that is going to be basically addressing the question. Oh, why is L equal F equal C? Yesterday, I addressed the question about whether L can be different from F, right? Remember? What? What did I show you? Yesterday, L, if you consider the linear, negative linear output of the model as being the energy, then the loss was equal to loss was we don't remember. No one. This was it, right? So the loss in the case where we consider the F as being the negative linear summation, right of the last module, then the loss was the difference between the correct energy, right? The energy at the correct site minus the minimum value or the soft minimum value that the energy can take, right? I show you eventually when we were completing the gradient for the super cold case. This was the original case. You were you're going to have that the the loss is going to be tried to push down the correct case. So what this is the force, right? I just compute the gradient. So the push down happens on the correct case, and the pull up happens in the lowest value, right? So the fourth item was the lowest, which is getting this force pointing up. Whereas the correct case gets the arrow point down, right? And then if you keep doing that, when when the Okay, in this case here, if I apply the same, the gradient, what would be here? You're gonna get one arrow pointing down or size one, an arrow pointing up or size one and they cancel out, right? They basically reach an equilibrium, right? If you are in the super cold regime. So this was the gradient, right? That was just the difference. Whereas the loss, maybe I should have, I should write here the loss and the gradient, right? Okay, I will update the slides with a loss and gradient because it's going to be convenient for to look at both the equations at the same moment, right? And so that was it, right? This is the loss, which is going to be the energy of the correct guy, right? Minus the most offending energy, right? So this is the most offending energy, the mistake, right? The what is the worst the network comes comes up with, right? What is the class that the the worst class that the mother think is the correct one, right? And so that was the outcome, right? So you try to push down the correct, you pull up the incorrect until you reach this value here, we are okay, right? We just forgot but the understanding it is there, right? Yes, okay, okay. So let's go back to the pytorch thing. So we said here we had a summation of all the items. Okay, so I was telling you in this case, the energy in this case is called E. The energy is the summation of all these different costs. And then we saw yesterday that the loss doesn't necessarily have to be equal to the energy. But in this case, was the correct energy minus the most offending energy, right? That's actually called perception loss. Or if you do the soft versions, it's going to be maximum likelihood, which is going to be the correct energy minus the soft mean, right? The negative one over beta log sum of the exponentiation of the negative beta multiplied by the, the energy. Okay. That was recap. Here we said that all these encoders have the same weights, right? They share the whole parameters, right? So let's say you run back prop, you have the gradient coming back here, right? And you compute the first grad parameters, right? So you have first encoder here. Let's say we don't accumulate by default. So when the gradient goes in this direction and flows through here, when I compute the new grad parameter, I will overwrite what I have before. Then finally, when I have the third line here going, having a gradient going in this direction, I will compute the third grad parameter, which will override both these two computations, right? These two values. That's why there is a automatic accumulation of the grad parameter. Because the model, the PyTorch doesn't know how many times you use the thing, right? And so every time it's supposed to compute the grad parameter, it's just summing whatever it has computed to whatever it has been computed before. Now, the need of clearing the gradient, every time you have the for loop, right? The training loop, whenever we go through the sequence of these operations, well, we use the same model, right? And so of course, there will be gradients from the previous iteration, but we don't care to keep the accumulation of the gradients in this loop, right? Because once after I step, I don't need any more the gradient. Do you think instead of summing the gradients, if we can average it? If so, how would this change the behavior? That will be mathematically incorrect, right? So if you have this expression, let's call it myS is going to be equal w1 times x1 plus w2 times x2, right? What is the partial derivative of ds over dx1? Hello, w1, yes, okay, very good. Bam. Then what is the partial of s with respect to the x2? w2, okay, right? So there's no big deal. Now, let's assume that x1 equals x2, which is going to be x. And so you're going to simply have that s is going to be w1 plus w2 times x, right? And here you can see this as being, you know, parameter sharing, right? Because you can have the same expression as before, right? But then if I ask you now the partial of s with respect to x is going to be the sum of the two gradients, right? And so if you have parameter sharing, that means you reutilize the same parameter multiple times, then the gradients will just sum, right? So you're going to have this one, which is basically being equal to the over the first one, right? Plus the other one, right? So that was actually the numerical explanation of the question. But maybe I trick you, right? Because I show you, as you can see on the screen here, that s was simply the sum of these two things, then I call x1, x2 equal the same thing. And then the two w's were summing automatically. So you're like, you're cheating, right? You show us a case in which the two things are summing, right? So how about you use a different operation? Do they still sum the gradient? Okay, let's try. Just because, you know, you got me curious, right? Is this stuff really working? So let's give it a try to a different thing, right? Just such that we are satisfied with our summation of the gradient, right? So let's have s instead here, or maybe a different letter. Let's call it r. r is going to be let's say w. And then I have x1 and x2, okay? So if you do partial of r over dx1 is going to be, yes, thank you, wx2. Then if I do partial of r with the dx2, I get the opposite, right? I get wx1. And then if we say that both of them are the same, right? x1 equal x2 equal x. Then you have that dr, it's simply going to be wx2. Therefore, if you do dr over dx, you're gonna get what 2xw, right? Which is exactly as doing the summation of dr dx1 plus dr over dx2. You see? Right? These two, these two, these two things are going to be 2wx. It's exactly the same thing. So it does work, right? We have two demonstrations, right? Two computational demonstrations, like two, what's called non-computational. Are it medical? Justifications? I'll show you twice, two cases, right? It doesn't mean it's always true, but it is always true. Then whenever you have parameter sharing, the gradients sound, okay? So here we go. This is my definitely final answer for this question I move on. You're happy? Whoever asked this question, I don't know who it is, but good question, you made me think. Moving on. Question. After we step towards the negative direction of the gradient, I don't need the gradients anymore, right? So why isn't step automatically zeroing up the gradients question for you? Yeah, I can repeat the question. So after I take a step in the opposite direction of the gradient, so I perform my gradient descent step, I don't know, I no longer need the gradient around, right? Why do I need the gradient that I already follow, right? Already step. The gradient was from this location, I followed the negative direction of the gradient. Now I don't, it doesn't have any more, any meaning, right? Because I changed my parameters. This was the gradient with respect to that old location. So why isn't the stepping function also deleting the gradient, right? Why isn't the stepping function also zeroing up the gradient? Yeah, that's the funny question. Yes, you got right the because there is the login line here, right? I may want to check what these logs, what is the I want to maybe log all these gradients, right? So if the step cleans up things, then I have to, you know, undo the cleaning. If I want to log things up. That's basically the only reason why we keep the gradients around. I ask these to the developers of PyTorch. Why are we doing that? But that's the most agreed answer. Again, that was not easy answer to answer last year. We asked this question. The other reason why we accumulate the gradient was the thing someone of you pointed out for the efficiency, right? So let's say I zero up the gradient now at the beginning such that I have I delete from memory all the previous values I computed. I have my first batch sent to the model. I compute my first batch of prediction. I compute my first loss. I backward the loss. I compute my second batch. I compute the new loss and backward the loss. Then I step. And that's that's it. Okay, so in this case, I step twice. I sorry, I backward twice before actually stepping. Okay, we will see an example of this, perhaps in some advanced code later on in the semester. So this is another way of, you know, breaking up. Okay, yeah, also distributed, perhaps distributed thing, right? Someone pointed out so that was actually correct, your answer. Okay, okay, so moving on, we move to the notebooks, because otherwise, why am I showing you all these, these things, right? Okay, so we go on this one here, and I show you today, maybe two notebooks, depending on how much time I take. So I show you one notebook from the spring 21. And then if there is time, I show you one also notebook from spring 20. Okay, the one I'm showing you is going to be the zero four spiral classification. If we have more time, I will show you. But maybe I will because it's quick. I will show you the regression. Okay, CD work, GitHub, NYU deep learning. Okay, then I do conda activate, I do activate book because it's a recent more recent version of the pytorch. And I do Jupyter lab. This is an alias right alias for Jupyter lab is going to be just Jupyter lab. So I import torch and then and then I do this one, which is something some people hate, but I love it. So in notebooks, I really like to have the same variables as in math, right? So if I want to write pi, no, for the 314, whatever, I just write backslash pi, and then I press tab, and I get pi, right, I can do alpha, I can do beta, can do gamma. I really like this because it makes my mind jumps jump less. Okay, I import a few visualization routines. I set some default. This works in Jupyter notebook and Jupyter lab. I import some default. Okay, here I define the device. We already talk about this. Okay, I create my model. I have 1000 samples. I have two dimension and input five classes for the spiral 100 hidden units. This is going to be my data. I don't care to explain these functions because it's just concatenation of sine and cosine and blah, blah, blah. I visualize the data. And these are the spires that you already seen in class, right? Yes, okay. Okay, so this is the data we'd like to classify. And the class is going to be the color, right? So we have one out of five different colors. And then the location is going to be just the abscissa and ordinate, I think it's called in English, right? Well, the horizontal and vertical components. Okay, let's zoom in because otherwise I don't see anymore anything. Then I define some, okay, we start with this one we define some hyper parameters. I will start with a linear model that goes from two to 100 from 100 to five. I send the model to the correct device I have C to be the cross entropy. I don't do any reduction, which is like taking the average or something. I take create an optimizer in this case is going to be Adam, but it doesn't matter. So here we have the five steps, right? I compute my this I should I should change the maybe the name because this semester I changed things. This is the linear sum out of the model, right? Because the output is linear. So this is my S for linear sum. Then I compute the free energy which is going to be equal the cost, right? Let me actually run. I compute the free energy which is equal to loss and then I have Oh, okay, that's why I didn't do the reduction. That's actually something right? So here, I send every time the full the full data set x inside the model. Okay, I send the whole all the points. So this is actually batch full batch grade in the same grade in the same without the stochastic, right? 2000 is 2000 steps whatever. So I send 2000 times the full batch inside, then I compute f and see which is going to be the energy of each pair. Okay, so f, remember, is the level of incompatibility between x and y. So if I have capital P items in x, then I will have capital P f right one f one level of incompatibility pair x and y pair. Then we said, remember, this f was going to be equal the the per sample loss. Maybe I should use the calligraphic L here. And then the calligraphic L was the average of all these per sample loss losses. Anyway, yes, I understand the notation is a bit funky here. I zero the grad I do back propagation, which is accumulation and computation with a computation of the of the grad parameters, but then accumulation, but I zero up. So there's no accumulation, I do the step. And then we reach an accuracy of point five. What is point five? Tell me in the chat. Is the chance better the chance worse than chance? Who can? Okay, now someone with a correct answer. Yeah, how much is okay, point two is random, right? Why is point two? Because we have five classes, but okay, fine. I know point five looks like all random, but okay, it's not anyway, just me joking too much. Sorry. But yeah, so you can print the the model. And let's forget about the warning, I really reported the the warning. And here you have the what's happening here, right? Anyone? Yeah, linear, but linear boundaries, right? Why there are linear boundaries? Why why? Because there are no activation. That's that's a good answer. Yes. So now I show you what is the cross entropy energy. Okay, these are going to be the energy for my linear model. Okay, this is just a cross entropy of a train linear model. Okay, this is how the things look when you when you when you train them with a with this with a linear model. Okay, let me show you instead how the other one looks right and the the the linear sum instead of using the cross entropy. I show you the linear sum. Oh, what happened here? What what am I showing you right now? This is the energy at the negative linear output. Okay, who can describe this? What's happening here? What am I showing you? Anyone? Can someone describe this picture, please? So these are different classes right here. I'm picking different classes. This is the energy for the last class for the second to last class and so on the energy associated with different colors. Yes, but the linear energy boundaries is not linear and boundaries. What are these lines? What are these white lines? I'll show you. Do you know contour lines? Yes, how are these contour lines? Describe them objective. They are positive and negative. That's okay. That's correct. But then another objective, please. They are straight lines and parallel. There you go. Okay, very good. Okay, this means that what I'm showing you right now is a am I showing I'm showing you a what is the noun plain? Yeah. Okay, okay. Good. Good. You're following. Okay, now let's make things a bit more fanciful. Okay, let's add a positive part for the 100 units in the in the hidden layer, right? Let's go back to the correct zooming factor. Otherwise, I don't see anything here. Okay. So here, I just say that all the negative values in the hidden layers are gonna be set to zero. This is like very brutal, right? You think on average, you're gonna have 50% zero units, right? Maybe on the hidden layer 50 will be positive. I chop off all the negative one and I set them to zero. I think it's just, you know, harsh, right? Very harsh. Let's do this. Let's train again. And let's plot the model. And that's not the embedding. I cannot plot them right now. Let's plot the, okay, we can start with the negative free energy, right? Just to compare to the, to the, to the plots of the planes we just saw. Okay. This was the data. Here is training. Okay, there we go. So these are going to be our model. What is the peculiarity of this model? This is a, what type of model? Piecewise linear. That's correct, Jack. Yeah. Why is piecewise linear? Because of the positive part? Yeah. Okay. And here I show you instead, how the negative linear sum, right? Negative, yeah, negative linear output energy looks for a train model, right? You can see now how the, these lines are normal, no more straight nor parallel, right? So this is no longer a plane. And they are basically following this line somehow the, the manifold shape. So you have these energies now shaped. So question for you, right? Just to make sure we, we are on the same page. How did we shape the energy function? Right? So this is a function, right? This is an energy function. How did we shape an energy function? By how? No, no, no, no, no, no, no, no. Even before, right? And the, the energy function was linear, right? I'm not saying how did we make the energy function not linear. I'm asking how did we shape it, right? So even the plane, okay, maybe it's in real terminology, even the plane has been oriented correctly, right? So each plane before was oriented towards the correct class, such that we will still get in 0.5 accuracy eventually in the, in the, in the final score, right? But how do we change the, either orientation? Yeah, how did we change the weights, right? So you had to grain descent of what? We are descending what? What are we descending the loss? Okay, so if we put full, the full sentence out, we shape the energy function by minimization of the loss functional. Okay, that's the full statement here. So whenever we minimize a loss functional or a loss function, if you think functional has been a function of the functional of the energy function, or if you think a loss function is functional, the parameters, right? So we talk usually about loss function, when we think about changing or finding better ways, right, to minimize the badness of the, of the, of the parameters, but in this case, we'd like to think about minimizing a loss functional, such that we can shape the energy function such that it becomes well behaved. Okay, again, this is terminology. Let me show you the other one, the logit one. So I sort of know the cross entropy one, I comment out this one. And this is going to be the cross entropy, the cross entropy energy. Okay, so the cross entropy is flat here, and then it goes up linearly, basically out, outwards. Okay, and these are going to be the different classes. Again, all these are scalar functions. So all these scalar function can be considered as some sort of energy, right? This looks nicer than the other one, but it's not necessarily more powerful because this one, the loss is just the energy loss. Whereas where we considered the other energy, we had that the loss is contrastive and it had a nice gradient as well. So the minimum of the functional is a function. Yeah, correct. Yeah. But again, eventually, what you're changing are the parameters, right? Because the energy function is a parametric function in terms of the weights of the, of the model. So again, eventually, you're just doing standard class, but standard minimization over vectors, but the nice, it's a different, again, way of thinking about things. Okay, last thing, I don't know if it's gonna work because this was a poor request from someone on the internet. So let's uncomment this line. And let's comment this on let's see whether this works. I don't know whether it's gonna work or not. And let's do this. If it works, I will give a shout out, shout out, shout, shout out to the author. So we train this model, I just change. So the only thing I change is going, instead of going directly to five classes, I went to two dimension in the middle. And then, okay, there you go. And that was basically the the output of my video from last last week, right? When I show you how to look at the unwarped data point, okay, so you can actually look into this code if you won't like to know more about how I made the animation. Alright, so we do have four minutes, then I will cover the other notebook, which is very simple. That's not gonna require too much effort. It's gonna be exactly the same but for classes for regression. So we go in the PDL CD PDL, then I do keep cool maybe I don't know. Yeah, get status. Okay, it looks good. And then we do conda activate book. And then we do Jupyter lab. So here I go through the same things default blah, blah, blah random seed, all the same I just show you the differences such that you don't get wasted waste your time. This is my data point, right? So these are my data points which are going to be somehow following this funky function, I will try to regress when your neural network this function. Okay, so we start with the initial model, which is going to be a linear model. Then I will train a two layer neural network. Then I don't know what this is. So I just execute everything. And then I check. All right. So at the beginning here, I train a model which is as you can see from these lines of code over here, it's just a linear model. Okay, so as I've done for the classification, I do the same for the regression train a linear model. If you train a linear model with the same five steps we said before prediction, loss computation, zero gradient backward and then step in the opposite direction, right? Same thing. You're gonna get eventually you learn a line that is going to be cutting through your data points, right? Which is going to be not that exciting. You just we got linear regression with neural networks. And then we decided to do exactly the same thing as before. But now I have this distinction here for five models or 10 models I don't remember. I will train the model in a with a positive positive part. Otherwise, half of the other half of the time I will use a hyperbolic tangent as activation function. As I think I told you already, the positive part is more powerful because it allows us more flexibility. The ton hyperbolic tangent is a bit more smoother and then it's less power less, like it gives you less freedom, right? So while this is training, since still training, I still have 30 seconds. Okay, there you go, finish. So, okay, very good. This is the variance across the untrained model, right? At the beginning, the models were untrained. And therefore, the predictions, when I test them is going to be completely bad. Afterwards, when they are trained, you're going to get the following. Okay. In this case, here, you have the real network on the left hand side. As you can tell, it's going to be this piecewise linear approximation of your data. And then you have the standard deviation, the variance here, I show you 30 times the variance is stuck to zero throughout the whole domain, right? So over the training region, the variance is zero. Then I show you the 10 times the standard deviation is somehow bumpy, but it's still very low whenever you have many data points. This is important thing. You know, you know why? Because whenever you have, let's say classification, you always deal with this kind of probability, you can, you know, you can take the amplitude of the probability to tell how maybe likely the model associates the given class as being the correct one, right? But when you do regression, how do you know what is the confidence of the model? Now you have a new tool, okay? You train a bunch of models, you compute the variance across the multiple models. On the right hand side, instead, you have the variance and the standard deviation for a hyperbolic tangent model. Pretty okay, right? But let's do one thing. And then I say goodbye. Let's have some zoom, right? So instead of having zoom or four, I zoom out four times. Oh, say oh, in the chat. No, say, oh, okay, very good. What happened here? What? Okay, yeah, sure. What happened here? As we leave the negative one to plus one region of training data. These models will start to disagree on what is the correct prediction, because guess what? There is no data outside the data regime, right? The data manifold data interval. Therefore, the level of disagreement between multiple models train on the same data can be used as a proxy to estimate how certain or uncertain a given prediction is. Okay, this is really good, strong result you can use for your regression things. Okay, we use this in a paper for as a cost, right? So this was my cost for basically going away from dangerous region, right? In self driving car. When we were measuring a high variance, then it means the model is just, you know, randomly guessing. And then you can guess what run back for propagation from the variance. And therefore you're going to have a force telling you to move away from regions of high uncertainty, where your models is going to be having really no clue what's happening. And therefore those might be dangerous regions because the data collection that was again for autonomous driving. It's never been collected in this location because I guess guess what never no expert driven has driven in those areas, right? That's like dangerous zone. But now you have a arrow telling you, ah, danger zone, go away, go away, right? Okay, that was it. Thank you for staying three minutes longer with me. I hope you had a nice, enjoyable afternoon with me today. And yesterday, of course, please go over the slides I put on on Google Drive such that you, you know, have the time to digest this concept. Check out the video if you want. I don't think you should because it's a waste of time, I think. Anyway, thank you again so much. I really had fun to see you today. And I see you next week. Okay, bye.