 So we share the screen and I'm opening the chat. All right, so I have the chat open so you can interact with me. And so a small recap from last time. Last time we've been talking about energy and actually we've been talking about inference, how to find Z, how to find Y check, how to compute Y, F and E, okay? And so let me just start, I guess, with the last slide from last time. So we had computed this F infinity, which is called zero temperature limit free energy as a function of my Y. And Y is gonna be a two-dimensional vector, right? So whenever I'm gonna be plotting this F infinity of Y, it's gonna be a scalar field, means this height over like a 2D region, okay? So we saw already this stuff that since it's gonna have different height, I'm gonna represent with the color purple, the height equals zero, and then color equal green for equal one. And then everything that is above and beyond the free energy equal to is gonna be in yellow, okay? And so this is how this stuff looks. I would like to remind you that this free energy was the quadratic, a clean distance from the model manifold, right? So all points that are within the model manifold, they have zero cost, right? Sorry, zero energy, free energy, because again, the distance between them and the manifold is zero, so zero square is zero. And then as you move away, it's gonna increase up quadratically. So far everything should be known understood and you had one week to go over this stuff, so I assume everyone is quite familiar. So something that you may notice right now is gonna be in the side of these ellipse. You're gonna have like a region that is slightly lighter, right? You can see a lighter degree of purple. So what's going on over there? So let me show you this image here with the height proportional to the actual height of this free energy, okay? So I'm gonna change the color map so that you can clearly see what's going on. And I'm gonna be using this one, which is called cold worm. So cold means like f infinity equals zero. I'm gonna be using the blue color. For f infinity equals 0.5, I'm gonna be using a gray color. And then for everything that is above and beyond f infinity one is gonna be in red. And so this is gonna be the image you saw before, that was like simply saw from top. Here I'm gonna show you the contour. So each line here, they share the same value of the free energy, okay? So let me spin this little guy so that you can see all around. As you can tell, all the regions like the height around the ellipse that is with the manifold ellipse is gonna have zero energy. And as you move away from that, you're gonna have like a quadratic thing, right? So you're gonna have like a parabola. What you notice is that in the center, so on the outside, of course, it's gonna be like a parabola, but in the center, those two things are gonna be going up on a peak, right? And this might or might not be wanted. And so we are gonna start today lesson by learning how to relax this free energy, this infinite zero temperature limit free energy to a more free energy without local minima such that it's a bit more smooth. Let me take here a cross-section of this bathtub for y one equals zero. So I'm gonna be chopping it in a correspondence of y one equals zero. So what we get is gonna be the following. You're gonna see now that those two branches are gonna be my parabolic branches, right? So again, what is this free energy? Free energy was the square distance of your given point to the closest point on the manifold, right? So if you're on the manifold, which is like location 0.4, for example, then the distance between you and the manifold is gonna be zero. And therefore the square of zero is gonna be zero. As you move away, let's say, we move to the right-hand side of this 0.4, as you move linearly to the right-hand side, you're gonna be increasing quadratically, right? That's why we observe this energy, free energy going up quadratically. Similarly, it happens on the other side. Of course, the same happens as you move towards the zero, right? And so as you move towards the zero, you're gonna get that you try to climb up that parabola. And we have this peak over here. And so in the next slide, we're gonna be learning how to smooth that peak. I'll let you, I'll tell you later why we, what is very, why this is very useful, like why we might want to do so, okay? So free energy, we know, right? The minimum value of the energy, E, that is spanning across Y and Z, right? So you have this energy, we saw that for a given Y, we have like an energy over Z. And then the free energy was the value of the energy correspondent to the location where we have the minimum value, right? So the minimum value of this E is gonna be my free energy. Now I'm gonna be introducing a relaxed version, which is gonna be this purple F. So this purple F function parameterized by beta is gonna be simply this expression. What is this beta, right? So this beta, it's in physics. It's called the inverse temperature, the thermodynamic inverse temperature or the coldness. And it's simply one over KB, which is the Boltzmann constant, multiplied by the temperature, okay? So again, if T, the capital T, the temperature is very, very, very high, like it's very warm, like you're on the sun, beta is gonna be extremely small, right? It's gonna be zero. Instead, if temperature, the temperature is cold, like zero Kelvin, then automatically you're gonna get that beta is plus infinity, right? And so now you can understand why I call my F infinity the zero temperature limit free energy. So it's zero temperature, it's super cold, right? So capital T is zero, meaning beta is plus infinity. So again, if you have this free energy with so-called free energy, the free energy is gonna be exactly the minimum. Otherwise, if you relax this constraint as you warm up a little bit this free energy, the free energy is gonna be a summation of multiple things, right? So this S here is the S for sum, is a summation of all these components here multiplied by the interval. Cool, this symbol over here, it's simply the measure of the domain of Z. So in our case, Z goes from zero to two pi, and therefore this item over here, it simply means two pi, okay? All right, all right. Who remembers what this KBT is, right? What is this KBT? Why are we talking about energies, right? And so again, from physics 101, you might remember that the average translational kinetic energy was the two-third KBT, no? And therefore KBT or two-third KBT express the kinetic energy, right? Of these, let's say, gas with all those particles. And so the temperature allows you to express the energy, right? So you have temperature and energy are connected. So you can make a quick, you know, check here, and beta, since it's gonna be the inverse of KBT, it's gonna be in one over joule, right? And so here we have these one over joule, means that this stuff is joule, therefore F is gonna be an energy. And then inside this exponential, we have one over joule times the E, which is joule. And then if you multiply the two, then the two units cancel out, so everything works just fine. All right, all right, all right. And also, yes, the dimension of Z cancel out with this dimension, right? So everything is just pure number. Okay, again, this is not machine learning, this is physics, just to give you a little bit of, you know, overview about what this stuff, where this stuff comes from, right? So this is just from our friends from the physics department. All right, all right, all right. So I want to compute this free energy in this relaxed version of this free energy. Since I don't want to compute this integral, I may not know how to do that. I simply use a simple discretization, right? And so I replace this Latin S with a Greek S, right? And then I replace this Latin D with a Greek T. So everything else is just the same. So I go from the time continuous to a discretization, very simple discretization that works in our case because Z is like one dimensional, so everything is pretty easy. Moreover, here, I will just define and pay attention. I am defining right now for this class, okay? This thing has been the soft mean of E. So my free energy, the purple one, it's simply the relaxation of the zero temperature limit is going to be simply this soft mean. So the zero temperature, the super cold one is simply the mean, okay? M-I-N, mean. Whereas if I compute, if I relax, if I turn on the temperature, like I increase the thermostat, I'm gonna have this soft mean, which is this log summation of exponential, okay? And I call this actual soft mean. Why do I call it actual soft mean? Because other people, most of the people outside this class will call this the soft mean something else. And I will let you know a bit more about this in a few slides, okay? Something that is super interesting is computing the limit of this free energy here for beta that goes to zero. So whenever you increase the temperature as the temperature on the sun, like it's super warm, what is the most relaxed version of this mean? And so if you do that, you're gonna see that this stuff ends up being the average. But again, this is just, it's not too important as the derivation. Just I can show you here, and I just show you so you have access later. The limit of this free energy for beta that goes to zero, so it's very warm, super warm, it ends up being simply the average of the energy, okay? Across those zed. Again, you don't have to get scared about that math. All right, so let's compute this free energy for the cases we saw before, right? So we are still doing inference as last time, but instead of using the cold inference, the cold free energy, we're gonna use this relaxed version for the y equal 23. So if you remember, the y equal 23 was this x, the green x on the right hand side. And then here, the free energy was the square of the distance between the blue x and the green x, right? So the distance was 0.5, square would have been 0.25. And that would have been the free energy, zero temperature limit free energy. But in this case, we have now to consider all these contributions. And so I'm gonna show you how all those little zed, will contribute to this free energy. And so we choose a beta equal one, and we have now this. So given that y prime is gonna be this x on the right hand side, my free energy now comes from the addition of all these terms here, the exponential of minus the energy of all of this, right? So all the squares, like the exponential of the negative squares, right? So as you can tell, those points that are close to the x, will have like a smaller energy, and therefore the exponential will be larger, and that's why you can see them. But for energy that are further away, that are very high energy, you do not do the exponential of minus and large number, you're gonna get basically zero. So they don't count in this summation, in this integral. Okay, first question for people at home to just check if you are following. How that, where does 0.75 come from? So where does this value over here come from? And you're supposed to type on the chart, such that I can read aloud what you're saying. So I'm asking once again, where does this value over here, 0.75 come from? And someone has to reply. Contribution to the energy, yes, yes, no, the number 0.75. I need you to tell me how to compute 0.75. Where does that number come from? You have all these closest y till there. Yeah, tell me, how do I compute? One over two pi, no. X, okay, X of minus beta E, okay. So how much is E? E is the square distance, right? So how much is it? How much is, okay. E is 0.25, correct. And so E to the minus 0.25 is going to be 0.75, correct. Okay, so Jessie got the right answer. Good job. So great, now we know where that number comes from. So every time you see this diagram, so although it looks very sparse and pretty and whatever, you always have to pay attention to the number I put on the screen, right? So those numbers are not random number. They are computed by my computer. And you always, always, always have to check on a piece of paper that these numbers make sense because if they don't make sense, then you're not understanding what's going on, okay? So you have to pay attention to the numbers. And, you know, okay, I'm a physicist, right? So you always, I always have in advance in my mind the answer that my program, my network, my whatever is supposed to do, right? If I make an electronic circuit, I must understand, I must know in advance what is the voltage somewhere here and there before I actually measure it. Otherwise, you know, you don't go much ahead. All right, all right, all right. So let's move on and let's now consider instead the case for when I have y prime equal 10, right? So the 10th item, so which is the element on the top there. So in this case, I'm gonna get that all those points here will contribute to the free energy, right? In this case, we're gonna have a number to 0.2627. Okay, someone else that is not JC can write on the chat how much, where that number comes from. So where does 0.26 come from? I think you must have understood now, e to the minus one, kind of, yes. So the distance here is 1.1. 1.1 square is gonna be 1.2. And then you take e to the minus 1.2, which is 0.26. Yeah, that's correct. All right, all right, all right. Okay, so next question, what happens now if my y prime is gonna be the origin? So what happens if my y prime is the origin for the zero temperature? You're gonna get the square distance, right? From either side. In this case, what's gonna be the main difference if you warm up the temperature, right? So you're not zero temperature. It's not freezing cold. We are gonna be increasing a little bit. The temperature and how is this free energy changing from before? Anyone can type on the chat. It's symmetric. Yeah, that's perfect. How do you know? You already saw the slides before, or you actually got it right? Okay, I assume you got it right. All right, okay, that's perfect. Yes, it's symmetric, right? So a point now inside, oh, okay, yeah. I don't know if it's here or she, but studied physics in the undergrad, okay, cool. All right. So in this case, you have again, that all those points on the top and on the bottom will contribute to the free energy, given that I choose that y prime to be in the center, okay? All right, so that's pretty much, oh, but why are we talking about this, right? So we came here because we had that issue, no? With the picky center, right? I'll show you before that spinning bathtub and then the cross section here that we had this picky thing, which was coming from the cold free energy. Let me show you what happens now if I choose the warm free energy, right? And so if I do that, I'm gonna get, if I can scroll my screen. Ta-da! You don't see anything. Okay, let me click. Click, okay. All right, and so the red one was the super cold. The beta is the coldness again. So large beta is cold. And then we reduce the coldness so we increase the temperature. And as you can see the picky part becomes smooth, smooth, smooth until it becomes, oh, becomes a parabola with a single global minima. Oh, this is coming out to be, remember what happens if beta goes to zero? You get the average, right? So you actually recover the MSC. Okay, I'm just giving like small, small like information bits, pills, whatever. But again, yeah. So whenever we increase the temperature, you're gonna be relaxing until you get just one single minima. And then there are no more latent because we just average out everything without those weights, right? Anyhow, I think now if you need to implement this stuff in pintorch, you're gonna be getting like quite frustrated because they use different names for the things I just defined. And someone say, oh, you should have used their names. No, because those are wrong, right? So I use the correct name. So the one that makes sense, we'll try to sell it to you this way. So let me explain to you a little bit of, what is the nomenclature I use such that it makes sense, at least to me. Otherwise things don't make sense to me. So this is the actual softmax, right? Not the softmax that people talk outside this class. This is the actual softmax, which is this, one over beta log of blah, blah, blah, some of the exponentials. I just expanded these, the previous, I just expanded the one over z. I took it out, right, in the logarithm. So I just split the two things. So how do we implement this stuff in pintorch? Well, you just use this function, which is called torch.log some x, which is this softmax, actual softmax, right? And then plus or minus that additional constant over there. So this is how you want to use, how to implement that, because it's numerically stable. Moreover, if you, this is the actual definition of the actual soft mean, and you can see this is what I wrote before, you can think about that. It's very similar to the softmax, right? The actual softmax, what's the only difference? There are two minuses, right? And so you can do that, you can get that away with, you put a minus in front, so you cancel the first minus, you put a minus inside, so you cancel the other minus. And so the soft mean is simply, you can implement it as a softmax with the two minuses, okay? Again, actual softmax. And then someone, of course, is gonna be asking, but what is the softmax we use in class every time? So that one is actually the soft arcmax, right? Why is that? Because a arcmax is gonna be like a one hot vector and the one tells you what is the index of the element that has the maximum value, right? So the max gives you, retrieves the maximum value, you know? And then the arcmax is gonna tell you where is the index pointing to that maximum value, right? So this is like a vector with a one hot vector. And the other one is a scalar. Similarly, whenever I compute the softmax, the softer version of the max, now this max is not just the max, it's gonna be like a summation of the logarithm of the summation of the exponential, right? Which you can change the temperature. If you get the temperature super cold, you retrieve the max. If you warm up the temperature, you get something more like a weighted summation. And for this softmax, which was like the arcmax is the one hot. If it's super cold, it's gonna still be one hot. But if you warm up the temperature, you're gonna get a distribution, a probability distribution, right? So whenever someone says, oh, the softmax gives you the probability distribution, now that's the soft arcmax, okay? Arcmax being the one hot, or the zero temperature limit gives you the one hot. If you increase the temperature, you get a distribution. So finally, these are the correct names no one is using but me. So I hope I didn't create confusion. If I did, sorry. But still, this is the correct way of seeing these things, okay? Because it makes sense, right? So again, if you have the max, if you have a function, you want to find the max, it's here, right? If you had this function, you want to find the mean, you can take the function, you flip it, you find the max, then you flip it back again, you get the mean, right? So that's what I show you here. I show you that soft mean is simply the flipped version, the negative, right? Of the max with the flipped in argument, okay? All right, all right. Enough me talking about mathematics and things. I hope it was fine. So this was the part that was concluding the last lesson, right? So this is the end of the inference. And we figured that there is the free energy. There is a very cold one or there is a warm version or there is a very hot version. The hot version is gonna be the average. The warm version is gonna be like something you may like. It's like this marginalization of the latent. And then the super cold version, the zero temperature limit is gonna be this, exactly the minimum version, minimum value. What I show you was the fact that this model is a very poorly trained model because those low energy regions were not happening around this training set, right? So let me show you once again the same diagram I showed you at the beginning of today's lesson, which is this one over here. So here I show you with these white checks a few samples on the model manifold. And then the whites, the blue whites are the training sample, but we never use the training sample, right? I just use the training sample to compute the energy, the free energy, but we never use them to learn because we didn't talk about learning. We talk about inference so far, right? And so guess what is gonna be the next part of today's lesson? You guessed it right, training. So now we're gonna be starting to learn how to train, learn how to train, train how to learn. No, learn how to train energy based model, okay? Unless there are questions for me on the chat. No questions, everything clear? Metal learning, yes. No, that one is a different subject next time. All right, okay. So I think, yeah, there is no big deal, right? So this is just inference. We didn't talk about any crazy stuff and we talked for inference about inference the whole last lesson. So I guess we can move on and start the training. Finding a well-behaved energy function, right? What does this mean? So this means we have to introduce a loss functional. What's a loss functional? Well, it's a metric, it's a scalar function that is telling you how good your energy function is, right? So we have an energy function, which is this free energy. And then we're gonna have a function of my function which is giving me a scalar, which is just telling me how good this energy function is, right? So a loss functional gives me a scalar given that I feed a function. And here I just show to you that if I have this curly L as the loss functional for all the whole batch, my whole dataset, I can also express this as the average of these per sample loss functionals, okay? So I just do the average of those per sample loss functionals. Cool. So what the heck am I talking about, right? So I'm just giving you, I'm making so much hype but I didn't tell you anything so far. And we already know this stuff from machine learning and previous lessons. And so here we go with the first loss functional, which is the energy loss functional. So this energy loss functional, it's simply the free energy F evaluated in my Y where Y is the data point on the dataset, right? So whenever we train these models, we're gonna be minimizing the loss functional, right? The loss functional. And so in this case, the loss functional is actually the free energy at the training point. Of course, right? I mean, what this energy function has to do with the free energy should be small for data that comes from the training distribution, large elsewhere, right? And so what is the easiest way to do that? Well, of course, we're gonna just have the loss functional being the free energy evaluated at the training point. So if it's larger than zero, then the training of the network, changing the parameters such that we minimize the loss functional is going to be squeezing down the free energy on those points, right? So you have the point, you have a free energy, boom. Point, free energy, bam, all right? So we just small, like we are reducing the free energy in correspondence to all these Ys. Ys, there is a check. There is a check because I want to emphasize the fact that we are trying to push down the energy at those locations, right? So I push down, there is the arrow pointing down, I push down. All right, okay, I might sound silly, but that doesn't matter, I like myself silly. So instead now, we're gonna be introducing these contrastive methods. What is a contrastive method? In this case, this contrastive method will have a Y check, which is blue, Y is blue because it's cold, right? So we want to try to get low energy. Again, the energy, the temperature are connected, right? So low energy is gonna be cold, blue. And then I have a Y hat. Y is hat, Y is red, Y Y hat is red. Won't increase the energy, right? That's why there is the hat pointing upwards. And so in this case, given that M is a positive number, the difference F of Y hat minus F of Y check that difference, the network will try to make it larger than M, right? So for as long as the difference is smaller than M, then this value over here will have a positive value. Whenever F Y hat minus F Y check will be larger than M, then you're gonna have that, the output of this stuff is gonna be zero, okay? Because there is a positive part. So again, this hinge loss will simply try to get that second difference to be larger than the first term, the margin. In order to have like a smoother version of this margin, this is like very binary, right? If you're lower than the margin you push, larger than the margin you stop pushing. You can use this other version, the log loss functional. Which is a smooth margin. You can see, right? Whenever you have that inside these parentheses, you have a very negative number. So if this is very, very large and this is zero, let's say, you're gonna have the X of a very negative number, which is roughly zero, and they have the log of one, which is stop pushing, there is no more. Instead, if this value here is large and this value is maybe negative or whatever is zero, you're gonna have the exponential of this number, which is gonna be very large. And then you're gonna have the one plus this exponential, which again, the one gets neglected. You don't get the log of this X, but you're gonna get basically the loss is gonna be proportional to the energy, right? If it's very large. Cool, cool, cool. But again, for our case, we just have a very tiny one-dimensional latent. So we don't need to do this. This contrastive sampling, contrastive learning, it's necessary whenever you have maybe high-dimensional latent and so on. So let's just train this model because I didn't train this model so far with this energy loss functional, okay? And so I train this model. It takes one epoch to converge. It's ridiculously fast. Okay, but that is a toy example, so you understand that. And I'm gonna start by showing you the zero temperature limit, the super cold free energy, okay? On the left-hand side, I'm gonna show you the untrained version, which is the one we already saw before. So in this case, for every training point, the blue points, I have a corresponding X, which is the location on the model manifold that is the closest to that training point, okay? Whenever I train, I'm gonna be, you know, get a gradient. That gradient is gonna be, I just told you before, if you get the mean, you're gonna get one item, and then if you do the derivative, you're gonna get the argument, which is just the one in correspondence to the lowest value. And so that one is gonna be represented here by that arrow over here, right? So this arrow here is going to be the energy, the derivative of the energy, which is gonna be just the distance, like the Y minus Y check. And then that's gonna be multiplied by, you know, the one in correspondence to the location that is closest to our point. All right, so what this means is that during training, whenever we use the ZTL, the zero temperature limit, you're gonna get the location on the manifold that is closest to your training point, and then you're gonna get this point to be moving there. You have this training point, you get this location that is on the manifold closer to this point, and then you get a gradient that is making going up here. Same, you have a training point here, closest point to the manifold here, you get this point, a gradient that goes down here, okay? And so this is the training procedure when using this zero temperature limit. One input later, on the right hand side, the train version, bam, all those axes automatically managed to arrive to destination finish. So this is like a well-trained model, which I show you, where I show you the energy going to zero in the all around, like corresponding to all the locations, corresponding to my training data set, right? The training points, the blue points. What happens if you have two closest point on the manifold? If, for example, if Y is at zero, zero. Right, so in the energy, in the zero temperature limit, you're gonna get just one point, it's gonna be pulled there. And this is very prone to overfitting. Let's say our Z is not just one dimensional, it's larger, right? So instead of having like a ellipse, you're gonna have like a potato. If you have an, hold on, let me finish the answer. If you have a potato, a potato, you're gonna get all these locations on the potato to go to those training points. And so if your Z is a high dimensional latent variable, you end up with a, you start with a potato and you end up with a porcupine with all those peaks going out. And this is basically overfitting. You just memorize the training set. In our case, this doesn't happen because our latent is one dimensional. So you can't really pull spikes out of that thing. But nevertheless, we may want to figure out how to deal with this overfitting by using these temperature regularization thing, right? So before I show you there was a peak if there is a zero temperature limit, then if you increase the temperature, you actually smooth out that peak. And so here I'm gonna show you, then I answer the other question. Actually, let me see what happens here. How do we update the energy function? Is it where I'm at rise without, oh, here, this is definition from last time, right? So my energy function is this one, right? Where, so my energy function is my model, right? Which is the square difference between the locations and the code of latent for the first component and the code of the latent for the second component. So this is like, this is how E is parametrized, right? Does the learning interpolate between the points? It has, would this algorithm learn the mod, the whole ellipse or just the blue points? Okay. So I'm getting there. Okay, is there a visualization for the spikes you talk about when overfitting? Yeah, I'm getting there as well. All right, so we were telling, like we were talking about how we train this energy function, right? So this energy function is gonna be this color thing I show you over here. And this is different representation. It's simply the location of that violet ellipse. Training for the zero temperature limit means you take that point of these ellipse, you try to pull it up, right? How do you pull it up? The only two parameters we had in this model were W1 and W2, which were controlling the X radius and the Y radius, right? So we had two parameters and with two parameters we try to fit all these Ys, right? And so basically the network will like the training procedure gradient descent will eventually try to change the size of these ellipse such that it expands and they're gonna be matching all those blue dots, okay? The spiky thing was, I was saying is that if you have a high dimensional Z, like in this case, Z is one dimensional. So you have like one line like that. If Z is two dimensional, it's gonna be the whole surface, right? And so now it's trivial to overfit. You can move anywhere in the plane. There is no more constraint of living on that line. And so we have to see how we can avoid overfitting, but in this case, it doesn't happen, but we can see now that by increasing the temperature, we no longer pick points individually. So we are using this marginalization, this Bayesian thingy. So on the bottom part is marginalization. On the left hand side, I show you how the training works, right? So you have that all those locations contribute to these, the gradient are just the average of those arrows here. So given that we pick one Y, that is this green X over here, you get all these points on this manifold will contribute and will be attracted there. Here before we have only one point gets pulled up. Here we have that all these points get pulled up, right? So it's much harder to overfit. Something you want to pay attention here is that how do I compute the gradient? So the gradient, I'm computing the gradient of this soft mean. And so automatically we are gonna get a soft arg mean, right? So if you have a max, you do the gradient, you're gonna get the arg max. So if you have a mean, the gradient is gonna be the arg mean. Here we have a soft mean, and therefore the gradient is gonna be the soft arg mean, multiply by the derivative of this energy, which is gonna be simply this vector, right? So the energy is the square distance. If you do the derivative, you're gonna get the vector, which are here, shown in white. And then the height is gonna be basically given to you by the, you know, the vector, multiply by this soft arg mean. Cool, cool, cool. Wow, that's a lot to take, I think, but I think it's just great. Finally, I train the last one and I'm gonna get something like this on the right hand side, okay? So before I show you the cross-section for the left hand side, the untrained version, I'm gonna show you now the cross-section for this trained version. So the zero temperature limit, the super cold one, I'm gonna get this red one with a spike. And then as you increase the temperature, as you reduce this beta, we're moving up until you get this, you know, average version, this parabolic blue one, right? Okay, okay, okay. And so all of this was about unsupervised learning, right? So far, we only have seen Ys. Where are the Xs? And so this is like yesterday night, I'm like, okay, maybe I don't talk about supervised learning, like I don't, how long is gonna take me to now train a model with the Xs and everything and I don't want to do it. But then I just change one line of code and everything just works. So everything we have seen so far is exactly the same for the unconditional, which is this unsupervised learning way. And it's gonna, like one line change, you're gonna get the supervised, like the self-supervised, the conditional. And so now in the last five minutes, we're gonna be talking about the self-supervised learning or the conditional case. What does this mean? So let's get back to the training data. This is my training data, right? We had, we tried to learn this horn that is starting with a horizontal mouth, like it's like a closed mouth, ah, like that. And then it goes like a very, you know, tall and narrow. And then the profile, the envelope is exponential, right? So here the radius goes from beta to alpha and it's multiplied by the exponential of two times the X. In the other case, it goes from alpha to beta and also it is multiplied by this exponential. So let's see if we can learn this stuff. And I didn't know if it was easy or hard. I thought it was hard. It was very easy. And so untrained model manifold. So let's give it a look. Well, how does my model look now? So I have a Z and since I have controller Z, I take, you know, zero to two pi, two pi excluded. That's why the bracket is flipped with an interval of pi over 24. So I get a line over there. I feed this Z on the decoder and then I'm gonna get my Y tilde, which is gonna be moving like going around ellipses because that's how my network is routed inside the decoder, right? Moreover, we're gonna have our Ys, our observed Y. Y is observed. You can see it's observed because there is a shade in that bubble there, in the circle. Now we have a predictor, right? And the decoder not only takes my latent Z, but also a predictor. And the predictor is fed with my observed X. And since again, if I have control over Z, I can simply say it goes from zero to one with 0.02 interval. Let me show you how my untrained network manifold looks, right? All right, so how do I train this? Well, I just do the zero temperature limit free energy training. So given my horn as before, I take one point, one Y point, I find the closest point on my manifold, and then I try to pull it up. I take this other point, I take the closest point and I pull it out there. I take this point over here, I take the closest point and then pull it out. I take this point over here on the horn, I take this point over here on the horn, I take the closest point on the manifold and I pull it out. I do that for one epoch only. I told you it was very easy to train this model. And we get, actually I had to define first what is the energy function, right? So my energy function in this case is gonna be this E of X, Y and Z, where again those two components, like it's gonna be the sum of the square distances. But in this case, I have F and G, right? So we have a predictor F, which are both of them mapping R to R2. And then F is gonna be a neural net, mapping my input X through a linear layer and RELU to a eight dimensional hidden layer. Then I go again through another linear layer and RELU to another eight dimensional hidden layer. And then I have my final linear layer to end up in two dimensions. So I have a four layer network, input two hidden of size eight and then one output of size two. And then my G function is simply what allows me to get this Z going in loops, okay? And by then the point now is that these two components are gonna be scaled by the output of F. So this is my model, very, very tiny, very tiny model. And I'm gonna be training it. And then I show you the model manifold. So again, I take the same discretization for the Z and X. And this is how the training, train model manifold looks. It's awesome, right? I think it's just great. All right, and this one took nothing, no time to train. So how can we move on? How, what do we do next, right? How do we move forward from here? So there are a few more ways to scale this up not to toy example. So, so far I've been kind of cheating, right? I've been always embedding into D decoder, the fact that my Z goes around circles, but I don't know that, right? So we don't know that. And so we may use something like this. In this case, my G function takes my F and Z. And then, you know, G can be a neural net as well. And in this case, I have to learn the fact that this stuff moves around circles. So in this case, I should be learning the sine and cosine. But then how do I know that actually Z is one dimensional? Well, I know because I generated my data, right? So I am the owner of my data generation process. So I knew that theta was a one dimensional item. So definitely I can just use a latent that is one dimensional, but no one can tell me that for, you know, natural images or whatever, right? So that's the other big issue. And so how, how would we deal with the fact that we don't know what is the correct size of my latent? Because again, if you choose a large latent, you're gonna be very easily overfitting everything. And so in this case, what changes from the previous slide, which is this one, is that now Z is a vector, okay? So Z is a vector, no longer just a single line. So actually it should be a vector and it should be like whatever size, the shape. And now my G goes from the dimension of F and, you know, Cartesian product with the dimension of Z into R2. Now the issue is that we need to regularize this loss functional because otherwise you are gonna be drastically overfitting, right? And so this is what current research with Jan is, you know, what me and my students are doing with Jan. We are trying to figure out ways to regularize the latent variable such that we can, you know, make things actually not simply overfit. And that was it. That was all I had to tell you about latent variable energy based models inference, training, zero temperature limit, a bit warmer free energy. And then we saw the unconditional case with unsupervised learning. And then we have seen the conditional case with the self-supervised learning, right? Where we have access to these X and the code to train these two models, like the code that I use for training the conditional case is just the same code as I use for the unsupervised and unsupervised case, but with one line change. So really, really it doesn't take much effort to put this together. What it took some effort was to draw slides, but again, that's just because I like making things pretty. And that was it. Thank you for listening. Questions, please go on. It's done, right? Class is finished. You can ask anything you want. Are you still awake? Yes, okay. Someone is awake. Can you explain the input dimension of G again? Yes, I can explain as much as you want. So now it's office hours, right? You can ask anything you want. Hold on, first question. So can you explain the input dimension of G? In this case, let me go back to the first case. In this case, G is one because it was fed with Z, right? And then the output was these G1 and G2, which were cosine and sine. In the second case, the input is gonna be this F, which we don't know exactly the dimension can be anything. So the dimension of F and then Z, given that I know that Z is one dimensional. Finally, which is the super norm, like the actual case that the more realistic case is this one where we don't necessarily know what is the dimension for the latent. And therefore, now we're gonna use a whatever dimensional variable, latent variable, but it's gonna be necessary to regularize the loss functional. Otherwise, as I was pointing out, you can easily overfit by using that zero temperature limit. Nevertheless, you can warm up the temperature and use that as a regularizer, of course, right? Did you get it? Yeah. So next question, how does this look without a latent variable? Okay, without a latent variable, it's exactly as turning beta to zero. Okay, so beta to zero, you just average over all possible values. How does, what does happen? What are you gonna be ending up having if you start here on the left side? And then instead of having all these arrows that are shaped now like that, all these arrows will have the same length. Well, actually these points over here will be even longer because they are further away. So these ellipse will be pulled in every direction. And the way to minimize this energy is actually to make it collapse in a single point, center in zero. And so that's the actual, it's a very good question, right? So what is the classical failure mode in neural network? Whenever you have multiple targets associated to the same input, you end up predicting the average of all the possible targets. In this case, the average of all possible targets that are all those points in the ellipse is gonna be just the point in the origin, which is like the collapse of your model, right? So that's a very good question. And the point is that if you try to learn multimodal outputs, multimodal data set, data with a MC, like without latent, with zero beta, infinite temperature, you're just collapsing to the mean, the average, right? M, E, A, N, not M, I, N, mean average. All right, another question. To be clear at the zero temperature limit, the loss is only considering the energy of the nearest point. Yeah. And as we warm it up, the loss is using a weighted sum of all points. And yes, and the weights that you're using for the weighted sum are the weights that are coming from the soft argument, right? If you take the soft mean, you have a soft mean of the energy, right? So that's what you get. You have the soft mean of the energy, right? So the F tilde, it's soft mean of the energy. You take the derivative of the soft mean, you get the, what do you get? You get the exponential divided by the sum of exponential. So that's the soft argument, right? Multiply by E prime. What is E prime? E was the square distance. So if you take the derivative of the square distance, you just get the vector, which is now multiplied by the soft argument. So it's actually what you said, which is very good summary. I'm gonna just read it again, and I show the other chart. So I just read your comment. To be clear, at the zero temperature limit, the loss is only considering the energy of the nearest point, the distance, the square distance to the closest point, yeah? And as you warm it up, the loss is gonna be the weighted sum of not the points, right? The weighted sum of all those contributions, right? The exponential of the minus beta E, right? That's what, that was written here on the top, right? So as you warm it up, you're gonna get this exponential, which is the soft mean, so soft mean. And then if you compute the derivative, you're gonna get the soft argument multiplied by the derivative of the energy, which are the arrows, multiplied by the soft argument. So cool. What happens if we allow Z to move freely into the space? You're gonna basically get a collapsed network. This model can simply output zero everywhere. And that's where you may need to use the contrastive cases, right? So in that case, very easy way to get zero energy is gonna be just everything zero, right? But in this case, you can use the contrastive case, and you can say, oh no, in this case, it should be larger than some margin. And so that's how you can deal with this, larger than like Z into D, okay? So taking beta, okay. So taking beta to zero would defeat the purpose of having a latent variable at all. That's exactly, yeah. And so this is what I kind of briefly show you. I didn't talk about, but this is like a quick derivation by showing you that if you go beta equals zero, like the limit for beta depends to zero, you will retrieve the average across all the latent. And that's basically the, you end up with having MSC, right? You end up throwing away all those kind of goodies, right? And that was pretty much it. How can you get more out of this lesson? Firstly, comprehension. If anything was not clear, ask me anything in the comment section below. If you would like to follow up with the latest news, follow me on Twitter under the endlalfCNZ. If you would like to be notified when I upload the latest video, don't forget to subscribe to the channel and turn on the notification bell. And if you like this video, don't forget to put a thumb up. This video has a transcript in English. And if you would like to contribute to the translation in your language, please let me know. So here, as you can see, we have the write up where we can see all these video that has been transcribed here in plain English. And then again, as I said before, if we go back to the homepage, we can see here in the English flag and we can select different languages. Now we have Arabic, Spanish, Persian, French, Italian, Japanese, Korean, Russian, Turkish and Chinese. And your language is just a waiting for you to be translated in. Finally, do play with the notebook and pytorch in order to get yourself more acquainted with all these new topics. And then if you find any typo or mistakes or anything, just please let me know directly on GitHub. Or if you feel brave enough, you can even send a pull request. It will be gladly appreciated. Thank you for listening and don't forget to like, share and subscribe. Bye-bye.