 So welcome back to class. And today we are going to be talking about the second part for energy-based models. I hope you actually did review the content from last week because we are building on top from last lesson. And therefore, there is much more coming out. And the third homework coming out today is going to be about energy-based models. So it's really, you have to really pay attention to what we talk about today because those concepts are not straightforward. And therefore, we need some attention and care to be able to assimilate them. So again, maybe right now, all of the things I'm talking about are quite not well understood. But with the homework, you're going to have the chance to code these things up yourself and develop some understanding of these topics. Nevertheless, if you have any questions about what we are covering today, feel free to ask on the chat, type down on the chat. I'm reading these questions such that we are all up to speed, up to the point where we can start homework today. So free energy. What was this? This was defined as one of the last slides from last time. So this is the free energy, the zero temperature limit. And it's called, there is this F infinity, which is defined as the minimum value that the energy E takes with respect to the latent variable Z. And so if we define, if we have the Z check to be the position of the latent where we have the minimum, then we can simply write that the F infinity, it's just the E at the correspondence of the Z check. Cool. So here I show you a diagram going from 0 to over 2. So with F infinity, zero temperature limit, free energy is 0 in a purple, in green for equal 1, and then greater than 2 in yellow. And then this was the diagram. So in yellow, these areas here, like on the corner, let's say over here, you have larger free energy, larger zero temperature limit free energy. As you move close down to the region here, which is dark purple, equal where this scalar field is, height is 0, this means the network believes that this is the region where the training, where the manifold is supposed to exist. And we told a few times last time that this is a badly trained network because these purple regions don't match this blue point. And so it's a bad trained model. OK. Moreover, I show you the same function, but in this case, I actually plot even the height, not just the color. So I use this cold warm scalar map, color map, going from 0, free energy in blue to 0.5 in white, and larger than 1 in red. And so here you can see how it looks this thing in basically 3D representation. And I get to spin it around such that you can see how this behavior in the center does this kind of peaked thing. So what happened there? Why is it peaked? Well, so let's do a cross-section. Let's take y1 equals 0. And then let's chop this bowl in half. And so what we see here is going to be the following. We have that at location minus 0.5, well minus 0.4 basically, the height of this free energy, it's 0. And this means that this location here is the location where you are on top of the manifold. As you move to the left-hand side, you're going to be growing up. So you go up this way because you go quadratically up from this location. And as you move also on the right-hand side, we also go up quadratically. So we go up quadratically until we reach the center of these ellipse. Then we keep going down again quadratically until we reach this location here, which is the other. So you have an ellipse. We were on this region here. So we were down, then we go up. And then we keep going down quadratically. And then we keep going up quadratically. And so that's what we observe here. And then up, up, up, up, up. And so here we have this peak because we are in the center of these two regions, these two locations. This might be wanted, might not be wanted. We're going to be learning today how we can push down that little picky thing such that we have a more smooth energy. So we're going to be learning about smoothness today. All right. So this was my free energy, or the zero temperature limit free energy. And why I've been repeating so many times zero temperature limit free energy? Because now I'm introducing the non-zero temperature limit free energy. And so enters the free energy, the more generic representation of the free energy. So this free energy before it was in blue, which means it's cold. It's like the zero temperature limit. So it's very cold, cold, cold. Zero temperature. The temperature is a positive number. The lowest value temperature can get is zero, where the atoms don't shake anymore. When they are stationary, when they are still. When they start moving, then the temperature increase. The temperature is the average kinematic velocity of the particles in a fossil material. So in this case, I define this purple, which is no longer blue. This purple F parameterized by beta. And beta is going to be this parameter. I'm going to be like beta is going to be it's called coldness. How cold this specific free energy is. And particularly here, this beta is going to be the 1 over KB, which is the Boltzmann constant. And then the temperature. So this comes from physics. If the temperature is super warm, so you're on the sun, so T is plus infinity, then beta becomes zero. And so if you have a beta equals zero, it means it's super, super hot. You relax completely this thing. Instead, it's the temperature. It's super cold. And it goes down to zero, where things don't move anymore. Beta is equal to plus infinity. That's why here you can see there is plus infinity, which means it's super cold. So beta is also called coldness or inverse temperature. And so with a very, very large beta, it means it's super, super, super cold. And that's why it's blue on this F. Instead, for a generic beta, it's just purple. And then we're going to be seeing what happens if beta, it's super, super, super warm in the next slide. Anyway, so let's look at this equation. What is it? So my free energy is going to be minus 1 over beta, which allows me to, later on, cancel with this beta. Log of what? 1 over the length of this domain. And there is the S for summation of these exponential terms. Exponential terms of these negative energies, like minus beta times the energy, right? Multiply by the delta Z, the Z. So what happens, let's say, with beta that goes to plus infinity, right? So if beta goes to plus infinity, the only terms that survive in this summation is the term that has the lowest energy, right? So if energy is a negative number, let's say, then you're going to have, let's say, the energy can take many values, right? I take a possible value, like I'm saying right now, it's a negative value. So I multiply this negative value by a minus, becomes a positive value, and then I multiply by infinity. So the only terms that survive is the lowest value, right? Then I take the exponential. So the exponential magnifies everything, right? Then I sum them all. I divide by this summation of all possible domains, such that it actually simplifies as well the length with this summation of all the lengths. And so basically, I get the average of the only surviving term. I take the log, the log cancels the exponential, and then the minus beta with the minus beta gets removed. And so basically this F beta of y corresponds to the minimum value of the energy, which is the one that survives. Cool. All right, so moreover, the one on top, we said that is the zero temperature limit, free energy. And that happens to be whenever we pick this temperature, which is super cold, or beta is super large. So why are we talking about the energies? What is this stuff? So the average kinetic energy is going to be this, yeah, this is the average translational kinetic energy, which is these two-thirds Boltzmann distribution times T. So this is how you compute the average translational kinetic energy connected to the temperature. This is in a joule, right? So if this is in joule, you have that the inverse is going to be 1 over joule. And so what happens here? You have E is an energy, right? Multiply by inverse of an energy. This becomes a number. This number is multiplied by this domain, but then the domain comes out with this one. So this is also just a number. And then eventually, everything is multiplied by 1 over this stuff, which is 1 over joule. So everything becomes joule again. And so F, again, is energy. So this is, like, again, some physics. We will get some connection from physics here. Anyway, let's say I like to take a simple discretization, I don't know how to compute this integral. Maybe it's too complicated, too complex. I don't know. I just want to make things easy. So a simple discretization, a simple discrete approximation, this grid approximation is the following. I have my F tilde, my approximate free energy. It's going to be the same stuff, right? Basically, 1 minus 1 over beta log of the length of this domain. And then instead of having this S in a Latin S, I have a Greek S here. So I still do the sum of these exponentials, which, again, have inside this negative beta times the energy, right? And then I, again, convert this Latin D into a Greek capital D. So we go from S Latin S to Latin to Greek S and then from Latin D to Greek D, which means simply we go from a time-continuous domain to a discretization of this. One dimension is just fine. But this one allows me to easily compute this energy for the problem at hand, right? And so, yeah. So let's see what is this stuff, right? So I'm going to be defining here and pay attention. I'm defining in this lesson. So outside this lesson might not be true, but for us, it's going to be true. That this expression here is the soft mean of my energy, OK? So it's the soft mean of the energy with respect to the latent Z. So what is the free energy? Zero temperature limit free energy? Talk to me. Tell me. Are you listening? Are you following, right? Type on the chat. What is the free energy? The zero temperature free energy of my energy. Low. How do we define the zero temperature limit free energy? Yeah. The free energy, the zero temperature free energy, the zero temperature limit free energy is defined as the minimum energy across all Z. And instead, if I don't want the zero temperature limit, how do I compute the free energy? It's going to be simply this soft mean, OK? So you want to think about beta as a coefficient of relaxation, OK? So if beta is cranked up to the maximum, is super, super cold, then you have this very strict, this very harsh, cold free energy, which is exactly you just look at the minimum, right? Just one point. Instead, if you crank up the temperature or you reduce this coldness, so you start warming up the system, then this free energy is no longer just the minimum value. It's going to be the summation of a multitude of values. And this summation is basically, again, as you can see here, is the summation of the exponentials of the minus energy. So again, the lowest energy will be the one predominant, because the exponentials might be scaling them up. You sum them all in this exponential space. You multiply by the delta, but the delta can be extracted, right? So you have all the summation of the deltas. You can take the delta outside the sum and divide by the big Z, so those disappear. So you basically adjust the summation of all these exponential of the minus beta energy. And then after you compute the sum, you take the log such that you get back to the energy space. OK? All right. We'll see how these are called in programming terms later in PyTorch and things, right? They use wrong names. These are the correct names. They're the names that make sense to me, right? So these names, you have the mean. It's going to be the zero temperature limit free energy. If it's not that cold, it's a bit warm. Then you get a soft version of the mean, OK? Later on, I tell you why, what they've done and what's going on in terms of programming and names. Anyway, what happens now if beta goes to zero? What happens if you go to the sun? What happens if it's super, super warm, OK? So if beta goes to zero, if you take this limit and you can work this out yourself if you want, you're going to end up with this final equation, which is going to be simply one over the length of the domain integral of the E in the Z, right? And what is this? Well, this is simply the average value, OK? So if you warm super a lot, the beta coefficient, right? So if you have beta goes completely to zero, right? If the temperature is super hot, eventually the free energy is the average of the E energy across all latent, which means you no longer consider any latent as being more important than the other, right? So before we said that for the zero temperature limit, you have a point, right? A white point here. Then you have to do a minimization in the latent space such that you find the closest point, right? And so you found the Z check, which is the latent that best gives me an approximation of my sample I have. And then I was computing the square distance. That was my energy, right? And so if it's super cold, you only have one Z check corresponding to my Y sample, OK? And this might lead to overfitting, because just have one point on one location. Then if you increase the temperature, you no longer consider one Z check. You now have a multitude of Z checks, right? And their relevance, their contribution to the free energy depends, again, on the square distance, right? How does it depend? Well, their contribution is proportional to the exponential or the minus beta energy, right? So the smaller the energy, the larger the contribution, right? The closest they are, the more important they are, right? Up to the limit where if you have super cold, just one point takes, you know, you have one to one point. Instead, if you warm up the system, you're going to have multiple latent corresponding to this point, possibly fighting overfitting, OK? This is important. If you crank up the temperature, you start warming up the whole system, not only these regions of latent will be corresponding to this one, but all of them. And so basically, you're going to be killing out all your latent. So there are no more latents. You just kill your system, OK? So you end up in a plain, boring MSC. But yeah, then you don't have latent anymore, OK? So don't kill your latent. Don't warm too much your system, OK? They cannot survive. It's too hot. Make sense? Yes? No? We see soon in the next chart how this works. So I mean, if the temperature is high, beta becomes zero. So each expression is independent of Z. Yes, you average out. So the free energy is simply going to be the average of all the energies with respect to every latent. And so there is no more latent. If every latent is treated equally, this equality across latent becomes nothing. Because regardless of where you are, each latent will have the same contribution. And so the latent are no longer explaining a different phenomenon. They will all say, oh, everything, like they all contribute to everything the same manner. So there is no more, the system doesn't have any more inability to pick different options. Every time is going to be always all of them shouting together, right? The colder the temperature, the less latent can be involved. Yeah, the warmer, we just take the average which is not useful because we cannot find the actual good latent. That's exactly the point. And then this temperature allows you to move between going from one latent per observation to more up to every latent for an observation. And every latent for observation is like no latent anymore. Like you just don't have latents, right? Like all of them are just trying to contribute to the same thing, so it doesn't work. Okay, again, that equation, that demonstration was not there for you to get scared. It's just to show you that whenever I tell you things in a perhaps intuition manner, now there is mathematical derivation behind, right? So I don't do math here because this is not a class, a math class. I don't do the writings code from scratch because it is not introduction to programming class. I just give you things, resources you can use to learn these topics, okay? Anyway, if you remember last time, we had 24 energies, right? One per Y in our capital Y, right? So capital Y was the collection of all these little Ys and those little Ys were some samples across these ellipse which was my data generation process. So we took the 23rd which was this kind of U shape, nice behave to U shape. And if you remember, it was like that, right? So you had here my peak, the green peak on the right hand side. We started with this orange on the left hand side. We run a minimization process in order to find the Z check which gives me this blue. This blue is the decoded Z check which is my G applied to Z check which gives me this approximation which is my best guess for what it should be a corresponding value for this one. And then the energy was the square distance, right? But now we change this. Now we're gonna be introducing no longer the zero temperature limit that the one to one we're gonna have this warmer version that more latent per given point. So let's fix beta equal one, okay? So with beta equal one, we're gonna get the following. Each point here in purple, like each point here has the color representing their contribution to the free energy of this location here, okay? So this is still Y 23. But now the free energy is no longer the square distance, right? Now the free energy is the summation and the S no, this S is the summation of all these contribution, okay? So my question for you right now is gonna be where on earth comes the value 0.75 which is this value over here, okay? 10, okay, there we go. My question is where does this value over here, no? 0.75. Actually works very well now today. Where does 0.25 comes from? Question for people at home, are you following? Tell me, where, how? So every time we work with computers we are doing mathematics right now. We are not doing programming, okay? So when you do mathematics or physics or engineering but let's say physics, I always want to know where my numbers are coming from, right? I want to know where the 0.75 comes from before the computer tells me, oh, the maximum value is 0.75, why, right? You always want to have a predictive model. We haven't talked about predictive model yet but you always want to be thinking about the answer to the result you get in advance, right? Because if the computer tells you something while you were actually thinking about something else it's likely the computer is wrong. Well, computer is never right. You made a mistake in the computation, okay? So question for my students, people there at home, type on the chat, where does 0.75 comes from, okay? Exponential of minus 0.25, yeah, that's correct. And the exponential of minus 0.25 comes from the fact that the smallest energy is this one over here, right? And this is 0.25, which is equal to 0.5 square. And then you took exactly the minus exponential of minus, the exponential of minus, this is set to one, right? We said, and this is 0.25, awesome. And so this value is, yeah, 0.78, right? So this is actually 0.78, fantastic. Very good, Zijou. All right, moving on. We have the squiggle, right? So same exercise for you, right now. Well, how does this stuff look like? So what was the last time? So in this case, we had that my sample here was the green X over here on top. We started with initialization that was over here. Then we ran green descent. We got to this location over here and this was my zero temperature limit free energy. But now instead, we're gonna be using the relaxed version where beta is equal once, no longer equal plus infinity. And so now as well, you can see there is a multitude of values, all of these one, that basically contribute to the free energy of this location over here, okay? I hope it makes sense, right? So the colder the system, the less points are gonna be interested, right? The larger the, the smaller the beta, the larger the temperature, the more they are getting until you get beta equals zero, which is super hot. It means all of them are taken in consideration. So question for you right now is gonna be what happens if I take now a point because we said we can take any point in the space, right? What happens if I take the point y prime equals zero, zero? What happens if you're in the center of the ellipse type? What will you observe? Each will have a similar contribution. Yeah, so the closest point on top and the bottom will have this similar mirror contribution, right? And the one on the edges are gonna be darker because they're further away. All right, so let's look at what happens from the previous chart where we had the peak in the center. What happens if you increase the temperature, okay? So that's what comes out. The red one was the one we observed before where we had the peak in the center. If you want to be increasing the temperature, you push down on that small peak, basically, and you relax this peaky thing to be a, basically, if you go with very tiny beta, you're gonna get basically a parabola. And then again, we recover basically a MSC, like a mean square error, where no latent is gonna be contributing anymore. All right, cool, so that was this relaxation. Oh, okay, finally, I had to tell you how to program this stuff in PyTorch, right? So what are the actual names that people use? So unfortunately, this is the wrong domain clature that is being used in this field, and I'm trying to fix it, but I'm just one person fighting. It's okay. So I'm showing you here these softmax, the actual softmax, okay? So the softmax is gonna be simply the summation of these exponential terms, of which I'm taking the logarithm. Moreover, I put out this term over here, which was inside here, was inside, right? So remember before there was here, there was delta Z in here, right? There was delta Z, and here there was one over the length of the big Z, right? I took this out, so I took the delta here, and then I split this logarithm such that I got this term over here. That is the Z over delta Z, right? So I put the minus because I flipped the thing. All right, so this one is simply, so the one without this red annotation is going to be one over beta, and this is called the log sum xp. So my softmax is this, it's called log sum xp, now which is awful because it's just saying that what are the terms? So this is simply the soft version of a max, okay? And then if you want to actually get the exact same thing, you need to also sum a term, right? So the log sum xp doesn't have the final offset term. So what was the soft argument? What was the actual softmin, right? So the softmin instead was this minus one over beta, log of one over n, right? So I just simplified inside the thing. The summation of the xps of the negative values, right? So if you can see this one, it's gonna be exactly taking the softmax of the flipped function, and then I take the minus again, right? So if you have a function, like this, let's say, and this is my max, how do I find my mean given that I know my max, right? I can take this function here, I just draw. I can flip it, so now it goes down and up. I can take this max, but then I have to flip it again back, because it was a minimum, right? And so what I written here is exactly this one, right? So in this location over here, I flip this function, right? If you flip it, then I take the max, which is the term over here, and then I flip it back, such that I get the location, like it gets back to the negative value, okay? Finally, what is the actual, what is the thing that people call softmax, okay? So that's simply the soft argmax, okay? What is the soft argmax, right? So the argmax given a vector of blah element, the argmax is gonna be the operator that gives you given a vector of n elements, and I give you also n elements of all zeros, and there is a one in correspondence to the location of the maximum value, okay? The softmax, the soft argmax instead, is gonna give you like a probability distribution over these values, right? So if you have, this is the maximum value, you're gonna have a larger value over here, and then, you know, small values on the other locations, proportional to the, you know, exponential divided by all the other exponentials. The nice part is that if you had two max values, each of them will have 0.5, more or less, value, right? If there are three positive, like very three max values, each of them will have a 0.3 something contribution, okay? So this is like a probability distribution, and it's also the derivative of the softmax, okay? So the soft argmax is the derivative of the, of the softmax. Similarly, like the argmax is the derivative of the max, right? Make sense? So these are the actual correct names, and then people outside here, they will call the soft and argmax softmax, which is like, ugh, awful, strong. All right, so this is how you can implement these things, okay? Why do you need to use, why do you need to know this? Why can't you simply implement this from scratch? Because there are stability problems, right? In mathematics, you have to be careful. I told you also last time, whenever you invert the metrics, you want to keep in check what is the, what's called, I forgot the, when you invert matrices, you want to check the stability of the metrics, right? Such that things don't blow up. Similarly here, if you want to compute the soft argmax, you cannot simply compute the exponential because you're computing the, oh, in this log sum x, right? If you compute the log of an x, but this stuff doesn't really end up well. Now you can, you should simplify things if it's possible. Okay? So use the actual pre-made versions that are taking care of many potential troubles things, trouble some things. Finally, we saw that this model was not trained well. So let's see about training. That was the part of from last week, right? So right now we just concluded the inference, inference for latent variable energy-based models, okay? We saw the version with the very cold mean, and we saw the version with a warmer mean, which is a softer mean. We saw the latent, how, so one latent per sample, several latent per sample, all latent, no, which is bullshit, like it doesn't work. And we saw that in this case, the energy is simply the quadratic distance between my prediction or whatever, and my expected y and the actual observation. How do we train the system, right? We didn't talk about training. We did talk about minimization. Whenever we compute the mean, we want to find the z that is minimizing the energy, but the minimization of the energy is not training. Minimization of the energy is inference. It infers you what is the value that latent has to take in order to give you the best approximation of your value, okay, in this case. All right, I hope it's clear. So moving on, we see how we train the system, right? So what does training mean? Find the parameters. So let's see how we find the parameters. Finding a well-behaved energy function, okay? Which is parametrized by the parameters of the model. Lost function, oh, what is this word? So what is a function? What is a functional? So a functional gives you a value given a function as input, okay? So a functional is gonna be a scoring mechanism, a scoring object for my energy function, because we have an energy function. And I want to tell you how good or bad an energy function is. So if you have an item that gives you a scalar given a function, it's called a functional. Just name convention. All right, so we have this curly L, which is functional of the function f, free energy, and the y, that are my observation. It's gonna be simply this average over all my observed y's, okay? Of these spare sample loss functionals. And it's a scalar value, right? So again, that's a big deal. We already seen this stuff in the lesson number two, right, the practical number two. So the easiest way to have a energy functional is gonna be the loss, the energy loss functional, which is simply, so given that you have the energy function and you have the observation, my energy functional, my loss functional is gonna be simply the free energy at that location, right? Oh, big deal, right? Why is that, right? So we said that we wanted to have that purple location, the ellipse coming from my model to be exactly underneath the observations. So for sure, it's a good idea to have this, the distance, right, the quadratic distance of this low location to my points to be minimized, right? You want to have a low energy in corresponding to your observation. So a simple version is just having, you know, your energy being your loss, right? So if your energy is your loss, you try to minimize the loss, you basically can minimize the energy in correspondence of your observations, okay? Why is blue? Why is blue? Because it's cold, like a thermometer, right? So if you have a thermometer, blue is cold, red is gonna be warm. So you want cold or free, like cold, low energies in correspondence to your observations. That's why I blow. Okay, hope it's understandable. Each wheel, okay, there was no question. Another option here is the hinge loss functional. So what does the hinge loss functional do? So given your energy function F, given a blue, white, and given a red, white, hot, my loss functional, hinge loss functional, will try to make the distance between my, you know, red guy here and the blue guy here, larger than a margin M, right? So this is a positive number, greater than zero. So the system will try to get this item over here to be larger than M. As long as it's smaller than M, you're gonna have M minus something smaller than M. So you're gonna have a positive number. You take the positive part, you still get a number, right? As soon, and so you're gonna try still to minimize that. As soon as this difference becomes larger than M, so as soon as the free energy for this bad guy is larger M times, like it's larger in M value, right? Then this value over here, then you're gonna have that M minus something that is larger than M. You're gonna get a negative value. Take the positive part, you're gonna get a zero. So the system stops pushing. As soon as you get this distance between bad samples and good samples to be larger than this M, right? So good sample, bad sample, the energy tries to, like this loss functional, tries to push the energy of the bad guy. So this is bad guy, this is good guy. You try to have the energy for the bad guy higher with respect to this one by M, right? So unless this is higher than this one of M units, the system will keep pushing this in this direction, right? Whenever they reach M in a height difference, then the system stops. There is no more gradient coming. If this is less, it keeps pushing. If it's less, it's negative, it keeps pushing, right? So you push until poof, you get zero in this loss, okay? And this is a contrastive method, right? You have two samples. You have a good boy, the blue one, you have the bad boy, the red one, right? So you try to get this energy to be M units away. So it's contrastive method. The other one, okay, what is the problem with the previous method, right? So what's the issue with this guy over here? The point is that if you just try to push down the energy on the good boy, perhaps a solution to the system, to the final, you know, when you finish training, is that the energy is gonna be zero for everything. Well, the system, the loss succeed, right? It pushed down the energy. In this case, the energy was non-negative because we took the square distance. But if the energy is zero for everything, well, then you can't discriminate between good and bad boys, right? It's gonna be all just flat. The system just collapsed. The manifold is flat. It's useless, okay? Makes sense, right? So if you just push down, and then you've just pushed everything on the floor, then there's no more mountains to go hike. So it's boring. There's not nothing you can do with that. So A, there are very many ways you can avoid that. So you need to have a mechanism for which allows you to have high energies for things that you didn't push down, for things that are not good, right? There are different options. We covered them in class. There are architectural options, right? Which are systems that doesn't let, you know, there are manners that don't let the system have low energy for too many values, right? So let's say you have the K-means. K-means can only have low value for K locations. Everything else is gonna be going up quadratically, right? K-means, you have locations which are zero, they, you, you, you, how you call it, you, you. Clamp them to zero, right? Those are the centroids. And then everything else goes up quadratically. So K-means by design cannot be flat because, again, you choose a specific number of locations, right? On the other side, other options are, you know, you have a network which is outputting just zero. Then everything is just zero. Then nothing can be done. So then there are other options, like here we saw the contrasting method. And this contrasting method just forcibly push the energy of bad guys. M units higher than the good one, right? Okay, I talked too much. Moving on. So that one stopped pushing whenever you reach M. In this other case, it's a softer version, right? So as you can tell, if this item over here inside this exponential is very, very large. So you get basically the log of the X, but this item goes away. And so you're gonna get basically the loss functional, the log logs functional. You're gonna be basically just the difference here. And so it's gonna try to minimize, it's gonna push down on this value. It's gonna be pushed up on this value over here. But again, if these are far enough, and so let's say the value inside here is a negative number, minus let's say a very negative number, then this exponential basically, you know, the output of this exponential is gonna be a small number compared to this one. And so you're gonna have, you know, less strength for this pushing. It doesn't stop, it never stops pushing, just less pushing, right? It push a lot if it's wrong, but it push less if it's not wrong. The other one just stops pushing whenever you reach the margin. So this is a soft margin, right? So the log loss functional can also be called a soft hinge loss functional. Why can't we simply use F blue minus F red? Because then you're gonna be having one going to plus infinity and the other one going to zero, right? You cannot, if you keep pushing up on the red one, boom, then goes to plus infinity. No, you cannot do that, right? You want to stop training, like you want to find a mechanism that say enough pushing, okay? Makes sense, yeah? I hope so. All right. When do we use soft hinge versus hinge loss? Exactly. Good question. Empirical evidence. There is no, I provide you the tools, I provide you the knowledge. We don't have perhaps a working recipe, right? So you can try both of them, see what works. This is why or how they work inside, right? So there is some understanding of how the mechanism works and then there is practical evidence of how these things work in practice. You may want to use one or the other depending on the performance you get, right? Yeah. Cool. So for the system I train in class, I just use these over here, not these energy loss functional. So let me start by showing you what happens with a zero temperature limit, okay? Triplet loss, there are many triplet loss, right? I'll just show you a contrasted example. So what happens with a zero temperature limit here? So in this case, on the left hand side, I show you the untrained version. We said that for even each location over here, there is one latent, which was obtained by minimizing the energy, which gives me this location, which is the decoded latent. And so training means I'm gonna be pulling this point up this direction, right? So the training gets this point up here. What are these arrows? These arrows are the gradient. And so the energy was the square distance, right? If you take the gradient, this is gonna be just the distance, right? So, okay. So if I train that, you're gonna get that point and then you pull it up. You take another point, so you have a good sample. You minimize the energy, you get the best closest possible thing and you pull it up. You get this sample here, you go around the manifold, you get this point here, boom. This point here, you go around and then boom. This point, you go here, boom, right? So these are just pulling up those things. What happens if you warm up the system, right? So on the left hand side, you can tell here that the free energy for this body over here, now, it's actually coming from the contribution of all this value over here, right? And also some of these values over here. And the contribution is proportional to the energy, to the exponential of the minus beta times the square distance, right? Yeah. So for one sample over here, you have all of them contribute to this free energy. You can observe these locations are darker, right? You can see, right? The location over here. This point here are having a higher energy and also these are over here, right? All right, cool. So in this case again, we said that one point has multiple latent contributing to this free energy. On the right hand side, you can see, if I pull every time, you know, what we get, right? So first of all, you can notice clearly that in the center right part, we no longer have this very sharp edge, right? We have a blurred version. And then you also can notice here, the purple was all around the ellipse, whereas here the purple is stronger here and here, right? So these regions here have higher energy similar to this one over here. Okay? All right. So what's next? So I'm gonna be showing you again, if you take a cross section of this one over here, right? So let's take a cross section like this. You're gonna get the following. And so this is the cross section by using different values of beta. Okay? All right. We talked so far in last lesson and today lesson about unsupervised, self-supervised case, right? So we had basically no X, we only had Ys. In this case right now, we instead have a X as well, right? So we were talking about unsupervised so far, no X. We introduced back the X. Remember, we were trying to learn this horn, right? And so in this case, we have the whole, we have to see the whole, we have to learn all the exponential thing, right? We said that this envelope here, these row one and row two are exponentially increasing, no, with twice the X. And then the diameters change from being horizontal to being vertical, right? What we have done so far, it was like taking a cross section for X equals zero. So we took off X. Now I'm introducing a X again. We are switching now to the conditional case. I thought, oh my God, I'm gonna spend like a week training this, no. So I changed two lines of code and everything I show you so far applies to this case, right? So this is wonderful, I think. There's no difference. There's no major difference in computation like in a coding wise and neither in reasoning between what we cover so far for one hour and a half basically or more. And this case here where we actually have a conditional case. So let's see what's going on. So untrained model manifold, right? So we have a Z there. And we said that Z goes from zero to two pi with two pi excluded with pi over 24 interval, right? And so we have 48 samples in the latent space when I show you the chart. And so they are, for example, discrete values across this line, which I feed inside a decoder and then we had the Y, which was varying on ellipses. Then we had an observation Y with a shaded the ground and it's blue because we have that the energy should be low and the free energy should be low on corresponding to those observations. And then we had, now we have this new item, right? So now we have a predictor, which is a predictor. A predictor is some new component which is being fed with X. And also X in this case is an observation, right? So also X has the shade and X is gonna be my conditional component, right? So before it was unconditional, we had no input. My previous network didn't have a forward function. There was no input to the network. My network only had outputs and an internal latent variable. Now finally, we have an input, okay? It's very important. It's very new, it's very like non-trivial, okay? All right. So now my X, I can make it vary from zero to one with one over 50. So I have 51 samples. I show you my untrained model, right? So this is my neural net. My neural net is basically, as you can tell, telling me what is the size of these ellipse, whereas the Z allows me to go around the ellipse, right? We saw that the Z was that cosine and sine, right? So Z allows me to navigate the ellipse, whereas here X is gonna be in charge of deciding the shape of this thing, right? Or this item. And you can tell here as I make it spin how this untrained model manifold look, okay? Awesome. So energy function. So how do I define the energy function? You already know. So my energy function is gonna be this red box on the right hand side, connecting my Y tilde, my believe for what the Y should be and the observed Y. So in this case, my energy E of X, Y and Z, and this is finally the full formulation, right? E of X, Y and Z is going to be, as we just noticed before, the square Euclidean distance between my Y observation and my predicted Y, right? So you have Y, Y minus this first component, F1 times G1 square plus Y2 minus F2 times G2 square. So what are the F1 and G1 and F2 and G2? So F and G's are function that maps a scalar R, being it X or Z to the 2D space, right? R2, the plane. X is mapped through a F to a neural network which is X with the real linear layer to eight units. Another linear layer with H and real with eight units. Finally, last linear layer with two, right? So my architecture has four layers, has input, two hidden, like input, two hidden, one output, right? It goes from X, which is a scalar, one dimensional, to eight hidden with real, eight hidden with real and then two output values, no, no, no linearity, right? So this is my simple, tiny network. Finally, we see a neural network and Z, the G is still the same one. So the Z gets mapped through this thing, right? I actually removed the 2W, right? So there is no more W1 and W2 we had before that were deco-efficient for defining these ellipse. Now we have a neural network deciding what are the radius, radii, given that I have a different X, right? So my X has to learn that kind of exponential profile and the fact that goes from a horizontal ellipse, tiny horizontal ellipse to a larger vertical ellipse, okay? So I train this model. It takes one epoch. It's very straightforward. I change two lines of code. I seriously thought I would have had to write a new notebook. I just changed two lines of code. I have this as my input data. X as input Z as generated. And this is how my final train manifold looks. You can tell we start from a horizontal tiny ellipse to a big boy vertical red ellipse, okay? As I show you before, my training data was this one, right? So you can see here my training data, it's sampled, right? X is sampled from the uniform distribution. Theta, it's sampled from the uniform distribution, epsilon noise. So here you have discrete samples, right? Then how do I train this? I just use the zero temperature limit free energy, right? So how did I train this stuff? I take one point on this from my training, from my data distribution. I find the closest point on this manifold that I show you before here, right? So I show you this manifold before. So I take one point on my, well, I take one observation. I find, I run a minimization process. I run a gradient descent to find the latent corresponding to that, the closest point. And then I minimized the distance, right? We said I use the free energy. The free energy is loss functional, right? So I have an observation over here. I find across all possible, you know, values of Z, the one that is giving me the closest Y here. And then I run a stochastic gradient descent such that I can minimize that one. Again, I have my observation here. I run a minimization process such that I find the latent which is giving me the sample, the Y that is the closest. And then I train by pulling this, okay? So there are two minimization process. First one is inference in the latent space such that I find the prediction that best approximate my given observation. Then I run gradient descent, stochastic gradient descent in the parameter space such that I can improve the network performance, which means try to approximate this horn, which was my observed horn. And as you can tell from the final part, I'm gonna be running a very neat, tiny sweep across my X and Z. And you get these fully, nicely, smooth manifold, okay? The last few minutes, I show you the conclusion of this lesson and then we are done. So what are the big other challenges? What's next, right? So first of all, the big part is gonna be that now we knew already, I knew already what's inside, right? What is the internal ellipse thing? And so the next point is gonna be having a G, G1 and G2, which is now starting from the dimensionality of this F, right? So F was the function that was my predictor, right? So F is my predictor, which is mapping my input to this dimension here. But then my decoder now could be a neural network. Perhaps I don't know exactly anymore. It's a sine and cosine with respect to Z. This is a neural net, which is gonna be getting this encoded X and the latent. And it has to figure out, we have to train. We have to figure out the architecture that allows me to get a decent Y, right? So first option here, basically learn this, what's called here. Learn the fact there should be a cosine and a sine inside, right? So this was the first hard part. We can work as in terms of research. How do you find decoders that are doing the right thing? Finally, which is a super, you know, actual real case scenario, which we don't know how to solve it yet, is gonna be the following. What's the difference? Well, the latent is no longer a scalar. This is like, ouch, it's a big deal. The problem here, so the only difference here you can see is one difference of those things a little bit shifted, but the only difference here is that I go from a latent that it is a one dimensional because I know my Y is very in a one dimensional domain to this one where I don't know what is the dimension across which Z, like the dimension of Z, right? And so this is another big research topic. How do you find the right size of Z? How do you find the right latent? So first question, how do you find the right decoder? So the predictor, we just saw it was a neural net. And then how do you actually find the correct decoder? That was the first challenge. This is big challenge, you know, this is actually a real challenge. How do you find the latent? How do you find constraints over the latent? If the latent is too powerful, again, it's gonna simply just have low value for everything, right? Having a latent that is only one dimensional enforces the system to have zero energy only on a one dimensional subspace of Y. If Z is two dimensional, now there's no more architectural design constraint that allows us to only have low energy in a very specific determined subset of possible values. And so we are gonna have like a flat manifold. And this was it, okay? I'm not sure there are more questions. I don't know if people are still alive and listening. I hope you're still with me. The third homework is coming out tonight, I believe. It's going to be on a structured prediction with energy latent variable energy base model, which is the things that Jan Kover yesterday, he explained to you exactly how you should be implementing the algorithm. If there are any questions, if you find yourself stuck, communicate with us on campus wire. We are always here for helping you. It's a challenging class, I'm aware of that. You are a very brilliant student. It's been so amazing to see you most of you are more advanced than I am in terms of programming. Again, I'm not a programmer. So nevertheless, yeah. So if you need any help, type on the chat, on the campus wire, we are gonna be helping you out. Again, no extension for the previous homework because this homework is gonna be even more challenging. And then after this is gonna be the final project where you actually are gonna be having a real case scenario, real world challenges. So it's gonna be super exciting and I wish you best of luck. That's it. Why? Take care. Of course. Bye-bye. Stop sharing. No questions, right? No. Okay. Bye-bye.