 All right, all right, all right. So today we're gonna be talking again about foundations of deep learning. That's me, Alfredo, and you can find me on Twitter on the handle of CNZ. Actually, if you check Twitter, you can find some, you could find some news about today lesson since I posted online like yesterday night. So the deal is always the same. As soon as you don't understand, as soon as I don't make sense, since I didn't sleep and I've been working on this stuff for the last 30 hours, it's very likely I'm not gonna be making much sense at some times. So every time something is not clear, just stop me, ask me anything because again, if we keep going and you are not following, then we are not going anywhere, okay? All right, so today we're gonna be talking about inference for latent variable energy-based models, EBMs. For example, the ellipse. Likewise, we have cover only inference, only inference in our first lab. Today, we're gonna be only covering inference for energy-based models. I will not say the word training ever again, okay? I'll try at least. So today we're gonna be talking about inference. What is this stuff and where do we start? We're gonna be starting from our training examples, training samples. I said I wasn't going to say this word. All right, so let's see what kind of data we are gonna be working on and why we need these energy-based models. So we can think about our data, Y, bold Y, has been having two components, Y1, Y2. Y1 is gonna be this row one function of X, which is going to be my input, multiplied by the cosine of theta, which is some angle, we don't know, plus some epsilon noise. And then row two is gonna be again, the function of my input X, and then it's multiplied by a sine of this theta. We have no access, plus some noise, epsilon. Row is a function that maps the input one-dimensional R into R2. And so it's mapping my X into something that is this alpha X plus beta one minus X for the first component and the other is beta times X plus alpha multiplying one minus X. And then everything is multiplied by this exponential of X. So alpha and beta are simply 1.5 and two. So this is simply the equation for ellipse. But then if X goes from zero to one, as I show you here, you're gonna have that this is gonna be drawing some sort of horn that is exponentially in the profile. And then it starts as like something like this and then eventually like horizontal ellipse and eventually end up as a vertical ellipse, okay? X here is gonna be sampled from the uniform distribution. Similarly, theta is also sampled from the uniform distribution from zero to two pi. Epsilon instead is sampled from a normal distribution with mean zero and then a standard deviation of one over 20. So again, as you might have seen from Twitter, this stuff looks pretty cool and it looks like that. But then since we have magic on this side, we can do this. And so you can see here how we're gonna be having this exponential side, right? This exponential envelope. We start with the ellipse that is like vertical and then we end up with this horizontal one, okay? What we want to pay attention here is that at a given specific location X, there is no one Y only, right? So we cannot really train a neural net that is like a vector to vector mapping because there is no vector to map. Well, there is a bunch of vectors, right? So given one single input X, there are many, many, many possible Ys. There is like a whole ellipse, right? Per given X. So we can't really use normal feed forward neural net to do this. Similarly, if we are just talking about Ys, given one value of Y1, I cannot even tell what is the other corresponding Y2 because there are almost always two values for Y2 given one Y1, right? And so using vectors to vectors mapping as we've been learning so far is not quite sufficient. So today we're gonna be figuring out how to use these latent variable energy based models to deal with this kind of multimodal outcome. So to make things simple and make my life easier, we're gonna do a few simplifications. The first one, I'm gonna be removing the input. So there will be no input data. My model will not have input data. And this is like, what? Anyhow, I fix my X to zero. So by fixing the X to zero, I'm gonna have that my exponential becomes simply one. And then basically we turn out having row one that becomes two, right? So alpha gets deleted by zero and then you just have the beta multiplied by a one. And then row two automatically is gonna get the amplitude of 1.5, right? And so my data points Y are gonna be simply points coming from these twice the cosine of this uniform sample, uniformly sample theta and then 1.5 sine this uniform theta. The collection of all my Ys will give me capital Y. So capital Y is gonna be the collection of all my sample. And here I decided to just use 24 samples. So I have 24 different samples from the uniform distribution, okay? And per each of these samples, there will be one epsilon for the first component and one epsilon for the second component. All right, so what we try to do today is going to be to learn, well, to learn wrong. We are not learning anything. We imagine that someone gave us a already trained already learned network. We're gonna be learning how to perform inference, how we can use a model to figure out if one point it belongs or doesn't belong to what was the training manifold, okay? So this is my training data. These are my Ys, which are, again, an ellipse. You can see here the major radius is two. You can see, right? There are one, two, three, four boxes. Each box is 0.5. So this radius here is two. And then the minor radius is gonna have one, two, three boxes. Each box is 0.5. And so this is the minor radius of 1.5. When you said there's no input, just what is theta? Do you consider that an input or? So theta, we don't have access to, right? So theta, it's something we don't see. X could be the input we provide the model to figure out at what location we are at that kind of horn. Allows us to figure out the dimension of those ellipses. But then we, theta here is something we don't have access to. So theta was simply a variable, which is missing, which was used for generating our data, but we don't have access to. So it's a missing variable. It's a missing input, okay? So we don't have access to, okay. All right, so let's look at what the model manifold is. So in this case, I'm gonna have a latent input, which is something latent means it's missing. I don't have access to this input. Still there is some potential input. You notice here is the same color as that theta, right? Anyhow, so I have my zed, which is, I can decide to take it from zero to two pi without the, so that square bracket, flipped square bracket means I'm considering a vector that goes from zero to two pi with two pi excluded with step, you know, pi over 24. And so this one basically is like a line. Well, there are many points. There are 40, 48 points, right? From zero to two pi excluded. So this latent input goes inside a decoder. And then the decoder is gonna give me this y tilde and y is bold because again, it lives in two dimensions. More precisely, we're gonna have that by varying z over one line, y tilde is gonna be varying around a ellipse, okay? On the other side instead, we're gonna have these bold y, which are my observations. So how do I know these are observations? Because it's, this circle, it's shaded, whereas those other circles are simply transparent. The bottom one, it's a little bit gray, which means I have access to this data, okay? Cool. So this is how these points look, right? The blue points are the one sample from my data generation generated distribution. We already sampled them, we have 24. And then here I just decided to plot 48 of these values from like reconstruction of those latent variable, right? Such that I can clearly see what the network thinks the true manifold is, okay? In the second episode, when we are gonna be learning, we're gonna be figuring out how to match my internal belief, the violet one, with actual the data we have. So we are not gonna be seeing that this time. This time, we already have this model, which is pretty bad since it's not already matching the data. And still we are gonna be seeing how to use this model, yeah. So what determines the shape of the red or orange points? Is it the alpha and the beta? Alpha and beta are determining the shape of that blue thing, right? So the overall thing, it was that horn, I showed you before, the one that was spinning. And then we decided to slice it at a specific value of X, right? So this is like a cross section, which gives us this potato, the blue potato. On the other side, I'm gonna be telling you what is inside the decoder. We have an internal belief for what the true data manifold is, right? That's the network that the model believe about the, you know, how the data is supposed to look. Okay, let me show you in the next slide a little bit more information so maybe we can get, you know, sync. So here, we're gonna be looking at this energy function. So what is this energy function? So this energy function, it's something that tells me what is the compatibility between this Y tilde and Y, the blue Y, right? And so basically, in this case here, measures the distance between my given training sample and my reconstruction, my given, my best guess about what I think it should be, the real data point. So let's give more context here, right? So my energy E function of my Y data point and my latent variable Z, it's gonna be the sum of the square Euclidean distances of the two components. So we have component one of the Y minus component one of this G, which is our decoder function of Z squared. And then we have the other one is gonna be Y2 minus G2, which is the second component of this output of the decoder squared. And this, importantly, happens for every Y we pick from capital Y. So in this case, we have 24 different E's, right? So we can index 24 different E's based on the specific Y you pick. More about this in the next slide. So what is this decoder? So this decoder is a little bit cooked as in, I know what is the data generating process. So I can put inside the G what is quite aligned with what I think, it's a very good guess about how the output should look. So my G, which is a two component function G1, G2 maps the real line to these R2. And therefore it maps my Z into these two components which are gonna be W1 cosine of Z. And then the second component is gonna be W2 cosine of Z. To notice here, the only parameters we have available in this network, in this decoder are W1 and W2, okay? Cosine X and sine Z, sorry, cosine Z and sine Z are knowledge, a priori, I know already and I put there my best guess for that. And so again, this network has two parameters. Nevertheless, with two parameters, we can still do many things. So again, stress once again, this E happens to exist for any peak of Y in this set of all Ys. So let's put this E on the top here just so I can clear the screen below. And so now I show you all 24 energies we have. How do I get this stuff, right? So these energies are coming from the fact that I pick a specific Y. So the first one I pick Y prime, which is like my peak of Y is gonna be the first of my training sample. And therefore I can call the first energy my E1, right? So I can index them right now. Since I have a discrete number of training sample, I have a discrete number of energies in this case. So this is my E1. And then the last one on the row is gonna be the one associated to the sixth sample of my training sample, my training set. And therefore I have my E6. If we go down until the last row, we're gonna be seeing, I'm gonna be picking the 19th sample from my training set. And then I'm gonna have these E19 over there. And finally, if I pick my Y prime being the last, the 24th example, then I'll be ending up with the E24. On the X axis of each of these little cells, you're gonna be having Z. So each of these E's, not E1, E2, E3, E8, E24, are functions of my Z latent variable, which is spanning, as we said before, zero to two pi. In these drawing here, I just have them separated by pi over 12. So I have nice separation for drawing this function. So moreover, the range of this energy in this case is going to be zero to 12. And we are gonna be computing these values in just in a short moment, such that we can better understand what the heck I'm talking about, right? So again, until yesterday, I had no clue about what these were, okay? So I am very new to this topic as well. And therefore we are exploring together what is this jungle of very funny, weird, weekly functions, okay? We are gonna start by cherry picking two of them. For example, the E23, it looks pretty, you know, kind of okay. It looks very mostly smooth. And I think it looks like, you know, even convex in the central part. And then I'm gonna be, of course, if I pick the nice one and smooth one, I'm gonna be also picking some weird stuff, like the double wiggly, the one which is wiggly. But as I said, let's start with ease and let's start with the simple version, okay? So far, everything is easy or right. No one is writing anything on the chat and, you know, Sean, just ask a few questions. So far we are all on board or I lost some of you, no? Yeah, so basically like the square would be Y23 and then the X axis is showing as you vary Z you're gonna be evaluating this EY23 of Z. Yeah, this is the E23, the one I show you right now. Great, the lecture's going great. I'm understanding this well. Okay, okay, that's fantastic. Okay, so let's look at this first example on this kind of U shape. So how does this U shape arise, right? And so this is the current configuration. We have Y prime is gonna be the 23rd example from my training set, which is figured here by that green X on the right hand side. Okay, so over here, whenever I start my Z and I start with Z equals zero, it actually turns out that Z zero corresponds to this location over here. So if I send Z equals zero inside the decoder, I'm gonna get a point over here. Why is that? Oh, because simply the W1 we just randomly generated is a negative number. And so this size over here from zero, like this, the point from here to here, this is my W1. And instead, W2 is gonna be a positive number over here. So whenever we have Z equals zero, you're gonna have that the cosine of zero is gonna be equal one. So it becomes one multiplied by a negative number I go down here. And then zero, sine of zero is gonna be zero. So you're gonna be on the X axis. So over here, this is gonna be my initial point. How far is this point from the green X? Let's count. So we have one, two boxes, three, four boxes, five, six, six boxes, and seven, right? So two boxes are one, right? So seven boxes means we have three and a half, right? So if I counted correctly, one, two, three, three and a half. So the distance between this point over here and the green guy over here, it's roughly three and a half. Now, if you take three and a half and you square it, you get, yeah, you guess it's right. It's 12, right? And that's why we get this point over here. You don't trust me to take out a calculator and check how much is 3.5 squared, okay? Anyhow, so that's why we start at this location here, 12, right? As we keep increasing Z and we go from zero to pi half, we end up at this location over here. And then as we keep going until pi, you're gonna get ending up in this location over here. And as you can tell, pi, you're gonna be at one square away from this green boy. And so one square is gonna be 0.5. 0.5 square is 0.25. And therefore the height of this red curve at this location over here, it's 0.25, very close to zero, okay? And then we still keep cranking up that Z and we go to three, three half pi. And then you keep going up to two pi, right? And two pi, we're gonna be basically getting up to the same location where we started, okay? And then if you keep going, you're gonna repeat this one, it's gonna be going up and down, up and down, up and down. All right, all right. So this looks pretty okay, I think. No, no, no crazy stuff. But then we saw the other one was kind of wiggly, right? What happened there? So instead of using the Y23, we're gonna be using now the Y10, which is this thing, right? Like a signature, like young signature. All right, so what happened here? So in this case, our Y prime, which is the peak we have from my possible Ys, is this guy over here in the top X over here. And again, as I told you before, whenever Z is equals zero, we start at this location over here. So if you have understood what I'm talking about, and now we're gonna be doing an exercise such that you answer me. Can you tell me what is the distance between this location over here and this point over here? So question for the people at home. Can anyone tell me what is the length of this segment I just draw? And I'm okay, 1.5 times 1.4, which is square root of two. Yes, so that's correct. And if you square it, you're gonna have what? It's gonna be 1.5 times 1.5 times two, right? I just squared, so you said 1.5 times square root of two. I'm just squaring everything, so we're gonna get 1.5 squared times two. So 1.5 times two, it's three. And three times 1.5 is 4.5, right? And so we can determine that my initial energy, which is the square length of this segment, is going to be 4.5, which is exactly what this initial value over here is. So this point over here, it's 4.5. Cool. Can you just repeat why you know that Z equals zero corresponds to the left most point? Yes, so I know that this is because I checked the code. I know that my W1, it's equal to something that is minus 1.5, something like that, minus 1.5, okay? And then we have the W2 and I'm drawing with the touchpad, so it's crazy. This is 0.3, 0.4, something like that. 0.5, I think that's what I'm looking for. I'm looking for something like that. I'm looking for something like that. Zero point, let's say four. That looks like a one, but okay. Okay, believe me, that's a four. Okay, when we go pi half, we are roughly one unit away from this point. So one square is gonna be roughly one, right? I mean, something roughly one square is gonna still be there. So this height over here is gonna be one. And then we climbed up to this location over here. And this location over here, we should basically get the same point over here. So then over here, we're gonna get a similar value, a little bit smaller, and then we, oh, what happened here? So when we go to three, three half pi, we actually are at this location over here and we have another minimum, right? What happened here? So basically you had this point is closer to my green guy, then a point over here, right? And so in this case, this function here, this energy has a local minima, which is happening at three, three half pi at this location over here. All right, cool. Let's go back to the arrow. Okay, so now we determined that this height was 4.5, this was one, and then this something, we can figure this gonna be like two square, this is gonna be four, okay? Okay, so what happened now? All the stuff is still here. Okay, clean. All right, free energy. So what is this free energy? So we're gonna figure that out right now. So the free energy, actually, this is the zero temperature limit of the free energy. It's going to be simply the minimum value of this E function with respect to Z. So we can compute this Z check, which is gonna be equal. We can define it as being the arg mean of this energy function with respect to Z. Why the check? Well, because the check is pointing downwards, right? So whenever I minimize my energy, I found the location that is the lowest one, and that's why I'm gonna put the check means that Z is the location where the energy is the lowest, okay? And how can we find that Z, right? So if Z is basically discrete, like let's say we have like K-means, we can do exhaustive search. We can check every Z we have. Otherwise, we can use techniques, like gradient-based techniques, such as gradient descent, and keep, like, pay attention. I didn't say stochastic gradient descent because here we are not doing any stochastic something, right? E is a function of what we know everything. When we do stochastic gradient descent, we are minimizing that loss function, which is expressed as an average of those per-sample loss functions, right? Here instead, we are minimizing the specific value of E. There is no average, so it's not stochastic. And therefore, you're gonna be using, you can use algorithms such that when you get gradient, line search, LBF, GS, and so on, okay? So let's look at, and let's figure out what is this free energy, right, and how it works. So given that we have defined this Z check, these free energy, the zero limit for the free energy is gonna simply be these energy E computed in the location of my Z check. So let's visualize here these E, so this E10, no, the energy for the sample when I pick the sample 10. I initialize my latent variable Z tilde, the orange one, with some value. And then I'm gonna be running a gradient base method for minimization. Therefore, I end up in the blue location, which is my Z check, and it's blue because it's cold, so it's like low. Usually think about this energy as being like a temperature, right? I mean, if you multiply by the Boltzmann constant, no, KT, you're gonna get like some energy, right? So energy and temperature are very closely related. And so again, I use the blue to show you that it's low and cold. And so at that location, the Z check, yeah, at that location, we reach the minimum of these energy and that is my free energy, the zero limit for the free energy. Cool, cool, cool. So in practice, this could depend on the initialization then. Oh, yeah, oh, yeah. So well, the initialization, so your algorithm will screw up depending on the initialization for sure. So I can show you later on that LBFGS actually gets the wrong minimum, but nevertheless, the free energy is the global minimum, right? So I'm telling you here that the value, the minimum value of E is the free energy. If we don't get there because we don't know how to get there and then it's a different issue, right? So it's not dependent on the initialization. The initialization will make your algorithm more or less likely to converge to the actual correct solution. All right, so what happens here? So in this case here, we have the blue points are my points from the training distribution. The tilde one are samples from my model. Then my y prime is the peak I have chosen, right? So this is my 10th item in the training set. Then my z tilde, which is the initialized, the value I initialize z with, if I send it through the decoder I show you before, it's gonna generate this point here, this location over here. Then I can run some minimization algorithm and then you end up in that location, the blue location. And the blue x is gonna be the decoded version of the z check, which is the closest item to this green boy over here. So why are we doing this stuff here? How can we use this model? What can we use this model for? So we can think about, if we have someone has trained this model and has given that to us, we can potentially find what is the closest value in our possible set of all possible values we can generate, which is the closest to your sample. But so we can use this for performing denoising, for example. So if I have an image which is corrupted, which is gonna be there for far from my manifold, the model manifold, then I can ask my model, hey, model, can you tell me what is the latent, which is gonna give you the decoded version, the decoded item here, which is the closest as possible to the image I'm looking at. And then potentially, we could just pick this value over here as a cleaned up version of my corrupted input. What is the energy? The free energy, the free energy now, it's simply in the square distance between the green point and the blue x, right? So if you take these two boxes, which is basically one, one square, which is rough, again, one, is going to be the free energy corresponding to these x over here, okay? So every x, every location here in the training manifold, will have a free energy, which is determining what is the distance to the, what is the closest distance to the manifold, okay? So you can see in this case that, let's say our model is well-trained, we can tell that this location over here has much lower free energy than a point over here. And so these points could be more likely coming from these, could be compatible with what this model has been trained on. Or like we show in this case, this model is definitely not well-trained. So what do you mean by well-trained? In this case here, just for pedagogical sake, I didn't train fully this model, such that there are errors. In ideally, those purple points should be exactly matching those blue points. And that would be a well-trained model, which is capturing all the dependencies between those y variables. And this is again, one cross-section of that horn. This is a not well-trained model, which means I stopped training after a few epochs. And therefore the model tried to get there, but it didn't quite manage to get yet there. We can think about that, or we can think about this is well-trained model. So you actually learn properly. And then these points here are much further away. So by computing the free energy of these points, you can have like a measure of how far they are from the learned distribution, okay? All right, so let's move on and let's look now instead at the 23rd, right? And the 23rd you shape. And so in this case instead, oh, it's much easier. We just have a global minimum and a global maximum. So there is a question. If these were for denoising and the model was trained to the point where the till points were on top of the till points, wouldn't it be not do any denoising? So I believe that you're saying if the till points are far away from the purple one, right? So that would mean that the model would be not well-trained, right? So if we, yeah, if these blue points were like up here and all of them would have been closer to some point over here, that means that this model is badly trained, right? So again, today we don't talk about training. So this is simply what has been given to us and we just play with what we have and try to figure out what this energy and what this free energy mean, okay? So this is how to use this stuff rather than to learn this stuff, learning next time. It's enough to understand how to use this. Trust me, all right. So let's figure out what's going on with this U shape. So the U shape instead comes from this kind of example here. So again here, we initialize to the location in orange and then by running some gradient-based method or whatever minimization process, we find these blue X, which is my Z check. So we go from the Z tilde, which is the initialized value for my latent to this Z check, which is the value at which I find the minimum for my energy. Since this is periodic, I'm gonna show you just on the next repetition. So I don't clutter too much the chart. And this came from this configuration over here. We start with these training points. These are points from, you know, I just sampled them from my model. My peak was this green over here. And in this case, perhaps I can tell that the model we initialized the latent with the orange and then it actually went a little bit too much, I think. It didn't choose the exact best, right? So this is like a bit, it overshoot a little bit, I think. Anyhow, this over free energy of this location here is gonna be 0.25, right? 0.5 square. Cool, cool, cool. So what's left to show you? Well, just a few more things, but we are almost finished. And then I'm looking for all your questions because I really know you have questions. I have questions. So let's, in this case, compute the free energy for every location I show you in this grid. What does computing the free energy for every location mean? So just for sake of, you know, clarity, I'm gonna just repeat myself because I like to listen or to talk, right? So I like to talk. So let's select in green here. Let's say I'm picking this sample over here as my first location. So given that location there, I'm gonna be picking a orange, we have orange, there is no orange. Okay, I have to pick red, sorry. So let's say I initialize my latent variable such that the coded version of the Z tilde is this point over here. Then we run our minimization process to perform inference, right? To find out what is that check. So whenever we find Z check, that process is called inference. Given an energy, given a sample Y, not given a location Y, I do inference to figure out what was the most likely latent variable, missing variable that generated that point over there. So inference again means we are doing minimization and therefore we are moving around our model manifold until I get to this location over here. What is this location over there? This location is the location that is the closest to my sample Y that I picked. Therefore, what is my free energy? So my free energy is gonna be simply the square distance from this green guy and the red one, right? So this segment over here, square, it's gonna be the free energy of this point over here. So question for you. How do the free energy of the point on top left compares with the energy of the point I circle in yellow over here? Which one is larger? Which one is smaller? And where is my Z check for the second example? Green is larger. Yes, green is larger because this distance here, square, it's gonna be much larger than which distance. So similarly, if we initialize with luck and we run gradient descent, like gradient based methods, we may end up in a location that is over here. And therefore, the free energy is gonna be the square distance between that point and that point here. So definitely this point would be much larger, the free energy with respect to this point. So some other question someone can make is gonna be, how far is the green point from my distribution, right? How far is the green point from my learned distribution? And the learned distribution here is represented by those blue points, right? So you can tell that that point in the top left, it's gonna have a higher energy. So it's further away. It's less compatible with respect to the other guy, right? All right, so we are almost done here. So let's, to make like some exercise, pay attention to those five values, right? So I'm picking that row there, just below the x-axis, and I'm picking the first and then the fourth and so on, right, example. And so I'm gonna be plotting now these energy functions. They look pretty much like this. So for the blue one, as you can expect, we extend up to 20 and then we go down to 2.5 roughly. 20 is gonna be in this location, like the distance between this point and this further point away here, square. And then instead 2.5 square is gonna be, this distance here, square. Similarly, you're gonna have energy function for the red one, for the purple, green and orange. Then given that I compute all these values for the energy, I can now compute what is the free energy. So the free energy, instead of being a function, is going to be a value when I pick a specific location, right? So it's no longer a function of the latent. Whenever we compute the free energy, the latent disappears. And I get that Z check, which is the optimal latent, the latent that is the most likely giving me that point. So here we have that the Z check for the blue curve happens over here. Similarly, the Z check for the orange, green and purple and red are happening in these locations. There for sure, we could have ended up caught in this local minimum, right? That definitely could be a pitfall of, you know, of using some gradient-based methods. So question now for the audience. I'm removing everything. What is F infinity? So how many dimensions does this stuff? Okay, can someone remind me this, right? So can someone tell me what is the domain and the image of this function on the chart? Where does F infinity live? Anyone answer? Is anyone listening? Hello? Okay, so Y is on R24, yeah. But that's, there is just an R. I don't know what just R means. The capital Y is 20, capital Y has 24 items. Each item in capital Y are two-dimensional, right? So Y is a matrix. But I'm not asking the capital Y, I'm asking capital F infinity, right? So capital F infinity, as someone mentioned here, it's definitely a real value. In our case, it's actually positively, non-negatively value, right? Because it's a square and sum of squares. And the domain instead, what is the domain of capital F? The domain is gonna be the, basically the where Y, the bold Y belongs to, no? So the bold Y, it's a vector in two-dimensions. So that's gonna be R2, right? So again, these F are scalar values. So I'm gonna be representing the different intensities of this scalar value with this color bar here. I will represent in a violet, very dark. You may be not even able to see in this free energy equals zero. And then in aqua, I'm gonna be representing this zero temperature limit free energy for equal one. And then everything that is above and beyond the value two is gonna be yellow. And so this is how that grid looks, okay? So each location in that grid here, and I show you before those green points, have a free energy, which is here represented by this color. In this location over here in the bottom side, you can see it's yellow, which means it has a free energy, which is equal or larger than two. Moreover, those arrows are pointing, are the gradient, right? So these are pointing in the direction of maximum ascend. As we move closer to this region here, you're gonna get finally, you're gonna see some colors. And here you can tell the free energy is getting lower, lower, lower, until we hit the location where this reconstruction happened, which is the location, the region where my free energy is zero. When we train this model, we try to get this zero energy level to be matching the location of these blue points. Of course, as you can tell, this model is very poorly trained. And therefore this energy surface is not well matching my training point. It's getting close, but it's not yet there. So next time we're gonna see how to stretch this energy such that it's gonna be nicely fitting on these blue points. Why is the energy surface single value? So the energy surface, which is the value of F infinity, right? And F infinity is the minimum of my energy. So energy, the capital E, it's a function over all possible latent. But then given that we have this function, we're gonna find what is the minimum value that this energy can take. That minimum value is the zero temperature limit of the free energy, which is this F infinity, okay? And so EYZ is a function of YNZ. But then whenever we take out the Z with the minimization, we get this F, which is gonna be a function of Y, right? So every time I move across the Y space, here we have Y1 and Y2, the two components, you're gonna have that F will have larger than two, larger than two, blah, blah, blah, then 1.75, 1.50, and so on. Lower values until we get F roughly zero, and then actually it increases a little bit. So maybe next time I'm also gonna show you this chart in a 3D version also rotating. I didn't have time to do that. Is it, did I answer your question? Is it clear why this energy function is single value, like as in a scalar value, right? You mean single value. Am I understanding the question correctly? But we have 24 Ys. So the capital Ys are these blue points. Right now, my Ys I'm using are this one. So they're not 24. There are, so if you count from here, let me go a bit larger, I can see. So here we have blah, blah, blah, blah, blah, blah, 12 and 12 here, plus one, we have 25. And then here we have 8 and 8, 16 plus 117. So right now we have 17 times 25. I don't know how much it is, someone can compute. Okay, Google, how much is 17 times 25? Okay, she's not listening. So it's 425. 425, there you go. So right now we have 425 points, right? So we have 425 energy functions of which function of Y? So given that I pick a Y, I have an energy function. Those are function in Z. Given that you pick the minimum value of this energy function, that's gonna be your free energy for a specific Y. So you remove that latent variable. So we have an internal possible way of spanning our manifold, right? So you want to think about this as, you know, you have like your potato in your model, like your model thinks about the data is distributed as this kind of shape. And then your latent variable allows you to go all around this potato. So right now if you ask me, oh, is this point here on your manifold or not? So if this point is on my manifold, I know that by going around my manifold and find out if, oh, I get there, right? And so if the free energy of that point is zero, therefore it means that that point you're asking me about leaves on the manifold that the model has learned. If your free energy is not zero, then it's gonna be simply equal to the quadratic Euclidean distance from that location from your point and my manifold, right? Did I answer that question? Yeah, okay. More questions for me? Oh, was everything clear? This stuff I really just digested it, like in the past 30 hours. So again, I might not have done a very good job. Let's see, how do we choose a function to represent the data manifold? Ha! In this case, it seemed like we chose a leap space on the data, but how about other scenarios? Yeah, definitely. There is a lot of research going in architectures, right? Network architectures. So, but again, right now, yeah, we chose that. Next time, I'm gonna be trying to learn the level of compatibility, like I'm gonna try to learn this energy for the X, Y, Z, the triple, right? And so we're gonna be just using neural nets, right? Even the sine and cosine, you can somehow approximate them with a few layers, right? So instead of having this Z function, the G function over here, instead of having this very simple thing, we can think about having a few layers of a neural net, right? So you can still use a few layers of a neural net, but you're not gonna be using the neural net to do vector, vector mapping. You're gonna be using a neural net to do bunch of vectors to scalars, right? So bunch of vectors to scalars is gonna be this energy-based way of thinking about things. Because again, how do you, let's say you want to translate something from one language to another language, right? So I have one sentence, but I may translate that sentence in different ways in another language, right? So how do you train this? You cannot really say I do softmax because first of all, there is an infinite number of sentences, so you can't do that. But then there might be even multiple sentences that are correctly associated to your first sentence. So this energy-based model allow you to end up with a score, scoring mechanism, which is this energy, which is telling you how compatible are points, right? So here X, Y and Z are all interchangeable. Given one, I can find the other, right? So if I have the energy, if my model learned the energy, I can find X given Y, I can find Y given Z, I can find Z given X, I can find all kind of combination, those X, Y and Z. I don't even have to write them X, Y and Z, I can just write all the components, right? And then I can, as long as my model learns that, right? It learns all the, how do you call them, interactions that exist in my data. That's why Jan likes them so much and they are super powerful because they don't make too many assumptions, I think. Did I, okay, I answered your question. We are over time. So I think this lesson was kind of fine. I don't know, you had to tell me because I really don't know. I hope you liked it. Yeah, that was great. Okay, because people are very quiet today. I wanted to make also a notebook, but then the notebook is really ugly because I use the notebook to make very pretty visualization but the code is really ugly. Maybe next time I can share with you a cleanup version of the notebook for pedagogical purpose, right? And especially gonna be showing you this network which doesn't have an input, it doesn't have a forward function, which is so funny. And then we're gonna be learning perhaps what is the free energy without this beta that goes to infinity. And we're gonna be learning how to do learning, okay? So today, again, we just learned. So let me get to the beginning so we can end up here. So today we talk about inference, okay? We do inference by doing minimization of an energy function. Learning is something we're gonna be talking about next time. They don't have anything to share. Well, it's two different topics, right? Next time, the other one. And then the other part, so it was inference for latent variable energy base model which allow you to capture this multi-modality of multi-modality of coexistence of things, right? You don't have simply vector to vector, you have one to many option, right? And then we talk about this stuff here, no? And how we can possibly try to learn this combination of x, y, x, y combination, right? So there's a question. So minimizing energy regarding to train manifold basically means denoising. I think you can think about that in that way. So the real manifold, okay, depends which one is the real manifold, right? So if the model has learned the real manifold, then by minimizing the energy, you can find what is the denoise version of your input. Another option you have to denoise the stuff is gonna be if you find yourself here, you can compute this energy, you can follow the gradient. And then here you can recompute the energy. You can still go and follow the gradient. So you can end up, boom, down on the manifold, right? So you can make little steps, so you can just go, you can find out where to go. Or you can use the Z check to find out what is the best approximation of your point over here, right? Okay, all right. So that was it. Thank you for listening. You have a nice evening. I see on Friday, I feel free to ask young questions about this, this, this practicum. He was, you know, helping a lot as well, right? Have a good night, bye-bye. So how can you get more out of this lesson today? Comprehension, if something was not yet clear, you should really ask me anything in the comment section below, okay? I will answer every comment that you write over there. News, if you would like to keep up with everything I post online, you should check my Twitter account under AlfCNZ. If you also would like YouTube to notify about you the latest videos I upload, then press that subscribe button and turn on the notification bell, such that you're not gonna be missing any content. If you like this video, put a like on it. It means a lot to me. Searching, we have a companion website where you can find each and every video transcribed by students that volunteered. For example, here you can see this lesson transcribed. As you can tell, the titles are links which are redirecting you to the correct section in the video. So here we have this lesson transcribed to you in English. Moreover, not only English is available, as you can tell here, there is the English flag. You can go up on top, for example, and I show you the homepage. Here, you can find that many languages are available, Arabic, Spanish, Persian, French, Italian, Japanese, Korean, Russian, Turkish and Chinese, and more are coming. If you would like to contribute and add your own language, don't hesitate to contact me on Twitter or by email. This was the language part. Moreover, it really, really helps. If you implement things that we cover in class with PyTorch and, you know, a notebook perhaps, and some patient. Today, class didn't have a companion notebook, but nevertheless, I would really recommend you to try to put together a few trials yourself such that you can test your knowledge. Finally, if you find any bug in the previous notebooks, in the website, anywhere, you're really encouraged to point them out on GitHub, or if you find yourself inclined, you can also send a poor request such that you can be an official contributor to this project. And don't forget to like, share and subscribe. Bye-bye.