 All right, all right, so good morning, everyone. Welcome back to Deep Learning Spring 2021. And today, before starting with the class, I'm going to be showing something that I just happened to see a few days ago on Twitter, which was very nice. So I start the lesson by sharing this content with you. So if you go on my profile and you see on the last post, it's called something like this, a fine transformation in 2D and 3D. This guy here made a video explanation. You can find it on YouTube, which is, I think, really good and allows you to understand what's going on whenever you apply a fine transformation to points in a plane. Again, this is again to build your own intuition and understanding of how things work. Nicely, that is something I might have not remember is that a fine transformation in 2D, it's simply a linear transformation in 3D. So everything can be thought as a linear transformation, or even though it's an affine, it can be a linear layer at dimensions d plus 1. That was the shout out I was going to do for James. So today, again, we said we're going to be talking about energy-based models. And then we're going to be also trying to use a new tool that also came from Twitter. So thanks for sending me teaching material, teaching tools. It's very appreciated. I hope I will be able to use it in class today. Let me move around these bars so I can see your face. So foundations of deep learning. And let's move away my face from the screen. There we go. OK, awesome. As every time you have no idea what's going on, just type in the chat, I don't know what's going on. I will try to explain better. So these are topics that are not trivial. And they reconnect to what Jan was talking about yesterday, which is something perhaps a little exoterical based on normal available knowledge. Cool. So let's start. And today, we're going to be talking about inference for latent variable energy-based models or EBMs. So in this lesson, we're going to be talking about inference only, like in the lab number two this year. We only talk about inference. Just someone gives us a model and we try to use it and understand how to use it. And instead, in the next week, we're going to be seeing how to train this model in order to actually make it do what we want. In this case, someone gave you the model and we're going to be learning how to play, how to get things out of this model. So this is inference only today. In this specific lecture, we're going to be talking about a very small toy example, which is basically the ellipse. This allows me to explain to you things that might be a little bit confusing whenever you use actually proper examples. And in this case, we can actually reason and plot and draw everything on a 2D screen. And it's already quite challenging already with a toy example. So let's move on. This first chapter is going to be, again, inference. And it is the unsupervised learning. So in this case, we won't be using labels, right? Or actually, we won't use the conditional case. So let's figure what this means, right? So in this case, in the energy-based model, the unsupervised learning is also called as unconditional case. More about this in a few seconds. I can't just bear with me. It's a bit confusing, I'm aware. Because, again, there are no labels in this case. So what I said just a second ago is not correct. So unsupervised label means unconditional case. There are no labels because we are not doing classification. All right. So we're going to be starting with understanding what are the training samples that can be used in order to train this model in the next lesson. And again, someone gave us this train model. But we still have to understand the data, right? So again, as in the first lab we saw, like in the actual first lab, when we started with the actual data, similarly here, I'm going to be showing you what type of data this model can be used on, OK? And so we have y. Our vector, y, has two components. The first component is this row one, which is like a function of an x, which was usually our input, and then is multiplied by the cosine of an angle theta, which we don't have access to, OK? So x we may have access, and then theta we don't. And then I add some noise epsilon. Similarly, the second component of the vector y is going to be this row two, another function of x. And then I multiply this by the sign of theta, which it's a variable to which we don't have access, plus epsilon. Cool. So let's figure what is this row. Row, as you can tell, is going to be a function that maps the real numbers to r squared. In this case, x is a scalar value. So we have that we go from r to r squared, because we get the row one component and row two component. In this case, x is mapping the following, right? So in the first line here, and yep, here. So in the first case, we have that x gets mapped to alpha times x plus beta times 1 minus x. And alpha is going to be equal to 1.5. And then beta equal 2, OK? So if you have that x goes from 0 to 1, you're going to get that whenever x is equal 0, you're going to get that the first component, row one, is going to be equal to 2, right? So this first component is going to be equal to 2. And then as x goes from 0 to 1, we're going to be going from 2 to basically alpha, which is 1.5. So the row one component, this first component here, basically goes from 2 to 1.5. And not only, but then the 2 is multiplied by exponential of 0. So the 2 is actually just 2. But then the alpha, the 1.5, is going to be multiplied by an exponential of 2, OK? So let's see first. The two components here, we said the first one goes from 2 to 1.5, and the second one goes to 1.5 to 2. So what these basically mean is that y are going to be points living on an ellipse. The ellipse starts at, we said at the beginning, 2 on an horizontal and 1.5 in the vertical axis. And then as you move across x, as x goes from 0 to 1, this ellipse from a horizontal potato becomes a vertical potato, OK? And also enlarged. And the envelope of this basically cone is going to be an exponential, as you can see from here. So far it makes sense, right? I mean, there is no machine learning. It's just some functions, OK? All right. So what is this theta? So we said that theta is also a number that we, I mean, is a variable we don't have access that goes from 0 to 2 pi. So again, these stuff are showing basically that are like circles, right? So cosine sine. And then the x-axis is also multiplied by a different value. So again, we have a non-axial circle. We have an ellipse and a similarity for the y-component. What is epsilon? Epsilon is simply some noise. So we have 0 mean, 1 over 20 in terms of standard deviation, y1 over 20 because it makes some noise, a little bit of noise, not too much. That's just an arbitrary number. So this is my data set, let's say. This is how I'm going to be building up my data set. So how does it look? And this is like I'm going to show you right now how it looks. So as I told you one second ago, we start on the left-hand side with a horizontal potato, right? So we said this one is going to be 4 in the diameter, so 2 is going to be the x-radius. And then 1.5 is going to be the y-radius or 3 is going to be the y-diameter. And then this horizontal potato, as we move across x from left to right, is going to be going on a exponential envelope, as you can see from the picture. So let me actually spin around this drawing so you can see better. So we start like that and then we move like that on a exponential zooming thing. All right, so this is the data we're going to be playing with. Why do we need energy-based models? The point is that whenever you pick a given x here, so let me stop this rotation so I can actually draw things here, hop. All right, let's say we pick a given x. So if I pick like one longitudinal cut, given a singular input x, you're going to get an infinite amount of y's, okay? So given one x, there are infinitely many y's. Or in this case, those are specific samples, but there is a full eclipse at a given location, right? And so with a normal neural net, which is mapping vectors to vectors, we can't really figure out what is the correct solution. What is the correct y that corresponds to my x? There is no correct y. There is a multitude of y's that correspond to my given x, okay? And so how do we come up with an architecture and neural net that allows me to get multiple y's, an infinite possibly number of y's, right? Let's say you may think, if you have up to a discrete number of y's, maybe you can create a network that has a discrete number of outputs. And in this case, you have an infinitely possible numbers of output corresponding to the given x location. Let's say you have x and you even have y one. Well, you still now have two options for y two, right? So unless you are at the exact edge of these ellipse, given a x location, so you have a slice, you have basically an ellipse. And given another, let's say you choose a given y one, then you have two possible other values for the y two, right? So again, in this case, how do we find a network that allows me to predict multiple values? Well, energy-based models are an option, right? So energy-based models are gonna be giving you some sort of score telling you how likely, how good, how compatible, a given prediction is. So far, we are on board, right? Everyone, you can put yes, no. I should be able to see you on the participants, your reactions. No, no, no, reaction, reaction. Don't type in the chat. The chat is for, oh, okay, I'm messing up here things. The chat is for you to type stuff. You can, okay, you can react on the thing, right? All right, cool. Let me make things easy in this first lesson. I'm gonna be removing X, okay? So before we said that we had this horn, basically, right? It goes like that. It's too complicated. I don't want to pay attention to the X. So I'm just setting X to zero. I, boom, slice this horn at a given location. So what do I end up with? I end up with a ellipse, right? And that's why the lesson is called the ellipse. It's not ellipse, it's actually the horn. I should change the title. Okay, so I set X to zero. And this means that the Y, the row one actually, is gonna be equal to two. And then the row two is gonna be equal to 1.5, okay? All right, so what next? Then I can actually sample a discrete number of samples, Y, okay? So I pick, let's say 24 values for theta, uniformly sampled from zero to two pi, and then I collect them in these metrics, right? And so I have all my samples here, the Y samples, 24 of them, collected in this capital Y metrics. And this is gonna be my training set. How is row one, zero equal two? So if you see here, we said that row one, if alpha is equal to zero, so if X is equal to zero, you get alpha times zero is zero, plus two times one, right? Makes sense. If you substitute in this expression over here, and my mouse disappear. If you substitute in this equation over here, X equals zero, you're gonna get this goes to zero. This one goes to one, beta is equal to two, and then you multiply everything by exponential of zero, which is one, got it? So that line describes row. Yeah, yeah, I'm cutting row one, and I tell you that row one equal two, okay? All right, moving on. Okay, cool. And so this is my data set, okay? So my data set doesn't have X's. It's gonna be just unconditional or unsupervised, and it only has Y's. Hmm, okay, interesting. All right, so let me show you now my untrained model manifold. How can we generate a full manifold from a model, right? Again, this is an untrained model. If this model would be trained, then it would be performed nicely. This is just a initialized model, which maybe I have under-trained, and so it's gonna be performing poorly. Nevertheless, it can still show you a manifold. Like we saw in the first class when I show you a randomly initialized network, which are performing arbitrary transformation. Similarly here, a untrained energy model can show you a randomly, like a, what did I say? A arbitrary manifold, okay? All right, so we start with my Z here. What is this Z? Z is gonna be my latent variable. And so my latent variable has the same color as my theta in the previous slide. So, oh, okay, so whenever you have missing stuff, it's called latent in the other case, right? So we don't have theta, then I come up with this Z, which is gonna be replacing my missing variable. And Z is gonna be going from zero to two pi with two pi excluded. That's why there is a square bracket flipped in the other direction with intervals of pi over 24. So there are 48 different values, zero, blah, blah, blah, no, until two pi. So Z can be thought as these discrete points on a line that is going from zero to two pi minus pi over 24. Then I provide my latent variable, which is these discrete points over a line into a decoder. And my decoder is gonna be converting this into my y tilde, no, and tilde means like approximation of y. And so as I move Z across this line, I'm gonna have that y moves around circles, okay? So this is how this energy-based model generates output samples. On the other side, we have these bold y's, no, the blue one, the one that I showed you before, the one that come, they are my data set, right? So how do I know these are values from the data set? Because you can tell the circle y, it's shaded, right? And so if it's shaded, means it's an observation. This is part of my data points. Whereas the other two circles are not shaded, they have like a black background. So that means those are values that our model either compute or gets as an input there. But again, I'm not observed. So there's a question here. I am confused about the graph of only y's. There are 24 y's components, but y only two axes, I mean, okay. So let me go back to the other one, right? Actually, I can just press play here, I'm gonna see the next chart so we can discuss both. So I'm finishing the explanation here. And so as you move Z going from zero to two pi in discrete jumps of pi over 24, you're gonna get points, orange points across a line. And then if they go through a decoder, they're gonna be going over ellipse, okay? And so our data set here includes, is made of 24 different values for my y. Each y has two components, right? The y one and y two. And those two components are given to you by two times the cosine of theta plus noise. And the other one was 1.5 sine of theta plus noise, okay? Oh, 24 comments. Yeah, okay, cool. And so that's pretty much it. Why did I say this model is untrained? Well, if this model would be trained, you would end up having this manifold here. Basically happening just under these blue points, right? So the blue points are the real points, the one that are making our training manifold. Whereas the purple one here, these are basically our network believe wherever the manifold should be. But definitely it's not correct, right? Because again, it's not matching what my data is telling us, okay? So far there are three green check marks here. You may want to remove your check mark unless you're trying to tell me something. Moving on. So what are we gonna be talking about right now? Now we're gonna be introducing, finally, we haven't talked about yet, the energy function. So enters the energy function. Which is just a red square, a red square between my y's here and then my tilde y's, right? Y tilde. And so this energy function is gonna be having the following equation. So my energy function E, function of my observation, given singular observation y and bold y and the given latent z is going to be the sum of the square Euclidean distance between the component, the first component of y and the first component of y tilde, which is the g1 function of z. And then the other one, no, the square Euclidean distance between the second component of y and the second component of this y tilde square. So a square distance. And then I sum both together. And this happened for each and every observation y in my capital Y collection of all observations. So finally, at the last point of in this slide, I mean, until the end of, like to the end of this explanation, I'm introducing you the decoder. What is the decoder? So the decoder is something smart. I mean, it's something cooked for this specific exercise, okay? And so the decoder g is the following. So g is a vector, as you can tell, of course, it has two components, g1 and g2. And so it's a function. So it goes from my scalar input, which is the z to two components, right? g1 and g2. And it maps z, the latent variable to w1 cosine of z. And the second component is gonna be w2 sine of z. Question for you at home. How many components, how many parameters? Sorry, how many parameters does this g decoder have type in the chat? Good morning. The question for the students at home is how many parameters does this network, if you want to call it, function has? Can you hear me? Test, one, two, three, talk, talk. No, no, no, it depends on why. Okay, you can hear me. Yes, the answer is two. Okay, so people reach their keyboards. Maybe, okay, very good. So there are two parameters, no, in this model. w1 and w2, okay? So it's not a deep model. It's like a very simple model, well, simple. No, the number of parameters doesn't define the complexity of a model, right? In this case, we can say that this model has just two parameters, which is basically the x radius and the y radius of these ellipse that we can generate by changing the z. Okay. So I'm gonna be writing there this e of y z just for keeping it there, so I can delete the part below of this chart of this slide. And so here we have many squiggly functions. What are these things? So these are my energies. As I told you before, we have as many energies as y's in my data set, in this case, right? So how many y's do we have? We said we have 24 y's, right? As you can tell, we have four rows and six columns, so you can count there 24 different energy, okay? So the first one, top left here, is gonna be the energy associated to whenever I pick my y from the data set, that is the first component, right? So y square bracket one means my first y in my data set, okay, which is, and just pick it here. I represent it as my y pick, no? I just put the apostrophe to say I pick that. And so this one will generate a function E1, okay? So the energy E written here, E, I'm actually now indexing this E based on which peak of y, okay? So given that y's, those y's are discrete samples from my capital Y, then I can have the same index here, one here as, you know, index the function. All right, cool. The last one on the line is gonna be a corresponding to the energy of when I pick the sixth sample within my training samples, right? Similarly, this is gonna be called the E6, right? Indexed to six. Bottom left corner, we have the y in 19 sample, right? So this is gonna give us the energy 19th, the 19th energy of this overall system. And then the last one corresponds to when I pick the last sample in my training set, right? And so we're gonna have y pick, which is the, I picked the last one and gives me the E24. So there's a question. How do we know that we need to use the decoder function that is an ellipse? Good question, because I cooked this example for you, okay? So this is a toy example. This is like a pedagogical exercise for you to understand what these energy-based models can be used for and what means to perform inference with them. In real cases, you will replace g in the decoder with a neural network, which can learn any arbitrary function, right? So neural networks are function approximators. And more specifically, you will want to learn a function, which is basically comprising both the decoder and this E here. So there's like a big, this whole, and like the set of this thing and this one comprise my energy-based model. And here we just give some structure, right? So here we split this big energy-based model as a decoder, which is decoding my latent into like a guess, y tilde, and then I compute the squared Euclidean distance in order to have the energy, but it doesn't have to be like that. There's an option, just a possibility, right? So your energy model, energy-based model, is gonna be a neural net, which is gonna be outputting a scalar, and this scalar will have to tell you how compatible are a given z, given latent with a given observation y, okay? So the question to Shan Yin is that you don't know. This is just an example, and so it's like again, cooked in advance. Otherwise, you're gonna be learning this with, you're gonna see this actually next week, how we learn stuff. The number of data points is six or four. The number of data points are 24, right? Six times four. We are trying to approximate the decoder, right? We are not approximating the decoder. We are trying to learn the parameters w1 and w2. Each EI graph is over the z value. I'm getting there, I haven't finished explaining. All right, cool. So as Allen is saying, each square here is a function of z, right? In this case, z goes from zero to 22 pi, right? And here I show you 12 points in between, right? So there is like, I can show you like discrete values, right? Those little points. Sorry, the delta z is gonna be pi over 12. So there are I think 24 points in each square. Then the y-axis instead goes from zero to 12, okay? And so here you have that each box goes from zero to 12. And then the x goes from zero to two pi. The x-axis here, sorry, the abscissa is the latent variable and the ordinate, the vertical axis is gonna be the energy amount, right? I cannot call them x and y because we use x and y for different things, right? So the horizontal axis is called abscissa and the vertical axis is called ordinate. Allen, there are 24 data points for y. We have 24, yeah, that is, yeah, there are 24 lowercase bold y's. And capital Y is the collection of all my lowercase bold y's. And one y is going to be a 2D vector. So these squiggles, they look very similar to each other, but they are like two extremes, right? So the bottom right, the 23rd has like a u-shape, which is pretty, I would say, okay, not too crazy. But then the other one, the 10th, the 10th energy, instead it's like a little bit, it has two local, it has two minima, right? So what's going on there, right? So we start actually with the u-shape, which is the easier one, maybe to understand what's going on. Then we look at the other one, okay? And so here we're gonna be talking about the 23rd energy in the one that is arising from which when we pick the y from our training set, that is the 23rd item, okay? And so again, z goes from zero to two pi and then the y axis goes from zero to 12. And so this is what we're talking about here. So my y prime, which is my peak, is going to be the x cross on the right-hand side here on the screen. And so we're gonna be doing some computations right now. So let's see whether I can do this or I'm gonna be failing miserably. I hope I managed. So we have a new toy, let's see if I can use it. All right, so let's use a pen and then let's pick, oh, I cannot change the color, okay. Pen color, let's go with cyan, okay? So whenever my z is equal to zero, so I'm looking at this location over here, I am at this location over here, okay? How do I know I am at this location over here? Because I know what is inside the decoder, right? So we said the decoder has w one times the cosine of z and then sine times the, w two times the sine, right? So we have the g is equal to w one times the cosine. And of course my computer is lagging, so I cannot write cosine of z. And then w two sine of z, right? Cool, so at the beginning, what is the energy here? So the energy we said is gonna be the square of clean and distance between my y sample, right? So between this guy over here and then my y tilde, you know, and my y tilde is gonna be at the beginning when I pick z equals zero here. How do I know I'm starting over here? Well, because, you know, I coded this up and this random network, this random model came to be the fact that w one is actually equal to how much is it? This is minus zero five, minus one, minus 1.5, right? So whenever I initialize this specific model, I saw that w one is equal to minus 1.5, okay? And so we started this location over here. W two doesn't matter, right? Because w two is gonna be eaten by the sine of zero, right? Sine of zero is zero, so this term doesn't matter here. And so how far is this point from here, right? So we can count, there is two boxes. So two boxes means one, right? So zero and two boxes down is one. So we have one, two, three and a half, right? So I check what is this distance over here? Oh, I'm going down here, okay? So this distance over here, we said is gonna be 3.5, okay? The energy we said is the square Euclidean distance, right? So how much is 3.5 square? Of course, this is lagging. 3.5 square is gonna be, of course, 12. And so that's why we are starting at this location over here, at the top, okay? Then we're gonna be increasing that, right? So instead from zero, we go down to this location over here, which is pi. So whenever we are at pi, basically we are walking. We are gonna be actually walking this direction. And then we end up at this location over here. Let me change the color maybe, so we can actually understand what's going on. And I cannot change the color. I don't know why I cannot change the color. There we go. And color, let's go with red, maybe. So when we are here, this is corresponding to when z equal pi, right? So we have z equal pi. How much is this distance here? Anyone is following? Can you type how much is the distance in the chat? 0.5, okay? So what is the energy associated to this value over here for this decoded value? 0.025, yeah, that's correct, right? And so this one implies that my energy is gonna be 0.25. And that's exactly the height of this point over here, okay? And so this is gonna be my 0.25. Then as you keep increasing the z, you're gonna be going down here and it keeps growing until you get down here. And so you're gonna be back to the 12, okay? All right, all right. So this is how I drew one function, right? So this function here, I'm showing over here, it's simply the squared Euclidean distance between the decoded z. So z goes on a line, right? So z goes from zero to two pi, right? With discrete sums, discrete jumps in this case because I just pick values, right? And as I pick these discrete values from zero to two pi, I send them through the decoder. And the decoder converts one value from this line into this that is a discrete value across an ellipse, right? Then we take the distance between the point and the actual target here. We compute the square of the distance and we get basically the value of the energy. That's it, okay? So this is everything there is to energy-based model for today's lesson, right? Basically. So let's move on with another example. I'm cleaning up the screen and then I'm gonna be looking at this other squiggly line, okay? Are there questions so far? So is this stuff here clear? Yeah, I'm summarizing everything again in the next slide. Are there other questions? Summary happens in the next up, in the next, I'm gonna repeat everything once again. Otherwise, are there specific doubts so far? I guess not. Just write down if you don't know what's going on. So we're gonna be looking now instead at how this specifically squiggly function comes to be. And this is basically whenever I have, I pick my y prime to be the 10th item in my data set, okay? So this is like my, it's basically the same, the fact that the only difference was that before we had this one as my target, now I change and my target is going to be this body over here, okay? All right, so first question, I'm gonna be changing the color such that we don't get bored, pen color, we go with yellow. All right, so whenever we are at z equals zero, so at this location over here, we said that we are at this location. So whenever you send z equals zero in here, you send it through the decoder, and then the decoder is gonna give you the first y tilde, the first guess, the guess corresponding to this specific value of the latent. I told you before, given that my w1 is minus 1.5, that the first value that the network tells comes up with is gonna be this value over here. So a question for people at home, can you compute what is the energy? Wow, it's so bad line, okay? Let me try again. There is no undo button, I should complain to Microsoft. Then, okay, let's try again. All right, see, I can do it. All right, so how long is this line, right? So can you tell me? I'm computing the energy associated to this given sample, right, from the training set, and given that I provide this latent, okay? So it's gonna be square root of 1.5 plus 1.5 squared, right? So let's actually get, let's remove the square root because we're gonna be taking the square, right? It's always going to be this square plus this square, more or less, right? So this is 1.5 squared plus 1.5 squared, right? Which means two times 1.5 squared. So one, which is two times 1.5 times 1.5. Two times 1.5 is three. And then three times 1.5 is 4.5, right? Okay, very good. And so this thing over here, we said it's gonna be square root of 4.5. And this system is lagging like you have no idea, but okay, I'm writing in the past and I see the squeakles in the future, but okay. And so this was the length of this segment and this value over here is gonna be 4.5. Shouldn't be 1.75 rather than 1.5. Why 1.75? Each square, it's, yeah, yeah, 1.75, right? More or less, yes, yes, it's on the x-axis, right? So actually we have, yes, 1.75, right? This distance here, plus 1.5 squared. If you have a calculator, you can tell me how much it is, okay? Otherwise we can ask Google. Okay, Google, how much is 1.75 squared plus 1.5 squared? 1.75 squared plus 1.5 squared is approximately 5.31. 5.31, she said, okay. So this value here is 5.31, we said, right? So 5.31, 5.3, okay, this is a five. All right, so fantastic. So this is the initial value over here. Then we increase z, right? So z moves on a line and this one keeps growing. We grow, we grow, we grow until we get to something that is over here, which is pi half, okay? Whenever you have pi half, you're gonna have this distance over here, square, right? So this distance is basically 1.2, roughly. So this distance is 1.2. You want to square that, you're gonna get this value over here, right? Which is my minimum distance. Then we keep increasing, basically. We go down this way. And then you're gonna be computing this square distance. You're gonna get something a bit shorter. And so this second value here, it's lower than the other value, right? So this is this value here. And then what happens? As you keep increasing this one down here to three, three half pi, right? Oh my God. Okay, three. What happens here? This location over here, it's closer to this point than this location over here, right? And so here you get basically another minimum, right? So you get a value that is lower than this value over here. So this is the location over here. And then you keep increasing this one, right? You get back to the two pi. Cool. And so this is how we compute this energy for this specific case. Moving on and cleaning up the screen. We're gonna be introducing what is the free energy. And I have no idea why this squiggle is here. Go away, squiggle. Erase. Oh, you cannot even erase. Okay, amazing. So we're gonna be defining now my free energy as the... So actually this is not just the free energy. This is the zero temperature limit for the free energy. So this is free energy free zero temperature limit for the free energy. It's going to be the minimum value that this energy function takes with respect to the latent variable set, okay? And so we can also write down this one. We can also write here that my Z check, which is the Z at which my energy has its lowest value. This is the arg mean over Z of E. It's gonna be called Z check. Why is it blue? Because it's cold, right? So if the energy represents the... I mean, the temperature is going to be the average kinetic energy of the particles, right? And so if you go at the lowest energy, it's gonna be the coldest region. And so if we have that the Z check, check means goes down, right? So if you have the lowest Z called Z check, then you can also write that the free energy, the zero temperature limit of the free energy is gonna simply be the energy computed at the location Z check. How do we compute the Z check? There are many options. You can do exhaustive search if Z is discrete. Let's say you have like the K means case we talked about yesterday, or instead if this stuff is actually continuous, you can use something called, so you can apply gradient descent. And again, pay attention. I'm not talking about stochastic gradient descent. It's gradient descent because you actually have access to your function. Whenever we train a network, neural net, usually our objective function is the average of all the per sample loss, right? So we have a per sample loss and then we take the average of them. And whenever we take a subset of this entire dataset, it's gonna be called a stochastic gradient because those gradients are approximately pointing in direction of the true gradient, the one that you would get if you use the whole batch. In this case, there is no averaging whatsoever. We have an energy and you just perform gradient descent. No stochastic, right? And so options available to you are a conugated gradient, line search, LBFTS, anything you want. Cool. So let's see how this looks and how it works. So we start here with a initial value for my Z, my latent. Then I'm gonna be computing running gradient descent. And I get my final Z check to be this location over here. This means that the value of this function here, the energy at the lowest value, it's my zero temperature limit free energy, f infinity. What does it mean this in the Y space? So means that we start with the initial guess which is this location over here. This is my initial Z latent sent through the decoder such that I can display it in the Y space. And then I'm gonna be running gradient descent and gradient descent is gonna be simply making me travel across the network manifold until I get to this location over here, which is my X, right? And so again, what is the free, what is the energy of this Z and this Y? Can someone tell me in the chat how do I compute the energy of this one and this one? Yeah, it's 1.2, which is the square, oh wow, okay. I have no idea what happened there. Okay, there are issues here. Eraser, oh, okay. And all right, here, there we go. And so this value here is 1.2. You want to square this one if it works maybe. Okay, 1.2, if we square this one, we're gonna get the energy, okay? How do we prevent local minima in this case if we use gradient descent? Yeah, good question. So in this case, we do have local minima. If you actually train better architectures with many parameters and whatnot, you won't be incurring in local minima. As we have seen, as we know also for deep learning, a local minima are not quite a problem, especially if you use over-parameterized networks, you use residual connections, you use batch normalization. But definitely in this case, you may incur like in falling on the other side. So your algorithm doesn't converge to the correct solution. All right, cool. So this value over here, squared, right, represents the free energy, the zero temperature limit free energy of this point over here, right? So this point over here has the energy, which is the square distance to the manifold across the whole manifold, but the minimum distance is called the free energy, the zero temperature limit free energy. Cool. In the other case here, we switch now to this U shape, right? Before we are looking at the squiggle case, right? Before we were looking at the squiggly case, now we are switching to this U shape that is easier to deal with, right? And so in this case, maybe I initialize with this location over here, then we play green descent, we get down to this location over here, and then the height of this one, I just show you on the other side, it's gonna be my free energy, right? So again, free energy, minimum height of this energy function. And what happens here? Similar, right? So we initialize the latent in this location over here. This is my y-pick, the picked y. I perform green descent, and I end up at this location. And then this square distance is gonna be this value over here, and this is gonna be my free energy, yes, 0.25. All right, you understood. And so let me actually clean up a little bit the screen, and I'm gonna be asking like final time, right? And then we almost done. So here I'm gonna be doing this free energy computation for every possible point in the y-space, okay? So before I was telling you we have 24 energies, one energy per sample in my training set, now I'm gonna be computing every possible energy for the whole possible grid I draw here, right? And so similarly, let's change color just because of why not. So let's go with cyan. I'm gonna be starting this location here, right? So I'm gonna be looking at this value here. Let's say we initialize with some possible latent that it gets decoded over here. I ran a run grid in the same, so I basically move down this direction over here. We get to a location that is basically over here. And then my free energy is gonna be this square distance from that location. Similarly, I can take another point, and then I get the exactly same thing. So I'm gonna do, let's say, a point over here, how to do the free energy. So let's say we are starting whenever we end up in this location over here. And then my free energy is gonna be the square distance from that point, okay? So I'm gonna be doing this for this one, two, three, four, five peaks, okay? From the whole grid. This is how they look. We have that the free energies are going to be corresponding to the minimum of this plot, right? So the X here, the green one for this one here, for this one here, the orange for this one, the green for this one, and then the yellow for that one. Okay? So final question for people at home, what is this function over here, right? So this function, what is the domain and what is the image of this function over here? Can someone tell me? The domain is two, right? So we move from R2 to what? Two, anyone? R2 to R1, yeah, okay. This being not a success, I think these drawings. All right, so yeah. For each location in the 2D plane, I can come up with a scalar. So how do we visualize this stuff? Like in this way. So zero there in the energy is gonna be represented by this purple free energy equal to zero. Then in green, I'm gonna be representing the free, the zero temperature limit free energy equal one. And then for everything that is above and beyond value, the value of two is gonna be yellow. And so this is how these manifolds of the network looks, okay? So let's say a location over here has something a value to this beyond two, right? So that's why it's very yellow. A point that is over here instead, let's say this point over here, it looks like it has a energy equal to one, okay? Points that are over here instead are having energy equals zero. If this model was properly trained, then this purple location would be exactly above these blue points, right? The blue points are my 24 points from the training set. So if this model was properly trained, then this purple spiral, like sorry, no spiral, purple region would be underneath this. And what we're gonna be doing next week, okay? We're gonna be training, we're gonna be learning how to move this region over here, basically underneath these guys, right? So we're gonna be moving this region here under these points over here. This is gonna be learning. So moving the energy is learning. Computing y's and z check, the y check, z check through gradient descent, it's inference, okay? As I show you again in basically one of the first labs, we can use gradient descent to compute x given y, in that case, or in this case to compute y, to compute z check and y check given a value from the actual observed data set, okay? Finally, just to make you see something that is pretty, but then we say goodbye, I just plot this stuff in a 3D, okay? And so similarly I changed the colors here. So we have three colors. We have blue for very cold, zero temperature. So this is the three zero, zero temperature limit, three energy. In white, we have the 0.5 value and then everything beyond one is gonna be shown in red. And here is how this energy looks in the y space, okay? So in the y axis, I show you the free energy, the zero temperature limit free energy. And again, this zero temperature limit free energy means the minimum value of the energy, which is again, in this case, this energy was the square Euclidean distance. So we just check, which is the minimum square Euclidean distance, right? And so as you can tell, points that are at zero energy are on the plane here and those reflect all those locations which the model assigns a good score, like a zero score, it's good. The higher the energy, the more uncomfortable the model is, okay? So if you move towards the edges of this bath tub whenever you have the red lines, the model is uncomfortable. It feels like that location shouldn't, it's not good. That location doesn't belong to the training manifold, okay? And so that was the lesson for today. I hope you understood something, not as you, like, I hope I was clear enough to explain things. This tablet was lagging behind so I was not able to show my artistic skills to my maximum abilities. So today we just cover inference for this energy-based model. Next week, we're gonna be covering the non-zero temperature limit. So another type of free energy and then how to train this model, okay? Questions about utilization and applications go on the lecture side. And with this, I'm gonna say goodbye and enjoy your Thursday, all right? Bye. No questions, right? Done, finished. Okay, very good. All right, bye-bye. And I cannot quit. Oh my God, I'm trapped. Stop sharing. Okay, exit.