 So today, we're going to talk about energy based models. And you'll hear about it for me today. This is a very important part of the course. And tomorrow, Alfredo will have a whole session on energy based models and inference in energy based models. All right, energy based models. So this is sort of a different way of framing what machine learning systems can do and is much more general than the standard view of seeing neural nets and deep learning system as kind of a function that maps inputs to outputs. The issue with predictive, classical predictive models, like the one that's shown here on the top right, is that they can only make one prediction at a time. So you give a neural net, g of x, an input x, and it produces one output. This output might represent a probability distribution, but it's a finite object, a vector, a tensor, something like that. And it's a single output. And then you can measure the distance or divergence or whatever between what came out of the function and the desired output during training or to test the system. You can measure to what extent a particular output is compatible with the prediction by just measuring the distance between this output and the prediction. So we never do this for classification because we don't need to or because it's trivial to do. But in fact, what we're doing when we're looking at the multiple outputs of a neural net that does classification and whose scores go through a self max, is that we actually examine every output, which is each output is basically an hypothesis for a possible output. And we look at a score and the decision in the end is made by picking the output with a larger score. So in fact, what we're doing is that we're feeding all the possible values for y. We're measuring how well this possible value for y matches the output of the system. And then we pick the one that matches best. That's how we do classification. It's kind of implicit when you do self max, but it's really what's happening. So we're going to kind of generalize this idea. And the reason is that there are many cases where you don't want the system to have a single output. You want the system to basically tell you a range of possible outputs. And I'll give you a very concrete example. If you're doing language translation, let's say you want to translate French into English. There are many ways to translate the same French sentence into an equivalent English sentence that has the same meaning. You can change the style. You can change the word order. You can change the words that are using. The English dictionary is enormous. So there is always kind of many substitutes for the same word. So there's always several solutions, several possible outputs that are possible. Same when you do speech recognition. So in speech recognition, when people speak naturally, that includes me, I say things like in between words, right? Do you actually want to translate those or transcribe those in the recognition? There are words that may be ambiguous. And you may not be sure exactly what word was pronounced. So how do you represent the set of all possible transcription of a speech signal that could be useful? So those are two examples. Let me give you another example. Another example is you have an old movie or a picture which is low resolution and you want to create a high resolution version of it. There are many high resolution versions of that movie or that picture that are basically compatible with the low resolution version that you have. Or maybe it's not a low resolution, maybe it's a noisy version and you want to kind of denoise it, right? There are many ways to denoise an image. There are many ways to increase the resolution, all of which would probably be nice looking. And so there is no signal answer to that problem. So what do we do in that case? We have to be able to represent the uncertainty basically allow the machine to represent multiple values for Y and we cannot do this if we use a function that just maps X to Y because we only get one Y. So what do we do? So one idea is this idea of energy based models and you can see probably seek models as a special case. So in case you're wondering, it's a slightly more general concept. And the idea there is that your model is really an energy function. So it's a model that can be very complicated inside, but in the end its output is a single scalar and that scalar measures the incompatibility between X and Y, right? So you give it an X, let's say a low resolution image and then you give it a proposal for Y, okay? And what the machine tells you is whether this Y is a good match for X, whether it's a good high resolution version of X if you want or X could be a video clip and you ask the system, can you continue to video clip, can you predict the next frames, right? And you make a proposal for Y and the system tells you that's a good continuation or bad. Another example of course, which is much more popular these days is for natural language understanding and processing. You show, I apologize for the background noise, it's a heavy helicopter flying over my house. You show a system, a segment of text and you remove some of the words from that text and you ask the system to fill in the blanks. This is very popular now to train natural language processing systems. The reason being that it's a task that when learning it, the system kind of learns to represent text, okay? So there are many ways you can fill in the blanks in the text and you'd like the system to be able to represent multiple options. The current systems that do this, actually don't do this very well in terms of representing multiple options, but I'll come back to this. Okay, so this idea of having variable nodes, which are the circles and then what's called factors, which are things that measure the compatibility or incompatibility between the values of variable nodes is a classical way of representing intelligent systems, machine learning system called graphical models. In this particular case, something called factor graphs. So factor graphs are graphs, bipartite graphs, which means they have two types of nodes, variable nodes and factor nodes. Factor nodes are squares and what the factor nodes do is they give you a value that measures the compatibility between the variables that enter it. You can have multiple factors, multiple variable nodes, and that's the idea of a graphical model, the factor graph. Okay? Now factor graphs generally are interpreted in the context of probabilistic models. I'm not gonna do that. Okay, so our energy function f of x, y is a scalar value function. It produces a single output and I'm not gonna represent the output, okay? When you have a red square that sort of implies that there is a scalar output from it. It's like the cost function symbols we've been using for the last few lectures. And the idea is to make it take low values when y is compatible with x and higher values when y is less compatible with x. Okay? So it's a measure of incompatibility, not a measure of compatibility, okay? And it's called an energy for that matter because it's very similar to the physical concept of energy, right? If you have a rubber band, low energy is when the two ends are happy with each other, right? You satisfy the constraint. And when you start pulling the rubber band, you have like high potential energy and the thing wants to come back to its rest position. So it's a similar concept as energy. You could think of y and x as being like physical properties and for y to be compatible with x, it has to change so that the energy is small, okay? Okay, so what is inference? So inference there is not just taking x and running you through a neural net and then here is your output. It basically consists in trying to find a value of y, given a value of x, trying to find a value of y that minimizes the energy, okay? So it's this formula here. I'm gonna write it white check. So a check, you know, this little v indicates that, you know, it's kind of a minimum y that produces a minimum value for the energy. And it's gonna be the value of y that minimizes f of x, y over all values of y given, you know, taken from the set. So I'm not talking about learning yet, okay? I'm not gonna talk about learning for a while actually. This is just inference. You know, it's not forward propagation anymore. Inference involves an optimization now with respect to one of the variables we're interested in predicting. Okay, so here's an example. Let's say we have, let's say x is a scalar variable. Okay, in general x would be an image and y another image or something like that, right? But or something else. But here I'm gonna use a very simple example where x is a scalar variable and y is also a scalar variable. And my model is supposed to give low energy to those blue dots because they represent a particular relationship between x and y. In this case, it's actually y equals x squared or something like that. And this is an energy function that actually captures this relationship between x and y. Why does it capture it? It's because for a given value of x, if I look for the value of y that minimizes the energy, I find the manifold, the surface on which those blue dots reside. So that's an example of an energy surface which has the right shape. So that for a given x, if I minimize f of x, y with respect to y, I'm gonna get something that says that y equals x squared. But I don't have an explicit representation of y equals x squared. I just have an energy function that makes me pay a price for making y different from x squared. Now, there are several energy functions like the one on the right here that will give me exactly the same result. They can have different shapes. The only thing that is required is that they have a minimum at the good values that are compatible with x and they could be multiple minima. Here there is only one and they could be multiple minima. And the other condition is that all the values of y that are not compatible with x have higher energy than the ones that are compatible. So basically, if you want to build or train an energy-based model, we have to find a way to shape the energy surface so that it has that property. It gives the energy to stuff we want and the energy to stuff we don't want. So you can think of this as an implicit function, okay? It's very common in mathematics, right? You particularly in physics, like Schrodingen physics, for example, you don't say, here is the equation of height as a function of time when a ball falls, for example. What you write is sort of energy Hamiltonian in the case of physics or something like that or Lagrangian that says, if x is not, if the ball is not in the position I expect, then that value would be larger than it could possibly be. Physicists invented this in the 18th century. They call this the principle of east action, for example. That's an example of it. So it's not any idea in any way, but we can sort of use this in the context of machine learning. So here's a more complicated situation here where we have again two scalar variables, x and y, x that we observe and y that we need to predict. And the data that we've observed are those black dots. And for any particular value of x, there are multiple values of y that seem compatible with it, right? Sometimes just an isolated value, sometimes like a whole range of values, right? So it's clear that we can represent this dependency between x and y by an energy function, but we cannot represent it by a function that just add puts and hypothesis about what the value of y is, at least not with a deterministic function like a neural net. So one thing we should strive to look for is energy functions of this type that are easy to minimize with respect to y. Okay, and there are two types of energy functions that are easy to minimize with respect to y. There are functions for which, so if y is discrete, there are functions for which, despite the fact that y is discrete, there is an efficient algorithm to find the value of y that minimizes the energy, okay? Using some combinatorial methods. So something like dynamic programming, for example, that would sort of, despite the fact that y might be something like a transcription of a speech signal or a translation of a language, there's an easy way to find the sentence that will be compatible with the input without having to exhaustively explore all possible sentences and choose the one that has the lowest energy, okay? Then the second one to make it efficient is to make f a smooth function of y. And if it's a smooth function of y, then you can use something like gradient descent to do the inference, okay? We're not talking about stochastic gradient descent here because stochastic gradient descent is used when the cross function you're optimizing is a sum of many terms that are very similar. This is not the case here. The energy function is just a single term. I mean, it could be multiple terms inside, but it's not a lot of terms that are similar. So I'm talking about like, classical gradient based optimization, like conjugate gradient or something like that. And if it's smooth, then, you know, given, you know, you start from a hypothesis for y which may be wrong and then by gradient descent, you can find, you know, you can find a value of y that is close and has low energy. You may allow x to change or not depending on the conditions. Okay, there are entire books to write about like efficient inference. In fact, there are books that have been written for probabilistic models of how you do efficient inference in particular type, special types of models. Okay. All right, so what I told you about now is, you know, what's called conditional energy based models. Conditional energy based models where you have this observed variable x and you're trying to predict the variable y. And there what you assume is x is always observed whether you are doing, you know, during training or test and y is observed during training but not observed during test or maybe it can be partially observed during test but you never know which part of y is gonna be observed and which part is not going to be observed. And then there is the unconditional version and the unconditional version is one in which there is no x. So there is no set of variables that you know for sure is always going to be observed. What you have is a variable that can be partially observed but you don't know which part is gonna be observed. If you knew which part that would become an x. Okay. And if you know that you know you're going to observe every time that's x. But if you don't know if any part of a variable could be observed or not observed during training or test, then that's part of y. Okay. So there are energy based models. They may not have been presented to you in that way but there are energy based models in the sort of unconditional energy based models that you've heard of, that you've studied. One example is principal component analysis. Another one is K-means clustering. And basically every sort of algorithm that you know you've heard about that is unsupervised is essentially can be cast in the form of an energy based model that's unconditional. Okay. So we're not talking specifically here about supervised running, unsupervised running or something called structure prediction which I'll be coming to but about a framework that basically kind of can represent all of those. Can you say something about, so the square here is actually representing the model itself, right? And then- That's right, the square is the model. Yeah. And it could be very complicated inside. It could have like very large neural nets inside of it. Right. And x and y's are both inputs of this model. It's not that y is the target. That's right. Okay. During training, y is a target. Okay. But it's not a target for a function. It's a target to which the machine should be trained to give low energy. Welcome to how you train those things in a minute. Okay. So yeah, so that's the kind of the funny thing that despite the fact that y is the variable you want to predict, it's actually an input to the model. And the way you figure out the best value for y is that you search for a value of y that minimizes the energy computed by your model. You really need to get this to understand this. Okay. And remember, we've not talked about learning yet at all. Okay. We assume we're given a energy based model, energy model. Okay. Since we've all been conditioned by a schooling to think in terms of probabilities, I'm going to make the quick connection between energy based models and probabilistic models. So, particularly those factor graph or graphical models, but also the type of models we've been playing with so far with a suff max on top that produces a distribution over y, right? So, if you have an energy based model that computes energy F between the two variables x and y, you can turn this into a conditional probability distribution over y given x. And a very simple way of doing this is using something that physicists invented at the end of the 19th century called the Gibbs-Bolzmann distribution. And that's basically a suff max. Okay. Suff max is an instance of a Gibbs-Bolzmann distribution. So, you define pure y given x as the exponential minus some constant that you pick. It's a little arbitrary. For physicists, this constant is akin to an inverse temperature for statistical physicists. And then you plug the energy here. That measures the compatibility between x and y. Okay. And then you want, so that makes all those numbers positive. Okay. It makes its function positive because exponential of anything is positive, right? And it gives you large values for low energy and low values for high energy, which is what you want. If you want to turn energies into probabilities, you want to give high probabilities to things that have low energy. Okay. So, that does the right thing, but then you need this probability distribution over y to be normalized. So, you need the integral over y of pure yx to be equal to one. So, basically you divide this by the integral over y of whatever is on the top here, e to the minus beta f of xy, y prime, because I use it to make the distinction between those two variables here. And now what you have is an expression that obviously when you integrate it with respect to y will give you one because you're going to get the same value on top and at the bottom. The bottom is a constant with respect to y. If you integrate the top with respect to y, you get the same expression above and below, which means the result is one. Okay. So, you get something that has all the properties of a proper probability distribution, which is, you know, it's a bunch of positive numbers with integral sum to one. If y is a discrete variable, this integral here is turned into a sum, a discrete sum. And those numbers here will all be between zero and one and they will sum to one. Okay. And that's what softmax does. I'm answering the question. I think his computer just reboot. So, why do we need y during inference? So, like we saw in our first lab, we may find x given y, okay? And what I show you in class in that draw.io diagram was that we were minimizing that square difference. We were minimizing the MSE in order to figure out what is the x that gives me a specific y value, right? And so, the y might be actually used as an input to this basically energy function, which we were calling it cost before in order to be able to infer the x value, okay? So, let me turn on the camera. There we go. As I was showing you in the first, well, in the actual first lab, well, which is the second, we have this energy function and this energy function has many inputs. In this case, so far we have seen two inputs, x and y. And then we may be able to find x given y or find y given x. As I show you, to find x given y, you need actually to solve a optimization problem, right? We have to use gradient descent, not stochastic gradient descent. In order to find out what is the value of the x, you have to provide now to this square block and the square block is the whole neural net in order to be able to estimate the x, okay? So, in this case, inference will be carried on as an optimization. Doesn't need to be an optimization, can be an optimization, okay? Okay, so hopefully you clarified all the unclear things that I said while I was away. Okay, so inference may be hard, as I mentioned earlier, right? Because it's easy if y is a discrete variable with only a few values, you can just exhaustively list to them and compute the energy, which could be one output of a neural net, for example. So that's easy, we know how to do that. But it could be complicated, it could be that you're doing object detection and it's actually a list of things and like which ones are low energies, you have to do non-maximum suppression or something like that. It could be that the output itself is an image, the input is an image and the output is an image, you're trying to do image segmentation, for example, or semantic segmentation, identify different regions of an image as being in one category or another. So now the output is an image and to produce an image that's consistent, you might need to do one of those kind of optimization type tricks. I said handwriting recognition, speech recognition, translation, parsing, image denoising, I mentioned those examples, right? So anywhere where the output is from a high dimensional continuous domain where there is uncertainty or is compositional because it's like a piece of text, which is called structure prediction or situations of this type, you might be better off using an energy based model instead of just directly computing the output. All right, so now we'll talk about architectures. What do we put in those in those square boxes? And there are essentially two big families of architectures that are interesting. There is joint embedding architectures and latent variable models. Okay, so we start with joint embedding because it's kind of simple to understand and simple to explain. So we're gonna use an architecture very much like the one that's depicted on the right here, where we're gonna have two neural nets. They may or may not be identical. If they are identical, we call these Siamese nets, but they may not be the same network. They could be different networks, okay? And what those two networks are going to compute are vectors that we're gonna interpret as representations of the input, okay? And the energy is gonna compute a distance or divergence of some kind between those two vectors. So if the two networks that put vectors that are nearby, the energy is low. If the two networks that put vectors that are distant from each other, then the energy is higher, okay? And this has been extremely popular over the last year, a little more than a year, maybe a year and a half because it's basically the best way to train an image recognition system without having labeled data. I'll come to this later. But why can this represent a certainty? You know, multiple Y for a given X. Now, the neural net that, which I call a predictor here, that looks at Y can be invariant to certain transformations of Y. So it could very well be that this network has been trained in such a way that when I transform Y in a particular way, it's an image and I change the illumination or the scale or, you know, the exact position, the output doesn't change that much, right? And so if that's the case, what it means is that when I give the system a particular X, there's gonna be multiple Ys that match this X and those Ys are the pre-image of the neural net that don't basically, that don't change the output and make the output similar to the output from the network on the left, okay? So there you build this multi-modality of the output, the fact that multiple outputs are not compatible with an input, you sort of use the property, the invariant property of the predictor that looks at Y to do this, okay? The multiple values of Y that will give the same H and all of those are compatible with X if the H that's produced is similar to the H that the predictor from X produces. So there's a particular special case of this called Samy's Nets where those two networks are actually identical and share the same parameters. And the idea goes back a long time in the early 90s. I had a paper on this in the early 90s, a couple more papers in the mid 2000s. There's some of my students at NYU and then more recently there's been work on face recognition and pre-training unsupervised, self-supervised pre-training for image recognition. It's a long list that I'm not gonna go through. Pearl and Moco are ideas that came out of Facebook research and seem clear from Google and there's been more. So I'll come later on how you train those systems. That's the whole thing, but that's basically how they handle multimodality. Okay, here is the second way to handle multimodality. Okay, and I'm not saying those two architectures are the only one, there are others, but those are kind of two prototypical architectures to handle multimodal outputs. And those are called latent variable models. So a latent variable model, so a typical architecture would be something like this. You want the system to be able to produce multiple Y bars, multiple predictions for a given X. So what you're gonna do is you're getting to parameterize the set of possible prediction, which is symbolized by this S-shaped ribbon here. You're gonna parameterize this surface, okay? The set of possible plausible predictions are compatible with X, by a latent variable. So there's a latent variable is a variable that no one gives you the value of, right? You're not gonna be given the value of this during training or during test. It's some variable internal to your model. And you allow this variable to vary within a set, in this case here, a rectangle, okay? So 2D variable, but you can imagine it to be very high dimensional. And as you vary Z within this rectangle here, since Z goes through a few layers of a neural net, perhaps it's getting transformed into a more complex surface on the output, okay? So that's one way of representing a sort of multi-modal complex set of outputs from deterministic functions, right? The two functions I drew here that I called predictor and decoder, they're both deterministic functions. You give them an input, they produce one output, that's it, right? So you have to vary the input of one of them so that the output varies. Now, how do you do inference in a latent variable model? You give it an X and a proposal for Y. And then what you have to say is, you know, you have to compute the energy for this pair. And the best way to compute the energy, I mean, a way to compute the energy is to figure out what is the best value of the latent variable that minimizes the energy, okay? So in other words, you're gonna have to run now, even if you make a proposal for Y, you're not asked to predict a value for Y, you're gonna have to make another optimization of the energy with respect to the latent variable to figure out what is the, you know, to measure the energy of X and Y, you're gonna have to minimize the energy function with respect to this latent variable to find the point on this ribbon that is closest to the Y that you are observing. And that's really the energy that your model gives to that pair X, Y, okay? Obviously every point on this ribbon should have, essentially, you know, should have very low energy, let's say zero, if we design our energy to be lower bounded by zero. So to compute the fact that a point on this ribbon has zero energy, you plug a Y point on this ribbon, and then you have to say, you have to figure out what is the value of Z that, you know, among all the values of the ribbon is gonna actually put me here. And you have to do this by minimizing this energy function with respect to the latent variable, okay? So inference in the latent variable energy-based model involves minimizing with respect to the latent variable, and that will give you a measure of the incompatibility, the energy between X and Y. Now, of course, what you wanna do is infer the proper value of Y, so now if you wanna do that, you have to simultaneously minimize over the latent variable, NY, okay? So given an X, what is the combination of X and Y that gives me the lowest energy? And that's how you would do inference in a latent variable energy-based model. What you'd like is those latent variables to be sort of explanatory models of the image. So let me take an example. If you're trying to, let's say, reconstruct the 3D model of a face, okay? From a picture. So X is a picture of a face, and then Y is sort of a representation of a 3D model of a face. What you have to do is essentially take the, let me actually do this the other way around. Let's imagine that X is a 3D model of a face, so you have the 3D model of someone, and you're being asked the question, is that person that I'm seeing a picture of, which is why, is that person the person I have a 3D model of? So what you have to do is basically adjust the position of the 3D model in such a way that it matches the image that you see. And this adjustment of the position scale, et cetera, so as to align the rendering of the 3D model to the image that you're observing, you can think of that as a latent variable. And this process of aligning is a minimization of energy, right? Where the energy is the dissimilarity between the rendered image from the 3D model and the image you observe, okay? Okay, so let's say we wanna do handwriting recognition. Speech recognition would be similar. So I draw a word like this, and if you know English, you might be able to figure out what this word is. If you don't know English, you probably have no idea. Of course, you all know English, but even if you do know English, it might be difficult, okay? Because it's kind of a slightly purposely badly written. And so, but there is a trick here. If you know where the separation between the characters are, you know exactly what this word is. Now that have indicated the boundaries between the characters, there's no ambiguity as to what this word is, right? It's minimum. So you're going from a very hard problem to an easier one. Now, think about a process by which you would recognize this handwritten word. And the locations where you place the boundaries between characters is the latent variable, okay? So the latent variable in this case is a set of variables that if you knew their values would make the problem easier, okay? But of course, nobody tells you what those values are. You have to figure them out. So you basically decide that this is gonna be a latent variable. My latent variable is gonna be where do I put the boundaries between the characters? Okay, and my neural net here is gonna give me a score for like every region here as to whether it's a character or a particular class or not. And then my latent variable here is gonna decide where I put those boundaries. And then my decoder is gonna basically produce a list of scores for each of the characters there. And what I'm gonna try to find is a combination of a set of categories and a set of boundaries that overall gives me a very low energy. So something that is, you know, in such a way that, you know, if I cut an M right in the middle, it gives me a character that is not a character. It's like a single hump and it doesn't correspond to anything. So presumably my recognizer will tell you this is garbage, right? And so, and then similarly perhaps in the decoder, there is a language model that tells me like, you know, there is no such word as, you know, NW, et cetera, right? So if I parse the system in another way, I may get a sequence of characters that are high scoring from my recognizer, but my language model will tell me this is not an English word. So, which is why I was telling you, you need to know English to be able to read this word. There are similar problems in speech recognition. So continuous speech recognition, if you know a language, you know where the words end and begin. But if you're not trained in that language, you actually can't figure out where the boundaries between the words are in most languages. And it's true if, you know, in written language, right? So you can probably read this sentence because you understand English. You may not understand French. So you probably have no idea where the word boundaries here at the sentence at the bottom here. But if you knew where the boundaries were, you know, it would make the problem easier, right? You need to know less about French to be able to understand the sentence if you know where the boundaries are. So that's an example of a latent variable model and this type of latent variable model or prediction in that respect is called structured prediction. So this is a situation where the output of some structure is a word from a language, for example, and you have to essentially have some way of, you know, finding an answer that is compatible with all the constraints that the system has satisfied that, you know, it needs to be a proper word from the language. It has to be high scoring characters in between or, you know, high scoring sounds, et cetera. And there's going to be a homework about this, by the way. All right, so formally without knowing the details of the energy-based models, we're going to do the following. We're going to say we have an energy, I'm going to call it E now, not F, which depends on X, Y, and Z, okay? Of all three variables, the input, the variable to be predicted, or the proposal output and the latent variable. And inference there consists in minimizing this energy with respect to Y and Z simultaneously, okay? And so you get simultaneously a good answer and the value of the latent variables, which you can discard because generally you don't care about them once, you know, they help you to do the, to produce a good output, but, you know, a particular interest in them in the end. I'm going to redefine F of X, Y now. F of X, Y is a form of E of X, Y, Z, where I have eliminated Z, and I can do it in two ways. So this F infinity of X, Y, which is defined as the minimum over Z of E of X, Y, Z, right? So I give you an X and a Y, you minimize it with respect to Z. And that's a function of X and Y, okay? I call this F of X, Y. So now what I've done is that despite the fact I have a latent variable model, I've defined a new energy-based model without latent variables, F of X, Y. And what I've done is that internally I've minimized the more elementary energy E with respect to Z. I do know this F because physicists call this a free energy, okay? You don't need to know why, but that's why it's called F. Now there is another way of computing the free energy, which is to say, you know, there might be multiple values of Z that give me a low energy for a particular pair of XY that I'm being given. So for one pair of XY, there could be multiple Zs that give me a very low energy. And perhaps what we should do is not just pick the smallest one, but essentially give a lower energy to that pair of XY if there are many different Zs that give me a low energy for that pair of XY. So basically have multiple combinations, multiple values of Z conspire or are combined to lower the overall energy given to a particular pair XY. And one particular way of doing this, there are several ways of doing it, but one particular way we should derive from probabilistic models is to do what's called marginalization. And I'm gonna show you why this is a marginalization in a minute. So you basically compute F of XY as minus one over beta log integral or discrete sum if Z is a discrete variable e to the minus beta e of XYZ. Okay, physicists call this a free energy or they also call it the log partition function. It's the log of the kind of denominator that you could have in a self max, for example, where Z is the variable that you enter in the self max. So now you're back to the previous problem, right? You sort of abstracted away the fact that your model has latent variable inside of it. You may care about it, but you don't really care because now it's actually a regular energy based model that doesn't have a dependency on Z, okay? And now you can do inference by just finding the minimum of F of XY with respect to Y. Now, why do I call this F infinity and F beta? Because the formula at the top here is the limit when beta goes to infinity of the bottom one. So if you make beta go to infinity here, the only term that is going to count in this integral is the term that has the smallest energy because all the other ones, beta being very large, all the other energy terms being slightly bigger, the exponential is gonna be much, much smaller. Exponential minus is gonna be much, much smaller for the energies that are higher than the minimum. The only one that's gonna survive is the minimum, okay? So basically this integral is gonna reduce to just one term, which is the exponential minus beta, the minimal energy, okay? And then you take the log, so that removes the exponential divide by minus Y, divide by beta and remove the minus sign so that cancels this and you're back to minimum of Z of E of X, Y, Z, okay? So this is really the limit for beta goes to infinity to that formula. This is not the only way to combine multiple values of Z to conspire for the energy of X and Y, but that's one that is derived from probabilistic framework. Let me give you a complete example and I'm not gonna spend too much time on this because I think Alfredo is gonna talk about it a lot more than me, are you, Alfredo? Yeah, of course, I'm gonna spend the whole tomorrow and next week on this stuff, so maybe it's not necessary. This is gonna be next week, no, not the one tomorrow, right? The training, yeah, this is inference, right? This is inference, yeah. Yeah, so this is tomorrow, next week. Oh, this is tomorrow, yeah. So, yeah, so it's just a kind of a preview of what Alfredo is gonna tell you tomorrow. You know, let's imagine that our model is unconditional so we don't have an X, we only have two Ys and our data lies on an ellipse and so the manifold of data, if you want, to which our system should give low energy is really an ellipse and we should have higher energy outside, we could have an energy this model that tells us essentially what is the distance of any data point Y to this ellipse and the way to compute the distance between the point and the closest point on the ellipse is to have a latent variable which is basically the position of that closest point on the ellipse which is parameterized by an angle that I can call Z. So if my energy base model is parameterized this way, so this is essentially the Euclidean distance of any point Y, Y1, Y2 to a point on the ellipse and the point on the ellipse is parameterized by Z. If I minimize this with respect to Z, what I'm gonna find is an angle Z that gives me a point that minimizes the square distance between my data point and the model and that's gonna be the energy that my model gives to this point, okay? And again, I can write the free energy as just the minimum of a Z of the expression on top. But if you didn't get it, you'll hear more about this from Alfredo tomorrow. Okay, a bit of a note for this idea of transforming an energy function into a probability distribution, this is not always possible and it's not always desirable, okay? And this is the whole motivation for talking about energy-based models is the fact that there are certain situations where probability-seq models are basically unusable. They're unusable when, for example, the denominator here that is required to normalize the probability distribution is intractable. It's very common for people to model, to parameterize complex probability distributions through an energy function and then taking the exponential, this energy function and normalizing using the Gibbs-Walshman distribution. The problem is that you cannot always normalize because you don't always know how to compute this integral. If Y, for example, there's a space of images, this is an integral over the space of all images. And unless F has a very simple form like a Gaussian, you're not gonna be able to compute it, okay? So we're going to energy-based models because in many interesting cases, probability-seq models are intractable, okay? So we're basically kind of diminishing our ambitions here. You say, okay, we don't care so much about probabilities anymore. We care to model the dependencies between variables. We need to handle uncertainty, but insisting on using distributions leads to intractability. So we're gonna take a step back and say, okay, we're just gonna use the energy as the fundamental underlying object. Okay, so where this free energy formula comes from? There's a little bit of math here. So if I have an energy function E of XYZ, and I want to compute a joint probability distribution over Y and Z, I can use the Gibbs-Walshman distribution and it's a completely brainless application of Gibbs-Walshman distribution. You take E to the minus beta, the energy, and you have to normalize by integral over whatever variables are on the left of this bar because the joint distribution over Y and Z needs to normalize to one. So you have to divide by the integral over both variables, right? It's pretty logical. So now what you get is a distribution here over Y and Z because when you integrate over both Y and Z, you get the same stuff at the top and bottom and that's one. Now, once you have this joint distribution, you can actually compute P of Y given X by just integrating marginalizing with respect to Z, right? So P of Y given X is simply the integral over Z of P of Y and Z given X. That's basic marginalization, right? When you have a joint distribution, which is like a table, if you want a distribution for two variables, Y and Z, and you just want the distribution over Y, you sum over Z, right? And you get what's called a marginal distribution. All right, so now, what if I do this calculation here, but I substitute this, I put this integral over Z and I put this inside, okay? So I get, so I have to integrate this over Z, okay? The bottom is already integrated over Z, so it's a constant respect to Z, so I only need to integrate the top, okay? If I take this and I put it inside the integral here, I get this, okay? So integral over Z of E to the minus beta E of XYZ, and then the bottom I integrate over both Y and Z, the same stuff that's at the top. Okay, now I'm gonna do something very weird, which is I'm going to transform this expression by putting in front of it exponential minus beta. So I'm gonna take the exponent, I'm gonna take the log, divide by beta, put a minus sign, and then I'm gonna take, you know, multiply by minus beta and take the exponential. This entire thing here cancels that, right? I haven't done anything here. I've just rewritten this in a complicated manner by taking a log, dividing by beta, multiplying by beta, and then taking the exponential, right? This is obviously equal to that, and I'm gonna do the same thing at the bottom, okay? But here I'm gonna keep the integral over Y on the outside, in which I can obviously do, okay? So I'm taking the exponential of the log of this integral with respect to Z. Now, once I've done this, here at the top, you know, this minus one over beta logs integral over Z of E to the minus beta, E of XYZ. I'm gonna define this as being F beta of XY, okay? I'm just gonna define F beta of XY as being this entire expression in the bracket, okay? So now what I have at the top is E to the minus beta, this free energy now defined this way. And again, I have the same free energy here at the bottom, and now it's integrated with respect to Y, okay? What I get now is the Gibbs-Bosman formula for pure Y given X in terms of this free energy, okay? This is just the Gibbs-Bosman distribution for pure Y given X, okay? Where the energy is not the energy anymore, it's the free energy. So if I define the free energy as this expression, I get an energy function that when I apply the Gibbs-Bosman distribution to it, actually gives me the conditional distribution of pure Y given X, okay? So that's where this formula comes from, okay? It's a probabilistic interpretation of energies if you want, right? You can think of it this way. Like how do you define free energy in such a way that it's compatible with, you know, intuitive notions of probabilities basically, or not just intuitive with probabilities? But there are other ways, okay, that which I'm not gonna talk about. Okay, so let me give you a concrete example of a latent variable model that you're probably familiar with, hopefully familiar with, called K-means, okay? So K-means clustering, spot screening you may not be familiar with, I'm not gonna talk about, but K-means you are almost certainly familiar with that. So in K-means, the energy function between Y and Z, so there is no X in K-means, it's a unsupervised unconditional method, clustering method. So you have a data vector Y and the latent variable Z and the energy function is simply the square distance between Y and the vector Z multiplied by a matrix and the vector Z is constrained to be a one-hot vector, okay? So it's a vector of size K that has all zeros except for one component equal to one, okay? So when you multiply this matrix W by this vector Z, what you do is that you're selecting the one of the columns of W, okay? The one for which, you know, that whose index correspond to the index of the component that is equal to one in Z. And what you're computing here is a square distance between the data vector Y and this particular column of W, okay? I'm not talking about training K-means here, I'm assuming K-means has been trained, okay? And that you have this matrix of prototypes so where each column is a prototype. So here is a depiction here of the energy surface of K-means where the data points have been selected from this spiral. So there are like, you know, a lot of points, a lot of data points randomly selected from this spiral. And we've run the K-means algorithm with K equal 20 and each of those dark spots are basically energy wells, quadratic energy wells. And the energy grows is, you know, indicated by the brightness. And as you move away from the manifold, so if you're on the manifold on the spiral, the energy is zero because you're gonna have, or if you are at the bottom of one of those wells, the energy is zero because you are at the prototype. So this distance is zero because Y is equal to one of the prototypes, okay? And as you move away from that surface, the energy grows quadratically because this is a square gradient distance, okay? Now, if you move along the manifold, it grows quadratically until you get closer to another prototype and then it starts going down again. Okay, because now you've selected the other prototype as being closest to you. And this is what the minimization is doing, okay? So the minimization with respect to Z selects the closest prototype to the data point, right? And so any point on the plane here to the example, the brightness is essentially represents the square distance to the closest prototype, which is one of the columns of W, okay? So that's an example of a latent variable energy-based model, unconditional that you are familiar with, but it's gonna recast in this vocabulary if you want. Okay, there is an issue and this is gonna come up when we talk about training energy-based models, which we're gonna do in just a minute. And the issue is that, imagine I use one of those models I was telling you about before that is a latent variable model, so that I parameterize the set of possible prediction by a latent variable that can vary, okay? And I obtain the value of the latent variable by minimizing the energy between X and Y with respect to the latent variable, right? Which is the classical thing I just talked about. There's an issue. The issue is imagine that Z has the same dimension as Y. So by the way, this is gonna constitute an answer to the question that was asked before about the information capacity of the latent variable, okay? So imagine, I give you an X and a Y and let's say Y is an image, either a segmentation or a denoids version of X or higher resolution version of X or something like that. And it's imagined that I make Z essentially the same dimension as Y, okay? And imagine that my decoding neural net here is non-degenerate, which means that for every input is gonna produce a different output and we can produce all kinds of different outputs. It's basically, let's imagine a very simple function like the identity function, okay? Now the problem is that if I give the system an X and a Y, there's always going to be a Z for which the energy is exactly zero because they're always going to be a value of the latent variable, such that when I run you through the decoder, the prediction Y bar I get is exactly equal to Y, okay? So the problem I have here is that my energy surface, f of X, Y as a function of Y is completely flat. It's zero everywhere, right? And this is not a good energy based model because a good energy based model should give you low energies for stuff you want for good values of Y and high energies for things you don't want which are bad values of Y. And the problem here is that if my latent variable Z is as too much capacity, as too high dimension, if the set in which I can choose it is too large, things like that, then there's always going to be a value of Z that gives me zero energy, which means my energy surface is going to be flat, okay? How do I fix this? This is the entire topic of how you're trying to energy based model, okay? One trick here is to regularize the capacity of the latent variable, but I'm coming back to this. I'm not going to go into the details of this just now. I'm just going to say that, why is it that K means works, for example? So K means solves that problem by making Z a discrete variable that can only take K values. So the capacity, the information capacity of K is a log base to K bits, right? Because there's only K values it can take. And as a consequence, there's only K points in Y space that can have zero energy, which are the locations of those K values that Z can take. Or once they run through the decoder. So that's a way of limiting the capacity of Z and as a consequence, limiting the volume of the space of the Y space that can take low energy, essentially. But I'll come back to this. Okay, so how do we train an energy based model? We build this energy based model as we give it some architecture, okay? As I showed, that would be an example. The other one would be the joint embedding. But there is an infinite number of architectures you can choose. And then the way you train the energy based model is that whenever you have a data point X, Y, you tweak the parameters of that model so that the energy is as small as possible. Okay, so it's good to have an energy that has a lower bound, something like a distance or something, something that is lower bounded, like by zero, let's say arbitrarily, that loss of generality. Okay, so you're trying to make the energy of data points that you observe from your training set, zero or as small as possible, okay? And then comes the complicated part, which is to make sure that the energy of everything else is higher. So for a given X, you want the energy of good Ys to be low, but you want the energy of bad Ys to be large. And that's where it becomes complicated. Okay, there is basically two ways to go around, to go about this. Two classes of learning methods for energy based models. The first one is called contrastive methods. And the second one is called architectural methods or regularized methods. And those largely apply to latent variable models, but not only, okay? And you can apply those methods for both types of architectures of this joint embedding or latent variable predictors and for just about anything, okay? Okay, let's start with contrastive methods because they are the easiest to understand. Unfortunately, there are also the least efficient, but in some cases they work, okay? So contrastive method is very simple. You take a data point X, Y and you treat the parameters of the energy function so that the energy goes down, very simple. And then you pick another point, X, Y prime, Y bar, Y something else. Y hat actually I think is the proper notation for it. But here I denoted Y prime. So you pick another point Y or maybe a set of other points Y that you know are bad and you push them up, okay? And so the result is that, imagine that you have this energy surface here. So the X variable is here, the Y variable is here. Your data points are those blue dots. And the way the system is being trained is that you take a blue dot, you push it down. So that it goes to zero or stays at zero in this case. And then you take a green dot, which is not represented, and you push it up, you pull it up, okay? And the green dot is for the same X, but a different Y that you know is incorrect. So there is lots and lots of different contrastive methods and they basically all differ by how you pick Y. I mean, how you pick this Y prime that you're gonna push up or Y hat that you're gonna push up and by the loss function that you plug those two energies in, okay? And that we're gonna go through this. So let me start with that, right? So a lot of methods you may have heard of or may not have heard of, it doesn't matter if you have or not, can be classified as either contrastive or regularized or architectural methods. So the general approach is to make some likelihood in probabilistic models that need to be explicitly normalized, that's a contrastive method. It basically says push down on the energy of data points and push up everywhere else, okay? That's what maximum likelihood wants you to do. And I'll come back to this to make this clear. A second contrastive set of methods is push down on the energy of data points and then push up on chosen locations outside that are different from the data points. So maximum likelihood when you use sampling methods like Monte Carlo, Marcantia Monte Carlo or Hamiltonian Monte Carlo are examples of such methods. Contrastive divergence, which you may have heard of, it's not used very much anymore, but it was used for both machines and things like this. The type of contrastive methods that people use to train joint embedding architectures or Siamese nets, that's, it's that, it's contrastive method. There are other things here that I'm not gonna talk about. Adversarial, joint adversarial networks are actually an example of contrastive methods for energy based models. I'll explain that later as well. So if you know what again is, it's actually a contrastive energy based model, okay? Secretly. And then a technique called denoising autoencoder is a special case of this. And this has become extremely popular in the context of natural language understanding, a particular type of denoising autoencoder called masked autoencoder. You probably have heard of BERT and transformer networks and things like this. And they're pre-trained in a self-supervised manner using a technique that is basically a contrastive method sort of for an unconditional model to give low energies to data points and high energy to points that are just outside of it. Okay, and I'll come back to this, I explain this. So regularized architectural methods, there are some of those that you may have heard of like PCA, which I mentioned, K-means, which I mentioned. So the way this works is that K-means cannot have a flat energy surface because the number of points that can have zero energy is limited to K-points, okay? So that's kind of a low volume. Gas and mixture model, the volume of space that has low energy is bounded by the fact that it's an normalized probabilistic model. So it can't give high probability to everything. It has to give low probabilities to some things because it has to integrate to one. ICA, independent component analysis, normalizing flow, things like sparse autoencoders, sparse coding, variational autoencoders, if you know what this is and various other models I'm not gonna talk about. They are basically regularized architectural methods. They don't require you to generate contrastive samples whose energy you're gonna push up, okay? And there's a big advantage to those methods which is that you don't need to train them as many samples, essentially. Okay, so we talked about the transformation of energy based models into probabilistic models using the Gibbs-Boltzmann distribution. But as I said, don't compute probabilities unless you absolutely have to. But here is a depiction of why probabilistic models basically are secretly energy based models of a particular type, okay? They're basically energy based models where the loss function has this particular form which is a negative log likelihood of y given x, okay? Or y if it's an unconditional model. Okay, so the probability of y is no x here. I put the condition over W which is a parameter but which I haven't used in the other notation and I apologize for this and this should be an F. So if you have a data point y, you want to kind of increase this probability and then as a consequence, this will decrease the probabilities of everything else because this distribution has to be normalized. So if you use the so-called negative log likelihood loss as an objective function and you compute this gradient, it's gonna have the effect of lowering the energy of the data point you just showed and then pushing up on the energy of everything else. And here is y. Okay, so this is where the good notation. So P of y given x which is parameterized by W which are the parameters of the energy model is e to the minus beta F of xy divided by sum of y prime of e to the minus beta of F of xy prime, right? That's the Gibbs-Polson distribution. I'm a probabilist here. So I'm going to use the loss which is a negative log likelihood of the data under the model. So I'm gonna take the negative log of this for a particular data point that I'm observing in my data set, yx, okay? So negative log of this expression is gonna be the negative log of this, of the numerator divided by the denominator which is gonna be the difference between the log of the numerator and the log of the denominator. The log of the numerator is just gonna cancel the exponential, right? And then the log of the denominator is just gonna be the log of the denominator. What I'm gonna do is that I'm gonna divide everything by minus beta and the result which doesn't make any difference to the loss function basically just multiply it by a constant. And so what I get is the difference between the thing that canceled the exponential of the beta because I'd taken the log and divided by minus beta. So I just get F and then minus the log of the thing at the bottom but I'm also gonna multiply it by minus beta so I'm gonna get a plus. So I get this formula in the end. This is the negative, the log of the, the negative log divided by beta of the conditional probability of y given x, okay? And that's a perfectly good loss function to minimize. If I minimize this with respect to my parameter of w averaged over a bunch of training pairs x and y, I'm going to maximize the conditional distribution of all the y's in my training set given all the x's in my training set, okay? So that's the so-called negative likelihood loss. Let me compute the gradient of this with respect to w, right? So I differentiate this with respect to w. I get the gradient of this with respect to w which presumably I can compute by just running back propagation, okay? F is some neural net that I built so I know how to back propagate gradient through that. And then the other term when I differentiate this with respect to w, I get this, okay? I very much encourage you to do this calculation yourself. I'm not gonna do it here in front of you but I very much encourage you to do it. Take this expression here, differentiate it with respect to w and verify that you get this, okay? So what is this? First of all, you get a minus sign in front of it. The beta disappears which is interesting. And you get the integral of all y's, the entire space of y of the gradients of your energy for that particular y, y prime with respect to w, weighted by the probability that your model gives to this particular y. So this formula is just this one, okay? So this is the probability that your model gives to this particular y prime and this is a weighted sum. It's actually a weighted average of the gradient of your energy function with respect to the weights where the weights in the weighted average is the distribution that your model gives to each of those y's. What does that mean? What that means is that and you get a negative term here which means that the energy of a particular point y prime is gonna be pushed up and it's gonna be pushed up very hard if the probability that your model gives to it is large which means if the energy is low. So if you have a low energy y, y prime, this term is gonna be trying to push it up really hard, okay? If the energy is already high it's not gonna be pushed up very much because the probability here is gonna be low so the contribution to this integral is not gonna be large. So there's an issue there, there's two issues. There's many issues as a matter of fact with this thing. So this is the special case of probabilistic model trained with maximum likelihood, energy based models that are interpreted as probabilistic models that compute the conditional likelihood of y given x. You have to train it with maximum likelihood and that's what you get. Now there's issues with this. The first issue is that you have to compute this integral and that's most of the time completely intractable, okay? So you're not gonna be able to do this in most cases when SF is a very simple form. The second thing is that it's... So you could discretize, you could numerically estimate this integral. And the way you do this is that you replace this integral by a sum but you still have to sum over the entire space of y's. If y is a discrete set of categories, it's easy, it's just a discrete sum, right? And basically what you get here is softmax. You get the cost function that we've used at the top of our neural nets, which is the log binary cross entropy, right? The log softmax. It's actually this. It's the same thing. So if you think of f of x, y as the score that a neural net gives to a particular category y for given input image x, for example, okay? And you view the y's as discrete options and you plug that into this formula. What you get in the end is exactly the type of loss functions that you've been playing with and minimizing to train a classifier. But here this is a more general case where y may be a high-dimensional continuous space or something like that. So let's imagine that we discretize it, right? So we discretize this high-dimensional continuous space and we replace this by a discrete sum. If y is high-dimensional, this is still going to be extremely expensive, basically intractable. So a common trick is to say, well, I'm actually not going to compute this sum at all. I'm going to just sample a single sample from that distribution. Let's assume I can do this. Let's assume I can ask the system to give me a sample of y according to the distribution. It computes according to this distribution. That turns out to be easy to do because back during World War II, during the Manhattan Project, physicists actually invented a technique called the Monte Carlo method, which allows you to draw samples from a distribution that you only know through its energy and you cannot actually compute the normalization here. Okay? And that's called Monte Carlo method. So it's not recent. So it's not used in the context of machine learning. It's used in the context of building atomic bombs in Osalamus. So what you're going to do is you're going to draw a sample y hat from that distribution using one of those techniques, which I'm not going to go into the details of, and then replace this entire integral by just that one sample. And so on average, when you do this multiple times, the difference is going to average out to the same thing that if you had computed this integral, you should draw sufficiently many samples of y prime. Okay? So, but now you have a much simpler formula now. The gradient of your loss with respect to w is equal to the gradient of the energy that your model gives to your data point. Okay? Which you can do with back prop. And then you sample from other data point from the distribution that your model gives to your y's, given the x, okay? Using Monte Carlo or Monte Carlo or one of those techniques. And then you back propagate the energy of that in your system to compute the gradient, compute the difference between those two gradients and update your parameters with that. Okay? That seems super simple. It's simple, but it's inefficient because if you have a high dimensional space for y, you're going to have to sample a lot of samples. I mean to get a lot of samples of y before this converges to anything. But this is the way, if you've heard of Balsam machines and restricted Balsam machines, this is basically the way they're trained. If you've heard of, you know, NC or NC and NC methods for training probabilistic graphical models, this is the way they're trained. So there is a question I have no idea how to answer. Do we, can we integrate our beta? So can beta be integrated? How do we pick beta? Beta to some extent is a little arbitrary because particularly if your energy function is parameterized, you know, to take whatever value it wants, then beta is basically, you can set it to one and not worry about it. Okay? You know, in softmax, there is a beta parameter in softmax as well, right? If you make beta very large, the output of softmax, you know, for kind of reasonable inputs, the output of softmax would essentially be binary, right? One output would be one and the other one would be zero. For smaller values of beta, you'll get a kind of more, a smoother distribution on the output. But the thing is you can change the weights of the layer that come just before the softmax, you know, scale them by a factor of two or whatever and get the same effect. So in effect, beta is really not important, okay? I see. And otherwise it is determined by cross-validation or? You could do that, but generally you just, you know, set it to a value that seems reasonable for your, you know, gradient descent algorithm and just be done with it. Okay. You know, again, it's kind of arbitrary because, you know, you can multiply beta by two, but then, you know, the weights that are incoming to that layer, to your softmax or score computation, you know, can be divided by two and in the end you'll get the same result. So if you have, you know, some way to set the scale of the energy using learning inside of F, then the beta doesn't matter. You can fold it inside F. I see. I believe that for distillation, we usually train an effort with a given beta and then you try to actually use a lower beta in order to be able to distribute, to actually learn better the output distribution. Yeah, that's right. Yeah, this is a way to get more graded outputs, right? So you train it with a regular beta and then you can, so you get outputs that are more or less binary like the weights in the last layer basically scale themselves so that the output is almost binary. So then you tune down beta so now you get intermediate values and you use that as target to train a second network. That's the idea of distillation. So yeah, I mean, you can change beta after training and that will have an effect. And those soft labels are gonna be easier to learn for the learner network, right? No, they're gonna be harder to learn but they're gonna regularize that network better. They're gonna give more information, right? Because now you have, you know, vectors that take intermediate values instead of binary vectors that are one-hot, right? So there's more information in it. So that constrains the network more and as a consequence, it generalizes better essentially. Okay, okay, makes sense. Right, so, you know, your learning algorithm is gonna be replaced W by itself minus some learning rate times the gradient. And the gradient is gonna have two term, one term that pushes down on the energy of your data point, another one that pushes up on the energy of some other point that you dreamed up. If, and if you wanna do maximum likelihood, the way you dream it up is that you sample from that distribution. Okay, the second problem, so the first problem is that this integral might be, you know, intractable. And so you might have to resort to things like sampling and that doesn't work in high dimension. But the second problem is that this criterion wants to push the energy of bad points to infinity, okay? It's gonna keep pushing on the energy of bad-wise until they go up to infinity if you don't stop. And what it really wants is that it wants to make the energy of the, let's say the energy function is bounded below by zero. It wants to make the energy of the blue dots zero, like as small as possible. But what's more important is that it wants to make the difference between the energy of the blue dots and the energy of everything else, even epsilon outside of it, infinite, right? Because there's not gonna be a limit to how much those energy of bad guys are gonna be pushed up. And even if y hat is only slightly different from y, it's still gonna get pushed up, in fact, really hard. I should have mentioned, by the way, that this thing's balances, right? So it kinda stops at points when y hat is equal to y, you get zero, right? So the points that are on the data point basically don't get any gradient. What is being pushed up is everything else. And so what likelihood model wants, if your data manifold is a thin manifold, is that it wants to basically make your energy function a kind of a deep canyon, okay? Very narrow and deep canyon. And that's really bad because, you know, what's the use of an energy function that is essentially equal to infinity everywhere except at the place where you had data was equal to zero or whatever minimum value your energy can take. That's not a very useful energy function for inference, for example, okay? So that's the problem with maximum likelihood in probabilistic models is that they give you, they want to estimate the distribution and estimating the distribution is bad for inference, okay? Now, if you say this to a statistician, you will murder you on your feet, right? Because that's an athema, right? I mean, you're telling people like, you know, the probabilistic modeling is bad. And so statisticians, particularly Bayesian statisticians have invented all kinds of ways to prevent this from happening by essentially regularizing the distribution to make sure it's smooth, to make sure it's not zero anywhere so that the energy doesn't go to infinity, you know, things like that. But those are tricks, they're hacks. And if you're going to use hacks, you might as well use good hacks. Here's another set of hacks, and they are not hacks. Actually, they are, you know, attempts to liberate ourselves from the constraints of probabilistic modeling by essentially allowing ourselves to use other types of loss functions. So instead of insisting that the loss function should be a loss function that pushes down on the energy of good points and then pushes up on the energy of other points in such a way that in the end, the energies are log probabilities, we give up on this. We say we don't care if the energies correspond to log probabilities, we just want that the energy of good points be lower than the energy of bad points. Okay? And we're going to construct objective functions so that this happens. But we're not going to insist that the objective function be a negative or a negative or anything like that. So let me give you an example. So a simple example is the simple margin loss, okay? So margin loss, so let's look at this row here. And the idea for this goes back a while. So you take the energy of a good pair, so I give you a good pair X, Y from a training set, okay? And one term in your loss function is minimize that energy, push down on it, okay? The other term in the loss would be, I give you another term, Y hat, another Y, okay? Which I know is bad. So I haven't told you how you select this Y hat, okay? The design of the loss function is independent of how you select Y hat. We might come up with different schemes for this, but let's assume we have a scheme for coming up with a Y hat that is bad, different from Y. What we're gonna do here is that we're going to push up on the energy of F of X and Y hat up to the point that is equal to a margin, a constant value that depends on the distance between Y and Y hat or something like this, right? So this square bracket here with a plus means positive part, it's like value, okay? It's a value, okay? This is a value of the F of XY. So we're gonna push down on F of XY so that it goes to zero, but then we're not gonna push down much more than that. And we're gonna push up on F of X, Y hat so that it goes above this margin parameter Y that depends on the pair Y and Y hat. It could be a constant or it could be something that is large when Y and Y hat are very different and small when they're not that different, something like that, okay? So this is called, I mean, this has no name actually, but it's kind of hinge, hinge, if you want, a double hinge loss. And you can show that this will ensure that if your model is powerful enough and you train it properly with enough samples that good samples are gonna take lower energy than bad samples, right? Because you keep pushing them down if they're good samples, you keep pushing them up up to a point if they're bad samples and so you're gonna get a properly shaped energy surface there, okay? Here is another one here below. It's called the ranking loss or sometimes triplet loss. Triplet because there is X, Y and Y hat, okay? And this says, well, I don't really care about the energy of the good pair, but the absolute value of that energy, the absolute level. What I care about is that the energy of the bad guy be larger than the energy of the good guy, but they could both be large or both be small, I don't care. I just want the bad guy to be higher than the good guy. So again, you use a value policy part and you put the difference of those two energies in it so that the loss only cares about the difference of those two energies. So this is gonna have the effect of pushing up the energy of the bad guy above the energy of the good guys until the energy of the bad guy is larger than the energy of the good guy by at least M of Y, Y hat, okay? This has been used in a lot of context for like, Google has been using this for about for over 10 years now for like image search. I mean, now they use other things, they use deep learning, but back in the late 2000, they were not using deep learning for this. So they, but they were kind of trying to kind of find good representations for queries and images in such a way that representations of images that would match a particular query would be closed in some vector space. And they use this so-called triplet loss for this. This is a paper by Jason Weston and Sammy Benjo from 10 years ago. Here's another example here where it's a square, square loss. So it's very much like the one at the top here, except you square the losses. So they are kind of, you know, smooth and continuous. So I mean, the other one is continuous too, but it's not smooth. So it looks like this. So you have the first term is a square loss that says, you know, I want the energy of the good guy to be as close to zero as possible. And then the other one pushes the energy of the bad guys to a margin. And, you know, above it doesn't care because you have the value here, but there's a square so that, you know, they're both convex. And the reason for making them convex is that there's going to be an equilibrium point between those two costs and it's good to have convex losses. Okay, so those are, you know, losses for energy-based models that are not derived from probabilistic arguments. And they don't involve integrals. They involve finding why hats, okay? Which we haven't yet said how we do that. A more general form for this type of loss is what I would call a sort of generalized margin loss. So it's something that we take for a particular xy, we take a sum over all possible y's or maybe not a sum, maybe some other combination for all possible y's of some function, some loss function that is an increasing function of the energy of the good guys, a decreasing function of the energy of the bad guys up to a point and then a margin where that says, you know, above this margin, I don't really care pushing up this guy too much anymore. It was some extra parameters. So this is kind of a more general form if you want. And this triplet loss, margin loss is a good example of it. One of the people have been using a lot in joint embedding techniques is one in which the loss itself combines the different y-hats in a single function in a nonlinear way. So it's not just a sum of our energies for different y-hats, but it's some complex function of it. A particular one is called neighborhood component analysis or noise contrastive estimation. And it's basically plugging all those energies of all the y-hats into a softmax, okay? Softmax like function. In fact, this is incorrect. There should be a log in front, but it's a log softmax really. So this basically says, you know, I want to make the energy of my good guy as low as possible and then I want to make the energy of all the bad guys relatively larger essentially, I want to push them up and I plug them into the softmax and to compute the negative log of this and that's why I'm gonna minimize, right? That's NCE. Okay, so now let's talk about this idea of contractive joint embedding. We're gonna train, so I'm gonna place myself in the context of Samy's Nets in which those two networks are identical, but it's not a requirement. It's just a very popular approach at the moment. So let's say I want to train a system and this is starting to get us into the idea of self-supervised running. So the idea that you can pre-train NRNet to do a task for which you don't need labeled samples, okay? And you can pre-train it to learn good representations of images through this method. This is a really hot topic at the moment. There are papers, I mean, people, I've been working on this for a while, but it's become really, really exciting over the last year and a half or so. And in the last few months, have been incredible progress in this. So this is really fresh. In fact, there's a new paper that I co-authored from Facebook that is going to come out tomorrow, which I might tell you about next week, and it was really fresh from the press, right? It's not listed here. So you'll hear about it before everybody else. And by the way, I should indicate that there's going to be a blog post actually published, a series of blog posts by Facebook tomorrow about self-supervised running and how it's going to revolutionizing image recognition, speech recognition, and other things. So there's going to be various announcements by Facebook tomorrow about this. So this is a really, really hot topic. Like a lot of people are really interested in this because it will take us to the next step in machine learning and AI. Okay, so contrastive learning for joint embedding. So you have this joint embedding architecture. And the way you're going to train it to learn good representations of images is that you're going to take an image, you're going to distort it a little bit by scaling it, shifting it, rotating it, changing the colors, whatever, blurring it, things like that. And you're going to show these two different versions of this image to this joint embedding network. And you're going to train this network to produce, to minimize the distance between H and H prime. So minimize the distance between the vectors that come out of those two networks. And the reason you want this is because you want those vectors to basically be independent of the particular details of that image, the particular viewpoint, the details of the colors, the scale, things like that. You want a representation of the content of the image that is independent of the particular instantiation parameters of the scene. So that's one way to do it. It's not entirely unsupervised because you're cheating a bit, you're telling the system, those two images are identical, the content are identical. So we'll make their representations identical. Now, if you just do this, so you're basically minimizing a loss which is just the average distance between H and H prime, which is the average energy that the system produces, the system collapses. You'll get an energy function that is basically zero everywhere, okay? Because it's very easy for the system to just completely ignore X and Y, set the weights in those networks and basically just compute a constant H and H prime that are equal. That will give you a constant energy that is equal to zero, okay? So just training the system to make those there have similar representations causes a collapse. It causes the energy surface to be flat and equal to zero everywhere. And as I said, you need a contrastive phase that will push up the energy of stuff you don't want, right? So this is how you do it here. You pick a random image from your data set which you may also transform in some way. Just take a different image from your data set, plug those two things and now push those two vectors away from each other, okay? Using one of the objective functions I mentioned earlier. So it could be a hinge loss, it could be NCE, it could be whatever. People, so different papers here use different criteria, Sinclair-Moco use NCE, so the pearl deep face uses a hinge. So those use the square square or the square exponential. So there's various loss functions you can use but that's the basic idea. Generally that's done at the batch level where within the batch you have a single positive pair and the rest are negative pairs. The complexity of this is what's called hard negative mining. So you need to select the pairs of negative images so that your system learn something from it. So if you randomly choose those pairs is a good chance that your system will already give vectors that are very different from each other and your system basically will not learn anything because once in a while, once in the blue moon, the images will be similar enough that the representations on the output are small enough and then they're gonna be pushed away from each other. But if you want the system to learn at a reasonable speed you need to essentially select good negative samples that will actually cause the system to push the representations away. So things that are easily confusable but and confused essentially. And that's where things become complicated and that's where those contrasting methods are incredibly expensive. So Sinclair for example, sorry, if you want to train the Sinclair system which is two convolutional nets to get good results on ImageNet you have to pre-train for so long with so many pairs that if you were to do it on Amazon Cloud it would cost you $5 million. So it's kind of insanely expensive. So there's a big incentive to find alternative methods that are not contrastive because contrasting methods are already pretty expensive. And Moku is an attempt at doing this and there's a whole bunch of them that I'll show you later. Here is another example of contrasting method and it's called denoising autoencoder or sometimes masked autoencoder which is a special case. So you take a data point Y and you corrupt it in some way, right? So you generate another point X or in fact, we could call it Y hat. I call it X here but could call it Y hat. And it could be corrupted by if it's an image by adding noise for example or by masking some of the pixels or by perturbing it in some way. If it's text, it's very popular and very common in the Bert-like system. You take a piece of text and you mask some of the words in that text, okay? You replace them by a blank marker essentially. So now you have a partial view of Y essentially. And the observed part of this is basically called X. You run this through a neural net and this neural net is a neural net which is an autoencoder. So it's output is the same dimension as Y, okay? Same dimension as X as well. And you compare what is produced by this neural net with the original Y, okay? So basically training the system, if you minimize this energy during training with respect to the parameters of this neural net, you're training the system to denoise a corrupted input so that it's, you recover the original input without the noise, okay? Which is why it's called the denoising autoencoder. The idea actually goes back a very long time. I had things like this in my PhD, this is in 1987. But it was revived by Pascal Vincent who was at Facebook at the time he was at University of Montreal. And there's some theory around it coming out of Mila, the lab in Montreal. It was used also by Colbert and Weston in 2011 to as kind of a way to pre-train a natural language understanding system. And now it's become incredibly successful for things like Burt and Roberta. So this is the basic model of taking a piece of text masking some of the words and then training a very large neural net to recover the missing words. That's the standard way of training a neural net, a pre-training a natural language understanding system without requiring label data and without training it on a particular task, just training it to represent text, okay? So in the end, what you have here is because it's an autoencoder, you can take the layer inside of this autoencoder and use it as a representation of the stuff you train it on, representation of text, for example, or images if it's images. Now, why is this, how can this be interpreted as a energy-based model? How does it push down on the energy of certain things and push up on the energy of other things? So if you don't corrupt the image, if the corruption here is small, right? So very small or zero, X would be equal to Y and this autoencoder would basically be trained to reproduce its input on its output, right? So if you use the reconstruction error as the energy, the energy of points that you train the system on, we don't put any noise, it's gonna be zero, okay? Now there are going to be times when you show an example and you corrupt it, you noise it in some way, right? So it's like you take a data point, take one of those green points, those blue points and corrupt it, you get one of those amber points and now you're gonna train this neural net to map this amber point back to the blue point that it originated from through the corruption. And now what you're going to measure is the distance between the blue point that you just reconstructed, okay? Okay, so now, no, you're gonna train the system like this and now what is going to be the energy that your system measures? So when now you give to your system an X here, a noisy point, okay? Here, a bad point, it's gonna be reconstructed. So forget about the corruption now anymore. We are in inference mode, right? So we're using the system to just measure the energy of a point. So we're not corrupting it, we put a noisy point, we just copy it here, we run it through the system and what the system is gonna do is produce a de-noise point, right? It's gonna produce a point on the data manifold that is kind of a clean version of that bad point that we just fed, that bad point wide that we just fed. And so the energy we're gonna get out of it is gonna be large because the energy is gonna be the distance between the point that we fed which was a noisy point and the point that the system produces which is the clean point. So the energy is gonna be the distance between those two points, okay? So now if you plot the energy and this is a plot that Alfredo made and you'll probably show it to you again. If you make a plot of the energy surface and the gradient of that energy surface for all locations in this space, you get this thing where you get high energy, you get low energy on the data manifold, you get high energy outside. Isn't the slight issue here which is that you also get low energy right in the middle. You don't get low energy, we get low gradient. That's bad. You actually get low energy as well. And that's bad. But around the data manifold, you get the right shape which is that your energy goes as you move away from the manifold and that's the desired property of an energy based model. Now how do we fix this? You can fix this with latent variable models but I'm not gonna go into this right now. So that works astonishingly well for pre-training natural language understanding systems, right? So in every pretty much every single modern, high performance understanding system it built on this idea that you take a transformer network, okay? And you decide to split it at some point that one particular layer is gonna be used as a representation but it doesn't matter where. And you train it as a denoising autoencoder with this technique of taking a piece of text and either masking or replacing some of the words and then training the system to recover the missing or replaced words. You do this with loss and loss of text. The typical length of a text here would be about a thousand words, long. And then what you have is a system that can represent the meaning of a text because to be able to figure out the missing words in the text, you have to be able to basically understand the text a little bit, okay? So if I tell you a sentence of the type, the blank chases the blank in the savannah. You probably know that the first blank it might be, you know, lion or cheetah or something like this. And the second blank might be antelope or, you know, wildebeest or something like this, right? And the system basically has to learn this, right? I mean, it can learn this because it's trained on lots of sentences which are in the right context. And so it will know that, you know, lions chase antelopes and if it's in the savannah. Now, if I say the same sentence, but I say the blank chases the blank in the kitchen, you know, it might be a cat and a mouse or something. And you know that through the context so the system might be able to do that as well. So this works in the context of an LP. It works really well because the way we can deal with the uncertainty and the prediction is by, you know, basically having a big softmax on the output on the set of all possible words, right? So for every word that we need to predict, we produce through a softmax, we produce a list of scores which we transform into a probability distributions we don't have to that represents the score of every single word in the vocabulary, okay? So for discrete data, we can do that. We can do softmax over discrete data even for a large cardinality like vocabulary of all English words. But unfortunately, this doesn't work very well. So people immediately, you know, try to translate the success in the context of image recognition. And the idea that we'd be natural would be something like this. And in fact, the first work I heard about this goes back to like 2009 at the NEC Research Institute, but it was never published because it wasn't so successful, but this is, you know, where it was tried. This were like colleagues of Caribbean West, actually. Debbie Gorgier in particular. And so what they did and what a lot of people have done since then is you take an image, you correct it by say, you know, blanking out some parts of it. There's a, you know, a couple of papers by Deepak Pathak on this when he was at Berkeley and I was a professor at CMU. And so, you know, it's the same process that we do for text. And then again, you train a giant neural net to recover the missing parts. And if you pre-train a convolutional net to do this, and then you use the internal representation as input to a classifier for image recognition, it doesn't work very well. It works a bit, but not very well. So what has been an unbelievable success in the context of text has not translated to the context of image recognition. And the reason is because in text, you can represent a distribution, the uncertainty over the prediction, easily with a softmax. But in images, how do you represent a probability distribution over all possible patches that could fill in this one? You can. And so those systems are trained with something like least square. And as a consequence of this, they basically predict kind of the average of all the things that could happen here, which is a blurry prediction. And as a result, the features that are learned are very good. So this is an example that says that if you're trying to use a predictive model without latent variables on a high dimensional continuous domain, it doesn't work. It works on discrete domains like text, but it doesn't work. So you have to essentially use those joint embedding techniques that I was telling you about earlier. Those work, and they work because in that context, you learn representations, but you never reconstruct the prediction. You never actually produce a prediction. You just train the system to kind of give you kind of good matches between representations of an image on a proposal. Okay, do you have no question because you are flabbergasted or because it's all clear? I cannot believe that this is all completely clear. Again, we've been answering everything here in the chat. I know, I know. I know I'm seeing the chat with 99 plus items. So I know that for a fact. Yeah, the connection was a little unstable today for some reason. I'm not sure why. It's like a couple of times when it went down and then I really don't understand. Okay, so let me tell you a little bit about this idea of self-supervised learning. So what I told you about here, energy-based models, joint embedding methods and predictive model related variables. I mean, those are things you can use in lots of different contexts, whether it's supervised learning, structured prediction, self-supervised learning, unsupervised learning, learning features, learning to predict with uncertainty, things like that. And I'm not making kind of an assumption here about what you're gonna use this for. I basically made the point that this is very general way of approaching the way those systems like this can be trained. You interpret the way they compute the output as the minimization of a free energy, essentially. Free energy if there is latent variable energy if there isn't the F function. If there are latent variables, there is two simultaneous minimization with respect to the variable to be predicted Y and with respect to the latent variable. And then the training has to make sure the energy for observed samples is lower than the energy for bad values of Y. And you can do this contrastively by pushing down on the energy of good pairs X, Y coming from your training set and push up on the energy of X and bad Ys which you have to generate in some way. So my exhaustive view if you can. And which of course, and those contrastive techniques unfortunately are very inefficient in high dimensional spaces because there are many, many ways an image or an object can be different from another object if it's a high dimensional object. And you have to basically have contrastive samples in all directions if your model is powerful. So that leads to inefficiencies. If you insist on your model being, despite being an energy-based model, being the log of a probabilistic model, then you run into interactability so you have to deal with. And you can think of this as a particular, a special case of energy-based models. But there is a whole world outside of probabilistic modeling and you shouldn't feel like you necessarily have to force your model to estimate probabilities. It's not always useful by the way to estimate probabilities. So for example, if you have a robot system that needs to take an action, you know, it's completely useless for it to take an action with probability 0.6. It needs to take an action. At some point it needs to make a decision. And so it needs to score multiple options perhaps and then, you know, maybe take the action that has to best score. But the fact that those scores are calibrated probabilities is irrelevant. You just need to take the best action. Okay, so we talked about those contrasting methods where you push down and push out and pull up. And then I only alluded to the fact that there are methods that don't require those contrastive phases. And I showed one example, which is K-means. So K-means is a latent variable model. A latent variable energy-based model. And when you train it, you don't need to push up on anything because the volume of white space that can take your energy is limited to K discrete points. Okay, the same K as in K-means, right? That's called a structural model because the structure of it limits the volume of stuff that can take your energy. So you don't need to push up on the energy of anything. That's gonna happen naturally. And what we'll talk about next week is a whole bunch of other techniques that are non-contrastive that use other methods to push up on the energy of bad points. And they're all based on this idea of limiting the information capacity of the latent variable. Not all of them actually, but many of them. And in the joint embedding situation, they're based on other ideas which work really well, but are a little mysterious on the theoretical point of view, but we'll talk about this as well. Okay, thank you for your attention. And I hope you enjoy the practicum with Alfredo. Hope so, we work together on that, right? Obviously. All right, everyone, see you tomorrow. Have a nice day. Okay, take care everyone. Bye-bye.