 Right, so we're going to talk about energy-based models. And it's basically a framework through which we can express a lot of different learning algorithms, not the ones that are kind of simple, like we've seen in supervised running, but the things that are a little more sophisticated. And it sort of encompasses also a lot of probabilistic methods. But it's a little simpler to understand, I think, than probabilistic methods. And probabilistic methods really are kind of a special case. If you want, of energy-based models. And I think it's kind of a framework that's a little sort of enlightening in the sense that it explains a lot of things that seem very different when you don't get this sort of unifying view of things. So what I'm going to talk about first applies equally well to supervised running, what some people call unsupervised running, or what I would call self-supervised running, which I'll talk about a little bit today. And it's basically, we're going to talk about models that observe a set of variables x. And we're asking the model to predict a set of variables y. And I'm not specifying that x is an image, or whatever, and y is a discrete variable, like for classification. Y could be like an entire video. And x could be an entire video. Or x could be an image and y piece of text that describes it. Or x could be a sentence in one language, and y a sentence in another language. x could be an entire text. And y could be a simplified version of that text, or an abstract. So it could be anything, really. I'm not kind of specifying necessarily here. But it comes from the fact that there's two issues with sort of feed-forward models, whether they're on all nets or something else doesn't matter. A model, a classical model, proceeds by doing a finite fixed number of calculations to produce an output. If you have a multi-layer net, there is a fixed number of layers. Even if you have a recurrent net, there's some sort of limit to how many times you can unfold it. So it's basically a fixed amount of computation. But there are two issues for which this is not entirely appropriate. The first one, or two situations. The first situation is when computing the output requires some more complex calculation than just a bunch of weighted sums and non-miniarity in kind of finite number, when the inference is complex. And the second situation is when we're trying to train the machine to produce not a single output, but a possible set of outputs. So in the case of classification, we're actually training a machine to produce multiple outputs. We are training it to produce a separate score for every possible category that we have in our system. Ideally, the system would produce the best score for the correct class and infinitesimal scores for the other ones. In practice, when we run the output of a neural net through softmax, it produces scores, and we just pick the one that has a high score. But basically what we are telling the machine to do is produce a score for every category, and then we'll pick the best. Now, this is not possible when the output is continuous and high dimensional. So the output is, let's say, an image. We don't have softmaxes over images. We don't have a way of listing all possible images and then normalizing a distribution over them, because it's high dimensional and continuous space. Even if it were a low dimensional continuous space, it would not be possible. We would have to bin that continuous space into discrete bins and then do a softmax over that, but that doesn't work very well. It only works in low dimension. So when we have a high dimensional continuous space, we can't do softmax. We can't ask the system to give us a score for all possible outputs. Similarly, even if it's discrete, but potentially infinite, so things like we're producing text. Text is compositional, and there is a very, very large number of possible texts over a given length. And we can't just do a softmax over all possible texts. Same problem. So how do we represent a distribution or a bunch of scores over all possible texts in a compact form? That's where energy-based models come in, or probabilistic models, so that matter, but energy-based models in particular. And the solution that energy-based models give us there is the idea that we're going to use an implicit function. In other words, we're not going to ask our system to produce a Y. We're just going to ask it to tell us whether an X and a particular Y we show it are compatible with each other. So is this text a good translation of that text? That sounds kind of weak, right? Because how are we going to come up with that text that our machine is comparing? But let's hold this for a bit. So we're going to name this function f of X, Y. It's going to take an X and a Y. It's going to tell us if those two values are compatible with each other or not. So is Y a good label for the image X? Is Y a good high resolution version of this low resolution image? If Y a good translation of that sentence in German, et cetera. And so the inference procedure now is going to be given an X, find a Y that produces a low value for which f of X, Y produces a low value. In other words, find a Y that's compatible with X. So search over possible Ys for a value of Y that produces a low value for f of X, Y. So this idea of inference by minimizing some function and pretty much every model, probabilistic, non-probabilistic, whatever that people have thought about going to work this way. I mean, except even classification, multi-class classification with neural nets or whatever, implicitly worked by energy minimization by basically finding the class that has the best score, which you can think of as the lowest energy. So basically, we're going to try to find an output that satisfies a bunch of constraints. And those constraints are implemented by this function f of X, Y. And if you've heard of graphical models, Bayesian networks, all that stuff, or even classical AI or SAT problems, they basically, they can only be formulated in those terms as finding the value of a set of variables that will minimize some function that measures their compatibility. So we're not talking about learning right now. We're just talking about inference. We're assuming this function f of X, Y is given to you. We're going to talk about how we learn it a little later. OK, so this energy function is not what we minimize during learning. It's what we minimize during inference. So inference is competing Y from X. So this energy function is scalar-valued. It takes low values when Y is compatible with X and higher values when Y is not compatible with X. So you'd like this function to have a shape in such a way that, for a given X, all the values of Y that are compatible with this X have low energy, and all the values that are not compatible with that given X have higher energy. And that's all you need, because then the inference procedure is going to find the Y check that written here, which is the value of Y that minimizes f of X, Y. It's not going to be the value. It's going to be a value, because there might be multiple values. OK, and your inference algorithm might actually go through multiple values or examine multiple values before kind of giving you one or several. OK, let's take a very simple example in one dimension of scalar variable. So X here is a real value, and Y is a real value. And the blue dots here are data points. So what you want is if you want to capture the dependency between X and Y in the data, is that you would like an energy function that has either this shape or that shape or some other shape, but which has a shape that in such a way that if you take a particular value of X, the value of Y that has a lower value is near the blue dots, which are the data points. OK, so a function like this captures the dependency between X and Y. Now to do the inference of what is the best Y for a given X, if you have a function like this, you can use gradient descent. So if I give you an X to figure out what's the best value of Y that corresponds to this X, you can start from some random Y, and then by gradient descent find the minimum of the function, and you'll fall down to the blue beads here. Might be a little harder for this one, but from the point of view of characterizing the dependency between those two variables, those two energy functions are just about as good as each other. I'll come to this. The discrete case when Y is discrete is the easy case. OK, and we've already talked about this, and I'll kind of reformulate this in terms of energy in just a couple of minutes. OK, so a feedforward model is an explicit function in that it computes the prediction Y from X, but it can only make one prediction. We can cheat in the case of a discrete variable by putting out multiple outputs, which correspond to a score for every possible classification. But in effect, but you can't use this trick for high dimensional continuous values or compositional values, as I said earlier. So an energy based model is really an implicit function. So remember in calculus, when an implicit function, you want the equation of a circle as a function of X and Y. You can't write Y as a function of X, but you write an equation that says X squared plus Y squared equals 1, and that gives you the unit circle. So X squared plus Y squared minus 1 is an implicit function. And when you solve it equal to 0, you get the circle. So here's another example here. Again, scalar values for X and Y and the black dots here are data points. So for the three value of X indicated by the red bar, there's multiple value of Y that are compatible. And some of them are actually a continuum of values. So what we'd like our energy function to be is something that looks like this. It's basically here I'm drawing the level sets of that energy function. So it takes low energy on the data points and higher energy outside. This is a slightly more complicated version of the little 3D models that I showed earlier. And the question is, how do we train a system so that the energy function it computes actually has the proper shape? It's nice when Y is continuous that F be smooth and differentiable so that we can use gradient-based inference algorithms. So if we have a function like this and I give you a point X, Y, you can, through gradient descent, you can find the point on the data manifold that is closest to it or something similar to that. If I give you a value for X, you can search by gradient descent along the Y direction for a value that kind of minimizes it. So that's the inference algorithm. Whereas not an algorithm, it's really a prescription. Then the algorithm is how you do this minimization. And for that, there is all kinds of different methods. Gradient-based methods are one of them. But there are all kinds of methods that are in the case where F is complicated, you may not have, it may not be possible to rely on the search methods, so you may have to use other tricks. In most cases, though, it simplifies. So just as an aside, for those of you who know what graphical models are, a graphical model is basically an energy-based model where the energy function decomposes as a sum of energy terms. And each energy term takes into account a subset of the variables that you're dealing with. So there will be kind of a collection of Fs, and some Fs would take a subset of Ys, some Fs would take a subset of X and Ys, et cetera. And if they organize in a particular form, then there are efficient inference algorithms to find the minimum of the sum of those terms with respect to the variable you're interested in inferring. So this is what belief propagation and all those algorithms do in graphical models. This is an aside, if you don't know what I'm talking about, it doesn't matter. So as I said, the situations where you might want to use this is when inference basically is more complex than just running through a few layers of neural net. When the output is high-dimensional and has structure like a sequence or an image or a sequence of images, which is a video, when the output has compositional structure, whether it's text, action sequences, things like that, or when the output should result from sort of a long chain of reasoning. So it's not just, I can just compute the output. You need to solve a constraint satisfaction problem to basically produce the output or do kind of long chains of reasoning. OK, there's a particular type of energy-based models, which is really where they start becoming interesting, is energy-based models that involve latent variables. So an energy-based model that depends on latent variable, a latent variable, EVM, in this case, would depend not just on the variable that you observe, X, and the variable you want to predict, Y, but also would depend on some extra variable, Z, that nobody tells you the value of. OK? And the way you use this latent variable is that you build your model in such a way that it depends on the latent variable that if you knew the value of this latent variable, the inference problem would become easier. So let's say you want to do handwriting recognition. And I like this example, which I told you about already. If you know what the characters are, reading this word becomes much easier. OK? The main problem here in reading this word is not just to read the individual characters, but to actually figure out what the characters are, like where one character ends, where the other one begins. And if I were to tell you that, it would be much easier for you to read that word. In fact, we're quite good at this. So if you read this sequence of characters here in English, if you understand English, you can probably parse it. You can probably figure out where the word boundaries are because you have this sort of high-level knowledge for where the words are in English. I do the same thing to you in French, and you have no idea where the word boundaries are. OK? And let's just speak French. So the word boundaries in this case and the character boundaries on top would be useful to solve the problem. It would allow you, for example, in the case of character recognition to have individual character recognizers applied to each character, but you don't know where they are. So how do you solve that problem? So that would be a useful latent variable. So for speech recognition, the problem is that you don't know where the boundaries between the words are. You don't know where the boundaries between the phonemes are either. Speech is very much like this continuous text, continuous speech. We can parse the words because we know where the words are because we understand the language. But someone speaking a language you don't understand, you have a very faint idea of where the word boundaries are. Most of the time you can't. In languages where there is no stress, in English it's kind of easy because there's stress on the words. So if you can figure out where the stress is, you can probably figure out more or less where the word boundaries are. In French where there is no stress, you have no way of figuring out. Je peux dire une longue phrase en français, vous avez aucune idée où sont les frontières entre les mots. So it's kind of a continuous string of phonemes. And it's very hard to tell where the word boundaries are unless you know the language. So that would be a useful latent variable to have because if someone told you where those boundaries were, then you would be able to do the task. So that's how you would use latent variables. And this way of using latent variables has been used for decades in the context of speech recognition, in the context of natural language processing, in the context of character recognition, as I said, OCR, and in a number of different other applications, particularly ones that involve sequences. But also in computer vision. So things like you want to kind of detect where a person is, but you don't know how that person is dressed or what position that person is in, things like this. So those are variables that if you knew them, would kind of help you solve the task. Although nowadays vision just works. OK. So if you have a latent variable model, this is how you do inference. So you have a new energy function now. It's called E not F, E of x, y, z. And to do inference, you simultaneously minimize it with respect to z and y. So you ask the system, give me the combination of variables of y and z that minimize this energy function. I actually don't care about the values of z. I only care about the value of y. But I have to do this simultaneous minimization. I'll give you some more concrete examples a little later. In fact, that's equivalent to defining a new energy function F, which I call F infinity here. That only depends on x and y. F infinity of x, y is the min over z of E of x, y, z. You take a function of x, y, z. You find the minimum of this function over z. z now gets eliminated. You get a function of x and y. In practice, you never do this. You practice to minimize it with respect to z and y simultaneously, because we don't know how to represent the function. But there is an alternative to this, which is to define F here, which I write F of beta, or F index beta of x, y, as minus 1 over beta log sum or integral over z of e to the minus beta E of x, y, z. Now, a little bit of computation. You will see that if you make beta go to infinity, this kind of F beta converges to F infinity, which is why I call it F infinity. And I went through this exercise a little earlier in the class. In this integral over z, if beta is very large, the only term that is going to matter is the term E of x, y, z that has the lowest value, which is the one that has the lowest value over all possible values of z, because all the other ones are going to be much bigger, because beta is very, very large. And so their value in the exponential is not going to count, really. The only one that's going to count is the one that has the lowest value. And so if you have only one term in there, which is E of x, y, z for the value of z that produces the smallest value, then the log cancels the exponential, and the minus 1 of the beta cancels the minus beta, and you're left with just min over z of E of x, y, z. So that's the limit that you see above. So if I define F of x, y in this way, and again, then I'm back to the previous problem of just minimizing F of x, y with respect to y for doing inference. So having a latent variable model doesn't make much of a difference. You have an extra minimization with respect to the latent variable to do, but other than that, it's fine. So there is a big advantage also to allowing latent variables, which is that by varying the latent variable over a set, I can make the output, the prediction of the system, vary over a set as well. So here is a particular architecture here where x goes into what I call a predictor, which is some sort of neural net. It produces some feature representation of x. And then x and z, the latent variable, go into what I call here a decoder, which produces a prediction, y bar. It's a prediction for the variable y that is the one that we want to predict. And our energy function here just compares y bar and y. It's simply the distance between them. You're familiar with this kind of diagram. We talked about it about them earlier. So if I choose to vary z over a set, let's say a two-dimensional square here, symbolized by this gray diagram, then the prediction y bar is going to vary also over a set. In this case here, some sort of ribbon, two-dimensional ribbon. And what that allows me to do is basically have a machine now that can produce multiple outputs. By varying the latent variable, I can have this machine produce multiple outputs, not just one. And that's crucially important. Right, so let's say you're trying to do video prediction. So there's many ways, many reasons why you might want to do video prediction. One good reason is to build a very good video compression system, for example. Another good reason is the video you're trying to predict is the video you are looking at from your windshield when you're driving a car and you'd like to be able to predict what cars around you are going to do. This is what Alfredo was working on. And so it's very useful to be able to predict what's going to happen before it happens. In fact, that's the essence of intelligence really, the ability to predict. Now, you're looking at me right now, just a minute. You're looking at me right now and talking. You have some idea of the word that is going to come out of my mouth in a few seconds. You have some idea of what gesture I'm going to do. You have some idea of what direction I'm going to move in, but not a precise idea. So if you train your own neural net to make a single prediction for what I'm going to look like two seconds from now, there's no way you can make an accurate prediction. If you train yourself with least square, or if you train a commercial net or something to predict the view of me here with least square, the best the system can do is produce a blurry image of me because it doesn't know if I'm going to move left or right. It doesn't know if my hands are going to be like this or like that. And so it's going to produce the average of all the possible outcomes. And that's going to be a blurry image. OK. So it's very important that your predictor, whatever it is, be able to deal with uncertainty and be able to make multiple predictions. And the way to parameterize the set of predictions is through a written variable. And not yet talking about distributions or probabilistic modeling, this is way before. OK. Way before we're talking about this question there. Say again. Well, so Z is not a parameter. It's not a weight. It's a value that changes for every sample. Right? So basically during training, we haven't talked about training yet, but during training, I give you an X and a Y. You find a Z that minimizes the energy function with the current values of the parameters of those neural nets. OK. Yes, you best guess for what the value of Z is. And then you feed that to some last function that you're going to minimize with respect to the parameters of the network. The last function is not necessarily the energy. It might be something else. OK. In fact, most of the time it's something else. So in that sense, you learn Z. You infer Z. OK. You don't want to use the term learn, because learning means you have one value of the variable you learn for a whole training set. Here for Z, you have different value for every sample in your training set or every sample in your test set for that matter. OK. So they're not learned in that sense. They're inferred. Yeah, another example of this is translation. So translation is a big problem, language translation, because there is no Segal correct translation of a piece of text from one language to another. Usually, there is a lot of different ways to express the same idea. And why would you pick one over the other? And so it might be nice if there was some way of parametrizing all the possible translations that a system could produce that would correspond to a given text. In St. German, that you want to translate into English, there could be multiple translations in English that are all correct. And by varying some written variable, you might vary the translation that is produced. OK, so now let's connect this with probabilistic modeling. There is a way of turning energies, which you can think of as kind of negative scores, if you want, because low energy is good and high energy is bad, to turn energies into probabilities. And the way to turn energy into probabilities, we talked about this already a little bit, is to use what's called the Gibbs Boltzmann distribution. So the form of this goes back to classical physics in the 19th century. And the pure way given x is exponential minus beta, where beta is some constant, the energy of x and y. And then you want to, so that turns all those energies into positive numbers. We take the exponential of a number, mix it positive. And the minus sign is there to turn low energy into high probabilities and vice versa. And I'm using this convention because this is what physicists have been using for the last century more. Century and a half. So taking exponentials, you turn the energies into positive numbers. And then you normalize. So you normalize in such a way that the P of yx is a properly normalized distribution over y. And to make it a properly normalized distribution over y, you divide by the integral of the sum if y is discrete over y of e to the minus beta f of xy, which is the same thing as the top, except you integrate over all possible values of y. Now, if you compute the integral of this over y, it's equal to 1. Because obviously, you get the integral on top, the integral at the bottom, which is a constant. And you get 1. So that confirms that this satisfies the axioms or probability distributions. That has to be positive numbers that integrate to 1. There's a particular, like there's many ways to turn a function into a function that integrates to 1, a positive function integrates to 1. This one has interesting properties, which I'm not going to go through, but corresponds to the so-called maximum entropy distribution. The beta parameter is kind of arbitrary. It's the way you calibrate your probabilities as a function of your energy. So the larger the beta, the more binary your probability will be for a given energy function. Beta is very, very large. It's basically just the e of xy for the y that produces the lowest energy that will have high probability. Everything else will have very low probability. And for small beta, then you get kind of a smoother distribution, OK? Beta in physics term is akin to an inverse temperature, OK? So the beta goes to infinity is equal to zero temperature. OK, a little bit of math. It's not that scary. To show you where the formula for F beta comes from that I talked about earlier. So let's go through this little slowly here. The joint probability of p of y and z, given x, OK? I applied the same Gibbs-Walshman distribution formula as I used before, except now it's a joint distribution over y and z instead of just a distribution over y. This is for a latent variable model, OK? So it's e to the minus the energy of x, y, z. And then I have to normalize, I have to integrate in a denominator with respect to y and z so that I get a normalized distribution over the joint domain of y and z, OK? So that's the formula at the top left. I can marginalize z. So if I integrate p of y and z given x, I integrate this over z. I get just p of y, OK? That's the marginalization formula. She's at the top right. And so now if I write p of y, x is simply the integral over z of the one at the top left, OK? Which is written in the second line. So at the top, we have integral over z of e to the minus beta energy of x, y, z. And at the bottom, integral over y, integral over z of e to the minus beta, e to the x, y, z, all right? OK, now I'm going to do something very sneaky and stupid, which is that I'm going to take the log of this formula, multiply by minus 1 over beta, then multiply by minus beta, and then take the exponential. All of those things cancel out, OK? The log cancels the exponential. The minus 1 over beta cancels the minus beta, right? So I haven't done anything by doing this e to the minus beta times minus 1 over beta log. I've done nothing because everything cancels, OK? And I do the same at the bottom. And what I see now is that the stuff in the bracket is the formula I wrote previously, f beta of x, y equals minus 1 over beta log sum over z, integral over z of e to the minus beta, e of x, y, z. And so I can rewrite this horrible, complicated formula here as e to the minus beta f beta of x, y divided by integral over y of e to the minus beta f of x, y. What does this all mean? It means that if you have a latent variable model and you want to eliminate the z variable, the latent variable in a probabilistic correct way, you just redefine the energy f as this, as a function of e of x, y, z, and you're done, OK? You're done is a little bit of a shortcut because actually computing this can be very hard, OK? It can be intractable. In fact, in most cases, probably it's intractable. I am missing a minus in the denominator. You're correct. OK, so the last few slides were to say if you have a latent variable that you minimize over inside of your model, or if you have a latent variable that you want to marginalize over, which you do by defining this new energy function f this way. And minimizing corresponds to the infinite beta limit of this formula. It can be done. OK, I mean, just look at the substitution in the second line. OK, the last two terms in the second line. The bracket I replaced by f beta of x, y, because I just defined f beta of x, y this way. OK, I just defined it this way. And if I define f of x, y this way, then p of y given x is just an application that gives us a formula. And z has been kind of marginalized implicitly inside of this. So physicists call this a free energy, by the way, which is why I call it f. So e is the energy, and f is a free energy. Because I'm sort of getting confused. Even in probability, you can have latent variables in generation. So what exactly? So the difference is, in probabilistic models, you basically don't have the choice of the objective function you're going to minimize. And you have to stay true to the sort of probabilistic framework in a sense that every object you manipulate has to be a normalized distribution, which you may approximate using variational methods or whatever. Here I'm saying, ultimately, what you want to do with those models is make decisions. And if you build a system that drives a car, and the system tells you, I need to turn left with probability 0.8 or turn right with probability 0.2, you're going to turn left. The fact that the probabilities there are 0.2 and 0.8 doesn't matter. What you want is make the decision that is the best. Because you have to make a decision. So probabilities are completely useless if you want to make decisions. If you want to combine the output of an automated system with another one, for example, a human, or some other system, and those systems haven't been trained together, but they've been trained separately, then what you want is calibrated scores so that you can combine the scores of the two systems to make a good decision. And there is only one way to calibrate scores and these two kind of turn them into probabilities. All other ways are either inferior or equivalent. But if you're going to train the system end-to-end to make decisions, then no. Then whatever scoring function you use is fine, as long as it gives the best score to the best decision. That gives you way more choices in how you handle the model, way more choices of how you train it, what objective function you use. Basically, if you insist that your model be probabilistic, you have to do maximum likelihood. So basically you have to train your model in such a way that the probability it gives to the data you observe is maximum. The problem is that this can only be proven to work in the case where your model is correct and your model is never correct. In a sense that there's this famous quip by the famous statistician Box that said models are wrong but some are useful. So probabilistic models, particularly probabilistic model in high-dimensional spaces and probabilistic models in combinatorial situations like text and things like this are all approximate models. They're all wrong in a way. And if you try to normalize them, you make them more wrong. So you're better off kind of not normalizing them. There's another point that's actually more important. And I come back to this little diagram and had this one. So this is meant to be an energy function that captures the dependency between X and Y. And it's like a mountain range if you want. The values are where the black dots are. These are the data points. And then there's kind of mountains all around. Now if you're trying to probabilistic model with this, imagine that the points are actually on a thin manifold of an infinitely thin manifold. So the data distribution for the black dots is actually just a line. It's one line, two lines, three lines, but they're lines. They don't have any width if you want. So if you're trying to probabilistic model on this, your probabilistic model should give you your density model. It should tell you when you are on this manifold the outputs should be infinite. The density is infinite. And just epsilon outside of it should be zero. That would be the correct model distribution of this distribution if it's a thin plate. Not only the output should be infinite, but the interval of it should be one. It's very difficult to implement on a computer. Not only that, it's basically impossible because let's say you want to compute this function through some sort of neural net. Your neural net will have to have infinite weights that are calibrated in such a way that the interval of the outputs of that system over the entire domain is one. It's basically impossible. You cannot have accurate probabilistic model. The accurate correct probabilistic model for this particular data that I just told you is impossible. This is what maximum likelihood will want you to produce and there's no computer in the world that can compute this. Okay? So in fact, it's not even interesting because imagine that you had a perfect density model for the density I just mentioned, which is a thin plate in that x, y space. You couldn't do inference. If I give you a value of x and I ask you what's the best value of y, you wouldn't be able to find it because all values of y except a set of zero probability have probability zero and there's just a few values. For example, for this value of x, there are three values that are possible and they are infinitely narrow and so you wouldn't be able to find them. There's no inference algorithm that will allow you to find them because they're just direct functions. How do you find them? So the only way you can find them is if you make your contrast function smooth and differentiable and then you can start from any point and by gradient descent you can find a good value for y for any value of x. But this is not going to be a good probability model of the distribution if the distribution is of the type I mentioned. Okay? So here is a case where insisting to have a good probability model actually is bad. Maximum likelihood sucks. So if you are too Bayesian, you say but you can correct this by having a strong prior where the prior says your density function has to be smooth and you can think of this as a prior. But everything you do in Bayesian terms take the logarithm thereof, forget about normalization and you get energy-based models. So energy-based models that have a regularizer which is additive to your energy function are completely equivalent to Bayesian models where the likelihood is exponential of the energy and now you get exponential one-term in the energy times exponential regularizer and so it's equal to exponential energy plus regularizer and if you remove the exponential you have an energy-based model with an additive regularizer. So there is kind of a correspondence between probabilistic and Bayesian methods there but insisting that you do maximum likelihood is sometimes bad for you particularly in high-dimensional spaces or combinatorial spaces where your probabilistic model is very wrong. It's not very wrong in discrete distributions, it's okay, but in that case it's really wrong and all models are wrong. So there is a form of learning and I'll come back to this at length in future lectures called self-supervised learning and it's really sort of encompasses self-supervised learning but also where people used to call unsupervised learning and a lot of things and I think the future of machine learning is in self-supervised learning and you start seeing these days over the last year and a half there's been enormous progress in NLP because of systems like BERT and those systems are trained using self-supervised learning a particular form of self-supervised learning which we'll talk about. There's been also quite a bit of progress over the last three months also in using self-supervised learning to train systems to learn vision systems to learn features using a self-supervised pretext task and the purpose of self-supervised learning is to train a system to learn good representations of the input so that you can subsequently use those representations as input for a supervised task or reinforcement learning task or whatever The thing is there is a lot more information that the system can use in the context of self-supervised learning So let me tell you what I mean by self-supervised learning So self-supervised learning is that someone gives you a chunk of data and you're going to train a system to predict a piece of that data given another piece of that data So for example I give you a piece of video and I ask you use the first half of the video and train a model to predict the second half of that video Why would that be good? Why would that be good in the context of learning features for vision systems for example? If I train myself to predict what the world is going to look like what my view of this room will look like if I shift my head a little bit to the left how the view changes is that every point in space has a depth has a distance from my eyes If I sort of infer somehow that every point has a distance from my eyes then I can very simply explain how the world changes when I move because things that are closer have more parallax motion than things that are far and you get this sort of perspective distortion and so there is this idea somehow that if I train a system to predict what the world is going to look like if I move a camera the system implicitly will run about depth you will not have to be training it to predict depth in a supervised fashion it will have to internally kind of discover that there is such a thing as depth if it wants to do a good job at that prediction which means you don't have to hardwire into the system that the world is three-dimensional it's going to learn this in minutes by just predicting how your view of the world changes when you move the camera now once the system has figured out that every point has a depth in the world then the notion that there are distinct objects that are in front of the background immediately pops up because objects are things that move differently from things that are behind there is nothing that pops up immediately which is the fact that objects that are not visible hidden by another one are still there it's just that you don't see them because they are behind but this concept that objects still exist when you don't see them is not completely obvious babies run this really really early but it's not exactly when because we can measure that when they are very little but they probably learned this very quick once you have identified this concept of objects perhaps you'll figure out that a lot of objects in the world don't move spontaneously so there are inanimate objects and then there are objects whose trajectories are not entirely predictable and those are animate objects or other types of objects that move in not entirely predictable ways like the waves on water but are not animate necessarily or like the leaves of a tree and then after a while you also realize that objects that have predictable trajectories generally don't float in the air if they are not supported they fall so you can start learning about physics, about gravity about inertia babies learn this around the age of 9 months so this is not something you're born with you cannot learn this around 9 months as a baby you learn that gravity is a thing before that you don't know so the motivation for self-supervisioning and this is one reason I think self-supervisioning is really the future of machine learning certainly and the future of AI animals and humans seem to run an enormous amount of background knowledge about the world just by observation by basically training themselves to predict so one big question in AI in fact the question I almost exclusively work on is how do we do this we haven't found a complete answer yet right so I give you a piece of data let's say a video and the machine is going to pretend there is a piece of that data that you see and then another piece that it sees and it's going to try to predict the piece that it doesn't see from the piece that it sees so predict future frames in a video predict missing words in a sentence so I give you a sentence I block some of the words and the system trains itself to predict the words that are missing or I show you a bunch of a video and I block a piece of the frames of the image for some of the frames you know predict the less half from the right half you know right now you only see my right side but even if you've never seen my left side you could more or less predict what I look like from the other side most people are more or less symmetric except scary Hollywood characters so one instance where start supervising has been unbelievably successful and it only happens over the last year and a half is text so text use a particular type of self-supervised running called denosing autoencoder so you take a piece of text you remove some of the words typically 10, 15, 20 percent of the words so you replace the token that indicates a word by basically blank and then you train some giant neural net to predict the words that are missing it's the system cannot make an exact prediction about which words are missing and so you train it as a classifier by producing a big softmax factor for each word which corresponds to probability distribution of words okay and once you've trained this system you chop off the last layer and you use the second last layer as a representation of any text you feed it it's a particular architecture of this network that makes it work well but it's irrelevant to the point that we're making right now those transformer networks that we talked about last week a little bit but it's a very simple task of completion tasks, filling in the blanks take a sentence, remove some of the words train the system to predict the words that are missing that works amazingly well all the top NRP systems now that have the best performance and all the benchmarks basically are pre-trained using a method like this and the cool thing about it is that you have as much text as you want on the web to pre-train those systems you don't need to label anything, it's very cheap it's very expensive in terms of computation because those networks are enormous for them to work well but it works really well so immediately people try to translate that success into a similar success for images so let's say take an image block out some pieces of it and then train some convolutional net or something to predict the missing pieces in the image and the results have been extremely disappointing it doesn't work really I mean it works well in the sense that the images get completed with sort of things that make sense but then if you use the internal representation learn this way as input to a computer vision system you can't beat a computer vision system that has been pre-trained supervised on ImageNet so what's the difference why does it work for NLP and it doesn't work for images and the difference is that NLP is discrete whereas images are continuous people also try to do this for video so same idea as BERT except replaced words by video frames so feed a big video to a transformer like system or something similar remove some of the frames or blocks of frames and then train the system to predict the missing frames and the features you get are not so great so that's the difference is things seems to work in the discrete world they don't seem to work in the continuous world and the reason is because in the discrete world we know how to represent uncertainty by a big softmax vector over words in continuous spaces we don't so if I want to train a system to do video prediction I don't know how to represent a probability distribution over multiple video frames so here is another reason why we might want to use self-supervised learning and deal with uncertainty and again this is what Alfredo is working on among others it's the fact that we'd like to have we'd like our machines to be able to kind of reason about the world predict what's going to happen so I told you before an example where to be able to build a machine that drives a car it's probably a good idea to be able to predict what cars are going to do be able to predict what your car is going to do if you're driving you're a cliff and you turn the wheel to the right and you want to predict in advance that your car is going to run off the cliff and if you can predict that you're not going to do it so if you have a good predictive model of the world a system that will predict the next state of the world so the current state of the world and the action you take then you can do you can act intelligently okay well you need other components to act intelligently but I'll come back to that but again this ability to predict is the essence of intelligence really the fact that some animals are intelligent is because they really have a much better model of the world and they as a consequence are better at sort of acting on this world to kind of get the result they want so the problem with the world is that the world is not deterministic or I mean maybe it is deterministic but we can't predict exactly what's going to happen so the fact that it is deterministic or not is irrelevant our brain have a limited capacity our computers have a limited capacity and we can't exactly predict what's going to happen and so we need to be able to train our system to train our brains to train our AI systems to predict in the presence of uncertainty and that's the most difficult problem that we need to solve today to make significant progress in AI how to train the system to make high dimensional predictions under uncertainty and deal with this uncertainty and as I said before probabilistic models are basically hopeless okay so let's take an example with video prediction here are four frames what's the continuation of those frames so it's a little hard to see that the little girl is about to like blow on her birthday cake and if you train a neural net with least square to make predictions so you train it on thousands of videos of this type if not millions this is the kind of prediction you get very blurry and it's exactly what's going to happen so it predicts the average of all the possible futures which is the best way to minimize the squared error okay and if you want sort of a model version of this let's say your entire training set consists of someone putting a pen on the table and letting it go and the person always put the pen exactly at the same place the same way but every time you do the experiment the pen falls into a different direction so basically X is the same for every training sample but Y is different because the pen can fall in any direction probably with a uniform distribution so if you train a neural net to predict with least square you'll get the average of all the possible predictions which is a transparent pen all around the circle which is not a good prediction that's why you need latent variable models okay so if you make a prediction by the system but you have latent variables which indicate what you don't know about the world okay so X is what you know about the world here is the initial segment of the video of someone putting a pen you know that when the person lifts the finger the pen will fall but you don't know in which direction so what you want the system to tell you is the predictor here X to H, H should be a representation that tells you the pen is going to be on the table but I can't tell you in which direction and then Z will have the complementary variable here is the direction in which the pen actually fell and then the combination of those two pieces of information the stuff you can extract from the observation and the stuff you cannot gives you the prediction Y bar which hopefully is close to what actually occurs okay so the way you use something like this you don't use it for I mean if you want to use it to kind of rate a particular scenario you give it X you give it Y and then you ask it what's the value of the Z variable that minimizes the prediction error in my model and then the resulting prediction error is the energy and it's how you model rate the compatibility between X and Y if you want to predict Y's what you have to do is you observe X and then you kind of dream up a value of Y within a certain domain and that produces a Y bar and then dream up another value of Z and that will produce another Y bar and you can produce a whole set of Y bars by kind of drawing multiple values of Z within a set or within a distribution yes well so if what you're predicting are the future frames and what you're observing are the the past and current frame like increasing you mean increasing the past frames that you're looking at a little bit but you know after a while things are going to happen that really don't depend I mean the information about what's going to happen in the future really is not present in the in the past frames so in this particular case there would be variables that are necessary to make a good prediction but the information is not present in X okay so the question was what's the world of Z really like you know does it implement a constraint between X and Y or something else and in this particular example here that I showed I showed several examples one example I showed is character recognition if you knew where the characters are then the task of recognizing the characters would be easier so by making the inference about where the characters are you know you sort of help your system you build a system in such a way that it can use that in this particular case here it's different here the role of the latent variable is to basically parameterize the set of possible outputs that can occur and in the end what you want is Z to contain the information about the about Y that is not present in X okay so the information about you know where I'm going to move next I'm going to move left or right this is not present in anything you can observe right now it's inside my brain you can you can tell yes I mean right now here I'm not assuming anything other than you know preta of X is a big neural net and of H and Z is a big neural net so this is sort of an example visualization of energy landscape where we've trained a neural net basically to compute an energy function here it's not a neural net it's a very simple thing actually to capture the dependency between two variables X and Y and the data points are along this little spiral here so data points are kind of simple uniform here this spiral and then we train a system to give low energy to those points and high energy to everything else now these two forms there is this is sort of a conditional what you could call conditional energy based model where there's two sets of variable X and Y and you're trying to predict Y from X but there's also another form of energy based model which are unconditional there's only a Y no X you're trying to predict the mutual dependencies between the various components of Y the distribution of a Y if you want but there's no X so this is something you would want to use if you want to say do image generation unconditionally or you want to just model the mutual dependencies between things but you don't know at any point if you're going to be able to observe Y1 or Y2 or none of them the math is the same really okay so how are we going to train those energies as well this is really where things become interesting it's the question of training so training should do something like the little animation at the top here it should kind of shape the energy function because our machine now computes an energy function as a function of X and Y it should shape the energy function in such a way that the data points have lower energy than everything else okay that's the way the difference is going to work if if the correct value of Y has lower energy than the incorrect values of Y then our inference algorithm that finds the value of Y that produces the lowest energy is going to work okay so we need to shape the energy function so that it gives low energy to the good Ys for a given X and high energy to bad Ys for a given X 2 to R it goes from any domain you want to scalar that minimizes because the solution is no longer unique right is C then sort of shaping the space around this number to make it more identifiable not necessarily so this model actually is a latent variable model in fact most of you are probably very familiar with the model that is used here is K-means so how is this produced okay let me delay this for a bit but this is the energy surface of K-means which is a latent variable model let's keep the latent variable thing kind of aside for a minute just think of this as you have an energy F of XY and the fact that there may be underlying latent variable for now is really relevant okay there's two classes of methods to train energy based models and again probabilistic methods are all kind of you know special cases within those one class is called contrastive methods and this idea is very natural take a training sample X of XIYI and change the parameters of the energy functions so that its energy goes down okay easy enough conversely take other points outside of the manifold of data some process by which you pick for a given X you pick a bad Y and then push that guy up if you keep doing this with a loss function that takes into account those different energies then the energy function is going to take a shape such that the correct Y will have lower energy than the bad Ys okay keep pushing down on the good values of Y keep pushing up on the bad values of Y contrastive methods and they all differ by how you pick the Ys that you push up and they all differ by the loss function you use to do this pushing up and pushing down there's a second category of methods and I call them architectural methods in that case you build the energy function f of XY so that the volume of low energy regions is limited or is minimized to regularization so you build a model in such a way that whenever you push down on the energy of data points the rest goes up more or less automatically because the volume of stuff that can take low energy is limited or minimized to some regularization those are very broad concepts okay yes that's one set of techniques but there's many so there is a set of methods like score matching for example that says the gradient of the energy around the samples should be zero and the second derivative should be as large as possible the trace of the Hessian should be large and so basically you're telling it make every data point a minimum of the energy by making sure the energy curls up around a retraining sample it's very very hard to apply in practice because you have to compute the gradient with respect to the waste of the trace of the Hessian of the energy function with respect to the inputs I mean it's complete hell for simple models for linear models but it's hell I'd stay away from it okay so there's a number of different strategies here it says seven strategies but I kind of reorganize this into those two categories of contrastive methods and architectural methods and there are kind of three sub-categories in contrastive and four sub-categories in architectural and there's some names of various algorithms here that you might recognize some others that you may not recognize which is okay and I'm going to try to go through some of them now what I need to do is go score matching okay bear with me for just one second okay so C1 contrastive sub-category 1 push down the energy of data point push up everywhere else and this is what maximum likelihood does and maximum likelihood pushes down the energy of data points to minus infinity and pushes up the energy of other points to plus infinity which is the problem that we were just talking about earlier and so here is what happens so you have this boson distribution that gives the likelihood of y given x which is for a particular data point yi xi it gives you the probability that your model gives to this particular value of yi for a given xi and it's exponential minus beta the energy divided by exponential minus beta the energy integrated over all y's okay so if you want to maximize so let's say you have a bunch of data points and you want to maximize so here I'm not writing x because it doesn't matter and you want to maximize the probability that your model gives to this particular value y you want to make the energy of this y y small which means you want to make the e to the minus beta the energy of this big and you want to make the stuff at the bottom as small as possible so instead of maximizing p of y we're going to minimize minus log p of y okay so minus log p of y so if I take the log of this ratio I'm going to get the difference of those two terms the difference of the logs so I get log of e to the minus beta e of y and then I get minus log e to the minus beta e of y I take the negative of this because I want to minimize so negative log probability is what I want to minimize and I get the last function here at the bottom I divided everything by beta which makes no difference as far as the minimum is concerned okay so to go from the top formula to the bottom formula you take minus log of the top formula and you divide by beta so now we have a that gives us a last function and this last function where we minimize it says make the energy of the data point y as low as possible e of y should be small and make the second term as small as possible so I get these that are inside of this exponential minus as large as possible so the second term is going to push up on the energy of every point including the data point now if I compute the gradient of this objective function so this is the probabilistic approach maximum likelihood if I compute the gradient of this objective function I get the gradient of the energy at the data point y minus this formula here is the expected value over y of the probability that my model gives to y that is given by the gives also in distribution and that is used as a coefficient to weigh the gradient of the energy function at that location okay so this integral here the second term is basically an expected value of the gradient I compute the gradient of the energy function at every point I weigh every point by the probability that my model gives to that particular y and I compute that weighted sum if y is discrete this is a discrete sum if y is continuous this is an integral so the first term so now if I use this in a gradient you know like stochastic gradient algorithm the first term is going to try to make the energy of my data point as small as possible and the second term is going to push up the energy of every single point every y and the third term is can I compute this at all can I compute this integral so an enormous chunk of publications in probabilistic modeling have to do with how do you either compute this, estimate this or approximate this because that integral is in interesting cases is intractable okay if y is a space of images I cannot compute an integral over all possible images except if the energy function or the gradient of the energy function is very very simple most of the time it's not that simple if you want a complex model to capture the dependency of the world it's not going to be that simple it's going to be some big neural net so this integral is going to be completely intractable now there is little bit of salvation in the fact that this is an expected value and so to compute an approximation of an expected value y is from this distribution which is the distribution that my model gives to y and I compute the average of the gradient over those samples I get some, a finite approximation of this, it's called Monte Carlo methods okay it's called Monte Carlo approximation invented by physicists when they were trying to build a atom bomb in the 40s there are other methods that are based on variational methods so I don't really know how to compute p I can't compute the synthetic world but let's say I replace p by another distribution q for which I can't compute this average let's say a Gaussian or something and then I try to make q as close to p as possible that's called variational methods okay you probably heard that term many times that's basically a variational method you approximate an expectation over a distribution by replacing the distribution by something you can actually compute and you try to make this computable distribution as close as possible to the real distribution using some measure KL divergence right, so here is k-means so k-means is a, you can think of k-means as an energy based model energy based model, you can interpret k-means in terms of energy based model is anyone okay with what k-means is? or have you forgotten what k-means is? okay so k-means is this very simple clustering algorithm where the energy function if you've never heard of it this is a way to explain k-means the energy function is written at the top here E of yz is y-wz where w is a matrix and z is a set of one-hot vectors okay so z is a discrete variable with k possible values and it's a k-dimensional vector with one component equal to one and all the other one is equal to zero so you multiply z by the matrix w the effect is that what you get as this product is one of the columns of w the column of w that gets multiplied by the component of z that's equal to one gets reproduced and everything else is gone so that product selects one column from w the columns of w are called prototypes and if I give you a y the way you do inference is that you figure out which z k possible vectors of z minimizes the reconstruction error minimizes the square distance between the corresponding column of w and the data point that I'm looking at and the energy is just the square distance between the two okay now the energy function you see here represented in this chart oops right here here are kind of black blobs and those correspond to quadratic wells around each of the prototypes of w so the system here has been trained and it placed the columns of w along the manifold of training samples which is this spiral that's where all the training samples are selected and the way this is trained is very simple you just minimize the expected the average energy over a training set okay so basically I give you a y y is a training sample you find the z that minimizes the energy so you find the prototype that is closest to y the column of w that is closest to y and then they do one step of gradient descent so you move that vector close to y closer to y then you take another y select which columns of w is closest to it and move that column a little bit closer to y and then you keep doing this that's not exactly the chemise algorithm this is the kind of stochastic gradient form of the chemise algorithm the real chemise algorithm actually kind of does sort of coordinate descent if you want so it first goes through the entire training set and figures out for each data point which column of w is closest to it and then after you've done this you compute every data point every column of w as the average of all the data points to which it's associated and it goes a bit faster if you do it this way as opposed to stochastic gradient but the result is the same in the end you minimise the average of the energy over the training set okay so that's an example there was a question about latent variable earlier that's an example of a latent variable model very simple one where the decoder is linear there's no dependency on x and what you're just trying to do is model the distribution over y here y is two dimensional and you're just trying to say if I know one value of y can you tell me anything about the value of y2 and once you have this energy function you can I give you y1 you can predict what the value of y2 should be I give you a random point you can tell me what's the closest point on the data manifold by just searching for the closest prototype basically so k-means that I just explained here belongs to the architectural methods it's not a contrastive method as you can observe I did not push up on the energy of anything I just pushed down on the energy of stuff the k-means is built in such a way that there is only k-points in the space that can have zero energy and everything else will have higher energy okay it's just designed this way right so it's architectural in that sense once I've decided on k I've limited the volume of space in y that can take low energy because there's only k-points that can have zero energy everything else grows as I move away from them now let's talk about there's a bunch of other methods these are my favorite methods I think ultimately everybody will be using architectural methods but right now the stuff that works in images okay so contrastive methods I have data points and currently my model computes an energy function let's say that looks like this so I'm drawing the contours of equal cost okay it's like a topographic map so obviously that model is bad because it gives low energy to those points here right those points have low energy and they should not and then those points have high energy and they should not so what should I do so obviously if I take if I take a training sample here and I change the parameters of f of an x, y so that the energy goes down it's probably going to move the function to have lower values in that region it could be sufficient because it could be that my energy function is parameterized in such a way that it could be flat it could be zero everywhere so I need to explicitly push up on other places and so a good location to push up would be those red locations here these are locations that my model gives low energy to but should not have low energy okay so let's say this is my training sample right now okay the big black dot here that's my training sample one way I can train a contrastive system is by saying I'm going to push down on the energy of that point and I'm going to perturb that point a little bit by corrupting it some way adding noise to it and I'm going to push up on the energy of a point that's nearby okay and I'm going to do this multiple times so if I do this sufficiently many times eventually the energy function is going to curl up around every sample because I modify a sample and I push up you know I corrupt it a little bit and I push up on the energy of that corrupted sample, the contrastive sample so eventually the energy is going to take the right place something a little smarter instead of sort of randomly perturbing this training sample by some perturbation around it I'm going to use gradient descent to kind of go down in the energy surface and then I'm going to get this point and push it up okay that makes more sense right because I'm going for the juggler here I you know the system kind of finds a point that the system gives that it gives low energy to currently and it pushes up right so the procedure is here is a training sample move down in the energy surface so find a value of y that has lower energy than the one you started from and then that's a contrastive sample push it up push down on the original sample push up on this new sample you just got now this might be expensive and your energy function may be complicated it may have local minima so here's another technique the other technique is a training sample and imagine that this surface is like a smooth mountain range and then give a random kick to this marble think of it as a marble you're going to give it a random kick in a random direction and you're going to simulate this as a marble rolling down this energy surface so let's say I'm going to kick it and then I'm going to go in this direction for a while and then it's going to go down in the energy after a while cut it off you get a point at the end of this trajectory push it up okay so I'm doing this very informally but in fact there are so depending on how you do this here I'm explaining the principles of how those methods work but in fact if you're interested in probabilistic modeling and what you're interested in is doing maximum likelihood what you need to do is sample produce samples according to the probability your model gives to the samples and there's ways to run those algorithms in such a way that the ratio of the probability with which you will you will pick two samples corresponds to the ratio of their probabilities in the given by the model which is all you need and that's essentially done by the details of how you implement those kind of trajectories basically and like the noise that you use to so okay so let me give you names for this okay the random noising corresponds to an algorithm called denoising autoencoder and Alfredo is going to tell you more about this the the gradient descent and random kick versions the random kick that's called contrastive divergence and there are a form of this for and if you do it's searched through the space by random perturbations trying to find low energy space with noise it's a special case of Monte Carlo methods and if it's a continuous trajectory not a continuous but if it's a trajectory it could be discrete it's called Markov chain Monte Carlo methods or MCMC and if it's in a continuous space where you use kind of this rolling ball with a random kick method that's called Hamiltonian Monte Carlo HNC there's a question okay well at the time let me just talk about denoising autoencoder so what's a denoising autoencoder is a type of energy based model where you start with a y so you only have y's here we go so you start with a y you corrupt it so this is the the little diagram that I showed earlier you corrupt this sample okay you get another sample that I'm not going to call and you pass this through an encoder which is a neural net which is another neural net and then you compare the output which is a reconstruction for y with y this is just in the classical form this is the distance between y and y bar squared okay and I think here in the network on the left it's just some neural net that you train the corruption is built by hand okay it's not trained so what does that do for you this actually pushes up the energy corrupted points okay so here the energy is um so this is how you train the system but the actual system doesn't have the corruption you give it a y you run it through the encoder and the decoder and measure the reconstruction error okay so it's exactly the same diagram except no corruption so the corruption is for training and this is how you use it okay what does that do for you you have space of y, you have data points take a point y and corrupt it okay and now you train this neural net this encoder decoder to basically reconstruct this corrupted point from the corrupted point to produce the original point the original training point so the neural net function is going to map is going to map this point to that point okay what does that mean that means when you when you plug a vector here y and you do this for every training sample and a large number of corruptions okay what that means is that when you plug a point y here on the input and you measure its energy which is a reconstruction error this point if it's on the manifold if it's on the manifold of data is going to be reconstructed as itself therefore its energy here which is a reconstruction error b0 okay if it's trained properly whereas if you put on the input a point that is outside the manifold it's going to get reconstructed as the closest point on the manifold because it's been trained to do that therefore the reconstruction error here will be this distance okay what that means is now that the energy computed by this denosing auto encoder is such that it grows quadratically as you move away from the manifold of data if the thing is properly trained okay so that's an example of contrastive methods because you you say on the manifold things should be 0 the reconstruction energy should be 0 outside the manifold the reconstruction energy should be the distance to the manifold okay this is BERT so BERT is trained this way except the space is discreet though combinatorial because it's text and the corruption technique consists in masking some of the words and then the reconstruction consists in predicting the words are missing you can always copy the words that are not missing so you don't need to do it so it's a special case of denosing auto encoder it's actually called a masked auto encoder because the type of corruption you do is masking pieces of the input okay alright out of time we'll talk about more of those techniques next time