 Okay, so we talked about energy-based models last week, and I'm going to recap some of the stuff that we talked about in the first few minutes and then expand on that, and then we'll talk about structure prediction in the second hour. Okay, so what we talked about last week mostly was so-called contrastive energy-based running. So it's the idea that you have an energy function, which in this notation is denoted e, but it should be called f, I apologize for that. And this energy function, the training of this energy function should be such that we give that the energy takes low values on training samples, which are here symbolized by, you know, xi, yi. Okay, so training sample here, xi, x1, y1, x2, y2, etc. So I put the collection of all the x's and all the y's together. And so we're going to have an objective function that's going to have the effect of pushing down on the energy of those sort of individual data points. If our energy function is bounded below, let's say by zero, because it's some sort of distance or something like this, that this will have the effect of, you know, pushing down to the floor the energy of those good guys. And contrastive methods work by explicitly pushing up on a set of other points. So for the same x's but different y's that are incorrect, we're going to pick a bunch of those points and we're going to push up on their energy. Okay. So for this to work, this h function here has to be an increasing function of e of x, x1, y1, e of x2, y2, etc. Okay, has to be an increasing function of the energy of the correct points, because then when we're going to minimize this source function, this is going to have the effect of pushing down on the energy of those points. Okay. But it should be a decreasing function of the energies of the bad guys, so that when we minimize the loss, the energy of the bad guys is pushed up. Okay. So I said decreasing and increasing, not strictly increasing, not strictly decreasing, because there is a point where perhaps we don't care anymore about pushing down the energy of the good guys. And we don't care anymore about putting up the energy of the bad guys, if they are above a certain threshold. And this threshold we called M. Okay. So M may be different for every pair of Y and Y hat. And you have a lot of choices on how you pick that. Here M is actually a vector. So it's a vector of margins that you get for each pair of Y hat. So this is sort of the general form of a contrastive loss. And I've just said the property it should respect, increasing function of E, decreasing function of E of X of the training samples, decreasing function of E of X and the bad guys, the well-chosen contrastive points, and then some margin. Okay. The margin may be explicit or implicit. And what you probably need is a margin, a separate margin for each pair, you know, Y I and Y hat. Okay. So there are two cases, right? There is the conditional form, which is this, when you have an X. So basically you have, you know, an input variable X and a variable to be predicted Y. And then the unconditional version, which, you know, you can think of it as basically the same thing except here, we don't assume that one of the two variables is observed. It could be observed or not observed. And both variables are Y, essentially. So the formula is just translate directly, except we don't have an X. Okay. Just remove the X's from the formula above and then you're okay. Okay. So now we're going to dig a little bit more into what particular form this last function can take. But before we do this, I, you know, I said earlier that there were two types of energy-based training, the contrastive forms that we are talking about right now. And then the regularized and architectural forms that we haven't talked about yet, but I want you to keep this in mind that there are two kinds of two ways of training energy-based models, machine learning in general, actually. The contrastive methods and the non-contrastive methods that, you know, can be called architectural or regularized depending on the situation. So I put here a list of standard algorithms together with whether they belong to one category or the other, whether they are supervised on supervised structural prediction or whatever. And, you know, I'll show this slide again multiple times. So when we talk about some of those examples. Okay. So what we talked about last week, I showed this slide where, you know, what are particular forms of H that are, you know, popular or simple or appropriate. And one particular specialized form is this form where you only have two samples in the last function. Okay. So you basically have one good guy and one bad guy. And your last function only depends on that pair. Of course, for every training sample, you choose a different good guy and bad guy, right? I mean, the good guy comes from the training set, the bad guy you pick in some smart way. And here I'm not specifying how. And you may have a margin that depends on, probably depends on the difference between y and y hat. So particular instantiations of this type of, so again, this needs to be an increasing function of f of x y and a decreasing function of f of x y hat. Okay. And for this to work, in every case, m needs to be non zero positive and non zero. This ensures that when you minimize this, the energy of the bad guy would be larger than the energy of the good guy by at least m. This does not need to be a hinge loss. It needs to be, you know, it can be anything, but hinge loss satisfies this condition. So simple example is the one here, the kind of simple, the simplest form where you say, I want f of x y to be zero or smaller. Okay. So you put a hinge here that pushes this down towards, towards zero. And, you know, if, if the energy cannot be, cannot go below zero, then this is just, you know, minimizing the energy basically. And then the other term just pushes up on the energy of the bad guy until it's larger than m. Okay. So this is like a positive part, the radio. And this is obviously a decreasing function of f until m until f is larger than m of y hat, in which case it stops. Okay. So this explicitly pushes down on the energy of the good guy, pushes on the energy of the bad guy, so that the energy of the good guy is zero and the energy of the bad guy is m. Here's another example. So this, this one only cares about the difference between the two energies. So it's not going to try to make the energy of the good guy small. It's just going to try to make the difference between the energy of the good guy larger than a particular margin. Okay. So this is the so-called ranking loss or, or, or, or, or hinge loss or sometimes triplet loss. And it's, it's been, you know, it's been quite popular for some applications, but the fact that there is kind of no, you know, pinning down of the absolute values of f, maybe an issue sometimes. So here's a slight modification of the one at the top where we square the energy terms, right? So, and it's, you know, depicted here on the, on the, on the, on the, on the right on the diagram. So the, the first term is the energy of the good guy, and you're going to square this. So you're going to put it in a hinge and then you're going to square the hinge. So you're going to get a cost function that looks like, looks like this right curve here. It says, you know, I want the energy to be as close to zero as possible with a square penalty. And then the other term is a square hinge in the other direction with a margin. Okay. So you have a margin here and you pay a price for making this energy lower than the margin. And you pay a quadratic loss for doing, for doing so because of the square. Okay. If you didn't have the square, so if you use the, this objective function, those would be straight lines. Okay. And, you know, with a square, you may have better properties that there is a, equilibrium point between the, the two, the two losses. Okay. So those are examples of very simple losses that are contrastive and they take only one pair of good guy by guy with a margin. Now here is a long list of other loss functions that people have used over the years in various contexts that were not necessarily formulated as energy based models. Okay. So then go back to the 1980s. Okay. So this is not a new, a new problem that people have been dealing with. So we can interpret actually all energy functions like the, the perceptron loss as actually in terms of sort of a energy based model. I'm not going to go into the details of, of what this is. You probably have heard what a perceptron is. The main issue with it is that its margin is zero. So it doesn't work in all conditions. Hinge we just talked about. Here is another objective function. This one is very similar to the hinge, except it's a soft hinge. It's very similar to the kind of loss you, you use in logistic regression with binary output. So here you take the energy of the good guy, compute the difference with the energy of some bad guy, which is called Y bar here. And then put this in the log of one plus exponential, which is basically a soft hinge if you want. So it tries to make the, this energy much smaller than that energy because when that's the case, then you know, you will have a sort of negative exponent here. And, and this, this whole thing, we basically go to zero. Okay. Approach zero. And then for large values of the difference, the, this, this cost function basically is, is like the, is like the identity. It's like, it's like the difference, right? So it's very much like the, the hinge loss, except it's very much like the hinge loss, except it's kind of a soft version of it if you want. I'm going to skip those other things. So let's go, this, this date from the 1980s. And the main issue here is that there is no explicit margin. NCE. So this is very similar to what we look at here, except, you know, we don't take a log, we compute the inverse and we put a minus sign in front of the thing. This was used in the context of speech recognition mostly, actually. Square, square, I just mentioned square exponential is, is a different play on it where the way we push up on the energy of the bad guys is that we plug them into a negative exponential. So they're being pushed to infinity, but with a, a quickly decreasing force, if you want, because, you know, as the energy of this bad guy grows, the, the cost becomes smaller and smaller and flatter and flatter. So we don't push it very far towards, towards infinity. The, the margin here is, is infinite actually, right? Because the system really push, tries to push the energy of bad guy to infinity. The second last line we've already seen, this is the negative log likelihood, which in the context of speech recognition, people used to call maximum mutual information. It's actually the same thing. And, and there you have a term here that doesn't take into account a single bad guy, but it takes into account all the bad guys all together. So you integrate the, the second term over all the bad guys, over all values of why good guys and bad guys, okay? And plug this into a cost that will push up the energy of all the good guys and bad guys. But the force with which the good guys are pushed is less than the force with which they are pushed down. So overall, this will actually shape the energy function in the right way. And we explained that last, last week a little bit, explained also that this may not be appropriate for various reasons. So, so here is the advantage of, of energy based models. You can come up with your own loss function. You're not bound to using something like negative log likelihood, which you are pretty much bound to if you're, if you're using probabilistic methods. So this opens the door to a lot more flexibility in the type of cost functions that you allow yourself to use. So I mentioned this briefly last week. This is a, a sort of generalized additive margin loss where you have a margin for a pair of samples, which sort of makes, you know, puts a competition between the good guy and the bad guy, if you want. But instead of just having one pair for every sample XY that you see, you sum over all, over a set of possible pairs with, you know, a possible bad guys, essentially. Okay. So cost function doesn't take a single white hat. This should be white hat, not white check. I'm not sure where I put a white check here. But, but a whole bunch of them. Okay. And you combine them additively. But let's come back to this idea that we combine the energies of all the, all the good guys and all the bad guys together. So this has become kind of popular for certain types of certain approaches to particularly self supervised running over the last few years, three or four years. And a particular example of this is called info NCE or some, what some people call CPC contrastive predictive coding. I don't like this term because it's, it's too inclusive for a method that is very specific and uses a particular objective. But the, the idea of using kind of, so most of those methods only have one good guy in the, in the last, but a lot of bad guys and the bad guys come from a batch, right? So you train at the level of a batch, you mini batch, you collect a bunch of samples with one good sample and a, and a bunch of negative sample contrasted samples. So bad wise, okay, that you come up with somehow. And within this batch, you're going to push down on the energy of the good guy and then push up on the energy of the bad guys, but they may compete with each other, right? So a particular example of this is imagine that you put all the scores of all the bad guys inside of a, a softmax basically, right? So here the, the top here is, and I'm not sure I got this sign, right? Yeah. No, I think it's correct. So you put the, the energy of the good guy at the top and then you divide by the sum of the energies of the good guys and all the bad guys. Okay. So if the bad guys have very high energy, then this basically is negligible and what you get is one and your cost is zero because you take the log, right? But if the energy of the bad guys are too low, then this denominator matters. Okay. And their value, their, their energies are going to be pushed up because you want to make this whole term, this whole sum small, you know, which means you want to make the minus log of one over, which is a log of this small. Okay. So what advantage does that give you? It, it, it performs within a batch a technique called hard negative mining, which means within a batch, the, the, the negative samples that are going to get, that are going to get most of the gradients are the one whose energy is significantly smaller than all the other ones. So if in this batch of samples of bad guys, all of the bad guys have high energy, so you don't care about them, but one of them has low energy because of the softmax, this guy is going to get all the gradient. It's going to be pushed up really hard. Okay. Whereas the guy that already have high energy are not going to be pushed up very much. So there is a, this sort of introduces some sort of competition between the, between the energies of the, the, the various sample, the various negative contractive samples within, within the batch. So people now call this info NCE and it's very popular for training joint embedding systems. So let's talk about joint embedding architectures now. We talked about this last week, but let me come back to this. So we have two networks. They can take similar inputs or, or, I mean, they can take inputs of the same nature, let's say two images, or in other joint embedding methods, they can actually take, you know, inputs of completely different nature. One could be, for example, an image, the other one could be a test query. And you want to train the neural nets so that the vector representing an image and the vector representing a query that describes this image, or a question that people can type in a search engine or a caption for that image would have nearby embeddings. But if you have an image together with a query that doesn't match this image, doesn't correspond to it, you want the output representations to be far away from each other. Okay, but here we're going to use the simple example where those two networks are identical. So they share the same weights. Okay, you have one single weight here that you shared across those two networks. So they produce embeddings for images. And what we want is, you want the output vectors to be very close to each other, which means this distance measure should be low. That's the energy. So you want the energy to be low. Whenever the two images are essentially distortions of each other, they have essentially the same content. Perhaps, sorry, different due to cropping or scale or noise or rotation or things like this, right? You can artificially generate those samples, right? You take an image and you distort it a little bit. And that's your X and Y positive pair. So you run them through those networks and you tell those two guys minimize the energy, which means bring H and H prime as close to each other as possible. Now, if you only do this, though, your system will collapse, right? So we talked about this last time. What's going to happen if you only have positive samples is that your network will happily ignore the input and just output constant H and H prime. That's a collapse, an example collapse. So contractive methods will push up on the energy of other samples. So you pick another image that you know is different from the first one. Or maybe you don't know that it's different, but you rely on the fact that your training set is very large and you pick another sample and you assume it's different. This is what people do. And you push it through the network and you tell the network, I want those two representations to be at least some distance away from each other. If you have multiple negative samples within a batch, you can plug that into an InfoNCE style loss. Or you can use just ranking loss or square square or square exponential. One of the loss functions I showed you earlier. So those different papers here that are mostly from the last couple of years use different tricks to make this efficient. So it seems clear it's kind of a one that used InfoNCE. And it works okay, but it's very expensive because you need a lot of those negative samples to be able to train the system to work properly. Probably more could use other tricks, which I'm not going to go through right now. I mean, you use negative samples, but they use tricks to make that more efficient and to kind of do the negative mining more efficiently. And there's been sort of recent success in speech recognition, what I explain just now. So this is some work, very recent work from Facebook from last year, I believe, from October 2020. So it's very recent. And they used essentially joint embedding, a joint embedding method to pre-train the speech recognition system with unlabeled data. So they had roughly about a thousand hours of unlabeled speech. And what they do is they run a convolutional net on the speech signal. So the convolutional net outputs a sequence of vectors that represent the speech, if you want. And they use a criterion that says, I want the representation of a particular segment of speech to be easily predictable from what surrounds it, essentially. So basically make the energy, which is the difference between the prediction from the neighbors and the current vector as small as possible. But they also use a contrastive phase where they substitute some of those samples, the central sample by another one, and then make that energy large. And so it's a little bit like the masked autoencoder idea that we mentioned last week of, you know, you take an input and you corrupt the input and you train the system to kind of recover the uncorrupted, except here it's more like a joint embedding method. But it has a little bit of the same flavor as the denosing autoencoder. So they use, I mean, they use the masking here and they use a transformer on top of the convolutional net that extracts those acoustic vectors to do this sort of missing vector prediction, if you want. It's contrastive, but it's more akin to a denosing autoencoder than sort of a joint embedding method. So the result of this is that now with this representation, they can, with the representation that they learn through this process, they can use this representation and train a speech recognition system with only 10 minutes of label data, which is astonishingly small for any language, and basically get the same word error rate as the previous state of the art from last year that was obtained with 100 hours of label data. So that's an incredible reduction. This is really important because, and it's open source, it's really important because if you're a company like Facebook or Google, you want to be able to essentially recognize speech in any language spoken in the world. And, you know, there's a lot of languages for which there is very little training data, label training data. And it's very expensive to label speech data. You need to find, you know, competent speakers of that language and then, you know, have them kind of label. So if you are able to reduce the amount of data that is required for training a speech recognition system, it's super important. But it's a question here. Yes. I wonder how wide the CNN needs to be for the masked prediction criterion to be useful. For example, if the window is too narrow, the prediction won't capture any of the semantics of the speech. Yeah. So this actually is a funny kind of speech system that does not use a pre-processing. So it actually takes the raw wave as an input, just the samples. Okay. So there is no pre-processing that turns this into a time frequency representation as is more common in speech recognition systems. This one works directly from the raw input. Okay. So the first few layers of the convolutional net extract auditory features, if you want, from the raw signal. And then this representation vector Q, this acoustic representation vector that you get. I don't remember the details. So I don't know, you know, the temporal separation between two of those output vectors. My guess is probably, you know, on the order of a few tens of milliseconds, probably 20 milliseconds or something like this. But I'm not actually sure. I don't know. So it's enough to identify an elementary sound called a phone. Okay. The input window. So this is the displacement between two cues, but the window is, of course, larger because you need some context. For example, if you look at the signal, the raw signal or the time frequency representation with the sound P in APA or in EP or OPO, right, the P actually is completely different. We hear it as a P. But when you look at the spectral representation of the sound at that moment, it's completely different because of the context. So you need some context to be able to interpret an elementary sound. The sound itself may last, you know, a fraction of a second, maybe 100 milliseconds or less. But you need some context around it to be able to recognize it. So that's why the commercial net has a window. But I wouldn't be able to tell you, you have to look at the paper, really what the architecture is and what the input window is. Okay. So this is, you know, recent progress due to those contrastive methods, which really is, I mean, makes a big difference in the world. Even better, you can use those techniques to train a speech recognition system that is multilingual. So now instead of training a speech recognition system that only recognizes one language, you pre-train it in this self-supervised contrastive manner on lots of languages. I think, you know, on the order of 100 or something or 10, it depends on the experiment. And the system can aligns a good representation for language regardless of which language it is, including tonal languages like Mandarin and, you know, stressed languages like English or completely non-stressed languages like French. And then you take that representation and you can train, you know, a recognizer on any number of languages. And it's interesting that, you know, when you look at, when you do some sort of low dimensional visualization of the representation that is extracted by the system, those cue actors, and you can do this for different languages, you clearly see different categories of languages where, you know, Mandarin and, you know, other languages of the same type, tonal languages are kind of in one cluster. And then you have languages here that are kind of Germanic, if you want, and, you know, with a tinge of romance here. Arabic and Kabil are kind of different. And then you have the sort of sort of Latin derived languages down here, not Basque. Actually, Basque is not even Indo-European, so it's in the category of itself by itself. Basque is the language spoken in southwestern France and north, north, northeastern Spain. And nobody knows how to classify this language. It doesn't belong to any big family. Probably most of you have heard about generative adversarial networks. This is a technique that people use to generate images. And we may talk about this in a future lecture at, you know, some length, but in fact, again, is secretly a contrasting method for energy based models. Okay. So what is again, so again, so I'm not going to describe kind of the usual description of GANS. What I'm going to tell you now is an idea. So we need a way of picking the negative samples whose energy we're going to push up. Okay. And we can push up at random places, but in high dimensional space, right, pushing up on random places is not going to get us anywhere. So we're going to be, we're going to have to be really smart about how we pick those negative samples whose energy we're going to push up. And perhaps a good idea would be to pick those samples that are not training samples, but yet have low energy, right? This is something, this is the hard negative mining issue that I was telling you about earlier. So we'd like the, the, the samples, you know, the, the wise whose energy we're going to push up to be given currently low energy by our model because they are wrongly being given low energy. Okay. So we pick a sample that currently has a low energy and we push it up. Again, the problem with this is that they may be, you know, an infinity of those in a high dimensional space, they may be hard to find. So here's an idea. We're going to train a neural net to tell us where they are. Okay. We're going to train a neural net to generate those green points. And in the context of again, that's called a generator. So think, so again, is a system that has two networks. It has a generator, which is a neural net. It takes as input, a bunch of random variables, a vector of random variables that you draw from a distribution. And this, this generator network is going, is going to produce a green point. So it's going to be trying to produce a point which our energy based model gives currently goes, gives low energy to, but should be given high energy. Okay. And so we're going to take a Y from our data or data set, run it through our energy based model, which could have any architecture here. I drew it as an autoencoder, but it could be anything. Okay. Something that I put to a scalar. And we're going to push down on the energy of this because that's a data point. Okay. That's a blue, a blue dot. And then we're going to generate one of those green dots and then pull up on the, on the corresponding energy, right? So that we're going to shape the energy function in the right way. Now, how do we train this generator? So the way we're going to train this generator is that we're going to say, how can I change the weights of the generator in such a way that next time around when I generate a Y, this Y will be given a low energy by my energy based model. Okay. Because we want to generate green points that actually are close to the manifold of data so that when we push it, the energy takes the right shape. So we need to find a green point that our model gives our energy to. And the way we do this is we train the weights of this network so that we train you to produce points here such that those points are given low energy by our current energy based model. We can do this easily, right? We have a Y here that, or generator, just produced from a random vector. That's an image, let's say. We run through an energy based model. It gives us an energy and then we back propagate the gradient of the energy in this network. Okay. And we back propagate again the gradient through this network. And with that gradient, we update the weights of this generator network so that the new Y that's going to be generated after this update has a lower energy than the old Y, the old Y had. Okay. So basically, this generator trained itself to be adversarial, which is why it's called a generative adversarial networks. In the sense that it trained itself to produce the not just bad guys, but the worst guys possible that are most annoying for the energy based models is our examples to which our energy based model gives low energy, but should not have low energy. Okay. So that's the story, essentially. It's a contrasting method where we train a neural net to produce the negative samples. Okay. It's basically as simple as that. So basically there's two phases. You give a data point, you run through your energy based model, then you back propagate through the energy based model and take a gradient step with respect to the parameters of the energy based model so that the energy goes down for that data point. And then you generate random vector, run it through your generator network, produce a sample, run this through your energy based model. It used to be called a discriminator. People prefer now the word critic for this thing, but it's really an energy based model. Then back propagate the gradient of this energy through the energy based model and back through the weights of the generator and then change the weights so that this energy goes down. Okay. So that the Y hat that will be generated from the same vector next time will actually be given a lower energy by the energy based model. So you basically move the green point so that they get closer and closer to the lower energy region and you keep pushing up on them. Now, you can use this process to train an energy based model, which then you can use for whatever application you want. You could think of this as a self supervised free training, although it doesn't work very well if you do this. But what people do is that they actually drop the energy based model to drop the critic and they use the generator as a way of generating images. So now what you have is a model that for which you give it a random vector drawn from a particular distribution and add comes an image that basically looks like the images that you had in your training set. It doesn't need to be an image. It could be any piece of data. So it's a way of training a system to generate. You can interpret the energy based model as a critic, which is a way of telling the generator your sample looks good or looks bad. Okay, the energy based model basically just rates the generated sample and tells the generator whether it's good or bad. Now the original formulation of GANs actually was kind of more probabilistic than this and actually didn't quite work very well. And what people did over the years, like this paper by Archovsky and Botu and a few others, Archovsky was actually a PhD student at NYU. Botu is affiliated with NYU and Facebook Research in New York. So this is the idea of a Vestershtang GAN. And the Vestershtang GAN is essentially a technique to make sure that whatever energy function is being computed by this is kind of smooth if you want. Because with this technique, if you're not careful, the generator will generate images that are more and more realistic. So the green points are going to get closer and closer to the blue points. And if you're not careful, the energy function will basically turn into a canyon. It's going to have low energy just for your training samples and then super high energy just outside of it. And you want to prevent this from happening. So so you can sort of regularize the the slope if you want of the energy function. And that's basically what Vestershtang GAN is. And it's from Vestershtang to this question. You know, I have also heard that there were secretly probabilistic models as well in the GANs. So one of the problems I heard of GANs is mode collapse. The generator keeps outputting the same green point. A solution I heard is to have the generator make like five samples and have the critic judge the set. If they are all the same, it is obviously fake to the critic. The generator is creating samples from the domain PDF. A mini batch should have a similar PDF, which kind of makes sense. How do you view an energy point of view? And is why would making more samples in a set good from an energy-based model viewpoint? Yeah, so there's a lot of problems with GANs. Okay, so first of all, because you're doing this dual optimization of two networks that are basically against each other, what the optimization actually does is not a minimization or maximization, but it tries to find a kind of saddle point or more precisely what's called an ash equilibrium. And if you use kind of standard optimization methods like stochastic gradient descent without being careful, you can prove actually that you may not be able to find to converge to an ash equilibrium between those two networks. One is going to win and is going to kill the other one basically. So that's what happens with mode collapse. Now what happens with normal GANs in the original formulation is that if you trend them for a particular time, you're going to get some reasonable solution. If you trend them for very long, you're going to observe this mode collapse. And the mode collapse essentially is one in which the discriminator basically doesn't give you any useful gradient because it's got very large weights and basically its function is some sort of canyon. And the generator, because it doesn't have any useful gradient, basically keeps producing the same output. And so that's kind of a failure mode. And essentially if you train again in its original formulation for long enough, you'll observe this. So you have to stop training before it happens. So all those techniques, there's a lot of tricks for this, but the stochastic GAN is basically a way to prevent this from happening too early. But you still have the problem that you're trying to find an ash equilibrium between two functions. You're minimizing two criteria. You have two separate loss functions, one which is minimized by the weights of the critic, another one that's minimized by the weights of the generator. And those two different objective functions, which is the positive and negative terms in the hinge, if you want, need to find, they're not compatible with each other. So it's always a trade-off. If you want green points that are really close to the blue points to have high energy, it means that your energy function has to be very steep. And this may not be good. And so this drives the system to a bad energy function, in fact, if you do it properly. So the criterion is bad. So you need to regularize the critic function, the energy function, so that it's smooth and doesn't go to hell, essentially. Or you need to use other tricks. And there are literally hundreds of papers on how to make this work, but it's very finicky. The applications of GANs, so here's something. The application of GANs to pre-training in a non-supervised manner, an energy-based model in such a way as to generate good features for a subsequent supervised phase, has been essentially a complete failure. It's been essentially no success of using GANs as a technique to pre-train a system in self-supervised. The only success of GANs, I've been in content generation, image generation, sound synthesis, things like that. For that, it works, but that's the generator you use for that. Okay. As a way to train an energy-based model, it doesn't really work. Okay, so those non-crocheting methods. So there was an idea that came out of a paper called Moco, which I mentioned earlier by coming here from Facebook and these collaborators. And the idea was to basically slow down the weights of one of the two networks in Assamese joint embedding architecture. So this is a situation where you have two identical networks. So they both take images, for example. And you make the weights of one of the networks slightly different from the weights of the other one. And the way you make it different is by essentially computing an average of the past weight vectors of that network. Okay. And the technique was called Moco because it means momentum encoding. So there's basically a momentum embedded in this weight so that the weights of the two networks are slightly different. And if you add... So there's various ways of doing this. In fact, I have a chart here. Moco doesn't appear here because it's been sort of largely outdated in the last year or two. But these are... So Sinclair is a contrastive method, which I described before. And those other methods are non-contrastive. Actually, Sinclair, to some extent, is. But BYOL and SWEV are essentially non-contrastive methods. So let me talk about BYOL. That means bring your own latent or boost wrap your own latent. This came out of deep mind. A long list of authors here. Very recently. And when the paper appeared, nobody knew why this was working, including the authors themselves. Okay. So they used this trick of averaging the weights over time for one of the networks over the previous one. But then the other trick is that they add a layer or a couple layers on top of the encoder that is supposed to basically swallow the difference, eliminate the difference of the representation between an image and the distorted version of it. And if you train the system with appropriate batch normalization at all the layers, particularly in the last layer here and within the encoder, it works without having to rely on negative samples. Why does it work? It's not clear. It's still a topic of research. But it's really exciting because now you can train joint embedding methods with a contrastive phase, which usually is very expensive. Now, simultaneously, people came up with, so these are people at Facebook Research, collaborating with the India in France at Fair Paris. And SWEV can be seen as a successor of the cluster. And so this is another technique which I will not go into the details of. But essentially, the idea of SWEV is you run two images to the encoder. Those are identical encoders. And then you basically perform a clustering on the outgoing vectors. And you normalize those vectors in a particular way called syncon-knop. So it's a matrix and you're kind of normalizing in a funny way. And then you basically use the clusters learned this way for one of an encoder as targets for the other encoder. I realize my explanation is not very detailed. I encourage you to read the paper if you're interested. This is a very hard topic. There was actually a blog post that was made just last week by Facebook that describes a large system that was trained using SWEV. And that beats some record on image net recognition. The details are important if you want to implement it. And it's open source, so you can download it. But in a library called VSOL, V-I-S-S-L, that just came out. But the point is that it basically quantizes the set of output vectors into prototypes. And then it uses the prototypes as targets to train the network. But you take the prototype of one network and the prototype of another network and you swap them. That's why the SWEV in the name comes from. And then you use those as targets to train those two networks. So it's an exciting new area because those systems, non-contrastive methods for training joint embedding, actually are performing really well for learning new representations of images. In fact, here is one that I worked on with some of my colleagues at Facebook that actually just appeared on the archive last week. So this is very fresh. But this field is moving so fast and it's so exciting that a lot of people are kind of jumping on it. And it's the idea of using joint embedding with a very simple criterion. It's much simpler to understand than SWEV or BYOL. So here what you do is you run distorted versions of the same image through two identical neural nets. They're really identical. You get two representations. And what you do now is you compute this over a batch, let's say. And over this batch, you compute the cross correlation matrix between those sets of vectors. So you take ZA, which is a vector. You take ZB, which is another vector. You compute the auto product of those two vectors. Actually, it's like that. So you get a matrix. You do the sum of that matrix over all samples in your batch. And you get a matrix like this. Okay. So I like the little bit, which is that before you take those vectors, you subtract the mean of those vectors. And you normalize the components by their variance. So what you get when you do this auto product and sum, you get what's called a cross correlation matrix between two sets of vectors. And the criterion you have to train it is to make this cross correlation matrix as possible to the identity. Okay. So what you want is you want one variable in ZB and the same variable in ZA to be as correlated as possible to basically have correlation one. Okay. So the correlation is the sum over the batch of the product of the values of the two components. Okay. And this is all divided by the product of the standard deviation of the two variables for normalization. So it's a value between minus one and one. One if the two values are completely correlated, minus one if they are anti-correlated, and zero if they are kind of uncorrelated. Okay. So, okay. So you try to make the diagonal terms as correlated as possible, which is another way of saying I want ZA and ZB to basically be the same vector. Now there's an easy way for the system to cheat though, which is to make every component of ZA equal or essentially very dependent on each other. And every component of ZB also equal. So basically there'll be very little information in both ZA and ZB because all the components of the vector will vary at the same time. It will satisfy the criterion for the diagonal but will not give you any particularly interesting features. So there is another term that says I also want one component from ZA to be decorrelated from a different component from ZB. Okay. So I want them to give me different information essentially. Okay. And that's done by trying to minimize to set the value of the odd diagonal terms to zero. Again, they can vary between minus one and plus one because it's a normalized correlation coefficient, but we try to make them zero. So that's easy to understand. It doesn't collapse. There's no negative samples, although you could think of the odd diagonal terms as some sort of weird way of doing negative samples, but it's over the dimension of the representation, not over the dimension of a batch or training set. And this works pretty well. It works basically just as well as swap more or less. Does it work for medical images? We haven't tried because the paper came out a few days ago. So there was only experiments with data sets like pre-training on ImageNet and then testing on things like Pascal, VOC, Koko and things like that. And finally, the batch size 1024, is it too large for actually normal people to train? Someone is asking or is actually doable? That's completely doable. You can do this on, you know, maybe you need multiple GPUs, but you don't need that many. And in fact, that's the optimal. So Yuri and Lee, the authors, you know, tried with several sizes of batch size and, you know, below 1024 it's worse and above it's also worse, which is not the case for things like, like SWAV and BYOL. They work better if you have bigger batch size. So this one goes to a maximum at 1024, which is kind of a good sweet spot. I see. Here's another example. So again, this was the topic of a blog post last week. I believe Thursday, Wednesday morning or something. It's a system called SIR and it's basically an application of SWAV, a large-scale application of SWAV. So you take one billion images randomly selected from Instagram. Okay. This is really random, right? You just watch Instagram for a few hours and you get a billion samples. All right. And then you train a SWAV system using the Ragnet architecture. So this is a particular family of convolutional net, ResNet-like architectures with, you know, a particular set of parameters. And, you know, you can parameterize the sizes of the layers and everything. So they use this sort of class of architectures and, you know, train Ragnet of various sizes with, you know, different numbers of parameters, pre-train in cell supervised running using SWAV on this. And then you take the resulting network, you stick a classifier on top, you take the representation in one of the layers and you stick a classifier and you train that system supervised. There's two modes, one in which you just train the classifier, one in which you fine-tune the entire network. I think those are numbers for fine-tuning the entire network. You can get 84% correct on ImageNet when you train on the full ImageNet supervised. This is top one correct number. And it's still about 78% for when you only use 10% of ImageNet label samples. And still about 60% when you use only 1%. So this is, you know, about 13 images per sample, per category. I'm sorry. There's a thousand categories in ImageNet, 1.3 million training samples. So 1% is about 13 samples per category. It's actually not balanced between all categories. So a very small number of samples per category. You still get 60% on the evaluation set, which is pretty amazing. And what's more interesting is that, so it works, it actually beats the state of the art in some other data sets. But what's interesting is that if you train purely supervised from scratch on ImageNet, you don't get the same performance. You get something like in the same conditions, you get a little more than 81% correct. Okay. This is another same condition. And this is with the pre-training. So here, the pre-training is basically just a way of initializing the weights of the RegNet network. And this curve is for different size networks, right? So after a certain size, purely supervised running kind of saturates, but the self-supervised running keeps going up. So this is, you know, this idea, which is very strange for, you know, classical statisticians in deep learning, which is that the bigger you make the networks, the better they work. This one has 1.3 billion parameters, which is quite large for CompNet. Okay. Now let's switch to the topic of using latent variable models in practice for structural prediction, which is really what you're going to need to hear about for the next, next one more in particular. Okay. So the general problem here of structural prediction is that you have an input, let's say an image or a piece of text or speech signal or whatever. And the output would be a structured object. So for example, it wouldn't be just the category of the dominant object in the image. But it would be, for example, a description of the image, if it's an image, or it would be a list of objects that are in the image, which, you know, is somewhat structured, like you have to decide like which object am I going to list in that list. If it's a translation system, it's, you know, a particular translation of the sentence. And of course, there are multiple possible translations. But it's a structured object, right? It has to satisfy grammar in the target language, et cetera. And, you know, it could be another image if you want to do image denoising, for example, it could be another, it could be an image if the input is a video clip and you're trying to do video prediction for things like, let's say compression. So basically, it's when the output is, you know, can be multi-models. So there could be multiple outputs that are all compatible with the input. And at the same time, the output is structured. Okay, so what you need for this is a latent variable predictive model that we talked about before, right? So basically the architecture in which the set of possible answers is parameterized by a latent variable that you can vary. Okay. And as you vary the latent variable, the prediction varies, hopefully if the system is properly trained over all the plausible outputs that correspond to the input. Okay. So the input is a sentence in Turkish and, you know, the output would be a translation in English. And when you vary the latent variable, you vary the different style of translation, basically, but you preserve the meaning. That would be an example. Okay. So what we've seen about latent variable models is that the way you do inference is that you give an X and you give a Y. Let's say during training, you know Y. So you're given Y. Or you're given a proposal of Y during test. And what you do is that you're finding the Z that minimizes the energy function. Okay. That's the way you proceed. If you don't know Y, if you are inference time, you jointly minimize the energy with respect to Z and Y. Okay. So try to find the combination of Z and Y. That minimizes the energy. And the corresponding Y is the best scoring output, if you want. Here is a simple example. So this is a system by Nicolas Carrion, who is actually a postdoc at NYU right now. But he did this work during his PhD in Paris. He was actually at Facebook during his PhD. And it's called the Deter system. It's probably one of the best performing vision systems at the moment. It uses a combination of convolutional neural net and a transformer, which is a particular architecture we haven't talked about yet. But you'll know more about this later. And this transformer basically outputs a set of predictions, you know, kind of boxes for where there might be objects together with kind of a score for different categories. And then, you know, what you need the transformer to do is to give a list of high scoring, you know, well-identified objects together with their energy or their scores. But here's the problem. When you train, you give a list of objects that are in the image. But then you need to map the list of objects that comes out of the neural net with the list of objects that is given to the system for supervision. You don't know which is which. So you have to basically try to find a good permutation of the list of objects produced by the system that can best matches the label that is being given to the system during supervised running. That permutation is a latent variable. It's a discrete latent variable, but it is a latent variable. So basically by finding a permutation that best matches which objects come out of the neural net with the list of objects that you give for supervision, you're performing a visualization with respect to a latent variable. So in fact, the permutation is not actually a permutation because there could be objects that are present that your system proposes that are not in the list of desired objects. So you might allow the system to drop some of the objects. And vice versa, there could be missing objects that your system didn't pick out, but they are in the list of desired objects. So it's not exactly a permutation you need. It's called bipartite matching. But there's an energy function you can come up with that will give you the best match with the minimum number of deletion and addition, essentially. So that's a very simple example of an energy-based latent variable model, where the latent variable is really very close to the output. And that system works amazingly well. So you run an image through a convolutional net that the output of it goes to a transformer. Each of those are basically slots for objects. A property of the transformer is that it's equivalent to permutations of the input. So you can permute the input. You will get an identical result, but the output will also be permuted. So the same way that the convolutional net is equivalent to shift. If you shift the input, the output shifts. Here, if you permute the input, the output also is permuted, but otherwise unchanged. And I'm not going to explain the internal structure of it. And then you run through a second transformer that basically proposes candidates for output objects and classes. And that's where you do the matching with the category. You can train the system to produce masks also for whatever is important. And this works really well. And you can do symmetric segmentation with it. I'll spare you the details, but this is one of the best-performing computer vision systems today. Okay, but let's talk about latent variable model a little more generally, or about structure prediction a little more generally. So here's an example where we don't actually have latent variable, but we do have structure prediction. So let's say we have three energy terms. In case we have an input, let's say it's a sentence from a text. That input goes into three different energy terms. And the output is a sequence of words. And let's say we want the sequence of words to satisfy a grammar. And it's a very simple grammar in this case, which would be that if Y1 is a particular word, then Y2 can be another list of possible word. And if you want to pick another word for Y2 that is not in this list of potential words that can follow Y1, then we need to pay a price for it. Okay, so this energy term would basically measure whether a word Y2 can follow a word Y1. Okay. For example, if I start the sentence, it is obvious and I stop. You can predict that the next word is probably something like probably that. Okay, it's obvious that. So in this case, you know, this word would be obvious, right? And it would be a low energy for following obvious with the word that, but probably a high energy for the following word to be lion. Okay, you never see it as obvious lion, right? That's kind of grammatically strange. So this term here would basically implement what's called a background language model, which tells you for a pair of successive words, what's the incompatibility between those two words, okay, in terms of an energy function. And you do this, you assume the sequence here as four words and you do this for every pair of successive words. Okay, so now I give you an X. I run, you know, inside of those boxes that might be complicated neural nets or whatever. But my output here has to satisfy this constraint of the energy that successive words need to be compatible with my language model. Okay, that's kind of the purest example of structure prediction. Okay, and you need the formalism of energy-based models for this because the only way you can find a good combination of Y1, Y2, Y3, Y4 is by actually minimizing the overall energy with respect to the Ys. Okay, so you're going to find through some search technique a combination of words that minimize the energy. Now, because words are discrete objects, there's ways to do this efficiently, which I'm going to go to in a second. So inference might be relatively simple and efficient because words are discrete objects. So this is, you know, very much used in a lot of natural language processing, speech recognition, handwriting. So in speech recognition, the process that does this is called a decoder and, you know, biological sequence analysis and things like this. So anything where you have sort of, you know, strong dependency between the variables you're trying to predict. So here is an example of how you would do efficient inference in such an energy-based model where, for example, you have constraints over pairs of successive variables. So here I'm, I've drawn a very simple form of energy-based model where you have two latent variables, Z1, Z2, two output variables, Y1, Y2. Okay, and you have a factor between successive values and X on the influence is the first two. Okay, X is a continuous variable, could be an image or audio signal or whatever. Z1 is binary. So just Z1, Z2 is binary. Y1 is binary and Y2 is ternary. So it can take three values. Okay, so one thing I can do is exhaustive search. I can say, well, there's two values for this, two values for that, two values for that. So that's two, four, eight, and three values for this. So that's 24 values total. Okay, so I can just go through all 24 combinations of values of latent variables and outputs. And, and for each of those, I'm going to compute the energy. Okay. And what that means is that, you know, I'm going to have to run through this energy-based model 24 times with 20 different combinations of inputs, which means, you know, if you assume that each of those terms cost the same to compute, I'm going to have to compute 24 times four energy terms. Okay, that's 96. Now, it's kind of stupid because I'm going to compute the same value multiple times, because there's multiple combinations of inputs for which both Z1 and Z2 are zero. And I can pre-compute the value of this energy for the combinations Z1 and Z2 equals zero. Okay. And I can pre-compute the same for this term. I can pre-compute the four values that correspond to Z2 equals zero, Y1 equals zero, Z2 equals zero, Y1 equals one, Z2 equals one, Y1 equals zero, and Z2, Y1 equals, both equal one. Okay. So that's what I'm going to do. I'm going to pre-compute. So this guy can take two values because Z1 can take two values and X is fixed. So this guy can take two values for Z1 equals zero or one. Okay. This guy can take four values because Z1 and Z2 are binary. This one can take four values as well because Z2 and Y1 are binary. And this guy can take six values because Y1 is binary and Y2 is ternary. So I pre-compute those values. Okay. And I'm going to put those values in a graph. Okay. So I start from the left here and then Z1 can be either zero or one. And if Z1 is zero, then I pay the price EA of X zero, which I'm going to attach to this branch. Okay. And if Z1 equals one, then I'm going to put the value here that I pre-computed for this term when Z1 is equal to one. Okay. Then I'm going to, then Z2 can be zero or one. And those are two nodes in my graph. But the transition, I can go to Z2 equals zero or one from either Z1 equals zero or Z1 equals one. And I'm going to label each transition with the cost of that guy for the combination of Z1 and Z2. Right. So if Z1 and Z2 are both equal to zero, then I label this transition by the cost coming out of this energy for Z1 and Z2 equals zero. Okay. And I can keep doing this. Right. Now for Y2 being ternary, I have three values for this, zero or one, two. And I have transitions from, you know, the two values of Y1 to the three values of Y2. And then a final node. How do I do the minimization of the energy with respect to the combination of Y1 and Y2? So basically, the combination is when you go through this graph and you follow a particular path. Okay. And you add up the energies along the path. Right. So I start with zero, I add up this cost, I get here, then I take this transition, add this cost, I get here, then add this cost, I get here, then add that cost, and I get there. And then, you know, I'm done. So what I've done is implicitly computed the energy for 1010. Okay. By just summing up the cost along the path. So to do the inference, which is to minimize the energy both with respect to Y and Z, I just need to find the path in this graph that has lowest cost. And of course, we know how to do this. This is, you know, a minimum length path search through a graph, which all of you probably have studied before, hopefully. You saw this problem through dynamic programming. Okay. Or in some engineers called it the Viterbi algorithm, but it's the same thing. So you go through. So basically, for every node, you say, what is the cost of getting to that node from any, any path, right? So here I only have one path. So the cost of that node is just the, that transition, the cost of that node is such that transition. What is the cost of that node to determine the cost of that node? I'm going to say, which of those two paths as the lowest cost? Is it coming from here? So is it the sum of this and that? So is it the sum of the cost already attached to that node plus the cost of the transition? Or is it the cost I attached to that node plus the cost of that transition? I'm going to pick the smallest of the two and I'm going to decide the cost of getting to that node is the smallest of those two paths. Okay. I'm going to write this cost here and I'm going to remember where it came from. Remember that it came from, let's say here. Okay. I do the same for this. And then I go to the next step. Again, here, I have a cost for both of those accumulated costs for the best path to get to each of those nodes. So the cost to get to that node is the smallest of those two paths, which is the cost of this plus the cost of the transition, or the cost of this plus the cost of the transition. Do the same here, do the same here. And then once I get here, I get the cost of the shortest path, the path with the least cost. And by tracing back, for each node, I remember where I came from. So I trace back and I can figure out which path I went through. And that gives me the combination of Z and Y that gives me the lowest energy. So this algorithm is dynamic programming. Okay. It's a shortest path in a graph, which I'm sure many of you have studied before. And it's very simple. You go through each step at each step. You compute the cost by basically adding the cost from the source nodes with the cost of the transition and just writing the smallest of the two, or the three, or the four, or the N that you have. Okay. The complexity of this algorithm is, you know, essentially order N where N is number of nodes. And the number of nodes, you know, depends on the sort of number of different values that, combination of values that enter a particular factor in this factor graph. Okay. This type of left-right graph is called a trellis, by the way. So this process of finding the shortest path in a graph that gives you the lowest energy in speech recognition systems and translation systems and various NLP systems that produce text, this is called a decoder. And it's used everywhere. Very often, those terms here actually involve some sort of neural net. Okay. So it's not necessarily kind of a simple thing. Okay. Let's say, and we talked a little bit about this problem, and this is going to be related to a homework. So let's say you want to do handwriting recognition. Okay. So I would have word. It may be a little unreadable. I like this example because it is unreadable. It's more readable, perhaps, if I put dots on the eyes. And I give you this example before, and I, you know, purposely value drew the N, the second N. So it looks like an M, but it's actually a word minimum, right? With a spelling error, essentially. So how can you do word recognition, which is an example of structural prediction? So this is an example for handwriting, but let's imagine this is a speech signal. Okay. So you have the speech sequence represented as a sequence of vectors that represent a word. Okay. And let's say I have a label here, which is the correct transcription of this. Actually, I'll draw this above a little bit. So the correct transcription is M, I, N, but I'm actually going to write this in a slightly different manner. Blank, I, blank, N, blank, I, blank. Okay. So each of those is basically a target character, if you want, the category. And I have 26 categories, the 26 letters and the 27th category, which is none of the above or blank, if you want. Okay. Something that is not a character, something in between. Now I'm going to run a convolution net on this. Okay. So I'm going to have this big convolution net. It's going to take the whole image. And every output vector is going to be a list of 26 or 27 scores, if I include blank energies that indicate the, the score of each of the 27 categories for a particular window on the input. The next vector is going to give scores for a window that's slightly shifted. Right. So this is the usual conv net trick. Right. Et cetera. Okay. So each of those guys gives me a score for a particular window for what happens within a window, you know, roughly in the center of that window. But it looks a little bit of context. Again, if this is a speech signal, those vector will represent kind of sound categories. But it would be able to tell you the sound looking at some context window. This is relative to the question that was asked earlier about speech recognition. So we get this sequence of vectors. Okay. And I'm going to draw all of them. Okay. And now it comes down to train the system. How are we going to train the system? We don't know where the characters are. Someone told us that the word that was written on the input is minimum. So we have the label, but we don't have the location of each of the characters. Okay. So we're going to have a little variable that's going to tell us where are the characters. This is very similar to the example I was talking about earlier with Dieter, where we have the list of objects in the image, but we don't know where they are. And the system gives us another list and it may be in a different order. It'd be missing some objects. It may have too many. So we need to find a way to kind of match those two things. Same story here. We need to find a way to match the label that we have with the output that our system produces. And that can be done through a little variable. So again, we're going to appeal to finding the shortest path in a graph and to the VTURB algorithm or in dynamic programming. So we're going to build a graph. And I'm not going to draw it in the form of a graph. I'm going to draw it in the form of a table. And in that table, we're going to fill a value here with a distance, a measure of divergence of some kind, you know, an energy function between the corresponding vectors at the corresponding location. So the cost for this cell is going to be the or measure of distance between those two things. Okay. But we're going to have another one here, which is distance between those two things. Okay. So we just fill in this matrix with a cost of matching one symbol with another or one vector with another. You know, those can be one-hot vectors, for example, right? So for example, an example of such distance would be the log of the, right? So if this guy is a one-hot vector or a category, I, this number here could be the negative log of the score coming out of the corresponding category in this vector. Okay. Which basically would be the cross-entropy, the usual cross-entropy loss. Okay. But we could use Euclidean distance. We could use some other energy. I'm not specifying. Something that measures the matches between the two. And now to find the best pairing between the labels and what the system outputs, what we're going to do is we're going to try to find a path. So we're going to view each of those elements in the table in the array as a node in a graph. And we're going to try to find a short path in this graph that goes from lower left to upper right and goes through, you know, kind of minimum, basically minimizes the sum of the costs along the path. Okay. Perhaps with some proper normalization for the length of the path. And in there, there's going to be three types of transition. Either we go diagonally, or we go vertically, or we go horizontally in that transition. If we go diagonally, we say what we mean is that at one particular location, this guy is associated with this guy. And then if we go diagonally, we said this guy is associated with that guy. Okay. If we go horizontally, it would mean that the label would correspond to multiple vectors here. And that may happen because let's look at this letter M here. It's going to appear, you know, certainly when the the convent looks at it in the center, this guy here is going to say M, of course, right? But perhaps when you shift it a little bit, it's still going to say M on both sides. Okay. So those three guys, in fact, are going to be M. And so this corresponds to this M here. Hopefully, so what we want here is horizontal transitions that say this M here, this M guy, it's basically those three guys. Okay. And the fact that I have three M's here is more evidence for the fact that there is an M. So I don't want to pay a high price for that. Okay. Why do I have a blank marker? It's because, you know, I might want to train the system to tell me when the window is sort of in between two characters, I want the system to tell me you are in between two characters and this is not a good character. So I'm not going to tell you what the character is because I can say why is this important? It's because, you know, this could be mom or this could be a non-English word, meme with two eyes. Okay. And so, you know, is it an, is it a U or is it two eyes? You're going to have both alternatives. And what is going to tell you whether it's one or the other is the fact that you don't have MIIM as a, as a desired output. And it's not in your dictionary. So it can be MIIM. It's not the correct answer. First of all, during training, you're not going to have that as a target. But then at inference time, what you need to do is figure out simultaneously the shortest path in this graph and the sequence of symbols that minimize the energy. It's easier than you think. But what you have to take into account here is that, is the word that I'm recognizing one of the words in my vocabulary? Okay. So again, you can, you can be very inefficient about it and just go through the list of all words in your vocabulary. And then for each word measure, you know, do the shortest path calculation that matches your, your, whatever comes out of your connet. And then you get an energy out of it and you can go through every word in your vocabulary and you will get an energy for everyone in your vocabulary and you output the one with the lowest energy, right? That would be a good way of doing it. Fortunately, unfortunately in English, you have something like 200,000 words. So it's not really practical. So the question is, you know, can you have a more efficient way of doing this search? And the answer is yes. And what you need to do is represent the, your vocabulary, not as a list of words, but as, as what's called a tree. And so it's basically, it's basically a tree with who is that represents all the, all the words in the dictionary starting from a node. So you start from the start node and you say, what can be the first letter of any word in my vocabulary? Okay, I have words that begin with a, I have words that begin with b, I have words that begin, you know, pretty much every 26 letter can be the first letter of a word. Then what can be the second letter? Well, if I started from a, you know, there's a few words with a double A, but it's pretty rare. So I'm going to pay price for putting another A here. If I have Q, if my first word is a Q, and the word is not Arabic, the second letter is most likely you. Because in English, you know, it comes from French, when you have a Q, you have a U afterwards. So you can build your vocabulary as a tree. And this is basically, so now what you have to do to, you know, basically figure out the word that is coming out of your handwriting or speech recognition system is that you have to figure out at the same time a good, shortest path together with one of the branches in this dictionary. So let me write the tree for a very simple set of words where the first letter can be C or can be B. And then the second letter can be A or U. And the third letter can be, let's say T or B and A, T, R. U, T. Okay, I have a few characters. Those are all English words. So if I, any path here, is cab, cat, cut, bat, bar, and butt. And those are all English words, right? And they're all represented by this tree. And so what I'm going to do is simultaneously, I'm going to say, what can be the first character? It can be C or B, which one has the, you know, I can compute the energy for B and C according to the first character I have here, and I can put those two costs here. So that gives me kind of, you know, two energy values. And then I can do this for the next one. And, you know, am I going to the next character and staying on the first one or not? So you can, you know, pretty clearly see that you can find a combination of a path here and a path in this graph that will overall minimize the energy, which is the sum of those scores over the, over the path. You can of course have cost attached to those transitions as well that indicate how likely that word is. So perhaps in your language, cat is very frequent, the cab is less frequent, so you'll have a higher energy here that will make you pay a price for recognizing cab versus cat, because cat is more frequent. Okay. Okay. Now we can actually turn this into a slightly more general form. So by the way, what I just explained, this is going to be pretty much directly at the topic of homework, right? Homework three. So here is a more systematic way of, of viewing this. There's a lot of situations where the representation of the hypotheses for what a recognition system does is best represented by a graph. Okay. So I give you some examples here. The dictionary, the vocabulary here is represented by a tree, a graph here. And that's, you can think of this as a set of hypotheses together with energies on them that are, you know, indicated by the cost you put on the transitions. You can think of this as a, this matrix here as a graph with transitions between those. The nodes are the elements of the table and the transition from one element to the next is like an arc in a graph, right? So this is like a grid graph with transitions that can go sideways or diagonally. In fact, there is sort of a general way of interpreting this where you can think of a trainable system, a deep learning system in which the values that are exchanged by the layers are not tensors, but are graphs. Okay. This is not a graph neural net. So there is a concept called graph neural net. This is another concept called graph convolutional net. This is not what I'm talking about. What I'm talking about here is something called graph transformer networks, which is a different concept. But it is one of those concepts where, and this is a very old one, it goes back to the mid-90s. But this is a concept where you represent the state of the system, not by a tensor or vector or anything like that inside of a deep learning system, but by a graph where the node and the transitions are attached values to them, energies or images or things like that, tables, whatever. And this is very much the way people working in speech recognition actually sort of apprehend this problem. In fact, in PyTorch, there is a library called GTN, which basically implements this idea. It's very simple. So the question is, how do you do back propagation through a deep learning system where the states are graphs? And the answer is, it's kind of relatively simple because the values on those graphs are produced by neural nets. And so you can compute the gradient of whatever it is that you compute on the output with respect to the values that are on those graphs and then back propagate this to all the way through some neural net that produce those values. So let me go through an example. It may be a little complicated. So bear with me for the next 10 minutes. This is in the context of handwriting recognition and a different type of handwriting recognition system that I want to just mention where instead of having a sliding window over the input, there is some sort of heuristic way of deciding where the boundaries between characters might be. This is called a segmenter. So you're given a handwritten image here and you run this to a segmenter. It's a program that you wrote which makes hypotheses about where to cut this into characters. So you can have a cut here. You can have a cut here. You can have a cut here. If each piece is separate, then you build a graph with a path where the links have each of the individual pieces as attached to the transitions. So this path says the three is a piece. The left side of the four is a piece and the right side of the four is also a piece. It's not a good segmentation for this sequence, by the way. But we can construct other paths. So the first path will group the three and the left part of the four together. Okay. And then we'll jump over those two components because we don't want to reuse that piece twice. And then the next piece is a piece by itself. And then the last path has the three and then the four. And that's the correct path. That would be the correct path for segmentation. But we don't know that yet. Okay. So what do we need to do now? We need to give an energy for each of those paths. And together we need to give an energy also for the recognition that we produce. So we're going to run each of those pieces through a neural net, a convolutional net, let's say. That convolutional net, each of those convolutional nets run through each of those pieces is going to give us a list of scores for our energies for each of the categories. Zero to nine if we do a digit recognition. So what this guy is going to produce is 10 scores. Okay. 10 energies. Zero to one. And what we're going to build here is another graph which basically mimics the structure of this graph called the segmentation graph. This is going to be the interpretation graph where each path now corresponds to a particular interpretation of the input string for a particular segmentation. Right. So if I go through this interpretation three, four, that's the correct interpretation with the correct segmentation. But here is another interpretation, three, two, one. Okay. So it identifies this as three. This has kind of a two. And this has one. Okay. The one has really good energy, point one. The two is not a so good energy because it's kind of truncated, right? 1.3, etc. So for each path in the input, I'm going to have a collection of paths on the output, each of which corresponds to a different interpretation with different labeling. Okay. And I didn't draw all 10 of the transitions here because that would be too messy. So I only drew the lowest energy ones. Okay. The good candidates. So this could be a three or five. This could be a three or four. This could be a four or two. This could be a four or nine. And this could be a one or four. Something like that. Okay. So now I'm going to run the Viterri algorithm here on this, on this graph to figure out what is the best answer that my system can produce. But I'm training this system so I can tell it what the desired answer is. I'll tell you the correct answer is three, four. So whatever path in this graph does not give you the sequence of label three, four is obviously one. Okay. So out of this graph, I'm going to select the path that actually saved three, four. And that turns out for this example, turns out to only be two paths, the correct one. And another one that happens to be incorrect, which has a higher, possibly a higher energy. So I'm finding the best of those two paths through the shortest path algorithm, the Viterri algorithm. And that gives me this path, which is the correct one. Okay. And I get a score of 0.7 energy. I have the correct path, because I was given the correct sequence. And that's the lowest energy path with the correct path. Okay. So you can think of this as a multilayer network of some kind, deep learning system, where each of those path selectors here basically selects arcs in the graphs. And you can think of them as switches, right? So we talked about, you know, how we back propagate through a switch. So what we're going to have to do is we want to make this energy as low as possible, because that's the energy of the correct answer. So if we want to train our system using something like a contrastive loss, we want to make the energy of the correct answer as low as possible, and the energy of the e-correct answer is larger. So here is the energy of the correct answer is 0.7. We can back propagate gradient through this entire chain to go back to the weights of our neural net, so that we change the weights of the neural net so that this number goes down. How do we do this? Well, this number is just the sum of this number and that number. So if we have a gradient over loss, the gradient over loss with respect to itself is 1. And so the gradient of this number with respect to that number is also 1, because this is just the sum of this and that. And with respect to that also 1. Now here, we have this number that appears here, and that number does not appear here. So when we back propagate through this transformer here, through the shortest path selection, the gradient is going to be 1 for this number, but 0 for that, because it doesn't appear up there. Okay, then again here we have another path selector that selected the desired path. Many of those transitions did not appear anywhere. So their cost would be 0. Okay, but then some do appear and so their gradient is going to be plus 1, like for example this guide. Okay, it appears in the correct path and when I back propagate it will get plus 1. Now some others are minus 1 and 0, ignore this for the time being. And so again, I can back propagate again the gradient. These are the outputs of the neural net, the difference instances of the same neural net. So I can back propagate through the neural net and get a gradient with respect to the weights of this overall thing. So I've basically back propagated through this structure. This is what the GTN library does for you, right, if you want to use it. There are a couple of questions. Yep. So the library is DGL, also the GTN, like this, Gratch Transformer Networks. Okay, so there is a question why there are two edges between two nodes in the interpretation graph? Well, there should be 10 of them, but I only drew two. There should be 10 of them because there are 10 categories, right? There are 10 different categories to go from this node to that node that correspond to this. Each of the 10 categories 0 to 9 has a score produced by this neural net that looks at it. Okay, this neural net has 10 outputs and it produces a vector of 10 outputs and I represent them by 10 transitions with 10 different energies and 10 different labels from 0 to 9. But because I don't have space, I only drew two of them. And the other question is, so when we do inference on your validation set, we kick off the path selector and say our word is the one with the lowest path, right? Yeah. So I only explained here how we do this when we know the design answer. So this is during training. I'm coming to the inference when we don't have the answer. Okay. So this is for inference when we are not given the correct answer here. Okay. So there I run through my neural net. I have this interpretation graph. It's the same as before. And then I just compute the shortest path in that in that graph. Okay. It happens to be this one. So it happens to be the one that is at the top here in this particular example. And I compute the score at energy and my energy is 0.6 for this. This energy is necessarily smaller than the other one that I got when I gave the design answer because this one is constrained, right? The path with the lowest energy is somewhere in this graph. And I selected other paths. It may or may not be in that, you know, the best path may or may not be in the correct answer. In this case, it's not. So when I let the system to its own devices and I let it figure out the shortest path in that graph, it comes up with a graph that is incorrect with the labels, the labeling 341. Okay. So it thinks this is a three, this is a four, this is a one. And it's the best overall answer. And it gets better energy. Okay. Now what I'm going to do is I'm going to clamp, I'm going to plug the 0.7 I got here, which is the score of the good guy, which I want to be small. And this guy, which is the score of the bad guy, okay. And I want it to be large. Okay. I can force this answer to be wrong, by the way, right? So if I want to generate a bad guy for contrastive training, I can have a path selector here that only selects the path that have the wrong answer. But I'm not doing this here. I'm just getting the best possible answer. Okay. So my cost function is going to be the difference between those two scores. In this case, it's a very simple loss function. It's a difference between the score constraint to produce the correct answer. Okay. The energy, which I want to make as low as possible. So this is the energy of the good guy, or the good Y. This is Y, right? So I give an X and a Y. I get an energy. I want to make that small. And then I have an X. I produce the Y. It could be a Y hat. So a Y that is incorrect, that I forced to be incorrect. In this case, I don't force it to be correct. And this is the energy of the bad guy. It's 0.6. And what I want is this 0.6 to be larger than this 0.7. Okay. So I'm going to push down on this guy, push up on this guy. So a simple way to do this, I'm going to compute the difference between them. That's going to be my loss function, the difference between those two scores. It could be a hinge loss that pushes on the bad guy up to a certain point. Here, it's just a difference. It's within the linear part of the hinge, let's say. And then I'm going to have to backpropagate gradient to this entire structure to get the gradient of that loss with respect to the weights in the neural net. Okay. So we already seen how we can backpropagate gradient to this half. Okay. Whatever number here appears in the contributes to the output will have a gradient of 1. Okay. The contribution. So, you know, this path here appears here. And that cost contributes additively to this guy and additively to that guy. So the gradient of the last year with respect to that value here is plus 1, which is indicated in parenthesis here. This one is 0 because it doesn't appear. All right. All the other ones are 0 also because they don't appear. But I also have to backpropagate through this end. And this guy has a minus sign. So where the gradient is 1 here, one side, I go through this minus sign. My gradient now is minus 1. So the gradient of the loss with respect to whatever comes out of this plus is actually minus 1 because of this minus here. Those contribute additively. So each of those guys has a gradient of minus 1. The gradient of the loss with respect to each of those numbers is minus 1. Again, I can go through the return transformer. So the corresponding nodes here are going to have a minus 1 contributed to their gradient. But I have two gradients coming from the top. I have one gradient coming from here and one gradient coming from here. And those gradients are either plus 1, minus 1, or 0. They're minus 1 if they come from here. They're plus 1 if they come from here. Or they're 0 if the path don't appear anywhere. Now, this path, for example, appears in both sides. So the plus 1 that comes down here and the minus 1 that comes down from here are going to cancel. And so the gradient with respect to this guy here is going to be 0. This path here is wrong. This transition is wrong. It appears in the wrong answer, but it does not appear in the correct answer. So it's going to have a contribution from the right. And so it's going to get a gradient of minus 1. Same for this guy, because it's also part of the wrong path. This guy, on the other hand, is in the desired path and does not appear in the wrong path. And so its gradient is going to be plus 1. What does that mean? This means I'm going to try to make, when I back propagate to this, this means I'm going to try to make this energy smaller. Give a higher score for 4, basically, at that location. And I'm going to try to make this energy larger. So this is a bad one. It looks like a good one, but it's actually not a one. It's the left part of a 4. So I'm going to try to make this energy higher. And I'm going to try to make this energy higher, because this is a terrible 4, as well. And then this guy, it was part of the good answer, so I'm not going to do anything to it. So once you back propagate to this, I put on this neural net and then to the weight, the effect on this neural net is going to be to boost, to basically inhibit the score of this one, to tell it like, you know, you're not a good one. Tell this guy you're not a good 4, and tell this guy you're a very good 4, your energy should be lower. That's going to be the effect. You update the weights and that's going to be the effect of it. So, overall, what this comes down to is that you've been telling the system, here is the correct answer 3, 4. I don't know where the characters are, but I'm basically minimizing with respect to a latent variable, which is the association between the labels and the candidate proposals on the input. And I'm doing this by basically minimizing with respect to a latent variable, which turns into a path in a graph. So I'm trying to find the shortest path in the graph. So this is a general form of the spatial case that described earlier using this diagram, where this was a simple case. In fact, the very first speech recognition systems is a famous Japanese paper by Sakurai and Chiba from the 1970s, where they used this technical dynamic time warping to actually do speech recognition. Okay. They didn't use back propagation. They didn't train the system this way, but that's, you know, basically they use this sort of shortest path in the graph. And since then, it's been completely generalized to this idea of graph transformer in speech recognition about 20, some odd years ago, 25 years ago or so, to work by Fernando Pereira and Maria Mori, who is actually at NYU, on finite state transducers, which is kind of a generalized form of this. And what we did in the late 90s in the mid 90s was to essentially realize that you could use those, you know, search shortest paths, decoder systems, and back propagate gradient to them so you could do global training of a speech recognition or handwriting recognition system at the sentence level or the word level without having to specify where the individual characters were. So you do simultaneous segmentation and recognition without having to specify the segmentation during supervision. So I'll end here with this simple example that shows how you can use operations on graph. So if you have an interpretation graph where each path is a possible interpretation for an input sentence, and you have a tree basically representing where each path represents a legal sequence of characters in the form of a tree, you can combine those two graphs to find the paths that are common to this graph and that graph and give them the corresponding cost in this path. So you do the intersection, it's called a composition, but it's the intersection between those two graphs where what you extract from it are the paths that are common that exist in both graphs and you just propagate the costs that are on the transitions to the result. So here what you have is a new graph that is labeled by the characters. All of them are legal sequence of the characters that come out of the grammar and the costs here are the energies that were extracted from this. So you do a shortest path in this graph and you get the best answer that simultaneously is grammatically correct. And this is really how speech recognition works. This search, by the way, is exactly how your predictive spelling correction on the web or your keyboard actually works. There is probabilities or scores attached to each of those transitions and it tries to figure out what you're likely to type next. So you can put this all together into a giant system and you can recognize checks and this was done in the late 90s and it worked really well. This was probably the first large-scale applications of convolutional nets for practical commercial applications. Okay, and that would be for another time. Thank you very much. And we'll see each other tomorrow for the training of EBMs in the in the lab in the practical part. All right, so see you everyone tomorrow. Bye-bye.