 Okay, there's a number of topics I want to talk about today. This is our last lecture, and I want to keep some time at the end for random questions on random topics that you might want to ask. Maybe general questions in general about approaches to machine learning, AI, deep learning, etc. Maybe questions related, maybe a little more philosophical, but let me start with something more concrete. I want to talk about structure prediction. I alluded to this topic a number of times during the previous lectures, but I think not enough in depth for most people to understand, so I want to come back to this. Structure prediction is basically the problem of predicting a variable that itself is not just a single category or a single object, but basically a sort of combinatorial object. For example, things like a sentence, you're doing speech recognition, you're doing handwriting recognition, you're doing natural language generation or translation, and what you need to output is a sort of grammatically correct consistent sequence of symbols. And there is no, you can't say that there is a finite number of possibilities of the output because the length of the output might be variable. But even if the length has a maximum and the number is in principle finite, because it's combinatorial, there's no way to enumerate all possible different outputs. And so to express the type constraint that the output has to reflect, that's what's called structure prediction. And this, you know, there's a lot of work on this going back to basically the early days of speech recognition. So this is not a recent problem. And in fact, I'm going to start by a little bit of history. In my mind, the first model to do structure prediction combined with things like neural networks, trained discriminatively, was this speech recognition model for words by Xavier de Riencourt and Lyon Boutou back in the early 90s, 1991, and it was kind of similar to work about the same time by Yosha Benjo and about three or two later by Patrick Kaffner. So these are people who worked on discriminative training for systems that are supposed to produce a sequence of symbols from, you know, a signal, let's say speech or handwriting and where once the first step basically is a neural net. Here and this neural net, I wrote TDNM, this means time data neural net is basically a temporal convolutional net. So this is the first model I can find of structure prediction sort of hybridized with neural nets if you want. So the problem that Xavier de Riencourt and Lyon Boutou are trying to solve was recognizing words using a neural net. And to some extent, the modern approaches are kind of similar to this, but in some ways. So the speech signal is represented as a sequence of acoustic vectors. So those are, you know, you slice the signal into little chunks and then on one of the chunks you do a Fourier transform, which Fredo has explained to you, and you turn it into basically a feature vector. And you have one of those vectors is typically 30 dimensions or so, maybe 40, and you want one of those vectors every 10 milliseconds, so about 100 times per second. So you have a sequence of 40 dimensional vectors, about 100 per second, and you run this through a convolutional net, a temporal convolutional net, and at the output of it, what you get is a sequence of feature vectors, you can think of it this way. In modern systems, those feature vectors are actually kind of softmax vectors that indicate a category, but in their case it wasn't. And those can be at the same rate or they can be slower. So if the neural net has, the convolutional net has temporal set sampling, you're not going to get 10 of those feature vectors per second, you might get, you know, 25, you know, two and a half or something, right? I'm sorry. At the input there is 100. So if you have a set sampling by a factor of four, you will get 25 feature vectors per second, not 100 or something like that. Now here is the problem. The problem is you want to recognize which word was just pronounced. And different people will pronounce the same word at different speeds. And so what you need to do is what's called dynamic time wrapping. And I explained this already in the past. So let's imagine that you've recorded this person, you don't want to do speaker independent speech recognition for now, just speaker specific. So you've recorded that person saying, you know, let's say the 10 digits spoken digits, like 0, 1, 2, 3, 4, 5, et cetera. Because you're only interested in recognizing spoken digits, isolated spoken digits. Maybe this is the system that is supposed to dial a number on your phone, right? So it just needs to recognize sequences of digits. Or perhaps it's a very simple speech recognition system that tries to spot the wake-up word for Amazon Alexa or something like this. So the only thing the system is supposed to recognize is Alexa or Google or something like that, a wake-up word. So the system may have a bunch of pre-recorded templates that correspond to sequences of feature vectors that were produced by someone speaking each of the words. And now the way you want to train the system is that you would like to train the neural net at the same time as the template so that the overall system recognizes the words as best as possible. This is a classification problem. But there is a latent variable in it, and the latent variable is how are you going to time warp the sequence of feature vectors so that it matches the length of those templates? And again, I'm kind of repeating myself because I talked a little bit about this before. So you do this with dummy time warping. And what that consists of is that you line up all the feature vectors along the bottom here of this. So think of this as a matrix. You line up all the feature vectors at the input so that you get the sequence of feature vectors are here. And then you put the sequence of template vectors. So feature vectors coming out of the template on this axis. And then each entry in the matrix is an indication of the distance between the feature vector here and the feature vector there. So you get this populated matrix with distances between feature vectors essentially. And the best way to map the sequence of feature vector into another one to see if they fit is to basically view this matrix as kind of a set of nodes in a graph. And what you want is go from the lower left-hand corner of that graph to the upper right-hand corner by going through a path that minimizes the sum of the distances along the past. So obviously you're going to have to go horizontally more steps than you go vertically. So on a few occasions you're going to do diagonally and on a few occasions you're going to go vertically up. But on many occasions you're probably going to go horizontally to the right. And that would be the situation where you have multiple feature vectors here that are essentially identical that match a single feature vector in the template. So for example, you pronounce the word seven very slowly. The a initially is going to be repeated multiple times because you stick on it for like a quarter of a second. So you're going to have 25 instances of this. And all of this would be mapped to maybe a single feature vector here that corresponds to this sound up. So finding this path that best warps the sequence into the template sequence is like minimizing with respect to a latent variable z. So it's like you have an energy function and you minimize this energy function with respect to the latent variable. The latent variable is the path in that graph. So now what you have is the best warping that matches the sequence of feature vector to the first template. Now you keep doing this with all the templates. So for every template, every word from zero, one, two, three, two, nine, you have the best way to warp the feature vector to that template. And now you can, if your system has been trained, you pick the category of the word template with the smallest distance. Okay? As simple as that. That's for classification. Now how about training? So for training, this is a latent variable model essentially. And what you need to do is you need to make the energy for the correct answer as small as possible and make sure the energy for the incorrect answer is a larger, okay? So let's imagine the correct answer is, oops, sorry, is this word here the second last one that correspond to zero, one, two, three, the category three, for example. Okay. So we know the correct answer is three. So what we need to do now is basically change the word template here a little bit so that it gets closer to the feature vector sequence and then change the feature vector sequence so that it gets closer to the template, right? You can think of this DTW distance, this dynamic time working distance as a kind of distance which involves minimization with respect to a path, but in the end is some sort of distance or divergence. And what you need to do is, and that's basically your energy. So what you need to do is make that distance smaller for the correct answer, okay? So the energy of the correct answer goes down. And simultaneously you need to make sure the energy of all the e-correct answers are larger, okay? So you might need to push them away. So you might need to have an objective function that is going to take the templates for the wrong words and somehow push them away from the current sequence of features, okay? So that's how you learn the templates. And then simultaneously you need to, you're going to have kind of a combination of gradients that are going to back propagate through this DTW. One is going to try to make that sequence of feature vector such that once you go the time warping, it gets closer to the correct word template, but also change it so that it gets away from the other templates, the templates for the other categories, okay? So that is simply back propagating through the, you know, through the dynamic time warping. And the dynamic time warping really is a switch, it's a giant switch, right? It basically tells you those values here along the path matter because they are the ones that indicate whether my input vector matches my template vector, okay? All the other ones that my path is not taking are irrelevant, they don't matter. So the distance is just the sum of those values. So when I back propagate, I just get, for each vector here, a gradient that corresponds to the gradient of, you know, basically the distance to the corresponding vector in the template. So for the correct template, this is going to cause all those vectors to get closer to the corresponding vector in the template, and all those vectors to move away from the corresponding vector in the bad templates that you decide to push away. And then you can just back propagate those gradients all the way through. Now, I'm kind of explaining the mechanics of it, but you don't actually have to think about it, you know, in principle, kind of conceptually, it's just a energy-based model with latent variable. And you just compute the gradient of your energy with respect to, you know, to everything in your network for, you know, values of latent variables that depend on, you know, the position of this switch here. You can think of this switch as the one that tells you which of the answers is correct. And so it's nothing more than an energy-based model. OK, now there's a question. So why am I introducing this before talking about structure prediction? Because this is a very simple form of structure prediction, particularly if now the problem is not to recognize a single word, but it is to recognize a sequence of words, right? So the word is a sequence of sounds, but a sentence is a sequence of words. And so therefore also a sequence of sounds, right? So I could build a collection of possible sequences which are grammatically correct sequences of words, which correspond to some grammatically correct sequences of sounds. And then this kind of dynamic time warping, if you want, will sort of find among all the possible sequences of symbols or sounds or words, we'll find the one that has the lowest energy, OK? That this feature vector is closest to somehow. OK, so that's the general problem of sequence labeling. And it can be formulated at a general level in this way. Now I'm kind of, I kind of set the stage a little bit, and now I'm going to talk about something that you're not going to see immediately as connected, but it's going to come up at the end. OK, so let's say you have a learning system that is composed of an input x, OK, it gets an input x, and it's an energy model in which the energy is a sum of three terms in this case. So those blue squares here are basically factors in a factor graph. They're energy terms, additive energy terms in your energy function. And your output is a sequence, in this case, a sequence of four symbols. And those symbols do not all contribute to all the terms in the energy. So basically, your energy function, the first term, takes into account the first two output symbols or variables in your output, in your sequence of output. The second one takes the second two. The third one takes the third and fourth, OK? Now imagine that this were a sequence of words, and your system was supposed to do something like speech recognition or something. So x is the speech signal. In the blue boxes, you have neural nets and various other things. There might be another neural net that looks at x, x, and then produces feature vectors that go into those boxes, but that's a detail for now. And what those blue boxes would have to implement is basically grammatical constraints. So in English certain words can follow others, but not others, right? So you rarely have two verbs that follow each other. And so you could implement this in this energy term that would make you pay a price for making a verb follow another verb, right? Or having, I don't know, two prepositions. You can have two adjectives that follow each other, things like that, right? So basically those would implement sort of basic grammatical rules. And you can think of this as kind of a language model, right? So I know what word can before, tell me what word can come after, and I can train this on the corpus of text to learn this energy function. So it would be a very basic code language model. So this type of model would implement a very cool language model, by just taking the previous word and then telling you what the next word, what next words are possible, making you pay a price for picking a word that is not correct. Okay, so how you do inference. This is just basically an energy model here, which in this case doesn't actually have a latent variable. But basically I give you an x and you have to find the sequence of y that minimizes the energy. But in this case, because the energy is a sum of three terms, there are efficient way to find the sequence of y's that minimize the energy that may not require a completely exhaustive search or gradient descent or something like this, okay? And I'm going to place myself in a situation where the y's are actually discrete, okay? So there's things like words or sounds or categories of some kind. And so this applies to this situation where the variables you need to infer are all outputs, which means they're going to be visible on the training set and you can train your system to kind of infer them correctly. But it could also be another situation where some of the variables are observed like x here on the left and y is observed during training on the right. But then all the intermediate variables are never observed. There are later in variables you need to minimize over as well. But again, here, this factor graph is factorized in the sense that the energy is a sum of different terms that only take subsets of the variables into account. All right, okay, so let's take a very concrete example now. And let's say the energy here in this case is a sum of four terms, four energy terms, okay? The first two depend on x, the observation. The last two depend on y, which is the variable you need to predict, which you're given during training but not during tests. And then two other nodes are latent variable nodes, okay? And let's say x is some high dimensional variable. We don't care what it is because we just observe it. And z1 is binary, z2 is binary, y1 is binary, and y2 is ternary, so we can take three values, 0, 1, 2, okay? Now, if you count how many possible configurations of z1, z2, y1, y2 there are, there are basically 24, right? Two times two times two times three, right? That's 24 different possible configurations of values. So if you wanted to do exact inference, you might have to try all 24 of those configurations and then compute the energy of all 24 of those configurations and then pick the one with the lowest energy to do inference, right? And in fact, those 24 configurations correspond to 24 times three evaluations of those energy terms, right? Because you have three energy terms. So we'll have to compute 96 different energies to be able to do this, okay? And this is a small example where the sequence is short and the variables are binary, okay? This goes exponentially with the number of, with the length of the sequence. And sorry, with the number of possible values of the of the of the z's and the length of the sequence. So if you have n possibilities for each of the variables and the length is l, it's n to the l, right? So it's financial in the length, okay? But the thing is there is a more efficient way of figuring out what is the configuration of lowest energy. And it's due to the fact that you have this local structure, right? So z1 can only take two values, okay? And z2 can also only take two values. So this energy term here can only take four values. It's ever going to see only four different values because it can only see 0, 0, 0, 1, 1, 0, 1, 1, right? So you can imagine pre-computing those four values, okay? This guy is also going to see only four values, right? So you can, you can, because this is binary and that's binary. So you can pre-compute those four values, okay? So that's another four evaluation of an objective function that we're up to eight. And this guy has six different values because this variable is binary, this one ternary, so it's two times three. So now you have six different configurations. So by pre-computing the four here, the four here, and the six here, you have computed all the possible configurations basically. And that's going to represent it here at the bottom. So this is called a trellis and it's basically a graph that has a source node and a target node and every path in a graph corresponds to a particular assignment of the variables, okay? So for example, if I go through this path, okay, it means y1 equals z1 equals 1, z2 equals 0, y1 equals 1 and y2 equals 2, let's say, okay? And if I add up the terms on each arc, I get the overall energy. Each arc is basically annotated by the energy term, the value of the energy that corresponds to this configuration. So for example, this arc here is this energy and that's the value of this energy term for y1 equals 1 and y2 equals 2, okay? So each of these arcs is a value of this energy term. Each of these arcs is a value of that energy term, etc. And now the finding the best lowest energy configuration of z1, z2, y1, y2 simply consists in finding the shortest path in this graph. Okay? And to do this, I only have to evaluate four terms of energy here, four terms here and six terms here and that's it, okay? So that's 14, I don't know why I said 16 here. Oh, it's 16 because of the first two here, yes. So 16 values total, but so that's a lot less than 96, okay? And that's because the energy is a sum of terms and you can use those kind of efficient algorithms to do the inference. Okay, so this is a simple case where the output is a sequence and when the output is a sequence, there is a simple algorithm. And it's basically shortest path in the graph, right, in a trellis. So that's just dynamic programming and it's very simple. It's efficient, it's nice. So to train a system like this, what you need to tell it is you need to tell it, here is the correct configuration of y1, y2. I don't know what z1, z2 is because it's a latent variable. So find me the path that goes to the correct combination of y1, y2, okay? So you know that let's say y1 equal 1 and y2 equal 2. So you know that the correct path has to include this link, right? And so there's only a set set of paths for the previous ones that are possible, right? You can't go to y1 equals 0 because that would be correct. So basically, only this guy survives and then the other path, you can take whatever you want as long as it gets to that point. So you can just find the one that minimizes the energy here. So minimize the energy with respect to z1, z2 so that y1, y2 take the right value, okay? Cuz training y1, y2 take the right value. And the way you train the system now is that you, by gradient and descent, you back propagate the gradient of the overall energy, okay? For this particular y and this particular x, and the z that you obtain by minimizing. You back propagate the gradient of this energy with respect to the parameters of all those energy terms. And you try to make that smaller, right? You know you have the correct y, the correct x, and whatever z value z must take. Try to make that energy lower by tweaking the parameters. At the same time, you have to make sure the energy of incorrect answers for y1 and y2 that are incorrect is higher, right? So you take other values of y1, y2, including y1 equals 0 and y2 equals whatever it wants, okay? And for all of those other configurations of y1, y2, you wanna make sure whatever energy you get by minimizing over z is higher than whatever you got for the correct one, okay? So your loss function is gonna be something where you take the energy of the correct answer, you try to make it lower. And then you take the energies of incorrect answers and you try to make them larger, okay? That's discriminative training for structure prediction. Structure prediction, because the structure here is represented by this sequence of costs, okay? But conceptually, at a high level, it's no different from everything we talked about before, when we have a latent variable. And when we train with a criterion that says, I wanna make the energy of the correct answer small and the energy of all the other answers higher, okay? Any question at this point? I had a question, so based on this diagram, it seems like this network only really takes discrete values. And my understanding was that back propagation isn't really effective if you just have only working with discrete values. So I'm wondering if I'm missing something or if that's how you connect those things. Right, okay, so in this case, Z1, Z2, Y1, Y2 are not variables that you learn. Okay, they're labels, essentially, okay? They're discrete, so Y1, Y2 are discrete just like the class, the category, the output of the component is discrete. Except you have two of them, but whatever. Z1, Z2 are basically of the same nature, they're discrete variables. They're not things you're gonna learn by gradient descent. They're just latent variable you have to minimize over to do inference, right? Let's not talk about learning for now. The answer system is trained, right? I give you an X and by energy minimization, you find Z1, Z2, Y1, Y2 that minimizes the energy, okay? And because you've trained the correct Y1, Y2 to have the lowest energy among all possible configurations of Y1, Y2, you're gonna get the correct one, okay? Now, for training, the training takes place, basically affects the parameters of each of those factors. Inside those factors, there are parameters, WA, WB, WCWD, which I didn't represent here. And the way you train the system is you say, at the gradient of the energy of the correct answer, with respect to those parameters, I'm gonna treat the parameters so that energy goes down. That's continuous, differentiable, okay? And then simultaneously, I have the energy of bad answers, I'm gonna back propagate gradients, and according to my last function, I'm gonna push up the energy of those so that my energy, my last function goes down, okay? My training objective goes down, not my energy, right? Now, so now the, what I'm explaining down there with the trellis is the fact that because those variables are discrete, you can't use gradient descent to infer them, okay? And so you have to infer them by combinatorial search, essentially. And the first solution I mentioned with the 96-factor evaluations basically is exhaustive search, right? Try every combination of Z1, Z2, Y1, Y2, and figure out which one has the lowest energy. But the whole point of this is that this is wasteful, in a sense that because the energy decomposes into terms that only take subsets of variables, you can actually decompose this into, you can reduce this to finding the shortest path in a graph, where the transitions in this graph are annotated by the energies that correspond to the value of the variables of the two corresponding nodes, okay? Now, this is a slightly more general form of what I told you about earlier. So this model here with the dynamic time-warping is very much the same, right? You know, the Z1, Z2 here are basically the path in the dynamic time-warping module. The Y is which of the word template matches, okay? And the training consists in just doing gradient descent to make the energy of the correct answer small and the energy of the incorrect answer is larger using some last function, which I leave unspecified at the moment. Professor, when you say that you're finding the shortest path, you're saying that the distance between nodes is the energy between the nodes? The shortest path is the path that has the smallest sum of terms along the edges, right? So each edge here is marked by an energy, so for example, this edge here is marked by the energy of the term B when Z1 equals zero and Z2 equals one, okay? Okay, yeah. So if I take this path, I'm going to pay that energy, right? And if I take this edge, I'm going to pay that energy. And so, you know, finding the lowest energy configuration of variables consists in finding the path with the smallest sum of value on the edges along that path, okay? So it's the shortest path in the graph, okay? Is that clear? Yeah, that makes sense. Thank you. And then so the zeros before the black node, those are zeros just because the summation itself is zero energy, right? Is that what you're saying? Yeah, that's right. Yeah, I'm not counting, yeah, like, I don't care which of those paths it is, right? I don't have an energy term here for like, what's the value of Y2? So if I had an extra factor here that only took Y2, then that factor would basically put an energy here on the, you know, we replace those zeros, right? Okay, there's a question here coming from the students. So we are pushing down on the energy or we are actually doing a minimization for training and inference. But then when are we actually pushing up just during training? All right. So let me remind you how training energy based models work, right? Particularly contrasting methods. And if you have latent variables, right? So you have your energy function E of XYZ. Sorry, the arguments are in the wrong order here, it doesn't matter. So you have your energy function E of XYZ. I give you an X, so in training mode, I give you an X and a Y. I don't give you Z ever, but I give you an X and a Y. Here is a training sample, it's an X and a Y. The first thing you do is you find a Z that minimizes the energy E of XYZ, okay? And you call that F of XY, right? But the way you compute it is just min over Z of E of XYZ. Now for the correct Y in your training set, you want that energy to be small, right? And for your inference algorithm to work at test time, I don't give you the Y, I just give you the X. And what you have to find is the Y that has the smallest energy. So for this to work, it has to be the case that the correct Y has the lowest energy among all possible Ys, right? So what I need to do now during training is that I give you the correct Y, and what you need to do is give a low energy to the correct Y and give a higher energy to every other possible configuration of Y, right? And exactly how you do this or how all those energies enter in your objective function depends on the objectives that you choose. We're going to come to this in a minute, okay? But almost certainly, you're going to have one term in your loss function that's going to say make the energy of the correct answer low, and another term that's going to say make the energy of all the other answers or some of them high. And we talked about this last time, actually, three weeks ago, okay? But I'm going to come back to it, right? Is that clear or do you need another clarification? I don't see any reply here. Okay, all right. Another one would be what if the factor graph is not possible? Do we have to search for all possible combinations of Ys? Maybe I think this is the continuous case, I think. No, not necessarily. So I mean, this idea of decomposing into energies also gives you an advantage even in the case of continuous variables, right? Because you can do independent optimizations, right? Like the combination of values of Z1 and Z2 only affects EV, even if Z1 and Z2 are continuous. And so you can do kind of a little bit of the equivalent of dynamic programming there. It's sort of more complicated in the continuous case, but it could be possible. Yeah, I mean, the worst situation is when all the Zs and the Ys enter a giant factor and there is no way of factorizing it. And then, you know, you have to do exhaustive search or some approximate search heuristics, okay, inference algorithm that minimizes the energy. Yeah, that was actually the case. The student was referring it. And the other student is also satisfied. So you answer both questions. Yeah, don't hesitate to ask if there's something that's not clear. Okay, so here is an instance of this. And if you encounter this in the literature, you'll know what it is. It's called a conditional random field. Okay, so conditional random field is a very special type of such structure prediction model where, you know, here you have Ys or Zs, doesn't matter. Here they're only Ys. But the way those factors are parameterized is that there is a fixed feature extractor. F of X, Y1, Y2 in this case here. And then a weight vector that just computes the dot product of this feature vector with this weight vector and that gives you a score here, just an energy, okay. The overall energy is just a sum of all those terms. So basically those are shallow neural nets, if you want. Single-layer neural nets with a feature extractor at the input. And then we can think about what type of loss function are we going to minimize to train something like this. So one possibility is to use the negative log likelihood loss. So basically you say, I want the energy of the correct answer to be low. And I want the log of the sum of the exponentials of all the answers, including the good one, to be large, okay. In fact, more correctly, you want the minus log of the sum of all configurations of your outputs of the exponential minus the energy of all those configurations to be as small as possible, okay. So basically you want the combination of energies of bad answers to be as large as possible, okay. And we've encountered that loss function before, right. I mean, that's basically what's used in Softmax. Softmax says that. Softmax says, I want the negative log probability of the correct answer to be as low as possible, or probability of the correct answer to be as large as possible. That's like an energy, okay. But then simultaneously, I compute the sum of the exponential, the log of the sum of the exponentials of all the answers, okay. And I want that to be small. I want all those energies to be large. I want all those probabilities to be small, okay. Softmax does that to you, or log Softmax criterion. When you back propagate, it does classification. And it does exactly that. It pushes down the energy of the correct answer. It pushes up the energies of all the other answers by computing the log of the sum of all the answers of exponential minus the energies. So here, conditional random field is basically an example of that, but you're not doing classification. You're doing kind of structure prediction. And in the positive case, you have the correct configuration of Y1, Y2, Y4, Y1, Y2, Y3, Y4. And the incorrect ones are not incorrect categories as in classification, but are incorrect configurations of Y1, Y2, Y3, Y4. Other than that, it's just, you know, backprop. I mean, it's not even backprop here because it's a shallow network. If you put a whole neural net in there, parameterized by Ws, that would be perfectly kosher. And that would be kind of a deep conditional random field if you want, which happened to actually exist before conditional random fields. Here's another idea. You can use a hinge loss. So the hinge loss says, I want the energy of the correct answer to be low. And then among all possible configurations of incorrect answers, incorrect configurations of Ys, I'm going to look for the one that has the lowest energy, among all the wrong, all the bad ones. And that one I'm going to push up. Okay. I don't need to push the other ones because they are larger anyway. So I'm just going to, you know, figure out which configuration of Y1, Y2, Y3, Y4 is both incorrect, but among all incorrect configurations has the lowest energy and then push that out. Okay. And the way I push up and push down is I'm going to put the difference of those two energies in a hinge loss so that the hinge will push the energy of the correct answer to be low and will push the energy of the incorrect most offending answer to be higher by at least some margin. Okay. So that's called a maximum margin Markov net. If you regularize the weights with E square, and if you have this kind of linear parameterization of the energy terms. You can also use the perceptual loss and Michael Collins, who's a well-known professor at Columbia in an LP actually, you know, kind of made his success, you know, built his career around this idea of using perceptual loss for a structural prediction. So that perceptual loss only works if you have linear parameterization of the factors. If you make them deep neural nets, you can't use the perceptual loss anymore. And it's because the margin is zero. And we talked about this a little bit before, but I'll come back to it in a minute. Right. So, so those ideas have been around for a long time. The probably the first people to think about things like this were people who worked on what's called discriminative training for speech recognition. And that goes back to the late 80s, early 90s. So your year and webinar, for example, at ET&T had something they called minimum empirical error loss. And this is kind of a particular loss for speech recognition system. They didn't have neural nets. They had some other way of kind of training speech signals into, you know, basically sound categories, if you want. But they had this way of training at the sequence level by not telling the system, you know, here is this sound at that location, that sound at that location, but just telling it here is an input sentence. Here is the transcription of it in terms of words, you know, figure it out by doing this time warping, you know, in the context of field and Markov models, which is kind of very similar. To the dummy time warping I was talking about earlier. Then as I said, in the early 90s, people started working on using neural nets to kind of feed one of those kind of structural prediction system. As I said, the first one I know about is by Xavier de Yoncourt on Yonbo 2 for a speech recognition. But they had a time delay neural net. Yoshua Benjou did his PhD on this and had some results around 1992. And Patrick Kaffner, the year after that. Leon Boutou, Yoshua Benjou, and Patrick Kaffner are the co-authors of my paper from 1998 about handwriting recognition because I hired all three of them at AT&T to work on this problem. They had basically figured out, you know, some way of doing this in their PhD thesis. And I knew that was the trick that needed to be worked on for things like handwriting recognition, structural prediction with neural nets. Right, let's see. Okay, so here is a way, and I only alluded to this really quickly in an earlier lecture. Here is a way to sort of put this in the context of deep learning. So as I said before, one way to do this in the context of deep learning is you make those factors deep neural nets, basically. They just compute some energy, and they are parameterized by a bunch of parameters, and nothing changes. We know how to do back problem, and we have pay torch. But here is a slightly different idea, and this which kind of draws on the same type of model. And this is when the structure is more complex than just a bunch of fixed factors with a known structure, if you want. And so the example I'm going to use here is handwriting recognition, but because there's a long history of it, and I have drawings that are prepared for this, that have been around for a long time. But OK, so here the problem we have is that we have a sequence of digits at the input, and we don't know how to segment this sequence into individual digits, because we don't know what the parts are for each of the digits. The four here is kind of broken into two parts, and so what we can do is build a graph in which each path is a possible way of breaking up this sequence of blobs into characters. So I can group, I can make each of the separate pieces a separate character, so that's the path at the top. I can group the first two pieces, three, four, the left part of the four, and then have the last two be separate, or I can have the first be by itself, the following two be regrouped, and then the last one be by itself. So what have I done here? I've basically said, OK, the way I did inference in the context of circuit prediction was by having energy terms that tell me the cost of a particular combination of variables. So this graph here represents, basically, is an explicit representation of that energy model as long as I put on those arcs here the energies that are computed by those modules for each value. But what if I just manipulate this graph? So what if the state that I manipulate in a neural net is not a particular assignment or variable together with something to compute energies, but it's directly a graph like this? So a graph like this, basically, you can think of as representing a list of energies for every possible configurations of the variables of interest. It's a compact way of representing a list of energies for all configurations of the sequences of symbols. So what if I build a neural net so that the internal states of that neural net are basically those graphs? And the graphs are just, again, I repeat, a compact way of representing a list of energies for every possible configurations of the variable of interest, nothing more. So but I can use this for other things that energies. So here, a path in a graph corresponds to a particular way of breaking up this blobs of ink into characters. And each path is a particular way of grouping those blobs into characters. I can run those images. So now this graph is annotated by images, not by energies. I can run those images through a comb net. The comb net is going to tell me, for each arcs in this graph, it's going to tell me, well, this is very likely to be a three, and here is the energy for that three. Okay, low energy if it's a good three, high energy if it's a bad three. It could also tell me, well, this may be a two, but with a higher energy or it could be a zero with a higher energy. Okay. So it's going to build this graph, which you can call the interpretation graph. Each path in this graph is a possible labeling of each path in this graph. And the labels indicate the categories. And the energies attached to the arcs are basically the energies of the, you know, produced by my comb net here for each of those answers. Okay. So this comb net is going to look at this little piece of a four, and it's going to tell me, well, that looks kind of like a two with a low energy, or that may look like a piece of a four with a higher energy. Okay. The guy that looks at this piece, which, you know, it's somewhere here, is going to tell me, well, this is a four. I'm quite sure of it with low energy. And, you know, it's going to tell me, maybe it's something else with higher energy. So each of those arcs here is going to be replaced by 10 arcs. I only represented two here, but essentially 10 arcs corresponding to the 10 possible categories, each of them with a different energy that is just the output of the corresponding output of the comb net that I applied here. Now inference, again, is finding the shortest path in this graph. So, you know, finding the path with a minimum energy basically, right? So finding the shortest path. So it's basically, you can think of it as a module that selects the shortest path. In fact, it's this one here. I call it a feature transformer. So the word transformer in the context of neural nets was used in 1997, but it's been, you know, kind of recycled for something else now. Right. Okay, so here's a complete example of how this might work. So again, we have an input image. We run this through this kind of segmenter that proposes multiple alternative segmentations, which are ways to group those blob of inks together. Each path in this graph corresponds to one particular way of grouping the blobs of ink. We run each of those through a neural net. Identical copies of the same comb net that just is trained to do character recognition. Okay. Each of those comb nets produces a list of 10 scores. So it tells, you know, this guy tells me this is one with energy point one. This is four with energy 2.4, et cetera. This guy tells me, well, this piece is four with energy point six or nine with energy 1.2 or whatever, et cetera, et cetera. Right. This guy is going to give me relatively high energy for everything because that doesn't look good. Same for this guy. Okay. So now I get a graph here and think of it as kind of a weird form of tensor. Right. It's a sparse tensor, really. Okay. It's something that says it's a tensor that, you know, for each possible configuration of this variable tells me the cost of that variable. So it's not really, it's more like a distribution over tensors, if you want. Okay. Or log distribution is not normalized because we're talking about energies. Okay. Then I take this graph and I want to compute the energy of the correct answer because I might want to train the system. Right. So I'm telling you the correct answer is three, four. Select within those paths the one that actually says three, four. Okay. And there's two of them. There's three, four with energy point one plus point six. And then there is three, four with energy 3.4 plus 2.4, which is my trial. Right. So I get those two paths. And then among those two paths, I picked the best one, three, four. Okay. So I told the system, here is the correct answer. Give me the path that has the lowest energy, but yet gives me the correct answer. Okay. So finding that path is like minimizing over a latent variable where the latent variable is which path you pick. Right. Because actually it's an energy model with the latent variable is a path. A professor? Yes. Three or four in the graph should be labeled before training or that latent variable will figure out for the system. So here I'm putting myself in a situation where I'm going to train the system supervised. I know the correct answer. This is the desired answer. Think of this as a target. Okay. So we just know the target, but I don't know which part is three and which part is four. Well, that's right. So we know what the target is. We don't know which path has the correct is the correct segmentation. Right. It could be this path or it could be that path. Right. And here what we do is we just pick the one with the lowest energy, which happened to be the correct one here. Okay. So is this recognition transformer, is each of these like, you know, NN box are each, are those all shared? Yeah. Okay. Yeah. This is multiple copies of the same neural net, right? It's just a character recognition neural net in this case. Okay. Now you have, we have the energy of the correct answer. It's 0.7. It's the sum of 0.1 and 0.6. Okay. And what we need to do now is back propagate gradient through this entire structure so that we can change the weights of that neural net in such a way that this energy goes down. Okay. And this looks daunting, but it's entirely possible because this entire system here is built out of elements that we already know about. That's just a regular neural net. And those past vectors and V2B transformers are basically switches that pick, you know, a particular edge or not. Right. So it's like a switch. It's like Max pooling. It's except it's mean pooling if you want. Okay. Right. So how do I back propagate? Well, this 0.7 is just the sum of this 0.1 and this 0.6. So if I have, if I compute the gradient of this with respect to this 0.1, it's just one. The gradient of this output with respect to this value here, 0.6, is also one. Okay. Because that is just the sum of those two things and just backpropagating one through a sum. And that's just a white connection. Okay. Now backpropagating through the V2B transformer, this guy just selected one path among two. So what he's going to do is that he's going to take those, those gradients here and just copy them on the corresponding edge on the input graph and then set the gradients for the other path that was not selected to zero. It's exactly what's happening in, you know, Max pooling or mean pooling. You're propagating through the switch at that right position, but not propagating next to it. So it's nothing fancy. Okay. Passed actor is the same. It's just a system that selects the path that could produce the correct answer. And so I'm just going to set, you know, whatever, you know, through this, I'm going to propagate the plus one to the arcs that appear here. So this arc is that one. You see a zero here, but I'm coming back to this in a minute. It should be a one for now. And plus one here and that corresponds to this plus one here. Okay. And then you can propagate those gradients all the way through the neural net and adjust the weights so that this energy goes down. Okay. So that will take care of making the energy of the correct answer small. Okay. By backpropagating through this thing. Now, what's important about this is that this structure here is dynamic in the sense that if I give you a new input, the number of instances of this neural net will change with the number of segments. The graph here will change. Those graphs will change completely. And so I need to be able to backpropagate through this kind of dynamical structure if you want. And this is, you know, situations where things like pytorch are really important because you want to be able to handle all those, you know, kind of dynamical structures that change with every sample. Okay. So this backpropagation phase takes care of making the energy of the correct answer small. Now, how do we make the energy of incorrect answer as large? Well, there's going to be a second phase where we're just going to, in this case, we're just going to let the system pick whatever answer it wants. Okay. And this is kind of a simplified form of discriminative training for structure prediction that uses a form of the perceptron loss if you want. Okay. So the first few stages are exactly identical to what I talked about earlier. But here, the Viterbi transformer just picks the best path among all the paths. You don't constrain it to pick the correct one. You just let it pick whatever it wants. Okay. So it's going to pick the best path that it thinks that has the lowest energy that it thinks is the correct answer. Now, the energy you get out of this necessarily is going to be smaller or equal to the one you got previously because this one is the smallest of all the possible configurations. The other ones is not the smallest of all possible ones. It's only the smallest of the correct ones. And so this guy necessarily is going to get smaller. So we don't pick the one, sorry, I lost you. Do we pick the one, do we take out the one that are actually producing the correct sequence or not? Okay. So you have two forms of it. The form I'm explaining here, you're not taking out the correct one. Okay. In fact, in this particular example, you wouldn't make any difference. But if you want the system to work properly, what you should do here is have a path selector that takes out the correct answers. Yeah, exactly. Yes, yes, yes. That's right. So you would want to tell the system give me your best shot a wrong answer. Okay. The lowest energy wrong answer. Exactly. The white bar in your paper, right? It would be the white bar. Here I'm not doing this. I'm just asking it, what's your best shot? You know, I don't care if it's correct or incorrect. All right. We'll come back to this in a minute. Okay. Putting this all together, my loss function is going to be the difference between the energy I get for the correct answer minus the energy I get for whatever answer the system wants to produce. Okay. So I compute the difference between those two and that's my loss function. And I can back propagate through this entire thing. I told you I was back propagating just to make the energy here small. I'm not actually doing this. I'm computing a loss function here, which in this case is just a difference between this and that. And I'm back propagating gradient through this entire structure. Right. So whatever path appears only on the left will get a plus one. So this guy gets a plus one because this edge only appears on this side. And so it gets a plus one. The path that only appear on the right side, like this guy, sorry, like this guy, get a minus one. Okay. The gradient here gets a minus one because you have a minus sign here. So the gradients, you know, when they back propagate, they end up being minus ones. Okay. This guy also gets minus one. This guy here appears on both sides. And so the minus one and the plus one cancel. And this guy is, you know, get zero gradient. It's in the correct path, but it's also in the path but it's also in the path that the system produces. So you shouldn't change anything. It's, it's okay. Right. So the guy that have a minus one are the incorrect paths. The other path that are in the incorrect answer but not in the correct answer. The one that I have a plus one are the edges that are in the correct answer but not in the incorrect one. Okay. The one that has zero are in neither or they are in both. Right. So now you get gradients here. You, those gradients are gradients for all the outputs of all those neural nets. You back propagate through the neural nets and compute and update the weights. Okay. And what, if you do this, then the system will eventually minimize its loss function which is the difference between the energy of the correct answer and the energy of the best answer whatever it is. That last function is the perceptron loss. And we talked about this before. In fact, let me go to this just now. Okay. Right. So the, the last function we just talked about is the second. Oops. The second one in this list here. This is the energy of the correct answer minus the energy of whatever answer your system wants to produce. Okay. That's a perceptron loss or the generalized perceptron loss if you want. And the bad news about this last function is that it doesn't have a margin. So it doesn't ensure that the energy of the incorrect answers is larger, strictly larger than the energy of the good answer. You know, it might just collapse. It might just make every energy zero or the same. Okay. So it's not a good loss function to use. It just happens to work when your energy is linearly parameterizing W but in the general case, it doesn't work. So you're much better off using something like this, a hinge. But in the case of a hinge, what you need to have here is this Y bar which is the energy of the most offending incorrect answer. So basically in the second phase, as Alfredo was pointing out, instead of picking the path with the lowest energy, the answer with the lowest energy, you constrain the system to pick a wrong answer. And then among all of those, pick the one with the lowest energy. And then you take the difference between those two energies, so energy of correct answer, energy of most offending incorrect answer, compute the difference between them and plug this into a hinge so that you want this energy to be lower than that energy by at least M. And this is the kind of objective here that Drian Kuo and Butou used. So it looks very similar. This is something called NCE that people in speech recognition produce. So it's basically like, this looks a bit like a sigmoid. So basically it's a sigmoid function you take the difference between the energy of the correct answer and the energy of incorrect answers and you plug them into a sigmoid, right? It's 1 over 1 plus exponential minus blah blah. And so it basically wants to make that difference kind of small, but then it doesn't care if it's too small and if it's too large, it kind of gives up. And then you have, this is the negative low likelihood loss. So make the energy of the correct answer small and then make the log of the sum over all answers of e to the minus the energy of those answers large. Make the minus log large, which means make the log small, which means makes those energies large. And then the Yo-Yeh Rabiner thing I was telling you about is another form of objective function that sort of pushes them and push up. Most of those are derived from sort of probabilistic principles, but many of them aren't. All the ones at the top aren't. Hey, Professor, I had a question about the margin of the losses. I think in a previous lecture, we discussed how the negative log likelihood loss converges to the perceptron loss when betas go on towards infinity. Correct. Or something like that. But how come the N and L loss has a positive margin and the perceptron loss doesn't? Well, just because the temperature is, I mean the 1 over beta, because beta is not infinite, because 1 over beta is not zero. So yeah, I mean, if you take the limit of this for beta goes to infinity, this 1 over beta log sum converges to min over y of energy of wixi. And so that's exactly what the perceptron does. So the perceptron is a zero temperature limit or infinite beta limit of negative low likelihood. Indeed. But the margin is essentially infinite in this case, whereas the margin here is zero. Okay. So there's a bit of a discontinuity here. Admittedly though, if you make beta very large here, numerically, the energies of anything but the lowest energy term, I mean the role of the importance of the terms for values wise in this sum will kind of diminish. And so numerically, it may actually start behaving very much like the perceptron on your own. There's a problem with it also, which I mentioned before, which is that this wants to make the energy of incorrect answers infinite. And it's not going to make them infinite because as they get larger and larger, they're the gradient of this sum with respect to each of them gets very small, but they're going to get pushed to infinity. And so it's not necessarily a good thing. The hinge is better in a way because it just says, well, I just want it to be larger by some value. I don't care how much. I give you another form of the hinge loss in the past where you have a sum over wise. So instead of just taking the most of any incorrect answer in the hinge, you take all answers and you sum over all of them. And for each of them, you have a different margin which depends on the YI and the YI bar. That's a more general form. It might be more expensive, depending on how you compute it. There are a bunch of questions here. So first, there are a few students asking about the segmenter. Do we learn the segmenter? Is he also, do we back prop there? Is he a latent variable, something? In this particular case, no, it's just handcrafted heuristics. But you could imagine building a differentiable segmenter and then back propagating all the way through it. Yes. This was actually one of the original plans when we built this thing in the mid-90s. We never got to it. But the reason we never got to it is because there is another approach to character recognition which is the kind of sliding window approach which I explained. So you just take the input, you never segment it, you just supply the neural net to every location on the input and you record the output and then you do structure prediction on top of that. So now you have to have some sort of sequence models that tells you if I observe 3, 3, 3, blank, blank, 2, 4, 4, 4, 4, it's actually 3, 4. The blank and the 2 basically are spurious. So you would have a grammar that would indicate what are correct combinations of characters on the output. And you would do this to finding the shortest path in the graph. So the graph on the bottom is generated by the segmenter, is it? Correct. Yep. The one with the one hop, like two hops or two hops, one hop, one hop. Yep. Okay. You could think of this as kind of a simple form of graph neural net or kind of a specific form of graph neural net where this entire deep learning architecture manipulates graphs instead of tensors as it's kind of wave representing inputs. Okay. Or states. So I think of this as a multilayer architecture where the states are graphs, are annotated graphs. All right. And then you can have modules here that turns graph into other graphs. We used to call these graph transformers. In fact, this is called a graph transformer network, right? Okay. So this is from 1997, right? This is not recent. And in fact, 1996, the first paper is in 1997. And then those can be, you know, as long as the way you compute those scores is with differentiable functions that are primary tries, you can back propagate gradient through this entire thing. And I can demonstrate it how you do this in this particular case. I see. Then there's another question which I may not be able to understand, which is what are the dimensions of the interpretation graphs? I don't know what dimensions. Dimensions. So basically each arc, okay, you can view it in two ways. The way I've represented it here is that each arc has a label, three here for this particular edge, and an energy 3.4. And then the number in parenthesis is the gradient that comes from the top. Okay. So here is a scalar. But here on the graph at the bottom, the annotation is an entire tensor. It's an image. Okay. So I don't specify what you can annotate the graphs with as long as whatever it is that you annotate it with, if it's computed by some continuous functions, you want to be able to propagate gradient through it. Now, another way of representing this graph is not by having a separate arc here for each category, but by having a vector. And the vector just contains the list of categories. Together with a list of scores. Okay. So 0 through 9. And then the list of energies for each of the thing. And that would be just one arc, but it would be annotated by this vector. Yeah, I see, I see. Okay. But because, you know, this guy, the V30 and the pass selector selects individual paths. It's clear if you kind of write it this way. How you implement it is up to you. So those graph transformer networks, there are speech recognition systems today. That, so basically in a speech recognition system, this whole way of inferring the correct sequence, using for example, a language model is called a decoder. Okay. So a decoder, at the output of a neural net, generally you have a sequence of vectors that indicate the score, the energy, the probability, whatever you want, of individual sounds or phonemes or sometimes words. And then you have to pass this to a language model that tells you, you know, this sequence is possible, that all the sequence isn't, and then it picks out the best possible interpretation according to the language model and according to the scores produced by your system. Let's go to the decoder. Okay. And the big question is, how do you back propagate gradient to the decoder? Is only a very small number of speech recognition systems today that actually do this. The latest one, I think, is by Renon Colbert, the original author of Torch. And here is how this works. So let's say you want to, so this is a particular concept called graph composition or graph transducers, and which kind of explains how you can combine graphs with each other, for example, together with a language model. Okay. So you can think of a language model as a graph. You can represent it as a graph as a neural net. It doesn't matter, but I'm going to represent it as a graph, right? So here, this is basically a lexicon that is represented as a tree, tree-t-r-i-e, okay? And it can represent the words barn, butt, cute, cure, cap, cat, and card, okay? So each terminal node is a word, and each path, and each terminal node corresponds to a path, and each path is a word, basically, right? The sequence of symbols is a word. Now, so let's imagine our entire lexicon is this, and we have a neural net or something that produces a trellis of possible interpretation that corresponds to this graph. So it says the first character, I can't tell you exactly what it is, but I think it's C with energy 0.4, it's O with energy 1, or it's D with energy 1.8, okay? Let's say it's a character recognizer, and the second one says it's X with energy 0.1, or A with 0.2, or U with 0.8, and the last one P with 0.2, T with 0.8. So what we need to do now is, what is the best interpretation of this, the best path in this, that also happens to be present in our lexicon, okay? And the operation you need to do for this is a concept that was invented by Fernando Pereira, who is head of the NLP research group, and more than that, actually, a Google research, but that was back when he was at AT&T Bell Labs in the early 90s, and he was sort of implemented in an open-source library called the FSM Library, which was implemented by Marie-Armorie, who is a professor at NYU. He did this when he was at AT&T, and then at Google. And the way you do this is this composition operation. So you start from the initial node of both of those graphs, and you say, is there a path I can take in this graph that is legal here? Okay, so here I can have B or C, and here I can have C, O, or D. Only C is common between the two. Okay, so I'm going to combine those two by saying in my output graph, I'm going to have one of those transitions, which is the only transition that is common here and here. Okay, so now I'm in this node, oops, sorry, I'm in this node here, I'm in that node there. Here I can take X, A, or U, and here I can take U or A. Okay, so I have two possibilities, U or A. A with 0.2, U with 0.8, and so I add those two here. Okay, so basically what I'm doing here is whenever I come at a node, and I have to take a transition, I find which of the nodes I can be in here, I look at the possible transitions, and if the transition exists, if there is one that matches, I create an outgoing arc, and I annotate it by the energy of whatever arc I had here. If I also had an energy in this arc, I could just add those two terms or combine them in some way. Okay, so now I have two nodes here, and the last one can be P or T, so I can start from those two nodes and have either P or T, and it can be either in this node or this node, so I can go here with T, or I can go here with P or here with T, and I end up with those three things. So now my interpretation is either cap or cap or cut. These are the three interpretations that are grammatically correct, and at the same time, are present as a possibility produced by my neural net. And now I just have to find the shortest path there, and that's my answer. Okay, so that operation here is called graph composition, and it basically allows you to basically combine two graphs, essentially, or combine two knowledge bases that are conceptually graphs, but they could be represented by neural nets. So here I can represent this thing, this whole language model by neural net. When I'm at a particular location, it means, when I'm here, it means I observed the sequence CU, and then I can run CU in my language model and ask my neural net to predict, so what's the next letter? And my neural net will say, well, it's either a T or R. In its softmax output with the 26 letters, it's going to tell me T and R have five probabilities, the other ones have low probabilities, or if he produces energy, he's going to say T and R have low energies, the other ones have higher energies. Okay, so it doesn't matter how you actually represent this. If it's represented as a neural net, then implicitly, then you can train this neural net. You can train the language model because you can back propagate gradient through this entire thing. Okay, so that would be sort of an example of what people called differentiable programming. I mean, basically the way to implement this is a really, really complicated program. What you need to do is back propagate gradient through this entire program, and this program has loops and ifs and recursions. Okay, so not trivial. I'm not telling you how we actually implemented this in 1994, 1995, but that's basically how our check reading system back in those days was implemented. So the last function we used in the end to train the system was actually the negative low likelihood loss function. So negative low likelihood says you have an interpretation graph here where each path is a possible interpretation and the sum of the energies along the path is the energy of that interpretation. You give it the correct answer, you select the path that has the correct interpretation. Okay, same on the other side. So here you combine with a grammar. So the grammar restricts the sequences to those that are syntactically correct. Okay, so if it's an amount on the check, for example, it's got a decimal dot, it might have a dollar sign in front, it might have stars, there's a grammar for it which you can build by hand. It's a finite state grammar. You compose those two graphs and you get the set of paths in this graph that actually contain a grammatically correct interpretation. And now you don't do Viterbi, you do forward. Okay, what is forward? So Viterbi computes the path in a graph that has a minimum energy. Basically it minimizes with respect to the latent variable where the latent variable is the path in the graph. Forward computes the log of the sum of the exponentials of minus the energies of all the paths. Okay, so basically it marginalizes over the latent variable which is the path in the graph. Now it turns out that you can do this very easily and it's very cheap. It doesn't cost more than doing Viterbi and you can backpropagate through it. And I don't have a slide for this so I'm going to switch to drawing it. Okay, so you have one path, you have another path and maybe another path that skips over here. Okay, and each of those guys has been energy, right? E1, E2, E3, E4, E5, E6, let's say. Okay, so if you do Viterbi, shortest paths in a graph you're just going to find the path that has a minimum energy. But what I'm going to talk about here is computing, so think of the path as a latent variable z. And remember to compute f of x, y, you can do two things. You can do min over z of E of x, y, z. And remember z is the path or if you want to marginalize you do minus one over beta log sum over all z's of E to the minus beta E of x, y, z. And that's marginalizing. It's a discrete sum if z is a discrete variable which is the case here because it's a discrete path. Okay, so this is f beta of x, y and you can think of this as f infinity, right? This is the limit for beta goes to infinity of the one at the bottom. Professor, is the EMP function some simple last function or neural network to be trained in the model? I didn't understand the question, I'm sorry, can you repeat? Is the energy function here some simple function like last function or some neural networks to be trained in the model? It doesn't matter, okay? This is the energy that you use to measure the score of an answer y, okay? The observation is x, the answer you are supposed to predict in this case a sequence of symbols is y. So for each of those things here is annotated with a particular y, okay? So this could be, each of those arcs is annotated by a symbol. So let's say a and this is b and this is b and c and this is, I don't know, x and this is g or something, right? So here the possible interpretation for y, so y would be a string of symbols and it can be either ab or it can be vcg or it can be cg, okay? If this is c, okay? That's an x, that was an x, right? Sorry, I'm sorry, you're right, this is a six, it's x. Thanks. You're right, sorry about that, okay? Those are the only three possible interpretations in that graph that can come out of that graph and z is which path you're taking, okay? So if z is the first path then the output would be ab, if z is the second path the output would be vcg, et cetera, okay? Okay, thank you. Right, but this is not used for training, this is the energy function, okay, used for inference. Okay, so the log of the sum of the exponentials of the energies for all the paths, the sum of, over all the paths, okay? So the sum here is over a path, okay? That's like marginalizing over z and we saw that before, right? We explained that before. Now how do I compute this? Now, it turns out it's very simple, it's done using what's called a forward algorithm. I'm actually going to draw a different tree, a different graph. The graph is going to look more like the one I had before, which was kind of like this, right? Okay, so y is a sequence of three symbols in this case and each, and the first symbol can be, oh, I'm sorry, I'm using nodes here instead of arcs, that's a little confusing. Let me correct that. Okay, so each path in this graph is a possible interpretation, okay? So for each edge I'm taking, I'm emitting a symbol, and I don't have skipping connections, so here they all have exactly four symbols coming out, because every path is of length four, okay? But how do I compute this sum? This sum basically, I go at a node, okay, when I'm at a node here, let's take this node right here, I'm going to call it red, okay? The cost from the input node, the energy from the input node to that node is the log of the sum of the exponentials of the energies from along all paths to go from the input node to that node, okay? So, and of course, I have an energy right here, which is just the energy of that branch. I have an energy here, which is just the energy of that branch, okay? I have an F here, and to compute the F for this guy, I just compute the log of the sum of the exponentials of those two guys, okay? Right, so let me unwrap this, okay? I've got an energy here, Y1, I got an energy here, Y2, I got one here, Y3, E3, sorry, and one here, this guy's E4, all right? The F I'm going to get here, okay, so this is, I'm going to call this, and I'm going to call it anything. So, the value I should have here is E1 plus E3, exponential of that, minus beta of that, plus exponential minus beta E2 plus E4, and I take minus 1 over beta log of this, okay? So, how is the E1 calculated? I mean, the smaller E1? Whatever comes out of your energy, right? Each of those graphs, as I said, you represent possible interpretations as a graph, each node in the graph has an energy, and a complete energy of the function, which is an F of X, Y for a particular Y, and a particular Z following your path, is E of XYZ. And now what you want to compute is log of sum of E to the minus E of XYZ, which is the marginalization over all the paths. So, it's basically combining the cost of all the paths in kind of a soft minimum way, right? But the algorithm is super simple because you maintain a variable, basically for each node, for each node you compute a variable alpha for a particular node, and it's going to be equal to minus log sum over all the nodes that are up from, so let's say it's node i, up from i, okay? So, all the parent nodes of i, and then you do E to the minus beta, the alpha k, and you add to it E of k i, which would be the energy of the link coming from node k to node i, okay? That's called a forward algorithm. And if you've heard about this, it's actually a special case of the so-called belief propagation algorithm. So, belief propagation is a general algorithm for graphical models, and the forward algorithm is a special case when your graph is basically a chain graph, okay? But I'm not going to go into this. You can take a course on Bayesian nets or graphical models for probabilistic methods. If you take a course with Rajesh, he will explain that to you. This would take us too far, but that would be kind of the thing, okay? So now, this is just a feedforward neural net where basically where the function at each node is a log of sum of exponentials plus addition of some term, right? This is a neural net where alpha i is the activation of the neurons, if you want, the nodes, and the weights are those E of k i, that link unit k to unit i, okay? And the operations you do is log sum exponentials. So, instead of a neural net in which you do a product by a weight and then you sum the products, here you add the weights and then you do a log sum exponential. Algebraically, it's actually equivalent. This is like weighted sum in the log domain, okay? But the point is you can do this forward prop, this forward algorithm, and you can back propagate gradient. So, whatever f you get at the end, you know, by the time you've run through this network, at the end here, you basically get f of x, y, the value of that node, the alpha here, is f of x, y, and you've eliminated z by doing this log of sum of exponentials of all paths, right? Now, if you want to compute the gradient of f of x, y with respect to each of the E k i, which themselves probably are outputs of some neural net, you can do that. You can back propagate to this network, okay? It's a neural net who's, again, whose structure is dynamic, you know, it changes from example to example, but you can, you can clearly back propagate gradient through it. And that's basically what we do here in this system. We run the forward algorithm on this graph and we get a score, which is the log of the sum of the exponentials of minus the energies for all the paths, okay? We do the same here. We get another score, I mean, it's minus log of the sum of the exponentials of the energy of minus the energy, okay? This guy necessarily is larger than this one. You compute the difference and that's the negative log likelihood loss. It's the difference between the log sum x of energy over the latent variable of the correct answer and log sum x over the latent variable of every answer. Although here, these are grammatical answers, but it's the same. And then you just back propagate gradient through this entire thing. And then you both back propagate gradient through this graph here, which you can think of as some sort of weird neural net with whether node operation is this log sum exponentials and you get gradients for each of the ease and each of the ease are the values that you get here, which are produced by the neural net. And so you get gradients with respect to the parameters of the neural net. Okay, so that's structural prediction for you. There's a couple more topics I wanted to talk about today and variational methods in Bayesian inference, because we talked about it in the context of VAE, but without really explaining what it was, or at least I didn't, maybe you did Alfredo, but like the general form of variational inference, or I can talk about the Lagrangian formulation of backprop. I can actually do both because it's kind of fast. It will take more than five minutes, but you can leave whenever you want. Okay, then let's go for both. The Lagrange thing is short, so I'm going to do that first. Okay. Okay, so you can formulate backprop as a minimization under constraint. So you have an input variable X is going through a first functional module. Let's call it F1 of X, W1, and it produces, we're going to call it Z1. Actually, let me call this F0, F0. Okay, and then the second one is going to be F1 of Z1, W1, and that produces Z2, etc. So, and then at the end, we have the last module, and it goes into some sort of energy term. Okay, let's say this is our output if we do supervised running, but it doesn't matter. It's just a cost. Let's call this guy Zn, okay, Zn and Y. Okay, so the forward pass can be written as Zk plus 1 is equal to Fk of Ck, Wk. Okay, that's just a forward pass. And then you have a cost function C, which you want to minimize, which is C of Zn, Y. Okay, that's just whatever cost function you want to minimize. Now, you can write the entire back problem as a minimization of the constraint. So, and the statement is minimize C with the constraint, such that the above constraint is verified. Okay, and when you have a minimization problem under constraint, the best thing to do is to write a Lagrangian, right? So, you write Lagrange function, and I'm going to tell you right away what is this function of. And for a single training sample X, Y is going to be the cost Zn, Y. Okay. Well, the other thing we might say also is there is another constraint, which is that Z0 equals X plus some other layers. Okay, so we're going to have an index K from 1 to n and a Lagrange multiplier and a constraint, which should be equal to 0, and that constraint is going to be ZK plus 1 minus FK of ZK WK. And I need to, I'm going to call this lambda K plus 1, and this is going to have to be up to n minus 1, and probably starting at 0, actually. Okay, so this is a Lagrange formulation of my back prop problem, where basically I have an overall cost function and I have a bunch of constraints, so the constraints are that the input to layer K is the output of layer K minus 1. Okay, so this Lagrange function is a function of X, Y, all the lambdas, the lambda Ks, all the Zs, and W, all the Ws. Okay, so what I need to do now to do this minimization of the constraint is I need to do dL over d lambda K equals 0. Okay, and if I, this condition, the gradient of L with respect to lambda K is just the, it's just this, right? I mean, lambda K plus 1, I'm sorry. It's just this parenthesis, okay? So I just get ZK plus 1 equal FK of ZK WK, which is just the forward propagation formula. If I do dL over d ZK equals 0, it's a little more complicated, right? So I get a first term, which is lambda K, because I'm going to have a ZK here, and that ZK is going to be a factor of this lambda, lambda K here, right? So I get, I guess I get lambda K transpose. And then I get minus, and then for this ZK, I have a lambda K plus 1 here times the Jacobian function of this with respect to Z, okay? So it's going to be something like dFK of ZK W over ZK. That's it, times, okay, times lambda K plus 1 transpose, and that should be equal to 0. So I'm going to rearrange all that stuff, and what I get is lambda K equals dFK of ZK W with respect to ZK, so that the Jacobian matrix of F transpose times lambda K plus 1. And funnily enough, this is actually the back propagation formula, right? This is the thing that gives you the gradients at level K, given the gradients of level K plus 1, you multiply by the Jacobian of the box, that you propagate it through, okay? So you don't have to think about it, you know, you're just going to write backprop as a constrained optimization problem, and backprop naturally comes out of it. Now the first people to figure this out were people in control theory. In fact, the first people to figure this out were people like Lagrange or Euler, or people like Hamilton and Jacobi. That's the classical formulation of mechanics, if you want. And in mechanics, when you write something like this, you say where C of Z and Y would be the energy of the system, like potential energy or something like this. And then the other term basically implements the dynamic constraints, the fact that you have a differential equation that tells you that the state of time Q plus 1 is a function of the state of time t with some constraint, right? So that's the dynamic constraint. And then if you do this, you figure out that if you have an energy for every, you know, C of K now as a time step, and the forward propagation as a differential equation that governs a system, and then you could have a term here that is not just an energy term at the output, but basically an energy term that you can have one of those terms for every time step, right? So the Lagrangian function would be sum over time steps of C for that time step of ZK, okay? And there might be some external variable. Let's call it YK, you know, plus those constraints K plus 1 minus FK of ZK WK, okay? And the sum takes place over all things. When you look at, you know, Lagrangian formulation of classical mechanics, that's basically the way they're expressed. C is the energy and the second term are the constraints. Now in classical mechanics, the lambda variable is actually the momentum. So Z is the position variables, and lambda becomes the momentum. So the second term becomes basically the kinetic energy, or the negative kinetic energy more specifically. Anyway, this is just an apartheid. Okay, why am I telling you this? It's because because actually the, you know, the mathematics of this is super simple. If you know Lagrangian minimization of the constraint. And this is something you can use also in a new class of model called neural ODE. So neural ordinary differential equation. And this is something Alfredo wanted me to talk about. Thank you. So neural ODE. So this is a type of neural net, which is basically a recurrent neural net where you say my state at time t plus delta t is equal to my state at time t plus delta t times, you know, some function which is a constant function of z t and a bunch of parameters which are fixed. Okay, they're not, they don't vary with time. I can write it this way, or I can write it in a differential equation form where I can say d z over d t at time t is equal to f of z t w. Okay, so that's a differential equation, ordinary differential equation. In this case of first order, well, it depends what's in z, but if I, you know, I could, I could, I can express just about anything this way. And the question is how do you train something like this? And basically, if you write the Lagrangian formulation of this, it's trivial. The, so there are two ways you might want to train something like this. You might want to train the system to map one point, you know, z at time zero to a particular point z at times big t after some trajectory, you may not want to constrain the trajectory, just, just want you to reach that point. And you don't care what it does afterwards. I just wanted to reach that point. And so you can have a cost function, which is basically, you know, the distance of z to that target point, zero big t, you know, I'm going to call it y. And then, so the target would be, the target would be a point y. And then your cost function would be the distance between z t and y or something like that. Okay. Another thing you might want to do is you might want to train the system so that it has stable states at particular points y. Okay. So that for a particular point y that you decide from your training set, f of this particular y w equals zero, which means, you know, that state is going to be stable, right, the trajectory. So you would have a point y in your space. And then, you know, you might start from some point. And when you arrive at that point, the dynamics stop, stops. So if you formulate this in terms of agonium, it becomes like super simple in the sense that the gradients now contrary to backprop through time. So if you were to unfold this network here, consider this a recurrent net. And you unfold it in time to compute the gradient of the endpoint with respect to the parameters, you cannot have to, and with respect to the initial, when you have to backprop it through time, right, you have to kind of remember the entire sequence and then do backprop through time. Okay. But if what you're interested in is just learning a stable state like this, then you don't need to store the trajectory. You start from some point, you converge to some other point, and you want to make y a stable state. What you just need to do is ensure that this is true. And the way you can do this is, you know, basically by minimizing your cost, which would be something like the norm, the square norm of f of yw. But the point of the point is that you don't need to remember the entire trajectory. The gradient with respect to the weights can be obtained by running a very similar type of differential equation backwards in time. And I'm sorry I'm not going to be able to go into details of that. I can refer you to a paper. So this is a neural D paper, which doesn't really mention that, but there is an earlier paper of mine called a theoretical framework for backpropagation. And basically it explains this Lagrangian formulation as well as how you apply it for recurrent nets that might be sort of continuous and continuous in time and that you want to train to go to particular fixed points. This is a paper from 1988. It's not recent. You'll find it on my webpage on the bottom of the publication page. But I don't want to go into the details of this. And there is the Bayesian stuff. Bayesian stuff. Yes. People are still here. I don't know. They are enjoying it. Take a round. You don't have to if you don't want to. It's not the Bayesian stuff. It's more the variational. Oh, sorry. Yeah, you're right. I got confused. So let's say I have some loss function. Okay. And I'm going to talk about a loss, not an energy, but it's the same thing. And my loss function is a marginalized loss function over a latent variable. So remember, I talked about this before, if you have an energy function f of x, y, let's say, and you want to derive it from a more elementary energy function e of x, y, z by doing the equivalent operation of marginalizing over z's. Yeah. So the way you marginalize, right, is you do minus beta e to the minus you sum over all z's and you take minus one over beta log. Okay. So this is the formula for marginalizing a latent variable. And that also applies to loss functions. You know, whatever function you want to marginalize over a latent variable, that's what you compute. So let's say you have a model with a latent variable and you don't know what the value of the latent variable is. And you want to compute what is my loss, which would be the log of the sum of the exponentials of the loss over all values of the latent variable. So right, so I'm kind of marginalizing over this latent variable. Let's say it's a, you know, vibrational autoencoder or something, I have a latent variable in the middle and I want to compute the minus one over beta log sum over all values of my latent variable of e to the minus beta l. I'm using l, but I could use just any symbol here. This is whatever function you need to compute. But it's useful for things you want to minimize like energies or or or objectives. Okay, so here this loss function here is no longer a function of z, it's only a function of x and y. I can rewrite this as the following q of z e to the minus beta l of x, y, z over q of z, right? I've just multiplied and divided by q of z. Okay, I've done nothing. Now q of z here, I assume is a probability distribution over z. So it's a density function that integrates to one when I integrate over z. So you can interpret this integral as the expected value with respect to that distribution of e to the minus beta l of x, y, z divided by q of z. Okay, now here's the trick. There's something called Jensen's inequality. And Jensen's inequality says something very interesting. It says, let's imagine I have a convex function, like say minus log. Okay, I'm not drawing minus log here very well, but it looks a bit like minus log. Now if I take a bunch of values over a range, okay, and I compute the average of the value of the function minus log over that range, okay? Because the function is convex, I'm going to get a value that is smaller than the function applied to the average. Okay, so my diagram is not that great because the curvature is not high enough. Let me draw it again. So here's a convex function. I'm going to vary a variable here over a range, okay? And I compute the average of that function over that range. So it's going to, you know, give me some value, you know, probably around here. And then I'm going to take the average of all those values in this range, the average of the range, the midpoint of that range, and pass it to the function, okay? And I get something below this. So I didn't draw this properly. So if I take the average of this plus this, you know, this, this, this, and this, I'm going to get something that's higher than that because, because the function is non-convex, is convex. If the function was straight, then the average after going to the function would be the same as before going to the function, right? If I computed the average of all those values, or the y values of those points, I would be at the same place as the function applied to the average, okay? So you can make the intercept between the convex function in a line, right? That goes from those two extrema. That's right. Yeah, there you go. So that, that the mean would be the, yeah, that point. Yeah, that's right. So the, the, the mean applied to the function values would be something like this chord. It wouldn't be that, but it would be kind of close to that, okay? Now it, let's forget about a function like that. Actually, I should have explained this in a very simple way with just two values. Let's say it's just a sum of two terms, okay? So I have a convex function. I have two values. The average of those two values after I pass through the function, okay? So basically, let's say this is my function is minus log. So the average of minus log of, let's call it x1 and x2, minus log of x1 plus minus log of x2 divided by 2, okay? Is this point, okay? And then minus log of x1 plus x2 divided by 2 is that point, and that's below, okay? And Jensen's inequality basically says if you have a convex function like, like minus log, the, okay, and then here I computed an average, but you know, it's true for any expectation. It says the expectation, so basically, you know, convex of expectation over any distribution of some function h, whether it be z in that case, okay? I have to write this in the proper way, is less than or equal to sum over z, q of z of this convex function applied to h of z, okay? That's Jensen's inequality. So this works with minus log, which means I can write that my objective function here is less than minus 1 over beta, which I'm going to actually put inside. I take this back. It's going to be less than sum over z of q of z times minus 1 over beta log e to the minus beta L of x, y, z divided by q of z, okay? Or you see the 1 over beta, the minus 1 over beta log exponential minus beta cancel, okay? So what I get is sum over z, q of z, L of x, y, z. So that's just the expected value of L averaged over the distribution q of z. And then I get a second term, and the second term is the negative log, 1 over beta, the negative log of q of z, but there is, you know, q of z is a denominator, so I'm going to bring it to the top. That's going to cancel the minus 1 over beta, and so I'm going to get something like plus 1 over beta log q of z, right? And I can write this again as sum over z of q of z, L x, y, z plus sum over z, 1 over beta, q of z log q of z, okay? This is the average loss energy, whatever it is, let's call it energy. And this is the, this is 1 over, this is minus 1 over beta times the entropy of q, okay? The entropy of a distribution is minus sum over the random variable of distribution, log distribution, okay? So this is minus 1 over beta e entropy. So what does that mean? What that means is that I have an upper bound on my, the last function that I want to minimize, L of x, y, okay? For my energy that I want to minimize, whatever it is, whatever function it is that I want to minimize, I have an upper bound on it now, and this upper bound is the sum of two terms. One is the average of the energy I get by basically sampling the latent variable, okay? So I have a system with a latent variable, I sample some value of the latent variable according to some distribution q, which of course I pick a q from which I can easily sample, okay? I can choose q whatever I want, whatever I want, right? So I pick a q, Gaussian, whatever, and I pick a z according to that distribution and I compute the expected value of the function I want to minimize with respect to that q, and I can do this by just sampling z from the q distribution and then computing the average of the function L that I obtain as a result, okay? So that's the first term. And then the second term is the entropy of z. So what I need to do is basically change my distribution z in such a way that the entropy is maximized. So if it's a Gaussian, for example, it means I need to make the variance of z as large as possible, but if I make it too large, then the average energy term is going to blow up. So I need to optimize this overall, this whole function. And if I optimize this whole function with respect to q and with respect to whatever parameter of L I want to minimize, because L is an objective function with respect to, I don't know, weights of a neural net or something, right? So I can simultaneously minimize with respect to those parameters W, which I did right here, and with respect to the q distribution. And if the q distribution is in a family that's wide enough, then this upper bound will be fairly close to the actual loss that I want to minimize, which is the marginalized loss over the latent variable. But I never need to actually compute explicitly the marginalization of a latent variable. So this is a way of marginalizing over a latent variable without actually doing it, okay? By marginalizing over a latent variable that you can sample from, like a Gaussian. But what you have to do is maximize its entropy. And when you think about variational autoencoders, that's just what they do, okay? They minimize the expected reconstruction error, which is L of x, y, z with respect to the parameters by sampling the latent variable z according to a Gaussian distribution, okay? But at the same time, there is what's called a KL term, which is the second term that basically tries to make the distribution as high entropy as possible. Now, this formula is exactly identical to a formula that people use in statistical physics. So physicists have a very famous formula, which is this. It says the free energy is equal to the average energy minus the temperature times the entropy, okay? What they call the temperature is what I call one over beta, okay? And that's identical to this formula because here this is the minus entropy, okay? This is the same formula. So what we're minimizing now is a free energy. And if q of z is sufficiently powerful to actually be the actual distribution that it needs to be, then the inequality becomes inequality. But that's the idea of variational methods. You basically use Jensen's inequality to turn the log of an average into the average of the log, okay? And now you get an upper bound, right? So it's this step right here. When I turned the equality that was here into an inequality by applying Jensen's inequality, what I did is that I put the log inside. There was a log outside and I put it inside. So now it's the expectation of a log instead of a log of an expectation, okay? And then because this is a ratio, it's the difference of two logs. And because this is exponential of an energy and I take the log and I divide by beta, I get this kind of nice formula. And now this is called a variational free energy, okay? And you get the expected value of the energy minus the inverse temperature times the entropy of the distribution. Now how you minimize this, you know, is another story. But what that means now is that you can use a surrogate distribution to sample your latent variable from. You don't have to sample from the real distribution which, you know, here the real distribution of Z is really complicated. I should have written it. The real distribution of Z, P of Z is e to the minus beta. This actually would be a different beta. It doesn't have to be the same. e of x, y, z divided by the integral over z of e to the minus beta prime of e of x, y, z. That's the real, if you plug this P into here, the inequality here becomes an equality, okay? You can show that the smallest value for this variable is when q equals p, okay? And then the two terms in the inequality are equal. Okay, so that's kind of the sort of energy view if you want of variational inference. If you need to compute the log over sum of exponentials, replace it by the average of your function plus entropy term. And that will give you an upper bound. You minimize the upper bound. And because you push down on the upper bound, you also push down on the function. You actually want to minimize it. Good for you. It's like, you know, the bare bones kind of simplest formulation of variational inference, okay, in terms of energy. I mean, you can replace L by P and with some normalized stuff, right? But it doesn't, it makes it more complicated. I mean, it doesn't make any difference really, but it makes it harder to interpret. Okay, I think we're done. This is a lot. So Peter, we're stuck around for this sort of extracurricular session of more than half an hour. Yeah, 40 people. It was a pleasure teaching this class, particularly given the circumstances. All right. See you tomorrow, guys, and stay safe. All right. Take care. Bye-bye.