 So I guess we can get started. This is the third part of the lecture on energy-based models, which we are going to continue a little bit what we talked about last time on sparse coding. And talk about GANs very briefly. You'll hear more about it tomorrow from Alfredo. And then talk about learning world models and similar things. Also a little bit about exotic cell supervised and unsupervised learning algorithms that are active research topics at the moment. So one thing I talked about last time was sparse coding. And I'm going to mention just a very simple idea, which consists in combining sparse coding or the audio sparse autoencoder with discriminative training. So imagine that the architecture I'm showing you here, the encoder, if you will, the first part on the left, is mostly similar to the encoder I talked about for the LISTA method. So you start with the X variable. You run it through a matrix. Then you run that through a non-linearity. It could be a value. For example, this is the case here. And then you take the result multiplied by some matrix, which we're going to learn at this with the product of the input by the encoding matrix WE, and then passes to a non-linearity. And you can repeat this little block, this green block here multiple times. Each of those is a layer, basically, that consists in a matrix or a bunch of convolutions, in addition with some pre-existing variable and a non-linearity. So this is a funny kind of neural network where you have kind of skipping connections. And then we're going to train this neural network to do three different things, or with three different criteria. One criterion is going to be just reconstruct X. So there's going to be a decoding matrix that is going to reproduce the input on the output. And we're going to do this by just minimizing square error. So this is what's indicated by the decoding filters here. And again, this could be convolutional or not, depending on which version you like. There's going to be an L1 criterion on the feature vector that makes it sparse. So this is very much like a sparse auto-encoder of the type that we talked about last week. But then we're also going to add a third term. And this third term is going to be basically a simple linear classifier, which is going to try to predict a category. And we're going to train the system to minimize all three criteria at the same time. So this is a sparse auto-encoder that also tries to find codes that do a good job at prediction. And this is sort of a good way. You can see this in two different ways. You can see this as an auto-encoder that is biased towards producing good labels. Or you can see this as a multilayer classifier that is regularized by an auto-encoder. What's the advantage of this? So the advantage is that by forcing the system to find representations here at the second last layer that can reconstruct the input, then you're basically biasing the system towards extracting features that contain as much information about the input as possible. So that sort of makes the features richer, if you want. It forces the system to not generate degenerate features, but to generate features that contain as much information as possible about the input. That works pretty well. I think it's an underexplored method for training neural nets, because very often we don't have enough label training data. Or when the training data is such that you don't have a lot of categories to work with, maybe it's a two or three or tank class problem, which we tend to produce very degenerate features in a neural net, as we discussed last time. Then forcing the system to reconstruct basically tells it you can't generate features that are to degenerate, will so degenerate that you can't reconstruct the input from it. So that's sort of a good, you could think of it as a good regularizer. OK, group sparsity and structure sparsity. So there's some work going back about 10 years, maybe a little more. In fact, the first work on this are about 20 years old. On the idea of group sparsity, what does that mean? Here is the idea. The idea is to train a system to generate sparse features, but not just normal features that are extracted, say, by a bunch of convolutions and values, but to basically produce sparse features that are sparse after the pooling. So you essentially have a system that consists of convolutions, nonlinearity, and pooling. You try to make those features sparse. And there's a number of different work. The idea goes back to Ivarian and Heuer in 2001, in the context of ICA, independent component analysis. And then there were a few other papers went by Ozindero in Jeff Hinton's group, and then Coray Kavachur, who was a student of mine back in the late 2000s. Carl Greger, who was posed up with me, Julia Meral, who is in France and a bunch of other people on this idea of structural sparse goody. So the idea basically is you take some of those models only have an encoder, some of them only have a decoder, and some of them are auto-encoders. So the one on the left, Ozindero's model, is an encoder-only model. Julia Meral's model is a decoder-only model, and Coray Kavachur's model is basically an auto-encoder, a sparse auto-encoder of the type that we talked about last time. So how does that work? Let's take, say, an encoder-only model. You have a feature extractor, which consists of convolutions or maybe just a free-connected matrices over a patch, an image patch, for example. And then instead of forcing the output of this to be after a nonlinearity, instead of forcing that to be sparse, you put a pooling layer and you force the pooling to be sparse. And this applies to all three of those. So here's a more specific example. This is the version that Coray Kavachur did for his PhD thesis where he had a sparse auto-encoder. So you have an encoding function, GE of WEI, could be multiple layers. In this case, it was basically just two layers with one nonlinearity. You have a decoder, which in this case was linear, WD times Z. You have an latent variable Z. And that latent variable, instead of going to an L1, it goes through basically an L2, but it's L2 over groups. So you take a group of components of Z. You compute the L2 norm, not the square of the L2 norm, but the L2 norm, which means the square root of the sum of the values of those components, of the square of those components. So take each component, compute the square, and then compute the sum of a group of those squares and then compute the square root of that. So that's the L2 norm within that group. And then you do this for multiple groups. The groups can be overlapping or non-overlapping. And you compute the sum, and that's your regularizer. That's your sparsity regularizer. So what does that tend to do? It tends to basically turn off the maximum number of groups. The system basically is sparsity on groups. So it wants the smallest number of groups to be on at any one time. But within a group, because it's an L2 norm within a group, it doesn't care how many units are on within a group. So many units can be on within a group. So what does that do? It forces the system basically to group within a pool features that turn on simultaneously. So if you have features that are very similar, feature extractors that are very similar, filters that are very similar in the convolutional net, then those features will tend to kind of, when you do the training, they'll tend to group themselves within a group because they will tend to be activated together. And that's the best way to minimize the number of groups that are activated at any one time. So to get those interesting kind of pictures here, the way this was obtained is by here the groups. So what you're looking at here are the, I think it's a decoding matrix. So these are the columns of the WD matrix that we can reconstruct an image patch from the sparse code by multiplying by that matrix. But what we do here is that we group those features into blocks of 36. So we arrange all the features in a 2D map that has nothing to do with the topology of the image. We can choose any topology we want. In fact, this is not actually a 2D topology. It's a Torrero topology. So the left side touches the right side and the top touches the bottom. So it's topologically identical to a torus. And what we do is we group sets of 36 features within a group. And those groups of 36 features overlap by three columns and three rows. So we have multiple groups of 36 features, six by six, shifted by three. You could think of this as kind of pooling over feature, but not pooling over space because there's no space here. It's a fully connected network. But it has a bit of the same flavor as pooling, except here you pool over 36 features. You don't pool over space. So then you compute the sum of the L2 norm of the features that are within each group. And that's the regularizer you use when you train your sparse autoencoder. So the system wants to do is minimize the number of groups that are on at any one time. And so as I said before, it basically regroups all the features that are similar and likely to fire simultaneously into groups. And because the groups overlap, then it creates those kind of slowly evolving sets of features that sort of seem to kind of swirl around a point. So the features you get as a result of this have some sort of invariance, and they have some invariance not to shift, but to things like rotation and scale and things like that. Whatever the system decides. So here the reason for choosing a 2D topology is basically just for, you know, to make it look beautiful. But you could choose any kind of topology you want. What is on the x-axis and the y-axis here in this diagram? So those are arbitrary axes. I don't even remember how many features there are here. This might be 256 features, I think. It's 16 by 16. So there's 256 hidden units. So imagine a network that has a 12 by 12 input patch, an input image. It's a patch from an image. And 256 hidden units with full connection, non-linearity, and there's another layer on top. Then that's the encoder. And then you have this group sparsity. And then the decoder is linear. And what you're seeing here are the columns of the decoder. And they are organized in a 2D topology. But it's arbitrary. Each of these squares is a column of the decoder. Each of these squares is a column of the decoder, but also corresponds to a component of Z. So they are organized in a 16 by 16 matrix, but it's kind of arbitrary. We just put them in a matrix. And then we train. And because the groups take kind of six by six neighborhoods in this topology, the system naturally kind of learns features that are similar when they are nearby within this topology. But again, I could have chosen any kind of topology. 1D to D3D, or even some graph neighborhood of some kind, as long as the pooling is between neighbors on the graph, that will work. So what I've done here is going to repeat this little pattern to kind of show, because it's toroidal, to show the, you know, what's going on. You know, it's toroidal to show the, you know, how those patterns kind of repeat in our sort of periodical. And the reason for visualizing it this way is that this is the kind of stuff that neuroscientists observe when they poke electrodes in the visual, primary visual cortex of mammals, but most animals that have good vision. They see kind of those kind of swirling patterns and neighboring neurons detect similar features, which means similar oriented edges. They are sensitive to oriented edges and neighboring neurons are sensitive to similar angles, or the same angles at similar scale, or things like that. And so perhaps this is how, you know, the brain organizes its neurons. It's by kind of basically having some sort of criterion in the back cells, which are the equivalent of the pooling units that we're seeing here. Here is another example here. So this one is not at the patch level, but it uses local connections, but it's not convolutional in the sense that it doesn't use shared weights. The reason for doing this is to have some, you know, a semi-realistic sort of correspondence to a sort of sort of biological learning where, of course, you know, neurons in the brain can't share weights, right? They end up being similar because, you know, they train using some sort of unsupervised learning, but there is no such thing as weight-sharing in the brain, as far as we know. So it was asked if a similar strategy of the training of the autoencoder with the classifier and the regularizer can be applied for a variational autoencoder, and whether this has been explored, if it works as well. For the first slide you show. Yeah, so basically adding noise in a variational autoencoder and forcing sparsity are basically two ways to achieve the same purpose, which is reduce the capacity of the latent variable, reduce the capacity of the code that is extracted by the autoencoder. And this is what prevents the system from running a trivial identity function, which would not be useful, right? And what we talked about the last couple of times is the fact that if you reduce the information capacity of the latent variable of the code, you, as a consequence, you also minimize the volume of space that can take low energy, okay? Because you limit the number of configurations of the code. And so as a consequence, you kind of limit the volume of space that can take low energy. So essentially this idea of regularizing without one sparsity or something like this, or adding noise to a code while limiting the norm of the code, achieve the same purpose, which is limiting the capacity of the code, for the purpose of limiting the volume of space that can take low energy. And as a consequence, if you train part of the space to have low energy by minimizing the reconstruction error on your training samples, automatically the rest of the space will have higher energy because the capacity, the volume that can take low energy is limited. So this is just to recap what we talked about last time and a couple of weeks ago. This is sort of the alternative, so those kind of architectural methods are alternatives to the contrastive methods where you explicitly push up on the energy of bad samples, which means you have to come up with a good idea, a good way of generating bad samples in that case, okay? And remember those two types of methods, contrastive methods, you push down the energy of the training samples, you push up the energy of stuff outside, either by corrupting the original samples or by doing a gradient, noisy gradient descent, you know, contrastive divergence, things like this, or by generating contrastive points in some way. We've seen a bunch of different contrastive methods. And then the alternative is limiting the capacity of a code or kind of limiting the volume of stuff that can take low energy in the context of autoencoder or predictor. This means limiting the capacity of the code. And there are many ways to do this. One way is through sparsity, one way is through adding noise while limiting the norm, the CAEs. And there are other ways that we'll talk about in a minute. Whenever you were talking before about the group sparsity, you were summing just a few samples, like a few indexes within a small range. What is that PJ? PJ is a group. It's a pool. So imagine this is a pool, like in a convolution net, but the pool, instead of pulling just over space, it pulls over features as well. Okay? For a fully connected network, it just pulls over components of the just features. Okay. So PJ is like a set of indexes. PJ is a subset of indices of Z, of components of Z. Yeah. Okay. Thanks. Right. So here PJ is a group of six components of Z that happen to be neighbors in this topology. Okay. And that's one P. And the next P is a similar square, six by six square, shifted by three pixels to the left, to the top or bottom. Okay. Okay. So the overlapping between the groups is what kind of represents this topology if you want. Okay. Okay. So in this experiment, you know, is very similar to the one we just talked about, except here we have local connections. So we have an input. It's a two-dimensional input here. We kind of only represent a 1D version of it. And we have units, possibly multiple units at one location, looking at a piece of the input, kind of a local patch on the input. And then those sets of, the sets of units are kind of, you know, replicated multiple times, but there's no shared weights. So the units, these kind of units everywhere on the input, but they, the weights are not shared. Okay. They're just locally connected. So I guess I'm not quite understanding the overall concept of the feature pooling. I mean, if I think about it in terms of like pooling that we used in convolutional networks, then it straight forward, but I don't really understand how we, how feature pooling works. Okay. Let me draw a picture. Maybe that'll be clear. Okay. So you start with an input vector. Okay. Multiply it by a matrix or pass it through some sort of encoder, right? Which may have radios and whatever or multiple matrices inside. Okay. Maybe multiple layers. And you get a feature vector. Okay. So let's call that C. And now, and now you do, you do pooling essentially. So you divide this into groups. In this case are non-overlapping. And you compute the, within one of those groups, you compute the square root of the sum of the squares of those ZIs where I belong to the group, the pool. Okay. It's called P because it's a pool. Okay. And you do this for all the groups. Right. So what you get here, this output here looks very much like the output of a pooling layer in a convolutional net. This is not a convolutional net. Okay. It's a fully connected network here. The result is the same. And that's your, that's your regularizer. Now in the example I just showed, you take the Z and this is what you send to a decoder matrix from which you reconstruct the input. Okay. So this is why this is why bar. That's a prediction for the reconstruction. And this, this pooled layer here is only used to compute the regularizer. It's not actually used as for reconstruction. You reconstruct from the sparse code directly. But it is, it looks very much like a pooling layer. Now, if this were, if this were a convolutional net, then that, that dimension over a feature here would be features, but you would have multiple feature maps. Okay. So I'm representing the feature dimension vertically. Then the encoder would do multiple convolutions and would also generate multiple feature maps, perhaps a larger number. And then the kind of pooling we would do here is a pooling where, so each after pooling, we would take a window over space as well as over features and compute the square root of sum of square there. And that gives us one output in our pooling output. And then we have multiple groups of features like this that go into different pooling. So it doesn't matter whether this is convolutional or not. In convolutions, you would pool over space as well as feature type, but if you don't have convolution, you just pool over features. And that builds invariance to whatever it is that the system thinks makes sense. Is that clear? Does that answer your question? Yeah, I think it's more clear. Thank you. Professor, I have a question for when you do, when you split the intergroups and do the pooling, would these groups overlap? Right. So in the example I showed here, they do not overlap, but you can make them overlap. Okay. So let's say we have a feature vector Z. I can take a pool here and a pool here and a pool here and here those groups overlap. And if I do this and I do group sparsity where these are the groups, what's going to happen is that I'm going to have sort of a continuously varying set of features here that sort of vary from one end to the other because the system is going to want to group within a pool features that are similar. And so because of the overlap, it's going to sort of continuously vary them so that they change slowly over the vector. Now, in the pictures that I showed in the slides, instead of organizing the Z features here in a 1D topology, I organized them in a 2D topology where I made the groups two-dimensional. So I take a six-by-six block. That's one group. And then the next group will be another six-by-six block with some overlap. And then the next group will be yet another six-by-six block. And maybe I have another one because I have a topology that takes these guys and these guys. And then there is a similar thing kind of sliding up, et cetera. So the groups basically are those six-by-six windows that are shifted by three and overlapping. And so that's how you get those sort of continuously varying features along the two dimensions. I could have equally well chosen to organize this in a 3D topology or into some sort of tree. So I take all the components of Z and I organize them in some sort of graph, perhaps a tree. So this is called structure sparsity, not group sparsity anymore. Well, it depends how you do it, I guess. And then the groups would be things like this would be a group. And then perhaps this would be a group as well. And I can organize a group in sort of Russian dolls like this. And what's going to happen there is that the groups that the units that are in many groups will tend to be very sparse, whereas the units that are in a few groups will tend to be less sparse. And so if you do something like this with a tree, what happens here is that the feature in the center tends to be not sparse at all. It's going to be something that really sort of detects just very sort of generic features. And then at the first level in the tree, they're going to be a little sparse. So they're going to be sort of very sort of smooth edge extractors or something like that. And then the more you go inside of the tree, the more each feature enters in a large number of pools, and therefore they get more pressure to be sparse. And so they end up being much sparser, which means they end up being more selective for particular features. And what happens there is that when you show an image, it tends to favor activating features that are along one particular branch in that tree, because that's the best way to sort of minimize a number of pools that are on at any one time. So that's called structure sparsity. And there's a number of papers on this by Julien Méral. So this goes back about 10 years ago. And Rodol Génaton. I mean, they co-authored this. Senior author was Francis Beck. I put a reference in one of the slides. And there's a paper by my group also by Arthur Schlamm, which I'll go to in a minute. Can you explain why grouping regularization actually helps in grouping similar features? Well, so that's a good question. Well, first of all, does it help? And the answer is not clear. So those experiments were done quite a while ago before the computation was really available and the data was available for this to really kind of work at a big scale. This was mostly viewed as the people interested in this were interested in two things. They were either interested in unsupervised running for things like image restoration and stuff like that. This was what Julien Méral was doing. All they were interested in unsupervised or self-supervised pre-training because at the time the data sets were very small for training convolutional nets. They were too small. So they had to be some sort of pre-training procedure, which is what I was interested in. And so it's the same motivation that we now have again for self-supervised learning. But a lot of those methods haven't been brought back to the fore. They tended to work very well when the data set was small. So they tended to kind of improve performance of, let's say a convolutional net. If you pre-train this using a method, so using a method very similar to the one I showed earlier. So something a bit like this, but convolutional. So make the encoder and the decoder convolutional and train with goose privacy on complex cells. And then after you're done pre-training the system, you get rid of the decoder. You only use the encoder as a feature extractor, say the first layer of a convolutional net. And you stick a second layer on top of it. So let me go through this a little bit. So you start with an image. You have an encoder which is basically convolution value, not much more than that, okay? Just convolution value. There needs to be some sort of scaling layer afterwards for this particular case. And you train with goose falsities. You have a linear decoder, and you kind of reconstruct the input, and you have a criterion here, which is this group at one. Okay, so it's sum of a group, sorry, I call the group P, right? Sum of a group of square root of sum for I in the group of z i squared. Okay, so that's goose falsity. So you train this little sparse autoencoder with goose falsity. And then what you do is you take the goose falsity layer that you just used as a regularizer. And so you basically eliminate you cut this part out of the network. You take the goose falsity, which is really a pooling layer, an L2 pooling layer, and you stick it here. Okay, so this is basically L2 pooling. But it has the same architecture that you use for the goose falsity. And then you use that as a feature extractor. Okay, which is like the first pair of layers of a convolutional net, convolutional value pooling. Okay, but it's L2 pooling, not max pooling. And then you can repeat the process. You can train another instance of this network, have a couple layers here. And have a decoder, have this L2 pooling and sparsity criterion, train this to reconstruct its input, and then stick the pooling on top, eliminate this, and now you have a pre-trained two-layer convolutional net. Okay, this is a procedure that some people call stacked autoencoder. Okay, so you train an autoencoder to extract features. And then you generate features with the part of that autoencoder and you stick another layer on top, train that as an autoencoder, and then keep going. And the only characteristic here is that this autoencoder is trained to produce invariant features through goose falsity, essentially. We use all possible subtrees as groups in the previous example. No, that's kind of up to you, really. What structure you use here, you can use multiple trees. You can use, if you want multiple features to represent an input, even at low frequency. So that's really up to you. It could be what you can afford. What you can do also is train the system with a bigger tree than necessary and then sort of prune the tree whenever there are branches that are not used or used very rarely. Okay, so this is the experiment I showed here is similar, but there's only local connections and no weight sharing. And what you see here is this, again, disorganization of the features in terms of what neuroscientists call pinwheel patterns. So pinwheel patterns are those patterns where the orientation selectivity varies continuously as you go around one of those red dots. So you take one of those red dots and if you kind of do a little circle around the red dots, what you notice is that the orientation of the edge extractor kind of varies continuously as you move around. And those are called pinwheel patterns and they are observed in the brain. In fact, those pictures here on the right come from neuroscience papers that describe this, where the color here encodes the orientation selectivity. The little stars indicate those kind of the singularities here, the center of the pinwheels. Is the group sparsity term train to have a small value? Well, it's a regularizer, right? Let me go back to the... It's a cost function during training or during inference, depending on whether you use the sort of predictive version of it where you have latent variable or not, but it's basically just a... It's basically just a term in the energy, right? So the term itself is not trained, it's fixed, right? It's just the L2 norm over groups and the groups are predetermined. But because it's a criterion, it sort of determines what the encoder and the decoders will do, what type of features will be extracted. Here is another example of sort of exotic way of doing sparse coding through lateral inhibition. And there's a bunch of different ways to do this that people have proposed. This one came from Carl Gregor and Arthur Schlam in my lab about 10 years ago. And so here there is, again, a linear decoder with a square reconstruction error. This is Wz minus X, where X is the input here in this case. And then there is a criterion in the energy, which is the vector formed by the absolute values of Z transpose times some matrix times the vector itself. So it's kind of a quadratic form that involves Z and this matrix S. And the matrix S is either determined by hand or learned so as to kind of maximize this term. Okay. And if the terms in S are positive and large, if one particular term Sij is large, what that means is that the system does not want Zi and Zj to be on at the same time. Okay. If Zi is on and Sij is large, then it wants Zj to be off. Okay. And so it's sort of a mutual inhibition. People use, people call this natural inhibition in neuroscience. It's basically, you know, all your feature vectors basically inhibit other feature vectors through this matrix S. You can decide that the matrix S a priori is structured. So you can decide that only some terms are on zero. You can decide that some terms, those terms are fixed or can be trained. And the way you train them is by actually maximizing, it's kind of an adversarial training a little bit. You try to find the value of S that sort of, you know, is as large as possible, if you want, within limits. Above a certain value of Sij, one of the Z, one of Zi or Zj is going to go to zero and that term is going to disappear. So the system is going to, you know, maximize the Sijs until it's large enough to kind of do the neutral inhibition between Zi and Zj. And it's not going to go any further because it doesn't need to. And again, if you organize S in terms of a tree, so here the lines represent the zero terms in the S matrix. And whenever you don't have a line between two features, there's a non-zero term in the S matrix, right? So every feature inhibits all other features except the ones that are up the tree or down the tree from it. And this is very much like group sparsity a little bit. It's kind of the kind of the converse if you want of group sparsity. Instead of saying features within a branch of the tree need to be activated together by minimizing, you know, L2, minimizing the number of such groups that are on, here we have a sort of inhibition term that for every feature inhibits all other features in all the other branches of the tree. And what you see again is that you see this systems are organizing the features in a more or less continuous fashion and in such a way that features along a branch of the tree correspond to basically the same feature but with sort of different levels of selectivity and then features along the periphery sort of vary more or less continuously because there is inhibition not just at the bottom level but also at the middle level. Okay, so to go back to this the way you train the system is at each iteration you give an X you find the Z that minimizes this energy function. So you find the Z that reconstructs but also minimizes the second term which means that if you have an Sij term which is non-zero, it wants either Zi or Zj to be zero or at least very small. You do one step of gradient descent now to kind of update W so as to minimize the reconstruction error and you do also, if you want, you can do one step of gradient asset to make the terms in S larger by kind of computing the gradient of this energy with respect to S but then going up the energy again if you use not a tree but some sort of 2D topology you also get those kind of patterns and more complex ones if there are kind of multiple scales for the features. Okay, so so much for sparse coding and structure sparse coding and the reason I'm telling you about this is because although those don't have a huge amount of practical applications, the sparse coding structure sparse coding they, in my opinion will be the basis for kind of self-supervised running methods of the next few years. As I told you, I think self-supervised running right now is the hardest topic in NLP and it's becoming kind of a bit of a hot topic in computer vision as well and it's mostly now dominated by contrasting methods but I think the architectural methods are going to take over because contrasting methods don't scale very well. So this is sort of giving you weapons for the future if you want understanding what this is all about. Okay, now for something completely different this is something that Alfredo would like because he works on this project and it's one of the users probably one of the most important users of self-supervised running is the idea of learning world models for control systems or for other purpose. So when humans or animals learn a task we quite obviously have a kind of good internal model of how the world works into the physics of the fact that when an object is not supported it falls. We've learned gravity when we were babies probably around the age of nine months or so, eight or nine months, that's when it pops up in babies and we learn this mostly by observations. So how is it that we can learn how the world works and all the concepts about the world by observation and there are two reasons for this. So one I already explained is the idea of self-supervised running. If you can train yourself to predict maybe you will spontaneously kind of learn abstract concepts about the world that might be useful in preparation for learning a particular task or a set of tasks but there's another reason which is that you actually want to build models of the world if you want to be able to act on the world right? So I'm holding this pen and I know that if I move my hand up the pen will move with it because it's between my fingers I know that if I open my fingers the pen will fall, I know about gravity I know about grasping I've learned all that stuff and I've learned mostly by observation, I've learned also by experimentation but a lot of what I've learned I've learned just by observation. So the big question is can we use what we've learned about self-supervised running to train a system to learn world models and what is a world model? So if you want to sort of give an idea of the architecture of an autonomous intelligent system it would be a system that is composed of essentially four major blocks here that are represented on the left so it's an intelligent agent or maybe not so intelligent we'll see it has a perception module and the perception module basically observes the world and then computes a representation of the state of the world called ST at time T, S of T is the idea that the system has of the state of the world this is necessarily an incomplete representation of the world because we can't observe the entire universe at once we only observe what's immediately around us and even that we can't see through occlusions and there is a lot of internal states about the world that we can't observe well enough even if you can observe, your accuracy of observation may not be good enough so if I put this pen in my hand and it appears to be vertical and I let it go it's going to fall but you can't really predict in what direction I've used that example before to describe the problem of an aleatoric uncertainty which is the world is non deterministic and you can't predict exactly what's going to happen because you don't have a perfect reading of the state of the world and maybe the world is intrinsically stochastic we don't know that actually okay so a forward model is a model that given the current state of the world S of T or your idea of your current state of the world and an action that you're taking or that someone else is taking something that you can choose or at least observe and perhaps an auxiliary latent variable Z of T which represents what you don't know about the world okay so the part of the state of the world that you don't know or the thing that's unpredictable about what's going to go on in the world the forward model predicts the next state of the world ST plus 1 you discretized time in some way so if you have a model of the world of that type you can simulate in your head what's going to happen as a consequence of your actions okay so you have this model in your head you know the current state of the world or some idea of the current state of the world you run your internal model of the world forward with a sequence of A of T a sequence of actions that you imagine taking and your model of the world as you imagine it will predict what's going to happen in the world okay if you could do this then you could plan a sequence of actions that will arrive at a particular goal okay so for example what sequence of actions should I do to grab this pen you know I should follow the trajectory actuate my muscles in a particular way so I grab this pen and the criterion the cost function I can measure is whether I've grabbed the pen whether the pen is in my grasp I could measure this with some function perhaps and the question is can I plan a sequence of actions that given my model of the world which in this case is the model of my hand and the model of where the pen is it's a little more complicated if I throw the pen and I have to catch it in the air okay because I have to predict the trajectory of the pen so I have to have a intuitive model of physics to be able to grab that pen which of course I've learned through experience as well people are surprised you like so much reinforcement learning this is not reinforcement learning it says absolutely nothing to do with reinforcement learning maybe be very clear this has nothing to do with reinforcement learning this may have to do in the future but right now it doesn't model based reinforcement learning no it doesn't has nothing to do with reinforcement learning okay let me go through this a little bit can you explain the difference then yes I will in a minute so on the left here you have this little agent it has this model of the world that it can run forward it has an actor where you can think of it as a policy that produces a sequence of actions which is going to feed to the model and then a critic which is going to predict what the cost of the final state or the trajectory is going to be according to the criterion so the critic here computes basically the cost of not fulfilling the goal that I set myself okay so if my task is to reach for this pen and I kind of miss the pen by a few centimeters my cost is a few centimeters if I grab it the cost is zero if I miss it by a lot the cost is higher okay that would be an example of a cost now okay so there is a number of different things you can do with this sort of basic model of intelligent agent so the first one is you start from an initial state that you observe in the world you run your forward model and you give a proposal for a sequence of actions you measure the cost and what you can do here ignoring the p here which represents a policy let's imagine it doesn't exist by gradient descent or by some sort of optimization algorithm you could try to find a sequence of actions that will minimize the overall cost over the trajectory I start from a state I run my forward model and it takes an action okay let me just call this a1 this is s1 or s1 and this is going to give me s2 and I'm going to measure the cost of s2 through some cost function c okay the next time step running my forward model again make an action proposal a2 this is all simulated this is all in my head right this model this forward model is in my head this is my frontal cortex so I'm not actually doing this in the world etc right so I can enroll this for a few time steps those time steps can be milliseconds if I control muscles they can be seconds if I control high level actions they can be hours okay so if I want to plan how to I don't know go to San Francisco you know I need to get to the airports and then catch a plane and then when I arrive there catch a taxi or something etc okay so this is independent of the level of which level description of the thing okay so what I can do with this is I can do a very classical method called model predictive control so it's a classical method of optimal control which is a whole discipline that has been around since the 50s if not earlier and some of the methods are method predictive controls go back to the 1960s there's something called the Kelly Bison algorithm I think it's Kelly with a knee I'm not sure so this is a method very similar to the one I'm describing at the moment and this was used primarily by NASA to compute trajectories for rockets so when they started having computers in the 60s at NASA they started computing trajectories with computers and they were basically using things like this before that they had to do it by hand okay and if you haven't seen the movie Hidden Figures it describes how people were computing this by hand this was mostly done by black women women mathematicians who also ended up kind of programming those computers what's that movie it's really great okay so here's a basic idea here this looks very much like a recurrent net because your forward model is basically the same network replicated over time and this is like an unworld recurrent network and so what you do here is you backpropagate the value of the cost of this entire network all the way to the actions and you don't use this for training you use this for inference you think of the actions as latent variables and you basically by gradient descent or some other optimization method you find a sequence of actions that will minimize the sum of the cost over the trajectory okay so basically you have an overall cost I'm going to call it big C and that's going to be the sum over time steps of the little c of st okay and what you're going to do is big A which is the sequence of A is going to be replaced by its own value minus some step size times the gradient of big C with respect to A okay so as long as you can compute the gradient of the sum of those costs over the trajectory with respect to all of the components of A which means the A the trajectories of A you can do this optimization you don't have to do it necessarily through gradient descent in some cases there are more efficient ways to do this optimization using dynamic programming for example if A is discrete there might be more efficient but if A is continuous and highly rational basically you have no choice but to use gradient based methods okay so this is inference there's no learning yet big A is the sequence A1, A2 A3 etc okay so you have a differentiable objective function and you can minimize it with respect to the variables you're interested in so what do you get out of this there are no weights in A A is a vector yeah because we never minimize vectors so far we've always been minimizing we've always been optimizing weights so people are confused for latent variables like the Z variables the latent variables of energy based models the latent variables we do minimize the energy with respect to Z so this is the same problem here we're solving I think not everyone understood that the latent variables are actually inputs so that was I think like also misunderstanding with the question we had on Piazza about training these latent variable models yeah you don't want to use the word training for latent variables or for things like this because you want to use inference okay you want to use the word to infer not to train I want to use the word inference not training what's the difference between inference and training training with training you learn a parameter that is the same for a large number of samples okay for inference you find the value of some variable a latent variable A in this case Z in the case of a latent variable energy based model that is specific to one sample okay you change the sample the latent variable changes so you don't learn it because you don't remember it from one time to the next there's no memory for it right so that's the difference you know conceptually you're doing the same kind of operation where you do learning and inference and so at some level of abstraction they're the same but inference you do it per sample learning you do it over a bunch of samples and the parameter is shared across the samples when we have an energy based model and we'd like to do inference we still have to do it every time we perform this we use it right so that was a big difference between after you've trained the model when you use it you still have to do minimization with respect to the latent variables okay so that's the big difference same here here there may or may not be any training your forward model may be built by hand or may be trained but by the time we are here it's trained we're not training anything here we're just doing inference what is the optimal value of the sequence of A's that will minimize or cost overall cost and this is an inference problem just like energy based models for example the FM the forward model can be just a one line of equation of physics right it can be just a deterministic equation so imagine the forward model is the few equations that describe the the physics of a rocket and A is basically the action on the steering you know how you orient the nozzles and then the thrust so that would be the connection of A would be the connection of those variables and then there is you know very simple physics Newtonian physics basically you can write the equations it will give you the state of the rocket at the next time step as a function of state of the rocket a previous time step and the actions you're taking that's how you do simulations that how every simulator works and then your cost function if you want to shoot a rocket would be maybe a combination of two things one would be the energy spent during that time step okay the amount of fuel you spent something like that and the second term might be the distance to a target you want to reach maybe you want to rendezvous the space station and the second term in the cost would be the distance to the space station okay square of distance to the space station if you measure the the sum over the entire trajectory of the distance to the space station the system will try to minimize the time it will take to get to the space station because it won't want to minimize the sum of the square of the distances to the space station over the trajectory but at the same time it wants to minimize fuel so you have to balance those two terms right so that's a classical way of doing optimal control and it's called model predictor control is model is Kalman filtering one type of model predictive control no Kalman filtering is a particular forward model if you want it's a way of estimating the state of the world okay but it's you know basically given your observation of the state of the world through a perception system there's going to be some uncertainty about the state of the world and the Kalman filter basically assumes a Gaussian distribution on this uncertainty and now when you run through your forward model you're going to have a resulting uncertainty about the state of the world at the next time step because it wasn't certain to start with so given the uncertainty when you started from what's the uncertainty after one step of physics if you want and if you assume linearity of all those steps and Gaussianity of the uncertainty that's what a Kalman filter is most of the uncertainty comes from okay so now your forward model produces a prediction and at the next time step you might get another reading of the state of the world because your sensors are still working so now you have two Gaussians one is your new perception of the world tells you here is where I think the state of the world is and your forward model also predicted here is where I think it is and you have to combine those two that's where the complexity of Kalman filtering comes in which is I've got two Gaussian predictions the resulting probability distribution is also a Gaussian I have to compute the covariance matrix and etc and that's where the formula is for Kalman filters come from okay so Kalman filter is a way to deal with the uncertainty in the reading your perception of the world and in the when you propagate this uncertainty in your forward model I think there was still a main difference I think you wanted to address the point that this is different from RL okay so what is RL in that context okay so okay I need one more step before I talk about RL okay and here is that step okay so what we had just a minute ago was a forward model that's enrolled in time and the system has takes a sequence of actions A1 A2 A3 S1 S2 and then we have the cost function here coming up okay and this could go on right now what we'd like to be able to do is not have to do this optimization with respect to A1, A2, A3, A4 every time we need to do a planning we don't want to have to go through this complex process of back propagating a gradient through this entire system to do model predictive control and so a simple a simple way to get rid of that step is the same trick that we use in auto encoders versus pass coding so remember it's pass coding we wanted to reconstruct but then we had to do inference with respect to the latent variable and that turned out to be expensive so what we talked about last week was the idea of using an encoder that we trained to predict the optimal value directly okay and we're going to do the same here that resulted in the idea of a pass or two encoder we're going to do the same here we're going to train a network to take the state and directly predict what the optimal value of the action is and this network of course we're going to apply every time step and this is going to be called a policy network okay so the policy network takes the state and produces a guess about the best action to take at this time so as to minimize the overall cost and this is going to be a trainable neural net or whatever model parameterized model that we want the way we're going to train this model is basically just back back propagation so we're going to using our perception module this is the world here and we're looking at the world with a camera and there's a perception module that gives us a guess as to what the state of the world is okay this is perception and this is our forward model applied multiple time steps and this is our cost okay so what we can do is run the system and to run the system we first run through the perception we compute an action we run this action through the forward model this forward model gives us here is the next state we're going to be in compute the cost and then keep going okay keep doing this just forward pop to this entire system which is really kind of an unworld recurrent net if you want and once you're done you backpropagate gradient gradients from the all the terms in the cost function all the way through the network all the way through the parameters of that policy network okay so basically you compute d of big C so big C remember is the sum of all the C's in a long time with respect to dw okay and that's just going to be the sum of a time of d of big C over dw sorry yeah big C over the AT dAT over dw okay I've just applied chain rule right but I don't need to right if I just define this function in pytorch and I just do backpropagate it'll just do the right thing so I can compute the gradient of the overall cost with respect to the parameters of that policy network and so if I train this over sufficiently many samples if my forward model is correct if my cost function does what I want then my policy network is going to learn a good policy that just looking at the state will minimize the expected cost over a trajectory okay the average cost over a trajectory there's no reinforcement running here this is all backprop okay now we can talk about the difference of reinforcement learning the main difference with reinforcement learning here is twofold the first one is in reinforcement learning in most reinforcement learning scenarios at least the the C function is a black box well it's a black box not a red box okay that's the first difference the second difference is that this is not a forward model of the world this is the real world and your measure of the state of the world is imperfect so inside of this policy network you might have your perception network here that estimates the state of the world so you have no control over the real world and your cost function is not known you can you can just get the output of the cost function by just trying something right you take an action you see the effect on the world and that gives you what reinforcement learning people call a reward but it's just a negative cost okay it's the value of negative value of your cost but the cost is not differentiable you don't know the function of the cost you have to go through the world to figure out the value of the cost okay and that's the main issue with reinforcement learning which is that the cost function is not differentiable it's unknown the only way to estimate it is by trying something and then observing the value which is what the reward is really it's a negative the negative of the reward is basically your cost so in that situation since you cannot evaluate gradients to minimize your cost you have to try multiple things you have to try an action see the result and then try another action see if the result is better and if your cost function is very flat you have to try many many things before you get a nonzero reward or a non-high cost and so that's where the complexity goes there is the additional problem of exploration so because you don't know the form of the cost and because it's non-differentiable you might need to try many actions in kind of a smart way to figure out in which part of the space to go to be able to sort of figure out how can I improve my performance so that's the main issue of exploration and then there is the issue of exploration versus exploitation so the fact that when you are in a situation you don't want to take completely random actions because they are likely to not result in anything interesting so you want to take actions that are kind of close to what you think might work and sort of occasionally you can try something else while you are learning and learn your policy as you go what I was describing just before is a situation where you can do all of this in your head because you have a model of the world and you can optimize your sequence of action very efficiently because you have a differentiable cost function your cost function is computed by your own brain if you want inside of your agent you can tell if you grab the pen you can tell the distance between your hand and the pen so you can compute your own cost function and it is kind of in your internal world model is differentiable in the real world it's not in the real world you don't know the derivative of the distance of your hand to the pen unless you have some model of that in your head but by default you don't but because everything is in your head it isn't implemented by normal net you can back properly get gradient through everything so that's the big advantage of this kind of approach versus reinforcement learning make everything differentiable so there is two problems with the world so there is one big advantage in this kind of scenario which is you can run this faster than real time because your forward model inside of your agent can run as fast as you want you don't need to run through the world that's one advantage second advantage is the actions you're taking will not kill you because you can predict using your forward model maybe you'll predict that the action will kill you but you're not going to take it in the real world so it won't kill you if you have an accurate forward model third advantage because everything takes space in your head everything is in normal net, everything is differentiable you can use all kinds of efficient or inference algorithms to figure out a good course of actions so that's the difference with reinforcement learning in reinforcement learning you're telling yourself I have to go through the real world I don't have a model of the real world I don't know how to compute the cost function in a differentiable way that said a lot of reinforcement learning methods actually work by training a model of the cost function so actor critic methods basically the role of the critic is to learn to evaluate to kind of predict the value of the overall objective function the expected value of the objective function and because it's a neural net that you're going to train you can back propagate gradient to it so basically learning an approximation of the cost function of the real world of the real world using a neural net that's the role of the critic okay why is it so good to have models when you're learning a skill like learning to drive for example it's basically what allows you to learn quickly and to learn without killing yourself so if you don't have a good model of the world you don't know about gravity, you don't know about the dynamics of objects you don't know anything and you put an agent at the wheel of a car the agent has no idea what the physics of a car is okay and you put the car next to a cliff the car is driving at 30 miles an hour next to a cliff the agent doesn't have a model of the world has no idea that by turning the wheel to the right the car will run off the cliff and will fall into the ravine it has to actually try it to figure it out it has to fall into the ravine to figure out that this is a bad idea and maybe just from one sample it's not going to be able to learn it so it's going to have to run into the ravine like thousands of times before it figures out the model of the world that first turning the wheel to the right makes the car go to the right and second that when the car goes above a ravine it falls into a ravine and destroys itself okay if you have a model of the world that understands about gravity and things like this then you know that turning the wheel to the right to the ravine and you don't do it because you know it's going to kill you okay so what allows humans and animals to learn quickly much much quicker than any model free reinforcement running methods that has ever been devised is the fact that we have very very good world models in our head okay now what does that tell us okay so here is the problem with the world the world is not deterministic or if it is deterministic it's so complex that it equally well could be non deterministic it doesn't make any difference for us there's two problems with predicting the next state of the world the first problem is that the world is not entirely predictable and it could be not entirely predictable for two reasons those are called aleatoric uncertainty and epistemic uncertainty aleatoric uncertainty is due to the fact that the world is intrinsically unpredictable or the fact that we don't have full information about the state of the world so we cannot predict exactly what's going to happen next so you're looking at me right now you're a pretty good model of the immediate environment of me but you cannot exactly predict in which way I'm going to move my head next because you don't have an accurate model of what's inside my skull okay your perceptual system does not give you a full model of how my brain functions unfortunately so you cannot exactly predict what I'm going to do next what I'm going to say, how I'm going to move my head etc so that's aleatoric uncertainty there is also epistemic uncertainty epistemic uncertainty is the fact that you can't completely predict the next state of the world because the amount of training data you've had was not enough your model hasn't been trained enough to really kind of figure it out okay that's kind of a different type of uncertainty so the big question now is though how do we train models of the world under a certainty I give you an ST can you predict ST plus 1 and it's the same problem we encountered before we start supervising I give you an X can you predict Y but the problem is that there are now multiple Y's that are compatible with X that are compatible with S even for a given action so that means that our model here or forward model may take the state of the world and an action but it will also have to take a latent variable which we don't know the value of to predict the next state okay this is very much like what we talked about earlier where we had I'm going to draw this in a different topology but it's the same idea so we had X and it was going through a predictor computing H and then that was going through a decoder that will take into account a latent variable to predict Y bar okay this is a prediction for S and maybe at some time we might be able to actually take the action and observe the next state of the world while we are training a model we'll actually be observing the next state of the world T plus 1 okay so to train a forward model here we take the state ST we take an action if we have an action we have a latent variable and our prediction goes into a cost function that diagram is exactly identical to the one on the right it's exactly the same diagram except I split the fm into two modules okay I've given it a particular architecture in fact I could make this more explicit I think you have the super thick marker selected I do yes so this would be my forward model okay so that's what inside this box inside the forward model box here is this and you know I renamed ST is not called X and ST plus 1 is not called Y bar but I mean it's not called Y but it's the same thing otherwise right so it's the same scenario that we talked about before latent variable in which it is models essentially but now we're going to use this to train a forward model to predict what's going to happen in the world so we may have to play the same tricks that we played that we talked about last week which is that last week what we explained was that we can take okay the way I drew this last week was slightly different what I explained last week is that we can if we have while we are training a forward model we have a pair X and Y and the way we find the value of Z is by minimizing the energy with respect to Z right so we basically find Z star which is the argmin of C of Y and Y bar, Y bar being the output of our predictor of our system okay and then we do one step of gradient descent so we change the parameters of our entire system according to the gradient of that cost but for this to work we had to regularize the limited information content and we have to do the same here why is that well here we're trying to solve a prediction problem but imagine and we talked about this a couple weeks ago I give you an X and a Y and you find a Z that minimizes the overall energy and the Z is not regularized if Z is the same dimension as Y there's probably going to be a Z for any Y that makes the cost function zero right there's enough capacity in Z there's always going to be a value of Z that makes the cost function zero and that's bad because that means my energy function is going to be completely flat it's going to be zero everywhere and I need it to be small on the training samples and high outside of the region of high data density and what we saw in the last couple weeks is that by regularizing the limiting its capacity either by making its farce for example or making it discrete or by making it noisy then we can limit its capacity why do we need ZT if you already have AT well so AT is the action you take right okay I'm going to tell you I'm going to I'm going to let this pen go okay but you don't know in which direction it's going to go so let's say it goes this way but I have to predict in advance that it's going to go it's like okay here's a better situation you are you are goalie playing soccer okay and it's a penalty kick so you're in front of you know the kicker in front of you and the guy's going to kick the ball and you're going to have to jump one way or the other and you have to make a choice am I jumping left or right and you have to make that decision based on what you observe from the person what you do A is which direction you jump in basically how you jump Z is what you don't know about the player in front of you doing okay you don't know the state of the world you don't know the state of the brain of this guy and so you don't know if he's going to shoot left or right or up or down okay that's the difference Z is what you cannot know about the world that is necessary to make the prediction okay A is the action you take which in this case has very little influence on the immediate state of the world it seems to be clear now right so you need to regularize Z and then one of the tricks we we described so one of the things we described to regularize Z was passity another one was adding noise but another trick we described is this idea of having an encoder right so you have X or ST run through the predictor the predictor goes into the decoder which makes a prediction about Y, let's call it Y bar and you compare oops sorry you compare Y bar to Y and here you have Z what we talked about is the idea of using an encoder here to predict the optimal value of Z and then basically having a cost function that is a term in the energy that measures the discrepancy between the value of Z you actually use and the value of Z predicted by the encoder and perhaps this is regularized in some way and the predictor also has to influence the encoder so it's pretty clear that you need an information bottleneck between the encoder and the decoder otherwise the system will cheat it will completely ignore X it will be able to predict Y exactly by just cheating, by looking at the value of Y running it through the encoder and then predicting Y that's just a very simple encoder so unless you restrict the capacity of Z the system will just cheat and not actually train itself to predict you have to push down on the information content of Z so as to force the system to use the information from X to make the best prediction now we can use that trick to train or forward model because again the forward model is basically just an instance of this and this is a project that for autonomous driving that a former student of Michelin F worked on and Alfredo has worked on this and is still working on this project and so here you're trying to train a car to drive itself and what's difficult to predict is what the car around you are going to do so you place a camera above a highway and you watch the car is going to go by and you can track every car and then extract the immediate neighborhood of the car basically a little rectangle around every car that indicates where the other cars are relative to your car and this is what was represented at the bottom so at the bottom you have a little rectangle that's centered around a given car and then all the cars around are a little rectangle that's centered on that car where the car is in a standardized location in the middle of that rectangle and you do this for every car what it gives you is for every car a sequence of what the cars around it are going to do and we can use this to train a forward model that will predict what the cars around us are going to do the question is if this forward model is predicting all possible futures irrespective of the action taken yeah where we predict a set of futures so given one action and given one initial state and one particular value of the latent variable it will make a single prediction and then you can vary the latent variable and it will make multiple predictions you can change the action of course so I've redrawn the little diagram I drew previously here the state basically is a sequence of two frames from this video there's no abstract state here it's just the picture itself the blue car is our car and the green cars are the other cars so you take kind of three frames from the past run this through this neural net which attempts to predict the next frame using basically a big convolutional net as a predictor and a big convolutional net as a decoder there's a latent variable here there's also an action here which is not drawn that gets into this and the system also has an encoder so it looks more like this and again the action here is not represented but imagine there is one so x is the the past frames it goes through a predictor that predicts a representation of the input and then that representation goes into a convolutional net that the decoder that predicts it basically is combined additively with a latent variable so it's added to a latent variable before going into a decoder that makes a prediction for the next state and the latent variable itself is a latent variable but it's being predicted by an encoder which itself is also a convolutional net it takes the past and the future and tries to predict the ideal value of the latent variable now of course you have to restrict the information content here and this is done in this particular project using sort of a VAE-like approach where the I mean it's basically a VAE with a few a few tricks so z is sampled from a distribution that is obtained from the output of the encoder the output of the encoder outputs a prediction for zbar as well as prediction for variances and z is sampled from that distribution so it's not optimized it's sampled but there's also a term that tries to kind of minimize the sum of the square of the z's over time which is the standard technique for VAE and that goes into the decoder and so this is trained as a conditional autoencoder basically there's another trick that's added to this which is that half the time z is simply set to zero so half the time the system is told you're not allowed to use z just make your best guess as the prediction without a z and that drives the system to sort of really kind of use the past in sort of a bigger way than if you just have a noisy z if you just use the standard VAE type training the system basically ignores the past it just cheats, it looks at the answer why I will cover the rest in a greater detail in a future lab perhaps you want to say something about the GANs because I will be actually going over the whole presentation as well so GANs are a particular form of contrastive learning okay so remember that when we talked about energy based learning we have data points and a model which I'm going to draw like this with a cos function it could have any kind of structure but I'm just going to draw it like this so this would be sort of a reconstruction type model right so imagine that the model here is a non-toe encoder or something like this but you can imagine just about anything a simplified version of this would be just why it goes into a cos function and not specifying what the cos function looks like so what the cos function computes is in the space of y so let's say y is two-dimensional is an energy that we want to be low on the data and high outside the data and here I deliberately drew a bad energy function so this energy function is bad because it should be low around this region where we have data and it should be higher outside and right now it's pretty low in this region right here so we talked about contrastive methods and the contrastive methods consist in taking a sample and pushing down on this energy and then taking a contrastive sample which I'm going to draw in purple so contrastive sample should be a sample that already gives low energy to but should not give low energy to and we're going to push that up so push up on the energy of this guy push down on the energy of that guy and if you keep picking those samples and those contrastive samples well by minimizing some objective function that wants to make the energy of the blue point small and the energy of the pink points high then the system will will learn properly so we've seen several ways of generating contrastive samples the idea of denoising autoencoder which is to take a sample and basically corrupt it in some way we've seen the idea of contrastive divergence so you take a sample and then you go down the energy with some noise and that gives you a contrastive sample to push up and we've seen a number of other methods that are based on prior knowledge about similarity between samples but here is another idea the other idea is to use is to train a neural net to produce those contrastive samples intelligently and that's the basic idea of GANTS at least a form of GANTS because there are several formulations of GANTS in fact there is an entire laundry list of various types of GANTS but the basic idea of GANTS is that you train your energy model so the energy model in the context of GAN is called a discriminator or sometimes a critic but it's basically just very similar to an energy model and you train it to take low energy on the data points and then you train another neural net to generate contrastive data points and you move their energy up so the overall diagram is something like this you have a discriminator and a discriminator really should be not drawn this way it could be a large neural net but in the end in the end it's just a cost function so it takes a variable Y and it tells you if it's good or bad low energy if it's good, high energy if it's bad so in one phase you collect a piece of data from your data set and you just give it to your discriminator so this is a real Y coming from data that's a training sample and you say the output of that should go down I should really write this as F because after all it's a high energy function so make F of Y go down of course by changing the parameters so you do W replaced by W minus eta dF so F is a neural net F is a neural net some parameterized function but probably a pretty complicated neural net that's the first thing and that will make the energy a point small now there's a form of this that's conditional you have an extra input here which is an observation but you can have this or not let's go to conditional again it doesn't matter second phase for contrastive samples you have a latent variable Z that you sample from some distribution a distribution that's easy to sample from let's say a Gaussian multivariate Gaussian or uniform or something you run this to what's called a generator so this is a neural net and that neural net produces something similar to Y it just produces an image for IR images and again you run this through your discriminator but now you want to make that large okay so in fact what I told you before is a lie you don't do this update like that okay but here what you want is you want to make FW of this Y bar high okay and what you're going to do now is train the discriminator and the generator simultaneously so first have to come up with a cost function a loss function and this loss function is going to be sum sum of a sample of a per sample loss function that basically is a function of F of Y and F of Y bar where Y bar of course is generated from the sample latent variable Z now this cost function needs to be a decreasing function of F of Y an increasing function of F of Y bar okay you can use just about any cost function you want as long as it makes F of Y decrease and it makes F of Y bar increase or as long as it makes a difference decrease F of Y minus F of Y bar an example of this of a hinge loss for example okay so something that says my loss function is going to be F of Y plus some margin minus F of Y bar positive part so this is a hinge and it says I want to make F of Y bar smaller than M other than that I don't care bigger than M I'm sorry I drew this backwards so overall as a function of F of Y bar this function looks like this okay so it wants to make F of Y bar larger than M okay so that's an example the actual cost function that most the original formulation of GANs used basically plugs each of those turns into a sigmoid and tries to make the sigmoid apply to F of Y as close to 1 as possible and the sigmoid apply to F of Y bar as close to 0 as possible it's basically that nothing more than that so it's sigmoid of F of Y plus 1 minus sigmoid of F of Y bar and you take logs because this is not the last function this is kind of before the last function so this is kind of like a cross entropy but you have a cross entropy that's positive for the the positive phase and the target is negative for the negative phase I shouldn't write it this way this is wrong actually sorry about that but you put it in the logistic class for each of those so it's log of 1 plus exponential F of Y for the correct one and minus for that log of 1 plus E to the F of Y plus log 1 plus E to the minus F of Y bar but you could imagine this is a large number of objective functions of this type okay so this is the last function you're going to use to train the discriminator but the generator this is for the discriminator but it's going to be a last function for the generator and that's a different last function and you're going to optimize those two last functions the same way the one for the generator that basically wants to make the generator produce outputs that the discriminator thinks are good but they're not okay so basically the generator wants to adapt its weight so that the output that it produces Y bar produces a low energy F of Y okay so you sample a random variable Z you run it through the generator it produces a Y bar you run through the discriminator the F of Y you get some value and then you back propagate the value through the generator and adapt the weight to the generator so that this energy goes down so basically the generator is trying to find a Y bar that has low energy as low as possible okay and it trains itself to kind of produce Y's that have low energy again if we're talking about conditional GANs there's going to be an X variable that's going to enter those two modules but that makes no difference in the end so LG is maybe simply an increasing function of F of Y bar I think we are kind of running out of time we are we have run out of time so this would be some objective function of F of G if G is the generator of Z where Z is simple randomly okay so you just do backfrop through this and you change the parameters of G let's call them U so that this goes down now this is called this is called a game in a sense that you have two objective functions that you need to minimize simultaneously and they are incompatible with each other and so it's not a gradient descent problem you have to find what's called a Nash equilibrium between those two functions and gradient descent will not do it by default so that leads to instabilities and there is tons of papers on how to make GANs actually work that's kind of a complicated part but Alfredo will tell you all about this tomorrow maybe you also want to mention the one with the sigmoid that creates some issues if we have like samples that are close to the true manifold and then I think we can close the descent okay so let me mention that so let's imagine that your data so again energy based framework your data is around so manifold but it's a thin manifold so it's an infinitely thin distribution in the original formulation of GAN the discriminator would need to produce zero probability outside of this okay so it needs to produce zero probability here and it needs to produce on the manifold it needs to produce infinite probability in such a way that the integral if this is really a density estimation in such a way that the integral of the density over the entire space is one and this is of course very hard so GANs basically abandon the idea of actually learning a distribution what they want to do is produce zero the original formulation, produce zero outside the manifold of data and produce one here is the output of a sigmoid that needs to be one which means the weighted sum of sigmoid needs to be infinite essentially so it's not that different and the problem with this is that if you train the system successfully and you get that energy function which is zero outside the data manifold and one on the data manifold your energy function is completely useless it's useless because it's a golf course it's flat so the energy function basically that corresponds to this negative log of that so it would be infinity here and the minimum value of your cost function on the manifold which for example could be zero if it's an autoencoder the energy is going to be smaller than zero and so it's a golf course of infinite altitude which is really not that useful what you want for every energy based model if you want an energy based model to be useful you want the energy function to be smooth you don't want it to go to infinity in sort of very small step you want it to be smooth so that you can do inference so that if you start from a point here it's easy to find a point on the manifold that's nearby using gradient descent for example so the original formulation again leads to first of all infinite weights in the discriminator instabilities something called mode collapse which Alfredo will tell you about and in the end a contrast function an energy function that's essentially useless so it's not ideally formulated so people have proposed ways to fix it by regularizing the energy function basically forcing it to be smooth so one good example of this is something called Westerstein Gans proposed by Narzowski who just graduated from NYU and a few other people and the idea of that is to basically limit the size of the weights of the discriminator so that the function is smooth and there is various mathematical arguments in probabilistic framework but that's the basic idea and there's lots of variations of this also Questions about today's class it was dense but at least we were answering every question it was coming through so I think we follow along today I wasn't sure if maybe you explained it in a different form and I didn't realize it's the same thing but I was a little lost on what the policy network is what that does so the policy network takes the estimation of the state of the world and produces an action and it's trained to minimize the expected cost of the state over the trajectory but it takes just one action there was a part towards the end where I guess you drew a new connection from like S to there was down connected through some module to A so what is happening there so the policy network is indicated by Pi here on the screen so it takes S the state and it produces an action okay that's what a policy is you observe the state of the world and you take an action in fact a probabilistic policy you don't take an action you give a distribution over actions and then you pick the action in some way and perform that distribution but here you just have to take an action if the number of actions is discrete then this Pi network is basically a classifier and it produces a bunch of scores for each possible action and then you take one of the actions probabilistically or deterministically just take the action with the highest score probabilistically you can sample according to the score and then you run through your forward model and you keep going so without the policy connection then the action is a latent variable so you have to optimize with respect to the latent variable to find its optimal value so you have this kind of diagram now where the actions are not produced by a neural net they are latent variables that you have to figure out every time you run your model you have to figure out what's the best sequence of action to minimize my cost so you have to basically do this for example by gradient descent figuring out the sequence of A that will minimize the sum of the C's over the trajectory that's called model predictive control and then the one with the policy network is called direct control essentially Professor you say that during inference we need to minimize over the C the energy to get the final value but there are two questions one, won't it take too much time during inference and would it be useful for real time systems and the second one is since it's unrolled and you have to back propagate all the way through the beginning it would have all the problems that we face in recurrent neural networks exactly so presumably you're not going to get the same problems as you have with current nets because your forward model presumably implements the dynamics of some real system so it might not have the issues of non-invertibility that you have if it's a physical system it's probably going to be reversible so you may not have the same issue as with regular recurrent nets but you're facing the same problems now in real time situations you use a form of this called receding horizon planning so receding horizon planning is when you are in a real time situation your system will run its forward model for a few steps in the future I don't know let's say a few seconds sufficiently many steps to predict it for a few seconds that's your horizon then you do this model predictive control by optimizing finding the optimal A that minimizes your cost your estimated cost according to your model you haven't taken an action yet to run your internal model to make that prediction so through optimization with respect to A you find the sequence of A that optimizes your cost and then you take the first action in that A and then you do it again so with the A you took observe the state of the world now you have a new state which you observe from your sensors for a few more steps in the future optimize the sequence of actions to minimize your cost take the first action and do it again so it can be expensive if your horizon is long if your forward model is complicated and so that's when you need a forward model a policy network so the policy network basically compiles this whole process into a neural net that directly produces which may or may not be possible but it gives you a good guess now to give you a concrete example there's an interesting series of books by a Nobel Prize winning economist that is in New York called Danny Kahneman and he talks about two systems in the human mind called System 1 and System 2 so System 1 is the process by which you take an action without thinking you're a very experienced driver and you can drive your car without even paying attention by talking to someone next to you you don't actually need to think about it System 2 is more sort of deliberate planning so System 2 is when you use your internal model of the world to kind of predict in advance what's going to happen ahead sort of foresee what's going to happen and then take a deliberate action that you think is going to be according to your model so it's more like reasoning you can think of this optimization with respect to actions to minimize an objective as a form of reasoning and we talked about this before so basically model predictive control is when you don't have a policy you haven't learned the scale you know what your cost function is you have a pretty good model of the world but you don't know how to react but you look at the chase game and you have to think about all possibilities before you play because you don't know where to play so you have to kind of imagine all the possibilities if you're an expert player and you play against a beginner you know immediately where to play you don't have to think about it I don't know if you've played simultaneous games against a master a grandmaster at chess people and beat them in a few minutes because the player can go from a chess you know opponent to another and just immediately play it's completely reactive she doesn't need to think because they've kind of compiled that if you want in their knowledge of chess that they don't need to think when they see this kind of type of easy situation so that's going from two to system one and when you learn a skill at first you're hesitant and you have to think about it you know you're hesitant when you drive you drive slowly and you look at everything and you pay attention and then when you're experimented you can just react really quickly basically you've gone from model predictive control to basically training your own policy network if you will and in the process of doing this you go from a sort of deliberate planned conscious decision mechanism to a sort of subconscious automatic decision mechanism that's what sort of acquiring expertise does and that's how you go from this diagram to that diagram where you have a policy that directly predicts the action without having to plan okay got it thanks