 All right. Good morning, everyone. Okay. We're going to use a lot of the stuff that we talked about in previous weeks and revisit some of the methods that we talked about, and Alfredo will talk to you again about this tomorrow, and try to apply this to self-supervised learning. Self-supervised learning, as you know, as you've heard from what I've said in previous lectures, is one of the hardest topics today in machine learning, deep learning, for two reasons. First, for learning representations of images and text and things like that, but also for learning predictive models, which are useful for control applications like robotics, cell-driving cars, things like that. Before we talk about this, I want to talk a little bit about GANs, generative adversarial networks, because I went really fast over them last time. And so let me revisit this a little bit. So GANs, you probably have heard about GANs. And when you hear about GANs, you hear about it for image generation, generally. Okay. So there's two parts to a GAN. A GAN has two neural nets, a generator and what's called a discriminator, or sometimes a critic. And you could interpret this critic in terms of energy-based model. So basically, the role of the critic when you give it an input is to tell you whether this input looks good or not, whether it looks like something in a training set, or whether it doesn't look like something in a training set. Okay. So you want, again, to generate, I don't know, images of airplanes. You show it lots of examples of airplanes. And you train this energy-based model to give you a low energy when it's an image of an airplane and a high energy when it's anything else. Okay. Maybe not even a natural image. So obviously, that's a task for energy-based model training, where you push down on the energy of stuff you want, images of airplanes. Okay. So you train this energy-based model. It doesn't have to be an encoder-decoder. It can be any kind of neural net. You train this neural net to give you a low energy when you show it to the image of an airplane. Okay. Now you have to have... So GANs are contrasting methods. So basically, GANs intelligently pick other points whose energy is going to get pushed up. Okay. And so this is symbolized by this little diagram here, which I've seen, this little animation you've seen before, where the blue dots are the data points, images of airplanes. And the green dots are intelligently selected points outside the manifold of data, whose energy we're going to push up through some loss function. Okay. The original GAN has a particular type of loss function. You don't have to use that. You can use any of the loss functions that we've talked about for energy-based learning. Like, for example, a hinge that takes the difference between the energy of a good guy and a bad guy, for example. Or maybe the square loss that says, push down on the square of the energy of the good guy, push up on the square of the energy of the bad guy with a hinge, with a threshold. Or it could be something else. Okay. So what's special about GANs is the way they generate those green points. Those green points are generated by a neural net. Okay. So you're going to train a neural net to generate those green points. And the way this neural net is going to be... So the input to this neural net is going to be some random vector. Okay. Draw on from a distribution that you can easily draw samples from, like, say, a uniform of a hypercube or a Gaussian, something like that. Generally, people use Gaussian. You decide a pre-oriented dimension for that vector, 100, 1000, whatever. Then you run that random vector through a neural net that is going to generate an image if you want to generate images. So it's going to look like a convolutional net, except it's backwards, right? One of those networks that starts from the vector, and then you turn this vector into a low resolution image with lots of feature maps. And as you go towards the outputs, the resolution, the spatial resolution increases through oversampling, which is the inverse operation of pooling and subsampling. And then the number of planes is reduced progressively. So you get to three planes on the output, RGB, okay? Or luminous-corminant. So whatever it is, however you want to encode your image. So that's going to be the structure of this network. And there's been a lot of work on precisely what structure to give to this. And GANS that work really well have pretty complicated structure, which I'm not going to go into. You're welcome to look at... There was a bit of a watershed paper from NVIDIA a few years ago that was able to generate very high resolution, very realistic-looking faces. If you look closely, you can tell that there are some details that are missing, but they look amazingly good. And a lot of people are using this now for a lot of purpose. Some of them good, some of them bad. So the big question now is, we know how to train this part, the critic, the energy-based model, that tells us whether an image is good or not. The question is, how do we train this guy? Okay, so the strategy is going to be to basically train this guy so that it generates points that are maximally confusing for the critic, okay? As the energy surface takes its shape, the more green point you add, the more the energy at this green point is going to increase. And what you're going to have to do now is basically bring those green points closer to the manifold data so that locally the energy takes the right shape. And that's how we're going to train this network. So this network is going to be trained so as to generate samples that this guy thinks are good images, okay? That this guy gives low energy to, right? So basically you draw a random sample, run it through the generator. It produces an image, run it through this critic. The critic is going to give you an energy of some kind, okay? You're going to back propagate the gradient of this energy all the way down to this. Okay? And then you're going to back propagate through the generator and the gradients you're going to get with respect to the parameters of the generator. You're going to update the weights with that gradient in such a way that the energy here goes down, okay? So basically this guy produces an energy and you're going to change the parameters of this generator so that the next time around it generates an image from the same random vector, which of course will never happen, but that's okay. Let's say you keep the same random vector. After the update, the generator will produce an image that has lower energy according to this critic, okay? Which means the green point is going to get closer to the manifold of data, okay? So basically the generator chases areas of the space of wise of images for areas that our critic gives low energy to, okay? It tries to basically train these parameters so that the green points are going to be produced in areas where the energy is low. So then when we train, when we update the parameters of the critic, the updates of the parameters of the critic are going to be pushing that energy up, okay? So you really have two objective functions in this system and that's what makes it interesting and complicated and sometimes very frustrating is one loss function which tends to push down on the energy of the good guy and push up on the energy of the bad guy with respect to the parameters of the critic, okay? There's a second objective function which is going to be minimized with respect to the parameters of the generator and that second objective function wants to make the energy produced by the critic from the sample generated by the generator to have low energy, okay? So you're going to adapt the parameters of the generator so that whatever comes in here produces a low energy. So those two things are incompatible with each other, right? Because, I mean, they're kind of competing with each other which is why this is called an address are you all network and generative because you generate samples? On the one hand, this guy is trying to make its output low, sorry, high for generated samples and this guy is trying to fool it. Basically it's trying to generate samples that it knows this guy will give low energy to, okay? So it's basically going for the juggler, right? It's really kind of trying to find areas of white space that the critic gives low energy to so that it will get pushed up by the adaptation of the parameters of the critic, okay? So there's two last functions. Those two last functions are competing with each other. If you minimize one, you're going to move the other one up and basically what you have to find is what's called a Nash equilibrium. So when you are optimizing two different objective functions that have sort of competing interests, if you want, that's called a game in the sense of game theory and the solution to the problem is what's called a Nash equilibrium named after John Nash who kind of derived the concept and the idea is that one of the functions, I mean, by trying to minimize the two functions, you can have an equilibrium between the two in such a case that any change you make to the parameters is going to make the two functions go up or one go down but the other one go up more, okay? And it turns out you can't really use sort of straight grid and descent to find the Nash equilibrium between two functions. There's no proof that it will converge and you can get it to limit cycles, you can get into kind of... So there's a lot of papers that are technical that I'm not going to go into on what do you do to prevent that from happening so basically to get to some reasonable convergence and in practice, the algorithms that people use are not always convergent. Sometimes they look like they converge and they just diverge after a while if you turn them for too long. So early stopping is really important for GAN sometimes. GANs are sometimes a victim of mode collapse so if the green point gets too close to the blue points, which happens when you train for very long, the energy function will want to basically have vertical walls, right? As soon as you get outside of the manifold of data, this is the same problem that we've seen with maximum likelihood training. The energy function will want to be very, very steep and because of the limitations of the neural net that's inside here, it can't be infinitely steep because it can't have infinite weights and so it will basically collapse. It will say like, I can't do it, I'm going to give a high energy to everything or I'm going to give low energy to everything or I'm just going to give low energy to just a single point. So when the system does that, that's called mode collapse, it basically causes the generator to always generate the same image and the critic to say, you know, I give up essentially. So there's a complicated dynamics there and there's a lot of work on this. Probably one of the most interesting work over the last few years is this idea called Wasserstein GAN from Archoski, Boutu and a couple other co-authors. So this came out of NYU in Facebook actually and their idea, this complicated math, but what it comes down to is you want to prevent the critic from having those vertical walls. So basically you limit the size of the weights inside the critic. You say like, you know, if the weights go above a certain size, I just clip them. I just kind of, you know, bring them back to some reasonable value. And that sort of limits the steepness, if you want, the slope of the virtual energy function and that kind of regularizes the system in a good way. Since then there's been lots of papers on similar ideas which I'm going to go into. So this gives you the busy concepts with GANs. A lot of the tricks of a GANs is in the practice, right? There's a lot of tricks to make them work. They're really finicky. But when they work, they're pretty amazing. They're pretty amazing for producing images if you choose this architecture well. However, as a way to train an energy-based model to learn representations of data, they have not been very successful. So a lot of people are trying to use GANs to learn representations of images, for example, and it basically doesn't work very well. So people have mostly given up on that. It's going to be disappointing. So the big red box and pink box is an energy-based model, which, in the context of GANs, is called either a discriminator or critic. Here I've drawn it as an autoencoder, but it doesn't have to be. It can be anything that produces a single scalar value. And the role of this is to tell the difference between good samples coming from the data and bad samples produced by the generator. So it will be trained to give low energy to samples coming from the data and high energy to samples coming from the generator. The generator is a neural net. It's a kind of reverse convolutional net if you want to generate images, which takes a random vector, sample from a distribution whose logarithm is something like rz here. Let's say a Gaussian, so rz is quadratic. And so you sample z from a Gaussian distribution. You feed it to this neural net. This neural net produces an image. You run it through this critic. The critic will give it an energy. You back propagate gradient through the critic all the way down to the generator. And you update the weights of the generator so that the energy here goes down. So basically you're training the generator to produce those y-hats that the critic gives low energy to. So those green points, as the generator trains itself, those green points are going to get closer and closer to the manifold data, the blue points, essentially. In that way, y-hat is not observed, right? So it shouldn't be... Y-hat is not observed. It shouldn't be shaded, you're right. It's generated by this deterministic function, right? But it's not deterministic in a way because the input to this network is itself a random. That's the basic idea. Now, the thing is, as I said, there is a lot of details about this. If you really want to play with GANs, I recommend that you download some code if someone has figured it out. A lot of tricks will be implemented in this, but this is sort of the basic principle. And you can do pretty amazing stuff with it. If you are interested in generating images, but as a pre-training method for learning representations of images, it's not a good technique. No other burning question. All right, so let's talk about self-supervisioning. So we've talked about self-supervisioning a bit, right? In the context of joint embedding systems to learn representations of images, or in the context of denosing autoencoder to learn representations of text to those BERT-like systems. There's sort of a broader purpose to self-supervisioning. So really what we want, okay, there's this sort of... There's like a main problem in AI, which is that in machine learning in particular, in deep learning particularly, is that we don't want to have to rely on so many labeled samples to learn a particular task. So of course, this is important for a lot of applications, like medical imaging where you never have enough samples, training system to generate speech in a rare language or translate a rare language into another rare language, perhaps by doing speech recognition with rare languages, recognizing rare objects in images, things like that. So that's one problem that needs to be solved so that AI can be applied to a lot of applications where it's not economically feasible or just feasible to collect a lot of labeled data. And the second application is to get systems to learn models of the world by observation. So there's a big mystery, which is how is it that humans and animals learn so quickly so efficiently? Most of us can learn to, say, drive a car in about 10 or 20 hours of practice. And after a few more dozens of hours, we can drive basically without even thinking about it, by we can talk at the same time and listen to the radio or whatever. How is it that we can learn so quickly? We don't have any machine learning techniques today that will allow a system to learn to drive a car in 20 hours of practice without causing any accident. But even just people who have attempted to use imitation learning, for example, to get a component to drive a car by observing how a human drives a car, even with tens of thousands of hours of training, you don't get to a level of reliability that you would like, because there's a lot of rare cases that you're never going to observe, essentially. And so what is it that humans and animals make humans and animals so quick? A house cat can learn amazing feats of jumping on tables and things like this very quickly in just a few weeks. And humans can learn those types. So what is it that we can learn those types so quickly? And the answer probably is that we have very good models of the world. We know how the world works, right? So we know that if we drive a car and we're next to a cliff, if we turn the wheel to the right, the car will veer off towards the cliff. And if we don't correct it, it will actually run off the cliff and the car will fall down in the cliff and down the bottom and nothing good will come out of it, certainly not us. So we don't do it because we have this predictive model that allows us to tell in advance what's going to happen in the world as a consequence of our actions or just because the world is just being the world. So we drive on the highway and we see this car in front of us zigzagging. We can infer that the driver of this car is filled in with, you know, a smartphone or something like that, right? Not paying attention or he's drunk. So we can stay away from it, right? So being able to predict the trajectories of the cars around you is crucial to be able to drive safely, right? But this applies to every situation in life. Like, you know, when you try to figure out how to build a widget, it doesn't matter what it is, okay? Build something out of wood or something. You have to plan. So you have to sort of imagine in your internal model of the world like what the object will look like once you, you know, cut the pieces of wood and you assemble it in a particular way and how rigid it's going to be, it's going to fulfill its purpose, how good is it going to look, you know, things like that. So you have this incredibly complex model in your head of how the world evolves. And that model also applies to, you know, your model of other humans, which is probably the most complex things to model because humans are somewhat unpredictable. So how do we learn those models? Now, babies, so this is a chart that was put together by a cognitive psychologist in Paris called Emmanuel Dupu. And, you know, he tried to kind of figure out at what age babies learn basic concepts about the world, about intuitive physics, for example. He has a similar chart for linguistic kind of abilities and communication. But this is for just learning how the world works. Now, babies up to a few months old basically cannot act, you know, they can't really affect the world in any way other than getting their parents to feed them and take care of them. But they can't like, you know, grab things, they can't move things, you know, things like that. So other than their arms. So they basically just learn a lot of really basic concepts by observation. And those concepts are, you know, the basis of, in my opinion, the basis of common sense. So very early, they run to track faces. In fact, you know, some people say that it's probably some hard-wired thing that kind of caused babies to pay attention to moving objects, particularly if they look a bit like faces. Very early, they learn about object permanence. Perhaps that's hard-wired. We don't know, probably not. It's hard-wired in certain animals that's been shown, but so object permanence is this idea that if an object is hidden behind another one, it still exists, right? It's not because it disappears that it's not there anymore. But babies learn this. In fact, that's probably why things like peek-a-boo are funny to babies, right? You hide yourself and you disappear, and then you go, boo! And that's really funny because all of a sudden you reappeared. And then, you know, basic notions like solidity and rigidity, you know, biological motion, so distinguishing animate from inanimate objects, coming up with sort of natural categories of objects without being told the name of anything, stability and support. So, you know, basically determining if an object is going to, you know, stay up or fall. Babies kind of see a lot of examples of this. They don't do this themselves because they can't grab things, they, the parents can, you know, put things and, you know, sometimes they fall or they go into a stable state and you can determine if babies have sort of integrated their concept by showing them an impossible scenario with a trick, okay? So, you would take an object like this and you would, you know, put it on the table and the thing would stay up, okay? Which, you know, by the time of six months, you know, babies have learned that, you know, objects that are like that are supposed to fall. And so, if the object doesn't fall and the baby looks at it very surprised, then you know that something, some event has violated the internal world model of that baby, right? So, they measure basically the kind of, the surprise time if you want, the level of surprise, like how long the baby kind of stares at the object because anything that is surprising babies will stare at. And with that kind of technique, you can determine that babies learn about gravity and things like that in inertia around nine months, okay? It takes a long time. So, before you learn the concept when you're a baby that an object that is not supported is going to fall, it takes about nine months. So, all of this is essentially learned just by observation. So, the big question in my mind, probably the most important question to make real progress in AI in my mind is how do we get machines to do the same thing? Essentially learn how the world works by watching video, okay? And perhaps, you know, the amount of background information that we accumulate by just learning things like learning to predict essentially, or learning to discriminate what is possible, what's from what's impossible, what's plausible from what's not plausible. Perhaps is the sort of underlying substrate for common sense. So, common sense may be the accumulation of all this background knowledge in the form of models of the world, okay? This is my own definition. There's a lot of people, including people at NYU who will disagree with this. So, if you talk to Ernie Davis, for example, he doesn't think of common sense as being this. He actually works on intuitive physics with completely different approaches. He's been working on this for many, many decades. So, people disagree on this, okay? You may believe me or not. But that's kind of an approach that a lot of people in sort of, you know, sort of deep learning based modern AI are following this idea that we need to get machines to learn world models. And this is how we will get machines that have some level of common sense. We'll get machines who know enough about intuitive physics and how the world works so that when we teach them to drive cars, they shouldn't drive over a cliff. Or they shouldn't, you know, drive on the wrong side of the road if there are cars coming in, you know, in front because bad things will happen as well, right? You don't need to be particularly perceptive to realize that. So, that's kind of the, a bit of the motivation for Sassan Parwazan. He's my friend, Lieutenant Malik. He's a professor at Berkeley. He's a professor at the University of California in California and he dug this book by Kenneth Craig, who's a psychologist in the middle of the 20th century. And he said, you know, common sense are not facts. They're basically a collection of models and I'm really going to subscribe to this. So let's talk about Sassan Parwazan. So with Energy-Based Models, we did the groundwork for Sassan Parwazan really and we already talked about it because we said, you know, Energy-Based Models between was plausible and was not possible. So we have an X and you have a Y. It's a X is a video clip. Y is another video clip. Is Y a plausible continuation of X? You can train an Energy-Based Model to tell you that, okay? And that, internally that model will have to capture the underlying nature of what a video can contain or cannot contain. And we've gone through this idea, right? So the general framework is, you're given a piece of data and a piece of that data, the machine will pretend that a piece of that data is not known. So let's say a piece of the video clip is known, the rest is not known, not observed yet. You run the machine to predict the part that is unknown and then you reveal the part that is unknown and you train the machine to kind of refine itself so that it does a better job at prediction. But the main issue there is how to handle multi-modality in the prediction. So in this little video clip of a girl having a birthday cake and she's about to probably blow the candle, what is going to happen next? Is she going to kind of move forward to blow the candle? Is she going to kind of move backwards because she doesn't know what to do with the candle or maybe she's scared of the fire? Is she going to turn her head towards a parent nearby and things like that? You can't really predict exactly what's going to happen. There's a number of different scenarios and for each of those scenarios, there are many ways they could happen. So you cannot ask the system to make a single prediction. You have to make sure the system can make multiple predictions. And that's where energy-based models come in because they basically can represent the constraint that the future should satisfy so that it's compatible with the present and the past without necessarily making a single prediction. Okay, so as I said before, there's going to be two uses for self-supervised running and for the energy-based kind of approach to self-supervised running. Learning hierarchical representations of the world, basically self-supervised learning pre-training, we precede a supervised or reinforcement phase to learn a particular task, but basically the self-supervised running will be there for the system to learn a good representation of the world prior to actually using that representation to learn a task. And then the second thing is learning predictive models of the world so that you can sort of predict in advance what's going to happen in the world or perhaps as a consequence of your actions so you can plan. There's something in optimal control called model predictive control where you have a model of the system you're trying to control and you imagine in your head, if I take this action, this system would evolve in that particular way and you can sort of in your head plan a sequence of action that will drive the system towards a particular state in the end that has minimized some objective. This is what you do all the time. When you take an action, you plan an action, you decompose a complex action into kind of a sequence of super ones and for each of those, you kind of have some specification of what you have to do every time step. But the big problem we have to deal with is representing uncertainty and we're going to use energy based models for this. So let me go back to something I talked about last week but not in enough details for you to really understand what it's about and this will, again, there'll be some version of this that Alfredo will talk about tomorrow and it's a sparse coding or more exactly sparse modeling. So essentially using regularized latent variable energy based models with a way to limit the capacity of the latent variable as we talked about last week with a sparsity penalty. Okay, so we saw last week that if our latent variable has too much information capacity, for example, if it has the same dimension as y, there's always going to be a value of z when you do the inference that is going to basically, let me take the example on the right here, there's always going to be a value of z when you minimize the energy with respect to z, there's always going to be a value of z that is going to produce a y-bar that is exactly equal to y, okay? If the decoder is not degenerate and if the z can take as many different values as you have in your training set, let's say. So that's bad because what that means is that eventually the system will essentially learn a flat energy surface, right? Because, you know, any y, for any y, there will be a z that produces that y, so your energy surface is going to be flat for every value of y, that's not a good energy based model because it doesn't tell you the difference between good stuff and bad stuff, right? So we talked about contrasting methods, the regularized latent variable methods consist in limiting the information capacity of z so that the volume of white space that can take low energy is limited and we talked about this in the context of canines, for example, which is a specific example where we discretize z, so we force z to be discrete and can only take k values, which means there's only k possible values for y bar, which means there's only k values in the space of y that can have zero energy and everything else will have higher energy. We've talked about making z low dimensional, so if z has a much lower dimension than y, then there is only a set space of y for which there will be a corresponding y bar and that also limits the volume of stuff that can take zero energy. What we're going to talk about today is regularized methods where you put a regularizer here on z so that you make z pay a price for going outside of a particular domain that is basically restricted. So a particular example of sparse coding is that this regularizer is the L1 norm of z. There's some of the absolute values of the components of z and they explain to some extent how this works. So basically what this algorithm wants to do is the L1 regularization wants to make the z vector sparse which means to have a small number of non-zero coefficients. It's going to try to make all the components of z zero except for a small number. And because the decoder is linear, let me use this. So a particular data point y is reconstructed as the product of a sparse vector by a matrix. This sparse vector is basically going to select a few columns of that matrix when you multiply this matrix by this vector. If this vector has only, let's say, four non-zero components, then the four corresponding columns of w are going to be selected and this product is going to be a linear combination of those four columns where the coefficients are the values of the components of z, so essentially when you train this system to basically minimize the square reconstruction loss under the constraint that z should have a small number of non-zero values, what you get is a sparse representation for any training sample which consists in reconstructing every training sample by a linear combination of a small number of columns of w. So here the system has been trained on natural image patches and each of those squares represent a column of w, represented as an image. I can't remember the resolution of this, it's maybe 12 by 12, I think. So the dimension of y is 144, the dimension of z I think is 256, so it's higher dimension. And these are the columns of w after training. So the system learns to, it turns out when you train it to represent natural image patches, it learns that it can reconstruct any natural image patches, any natural image patch by a linear combination of a small number of oriented edges, plus some gradients. So the gradients then capture the low frequency in the patch and then those edges represent whether there is kind of things happening in that patch. And this is kind of the learning algorithm taking place. So now what is this learning algorithm? I talked too briefly of this last week. So our energy function is this, E of yz is the square reconstruction error where the decoder is linear, just multiplying by matrix. And then there is a regularizer and that regularizer is some constant times the sum of the absolute values of the components of z, which is called L1 norm of z. But there is a very important detail here which is that the columns of w have to be normalized. So they're all normalized so that the square norm is one. The sum of the squares of all the terms of a column of w is constrained to be one, or whatever constant, doesn't matter what it is. So here's how you train the system. You pick a training sample y, you perform inference. This class consists in finding the optimal value of z, that we call z check, check because we minimized the energy with respect to z. And that consists in finding the value of z that minimizes E of yz for the y that we've been given. Now there is an algorithm for this that was proposed several, again a half ago, called ISTA, interactive shrinkage and thresholding algorithm that we talked about briefly last time. And ISTA is an interactive algorithm that starts with z equals zero and runs an update of this. So this update rule here, it has some sort of learning rate. So this is an update rule. This is not learning we're talking about. Okay, we're talking about inference. So we're not learning parameters, we're just updating z to find the z which is the energy. But we need a step size because this is a gradient step which is the gradient of this term. So if you differentiate this term, the square error, with respect to z, because it's a square, you get the thing in the parenthesis multiplied by w and this is this term. So this term here is the gradient of the quadratic term of the energy with respect to z. Give you a step size, update z with it and you've taken a gradient step of that term in the loss. Now the next step is to apply your shrinkage function. So the shrink function has a threshold and that is, you can think of this basically as a gradient step for that. And so those are two kind of sequential steps. Take a gradient step with respect to z of this term then take a gradient step with respect to that. This is kind of a sub gradient step because that function is not completely differentiable with respect to z where you get at zero. The shrinkage function looks like this. And this is the threshold which in this case is alpha eta. This guy here, that's the threshold. So you basically shrink all the components of z. This is applied to every component of z. It's like an activation function if you want. And it shrinks every components of z towards zero by a constant which is alpha eta. And if the value is always is already smaller than alpha eta, you don't shrink. You just set it to zero essentially. All right. That's how you find z check. Once this algorithm has converged you're left with z check. And then what you do is you update the weights, the decoder weights with a gradient step of the energy at z check with respect to the weights. So compute the gradient of the energy with respect to the weights. This is easy to do because it's just a gradient of that which is a quadratic term. And then update the parameters with this. So you get this learning rule. Take W, the W matrix and update it with some step size. It doesn't need to be the same size as I use the same symbol but it probably should be a different step size. And take the z vector multiply it by the transpose of the y vector and update the W matrix with that. This obviously has the same dimension as the W matrix. And this term here is simply the gradient of this with respect to W, nothing more. It's actually square. Now if you do this first three steps and you repeat the system will collapse. What will happen is that the W matrix will grow indefinitely so as to allow the z vector to shrink indefinitely to satisfy the N1 norm. So you have to prevent that degenerate solution from happening where z basically collapses to zero. And the way you do this is by normalizing the columns of W to a constant. So this is where this comes from. The reason for this guy here is to prevent this degenerate solution from happening. This works really well in terms of learning those features. But one thing we talked about last week is that and this is to introduce basically amortized inference and autoencoders that inferences is expensive to run this IST algorithm you can run it on an image patch but if you need to do this every time you need to extract features from a large image it may be expensive. So one idea is to turn this into basically train and we talked about this last week briefly about how to predict what the optimal value of z is once you run the optimization algorithm. So a very simple way of thinking about this is that you have this architecture which you can think of as an energy based model okay and the energy function now has an extra term this d of z and encoding of y encoding is just a neural net that you apply to y and then you compute the distance between the value of your written variable and the output of the encoder so this new energy here that has an extra term you give the system a y and you find the z that minimizes it so it's going to find not only a z value that reconstructs the original image not only one that is sparse or satisfies this regularizer but also one that is not too far from whatever the encoder predicted which is caused by this term here it may have a coefficient in front but I've kind of wrapped it into the d function so you find this optimal z z check and after you found this optimal z you you take a step of gradient to minimize the energy of your training sample that your system gives to the training sample we respect to the parameters of the decoder so that turns out to be very simple because a gradient step with respect to the parameters of the decoder only depends on the value of z that you computed z check and the only gradient you need to compute is the gradient of this cost back propagation through the decoder so you get the gradient of this cost with respect to the parameters of the decoder you make one step and that's how you update the parameters of the decoder to update the parameters of the encoder the only thing that matters is the value of z check and the value of that cost you can vary the parameters of the encoder for a constant value of z check it's not going to affect the other terms so you can ignore those terms so basically you compute the gradient of that term d with respect to the parameters of the encoder and you update that so this form of learning some people call it target prop because basically by minimizing the overall energy with respect to this latent variable you've computed a virtual target for the encoder okay when you adapt the weights of the encoder so as to make this energy small you're basically using z as a target you're training the encoder to make your output the same as z and so basically z has become z check that you obtain through inference has become a target value for your encoder so instead of you could remove all this and just connect the output of the encoder to the input of the decoder and that would be called an auto encoder and then the backpropagation would just consist in backpropagating gradient of c all the way through with respect to the parameters of the encoder and the decoder okay but there's not what we do we compute the optimal value of this internal variable and that gives us a target for the encoder hence the name target prop now you can prove, which I'm not going to do but you can prove that when z check is very very close to the output of the encoder so close that the difference between z and z check basically makes it possible to view the decoder as a linear function then you can show that this target prop is equivalent to backprop okay it's more expensive so why would we want to do it you want to do it because computing z check may be something a little complicated to minimize this that we've seen with the ISTA algorithm doing just backprop may not do the right thing here you can think of this as some sort of recurrent net with a funny architecture okay you take y, you multiply it by w transpose and you can pre-compute this because it's constant and then the iteration here is that you take z multiplied by w and w transpose compute the difference between that and w transpose y multiplied by some constant update z with it and then pass it through a shrinkage function that's the type of recurrent architecture right with slightly different notation here is ISTA rewritten slightly different symbols but it's really the same and by defining we as one over l, so one over l is the same thing as eta in my previous formula times w and then s as the identity minus eta w transpose w you can rewrite the ISTA algorithm this way and that looks now very much like a recurrent net with this kind of architecture you take y, multiply it by we which is equal to eta wd and then you shrink you multiply by this s matrix you add to the previous value here shrink again etc and this s matrix is this guy okay so we find the w, e and s matrices of this recurrent net so that it gives us a good approximate solution that predicts z check as well as possible and it's called LISTA it stands for learning ISTA essentially okay so what does it give us it gives us essentially what amounts to a sparse okay so now we have an encoder which has a few layers of this sort of recurrent method we have a way of computing the optimal z you know using ISTA but we can play a trick and that's the trick of amortized inference the trick is that once the encoder is well trained its output would be very close to the optimal value z check and so before we run ISTA to compute z check we actually take the output of the encoder and we initialize z with the output of the encoder okay z bar and that will give us a pretty good guess as to what the optimal value of z is and then we can run just a few iterations of ISTA to get z check okay so you can view this encoder as a way of accelerating the inference it's called amortized inference because you pay a price in training this encoder but once you pay that price faster because you just run through the encoder okay and the reason why I'm going through this again is because we can use this to pre-train the features of a convolutional net so we can build a sparse autoencoder and we can make this autoencoder convolutional and then we can train this autoencoder to learn features that hopefully it may be good for what we want to do so the convolutional version of sparse coding which I also mentioned last week is the same as the regular version except we replace the dot product of a column of a row of w by sorry of a column of w by a component of z we replace this by a convolution of a convolution kernel by a feature map so z now is a 3D tensor being a vector w is a stack of convolution kernels okay and we convolve each convolution kernel with each feature map sum up the results and that gives us our reconstruction okay in the regular version each of these guys is a component of z so it's a scalar and each of these guys is a column of w okay of the same dimension as y and we still have this this L1 regularization so if we use advertised inference for this so we don't just have a decoder but we also have an encoder to predict the optimal value for z the encoder will be a convolutional layer with some nonlinearity then followed by a very simple layer to kind of so it's going to be very simple in fact much simpler than the recurrent architecture I talked about earlier just a two layer network basically with two layers of convolution you get things like this so you get these are the decoder filters so these are the columns of these are the convolutional filters that are used to reconstruct a patch from the z this is if you have 32 kernels this is with 64 kernels this is with 16 kernels, 8 kernels, 4 kernels 2 kernels, 1 kernel so you get really nice looking features that you would get from say training a convolutional net on lots of images like ImageNet you get very similar things this is in grayscale no color here and this is these are the filters you get in the first layer if you use a single layer convolution and then a very simple scaling layer after that then the encoder filters look very much like the decoder filters and so you can think of this as a basically a feature extractor like pre-training the convolutional layer of a convolutional net just as an auto encoder so the model we have now is this thing where we have a Y we run it through an encoder which is going to do our prediction for for z or for z check we have our z variable the z variable goes into the decoder and we measure the reconstruction error but we also measure the prediction error the distance from z to the whatever the encoder predicts and then the z goes into a regularizer the regularizer here is a little different from the one I showed you earlier I'll come back to this in a second so basically that's an auto encoder we can train this auto encoder on a bunch of images okay if we make those things convolutional a bunch of large images and what you get in the end is a feature extractor here that we can use as the first layer of a convolutional net but it's going to it's going to be pre-trained so here's a second way of doing this which does not use target pop it just uses regular back pop okay we start with an input I'm calling it x here because this is going to be partly supervised semi-supervised let's say I run it through I run it through an encoding matrix first layer could be a convolution and then through a few steps of this Lista idea so recurrent net that basically shrinks or applies a value in this case a value but it doesn't matter then multiplies by a matrix then adds to the previous I mean to the input multiplied by the by the encoding matrix and then shrinks and then you repeat that you iterate this multiple times say three four times and then we're going to do three things simultaneously the first thing is that we're going to say whatever comes out of this should be sparse so we're going to measure the L1 norm of this and make it sparse the second thing is that we're going to say we're going to have a decoding matrix and we want to be able to reconstruct the x whatever this is so this will force this code to basically contain all the necessary information about x so it's going to be an encoding of x a feature representation so this path here is an auto encoder if you join this path with that one that's a regularizer for the auto encoder to make the representation sparse and then the last one here is a classifier so this is a linear layer that goes into a softmax and then y would be a label to classify the image so imagine that you actually have labels for some of the images maybe not all of them maybe just a subset you can back propagate all of those through this system you have to normalize wd because otherwise you get this collapse problem I was telling you about earlier so you normalize the columns of wd but you back propagate through this whole thing to compute s and wd and wc and in the end what you get is basically a sparse auto encoder that learns a representation that actually kind of works for your classification problem okay so you're going to be given a project to work on and the project is going to be about basically self supervised running so you're going to be given a fairly large number of examples of images without labels and a small number of images with labels and you can imagine using something like that okay just a suggestion there are many techniques you could use but that would be one possibility another possibility would be what I talked about just before where you train each layer as a sparse auto encoder and you stack layers on top of each other right so train the first layer as an auto encoder then generate outputs from the encoder now use this as a training set for a second layer of a sparse auto encoder and then fix that, train a third layer so you could pre-train a convolutional net architecture layer by layer this way instead of using sparse auto encoding you could use say denoising auto encoders so train each layer as a denoising auto encoder layer by layer or you could use variational auto encoders and we'll talk about this in a second we're going to talk about what's called group sparsity and again this is a refinement on this idea of a sparse auto encoder using using amortizing inference or target prep and the idea of group sparsity is really interesting so this was the ideas that were proposed about 20 years ago by Hivarin and Heuer in the context of ICA independent component analysis by Simon Asindero when he was a student with Jeffington about 15 years ago and some more recent work about 10 years ago by Kora Kevichelou who was a student of mine at the time, Carol Greger who was a postdoc with me and Julien Meral who was at the time a PhD student in France with Jean Ponce who is at NYU and Francis Beck I believe and so the idea of group sparsity is this so instead of having a sparsity criterion which is just the L1 norm of Z which is the sum of the absolute values of the components of Z what you do is you compute the sum of the squares of the components of Z in a group so you take a subset of the Z vector and compute the sum of the squares of those components and then compute the square root of this L2 norm another square L2 norm the L2 norm of the sub vector composed of that group of components of Z then you take another set of components of Z and again compute the L2 norm etc. so what you get is a bunch of numbers one per group and the regularizer is going to be the sum of those terms simply so this is the sum okay belonging to a group it's called a pool actually because we're going to see that it's very much like pooling so basically pool the activities with L2 pooling of the subset of components of Z vector we compute the L2 norm of those components square root of sum of square and then we do this for all the pools and we sum up the pools multiplied by some coefficient and that's our regularizer this is called group sparsity it's called group sparsity because what this regularizer will want to do is turn off the maximum number of pools but within a pool it doesn't care how many components are on okay so what is that by us so now imagine first of all imagine that this is a convolutional layer with a non-linearity okay and imagine that the pools are so Z of course because this is convolutional Z now is a three dimensional tensor where each slice is a feature map imagine that the pools are local blocks in the feature maps okay then this operation that computes those outputs here is basically L2 pooling it's completely equivalent to using an L2 pooling in a convolutional map right and so that's interesting because now what you train when you train a system like this as a completely unsupervised as an auto encoder is that you train an encoder which is a convolutional layer with a non-linearity and then you feed it to a pooling layer which is an L2 pooling layer and what you get out of this is a sparse representation of the input but it's convolutional so this is basically how to train the first stage of a convolutional map completely unsupervised by forcing the representation to be sparse so you don't have catastrophic collapse of the energy function but you train it to produce good features including the pooling you force the output of the pooling to be sparse okay but there's another way you can use sparsity so let's say we don't work with a convolutional layer here we work with sort of just image patches so this is not convolutional this is reconnected and now Z is a vector and I just arbitrarily divide Z into pools but what I'm going to do is that I'm going to make those pools overlap and in fact I'm going to play a trick I'm going to interpret Z as a 2D array there's no structure to Z as a vector but I'm going to pretend it's a 2D array so I'm going to reshape rearrange Z as a 2D array and I'm going to make those pools local into D but overlapping and I'm going to train the system as a sparsity encoder with those overlapping pools which group units within those pools now because of the sparsity penalty what the system will want to do is make the smallest number of pools activated at any one time so what it's going to do is that it's going to regroup features that are similar that are likely to be simultaneously activated within a pool but since this pool overlaps with another pool the features of that other pool are going to be similar also to the one here so if I organize those pools in a 2D manner I'm going to get continuously evolving features and this explains the old diagram I was showing you earlier so these are different ways of doing this this actually doesn't have a decoder but it's an encoder only type thing which prevents collapse in other ways so this is a sparse auto encoder this is non-conventional again so I pretended that the Z's were organized in 2D and then the pools are 6x6 and they overlap by 3 and the result is that because the smallest number of pools have to be on at any one time the system basically organizes the features in this sort of nice continuously evolving set of features where you have high frequency edges here kind of low frequency edges there and then the orientation changes continuously as you rotate around this single point here this is Jean Amiral's version which is decoder only there is no amortized inference no encoder, it's just sparse and you get this similar type of patterns so that would be a way of pre-training a layer of convolutional net completely unsupervised in sort of an energy-based manner with a regularized latent variable model where the latent variable is sparse the regularization is group L1 I should say group L2 actually so this this is an idea that you can use for your project now people are really interested in this and the reason we worked on this 10 years ago is because this looks a lot like what you observe in biology so if you poke electrodes in the visual cortex of many animals not all of them, but many animals you'll see this kind of pattern so the you know the type of stimulus to which a particular neuron will respond are generally very similar to what the neighboring neuron will respond and as you move in different directions you'll you'll see the angle to which a neuron is sensitive kind of slowly rotate and they're organized like this in this sort of weird pinwheel patterns in the brain so those things naturally naturally pop up when you train those group sparsity system which is kind of cool this is another example this is a different model but which I'm not going to go into in details of so this one is local connection but it's not convolutional so no shared weights and you get those patterns which you find in biology so this is sort of a color coding of the orientation selectivity of neurons in the brain and where those stars kind of represent the center of those pinwheels so the orientation kind of continuously evolves you rotate around one of those pinwheels and this is the algorithm that I just talked about this is what it produces and it's very similar I give you a few ideas here of how you can pre-train features for a feature extractor completely unsupervised with regularized latent variable models in this case regularized auto encoders here is another idea this one allows you to train from video or to train from multiple instances of the same image if you want which you can generate through these torsions for example and that's called regularization through temporal consistency and there's lots of ideas going back decades along those lines using linear models the one I'm going to describe is a particular special case, a particular way of doing it some of those papers kind of are relatively recent, five years old or fairly recent this came out of my lab actually and this is a different distortion this is Olivier Naff he's a deep mind he did this actually for his PhD work at NYU in the center of neuroscience and here's the idea you say imagine that you have frames from a video or imagine you have different distortions of the same image for example you take an image and you translate it and rotate it a little bit and then you translate it a little more and rotate it a little more okay you're going to run this image through an autoencoder an encoder and a decoder and you might make this representation the same size as the input or bigger or smaller at this point it doesn't really matter you do the same for the next image because this is an autoencoder that needs to reconstruct the input you're pretty guaranteed that H is going to represent anything everything there is to represent about Y it's going to contain a complete representation of Y if you want so you do this for the the various frames in your system in your data and then what you're going to do is you're going to train a function here which from those two images from the representation of those two images is going to predict the representation of that third image okay and that's a regularization okay it says I don't care what representation you use whatever representation you use it should be predictable I should be able to predict the representation of the transformed image from the representation of the previous two images and you can think of this as a form of regularization in particular regularization that will tend to perhaps identify what changes in an image what can change in an image from what cannot change so in particular that function leaves a piece of H unchanged okay so you tell the system that G function here is going to take maybe half of the H vector and it's going to allow it to change over time but the other half I'm going to force it to be the same so I'm going to force the second half of that H to be the same as whatever was here maybe the average of the two or something okay then when you train the system what it will try to do is separate H into two part one part that doesn't change when you modify the image and one part that does change when you modify the image so the part that doesn't change is the content of the image it's what objects are in the image and the part that changes is where they are exactly what are the instantiation parameters okay so I would be aware of basically factorizing the internal representation into a piece that is invariant and a piece that is equivariant to changes of the input okay here's another idea another trick that you might want to use to train features so there the there's no collapse you can think of this as an image based model but there's no collapse because you force all the representations to reconstruct the input so they can't like push information I mean these are figures on the paper that shows that it works okay now let's talk about something a little tougher okay variational encoder and you know we'll come back to this tomorrow okay there are three ways to explain variational encoders one is completely intuitive and with the concept that we've seen in energy based model you should be able to understand okay but it doesn't really tell you how to translate this into formulas it does but not not completely there's a second method which is the original description of variational encoder which is based on probability theory and variational approximation it's really tough to understand or at least to build an intuition for what actually what it does just from looking at the math okay there's a third way which I'll explain today which is sort of a energy based view of essentially amortized inference for a decoder only system and the math is not that easy but there is a very interesting trick in it which is the variational approximation trick which is why I really wanted to tell you about it the slides are in a different order but I'm going to start with intuitive explanation alright and I will try to connect the intuitive explanation with the less intuitive one okay so variational autoencoder is very much like the type of target prop style autoencoder I told about regularized autoencoder I told you about before with a twist okay and the twist is we're not going to regularize the latent variable although we are okay but it's just the L2 so it doesn't do much for you what I'm going to do is basically use the marginalization over Z instead of the minimization over Z okay so remember when we have an energy based model we can do inference by finding the one Z check or one of the Z checks that minimizes the energy for a given Y or we can choose to marginalize over Z so basically compute the free energy which is minus one over beta log sum over all Z's of e to the minus the energy for that Z okay or e to the minus beta the energy for that Z and that's the marginalization formula okay so what the marginalization formula tells us is that to combine the evidence so the energy of a particular Y should combine evidence for multiple values of Z not just the value of Z that has the minimum energy but there are other values of Z that have similarly small energy and they should contribute to the you know the you know to kind of lowering the overall free energy of our data point okay according to a model there's lots of different values of Z that could produce the same Y you want that Y to have a lower free energy than if only one value of Z could produce that Y okay and the way we make that free energy small is that we compute the log of some of the exponentials of all the energies for all the Z's that can contribute to that Y okay so the nice thing about doing marginalization is that it alleviates the need for regularization so if you have a latent variable model with a latent variable if you marginalize over the latent variable you don't need to have a regularizer on it to kind of make this thing work you always need to have some regularizer so that the Z distribution is well normalized so something like an L2 okay so this will basically say you can use an easy you want but you know just don't make them too long okay essentially that's like you can think of this as a gas and prior on the distribution of Z and so here's the intuitive idea of variational autoencoder you take a Y you run through an encoder you get a prediction for Z and then instead of optimizing Z instead of finding the Z that minimizes the energy here which we would do with a regular target advertised inference type we're going to sample a random value of Z according to the distribution given to us by this energy term so an energy can be turned into a distribution the probability distribution through the Gibbs formula right e to the minus divided by the integral okay so when you have a quadratic term here which quadratic as a function of Z the corresponding distribution is a Gaussian distribution okay so basically where the mean here of this Gaussian is Z bar which is the output of the encoder okay so run through the encoder you get Z bar now you have quadratic cost here we're going to sample Z from a distribution that corresponds to the normalized exponential negative exponential of this term and that's a Gaussian okay so sample Z from a Gaussian whose mean whose center is Z bar is whatever the output of the encoder is okay so Z produces a point in Z space basically compute a random Gaussian noise and add that to to that mean and that's your Z okay so on average the code for Y would be a fuzzy ball okay that corresponds to all the different noisy versions of that Z center to run Z bar you can take into account this also which says you know I don't want Z to be kind of too long so if you know it sort of limits the amplitude so this would be a sum of two Gaussians and a sum of two Gaussian energy terms and when you take the exponential of them you get a product of Gaussian which is a Gaussian so essentially the effect of this term would be to shrink the value of the mean of the Gaussian towards zero a little bit okay now take that Z running through the decoder you get a reconstruction and back propagate the gradient through that decoder to get the gradient of the reconstruction with respect to the weight of the decoder okay update the weights no problem and then back propagate the gradient to the encoder okay and compute the gradient of the cost here with respect to the parameters of the encoder and make an update so in the backprop phase this looks just like an auto encoder okay so M in this formula would be the equivalent of K in this formula I didn't want to write the entire expression here so here I wrote kind of a simple form of Gaussian which is sort of basically a circular Gaussian where you compute the Euclidean distance between Z bar and Z and then you multiply it by a coefficient okay so when you when you normalize this Gaussian we take exponential minus this term and normalize you're going to have this term will go into the into the normalization term normalization term basically would be something like square root of 2 pi K or something like that now a more general form for a Gaussian distribution is something like this right actually you don't need K here you just need M because you can fold the K into the M but it's basically you know Z minus Z bar encoder of Y is Z bar right Z minus Z bar transpose times M which is a matrix times Z minus Z bar and now M is a covariance matrix of the Gaussian and the normalization term that's going to pop up when you normalize this Gaussian is going to be you know one over square root of 2 pi times the determinant of M okay the square root of the determinant I mean the square root includes the determinant okay so square root of product of 2 pi and determinant of M this is the way the Gaussian multivariate Gaussian formula works right this is a general form for a Gaussian here I just wrote a simple form where the variance is fixed it's one over K one over K square one over K yeah and and I'll stick to kind of a simple version where even where K equal one actually in the following okay so we have a bunch of training samples we run them to our encoder and it produces points the Z bars that are those points right and then using this sampling from this Gaussian distribution which is centered on each of the Z bars we turn each of those guys into fuzzy balls right so essentially our Z's can be any point inside of those fuzzy balls you know with the probability that depends on the distance to the center now here is the problem those two spheres intersect a lot right so it's possible that we run this training sample through the auto encoder and because of the noise it's going to produce something that's close to the other code and then this guy similarly it's possible that when we add noise to it we're going to get a vector that's close to that guy so when we run those Z's through the decoder the decoder is not going to be able to tell whether we came from this sample or that sample so the reconstruction error would be large okay so the effect of adding noise to the Z's is going to be that if the system you know figures out the right thing to minimize the reconstruction error what is going to happen is that all of those spheres are going to fly away from each other because that's the best way to minimize the confusion to basically minimize the effect of the noise right you make the the weights of the encoder very large so that the Z vectors are very long the Z bar vectors are very long so that whatever noise you add to to Z bars don't matter because the Z bar is already so large it's very significant so when you run the Z through the decoder it's going to be able to tell which to properly reconstruct the training sample but that's not a good solution it just makes the weight of the encoder large it doesn't find the structure of the manifold of data so what are we going to do we're going to use this L2 remember this green L2 term this green L2 term says you know these spheres you can't really go too far away from each other with a spring okay intuitively a spring is a I'm not sure whether there is a random name here a spring is basically the physical mechanical equivalent of an L2 regularizer the force that a spring would exert is proportional to the difference between its length and its rest length which in this case we assume to be zero okay and it's the same force that the gradient of an L2 term is proportional to the distance right so it's completely equivalent the potential energy of a spring is L2 right so this will pool the the spheres towards the center and we'll force them to basically be as close to each other as possible because if two training samples that are very different get too close to each other then they're going to start being confused by the decoder which will bump up the reconstruction error which the system is trying to minimize so there is this force that repels those spheres from each other but at the same time that forces them to be as close to each other as possible and possibly inter-penetrate and what the system will end up doing is let the spheres inter-penetrate if the two samples are very similar to each other because if the samples are similar then the reconstruction error produced by the confusing the two will not be that large okay so that will basically cause the system to find the sort of underlying manifold if you want of the data so that's the intuitive explanation now there is another ingredient in the full variational autoencoder and that other ingredient is the fact that the size of the spheres in each dimension can be changed okay so there is basically a parameter that allows a set of parameters for each dimension of the sphere which are like the inverse variance of the corresponding Gaussian distribution in each dimension okay so and those are allowed to change and there is a a cause that tries to make them as close to one as possible so the spheres really want to be spheres of radius one but they are allowed to get a little bigger and smaller in certain dimensions they are a little elastic if you want using a particular term in the cost function and this is derived from probabilistic arguments but I'm going to derive this from energy-based arguments and okay so here is a basic idea and this explains the trick of the variational approximation I hope it does alright so we have our system is ignore the queue for now our system is a decoder only system with a latent variable Z and what we are going to do is is marginalized inference energy for a given point Y is not going to be the minimum of the energy with respect to Z but it's going to be the log sum exponential of the energy with respect to Z so it's going to be minus 1 over beta log sum over all values of Z of e to the minus beta e of yz where e of yz is just this output of this energy term okay is just the cost function C applied to Y and the output of the decoder which takes Z as an input okay so that's the marginalization now here is the problem here we can't compute this what's more important is that so what we would want to do is basically use this free energy average over a training set as our objective function for learning okay so it's a classic thing to do right minimize the average energy of your training samples here in this case is minimize the average free energy of the training samples and as we've seen if we do this without a regularizer on the energy on the latent variable it leads to a collapse when we minimize when we marginalize it doesn't necessarily lead to a collapse okay so this would be a perfectly good objective function to minimize if we're lucky but here is the problem this integral is intractable so what does that mean intractable it means we cannot efficiently compute it we certainly if E is some complex function of Z which it will be because we run through a neural net okay then we will not have an analytical solution for this integral we can't write a formula for it we could numerically integrate it right so we could go through you know we could sort of make a grid of all the values of Z and then go through every grid point and then compute the E to the minus you know the exponential minus of the energy for every point along Z and then sum them up and that would be the integral and then you know multiply by the size of the grid right and that's a numerical way of computing an integral but imagine that Z I don't know 100 dimensions and the number of grid points along each dimension is 10 the number of grid points we'll have is 10 to the 100 it's completely impractical okay so we will not be able to even evaluate this integral anywhere anywhere close to anything useful so entire books and generations of applied mathematicians and physicists this problem because this problem pops up everywhere in statistical physics and nuclear physics and all kinds of stuff it's how you compute this is called the log partition function and how you compute this people invented a lot of really smart techniques to do this including something called variational approximation so here is the idea of variational approximation you're going to say I'm going to take what's inside the integral here I'm going to pre-multiply by a probability distribution over Z that I'm just going to propose it doesn't matter what it is at the moment okay and I'm also going to divide going from here to here I haven't done anything I've multiplied by Q divided by Q over Z given Y it's important that Q over Z given Y be a normalized distribution over Z and it may depend on Y so I haven't done anything here I haven't changed anything to the formula and I'm going to express Q of Z given Y as a Gibbs distribution of some energy term Q of YZ okay so I'm just going to declare Q of Z given Y is going to be equal to E to the minus beta Q of YZ divided by an organization term so that this is a normalized density or a discrete distribution if Q is discrete but then it's easy okay next step so I have this formula and I want to be able to estimate it, approximate it maybe here's what I'm going to do I'm going to bound it, I'm going to replace it by an upper bound and I'm going to use for this something called Jensen's inequality which I'm going to explain in just a second so if you stare at this what you observe is that this is actually some sort of expected value of this ratio where the expectation is taken over that distribution okay the integral of the product of a distribution times some function is the expected value of that function over that distribution so it's kind of like an average if you want so this is the log of the average of this ratio over that distribution now Jensen's inequality will tell you the minus log of the average of something is less than the average of the minus log okay so basically I'm going to apply Jensen's inequality to this formula and I'm going to say this is less than the average of the log of that ratio which is what I've written here okay so I put this minus log I put it inside and I compute the average of that okay expected value over q of minus log of this why can I do this is because of Jensen's inequality I have a convex function like minus log so imagine this is minus log x so I have some value here and this value is actually the mean of some distribution say a gas and distribution it doesn't matter what it is actually you know let me just take a uniform distribution it'll be simpler to explain so actually I'm drawing samples from a uniform distribution here and now what I'm asking is I'm going to compute the average of that distribution right here run this through the minus log is going to give me a value okay so this is minus log of the average of the random variable x okay now what I'm going to do is take this value of x run it through it okay I get a value and take this guy run it through it I get a value and this guy I get a value and this guy I get a value and this guy I get a value etc and now I'm going to compute the average of those guys okay because the function curls up if the function was linear those two things would be equal for uniform distribution but because for any distribution for that matter there's going to be more terms you know this those guys those values are going to be kind of biased towards higher values okay therefore the average of the blue dots is going to be slightly higher than the red dot right so the average of minus log of x or any convex function for that matter is larger than minus log of the average okay that's Jensen's inequality for any convex function the average over a distribution of that function is larger than the log than the the function applied to the average okay so that's what I've used here so now I've bounded above my free energy with an approximate free energy called variational free energy through this okay now the log of a ratio is the difference of the logs right so when I take the log when I take minus one over beta log e to the minus beta e the log cancels the exponential the minus beta here cancels the minus one over beta there and I'm left with just e of yz okay so I'm going to get the first term integral over z of q of z given y e of yz and that's this term okay then the second term here the minus one over beta log one over q of z given y is going to be just one over beta log q of z given y because you know the log of one over something is the minus log of something so this minus goes away if I bring q to the top and so I'm left with sum over z of q of z y log q of z y which is this okay now we have to how do we interpret this this is the average energy where the average the expectation is taken over the latent variable z so to compute this is really easy I get a whole bunch of samples of z from this distribution this would be a Monte Carlo approximation of this integral I sample a whole bunch of samples of z according to this distribution and I can make this easy by choosing a q that is easy to sample from like a Gaussian okay and I compute the average of the energy over all of those samples so may not be efficient but it's simple so I can compute that approximately okay this is the entropy this is the negative entropy of the q distribution okay integral of z of q of z log q of z that's the definition of the negative entropy of q there would be a minus sign for the entropy because it's a plus it's a negative entropy and I divide by beta this formula is well known of physicists of thermodynamics in particular in thermodynamics there is a formula called free energy the Helmholtz free energy is equal to the average energy minus the temperature times the entropy and it's exactly the same formula that I'm writing here okay this is when a system a physical system is in a you know not in a definite state but is in kind of a distribution of our states and you never know you don't know what state it's in you just know the distribution and you know the energy of each individual state that it can be over so there is the average energy which is you know according to the distribution you take the energy of each state and then you compute the average according to that distribution and then the free energy is the difference between that and the temperature which is one over beta times the entropy of that distribution and what thermodynamics tells you is that a system that tends towards equilibrium will tend to minimize this free energy okay and that's exactly what we're going to do we're going to minimize this free energy we're going to try to find the parameters of e and q in such a way that this free energy is minimized by minimizing this free energy which is an upper bound on the this variational free energy which is an upper bound on the real free energy what we hope is that that would push down on the free energy itself that we actually want to minimize which is this one okay and this will actually work to the extent that q is actually a good approximation for the real distribution on z that would result from now if you compute the gradient of f with respect to the parameters of the decoder okay which enters into this formula you get something that's easy you basically say okay I got a bunch of samples of z right I may use either one or multiple samples per sample y but since I'm doing stochastic gradient you know if I don't get it this time around I'll get a new z next time around for the same sample so the story is you draw one sample from this distribution q let's say you make q a simple distribution like Gaussian conditional Gaussian with a particular mean you draw from that distribution and the gradient of this is just a gradient of the energy with respect to the parameters of the decoder for that particular value of z right so you have a value of z which you drew from that distribution q of y z you run through the decoder you get an error back propagate update the parameters of the decoder with that gradient very simple okay and if you do this with multiple samples of z for the same y on average what you will minimize is this average right because the samples are drawn from that distribution so when you compute the average of all those terms which is what your loss function is doing you compute the average you compute that term essentially you have an approximation of that term okay so that's super simple now the more complicated thing is what is the gradient of after the y with respect to the parameters of q and you can't do this without making some hypothesis about the form of q I mean you can but it becomes hairy right that so we're going to parameterize q as it gives distribution of some energy term bq so e to the minus bq of yz divided by normalization constant I'm not sure why the symbols changed here but they changed so I'm going to decide a priori here I'm going to write down my q as this and this is where the variational encoder do the assumptions they make is that q of yz is basically the square distance between z and the output of the encoder squared plus a the square of z multiplied by gamma so if I plug this into that what I get is a Gaussian distribution okay because this is a quadratic function and so quadratic you take the exponential minus you get a Gaussian distribution what's more is that the mean of that Gaussian is the encoder of y the encoder function applied to y okay now here's the thing the denominator here does not depend on the parameters of the encoder because the only thing the encoder does the parameter of the encoder is shift the mean but it doesn't change the variance and this basically depends on the volume so this will not change this integral will not change if I change the parameters of the encoder it will change the Gaussian distribution it will move it but it will not change its value okay it will not change its integral so what it means is that this normalization constant doesn't change so here I'm assuming a very simple conditional Gaussian distribution whose variance is 1 okay if I put a k coefficient here then the variance would be 1 over k okay so that's a simple thing and you can represent it this way where you run y through the encoder it produces a mean and then you sample z from a Gaussian whose mean is the output of the encoder and you have this regularization term which is this quantum of z so that makes everything simple because now I can rewrite my variational free energy here this term the entropy is constant because my Gaussian is a constant variance as you need variance and the entropy of a Gaussian is actually something like the log of the determinant of the covariance matrix and so that is constant just ignore it for a cost function it doesn't matter if there is a constant term so I'm left only with this term that's what I need to minimize that's what I need to compute the gradient of so we know how to compute the gradient of this with respect to the parameters of the decoder which are inside of E what we need to do is the gradient of this with respect to the parameters of Q and the parameters of Q are basically just the parameters of the encoder that produce the mean of Q okay and I'm missing a parenthesis here but basically I'm not going to have time to go through that derivation I should go through that derivation I'm not going to have time the bottom line is that the gradient or at least an approximation of the gradient of that variational free energy with respect to the parameters of the encoder is simply the gradient of the overall energy okay this guy essentially with respect to the parameters of the encoder but for the value of Z equal to the output of the encoder so this is basically backprop to an auto encoder take Y, run through the encoder and you get a Z bar run this Z bar through the decoder you get an energy backpropagate all the way through to the encoder and that's the gradient now this trick of basically parameterizing this function Q or this probability distribution Q as something like this where the parameters you can tweak are the parameters of some neural net or some other thing in the context of those models it's called a reprioritization trick it's not a trick it's just that you pick a particular form for your distribution which just happens to be the distribution of an energy and that energy happens to be resulting from a latent variable and a deterministic function whose parameters you can compute we can have another function encoder prime if you want that computes the program matrix because it makes the calculation can be done it's a little hairy but it results in something and what it results in is those this cost on the on the radius of those spheres I was talking to you about earlier so if you have a non-constant covariance matrix this term now matters and you have to compute the gradient of that term with respect to the parameters but you know where that term is it's the log of the determinant of the covariance matrix and so you can compute the gradient of the log of the determinant of the covariance matrix with respect to the covariance coefficient so that turns out to be really easy so things kind of are pretty peachy actually it basically wants to make the variance of that large and this wants to make it small so those two terms kind of contract okay so you know there's complex math in version of the two encoders but one thing you should remember from this is that whenever you have something of this type to compute the log of some of the exponentials and you can't do it because your E here is something really complicated as a function of Z the thing that you integrate over it's intractable the way to make it more tractable is you do this trick of multiplying and dividing by a Q distribution that you are probably going to decide that is simple and then you write a bound on this free energy using a variational free energy and that bound is done by basically putting the minus log inside okay so instead of the minus log of an average you now have the average of a minus log and that's because of Jensen's inequality when you kind of work it out you get a difference between two terms the average energy according to this distribution plus the entropy of that distribution minus the entropy of the distribution multiplied by beta okay so this will want to make the average energy small but this will want to make the entropy large because this is the minus entropy those two terms basically balance each other if you make the entropy zero which basically says Q of Z y is a single point then this energy the best way to minimize this energy is to put Q at the minimum of E okay so the only value of Q of Z y that will have a high probability basically infinite probability okay so that will be the Z value that minimizes E okay so that will be a good way to minimize this term and of course this term is not going to be happy because it wants the entropy to be large and you just made it zero so by making beta very large you make this term irrelevant so if you make beta very large you let the system say I don't care about maximizing the entropy of Q I'd be very happy to make Q very very narrow is to center it around the minimum value of E so that this overall thing is minimized and then I'm just left with the E of Y Z check where Z check is the Z that minimizes E okay so the infinite beta limit of this entire formula give you an F hat or F tilde it is actually equal to E of Y E of Y Z check okay you're left with E of Y and Z check here and so you're left with this idea of just minimizing the spectra Z and not marginalizing over but this acts as a regularizer it says okay you're not allowed to make this distribution very very very narrow because this is going to make this term unhappy this wants to make the distribution wide but making the distribution wide means that this average energy is going to go up it's going to be larger than if the energy was at the minimum because for other values of Z the energy is going to be larger okay and so the system finds a tradeoff between having a wide distribution so as to make this term small the entropy large and having a narrow distribution so as to concentrate Q around the minimum of E okay on the capacity of Z making the entropy of Z large means Z carries a small amount of information any particular value of Z I mean the distribution of Z carries a small amount of information so that's a way of regularizing so variational approximation for energy has the effect of reducing the capacity of a latent variable okay if Q has the right properties so you're left with you're left with this architecture in the end and the algorithm is simple, take a Y run through the encoder sample Z from a Gaussian distribution whose mean is this you can in the complete VAE version there is a second output to the encoder which is the covariance coefficients of that Gaussian in this example I didn't describe them but it comes out of the math as well if you want so sample that distribution run through the decoder you get an error backpropagate that error, update the parameters of the decoder with that error backpropagate all the way through the encoder and update the parameters of the encoder with that gradient and that's VAE and the way it limits the information content of the latent variable is that it makes them noisy and it makes them contain within a particular volume and so the amount of information contained in the latent variable is small because there's only so many noisy spheres you can pack into a big sphere of minimum radius but another interpretation of it is that what you're doing also is maximizing the entropy of that Q distribution and that again kind of limits its information capacity and I think we're around the time yes no we're done so it's good okay again alright so we see each other tomorrow for the target probe for the additional encoder again the more intuitive explanation and then some pytorch implementations in the notebooks so see you tomorrow I encourage you also to read the original paper of the original encoder it's written in completely different language you know the probabilistic framework but your understanding of what I just explained will make you understand or interpret the derivation it will make it look a lot simpler than it than it is alright bye bye everyone alright take care