 All right, you guys see the slides, I assume. Alfredo, I can see you. I can see anyone else, so we should. Yes. All right. Yeah, that's all good. You can also make signs, I can see you. Okay, so we're gonna talk about, still talk about energy based models and mostly in the context of cell supervised running or unsupervised running, continuing where we left off last time. So let me start with a little bit of a reminder of where we left last time. We talked about cell supervised running as the idea of basically trying to predict everything from everything else, pretending that a part of the input is not visible to the system and another part is visible and we trained the system to predict the non-visible part from the visible part. And of course, it could be anything, it could be part of a video or it could be something else. There is a special case where we don't assume that anything is visible at any time and so we're just asking the system to just predict out of the blue without any input. So we talked about the approach of energy based models which consist in essentially, it's having the implicit function that captures the dependency between X and Y or in the case where you don't have an X is the dependency between the various components of Y and the reason why we need a implicit function is that for a particular value of X there could be multiple values of Y that are possible. And so if we had a direct prediction from X to Y we could only make one prediction and using an implicit function we can make multiple prediction implicitly by basically having a function that gives low energy for multiple values of Y for a given value of X and that's a little bit what's represented on the left with essentially you can think of this as some sort of landscape, mountainous landscape where the data points are in the valleys and everything else at size, the manifold of data is as higher energy. So inference in this context proceeds by basically finding a Y or a set of Ys that minimize f of X, Y for a given X. So this is not learning yet. Learning consists in shaping F but this we're just talking about inference so it's very important to be able to make the difference between the inference process of minimizing the energy function to find Y and then the learning process which is minimizing a loss function not the energy function with respect to the parameters of the energy function. Okay, those are two different things. And in the case of the unconditional case you don't have an X and so you're only capturing the mutual dependencies between Ys. We talked about latent variable models and the reason why we talked about latent variable models is that it's a particular way of representing or building the architecture of the energy function in such a way that it can have multiple Y for a given X. So essentially a latent variable is an extra variable Z that nobody gives you the value of but the first thing you do when you see a Z is that you minimize your energy function with respect to that Z and that gives you now an energy function that does not depend on Z anymore. Or if you wanna do inference with a model with latent variable, I give you an X and you find the combination of Y and Z that minimize the energy and then you give me Y. That's the inference process. These two ways to do inference with respect to a variable that you don't observe. One is to just minimize over it as I just indicated and the other one is to marginalize over it if you're a probabilist but even in other cases and there's a simple formula to kinda go from one to the other which is basically a log sum exponential over all possible values of Z and this may be intractable so we don't do this very often. Okay, so training an energy-based model consists in parametrizing the energy function and collecting of course a bunch of training samples, a bunch of X and Ys in the conditional case or just a bunch of Ys in the unconditional case and then it consists in shaping the energy function so that you give low energy to good combinations of X and Ys and high energy to bad combinations of X and Y. So for a given observed X, you would try to make F of Y for the corresponding Y that corresponds to X as low as possible but then you also need to make the energy F of X and Y larger for all other values of Y, all other possible values of Y and it's probably a good idea to keep this energy function smooth if you are in a continuous space, if Y is a continuous variable because it'll make inference easier. Subsequently, you'll be able to use gradient descent based methods to do inference or maybe other methods. So there's two classes of learning algorithms as we talked about last time. The first class is contrastive methods which consists in basically pushing down on the energy of training samples. So you get a training sample X, I, Y, I, you plug it into the energy function and then you tune the parameters of the energy function so that that energy goes down and you can do this with, you know, with backdrop if your energy function is some sort of neural net with all the things in it. As long as it's a differentiable function, you can do that. But then what you have to do as well is pick other points that are outside the manifold of data and then push their energy up so that the energy takes the right shape. Okay, so those are contrastive methods and then there's architectural methods and the architectural methods basically consists in building F of X, Y in such a way that the volume of space that can take low energy is limited, perhaps minimized in some way. And so if you push down on the energy of certain points, automatically the rest will be up because we go up because the volume of stuff that can take low energy is limited. And I've made a list here. So this is an important slide which you saw last time. And there's a list of various methods that you may have heard of, some of which are contrastive, some of which are architectural. I must say that those two classes of methods are not necessarily compatible with each other. You can very well use a combination of the two. But most methods only use one. So things like maximum likelihood if your probabilities consist in pushing down on the energy of data points and then pushing up everywhere else for every other value of Y in proportion to how low the energy is. So you push up harder if the energy is lower so that in the end you get kind of the right shape. Maximum likelihood incidentally only cares about differences of energies, doesn't care about absolute values of energies which is an important point. And then there are other methods that are contrastive divergence, metric learning, ratio matching, noise contrastive estimation, minimum probability flow, things like that, which and generative adversarial networks that also are based on the idea of pushing up on the energy of data points outside the data manifold. Then there are similar methods denosing auto encoders that we will talk about in just a minute. And as we saw last time, they've been extremely successful in the context of natural language processing. Systems like BERT, for example, basically are denosing auto encoders of a particular kind. And then there are architectural methods and last time we talked a little bit about PCA and K-Means but we're gonna talk about a few more today, particularly sparse coding and something called LISTA. And I'm not gonna talk about the remaining ones. So this is a rehash again of something that we talked about last week, which is a very simple latent variable model for unsupervised learning, K-Means, which I'm sure you've all heard about. There the energy function is simply the square reconstruction error between the data vector and the product of a prototype matrix times latent variable vector. And that latent variable vector is constrained to be a one-hot vector. So in other words, it selects one of the columns of W when you multiply W by it. And so what you get in the end is the square distance between the data vector and the column of W that is closest to it. Once you do the minimization with respect to Z, which means looking for which column of W is closest to Y. So that's the energy function. That's the inference algorithm, looking for the closest prototype. And the energy function of course is zero wherever there is a prototype and grows quadratically as you move away from the prototype until you get closer to another prototype, in which case the energy again goes down as you get closer to the second prototype. So if you train K-Means on a dataset where the training samples are generated by peeking around this little spiral here shown at the bottom, which you'll, with K or 20 in that case, you get those little dark areas which indicate the minima of the energy function. There is a ridge in the middle where the sort of energy, it's like the energy kind goes down on both sides. It's like a ridge. But here is a method that's become very popular over the last few months and it's very recent. The first papers on this actually go back a long time. There are some of my papers from the early 90s and from the mid 2000s. And they were called Siamese Networks, or Metric Learning at the time. And the idea is to build a sort of energy-based model if you want by having two copies of the same network or two different networks, but very often it's two copies of the same network. And you feed X to the first network and Y to the second network. You have them compute some feature vector on the output, H and H prime. And then you compare those two feature vectors with some methods, some way of computing a similarity or dissimilarity between vectors. Could be a dot product, could be a cosine similarity, it could be something of that type. And what you do is to train the system is that you train it with a data point basically as a pair of X and Y. So you indicate the location of the data manifold to the system by basically telling it, here is a sample, tell me it's a sample of X, give me another sample that basically has the same content as X, but it's different. And of course you're never going to ask the system to give you that sample, you're going to generate those samples and train the system with it. So there are two pairs, positive pairs. So pairs are compatible with each other, which is the whole idea of energy-based models. And a compatible pair is, or positive pair if you want, is consist of X being an image and Y being a transformation of this image that basically does not change its content. So it's still the same content of the image if you want. So you want basically representations extracted by those two networks to be very similar because those images are similar. And that's exactly what you're going to do. You're going to feed those two images to those two networks. And you're going to have a loss function that says minimize the energy, which means minimize the distance or similarity measure between H and H prime, between the outputs of the two networks. So that's the positive part. So that's the way to kind of lower the energy for training samples. Okay, and then you have to generate random negative samples. And the way you generate them is by basically picking again a sample for X and then picking another image that you know is different that has nothing to do with X. It's incompatible with it if you want. It's very different. And now what you do is you feed those two images to those two networks and you try to push H and H prime away from each other. So basically you're trying to make the similarity metric C of H and H prime large for those two samples. Okay, and the objective function here is going to take into account the energy function for similar pairs and the energy function for dissimilar pairs is going to push down on the energy function for similar pairs, push up on the energy function for dissimilar pairs. Okay, so there's been a number of, so paper, people have used metric learning for various things for a long time, for image search, for example, for pitch recognition for things like that. But it's only in the last few months that there's been a couple of works that have shown you can use those methods to learn good features for object recognition. And those are really the first papers that produce features in a unsupervised or self-supervised way that produce features that can rival the features that are obtained through supervised learning. So the three papers in question are PERL, which means Protecting Variant Representation Learning by Isha Mishra and Laurence van der Matten at Facebook in New York. Another one called Moco by Kaming He and his collaborators at Facebook in Menlo Park. And the third one, which appeared more recently, is called Seam Clear by a group from Google, Chen et Al, and the last author being Jeffery Hinton. So there's been other work using those kind of methods. I think that it was a question, perhaps. I heard something. No, it wasn't a question. It was actually my phone waking up because I said Google and... Oh, I see. Okay, and slow features, something we'll talk about later, which is a little similar. Okay, so these are examples of results that are obtained with Moco and they essentially show that even with a very large model, you can, which is basically a version of ResNet 50, that you train using this contrasting method. You get sort of decent performance. This is, I believe, top five performance on ImageNet. Pearl actually works quite a bit better than Moco. This is top one accuracy this time with networks of various sizes. So here, there's several scenarios. The main scenario is you take all of ImageNet, you take a sample from ImageNet, distort it, and that gives you a positive pair. Run it through your two networks and train the network to produce similar outputs. Basically, the two networks are identical, actually, for both Moco and Pearl. It's the same as Net. And then take dissimilar pairs and push the outputs away from each other using a particular cross-function that we'll see in a minute. And then you have to do this many, many times and you have to be smart about how you cache the negative samples because most samples are already very different by the time they get to the output of the network. So you basically have to be smart about how you kind of pick the good negatives. So the type of objective function that is used by Pearl is called noise contrast estimation. And that goes back to kind of previous papers. It's not their invention where the similarity metric is the cosine similarity measure between the outputs of the convolutional nets. And then what you compute is this, basically the softmax-like function which computes the exponential of the similarity metric of two outputs for similar pairs and then divides by the sum of the similarity metric exponentiated for similar pairs and the sum of dissimilar pairs. So you have a batch where you have one similar pair and a bunch of dissimilar pairs and you compute this kind of softmax thing. And if you minimize the softmax cost function it's gonna push the similarity metric of similar pair to be as large as possible and the similarity metric, the cosine similarity of dissimilar pair to be basically as small as possible. I had this question that why are we separately using an LNCE function whereas we could have probably directly computed loss using the HVIVI transformed probability that we have by taking the negative log of that probability. So like what benefit would LNCE provide using like not directly taking the negative log of the probability that we have from H? Well, that's a good question. It's not entirely clear to me why. I think what happened there is that people tried lots and lots of different things and this is where I ended up working best. There is in the Hinton paper, there's kind of a similar thing where they tried different types of objective function and found that something like NCE actually works quite well. So it's an empirical question and I don't have a good intuition for why you need this term in addition to the denominator in H. I hope this answers your question, although sorry I don't have any answer for it. Why do you use cosine similarity instead of L2 norm? Instead of L2 norm or instead of, okay, it's because you want to normalize. It's very easy to make two vectors similar by making them very short or to make two vector very dissimilar by making them very long, okay? So by doing cosine similarity, you're basically normalizing, right? You're computing a dot product but you're normalizing this dot product and so you make the measure independent of the length of the vectors. And so it forces the system to kind of find a good solution to the problem without cheating by just making the vectors either short or large. It also removes an instability that it could be in the system. The design of those contrastive functions is actually quite a bit of a black heart. Okay, so what they actually do in Perl is that they don't use directly the output of the convenate for the objective function. They have different heads. So basically the convenate has different set of heads, F and G, which are different for the two networks. And that's what they use in the context of this contrastive learning. And then there is another head that they use for the ultimate task of classification. So those F and G functions are, you can think of as sort of extra layers that are kind of on top of the network that are different for the two networks. All right, so these are the results that are produced by Perl and you can get, so this particular experiment is one in which you pre-train the system using Perl on the ImageNet training set. And then what you do is you retrain, you fine-tune the system using either 1% of the labeled samples or 10% of the labeled samples and you measure the performance of five accuracy or top one accuracy. So this paper appeared in January on Archive and then just a few weeks ago, this paper appeared called Sinclair by Shannon Al which is a team from Google and they have a very sophisticated corruption or data augmentation method to generate similar pairs and they train for a very, very long time on a lot of TPUs and they get really interestingly good results. So much better than either Perl or Moco using very large models. And they can reach more than 75% correct top one on ImageNet by just pre-training in cell supervised fashion and then kind of fine-tuning with only 1% of the samples. Yeah, so this is, in fact the previous slide is a different scenario where you only train a linear classifier on top of the network. This is the scenario where you train with either 1% or 10% of labeled samples and you get 85% top five with 1% of the labels which is pretty amazing results. To some extent I think this shows the limits of contrasting methods because the amount of computation and training that is required for this is absolutely gigantic. It's really enormous. So here is a scenario where you just train a linear classifier on top so you freeze the features produced by the system that has been pre-trained using cell supervised running and then you just train a linear classifier on top and you measure the performance either top one or top five on the full ImageNet having trained supervised on the full ImageNet. And again, the numbers are really impressive. But again, I think it shows the limit of contrasting methods. Here is the main issue with contrasting methods is that there are many, many, many locations in a high dimensional space where you need to push up the energy to make sure that it's actually higher everywhere than on the data manifold. And so as you increase the dimension of the representation, you need more and more negative samples to make sure that the energy is higher where it needs to be higher. Okay, so let's talk about another crop of contrasting methods called denoting autoencoder and that's become really kind of important over the last year and a half or so for natural language processing. So the idea of denoting autoencoder is that you take a Y and the way you generate X is by corrupting Y. So this sounds a little bit like the opposite of what we were just doing with contrasting methods. But basically you take a clean image, Y, you corrupt it in some way by removing a piece of it for example, or you take a piece of text and you remove some of the words or you mask a piece of it. So a special case is the masked autoencoder where the corruption consists in masking a subset of the input. And then you run this through an autoencoder which is essentially an encoder or called a predictor here, a decoder and perhaps final layers that may have suff max in the context of text or not if nothing if it's images. And then you compare the predicted outputs Y bar with observed data Y. And so what's the principle of this? The principle of this is the following and you can think of Frito for those beautiful pictures. Basically- Yeah, we saw this in class last Tuesday. That's right. So this is basically just a reminder. You take a data point, so which is one of those pink points, right? And you corrupt it so you get one of those brown points and then you train the autoencoder from the brown point to produce the pink points, the original pink point. What does that mean? That means that now the energy function which is the reconstruction error is gonna be equal to the difference between the original point and the pink point, the distance, the square distance you see is the Euclidean, square Euclidean distance. So C of Y, Y bar is gonna be the if your thing is properly trained is gonna be the distance between the corrupted point X, the brown point and the pink point you started from Y. So basically it basically trains the system to produce an energy function that grows quadratically as you move away from the data manifold, okay? And so it's an example of a contrastive method because you push up on the energy of points that are outside the data manifold, essentially, you tell them your energy should be the square distance to the data manifold or at least to the point that was used through corruption. But the problem with it is that again, in a high dimensional continuous space there is many, many, many ways you can corrupt a piece of data and it's not entirely clear that you're gonna be able to kind of shape the energy function the proper way by just pushing up on lots of different locations. It works in text because text is discrete. It doesn't work so well in images. People have used this in the context of imaging painting, for example. So the corruption consists in masking a piece of the image and then training a system to reconstruct it. And the reason why it doesn't work is because people tend to train the system without latent variables. And in my digital diagram here, there is a latent variable, but in fact, in versions of this that I used for in the context of images, there is no real latent variable and it's very difficult for the system to just dream up a single solution to the in-painting problem here. It's a multimodal manifold. I mean, it's a manifold. It's probably not just a single point. There's many ways to complete the image here by filling in the masked part. And so with that latent variable, the system produces blurry predictions and doesn't learn particularly good features. It is a multimodal part also the reason why we had that internal purple area in the spiral because each of those points have two predictions, right? Between the two branches of the spiral. Right, so this is the additional problem that if you're not careful, the points that are in the middle, that could be the result of a corruption of one ping point on one side of the manifold or ping point on another side of the manifold, the those points right in the middle don't know where to go because half the time they're trained to go to one part of the manifold, the other half of the time they're trained to go to the other part of the manifold. And so that might create kind of flat spots in the energy function that are not good. So there are ways to alleviate this but they're not kind of completely worked out unless you use latent variable models. Okay, all the contrasting methods, this is just in passing for your own interest. There are things like contrastive divergence and others which I'm not gonna talk about, so contrastive divergence is a very simple idea. You pick a training sample, you lower the energy at that point, of course, and then from that sample, you're using some sort of gradient-based process, you move down the energy surface with noise. So start from the sample and figure out how do I change my sample? How do I change my Y? In such a way that my current energy-based model produces a lower energy than the one I just measured for that sample. Okay, so basically you're trying to find another point in input space that has lower energy than the training point you just fed it, okay? So you can think of this as kind of a smart way of corrupting a training sample, smart because you don't randomly corrupt it, you corrupt it by basically modifying it to find a point in space that your model already gives low energy to. So it would be a point that you would want to push up because your model gives low energy to it and you don't want it to have low energy, so you push it up and I'm gonna... Professor, have people tried contrastive methods with this image in-painting method and how would one do that? Does that really work if you do that together? So in-painting is a contrastive method, right? You take an image, you corrupt it by blocking some piece of it and then you train a neural net, basically an auto-encoder to generate the full image and then you compare this reconstruction of the full image with the original uncorrupted image and that's your energy function, okay? So it is a contrastive method. Right, so if we use the NCE loss with this in-painting loss, is that useful? You can really use NCE loss because NCE relies on the fact that you have a finite number of negative samples, okay? Here you artificially generate negative samples and so it's really a completely different scenario. I don't think you could use something similar to NCE, or at least not in a meaningful way. Okay, so this is Y space, okay? Y1, Y2. And let's say your data manifold is something like this, but let's say your energy function currently is something like this. So here I'm drawing the region of low energy and I'm drawing the lines of equal costs, okay? So the energy looks nice at the bottom left, right? You have data points here that your model gives low energy to, but then your model is not good because at the bottom right, it gives low energy to regions that have no data and then at the top, you have data points that your model gives high energy to, okay? So here is how contrastive divergence would work. You take a sample, a training sample, it says this guy, and by gradient descent, you go down the energy surface to a point that has low energy, okay? Now, this was a training sample Y. The one you obtain now is a contrastive sample Y bar. And what you do now is you change the parameters of your energy function so that you make the energy of Y smaller and the energy of Y bar larger, okay? Using some kind of loss function that pushes down on one, pushes up on the other. Which loss function you use is immaterial. You just need one that will do the right thing. Okay, so what I've described here is kind of a deterministic version of contrastive divergence, but in fact, contrastive divergence is kind of a probabilistic version of this. Well, what you do is you do this sort of gradient based descent. I mean, this sort of search for a low energy point, but you do it with some level of randomness, some noise in it. So one way to do this in a continuous space like this one is that you give a random kick. You think of your data point here as sort of a marble that is going to go down the energy surface. You give it a random kick in some random direction, say this. And then you let the system kind of follow the gradient and you stop when you're tired. You don't wait for it to kind of go down all the way. You just stop when you're tired and then there is a rule to select whether you keep the point or not. And that's your way bar. Why is the kick necessary? Okay, so the kick is necessary so that you can go over energy barriers that would be between you and the lowest energy areas, okay? That's why you need that kick. Now, if you have a space, a white space that is not continuous but is discreet, you can still kind of do this energy minimization by basically doing something called simulated annealing. So essentially, if white is discreet variable, you kind of perturb it randomly. If the energy you get by this perturbation is lower than you keep it, if it's higher than you keep it with some probability and then you keep doing this and eventually the energy will go down. So this is a non gradient based optimization algorithm or a gradient free optimization algorithm if you want, which you're gonna have to resort to when the space is discreet and you can't use gradient information. This technique I just described of kicking a marble and sort of simulating it rolling down the energy is called Hamiltonian Monte Carlo, HNC. And you might see this in other context. So that's another way of generating negative samples. Yes, Hamiltonian Monte Carlo. Some people call this hybrid Monte Carlo sometimes. So some of you may have heard of something called restricted boson machines and restricted boson machine is an energy based model in which the energy is very simple. It's written at the bottom here, the energy of Y and Z. So Y is basically an input data vector and Z is sort of a latent variable. The energy function is minus Z transpose WY where W is a matrix, not necessarily square because Z and Y may have different dimensions. And generally Z and Y are both binary variables. And so, I mean, binary vectors. So the components of binary variables. And they were kind of somewhat popular in the mid 2000s but I'm not spending much time on it here because they've kind of fallen out of favor a little bit. They're not that popular. But just so that gives you some reference of what this means. There's sort of refinements of contrastive divergence. One of them is called persistent contrastive divergence. And it consists in using a bunch of particles and you kind of remember the position. So they have sort of permanent persistent positions if you want. So you throw a bunch of marbles in your energy landscape and you keep making them roll down maybe with a little bit of noise or kicks. And then you keep their position. So you don't change the position of the marble according to new samples, new training samples. You just keep the marbles where they are. And eventually they'll find low energy places in your energy surface and will cause them to be pushed up because that's what happens during training. But this doesn't scale very well. So things like RBMs become very, very expensive to train in high dimension. Okay. So now for regularized latent variable energy based model which is my current favorite type of model. So we talked about the idea of building a predictive model by having a latent variable, right? So you have the observed variable X. You will need to predict it, extract some representation of the observed variables. And then that goes into a decoder that produces the prediction. But if you want your decoder to be able to make multiple predictions then you feed it with a latent variable. And as you vary the value of this latent variable the prediction will vary over a set, okay? Over hopefully the manifold of data of in the space of Y that are compatible with X. So this architecture here the formula for the energy can be written as on the left here. C of Y and C is a cost function that compares these two argument. So you compare Y, the data vector with the result of applying the decoder to the output of the predictor that takes into account X and the decoder also takes into account Z. So here's the problem with this. If Z is too powerful, in other words if Z has too much capacity then there always going to be a Z that is going to produce a Y bar that's going to be exactly equal to Y. So remember the inference algorithm here is that you give an X and a Y and then you find a Z that minimizes C of Y Y bar, right? That's how you do inference of the latent variable in an energy-based model, right? Given an X and a Y, I find a Z that minimizes the energy. So if Z, for example, has the same dimension as Y and the decoder is powerful enough to represent the identity function, then for any Y there's always going to be a Z that produces Y bar that's exactly equal to Y, okay? And if the decoder is the identity function which ignores H, it's the identity function from Z to Y, to Y bar, then you just set Z equal to Y and the energy is zero. And that would be a terrible energy-based model because it would not give high energy to stuff outside the manifold data, it gives low energy to everything, okay? It gives zero energy to everything. So the way to prevent the system from giving low energy to points outside the manifold data is to limit the information capacity of the latent variable Z. To be more precise, if Z can only take, let's say, 10 different values, what that means is, so you constrain Z to only take 10 possible different values. Let's say you make Z a one-hot vector of dimension 10, like in K-means, okay? Then there's only going to be 10 points in Y space that will have zero energy because either Y is equal to one of the Y bars that is produced from one of those 10 Zs or it's not. If it is, then the energy is zero. If it's not, the energy is gonna have to be larger than zero. In fact, it's gonna grow quadratically as you move away from that Z. And that's exactly the idea of K-means, okay? But what if you find other ways to limit the information content of Z? So this seems like a kind of a small technical sub-problem, but in my opinion, the question of how you limit the information content of a latent variable in a model of this type is the most important question in AI today, okay? And I'm not kidding. I think the main problem we're facing is how to do self-supervised learning properly. And contrastive methods have shown their limits. And so we have to find alternatives and alternatives are regularized latent variable models. There might be other ideas that nobody has had so far, but these are the only two that I know of. And then the main technical issue that we need to solve is how do we limit the information content of the latent variable so that we limit the volume of white space that can take low energy. And therefore, we automatically make the energy outside the manifold data where we train the system to have low energy. We automatically make the energy outside higher. So I'm gonna go through a few examples of systems that actually work that and things that people have done for, you know, 20 years in some cases. And so that's the idea here. Or one of the ideas, you add a regularizer in the energy and this regularizer takes low value on a kind of small part of the space of Z. And so the system will preferentially choose values of Z that are within this sort of restricted set where R takes a small value. And if Z needs to go outside of that set to do a good reconstruction, you're paying a price for it in terms of energy, okay? So the volume of Z space that is determined by R basically limits the volume of space of Y that can take low energy. And the trade-off is controlled by basically a coefficient lambda that you can adjust to make the volume of white space that take low energy as small as possible or not that small. So here are a few examples of R and Z, of R of Z. And some of them are kind of useful because they're differentiable with respect to Z and some of them are not so useful because they're not differentiable. So you have to look for kind of, you know, do a discrete search. So one is the effective dimension of Z. So what you can do is you can decide that Z a priori has three dimension, four dimension, five dimension, six dimension. You train your model for various dimensions of Z and there is one set of dimensions for which, you know, one dimension for which the prediction would be good, but at the same time, the dimension of Z would be minimized and what you will have found is basically the lowest embedding dimension of your space. So imagine, for example, that your data set consists of lots and lots of pictures of someone making faces in front of a camera. We know that the effective dimension of the manifold of all the faces of a person is something like 60, at least less than a hundred because it's bounded above by the number of muscles in your face. And so there has to be a Z of dimension 50 or 60 or something like that, such that when you run it through a convolutional net, you will generate all possible instances of the face of that person, okay? That's the face manifold for that person if you want. So what you can do is this really super expensive method of kind of trying all different dimensions of Z. One way to formulate this mathematically is to minimize the L zero norm of Z. So it's actually a slightly different thing. So what you can do is you choose a Z that's relatively high dimension but for any given sample, you minimize the number of components of Z that are non-zero. Okay, that's called the L zero norm. It's just the count of the number of components that are non-zero. And it's very difficult to minimize that norm because it's not differentiable, it's very discreet. So what people use is they use a convex relaxation of that norm called, and it's the L one norm. So the L one norm is the sum of the absolute values of the components of Z. And that's what you use for R and Z. The sum of the absolute values of the components of Z. When you add this to your energy function where the system is trying to do is find a Z that reconstructs the Y because it needs to minimize C of Y and Y bar. It also tries to minimize the number of its components that are non-zero because that's the best way to minimize the L one norm. Okay, and that's called sparse coding and it works really well and I'm gonna show you some examples of this. Before I go there, I just wanna mention that and we'll talk about this a little more and it's the idea that adding noise to Z will also limit the information content of Z. I'll come back to this in a minute. Okay, so here is the idea of sparse coding. So sparse coding is an unconditional version of energy-based models. So there's no X, there's only a Y and a Z. And the energy function is Y minus WZ where W is a so-called dictionary matrix. Very similar to the prototype matrix in K-leans. Z is a vector. Generally, the dimension of Z is larger than Y. And so you measure the squared distance, you clean the distance between Y and WZ. So basically your decoder here is linear, it's just a matrix. And then you add a term, long that times the L one norm of Z which is represented by those two bars. And that's the energy function for sparse coding. Okay, and you can think of it as a special case of the system I showed, the architecture I showed previously except it's not conditional, there's no X. Now what does this do? So our friend will tell you that the picture I'm showing here on the left is inappropriate because it's actually generated with a slightly different model. But it's a good sort of pictorial representation of what sparse coding attempts to do which is to approximate the manifold of data by a piecewise linear approximation, essentially. So imagine that you have this W matrix, okay? And someone has given it to you or you've learned it in some way. Now, if you decide a priori that a certain number of components of Z are non-zero, okay, most of the components of Z are zero, just a small number of components of Z are non-zero. And you vary the value of those components you know, within some range. The set of vectors that you're going to generate, the set of Y bars that you're going to generate are going to be the Y bars that are in the linear set space spanned by the corresponding columns of the W matrix, okay? For every value of the Z coefficients that are non-zero, you basically compute a linear combination of the corresponding columns of W. And so you're basically moving along a low-dimensional linear set space of Y space. So Y bar is going to be basically along a low-dimensional space, a low-dimensional linear set space. The dimension of that space will be the number of non-zero components of Z, okay? So for one particular Y, when you find the Z that minimizes the energy, a number of components are going to be non-zero. And as you move Y slowly, those non-zero components are going to change value, but you're going to stay on the same linear set space until Y changes too much, and then all of a sudden you need a different set of non-zero Z to do the best reconstruction. And now you're switching to a different plane, okay? Because a different set of Z components become non-zero. And so now you move Y again, and again the coefficients in Z keep changing values except for the ones that are zero that stay zero, and all of a sudden it switches again. It goes to another one. So it's kind of well symbolized by the picture on the left where you see that the manifold data is approximated by basically a bunch of linear subspace in these case lines. The reason why it's difficult to represent the actual sparse coding in 2D is because it's going to degenerate in 2D, so. So one question is how do we train a system like this? So to train a system like this, we actually, our last function is just going to be the average energy that our model gives to our training samples. So the last function is just the average energy, basically the average F. And remember F of Y is equal to the minimum over Z of E of Y and Z, okay? So we're going to take the average of F over all our training samples and minimize that average with respect to the parameters of the model. And those parameters are the coefficients in the W matrix. Again, it's called the dictionary matrix. So how do we do this? We take a sample of Y. We find a Z that minimizes the energy, okay? The sum of the two terms that you see here. And then we take one step of gradient descent in W. So we compute the gradient of the energy with respect to W, which is very simple because it's a quadratic function of W. And we take one step of stochastic gradient basically, right? And now we take the next Y and do it again, minimize with respect to Z. And then for that value of Z, compute the gradient with respect to W and take one step in the negative gradient. And you keep doing this. Now, if you just do this, it doesn't work. It doesn't work because the result is that W keeps, will keep getting bigger and bigger and Z will keep getting smaller and smaller. But the problem will not actually, the system will not actually solve the problem. So what you need to do is normalize the W matrix so that it cannot grow indefinitely and allow Z to shrink correspondingly. And the way to do this is that you basically, after every update of the W matrix, you normalize the sum of the squares of the terms in a column of, in each column of W, right? So normalize the columns of W after each update. And that will prevent the terms in W from blowing up and the terms in Z from shrinking. And it will force the systems to actually find a reasonable matrix W and not get away with just making Z shorter. Okay, so that's Sparks coding. This was, the learning algorithm for this was invented by two computational neuroscientists, Bruno Olsazen and David Field in 1997. And so that goes back a long time. Okay, so here's the problem with Sparks coding. The inference algorithm is kind of expensive. You, oops, what you have to do is, you know, for a given Y is to kind of minimize the sum of those two terms, one of which is L2, the other one is L1. There's a very large number of papers in applied mathematics that explain how to do this efficiently. In particular, one algorithm to do so is called ISTA. That means iterative shrinkage and thresholding algorithm. And I'm gonna tell you what ISTA is in just a minute. But it basically consists in basically alternating a sort of minimization with respect to Z of the first term and then the second term, alternately. So here's the, it kind of abstract form of the ISTA algorithm. There's a fast version of it called FISTA. And here it is at the, at the bottom. Actually, I'm realizing that I'm missing the reference for the ISTA algorithm. This is not any of the references I'm showing here. Sure, so there is debool, D-E-B-O-U-L-L-E. Anyway, so here is the algorithm. You start with Z equals zero and then you apply this iteration here, which is the second, last formula. So in the bracket, the thing that's in the bracket is basically a gradient step in the square error, the square reconstruction error. So if you compute the gradient of the square reconstruction error and you do a gradient step, you basically get this formula where one over R is the gradient step size. Okay? So you basically update Z with the negative gradient of the square reconstruction error. And then the next operation you do is a shrinkage operation. So you take every component of the resulting Z vector and you shrink all of them towards zero. So you basically subtract, if the component of Z is positive, you subtract the constant from it. And if it's negative, you add the same constant to it. But if you get too close to zero, you just clip at zero, okay? So basically, it's a function that is flat around zero and then grows like the identity function above a certain threshold and below a certain threshold, okay? It shrinks towards zero. If you keep iterating this algorithm for proper values of L and lambda, the Z vector will converge to the solution of the energy minimization problem, which is the minimum of this energy here, E of YZ with respect to Z. Okay, and that suggests, so keep this in mind. Now here is an issue. This algorithm is kind of expensive. If you want to run this over an image or over old patches of an image or something like this, you're not gonna be able to do this in real time on large images. And so here is an idea and the idea is to basically train a neural net to predict what the solution of the energy minimization problem is. Okay, so you see the diagram here on the right where we train an encoder that takes the Y value. For now, you can ignore the piece that depends on X, right? You have X going through a predictor predicting H and then H feeds into the encoder of the decoder. You can ignore this part for now in the unconditional version. You just have Y that goes through an encoder. It produces a prediction for what the optimal value of the Z variable is. Okay, called Z bar. And then the Z variable itself goes into the decoder. It goes, you know, it's being regularized as well and then produces a reconstruction Y bar. And what you do here is, again, you find the Z value that minimizes the energy. But what we're gonna, but the energy now is still the sum of those two terms, C of Y Y bar and R of Z. But then what we're gonna do is we're gonna train the encoder to predict this optimal value of Z obtained through minimization. And this encoder is gonna be trained by minimizing this term D of Z and Z bar. So basically it views Z as a target value and you train it by backprop, by gradient descent, to basically make a prediction that's as close to Z as possible. Okay, that's one form of this idea. Another form of this idea, slightly more sophisticated, is that when you're doing the minimization with respect to Z of the energy with respect to Z, you take into account the fact that you don't want Z to get too far away from Z bar. So basically your energy function now has three terms. It has the reconstruction error. It has the regularization, but it also has the difference between the Z bar, the prediction from the encoder and the current value of the Z variable. So the energy function now is written here, E of X Y Z is equal to the C function that compares Y and the output of the decoder applied to Z. This is the unconditional version here. And then you have a second term, which is this D function that sort of measures the distance between Z and the encoder applied to Y. There shouldn't be an X. And then you will also regularize Z, okay? So basically you're telling the system, find a value for the latent variable that reconstructs that is sparse if R is an L1 norm or doesn't have too much information, but also is not too far away from whatever it is that the encoder predicted. And a specific idea there is called LISTA, which means the learning ISTA, and it's to shape the architecture of the autoencoder so that it looks very much like a LISTA algorithm. So if we go back to the LISTA algorithm, the formula, the second and last formula here, looks like some vector update with some matrix. So it's like a linear stage of a neural net if you want. And then some nonlinearity that happens to be a shrinkage, which is sort of a double value if you want. It's one value going up, married with another value going down. And so if you look at the diagram of this whole ISTA algorithm, it looks like this block diagram here that I've drawn on top. You start with Y, multiply it by some matrix and then shrink the result. Then they give you the next Z, apply some other matrix to it, add it to the previous value of Z that you had shrink again, and then multiply the matrix again, add to the previous value you had shrink again, et cetera. And so you have two matrices here, W, E, and S. And at the bottom, if you define W, E as one over L, W, D, and you define S as the identity minus one over L, W, D transpose W, D, where W, D is the decoding matrix, then this diagram basically implements ISTA, okay? So the idea that one of my former postdocs, Carl Gregor, had was to say, well, why don't we treat this as a recurrent neural net? And why don't we train those matrices W and S so as to give us a good approximation of the optimal sparse code as quickly as possible, okay? So we're basically gonna build our encoder network with this architecture that is copied on ISTA and we know for a fact that there is going to be a solution where the system basically learns the value of W, E, and S that correspond to the one that should be. But in fact, the system learns something else, okay? So this is another representation of it here at the bottom left. We have those shrinkage function and then this S matrix and then you add the Y multiplied by W, E to the S matrix shrink again, et cetera, et cetera. So this is the recurrent net we're gonna train with W, E, and S. The objective of this ISTA is, can you repeat what is the objective? I think I missed the point. So the objective of training this encoder, okay? So the encoder in this diagram on the right here, the architecture of the encoder is the one you see at the bottom left, okay? And the objective that you're training this with is the average of D of Z and Z bar. Okay? So the procedure is indicates where there's no X, right? But if there's an X, doesn't make much difference. So take a Y, for this particular Y, find the value of Z that minimizes the energy and the energy is the sum of three terms, C of Y, Y bar, R of Z, and D of Z, Z bar, okay? So find the Z that reconstructs has minimal capacity, but also is not too far away from the output of the encoder. Okay? Once you have the Z, compute the gradient of the energy with respect to the weights of the decoder of the encoder and the predictor if you have one by backdrop. So the interesting thing is that the only gradient you're gonna get for the encoder is the gradient of D of Z and Z bar, okay? So the encoder is just gonna train itself to minimize the Z and Z bar. In other words, gonna train itself to predict Z as well as possible, the optimal Z that you obtain through minimization. The decoder is gonna train itself to, of course, reconstruct Y as well as it can with the Z that is being given. And then if you have a predictor, you're gonna get gradient to the predictor and it's gonna try to kind of produce an H that helps as well as possible. Is that clear? Yeah, thanks. Okay, so that's the architecture. It's basically just a pretty garden variety recurrent net. And this works really well in the sense that as you go through, you know, the iterations of this ISTI algorithm or through this train neural net that is designed to basically approximate its solution. What you do is you can train the system, for example, to produce the best possible solution after three iterations only, right? So it knows the optimal value because it's been computed with ISTA. But then when you train it, it trains itself to produce the best approximation of the value with only three iterations. And what we see is that after three iterations, it produces a much, much better approximation than ISTA would produce in three iterations. And so what you see here is the number as a functional number of iterations of either ISTA or this LISTA algorithm is the reconstruction error, right? So by training an encoder to kind of predict the result of the optimization, you actually get better result than if you actually run the optimization for the same number of iterations. So it accelerates inference a lot. Okay, so this is what Spark Scooting gives you with or without an encoder actually. You get pretty much the same results here when you train on MNIST. So these are, basically it's a linear decoder. The code space here, the z vector has size 256. And so you take this 256 z vector multiplied by matrix and you reconstruct a digit. And what you see here are the columns of this matrix represented as images, okay? So each column has the same dimension as an MNIST digit, right? Each column of W. And so you can represent each of them as an image. And these are the 256 columns of W. And what you see is that they basically represent parts of characters, like little pieces of strokes. And the reason for this is that you can basically reconstruct any character, any MNIST digit by a linear combination of a small number of those strokes, okay? And so that's kind of beautiful because this system basically finds constitutive parts of objects in a completely unsupervised way. And that's kind of what you want out of unsupervised running. You want sort of, what are the components or the parts that can explain what my data looks like? So this works really beautifully for MNIST. It works quite nicely as well for natural image patches. There's supposed to be an animation here, but you're not seeing it obviously because it's PDF. But the result is this. So the animation shows the learning algorithm taking place. So here, again, these are the columns of the decoding matrix of a sparse coding system with L1 regularization that has been trained on natural image patches. I must say that those natural image patches have been whitened, which means they've been sort of normalized in some way, canceled the mean and kind of normalized the variance. And you get nicely little, let's call it geborr filters. So basically, small edge detectors at various orientations, locations and sizes. So the reason why this was invented by neuroscientists is that this looks very much like what you observe in the primary area of the visual cortex when you poke electrodes in the visual cortex of most animals and you figure out what patterns do they maximally respond to? They maximally respond to oriented edges. This is also what you observe when you train a convolutional net on ImageNet. The first layer of features look very much like this as well, except they're convolutional. These ones are not convolutional. It's trained on image patches, but there is no convolutions here. So this is nice because what it tells you is that with a very simple and supervised running algorithm, we get essentially qualitatively the same features that we would get by, you know, training a large convolutional net supervised. So that gives you a hint. So here is the convolutional version. So the convolutional version basically says, you have an image. This is more responsive, by the way. What you're gonna do is you're gonna take feature maps. Okay. Let's say here four, but it could be more. And then you're going to convolve each of those feature maps with a kernel. Okay. So a feature map is, I don't know, let's call this ZK. Okay. And we have a kernel here. Well, let's call it ZI because I'm gonna use K for the kernel. KI. And this is gonna be a reconstruction Y. And our reconstruction is simply gonna be the sum over I of ZI convolved with KI. Okay. So this is different from the original sparse coding where Y bar was equal to the sum over columns of a column of a W matrix. So, and multiplied by a coefficient ZI, which is now a scalar, right? So regular sparse coding, you have a weighted sum of columns where the weights are a scalar coefficient ZIs. In convolutional sparse coding, it's again a linear operation, but now the dictionary matrix is a bunch of convolution kernels and the latent variable is a bunch of feature maps. And you're doing a convolution of each feature map with the each kernel and sum up the results. This is what you get. So here there are, it's one of those systems that has a decoder and an encoder. The encoder is very simple here. It's basically just a, essentially a single layer network with an linearity and then there is a simple layer after that, basically a diagonal layer after that to change the gains. It is very, very simple. And the filters and the encoders and the encoder and the decoder look very similar. So it's basically, the encoder is just a convolution. Then some non-linearity, I think it was hyperbolic tangent in that case. And then basically what amounts to a diagonal layer that just changed the scale. And then the decoder, then there is a sparsity on the constraint on the code. And then the decoder is just a convolutional linear decoder. And the reconstruction is just a square distance. So if you impose that there is only one filter, the filter looks like the one at the top left. It's just a center surround type filter. If you allow two filters, you get kind of two weirdly shaped filters. If you let four filters, which is the third row, you get oriented edges horizontal and vertical. But you get two polarities for each of the filters. For eight filters, you get oriented edges at eight different orientations. 16 filters, you get more orientations and you also get center surround. And then as you increase the number of filters, you get sort of more diverse filters, not just edge detectors, but also grating detectors of various orientations center surround, et cetera. And that's very interesting because this is the kind of stuff you see in the visual cortex. So again, this is an indication that you can learn really good features in a completely unsupervised way. Now, here is the side news. If you take those features, you plug them into a convolutional net and you train that on some tasks, you don't necessarily get better results than you train on ImageNet from scratch. But there are a few instances where this has helped boost the performance, particularly in cases where the number of label samples is not that great or the number of categories is small. And so by training purely supervised, you get degenerate features, basically. Here's another example here, same thing. Again, it's a convolutional sparse coding. Here, the decoding kernel, this is on color images. The decoding kernels are nine by nine, applied convolutionally over an image. And what you see on the left here are the sparse codes. Here you have, I don't know, 64 feature maps. And you can see that the Z vector is extremely sparse, right? There's only a few components here that are either white or black or non-gray, if you want. And this is because of sparse C. Okay, in the last few minutes, we'll talk about variational autoencoder. And I guess you've heard a bit of this from... But tomorrow, we are gonna be covering this tomorrow with the bubbles and the code and everything, right? So tomorrow's gonna be one hour of just this. Right, so here is a preview for how variational encoders work. Okay, so variational autoencoders are basically the same architecture as the one I showed previously. So basically, a autoencoder, ignore the conditional part, the part that's conditioned upon X for now. That could be a conditional variational autoencoder, but for now, we're just gonna have a regular variational autoencoder. So it's an autoencoder where you take a variable Y, you run it to an encoder, could be a multilayer neural net, convolutional net, whatever you want. It produces a prediction for the sparse code, Z bar, is a term in the energy function that measures the square Euclidean distance between Z, the written variable, and Z bar. And there's also another cost function here, which is the L2 norm of Z bar. In fact, generally it's more the L2 norm of Z actually, that would be more accurate, but it doesn't make much difference. And then Z goes through a decoder, which reconstructs Y, and that's your reconstruction error. Okay, now, the difference with previous, so this looks very, very similar to the type of autoencoder we just talked about. Except there is no sparsity here. And the reason there is no sparsity is because variational autoencoders use another way of limiting the information content of the code by basically making the code noisy. Okay, so here is the idea. The way you compute Z is not by minimizing the energy function with respect to Z, but by sampling Z randomly according to a distribution whose logarithm is the cost that links it to Z bar. Okay, so basically the encoder produces a Z bar and then there is an energy function that measures the distance if you want between Z and Z bar. You think of this as the logarithm of a probability distribution. So if this distance is a square Euclidean distance, what that means is that the distribution of Z is gonna be a conditional Gaussian, where the mean is Z bar. Okay? So what we're gonna do is we're gonna sample a random value of Z according to that distribution. Basically a Gaussian whose mean is Z bar. Okay? And that just means adding a bit of Gaussian noise to Z bar, that's what our Z is going to be. And you run this to the decoder. So when you train a system like this, what the system wants to do is basically make the Z bar as large as possible. Make the Z bar vectors as large as possible so that the effect of the Gaussian noise on Z would be as small as possible, right, relatively speaking. If the variance of the noise on Z is one and you make the Z bar vector very, very long, like norm a dozen, then the importance of the noise would be 0.1% with respect to Z, okay? So if you train an autoencoder like this, ignoring the fact that you had noise by just back propagation, what you'll get is Z bar vectors, they get bigger and bigger. The weights of the encoder will get bigger and bigger. And the Z bar vector will get bigger and bigger. So what's the trick in a variational autoencoder? Hold on, quick question. Where does the Z come from? Is Z a latent variable never observed? It's a latent variable that we are sampling and we're not minimizing with respect to it. So in previous cases, we were minimizing with respect to the Z variable, right? Minimizing the energy with respect to the variable, finding the Z that minimizes the sum of C, D and R. So here we're not minimizing, we're just sampling. We're viewing the energy as a distribution, as a log of a distribution, and we're sampling Z from that distribution. All right, so imagine our encoder produces the following points for training samples. So these are the Z vector, the Z bar vectors produced by the encoder at some point in training. So what the effect of this sampling of Z is going to do is basically turn every single one of those training sample into a fuzzy ball, okay? Because we take a sample, we add noise to it, and so basically we've turned a single code vector into kind of a fuzzy ball. Now, the decoder needs to be able to reconstruct the input from whatever code is being fed. And so if two of those fuzzy ball intersect, then there is some probability for the decoder to basically make a mistake and confuse the two samples, confuse one sample for the other. Okay, so the effect of training the system, if you add fuzzy balls, if you make every one of your code a fuzzy ball, is that those fuzzy balls are gonna fly away from each other. And as I said before, this is the same, I was saying this before in a different way, it's gonna make the weights of the encoder very large so that the code vectors get very long and basically they get away from each other and the noise of those fuzzy balls don't matter anymore. Okay, so here if the fuzzy balls don't intersect, the system will be able to perfectly reconstruct every sample you throw at it. My question, it was a couple of slides ago, but again on the same topic, it was a couple of slides ago. So when you, what exactly do you mean by degenerate features here when you said that when you were comparing cell supervision and normal completely supervision? I see, okay, that's a good question. What I was saying is it's something I said before in different terms, it's the fact that if you train a classifier, let's say a convolution net, on a problem that has very few categories, let's say phase detection, you only have two categories. The representation of phases you get out of the convolution net are very degenerate in the sense that they don't represent every image properly, right? They're going to kind of collapse a lot of different images into sort of a common into identical representations because the only thing the system needs to do is discriminate from phases with non-faces. And so it doesn't need to really kind of produce good representations of the entire space. It just needs to tell you if it's a phase or not a phase. So for example, the features you will get for two different phases will probably be fairly identical. So that's what I mean by degenerate features. What you want are features that basically feature vectors that are different for different objects, regardless of whether you trained them to be different or not. So if you train on ImageNet, for example, you have 1,000 categories. And so because you have a lot of categories, you get features that are fairly diverse and they cover a lot of the space of possible images. I mean, they're still kind of fairly specialized, but they're not completely degenerate because you have many categories and you have a lot of samples. The more samples and the more categories you have, the better your features are. In fact, if you think about it, an autoencoder is a neural net in which every training sample is its own category, right? Because you're basically turning the system produce a different output for every sample I show you. So you're basically training the system to represent every object in a different way. But it can be degenerated in another way because the system can learn the identity function and encode anything you want. If you think about the Simon's Nets, the metric learning system, the contrastive methods, Moco, Perl, and SimClear, I was telling you about, it's a little bit of the same thing. They try to learn non-degenerate features by taking the system, here is two objects that I know are the same, here are two objects that I know are different. And so make sure you produce different feature vectors for objects that I know are symmetrically different. Okay, that's kind of a way of making sure you get feature vectors, representations aren't different for things that are actually different. But you don't get this by training a confnet on a two-class problem or a 10-class problem. You need as many classes as you can afford. So pre-training using self-supervised learning basically helps making the feature more generic and less degenerate for the problem. Okay, so let's get back to variational encoders. So again, if you train your encoder with those fuzzy balls, they're gonna fly away from each other. And what you want really is you want those fuzzy balls to basically kind of cluster around some sort of data manifold, right? So you want to actually keep them as close to each other as possible. And how can you do this? You can do this by essentially linking all of them to the origin with a spring, okay? So basically the spring wants to bring all those points towards the origin as close to each other to the origin as possible. And so in doing so, what the system is gonna try to do is pack those fuzzy spheres as close to the origin as possible. And it's gonna make them sort of overlap, interpenetrate, but of course, if they interpenetrate too much, if the two spheres for two very different samples interpenetrate too much, then those two samples are gonna be confused by the decoder and the reconstruction energy is gonna get large. And so what the system ends up doing is only letting two spheres overlap if the two samples are very similar. And so basically by doing this, the system finds some sort of representation of the manifold, it puts those code vectors along a manifold if there is one. And that's the basic idea of variational autoencoder. Now, you can derive this with math and it doesn't make anything much easier to understand. In fact, it's much more abstract, but that's basically what it does in the end. So there's a couple more tricks there in that variational autoencoder idea and you'll get some details with Alfredo tomorrow. You can adapt the size of those fuzzy balls. So basically you can have the encoder compute the optimal size of the balls in each direction. And what you have to do is make sure that the balls don't get too small. And so you put a penalty function that tries to make the variance of those balls, the size in each dimension if you want as close to one as possible. They can get a bit smaller, they can get larger if they want, but there's a cost for making them different from one. So now the trick, the problem you have with this is to adjust the relative importance of this spring strength. If the spring is too large, if the spring is too strong, then the fuzzy balls are all going to collapse in the center and the system is not going to be able to reconstruct properly. If it's too weak, then the fuzzy balls are going to fly away from each other and the system is going to be able to reconstruct everything and anything. And so you have to strike a balance between the two and that's kind of the difficulty with variational autoencoder. If you increase the strength of the spring a little too much, it's a term called KL divergence in the system. It's a KL divergence between the Gaussian basically of a Gaussian, it collapses. All the fuzzy balls basically get to the center and the system does not actually model it today properly. I have a question about one of the previous lectures actually, so is that all right? Sure. Yeah, so when you were talking about linearizability, so generally when you were saying that stacking linear layers one after the other without having non-linearities is basically redundant because we can have one linear layer to do it. But I remember that you also mentioned there is one particular reason why you might want to do this where you just stacked linear layers after this and you said, well, there's one reason but you didn't go into that reason. So I was wondering if there is anything significant behind that. So the situation I was describing is that imagine you have some big neural net and it produces a feature vector of a certain size and then your output is extremely large because maybe you have many, many categories. Maybe you're doing phoneme classification for a speech recognition system. So the number of categories here is 10,000 or something like that. Okay, I have to draw slowly. So if your feature vector here is itself something like 10,000, the matrix to go from here to here will be 100 million, right? And that's probably a bit too much. So what people do is they say, we're gonna factorize that matrix into the product of two skinny matrices where the middle dimension here is maybe, I don't know, a thousand where you have 10K on the input, 10K on the output and then the middle one is 1K, right? So if you don't have the middle one then the number of parameters you have is 10 to the eight. If you do have the middle one is two times 10 to the, seven, okay? So you get a factor of 10. If you make it 100, then it's 10 to the six. So it becomes more manageable. So basically you get a low rank factorization. So the overall matrices here that you can call W is now going to be the product of two smaller matrices, U and V. And because U and V, the middle dimension if you want of U and V is smaller, say 100, then the rank of the corresponding matrix W would be smaller. There are people who do this without actually specifying the dimension of the middle layer by doing what's called a minimization of nuclear norm which is equivalent. But I don't want to go into this, but that would be kind of a situation where you might want to actually decompose a matrix into a product of two matrices to kind of save parameters, essentially your safe computation. There's also another interesting phenomenon which is that it turns out that both learning and generalization actually are better when you do this kind of factorization. Even though now the optimization with respect to this matrix becomes non-convex, it actually converges faster using stochastic gradient. There's a paper by, a series of paper by Nadav, if you're interested in this, Nadav Cohen. I think it's from 2018. He's co-authored with Sanjeev Arora from Princeton. Nadav was a postdoc with Sanjeev Arora and they had a series of papers on this that explains why even linear networks actually converge faster and they use it also to to basically study the dynamics of learning in sort of non-convex optimization as well as the generalization properties. How important is it to have a kind of matching in a variational order encoder? How important is it to have a matching kind of architecture of an encoder and a decoder? There's no reason for the two architectures to match. It's very often the case that decoding is much easier than encoding. So if you take the example of sparse coding which I talked about, in sparse coding where the encoder is one of those sort of list that type encoder, the encoder is actually quite complicated whereas the decoder is linear. This is kind of a special case because the code is high dimensional and sparse and so high dimensional sparse codes, any function of a high dimensional sparse code to anywhere basically can be, is quasi linear. A one way to make a function linear is to basically represent this input variable in a high dimensional space using a non-linear transformation. We've talked about this when we discussed what were good features. And good features generally consist in sort of expanding the dimension of the representation in a non-linear way and making this representation sparse. And the reason for this is that now you make your function linear. So you could very well have a very complex encoder and a very simple decoder, possibly linear one, as long as your code is high dimensional. Now if you have a low dimensional code, so another encoder where the middle layer is very narrow, where the code layer is very narrow, then it could become very complicated to do the decoding. It may become a very highly non-linear function now to do the decoding. And so now you may need multiple layers. But there's no reason, again, to think that the architecture of the decoder should be similar to the architecture of the encoder. Now there is, you know, there might be, there might, okay, that said, there might actually be good reason for it, okay? And in fact, there are models that I haven't talked about because they're not kind of proven, but which are called stacto-toe encoders where basically you have this idea. So you essentially have an auto encoder where you have a reconstruction error. I'm actually going to erase this and make it look like the auto encoder that we talked about where there is sort of a cost for making the latent variable here different from the output of the encoder. So this is a Z bar and this is a Z, if you want, okay? Now that's an auto encoder drawn in a funny way. So this is Y and this is Y bar at the bottom. Now I can stack another one of those guys on top. Okay, now I'm gonna have to call this Z1 and I'm gonna call this Z2, et cetera. Okay, this I'm gonna call Y bar and this I'm gonna call Y. Now, what's interesting about, okay, I'm gonna call the bottom one X now, change name. Now, if you look at the, you ignore the right part of the system, look at the left part, you go to Y and that looks very much like a classical recognizer where X is the input. Y bar is a prediction for the output and Y is the desired output and there's a cost function that measures the difference between the two, okay? The other branch that goes from Y to X, that's kind of like a decoder where Y is the code. But then you have codes all the way in the middle because it's kind of a stacked auto encoder, right? So every pair of layer, every pair of encoder decoder is kind of a little auto encoder and you kind of stack them on top of each other. And what you'd like is find a way of training the system in such a way that if you don't have a label for a sample, so you don't know Y for a sample, you just train this as an auto encoder. But if you do have a Y, then you clamp Y to its desired value and then the system becomes now a combination of predictor or recognizer and an auto encoder. Now there is a slight problem with this picture, there's a number of different problems. The first problem is that if, again, if Z1 for example has enough capacity and you only train on unlabeled samples, the system is only going to carry the information through Z1 and is going to just completely ignore the top layers because it's done enough capacity in Z1 to do the perfect reconstruction. So it's going to just put all the information through Z1 and then all the others Z2 and Y will be constant because the system won't need them. So again, you will need to regularize Z to prevent it from capturing all the information. And same for the other layers, perhaps. Now the other thing is, do this thing need to be linear or non-linear and that depends on the relative size of the various Zs. So if you go from low dimension to high dimension, you need something that's non-linear. But if you go from high dimension to low dimension, you can probably do it with a linear. Kind of like sparse coding. And so you will see that the system may have an alternation of linear and non-linear stages in sort of opposite phase if you want because you need linear to go from high dimension to low dimension. And then non-linear to go from low dimension to high dimension. And then again linear to go back to low dimension. And it's the opposite the other way around. This is, you know, people have been proposing things like this but not really sort of trying them on a large scale. So there's a lot of sort of open questions around those things. If you're curious, one paper that I've worked on when I'm a former student called Jake Jau is a system called Stacked, What, Where autoencoder. And it's a system a bit of this type but there is sort of extra variables kind of going this way. Which are basically the position of switches in pooling. I don't want to go into details but if you look for a paper about Stacked, What, Where autoencoders, you'll find two paper, one by Jake and myself and a follow-up paper by a group from Michigan, University of Michigan that basically enhanced it and sort of train them on ImageNet and got some decent results. So those are kind of architectures you can use to do kind of self-supervised running. Just to clarify in the parameters for the spring is for the KL divergence term in the loss? Right. So the KL divergence term in the loss? We're gonna see this tomorrow guys. So we're gonna be going through the equation and all these details. So I covered this tomorrow. So I'll see you tomorrow, hopefully with the video as well. If the bandwidth supports it, I will put the recording of this class online as soon as it's actually available to me. I will add it to the NYU streaming platform and then I will try to clean it up as I can and upload it as well on YouTube later on. All right, so thank you again, stay home, stay warm and I see you tomorrow. Stay safe. Bye-bye. Stay safe.