 So we're going to continue talking about energy based models and we've talked about contrastive methods and we've talked about non-contrastive methods in the context of joint embedding and today we're going to talk about architectural methods and regularized latent variable models. So this is more for cell supervised running than supervised running really. Last time we also talked about structure prediction and you're going to have homework on that topic that Vlad is preparing for you which is pretty cool I have to say. So part three of energy based models. Okay so we've talked about contrastive learning which are methods where you explicitly push down on points in the from the training set whether it's the conditional version where you have an x variable and a y variable or whether and they may be of course multi-dimensional all of them or they could be discrete or continuous and the unconditional version where you only have y's so you never know which one will be known or not known maybe none of them would be known and the name of the game in contrastive methods is that whenever you have a point from the training sample from the training set you push down on its energy and then you have to have some way of producing points outside of the manifold of data if you want or the set of training samples and you push their energy up. We've seen various ways of doing this either using the negative log likelihood loss function which you know basically pushes everywhere you know kind of pulls up the energy everywhere kind of Monte Carlo methods or sampling methods where you approximate the negative load likelihood by a set of finite points that you have to choose by sampling from the distribution that your model gives to points in the space and we've seen other ways to use contrastive methods where you plug either a pair of energies of good and bad energies into a loss function and the loss function has to be an increasing function of the energy of the good point and a decreasing function of the energy of the bad points in such a way that one will get pushed down the other one pushed up perhaps until the difference between them is large enough that you don't need to to push anymore so you can do this with a hinge loss or you can have sort of a generalized loss which is the one that's written here where you take a whole bunch of positive samples and a whole bunch of negative samples and you have some complicated way of you know pushing down and putting up on on those various terms. We've seen a very popular example of this for joint embedding methods called NCE or InfoNCE noise contrastive estimation that's what NCE stands for but they're basically the same thing. Okay so that was for contrastive methods and the problem with contrastive methods that we discussed last time is that they are very expensive because there are many ways for points to be different from outside of the manifold data so generally in an ambient space of high dimension the space occupied by the data is actually a very very small volume within that space could be a manifold a low-dimensional manifold within a high-dimensional space we've we've seen examples of this at the beginning of the class where I said you know what is the dimension the intrinsic dimension of a manifold that would be a collection of all the possible picture of someone's face turning the head around making faces etc assuming the hairs don't move and we've seen that the dimension of that manifold would be something like less than 100 probably around 50 or so because it's limited by the number of degrees of freedom in your face which is around 50 so so there you know the the image is a point in a one million dimensional space if it's a grayscale 1000 by 1000 image or something like that and the space that is occupied by the data is a tiny tiny sliver in that space very complicated but it's a tiny sliver so here is the problem the problem is that there are many many many dimensions in which you can move that will move you outside of that manifold data and so if you have to pull up on on sort of every place to make sure the energy takes the right shape it might become very expensive and indeed it is very expensive so the situation where it works are contrastive methods for joint embedding things like SIMCLEAR for example and you know SAMI's nets in general where the you know you also push down and pull up you generate negative samples by picking pairs that are you know dissimilar from your training set and the similar pairs are generated by distorting an image into a different version of itself so you can have essentially as much data as you want and this works okay but it's very expensive and it only works if the dimension of the embedding space in which you you know you measure the distance is relatively small why small because the smaller it is then the the the fewer dimensions you're gonna have to explore to to pull up on stuff so that's the that's the thing so now we're going to talk about and so we've seen that in the context of joint embedding there are also methods that are not really contrastive they don't explicitly push up on the energy of data points they they they make sure the energy is higher through other means some of the means are understood like in the case of the Barlow-Twin method that I presented last week which is very fresh from the press right it just came up on archive last week or or through mechanisms that are not fully understood like BYOL bring your own latent where there seems to be an implicit contrastive thing going on but without explicitly pushing on points but it's not fully understood why that happens and it it's connected with normalization of various kinds there's a lot of research in this domain it's very hot and in fact the the project you'll be given will actually address this so this will be related to you know asking you to do self-supervised learning on a relatively large collection of unlabeled images and then will be given you'll be given kind of a small number of label samples and on which you can you can fine-tune supervised and you'll be given the choice of what method to implement including things you can you know pull out of the literature things like joint embedding or like you know stacto-toe encoders or whatever method you want to you want to use I would discourage you from using GANs because GANs don't seem to work very well for learning features and they're also a little finicky okay so I gave you this big eye chart in the past and I'm going to show it to you versions of it multiple times which kind of try to classify some classical machine learning methods whether supervised and supervised or whatever into the contrastive methods and the regularized or architectural methods so we're going to concentrate on the the list at the bottom here I'm not sure I'm going to go through all of it but at least to the first two or so it said to the first two I might talk about the last one but if I don't doesn't matter too much because I think Alfredo is going to tell you about that and I may mention the the very last one if I uh five times so let's uh let's focus on the the regularized and architectural method and to kind of get an intuitive understanding of what this does I'll uh uh you know I'll I'll kind of recast some classical algorithms that I'm sure you've heard of in the in the context of energy based model I already sort of alluded to this alluded to this in previous lectures but I think it's worth revisiting okay so let's start with architectural methods so architectural methods consist in constructing the machine so that the volume of low energy space is fixed or limited by construction essentially so there are basically different ways of if there are latent variable models they're not all latent variable models but if there are latent variable models the different ways of limiting the information capacity of the latent representation or other representation from which the the prediction is is produced so you know first set of techniques build a machine in such a way that the volume of low energy space is bounded so here's a bunch of techniques that you may have heard of principal component analysis the volume is the dimension of principal subspace and so it's limited because the dimension is limited I'll show you an example here to visualize what happens k means the volume is limited by the number of prototypes so this is the volume of low energy space right so you figure out like you know how much of the ambient space of of y of the variable I'm trying to model can take low energy because of the way I constructed the machine then there are analytical normalized probabilistic models so things where you know you write a probability distribution you know it's normalized because it's got all the terms to make it normalized let's say a Gaussian distribution okay so a Gaussian distribution has a fixed volume of high probability because the probability has to integrate to one okay so if you make the you know if you make the the the variance large you know the the top of the distribution kind of goes down because the volume has to be one so Gatian are an example there are other you know such distributions which you know you can exclusively normalize you can guarantee that the volume is one because the expression of the the energy if you want is simple enough that when you take e to the minus you can compute the constant that you need to normalize with okay in the case of Gaussian distributions this constant is proportional to the determinant of the covariance matrix of that of that Gaussian and there is you know a bunch of pi's and square roots and stuff you know in there you know the formula Gaussian mixture is basically just like Gaussian except you have multiple of them and and the Gaussian mixture model is a linear combination of these Gaussian components where the coefficients are all between 0 1 1 and 7 to 1 so so in the end it's still a normalized model and it's easy to normalize okay so density models in probabilistic modeling that are exclusively normalized you know are architectural models if you want where it becomes a non-architectural model is when the normalization constant is difficult to compute or intractable and there you basically have to just use the energy and push down and pull up because you don't know how to explicitly compute the normalization term and we've seen that something i'm not going to talk about much is independent component analysis so there's forms of this that are essentially architectural methods others that are more regularized methods but i'm not going to go into this very much and then there is a latent variable models with a fixed latent distribution okay and so where the volume is the volume of low energy stuff is basically limited by the the volume of the latent variable power distribution an example of this is normalizing flow i'm not going to talk about this today i may talk about this in a future lecture i'm not sure this is rather you know normalizing flows are kind of rather recent models that you know have some applications in things like physics and stuff like that but they they they still very much at the research level and fairly complicated but i encourage you to read about it if you're interested okay so let's talk about principal component analysis in the context of energy based models so let's let's assume that we have a set of training samples that are sampled around this little spiral here in 2d right so our wave vector is the point in the 2d plane and our training set is drawn from you know that that spiral here a few thousand points or or whatever and we're gonna we're gonna ask us what is pca so pca is basically an auto encoder model okay so we've we've alluded to this idea of auto encoder you take a while you run it through an encoder then you run the result through a decoder you get a prediction for a reconstruction for y and then you measure the distance between the reconstructed y and the original y and that would be the energy okay so generally speaking the energy is something like this in fact it's a free energy because there's no latent variable so f parameterized by w of y is y minus the decoder applied to the encoder applied to y and both the encoder and the decoder are parameterized by you know parameter vector astrag parameter vector y which I didn't put explicitly here uh and you can measure the the square distance the ukrain distance or um you know whatever whatever discrepancy measure it doesn't need to be a distance actually it needs to be a divergence of some kind so it's good if it's zero when the two things are equal but it doesn't mean to be it doesn't need to be symmetric or to satisfy the triangular inequality or anything like that um it is missing a y right in the equation on the energy I think I'm missing a y in the equation this y minus w transpose w times oh yeah yeah yeah you're right uh sorry about that you're good point yeah so I was going to this to this format yeah there is a missing y here absolutely right so essentially in pca you constrain the encoder and the decoder to be linear and what's what's more you constrain them to be transposed of each of each other okay so basically you see you see w as a projection uh into a low-dimensional space a linear projection and then you see w transpose as the projection back into the bigger space okay um uh yeah and there is a y here sorry about that so this is the architecture okay and the energy function is just that and if we train pca um so the way you train pca you can just solve that system okay so square minimization with respect to some variable you can solve it using uh it's basically uh uh you know eigenvector eigenvalue problem or you can do it through gradient descent through e square it works too okay uh using shadow weights here it's the same matrix it will work the difference that you will get if you solve this with gradient descent it's that the the z vector you will get or the w matrix you will get would be equal to the the one you would get through the regular regular pca up to a rotation because in regular pca what you get is a eigen decomposition i mean you get the projection into the eigen space of the covariance matrix of the of the y's of your training set whereas here if you just do the square uh you get the subspace the projection into the subspace but the the basis in that subspace is not determined so up to a rotation in that space you'll get you'll get the same result um right okay so if you train pca with this uh dual spiral here the principal axis so essentially the linear subspace that best approximates your your your your training samples which in other words minimizes the square distance between uh uh any point on that surface and any training sample okay so you take uh you take a point on the on the surface is that's a y okay on on the on the manifold of training data that's that's one of the y you project on the linear subspace that's that's a z bar and then you multiply by uh w transpose and now what you have are the coordinate of this point in the original two-dimensional space okay so here w the projection uh is is a is a one by n matrix so it projects into uh a one-dimensional subspace okay gives you a scalar which is basically the position on the space so when you multiply by the way you transpose you get the two-dimensional coordinates of that point in the original two-dimensional space so when you compute the square distance between those two things what you get is the distance between the data point and this projection on this on the on the space okay and that's your energy so if a data point is on the on the principle subspace here its energy is zero and as you go away from the that's linear subspace the energy grows quadratically and that's indicated by the gray scale okay uh this linear subspace is simply the one that minimizes the sum of the square of the distances between all the the training samples and uh itself okay and and you know the projections of the training samples on on itself so uh you know this is not a very good approximation of this nonlinear nonlinear subspace right obviously but you know it's kind of a cooked example to show uh what would happen so the bottom line here is that when you train PCA you can train it by just minimizing the average energy of your training samples without having to push up or pull up on the energy of anything because by construction since the linear subspace you project on has a limited dimension the volume of stuff that can take low energy has a limited volume because it has to be you know a linear subspace of the dimension that you decided uh z bar should be right so it's the idea of using a bottleneck to limit the capacity the information capacity of the representation that your system uses from which it reconstructs the prediction there's some confusion here about uh what is why so is why your input here or is it your sample of the output what is it where why is your input okay so uh this is two dimensional space this is uh y1 horizontally y2 vertically okay and what the model is supposed to tell you is you give it a point why and it tells you whether it's a good point or a bad point in other words whether it's a point that looks like something in a training set or not okay a bit of the question is like referring to actually the choice of letters y rather than x i think that's the confusion here i see okay um so the reason i use y here is that y is both is y is the variable that we are trying to model okay and and uh and so as i said you know in the sort of energy based framework the x variable is when they would always observe on the training set or on the test set uh y is the thing you observe on the training set but you don't you don't observe it on the test set you're not given it or maybe you observe you observe only part of it uh and it's the variable you're supposed to predict okay you're supposed to model and so you write all the equations and you realize that to go from conditional to unconditional the only thing you have to remove is x okay you're left with the same formula all the formulas for energy based models you know the marginalization the minimization with respect to latent variables all the training things everything everything carries you just remove x when you only have a y and in unsupervised situations like like pca you only have a y you don't have an x okay you're modeling y basically what you're modeling here is the dependency between y1 and y2 you know what you're interested in is is is is saying let's assume i give you y1 tell me what y2 could be and here if you had a good model you would okay if y1 was this it would have to tell you well why you know y2 could be this or it could be that right pca only tell you know tells you well it's here um of course this is in you know two dimension with a principal space of one dimension so it's simple but and so you know pca can give you actually more more than one point but they are all in a linear space and similarly you know you could also say well here is y1 here is y2 okay y2 has this value give me the possible values for y1 and of course you know this value for y1 is possible and then the other intersect is this one so again you have two possibilities so i would not be able here to build a function that predicts y2 from y1 or y1 from y2 a deterministic function because there are multiple y1s that are compatible with y2 and vice versa so i have to use an energy based model where i just have a energy function that gives me the compatibility between y1 and y2 essentially okay um but there is no variable that i always observe i don't know if i'm going to know y1 or y2 or none of them okay i hope that clears up a bit i know it's i know it's difficult to grog once once you get it it's uh there is another question actually there is a follow-up question uh so isn't unsupervised learning without labels so basically it should be without y and only x isn't y the label well uh no y is whatever you want to predict okay and uh you and and it's it's not supervised in the sense that uh okay the difference between supervised and unsupervised has nothing to do with the data you know with with you know whether you call the data the the variable you want to predict y or not it has to do with whether the variable you need to predict has been basically human provided you can think of it this way okay it's a very tenuous it's not a mathematical definition but supervised learning is you you you know what the answer is so you give the answer to the system and that answer is basically a definite answer there is one correct answer uh first of all and and and second of all that answer has been provided by human or by you know some intelligent process uh now in self-supervised learning uh the y is not provided by human it's part of the input uh let's say you do video prediction you know y is whatever you don't see of that clip you do uh you know you train a a bird system for for natural language processing uh you take a piece of text you remove the some of the words so now you know your your uh your y is is is text technically it's human provided but in fact you know it's abundant data it doesn't need to be kind of manually labeled even though people have written those texts so it's a it's a bit of a tenuous you know it's a bit of an ambiguous definition really what supervised and unsupervised is if y is discrete uh you can call it supervised but you need an x if you have only a y you can't you can't really call this supervised you have to call this unsupervised okay so unsupervised is you have a y variable and you capture the mutual dependencies between the components of y okay so you don't know in which direction you were going to need to uh to make the prediction and so you have to capture all the dependencies between the y's uh and the best way to do this in an abstract way is to have a function that gives you the constraints that those y's have to satisfy so that uh when you are given some otherwise you can you can predict the values for the otherwise by minimizing this energy okay so that's unsupervised and unsupervised is when you have an x you have an input you're trying to predict the y and there is basically only one y that works uh that is given to you during training okay and then self-supervised is you're giving you're given a y during training but that y may be one of many possible outcomes you're only given one and and so the prediction has to deal with uncertainty that's to be able to represent uncertainty more or less explicitly uh and what's what's more the the y is not something you had to spend a lot of money getting okay it may be part of the input or you may be trying to make it automatically I summarize this on the chat as well tomorrow we're going to be going over these lines one second right um yeah if I was going to go over this again if that's not clear um but if you only want want to remember one thing uh when it's conditional you have an x the x is the thing you you observe all the time and in unconditional law unsupervised you only have a y you don't have an x so classical unsupervised running algorithms basically are y and the reason again why we use y is because all the formulas carry uh that we that we have in the conditional case are the same except there is no x in it okay um yeah so why is the stuff you're trying to model uh okay I think we're done with this slide um okay so now what if we decide to allow our encoder and a decoder to be more complex than just the linear projection uh and that's that's going to go to encoder with a bottom neck what happens there is that uh you you let the encoder and decoder functions to be neural nets okay multi layer perhaps but you still insist that the somewhere in the network there is a representation that is lower dimensional than the input and you decide on the on the dimension um so here's an example of of how this this this works on this uh or what this does on this particular training set so I can't exactly remember what the architecture of this of the encoder and decoder was here um I think it was like a couple layers pretty wide something like a hundred units in the first hidden layer and the bottleneck is still one dimensional it has to be because you know the space is two-dimensional so if you want to reduce the dimension you have to go to one so the system here is trying to kind of reduce to one dimension essentially the information as to what the training sample is uh and it finds a solution that is not perfect but you know it's a little folded so it has a low energy region that kind of follows this spiral and here kind of loses its marbles a little bit and kind of you know doesn't model this part very well and then sticks to this part but not very well and then it gives no energy to this part which it shouldn't okay so it's not perfect certainly one dimension that's not going to work very well this tends to work okay in in sort of high dimensional spaces you know not great but you know it kind of works and again uh because the dimension here is is lower uh which limits the volume of stuff that can take no energy as you can see here it's basically a one-dimensional uh surface even though it might be very complicated it's still a one-dimensional surface so we can't occupy the entire space and so you can use just the average energy as the loss just push down on the point on the data points and you don't need to push up on anything because automatically the energy will be higher okay it's about k means now so k means is a relatively simple algorithm which uh i hope all of you remember uh if you don't let me reformulate reformulate it to you in uh in in energy based terms so you have a latent variable model so now you have the elementary energy e and then the free energy f which you obtain by minimizing the elementary energy with respect to the latent variable z taken over a particular set and we'll see what the set is so the energy function is the reconstruction error in this case you'll clean in distance but again it could be anything of the uh of the difference between the data point and the decoding function applied to the latent variable okay uh so again it's an unsupervised model with a latent variable you have to do inference which means you have to compute the latent variable that best uh you know matches your data point using this uh using this minimization uh and in the case of k means that decoder is just a linear decoder it's just a matrix okay so the the energy is just a square distance between the data points and the z vector multiplied by a matrix and here is the trick in k means you constrain the z vector to be a one hot vector okay so basically a vector with all zeros and one one at one location and a forgot a comma here okay so it's very simple and basically the effect of constraining the z vector to be one hot is that the z vector selects one of the columns of w right when you multiply matrix by a vector that has only one one what you get is the the corresponding column of that of that weight matrix okay so when you do this this minimization of the energy with a spectra z what happens here is that you try to pick the the column of w that minimizes the distance with y so imagine all the w's the columns of w represent uh points in the in the space what you're trying to to do is find the point in that space which is one of the columns of w that minimizes that is at the smallest that is closest to the data point you're considering so what energy surface does that give you does that give us right okay so let's talk about training now so training k means is again the energy loss you uh you you compute f of y for a particular y so basically what that means is that you take a y you find a prototype that is closest to it right you have to do this minimization of e over z to get f of y okay so find the the prototype which is a column of w that is closest to y okay which is this this inference process this minimization and then do this for all the training set the samples and then minimize that okay you can do this with stochastic gradient descent if you want actually works pretty well if you have a large training set but you can also do it directly so figure out what every point is you know which prototype every point is is closest to and then compute every recompute every prototype as the average of all the points that to which it's assigned okay now of course that's going to change the assignment so you do it again and you keep doing it until the process converges that's key means okay but essentially what it is is an energy function of this type with a latent variable constrained to be one hot a inference process by which you compute the latent variable that minimizes this this energy okay so this is the formula we've seen in the last two weeks and then the the training just minimizes the average energy of the training samples and you don't need to push up on anything you don't need contrastive anything because the volume of stuff that can take your energy is limited by the number of prototypes you have okay so this is visualized here so here this is key means after training with this data set of samples drawn from this spiral here in 2d and what happens is that the prototype kind of spread themselves around the spiral more or less equidistant to each other and what you see is you know around each prototype there is a little dip in the energy right because on a prototype the energy is equal to zero so when y is equal to one of the prototype the energy is zero right because obviously there's going to be one z that selects the correct column so that this distance is zero right so we are on the prototype the energy is zero and as you move y away from a prototype the energy grows quadratically as as if it were a quadratic ball okay and then if you get closer to another prototype all of a sudden it starts going down again okay so the energy in the end is the minimum of a bunch of quadratic balls centered on the prototypes and that's exactly what you're seeing here um so you have you know a little kind of dip in the energy right equal to zero a ball uh every every location where you have a prototype and then in between two prototypes you have a ridge which is where those two quadratic ball intersect and one becomes smaller than the other and so the energy becomes smaller because the minimum of a bunch of quadratic energies and you also have a ridge here because here this is just the point where you start getting closer to the the top of the spiral or the bottom of the spiral so the center of the spiral becomes a hard ridge right okay so this is why k means works this is why k means doesn't need a contrastive phase it's because the volume of stuff that can take your energy is limited by the number of uh prototype uh k the dimension of z okay any question at this point okay so one point that I want to make here is is that if z is discrete you know it's a good way of limiting the the information content of the latent variable and therefore the volume of stuff that can take your energy and it doesn't matter if your decoder is linear or not okay so uh you can choose a complicated decoder a multilayer neural net of some kind if you discretize a latent variable necessarily the volume of stuff that can take your energy would be limited to k points where k is the number of different values that your latent variable can take right so that's a good way of you know essentially uh not having to to do contractive learning the problem is that you have to decide why k is and the you know that may require actually trying multiple values okay uh gas and mixture model so um this may be a little bit difficult to uh understand for some of you who haven't seen or don't remember where the gas and mixture model is because it's expressed in a you know a bit of a complex or maybe I'll I'll draw a picture um separately so gas and mixture model is is a is very similar to k means except that instead of having quadratic balls that all have the same shape you allow those quadratic balls to be uh to take the shape they want uh as long as they you know with a quadratic form so basically you allow those quadratic balls to be elongated in certain directions and not in others okay so in a situation like like this this would actually be advantageous because the the the the gas can be elongated along the dimension of the of the of the data manifold and you could probably get away with modeling this uh uh data with fewer samples I mean fewer prototypes uh but then stick more to the the thing um I didn't actually have a energy profile for k means so I for mixture gas and so I reproduce the k means here but this is not k means this this is not mixture gas and this is k means um mixture gas and would look nicer actually so what is the energy for mixture Gaussian uh basically you take a data point and you compute the distance of that data point to a mean okay and the mean uh is one column of w and again you have one of those latent variable uh z that is a one hot vector that's going to select which of the mean so it selects which of the components of this mixture or Gaussian uh you know what why it's going to be right um so okay so imagine z is a one hot vector z is going to select one column of w this column of w is going to be interpreted as the center of a Gaussian okay and we will compute basically the difference between the the data vector and the center of that Gaussian and then we're going to compute a a square distance if you want but distance uh would be warped in some directions through a matrix here and z and I'll come back to this so this think of this as a matrix this product and z as a matrix m is actually a tensor but and z is a matrix it's a symmetric positive definite semi-definite matrix which is basically the inverse covariance matrix of the Gaussian and then this is quadratic form that computes the you know warp distance if you want of between y and the the mean that you choose uh multiplied by this covariance matrix now this covariance matrix the index ij of that matrix is a sum over dimensions of z of a tensor m ij k times z k okay so you can think of z k of z here as selecting because z is a one hot vector you can think of z as selecting a slice of that tensor actually to be more efficient I should have put k at the beginning here but that's okay so think of this as a three-dimensional vector that is a bunch of covariance matrices if you want and the z vector is just going to select one of them right so this process here z selects the mean and here it selects the covariance matrix and then you compute the the the energy for this okay so that's e of yz and the overall energy of mixer of Gaussians is a marginalization over the z variable okay so we're not minimizing anymore we're marginalizing uh so we're computing f of y as minus one over beta log sum over all z so the z's that are all the one hot vectors of e to the minus beta there's a missing minus here of e of yz you can set beta to one if you want or something else doesn't really matter it's a little arbitrary but you know I'm using this formula because the this is the formula that we've used in the past so that's the energy of a Gaussian mixture model okay this may be a very different formulation that you've seen but it's equivalent now how do you train a Gaussian mixture model so it's trained with all kinds of tricks using what's called em expectation maximization you could train it with gradient descent doesn't work very well I mean it works it gets you to a local minimum but it gets stuck in local minima and it's relatively slow so people prefer em which I'm not going to go into here but essentially the last function you're minimizing is just the average energy of the points in other words which is basically the negative log likelihood that your Gaussian model gives to every point okay so you're you're minimizing the negative log likelihood that your model your Gaussian mixture model gives to every data point on average which is equivalent to maximizing the likelihood that your model gives to the data points and the product of those over data points assuming the data points are independent so it's doing maximum likelihood okay now why is it an architectural model because you could strain this n matrix to be such that the probability distribution here for each Gaussian is normalized and so you know you don't have you don't have a problem actually realize that I forgot no I didn't forget anything so the the the m contain both no I did forget something there's a bunch of parameters here in the weighted sum that I forgot which which have to be have to be normalized as well so you have to satisfy and I had them at the beginning I don't know where I removed them so you have to essentially guarantee that this covariance matrix is a covariance matrix for Gaussian which means that its determinant is fixed essentially if you change the covariance matrix you have to normalize it so that is determinant is equal to a constant okay it doesn't matter what it is if you are a probabilist you insist that it be a particular value but for energy model you don't care I don't know as long as it's constant so so you have to maintain this normalization constant constraint on the determinant of m the determinant of m has to be constant regardless of how you change it so I lied a little bit as I told you I forgot the the mixture coefficients here in this formula so this is kind of a weird mixture model where all the components of the mixture have the same weight but I'll correct this in the slides later so you would have coefficients here that basically compute a weighted sum of the of the exponential energies and there is a minus sign here as well that I forgot okay so that's for architectural methods now let's talk about regularized energy based models so the idea of regularized energy based models is that instead of constraining the volume of space that has low energy to be less than something you just have a term a regularization term in your energy that makes you pay for making this volume large okay so in that way it's regularized and the question is how do we do this and why you why do we do this should be pretty clear by now but I'm going to go over it again so let's imagine we have this model where we're trying to kind of make a prediction so we observe an x we're trying to make a prediction for y okay x go through some neural net which produces a representation of x in some in some form and then we combine this representation of the observation with the latent variable to make a prediction as we vary the latent variable the prediction will vary over a set okay so if z varies over I don't know a you know a uniform distribution over a space or something over low dimensional space because the decoder here is a neural net could be very complicated we're going to have a surface here of y bar that's that's going to have at most the same intrinsic dimension as the the dimension of the space that we vary z over right so if z varies over a two-dimensional space y bar at best can vary over a two-dimensional space and maybe less because it could be that the the decoding function will you know be degenerate and not be objective and and basically you know conflate multiple values of z to the same point so clearly that limits the if you limit the dimension of z that will limit the dimension of the space that can take your energy right so if I give you a y I find a z that minimizes the distance between y bar and y by minimize minimization if y has a limited range it could very well be that there is no z that will produce a y bar that is equal to y but let's imagine for a second that z is the same dimension as y so if that's the case and if the decoder is not degenerate for any y I feed the system there is going to be a z that produces the y bar that is completely equal to y okay so what that means is that now my energy surface is going to be completely flat it's going to be zero everywhere because anyway I can feed my system with there's going to be a z that produces zero energy okay which are obtained by minimization by inference okay so f of y is always zero in that case so obviously so the regularized latent variable model idea is to say I'm going to put a regularizer on z in such a way that the system will want to only use a subset of possible values for z I'm not going to decide a priori z is two or three dimensional or whatever I'm not going to decide it's discrete I'm just going to come up with some regularized regularization function r of z that is going to make me pay a price in terms of energy for you know choosing a z that's outside of a limited range if you want all right so replace a constraint by it by penalty essentially that's what it means so that's kind of a pretty generic architecture here for a conditional energy based model and the name of the game here is how do we limit the information capacity of the latent variable so that automatically the volume of stuff that can take your energy is limited so that when you push down on the energy of good points the energy of other points will necessarily be large because the volume is limited or minimized so the idea of this is that you know by having this regularizer you're going to minimize the volume of space that z can occupy and therefore you're going to minimize the volume of space that can take your energy of course when you train the system to give low energy to your data point at least those points will have low energy but everything else will have higher energy so basically you're shrink wrapping the low energy region around the the manifold of data if you will so that's the idea here add a regularizer to the z variable that makes you pay for going outside of a particular domain another question is what do we put here and depending on what we put here how is it going to work you know is it going to be compatible with the rest of what we're doing so you know I talked about effective dimensions so one thing we could do is have you know multiple z's of increasing dimension and then have an rz that kind of selects the one with the smallest dimension but that that's a little bit like you know searching for hyperparameters and then it'd be the best thing quantization discretization we talked about this but the other problem is you have to choose like how many different values the system can take you know the Bayesians do this with you know things like Dirichlet allocation where you have potentially infinite number of different values for z but they become increasingly high cost as as you add them so so it has the tendency of kind of minimizing the number of components that are required in the context of Bayesian models it's called LDA latent Dirichlet analysis Dirichlet if you pronounce the French way it's Dirichlet because he was German actually here is an interesting one that we're going to talk about a little too interesting one so what is the L0 norm what is the L0 norm of a vector is the number of components of that vector that are non-zero so basically what you're going to do here is count how many components of z are non-zero and then in your energy minimization process you figure out what z can i use that has the minimum number of non-zero component that will minimize my reconstruction error that would be the inference process right give a give away and then find a z could be any vector but you're going to pay a price that is proportional to how many components of that vector are non-zero now the problem with this is that it's not a differentiable criterion and so it's it's kind of hard to optimize this there are methods approximate methods for this but no to do this sort of L0 minimization so basically minimizing a square reconstruction error while minimizing the number of non-zero component one of them is called projection pursuit it's basically a greedy algorithm you figure out like you know what is the best first component i can use if i make it non-zero so you have a finite number of components you go through every single one of them and you say you ask the question what value should i give to that component so much so that my reconstruction error is minimized okay you pick that one and then you and then you select the second one now what second component can i pick so that my reconstruction error is minimized and you keep doing this of course that's called projection pursuit if your decoder is linear if your decoder is non-linear it still works but it's no core projection pursuit anymore uh business pursuit it would be called i guess now the thing that's particularly interesting is the regularizer where the which is the n1 norm of the of the of the of the z vector okay so basically you compute the sum of the absolute values of the components of z take your vector z compute the absolute value of all of them compute the sum of this minimize that so you know r of z would would be that okay this is called sparse coding and i'll come back to this in a second and the the nice thing about this contrary to l0 is that it's convex so you know it's a differentiable uh almost everywhere differentiable function not at zero uh but otherwise differentiable and the second advantage that it does is that it's a it's an approximation of l0 in a sense that when you minimize the n1 norm of a vector uh that minimization wants to make the largest number of components of z zero as possible okay so it produces a sparse vector essentially a vector where you know a number of components are zero perhaps most of the components are zero so linear reconstruction with l1 regularization on the uh on the latent variable is called sparse coding and learning that linear decoder is called sparse modeling before i explain this in more detail there's another technique called which consists in limiting the information content of z by adding noise to it so basically you say variable z you're limited within a certain volume i i'm i normalize you know i i i i i i i uh forbid you to go outside of a given sphere for example okay so you pick z within a sphere but then whenever you choose a z vector you add some random noise to it so a z vector is not really a z vector it's a fuzzy sphere okay and so the effect of this is now that uh because every z vector is a fuzzy sphere how many of those fuzzy spheres can you pack into the big sphere that you constrain the z vector to be in basically determines the capacity information capacity of z okay so adding noise to a vector limits the the information that this vector carries right that's kind of intuitive uh you know if i speak when i do this and you know you can barely hear what i'm saying uh because i'm adding noise which is very distracting uh to my voice and so you have a hard time understanding what i'm saying and so there is less information carried in my voice when i add noise to it so it's the same idea here you limit the information carried by the z vector by adding random noise uh you know turning z into a sphere by you know adding random noise basically uh this is the idea used by version auto encoders and one of the most interesting research avenues today i think in my opinion which is something i work on and a lot of people working with me at nyu and at facebook work on is what is the best way of limiting the information content of a latent variable in a predictive model of this type uh i don't think this is a solved problem i don't think we have a completely you know a wonderful solution for all situations we have solutions that work in special cases but i think it's one of the most interesting or not interesting but like one of the problems we really need to solve if you want to make progress in things like self-supervised learning and training predictive models some video and stuff like that there's a lot of paper on topics of that type there is a question i thought we are modeling the latent distribution and its posterior distribution how is this related to adding a two norm of a latent variable so in the va aren't we restricting uh and modeling the latent distribution i'll come to i'll come to this okay this is the next hour oh okay so hold your breath for a bit uh i'll explain that when we come to uh two things amortized inference for z and variational to encoders if they hold the breath for one hour it's not gonna end up very well i think your virtual breath okay okay i see okay so start scooting so this is an unconditional regularized latent variable energy based model there's no x again we're only modeling y and there's no condition and the energy function it looks very much like k means it's the ukrain distance between the data point and matrix multiplied by the latent vector the latent vector is not constrained to be one hot it's a vector um but there is an additional term and this additional term is a coefficient times the l1 norm of z which are right as just a single uh single bars so this is the sum of the absolute values of the components of z okay so architecturally it looks like this you have a y vector you compute the square distance between this y vector and whatever your model reconstructs and then you have a decoder that happens to be linear it's just a multiplication of a matrix by the latent vector and the latent vector also has an energy term which is the l1 norm of of it okay that's your energy model right so the the generic model is just you know a decoding function it could be a nonlinear function so neural net with multiple layers or whatever you have a regularizer on the latent variable and you you know have the reconstruction error here but in the case of like classical sparse coding the decoder is linear the the regularizer is l1 norm and the reconstruction error is the square uh ukrain distance right how does that work so this limits the volume of stuff that can take low energy um but you still need to do inference right so you need to find uh for any particular data point y you need to find the z that minimizes the sum of those two terms uh this is done generally by an algorithm that you know it's it's a gradient based algorithm but it's not gradient descent it is definitely not stochastic gradient descent because the loss function is not an average over loss of terms it's just one term uh and there's an algorithm for this called ISTA that means iterative shrinkage and thresholding algorithm and it's basically a method a general method that uh kind of you know minimize functions like this that are a sum of a differentiable quadratic term and another term that is not always differentiable it's basically non-smooth and what you do is you alternately take a gradient step for this term so that's a simple step right you can compute the derivative of this very simply with respect to to z uh and then you you take that step into account by essentially shrinking all the components of z by a constant toward zero if they're already closer to zero than that you just set them to zero and you can prove that by alternating those two steps of a gradient step of the of the quadratic term followed by a shrinkage of z by a constant all the components of z shrink toward zero by a constant uh eventually the system will converge to a solution to a minimum of the energy with respect to z for a given y okay so that's how you compute f you minimize e with respect to z right as we do all the time but in this case this particular way of doing it which is efficient but it's still quite slow if z is of if y is large okay so what that allows you to do is use a z here which has a larger dimension than y because in the end because of this term it could be that the the the number of component non-zero components of z that remain after you do this minimization would be relatively small okay because I explained earlier the L1 norm when you minimize it it wants to make a lot of components zero that's a good way of minimizing the L1 norm so this is what some form of sparse coating will give you you know the energy function you will give in this you will see in this little example that I showed earlier if you use sparse coating now it looks kind of funny because it looks like there are like straight lines that are kind of following the the the manifold here and maybe kind of a smaller one here and in fact it happens for the following reason so imagine that you you only allow one component of z to be non-zero here like the lambda is such that you crank it you crank it up in such a way that only one component is remains non-zero okay which you have to in the case of a 2d problem because if you let z to be two-dimensional then you know everything will have low energy yeah you know everything will have low energy that's not going to work right so you have to crank up lambda here so that basically in the end only one one component of z stays on so it starts looking a bit like sparse coating but there is a difference it's not a one-hot vector the component of z that are non-zero can vary all right so when you vary z when you vary that component of z that is non-zero you're basically varying the reconstruction along the dimension of w okay and so I mean you have to play a few tricks here for for this figure to to occur which I'm not going to go into but but basically as you vary that component of z that that that changes you're you're moving along the dimension of w okay and so that's what happens here you have a low energy line which corresponds to all the values of z that z can take and that direction basically is w okay so the detail I I kind of swept under the rug here is that I need also a constant so I need another component of z that is non-zero and another component here that I predict of y so that I can predict lines that don't go through the origin okay that kind of conform to this but but let's ignore this for a moment the point remains that as you vary the component of z that is non-zero you you're basically varying points that along a line in the direction of w okay and what that means is that there's going to be a whole line here that can take low energy if I take a point that is along the line corresponding to one when w is going to have low energy right now if I pick another component I'm going to have another line okay and another component I'm going to have another line another component another line etc etc so what's cool is I can learn different w's they're called dictionary elements and this is called dictionary learning in some in some communities and what's going to happen now is that the the overall energy is going to be you know I can sort of you know I can have regions basically linear subspaces low-dimensional linear subspaces like that have low energy and the the dimension of those low-dimensional subspaces is the number of z's that are non-zero so if the number of z's that are non-zero is three then I can vary those three components of z however I want and basically span a linear three-dimensional subspace right within the ambient space and all of those points will have low energy and then I can choose another set of three components and that's going to give me another linear subspace okay or I can choose only two components and now the subspace that has no energy is two-dimensional and my overall energy is basically the minimum of all of those okay so that's sparse cooling so whereas k-means approximates a point by a single prototype okay a data point by a prototype and tells you the energy is the distance to that prototype sparse cooling tells you I'm approximating any point by a linear subspace and my energy is the square distance of a point to a projection on that linear subspace but I have a bunch of linear subspace that I can choose from okay in that sense it's very different from PCA and you see them here so okay there is one very important detail that I need to mention which is actually mentioned here decoder normalization and this is how we train this so we train this with a non-contrastive method but we have a constraint to satisfy so we train this with a non-contrastive method which means we give a data point y we find a z that minimizes the energy okay that's inference and then we take a gradient step with respect to w to make this term smaller okay so we change w now to make this term smaller basically we change the plane you know for the z's that are non-zero the linear subspace to get it closer to our data point okay so if I have a data point if I had a data point here I would pick this plane as the closest plane and I would move that plane a little bit towards that by basically taking a gradient step of this term with respect to w okay that sounds very simple it doesn't work it doesn't work because what the system does is that it cheats it makes z very very short because that's a good way of minimizing that term and to compensate for that it makes w very very large okay if I multiply w by two and I divide z by two I get the same result I get the same reconstruction error right so the system can cheat by just you know making w very large which allows it to make the n1 number of z very small by just shrinking all of the you know the all the components of z making each other that's not a good solution it's a degenerate solution where all the z's are zero what do we do to prevent this we can put some constraints on z you know to kind of keep it large it's complicated it's difficult to make it work because it leads to non-convex loss functions so what people do is that they limit the the the norm of w okay so you say uw your columns which are those prototypes basically I'm going to constrain the norm of those prototype of your columns to be less than the constant let's say one okay so the way you train this is take a data point find a z that minimizes the energy square reconstruction error plus z1 norm you have to do this using this is the algorithm alternate multiple times taking a gradient step of this with respect to z and then shrinking the components of z and then repeating until z stabilizes okay so now you have z the optimal z for your y so now you have f of y essentially okay and our way to do is you minimize f of y with respect to w with a gradient step and what that means is that you compute the gradient of this term for the z that you just computed with respect to w all right and take a step stochastic gradient step and it's very simple because it's a quadratic term okay then the last step you have to do is now you go over every column of w and if one column has a norm larger than one you normalize it back to one okay the ones that are really short you know that are smaller than when you don't need to worry about although some methods actually normalize everything to one sometimes it's more stable if you do that okay so how does that work if you train a system like this on MNIST okay so here why are the MNIST digits w is a large matrix and z is a large vector and you have the one norm so here the number of components is I can't remember actually how many there were here but every square here is a column of w okay so the number of such squares is the dimension of z and I'm embarrassed to say I can't I can't tell what dimension this is I think it's 200 or something right you have 10 rows and I think it's 20 columns yeah it's 200 so you have 200 components of you know z is a vector with 200 components it's smaller than the dimension of y which is 768 but but in fact the intrinsic dimension of MNIST is much smaller than 768 because you know digits are kind of simple shapes and binary vectors right or essentially binary so you train the system and then you look at the at the components the the columns of w and what you get are these okay so white like a bright color indicates a large positive value and gray indicates zero and dark indicates a larger negative value and what you see is the the the components here of w are basically little pieces of stroke right and what you can say is that any digit can be reconstructed as a weighted sum of those vectors which are the columns of w where the weighted sum has a small number of non-zero component and those are the components of z okay so the z is sparse it's got a small number of non-zero component and any digit can be reconstructed as a linear combination of a small number of those components and naturally the system learns to basically learn strokes okay because strokes are the elementary objects from which you can you can build a character right so you need like different strokes of different width at different locations but you know it's approximate enough that that it works okay so this is for MNIST what if you apply this to natural image patches right so you take image net or something you take like little patches of I don't know 12 by 12 pixels or something like this and you apply this sort of sparse coding to it and you observe how the columns of w change over time as you run the learning algorithm you get something like this okay so this is 256 business function which means the dimension of z is 256 the dimension of y is 144 which is 12 by 12 and here we run sparse coding it's actually a slightly different version that uses something called amortized inference but it doesn't matter actually for the purpose of what we're talking about here so you start from random initial conditions of the matrix random and then as the learning takes place you get better and better uh business functions and what you realize is that the the best way to reconstruct an image patch the natural image patch with a combination of a small number of of of vectors or templates is to actually make them look like strokes like like edges okay looks like this so basically any image patch can be reconstructed as a linear combination of a small number of those guys okay so these guys will give you kind of the general illumination or gradients and then if you have an edge going through basically those things will give you the edges okay uh and you have basically edges that of sort of various length and and and and spatial frequencies at every location and orientation if you have enough business functions now this is cool because this is pretty much what you what you you would want right if you want the system to learn good features um that's what you want for images so this original idea of this kind of sparse coding learning algorithm using this l2 and l1 and constraining w to have a limited norm uh this is a an algorithm that actually came out of theoretical neuroscience it was proposed in the late 90s by two neuroscientists Bruno Olshausen and David Field so Olshausen Field the paper is from 1997 if I remember correctly but they had a series of paper they had one at NIPS and they had one at in nature or something like that um and it changed what you know a lot of the people's mind about like you know can we learn low-level features for images in an unsupervised fashion this is an unsupervised learning algorithm very simple and what it appears to learn are image features that are very similar to what we what we see in uh in the brain in v1 for example now what you're going to say is wait are these features really they're not features because they're not they're not features that where you compute features from an input they're features that in search the way that from a representation running through those basis functions you reconstruct the input it doesn't sound like something the brain will do you don't reconstruct in the brain right you extract the features you go from low layer to high layer not not the other way around um but in fact if you put an encoder to that uh to that system that basically from why predicts the z that would be a feature extractor okay and we're going to come to this in just a second um and in fact that's it amortize inference so what is amortize inference amortize inference says you know I don't want to run this optimization algorithm all the time that computes the optimal z that minimizes the query construction error plus the regularizer why don't I train a neural net to predict this optimal z from y directly okay so it could be that my decoder is linear it could be that my regularizer is at one so I'm using sparse coding um and of course the the function that maps a y to an optimal z by minimizing the energy with respect to z is a very complicated nonlinear function but why don't I learn that function with a neural net okay so basically then if I can successfully train this neural net I would be able to just plug a y run through this neural net and get the the z that would be kind of the best z to linearly reconstruct um my input this is called amortize inference why amortize because you basically amortize the cost of running a inference algorithm by training a neural net to do most of the job for you okay uh I mean there's several names to this but that okay so now you look at this architecture and you say well that looks very much like an autoencoder it's an autoencoder right you give a y you run through the encoder you get some representation and maybe you run that representation so the decoder multiply the uh and and then you know compute the reconstruction error yes it is an autoencoder but it's a regularized autoencoder and it's a funny kind of autoencoder here in the form that I drew it here because it has this extra cost function here that measures the difference between uh what comes out of the encoder and what the value of the latent variable is okay so let's think about this as an energy based model where the energy now is the sum of three terms the reconstruction error the regularizer we've seen this before okay plus there is another term which is some distance between the value of the latent variable and whatever the encoder produces so this is the energy that's a reconstruction error okay a cost function or distance of some kind divergence that measures the divergence between the data point and the output to the decoder applied to the latent variable is another a second term which is the prediction error the pollution energy which is the distance of some kind between the output of the encoder and the value of the latent variable and then the last term is the value of the regularizer okay so I have an energy model with those three terms I'll compute the minimum of the energy with respect to z and I get a z okay so I get an uh optimal z here that minimizes the sum of those three terms what do I do with that z now well uh now that I have that z uh I can I can uh run it through the decoder I get a y bar I get a reconstruction error I can back propagate this error through the decoder and change the parameters of the decoder so that this error is minimized okay sure I can do that right if the decoder is linear as we saw I need to normalize it so that the system doesn't cheat by making the decoder weight very large if the decoder is a neural net I need to do a similar normalization but I don't know how to do it uh there are many people who try things like this and didn't seem to be very successful so far so it's not entirely clear how you do this process with a nonlinear decoder uh yet uh and then what else do you do when once you have z you can use z as a target to train the encoder to just minimize this uh this term in the energy okay so now you adapt the weights you do a gradient step in the weights of the encoder the encoder has to be multilayer nonlinear otherwise it doesn't work very well and it's a step in the negative gradient of this term in the energy so again the the loss function basically is just the energy the average energy over training samples and you can afford to do this because you limit the information content of the latent representation through this regularizer and so you're guaranteed because of that that your system will not give low energy to the entire space right you don't need a contrastive phase because that regularizer shrink wraps your data in a region of low energy space um okay so that's if you minimize with respect to z you can also marginalize with respect to z okay with the formula that we've been carrying carrying around the negative log integral of x financial and that's the basis for variational auto encoders but i'm not going to talk about this yet okay so what architecture should we give to this encoder okay uh we know it has to be nonlinear but what architecture do we need to get to uh to give it let me talk about the east algorithm the east algorithm is written here if you have a spot screwing system that whose dictionary matrix is wd you take a this term here is the this this entire term here is the gradient of the of the the square reconstruction error with respect to to z all right the reconstruction error is the square error between z and the and the the wd matrix applied to z between y and sorry and the wd matrix applied to z you differentiate that square error with respect to z what you get is uh an error term and then it has to be multiplied by this uh transpose w that pops up of the equation uh and and and this whole thing works out so this is just a gradient of the square reconstruction error with respect to z uh and that's just a constant think of it as a step size okay so this is basically a gradient step right i mean i'm sorry this is the gradient and this whole thing here is a gradient step uh where you modified z with the learning rate equal to one over l uh l is some constant that in some versions of of ista is computed in a particular way okay but think of it as think of one over l as a learning rate essentially i mean a step size for your gradient algorithm okay so you take that and then you shrink all the components of the resulting vector so resulting vector you is you know a new a new version of z you take all the components and you shrink them towards zero if they are closer to zero than the amount by which to shrink you you just set them to zero you don't overshoot okay so you could think of this as some sort of recurrent network you take y uh you multiply it by uh a matrix we which is equal to w w d transpose uh you add uh and then you start from z equals zero you uh right because you know w y here is always multiplied by this matrix here so i pre multiply this this is the matrix this is w d transpose um then i shrink and then i get a new z i multiply this by this square matrix that i'm going to call s okay and i add the result to the previous that i had or i subtract but you know i just redefine s to have a negative sign in it and i just iterate this right so it's basically this algorithm z t plus one equals shrinkage of w e transpose y plus s of z t with the definition w e equal one over l of w d and s equal identity minus one over l of w e transpose w z okay this is this recurrent net and if i set those matrices those two matrices to be equal to that then the algorithm i run by running around this recurrent net is the ista algorithm okay which is the known algorithm that converges to the optimal solution of z for given y that that is sparse so what am i going to do i'm going to declare that this is actually a recurrent net and i'm just going to run those two matrices but the trick i'm going to do is that i'm only going to authorize this neural net to run for a finite amount of time like four iterations or something around the loop okay so now with a fixed number of iteration of that neural net i'm going to have some approximate solution of the sparse coding problem and i'm going to cheat by training the system to adjust those two matrices in such a way that the solution i get is a better approximation than if i just run ista for four durations and it actually works so this idea goes back about 10 years from my from a postdoc carol greger and it was picked up by a bunch of other people for various various applications right so you take this recurrent net unroll it a few times here it's only twice but you know you have to enroll it multiple times and then train it with back up through time it's just it's just as simple as that okay and what you get is an encoder essentially that predicts a z that is pretty close to the solution that is that would give you by running until convergence so in fact this is a chart from this paper from 10 years ago that plus the reconstruction error has a function of number of steps that you run through this recurrent net and this is called this is called this step for learning iterative shrinkage and thresholding algorithm okay because we learn those matrices instead of plugging the matrix that are predetermined and so if you run if you train this neural net let's say with five iterations to minimize the reconstruction error while being sparse right you get an estimate of the of the sparse vector here with a reconstruction error that's pretty pretty small if you run ista or the fast version of ista fast ista for the same number of iteration you get a reconstruction error that's much much higher okay so there seems to be some magic in learning this encoder in the sense that you actually get a better approximation of this sparse vector by learning this recurrent net then by actually running the algorithm that is the best known fast algorithm to solve the problem sounds like impossible magic how is that possible that a trained neural net can solve an optimization problem because this is an optimization problem that it solves approximately better than the algorithm that is designed to solve it that is the best known algorithm for this optimization and the answer is when you train the system you train you to solve the problem for a particular type of data right we've trained this for natural image patches to decode you know to find a sparse vector that represents natural image patches we don't train it to work for train it to work for any kind of random vector we train it to work only for stuff that we are interested in natural image patches you know handwritten digits whatever it is that we are interested in solving audio signal so for this particular type of data that has been trained on it's more efficient than the general algorithm fast ista which you know is designed to work in all cases and will work for random vectors for random data and therefore we kind of you know exploit the fact that we know the type of data that we're going to work with so that's the the beauty of amortized inference which is that you train a neural net to solve an optimization problem approximately and very often because you are interested in solving that problem in particular situations the result you get out of this neural net is faster and better quality than running the best known algorithm for the optimization because it gets specialized to solve that problem for the situation you are interested in so it's a very general concept this idea of amortized inference you want to keep this in mind because you know it's starting to get used in all kinds of situations where you would normally run an optimization algorithm but instead you train a neural net to basically give you an approximation for the solution you know more generally it could be it could be called like you know advertised optimization actually rather than amortizing inference can you go over once again about this shrinkage how do we do how do we perform the shrinkage so the shrinkage function has this shape okay take one component zi and pass it to this function that function subtracts a constant to positive values adds a constant to negative values and if the input the argument is you know within those two shrinkage values it doesn't do it's you know it sets it sets it to zero right okay so this is the shrinkage value right here which i guess for for lista i mean for lista is lambda over l where lambda is the constant in the you know that the l1 norm of z is multiplied by and l is the it's the inverse running rate that you use here in the gradient step and you just you just shrink and you can think of it as a gradient step of the shrinkage of the shrinkage term with respect to z it is actually a gradient step with with with the thing that you know you don't go you don't you don't overshoot right so if the value is already smaller than lambda over l you you set it to zero you don't you don't go the to the other side okay so it's kind of funny kind of clipped uh clip gradient but essentially it's two those the alternative gradients of the l2 term l2 reconstruction and the l1 regularization is that clear that function exists in pytorch by the way okay the shrinkage function now here's something really cool is the convolutional version of this so now we're going to still use a linear decoder but our linear decoder is going to be a convolutional layer uh so basically we're going to think of why as an image and we're going to reconstruct this image as a weighted sum of uh feature maps z k's and we're going to uh we're going to convolve each of those feature maps with a convolution kernel okay and then sum them up and that's going to be a reconstruction so basically take a bunch of feature maps run them through a convolutional layer okay uh with a single output feature map which is your y essentially and it's linear okay this is nothing more than a convolutional layer and so now our energy is going to be the square distance between this y image and the weighted sum of and basically running through the convolutional layer okay uh and again we're going to have a network norm over z um oh i'm sorry this is the bottom the bottom okay so this is the convolutional operation where you convolve uh the convolution kernel with each of the feature maps of z and you see the form is really the same as spot scooting so spot scooting is is is where you know the sum here runs over the the the columns of w and the components of z right and here the sum also runs over the the components of this uh convolutional kernel tensor uh and the feature maps of the the tensor of of z's okay so k is the index of the feature map essentially uh but it's really the same thing right convolution is a linear operation it's like a big empty matrix with same same term everywhere um um that works really beautifully because when you apply this so you apply this to an image and you run the learning algorithm where you normalize the the the kernels so that they don't grow too large same trick as in the spot modeling um so you you show you show an image why you find the z that uh which is a whole set of feature maps the same size as the image that minimized the reconstruction error under this l1 uh term okay you get a z and now you take one step of gradient descent to change the convolution kernels so that this term goes down and you normalize the convolution kernel so that the norm doesn't grow so the system can achieve by for you know by making z shorter uh and then you look at the convolution kernels and that's what that's what happens so if you have only one convolution kernel it basically learns to just you know reproduce pixel by pixel there's not much that you can do these are uh five by five filters if i'm not i know they're bigger than that they're like eight by eight or something like that um if you if you allow the system to use two two filters two kernels um you get kind of two types of contrast detection if you want if you let them have four if you let it have four you have very simple little vertical and horizontal edge detectors with two polarities okay um and the system doesn't need to learn that it needs to have uh edge detectors at every location because it already has them at every location because of the convolution so what it learns are convolution kernels that it knows are going to be used everywhere right so it doesn't need to replicate uh you know multiple instances at multiple locations if you have eight filters uh you get this so you start getting oriented edge detectors right 16 filters you get that now you get oriented edge detectors but you also get centers around uh filters so that those things here you know these guys they basically detect local contrast they're like what's called laplacian filters your eye does that uh your retinate does that and then your lateral geniculate nucleus which is where your optical nerve projects before going to your brain does that too okay and the your primary visual cortex area v1 in the back of your brain does all that stuff all this oriented edge now increase this to 32 filters you start getting you know a little more diversity uh in the in the filters you know more more of those centers around with some different sizes of the center and the surround uh and then if you go to 64 filters you start seeing things like uh double edges okay i mean you also have double edges here which you didn't have here so here you have single you know single edges essentially you know one side is black the other one is white whereas here you have double edges so that detects kind of a a narrow line if you want and then with 64 filters you start having things like end point detector corner detector this is a corner detector but you get like you know more diverse more diverse filters when you when you do this so this is wonderful because that means you know you can learn uh uh features completely unsupervised by just having a sparse decoder and then if you use uh so one thing i didn't tell you here i can start elide is that i'm using amortized inference for this okay so this is from a paper also from about 10 years ago and uh uh with my farmers to don't correct uh kerbichorlu as well as macroeuro and zeto correct kerbichorlu by the way is the uh head of research at deep point um i apologize for the pronunciation i never figured out how to pronounce it properly but i'm sure there are turkey speakers in the audience that can tell me a better way of pronouncing it um so uh right so those so in fact i'm using amortized inference but i'm not using this kind of complicated lista encoder i'm just using a essentially what amounts to a a two layer encoder with a single layer of convolutions followed by nonlinearity followed by which is actually a high probability tangent followed by basically a diagonal matrix that just kind of sets the gains but doesn't doesn't do anything else so what happens is kind of interesting which is that the filters in the encoder which is a single convolution end up being very similar to the filters in the decoder and this is a feature extractor right you can think of it as a feature extractor you apply the image to it you apply convolution uh you pass it through nonlinearity and that's your features um so this is a completely unsupervised way of training basically a layer of a convolutional net uh completely unsupervised using essentially what amounts to a sparse auto encoder or sparse coding with amortized inference which is kind of the same thing okay so this would be the the kernels that are learned by that by that system and they look really great and they look very similar to the one in the decoder so this is a very simple algorithm very simple procedure uh which basically very simple architecture uh take an image run it through a convolution pass it through a nonlinearity in this case high probability tangent pass it through a linear layer that just basically is just a scaling um it doesn't actually it's not a full matrix it's just diagonal matrix essentially then have a latent variable to which you have a sparsity l1 sparsity then run this through a linear decoder which is a bunch of convolutions reconstruct the image compute the reconstruction error and the process is exactly what i was describing a few minutes ago run to the encoder make a prediction for the the the z copy that z into the z as an initialization for sparse coding now minimize the sum of those three terms with respect to z so find the optimal z that minimizes the reconstruction error the sparsity and the prediction error those three terms okay and now that you have z use this z as a target for the encoder you just back propagate the gradient of this cost with respect to the weights of the encoder and then do a gradient step of the decoder so that this term gets smaller okay and then don't forget to normalize the the convolution kernels of the decoder otherwise they're going to explode and the z is going to shrink and even if your encoder is basically a single layer neural net this this works pretty well and gives you this result so you can use this to pre-train a neural net all right and there was a time when data sets were not as large as the ones that we have today but those techniques actually would improve the performance of things like pedestrian detection or other applications so let me tell you about like other approaches to learning features using this sort of advertised inference prediction type idea and this is you know a bunch of papers that are about five years old although there's been some more recent ones but but here's the idea you you want to train a system to learn visual features from video by basically doing video prediction okay so what you can do is take two frames of that video and then look at the third frame and then train a system to learn representations of each of the frames in such a way that through a learned neural net you can predict the representation in the next frame so you can so this is not a latent variable model actually although it should be in principle but this one is not so take two frames run them through an auto encoder and train this auto encoder you know backprop to this auto encoder so that this frame is being reconstructed so you're guaranteed that this h is a good representation of why regardless of what it is you do the same for multiple frames now you have the h's you run them through a neural net and you run this neural net through a decoder that is predicting the next frame okay and what you're going to do is do amortizing inference so you're going to run this through an encoder you're going to train this encoder to predict what this representation needs to but in fact you don't really need this you could just run this so this is sort of an example of doing video prediction and for you know you can train a system like this to learn features you can look at the filters the the features that are learned at the encoder level for example and these various criteria you can use to kind of impose on h so you know in that in that process or impose on g and you can learn pretty good features from this this is still this is still kind of you know working in progress along those directions but it's not completely worked out but again it's kind of a non-contrasting way of training feature extractors because you don't train your system to just reconstruct you're training it to predict and because you're training it to predict it can't come up with just trivial solutions and so you know methods like this you'll come up with like pretty cool like I didn't tell you anything about the details of how this works but but you know some of those come come up with sort of pretty cool type features we can talk more about this later if you're interested in the details so just to kind of suggest the idea that you know if what you're training the system on is not reconstruction but prediction then the problem of and you don't have latent variables then you don't have the problem of limiting the information content of a latent variable because you're doing prediction okay let me say a few words in the 15 minutes we've left on variational auto encoders and Alfredo will come back on this in more details in the I guess next week right yes yes tomorrow we're going to be covering the basic auto encoder and next week the variational one right so I may talk a little bit more about this next week again but let me give you the gist of it and it's a very informal presentation of what variational encoders really are about okay is also a more mathematical presentation of it is also a mathematical and probabilistic version of it and so today we're only going to see the intuitive version and maybe next week if if I have time we'll we'll talk about this sort of sort of formal version but sort of still in the energy based framework which actually is not just applicable to variational auto encoder but to general variational methods in the context of energy based models so why is it called variational auto encoder first okay it's called variational auto encoder because you're you're approximating a complicated distribution by a simpler one and in the context of physics that's all variational approximations in in statistical physics so that's where the name comes from but when I'm going to talk about this yet we're just going to think about this in terms of kind of energy based energy based models the variational auto encoder is basically a model of the type that we just talked about it can be it can be conditional or non conditional so the grayed out part here is the conditional if it's conditional then you have an x variable that runs through a predictor extractor representation that representation goes into influences the the decoder and the encoder okay in the unconditional version you just don't have that you don't have x okay so it's just basically an auto encoder in fact it's called an auto encoder and in terms of energies it's an auto encoder it's a regularized auto encoder with a funny form so it's a regularized auto encoder where you start with y you run through an encoder which I can have multiple layers it makes a prediction about the latent variable your latent variable is a free variable and there is three terms in the energy one term is the square distance multiplied by some constant to the to the output of the encoder that constant can be a matrix by the way so in which case this could be a quadratic form with some sort of covariance matrix in the middle but for now we're just going to say it's just a it's just a constant or maybe a parameter we're going to learn then there's another term which is the the l2 norm norm of I said z bar here but this is really z the l2 norm of z okay and then we run through the decoder and measure the reconstruction error okay so that's the energy model so the energy is reconstruction error of running the z through decoder plus the l2 norm of z plus the l2 norm of the distance between z and z bar and z bar is the output of the encoder applied to the input okay simple enough now here's what we're going to do instead of minimizing the energy with respect to z we're going to marginalize the energy with respect to z okay so basically our free energy function is going to be the negative log of the sum of the exponential of the energies for all possible values of z you know taken over a domain right taken over a large space and obviously that's completely intractable it's intractable because so what does that mean that means that the integral of the exponential of e to the minus this is simple it's a Gaussian integral right this is a quadratic term so when I integrate e to the minus this quadratic term over an entire space I get a Gaussian so I know what the integral is okay it's the determinant of this which is one same for this guy if I integrate e to the minus this guy over the entire space I also get a Gaussian integral I know what it is the problem is this guy I have no idea what this is because I'm running z through this really complicated neural net and so when I compute what I want to compute the integral of e to the minus the energy term that comes from this I can't really it's too complicated it's intractable okay so here is where the variational approximation comes in you just drop it okay you're just going to consider a marginalization over a distribution which is not the real distribution that it should be which would be e to the minus the complete energy it's going to be e to the minus the complete energy but you're going to drop this term because it's too complicated okay so you're going to replace basically the distribution over which you marginalize by a simpler one that you can actually integrate okay because it's basically a product of two Gaussians okay the energy is the sum of two Gaussians so the the the distribution is the product of two Gaussians because it's the exponential of of the energy which is the sum of two terms right you get the product of the exponentials and the product of two Gaussians is Gaussian so that's just a Gaussian right now how how are we going to approximate the distribution that Gaussian distribution because what we need to do is compute the the sum of the of the energy for like basically what we need to do is compute the gradient of our energy with respect to the parameters of our network for all possible values of the latent variable okay and sum that up for all possible values or average it for all possible values of the latent variable but we need to give a weight to every value of the latent variable that depends on the probability of that latent variable under the distribution which is just e to the minus the energy okay so that's what we need to do uh you know the the marginalization would be f of y would be minus uh i'm gonna assume beta equal one okay log sum over z of e to the minus beta uh e of y and z okay there is an x if it's a conditional model and the reason for using this form of the free energy is so that we can compute the probability of y as being sum over z of the probability of the joint probability of y and z but we also want it to be e to the minus the free energy of y divided by sum over y of e to the minus free energy of y okay and so the proper definition for f if you want those two things to be to be equivalent is as I showed last week is to define f this way and that corresponds to marginalizing over the distribution of y of z and that distribution of z the so called posterior distribution of z uh so p of z even y uh is e to the minus e of y z divided by sum over z of e to the minus e of y z I should say z prime here so that we don't get confused okay that's called the posterior of z but here is the problem we done we cannot compute this because this integral involves an energy term in there this term which is a quadratic term in terms of y bar but it's non quadratic in terms of z because we run z through this complicated neural net and so you know e has three terms it's got the reconstruction error it's got the prediction error and it's got the square norm with the coefficient of z it's it's the it's this term that I can't deal with okay when I plug this e into this equation here I cannot compute the denominator because that term is just too complicated okay whereas the other two terms it's just a Gaussian integral okay so it's quadratic in z and so it's just the integral of a Gaussian I can compute this is you know in this case it's basically lambda times k or something like that um so uh so that's the approximation I'm going to use a proxy for the energy or a proxy for p of z given y which is going to be a q of z given y okay and that q is going to be e to the minus e tilde of yz divided by the integral of e to the minus e tilde of yz and this is e tilde okay so I'm making this horrible approximation that's called a variational approximation okay I'm replacing a posterior on z by another one that is simpler and I can actually compute okay now what I need to do is you know basically say well for each possible value of z I need to compute the the gradient of my reconstruction error or my energy if yz with respect to the weights and I need to weigh that by p of yz p of y given z and compute the the sum of this over z and that would be really the gradient of the marginalized free energy with respect to the parameters is really the weighted gradient of the of the energy for all values of z weighted by the z given y okay so I had this formula just a minute ago okay z given y um I can't do this so I'm going to substitute this other one the q and I'm going to approximate this by a discrete sum where the sum is taken over samples from q okay so basically it's very simple you you have a y you run through the encoder you get a z and now what you're going to do a z bar now you're going to sample a particular value of z from the distribution whose energy is the sum of those two terms okay and it's sampling from a Gaussian super simple the mean of this Gaussian is z is z bar okay uh actually because of this term it may be a little shifted from z bar um and then you run through the you run that through the decoder get a reconstruction back propagate the gradient compute the gradient with respect to the encoder and decoder weights make an update okay pretty simple right and what that gives you is this approximation which is the gradient that you want which is the gradient of the free energy once you're marginalized over this latent variable another interpretation which is more intuitive of this is that each training sample is a point in in z space a vector in z space okay represented to the plane here so each of those blue dots is the code that corresponds to training samples at you know produced by the encoder when you add noise to each of those points you basically turn them into fuzzy balls and you you run a random point within this flexible through the decoder okay now if if you do have overlapping fuzzy balls here it means that the decoder is not going to be able to reconstruct those points very well because once in a while the samples are going to be in the overlapping region and so the reconstruction would be bad okay so this will cause the system to blow the all the points away from each other okay so it'd be the opposite as fast as fast coding the points will fly away from each other to avoid this confusion but it's not a good solution so what you do is you add a term that basically adds a spring that attaches each of those fuzzy sphere to the center so that they can't just fly away they have to be as close as possible to each other and that basically regularizes the whole thing and this is the l2 term this term so this term prevents the the fuzzy spheres from flying away and in fact I should probably put this term on on z bar not on z so it's just a term that intervenes in the loss but not in the energy okay we're out of time by four minutes so I'm going to stop here and thank you for attention and Alfredo will explain this in more detail all right take care