 Welcome to class to this Deep Learning Fall 2022 edition, 5 PM almost, 14th episode. What are we talking about today? First of all, actually, actually, before we start talking about the class, so I can get some minutes waiting for your classmates to come to class for who is a bit late today. I just found out, well, actually, I knew about this before, but I never actually use it because I'm talking about tools, sometimes. So to make my slides, I use something called Lateakit from a friend of mine who, I forgot the name, but such a bad friend I am. But it's for Mac, at least for what I know. And then you can drag and drop these formulas you made in Lateak and you drag them in a PowerPoint that you can animate everything. But it's always drag and back and forth, which is maybe fine that you can animate things, as you can see in my slides. I think things are pretty cute. But unfortunately, what happens if you have to update your formulas? Let's say because you change notation. Every semester I update my notation and the big pain in the neck is the fact that I cannot update all the formulas based on my new notation, right? If you just use Lateak, you have one preamble. You can define symbols in the preamble and then everything cascades throughout the compilation whenever you include that preamble. So just update the preamble, everything is going to be updated. That's why someone could say Bimmer is a good thing, although it's a horrible thing, I believe, but that's my opinion. So I just figured there is this thing that works wonderfully, Iguana Lateak. So this thing here. So this is actually a plugin for PowerPoint that works for both Windows and Mac, Mac and Windows, right? And you can just write basically Lateak, you generate, and so on. The translation is wrong. Bitmap here means PDF, Portable Document Format, not a raster. And the other thing that I just figured today is that here you can just do include, well actually input, and then you can add your preamble. And now all your formulas are going to be depending on this preamble, like you can update just once, and all the formulas will be following that major template, right? It's like HTML and the CSS file, right? The style files. So I think this is just great. And it works for both Windows and Mac, but the major point is that everything can be defined in this external file you can use as a template, right? Also, there is an extra bottom here. I think now that you can edit your source file in an external editor, which could also have auto completion based on the inputs. So if you have a preamble, then this external editor, if you actually read the Lateak and import the preamble, it can also automatically autocomplete your things. And then if it's also able to generate Lateak, you also can do like some preview, right? I think this is just great because afterwards you can apply all your animations and things and then nothing, you just have to do once, right? You don't have to do things too many times. Anyway, this is enough. We wait enough for people to come to class. We start with the lesson today, okay? So that was small bonus at the beginning of class. So we are in the middle of understanding these energy based models, right? We spent the last two classes, I think, if not more on that kind of ellipsy thing, right? So we are gonna be actually starting off from there such that we can make connections to the new topic, right? So the following topic is gonna be finally giving you all the view, right? About these models, right? So we finally, since we build all those little tiny, you know, building blocks, right? We started first with the energy interpretation of the classification, right? That was easy-peasy because you already know how to perform classification. Then we moved into these latent variable energy, what energy-based model inference, and then we spent some time in the training for learning the parameters. And we understood that these latent variables allow us to produce infinitely many predictions, right? So whenever we have to learn a one-to-many mapping, then I can use this latent variable in order to be able to, you know, predict all possible values, right? If you only can predict one value and there are multiple targets which are far apart, then you can, the best guess is gonna be just predicting the average of them, right? And which is not always the best option, right? As we have seen when we had the ellipse, if you predict the average of an ellipse, you're gonna get the point in the center which is not any possible point, right? And there is no possibility, right? The points are around, right? It's not in the center. The same way if you're driving your car and sometimes you take to the right, sometimes you take the turn to the left when you have an intersection turn to the right or to the left, left, right. And if you train a model which doesn't have latent variables, right? And you don't tell it in advance in what direction you're gonna be turning, it's gonna be learning to go where? Just straight ahead, which is not the best thing you want to do, right? Because you might anchor into some problems, right? If the straight splits into, you may want to take one or the other option. I spent the day making news lights yesterday and today, right? So you're gonna get something new that hasn't been taught before, which means new material, right? Primates, I don't know what's the word in English. Help me out, come on. Yeah, yes, you're at my guinea pig that we established already. So you're gonna have exclusive, oh, there you go, exclusive content for you, right? Just because you are in class this semester. You have exclusive content which we don't necessarily know if it's gonna work, right? Because all the animations, you know, you will figure, right, today if they work. All right, so unconditional learning, generative models, right? So remember what was the unconditional, conditional thing. Let's remind a little bit ourselves what we are talking about, right? Because also the notation in this course is slightly different from other courses, right? So we try to, you know, to fix broken things around. So remember, we had the data which is this kind of pill we divided it in three circles, the pink, the blue and the orange, right? This data is cut in two parts. The left side is always visible, right? Always observed. And the right hand side is never observed, right? Then we also use this kind of shading, you know, feeling, right, in order to be distinguishing when you have or not have the blue guy, right? So the pink is gonna be always shaded. The orange is gonna be always unshaded because it's never been seen. But then the Y, depending when you have it is gonna be one or the other, right? Which one are the two cases? Where are you observing the targets? We observe the targets during, type, type, type, type, come on. During training, yes, of course, right? So we have these X, which are the observations. We have the Y, Ys are the target, the things that we want to learn, right? And so the Ys are always there because that's what we care, right? And then this Z over there which we have seen that they are hidden, they can be inferred. We have no access to them. We can just, you know, but they allow us to have this kind of parameterization of the output, right? I can change my output by changing this Z. Cool. Those are optional and also the X is our option, right? So today we're gonna be talking about unconditional learning. Guess what is gonna be about? It's going to be about learning, finish my sentence. Learning unconditionally, of course, without the X, right? That's the point, right? We always want to learn something. The only thing we learn is gonna be the Y, right? You can learn the Y given that you observe an X. You just learn the Y without observing anything, okay? And Y can be anything, right? Today is gonna be points, I believe. Tomorrow is gonna be images, but whatever, okay? We start with points because it's easier to draw on the screen. Once more, so what are these names, right? So we have the inputs are X, which is observed during training and testing. The Y is observed only during training, as you said correctly, and then you have that Z is never observed. Finally, on the other side, we have the outputs. Perhaps you have the green, green ball H that is the internal or hidden representation. This is not the latent, right? Latent means missing, right? Like the Z. The Z is the missing variable. Then we have the Y tilde. Again, this is the approximation of the target, okay? And the tilde means more or less, right? Circa, like circa 1500 center, right? More or less. Okay, so last week we covered this guy over here, right? It's dim, yes. I know that your screen is not broken. The brightness is lower because it's already seen material, right? So you don't have to pay too much attention. And this is our energy model, latent variable energy model, which allows us to generate that cone in the output, okay? And this was the conditional, right? This is the conditional predictor, right? The conditional prediction is obtained by using the predictor, which tells us the curvature and the extension of the ellipse, right? And then the decoder basically just makes things go around ellipses in this case, okay? Again, this was the conditional, which means we observe the X location, but then we started the lesson with the unconditional, right? So we started the lesson without that first thing and we define this energy, right? So this was the first time we introduced the energy as the major architectural component, right? So we define the model here by defining the energy, whereas when we were talking about the classifier, we were trying to see where the energy can be somehow defined after the model is already put in place, right? Instead here, we just define the energy model starting from the energy itself, okay? All good? All right, moving on, starting today lesson by revising the technique we use for training any energy-based model, okay? So we start with this item over here, but remember what was the last slide in the last lab, right? We were talking about the fact that we don't necessarily know what is the size of Z. More precisely, Z doesn't have to be a scalar value, okay? Z was one-dimensional in that case because my Ys were two-dimensional points, right? And then anything that is larger than one dimension gives me too much information. What happens if my Z is, let's say, two-dimensional? I think I asked you the same question at the end of the class, but perhaps someone left. Do you remember? If you have Z that is one-dimensional, how do we perform inference in this thing? If I have a point, right? Let's say I have this point over here where my hand is, right? I'd like to estimate it's free energy. What was the free energy? Let's revise in geometric terms, right? So just intuitive speaking, right? What was the free energy? The free energy was the dot, dot, dot to the closest point of the dot, dot, dot, right? So that is the minimum distance to the manifold. There you go, that you're correct, okay? So given that it's my Y here, my target, I check what is the distance between all points on the manifold. I look at the closest distance and then I check what is the square, square of the cleanest distance, that's my free energy, right? So that's the closest way to the manifold, right? And the closer you are to the manifold and the lower you're gonna be having your energy, right? Meaning that the point, it's likely to be coming from the data distribution, the data manifold, right? If instead the energy is very large, that means your Y is very, very, very far away and too bad as far. It's not coming probably from the data manifold, right? Here we are gonna be generalizing this, right? So one of the last slides we covered last time, we were mentioning the fact that Z can be a vector, okay? So in the case of the ellipse, right? What happens if Z is two dimensional now? How are we finding, right? So the collapse, right? What does collapse mean? Energy is zero everywhere, right? So what does it mean that the energy is zero everywhere? So by changing the latent variable, we were going around a one-dimensional sub-manifold, right? One-dimensional manifold embedded in the ambient space which is this two-dimensional space, okay? Now the problem is that if you are on a two-dimensional on a two-dimensional space, then you can reach everywhere, right? I mean, you're covering the whole thing, right? So only in the case we saw last week was like the fact that we were only existing, you know, our prediction are allowed only to move around a one-dimensional manifold that was enough to avoid having the ability to reach every location in the two-dimensional space, right? And let's just start to, you know, covering the whole space, right? Okay, so how can we avoid collapse, right? How can we avoid to have zero everywhere? Or otherwise, how can we avoid y tilde to be too much, too free, right? Meaning how do we fight the every possible motion of y, right? So we have to add, okay, add some constraints or some regularization. Where are we adding the regularization? We're gonna be adding this regularization to the Z, right? And so these are for regularizer, basically allows us to start paying a price if Z changes, you know, outside a specific range of values, right? Or it's proportional to the, I don't know, length of the vector itself, something like that, okay? Okay, okay, we are on the same page. So training recap, given a observation, right? Given a blue ball y and given a energy function E of y and Z, in this case, remember the energy is given to us by the sum of all the boxes, right? So in this case, it's gonna be C, the cost plus R, the regularizer, where y tilde is gonna be the decoded version, the decoded latent variable. We were computing the energy F as the softer minimum, soft mean, of the energy itself, right? Remember we were taking that kind of that integral or the summation if Z is discrete. By paying attention to, remember what, which were the more important terms in that soft mean. E is gonna be a vector or a function, right? Which of the values, like larger small values of E contribute the most to the final soft mean, the smallest, right? So again, the soft mean gives you like some sort of summary about the lowest value of the energy. Or if you make it hard, it just takes the smallest value itself, right? And then we were minimizing that loss, right? And that loss functional in order to have a well-behaved energy function, which is giving low energy to points that are coming from the training set. Okay, okay. Then let's clear up the screen and we're gonna be going to the zero temperature limit. So zero temperature limit is just increase the coldness, right? You increase the, you make it super cold but you stick it in the freezer. And then in this case, we are computing the arg mean, like the Z check, which is the latent variable that gives me the prediction closest to the target, right? And then I can compute the free energy, which was the minimum distance. And then we were minimizing this loss functional, right? For the zero temperature limit free energy. Okay. So let's have a final concrete example of a energy-based model that is a little bit more serious than the toy example we saw last time, okay? And so we're gonna be talking about k-means. Whoa. So k-means, yes, is a generative model. Mm, okay, this is gonna be interesting, right? Perhaps you haven't heard about that in this, like you haven't heard about k-means in these terms. So actually there, instead of having the regularization term, we're gonna have a constraint, right? So it's gonna be like a hard constraint rather than a softer constraint. In this case, Z is one of the columns of the k-dimensional identity metrics, okay? So Z is gonna be a one-hot vector of size capital K. That's where the K comes in the K-means, right? So capital K means it's no lowercase K-means, right? Anyway, y-tilla is the output of the decoder, which is going to be simply a rotation of my latent variable. And then if you multiply a matrix by a one-hot vector, what do we get? Remember when you have matrix vector multiplication? Oh my God, I don't want to even say it as well, okay. So whenever you have matrix vector multiplication, that means that you take the first column of the matrix and you scale it by the first component in the vector. Then you have a plus, you have the second column of the matrix, scale by the second scalar in the vector, right? Plus the third column of the matrix, multiply and scale by the third component and so on, right? So you need to have as many scaling factor in your vector as the number of columns, right? Now the thing is that if that scaling factors are all zero, the sum of all of these things is gonna be zero. If there is just one non-zero element, which is equal one, you basically extract that specific column, okay? I hope it makes sense, right? I like to see every, you know, matrix vector multiplication can be seen from different angles. This summation of scaled columns is gonna be the one that works very well to quickly think about these things, right? Again, if you're already new and I'm offending you because you're already knowledgeable, I'm sorry, but I think it's important to be able to, that everyone has the ability to imagine these things on the flight. All right, and then the final thing is gonna be that the energy, which is going to be the cost because there is the only single box there, it's going to be the square equivalent distance, so the same thing as last week, okay? We choose the L to be the energy loss, right? So the energy loss means that the per-sample loss functional is going to be the value of the free energy itself, okay? All right, cool. So in this case, I have 50 training sample in a 2D space, which are going to be the following. So these are going to be my training samples, okay? I have 50 points in a 2D space. No, not big deal, right? Okay, so then let's figure how to perform inference, okay? So I have E, I just compute the pairwise distance, right? Between all my points, Y, right, which are 50, against all the columns of these metrics, okay? How many columns do we have? In this case, I have 15 columns. So capital K, I should have written perhaps somewhere. I have, I chose, I've chosen capital K equal 15. So this instruction here, bam, one line computes the E. Then I can compute F and Z check in one instruction line, right? I didn't tell you about where W was, okay? I'll tell you in a second. Y over on the right-hand side, right? Y is 50 times two, okay? Okay, no, no, don't apologize. Just, you can ask anything. Anyway, second line of code, we already finished with K-means, almost. There are three line of code. I mean, this is very simple to write down in PyTorch. So if you type E dot mean, and then you say the rightmost dimension, right, 15, you end up with 50 values for F. So F is a vector of 50 dimensions and Z check is also gonna be a vector of 50 dimensions with the index corresponding to which is the closest column vector, right? So W is also called a dictionary. Okay, maybe a dumb question, but can you use other loss function rather than square loss for K-means? So it's not the loss, right? I think you're asking whether you can change the cost. Are you saying that, right? Although you wrote loss, but I understand that you're saying cost, right? There are two different things, right? The choice of loss is the rightmost. I choose the loss to be equal the energy itself. That's one of the choices to implement K-means. The other choice is to use the square Euclidean distance for the cost, the penalty, okay? Yeah, yeah, sure. You can swap here whatever you want, right? We are just checking how to cast and see K-means from the energy perspective, okay? So again, like in the classifier, you're already aware of how this stuff works. I'm not teaching you K-means. I'm just showing you how to see K-means as a generative energy-based model, okay? Late and variable energy-based model. Cool, makes sense? Yes, I hope so. We know. I mean, I don't want to, I mean, this is not a lesson on K-means. I'm just showing you how K-means is seen from this perspective, okay? Anyway, good questions. Don't stop the questions. The, on this line here, I compute the free energy and the Z check in one line, right? By just doing E dot mean, right? E being this tensor, right? This torch tensor. And then this mean returns to items, right? And again, F and Z check are going to be 50 items, right? Because what is this instruction doing? Well, this instruction is taking the minimum of these 15 distances, right? So how to read this line here? Per each Y in this capital Y, there are 15 Ws, right? And so each row of this E matrix basically tells you what are the 15 square distances, right? Of one Y towards all the Ws, right? So it's the same thing we were talking with this manifold, right? You have one Y, instead of having infinitely many points on the ellipse, now you had just 15 points, right? Pam, pam, pam, pam. You can think about just discretizing your ellipse in just 15 points, which exactly was, again, what we were doing, we were doing a different discretization, but eventually I did discretize the thing, right? Although you could do the continuous version. All right, so this F is gonna be a vector of 50 elements, one per each Y, right? So each Y in your training set has a free energy, and then each Y has a Z check, which is telling you which of the centroids was the closest. Finally, for K that goes from one to capital K, you simply assign that each column is going to be the mean of the training samples for which the index corresponds to the given column, okay? And again, you have one, two, three, four lines, and you have K means, okay? There is one more line. So if you repeat this, let's say a bunch of times, right? Through a few epochs, then you basically train your system until it converges, right? Always, you're basically doing great in the same here. And how to initialize this W? A W can be initialized by picking 15 random training samples, for example. Again. So the check has the indexes of the closest means, well, the closest centroids, but what does F contain again? Well, F, if you compute, if you type E something dot mean, right, in torch, mean gives you the minimum value, right? So E was the set of 15 distances, square distances, right, per each training point. If you take the mean, M-I-N, right, you're gonna get the shortest distance. So the shortest distance, F, is telling you what is the closest distance to the manifold. But now the manifold is defined by discrete points, right? Whereas before it was the subspace, right? Now the manifold is described as, you know, single discrete points in the ambient space. Got it. So why do you need F for this computation? F is the output of the model. F tells me how far my Y is to the, well, like what is the closest distance to the manifold, right? So the inference here, right? The thing we care is gonna be determining the energy, the free energy for a sample. You would like to know whether this sample is reasonable or it's not. This is a generative model because my model spits out the closest centroid, okay? So if you return W of Z check, right? Then that's actually my output, right? So maybe I should have written here one extra line. Does it make sense? So let's say I have a, after I train the system, right? I have one new Y, right? I have the new Y, then I can compute what is the free energy for that new Y, right? And also I can tell what is going to be the prediction and the prediction is gonna be the closest centroid to my new target. Are we on the same page? Is Z check zero or one? Z check is going to be a vector, a one-hot vector, right? Or in this case, it's gonna be the index of the non-zero element, right? In this first mathematical representation, Z check was the one-hot vector. In the code here, in the PyTorch version, Z check is actually the index, okay? And so it's the index of the non-zero component, okay? How does the energy landscape look? Well, let's have a look, let's check. So these were my training points, right? I showed you before. There are 50 points in a two-dimensional space. And these are going to be, this is going to be the free energy, right? The zero temperature limit free energy. If your Y location happens to be at any of the coordinates you see here, at any of the centroids, then the energy is going to be zero, right? As you move away from the centroids, then the energy increases quadratically, right? You can see this, right? Maybe you don't see it here very well. So let me show you the next slide. Instead of showing you all values from zero to 1.6, right? I show you things that are from 0.02 to 0.01 or 0.10, okay? So this is very, very close to the bottom of that thing. And I show you here some curve levels, right? Such that you can see around each of these one, you have like these little rings and then this stuff grows up parabolic, okay? Questions about these things so far. Okay, so far, I think I'm doing well, right? Let me know what you think afterwards. All right, so moving on, second, oh, okay, sorry. Could you say again what the model consists of here? This is the model. You have a latent variable. You have a decoder. The decoder is linear. There is only a W matrix. And there is a square Euclidean distance used for cost. That's the model itself, right? And the energy-based model, well, the energy thing is gonna be like a box here around everything, right? Which has Y and Z sticking out. Are we good? Okay. All right, so back to the zero temperature limit we covered before, right, the recap. So the first example I show you was these k-means, right? Let's see another example that used this exact same recipe, okay? So in this case, we're gonna be talking about sparse coding. Any difference? Just a title, okay? So one slide for title change. But of course, there are gonna be some details, right? So the main structure is gonna be exactly the same. That's why you have two identical slides. They're gonna be some additional contributions, right? To have some, we have to introduce the sparsity, right? What does sparse mean? We already talked about sparsity when we were talking about convolutional nets. But let's have now another instance of this utilization of sparsity in deep learning, okay? So here we have that the decoder, first of all, it's gonna be linear, okay? So as for the k-means, also in the sparse coding we have this linear decoder. The cost is gonna be as for the k-means, this square Euclidean distance, right? We also have this R, right? In the k-means, R was not there and we had this kind of hard constraint. Now we have a softer constraint. So R is gonna be this softer term and R is going to be alpha-skiller coefficient, the one norm of set, okay? I compute everything that is written here and I try to minimize, right? Let's have a look at the energy I learned by performing this, okay? For the same training points I showed you before. So if you do that, you're gonna get this. How would you describe this picture? It's a cone, it's a cone with some linear parts. I don't know if you can see, right? So there's like a horizontal part here, there's a oblique part, there's a vertical part, there is like a, they are sharp edges, yes. There you go, thank you. It doesn't capture the spiral very well. I think it doesn't capture the spiral at all, okay? But you're very kind to say just not very well. See, so this thing doesn't really work. I see some large energy locations here in this kind of little slice, more yellow things here but basically you have just a darker region near where the majority of the points are and then the stuff just grows as you climb as you go away from this location. Actually, this model had a bias as well in the decoder such that you can have the darker region associated to having a z equals zero over here, okay? So what you're seeing here is that whenever you have z equals zero, then you're at this location over here which is going to be equal to the bias, right? So if you're on this slide, if you're in this slide and I told you there is a bias as well knowing for that visualization. So if you have z equals zero, you end up having y tilde being the bias term, right? And so you end up over here. And then as you start increasing z, you basically measure on the right hand side what is going to be the length of z, okay? That's pretty much all it happens, all there is there. So how can we fix that? So let's try to approximate this spiral by using consecutive points here, two by two, right? So let's try to approximate this continuous manifold with some piecewise linear approximation. So that's going to be our goal still with this parse coding thing, okay? So from here, we can basically, again, we were cleaning up the part below and then we define this t as the top two components of this first z check, okay? The z check was the minimizer of the energy, okay? And the energy is going to be the summation of the reconstruction error and this regularization term, okay? So in order to be minimizing this term, you have to basically change this z, right? But if you change z too much, this term becomes large, okay? And so this term starts getting annoyed. And so when you minimize this overall energy, which is the sum of these two contribution, there's gonna be a trade-off between getting a good prediction while having a z that is not grown too much, okay? What happens for this z check potentially, right? So potentially z check could be very, very, very tiny, which is gonna make this happy. But then this first reconstruction error might be very large unless a way that model has to cheat is gonna be make w grow so much, right? This objective function, right? If you try to minimize this objective function, we would like to have a trade-off between these two things. But this doesn't happen if w is allowed to grow indiscriminately, right? Because this one tries to shrink down the z, the latent variable. But then if there is no constraint over this guy, this w can just grow arbitrarily large. And then the cost overall is gonna be unaffected. So the cost will be very good and you don't have any sparsity overall, okay? So one thing that we have to do is also put a constraint over this w. Namely, we have to avoid this value to grow too much. So one way to do that is gonna be make sure that the columns have a unitary norm. So every time we're gonna be updating this w during learning, we also have to make sure that the norm of each of the columns is gonna be unitary. Such that they cannot change their size independently of what happens with z and the cost. How can we get to approximate that spiral by doing this kind of piecewise linear approximation? So after we find this z check, which is a compromise between lowering its length by reducing this r, but also compromising this other side, right? Where we try to get the prediction to be close to my target, right? So I get this z check, which is gonna be a compromise of a choice for the latent. T is going to be my two top components of this latent. So you can think about having the z check and I mask everything but the two largest components, right? So I check my vector, which is gonna be likely a sparse vector. And I just take the two largest component and I zero out everything else. Next, I'm going to be now computing this secondary z check minimizer, which is going to be coming out from minimizing just the reconstruction error, given that I provide just those two components inside the decoder, okay? So here instead of having my full white tilde that is coming from that orange z, orange pole z, now I use this minimization here. I minimize this cost between the target and the chop version of the z check, okay? Now I get the second z check, which is going to be giving me the optimal, what's called the relationship, the optimal quantities, right? The optimal value that the two components need to have in order to be giving me a perfect reconstruction, okay? I would choose now my loss, my per-sample loss functional to be the classical free energy, right? So the classical energy loss, right? So no big deal, which is going to be the energy evaluated at the second z check, the one with only two components, the two optimal components, right? Okay, okay. So if you do this, and let's say you learn these kernels here, right? So these are going to be the columns of my decoder, okay? These values here, the green values, these green dots, right? Are going to be the columns of these metrics that we saw before. So these metrics here, we're gonna be multiplying by z. Z is gonna be a vector that has everything zero but two components. Before we said, if z was the one-hot, you extract the single column, remember? In the K-means case, if the z is gonna be a one-hot, you extract the corresponding column. You just extract the corresponding centroid. Now z is going to be having two non-zero components, okay? So far we understand, yes, I hope so, right? Okay, so this is gonna be the top two components and then I find the optimal to the two top two components. So what are the optimal two components for this location over here where I had the mouse, okay? Top two is going to be giving me the top two components of the z-check, okay? So z-check, z here is the original random variable here. They're the random number. Then I find z-check, which is the one that gives me the minimum value that is a compromise between minimizing the reconstruction and minimizing the regularization. T masks everything but the top largest two components of this vector. So I enforce that this T is going to be a vector that has only two non-zero components, okay? So it's like a hard specification, okay? Then the top two optimal values are gonna be computed through this part over here, okay? So in this case here, top two gives me the non, well, zeroes out everything but the largest value. So z-check is not one hot. No one said that, right? That was in the K-means, right? There is, z is gonna be a vector in a, okay? So I guess next time I will start right here, z in R, whatever, right? Okay. So the point is that this specification tries to set components of the z to zero, right? Many of them. That's what a sparse vector means. It's a vector with many, many zeros, right? And so the process, like the reason that we would like to have this R here, right? Is that we'd like to have as many zeros as possible in this latent variable. Such that we avoid having Y to be too freedom, to have too much freedom, right? Such that we have basically a collapsed model, okay? So let's figure now what are the two optimal components of this z-check tool, okay? So the two means also there are only two non zero components, right? So again, if you multiply w by a vector, we said we're gonna get the scale version, let's scale summation of the columns of the matrix. Now we only have two values of z that are non zero. So the Y tilde is going to be the sum of two columns of the matrix, appropriately scaled. So let's figure now what is going to be those, what are going to be these two values, right? For this reconstruction over here. Let's say this is my first column, right? Of the matrix, and this on top right here is gonna be my second column of the matrix. So what is going to be, what are the coefficients that you have to multiply these two columns in order to get this point over here? So if I want just this point over here, what are gonna be the two coefficient? I want to reproduce this value over here. Can anyone tell me? One zero, exactly, okay? So basically you're gonna get the one hot vector and we're basically recovering k-means. Instead, how about this point here, which is 70% close to this location and 30% close to this location, right? So 0.7 and 0.3, right? So 0.7 times this one plus 0.3 times this one, okay? And so you're gonna get this location over here. Okay, I think we understood this, right? So question for you all now. How about a point that is moving along a parallel line that is just inside here? What is going to be this location here? What is gonna be the component for this location? How about this thing here, right? So I'm trying to tell you about this parallel line that is closer now to the origin, right? So this point here on the top here, we were having 0.7 and then 0.3, right? And now simply if you want to have it shrunk, you can just scale down both values, right? So let's say 0.9 multiplied by 0.7 and then 0.9 multiplied by 0.3, right? And so in this way, you can move down all the way and you're gonna have a lower and lower energy as you move away from this side, which is not good. You understand what's happening? So if you do that, you're gonna be ending up basically with something like this. As you move downwards, right? If you move down this direction, you just need to scale down those coefficients and you're gonna get a lower and lower energy. This is also not good, right? You understand what's happening? So now all points that are parallel to that one can be reconstructed by changing those two coefficients. Yes, we understand. No, we don't understand. I hope we understand, okay? So how can we fix that, right? So there is one more trick to get this to actually work very nicely. And so the final trick is gonna be the following. But isn't there a penalty for distance from the spiral? That's a good question, right? So the spiral is not there when you try to do inference, right? When you try to do inference, you try to reconstruct, let's say, this point over here where my tip of the arrow is, right? So the model will try to minimize the distance to this location I have, right? There is no spiral information in the model, okay? The model will try to minimize the C and the R, right? For this location over here. And so the model will just scale down the coefficients inside the Z in order to minimize definitely C. But the C is gonna be close to this target, right? The location that you try to infer, right? Okay, right? So the model will try to give you the energy associated to your specific location. But the model doesn't have a clue about where the spiral is, right? The model has this kind of cone. And as you go closer to the origin, you're gonna be just simply lowering these two coefficients, which is gonna be giving you the sum of the two coefficients is gonna be actually the R, right? So every point, more or less, will have whatever fix C cost, right? And then you just small, like you reduce the Z as you move towards the center of the diagram. C would be zero, actually, right? So if you go down here, you manage to hit any location, right? So given two points, you can always hit a 2D point, right? And on the screen. So given that I have two points, right? And unless they are aligned, right? If they are not aligned, you have two points, you can move them by scaling them. You're gonna be hitting any point on the 2D space, right? So C is gonna be fixed and equal zero everywhere. The only thing you are seeing here is going to be the R term, which is different for how far you are from the origin. You understand, Joby? So it's actually a broken model, okay, very good. What are we predicting? So here we are just estimating the free energy. And the free energy, we wish it's gonna reflect what is the closest distance to the manifold, but here it looks like the model has no clue where the manifold is, okay? So we wish we would like to definitely tell how far you are from the spiral, but it's not working as we expected, right? So this took like a few months of my time to put together this thing and nothing was working. And well, now I understand why, but I'm trying to explain to you why it was not working, okay? Are we on the same page? Require the weights to be zero, one to fix the model. So yeah, we have an issue with that. The columns now also they need to have unit norm, right? I didn't show you that. So a lot of things are getting there to balance out. So let me show you how we fix this problem, okay? So to fix this problem, instead of having these two dimensional points, I will pretend they are two dimensional, but I add a extra dimension. What do I mean? I add a fixed value on top of each location. This means here that I'm gonna define this y dot, which is this augmented y, which has a extra one on top of each 2D coordinates, okay? And so now these points are basically on a plane that is at height equal one, okay? And so now this one allows me to have some additional degree of freedom because now these columns, right? And these metrics have three components. Now I have these 50 points that have size of three. What happens now if I do exactly all the same is that points now need to be on that height, right? So we cannot just try to intersect every point in the 2D space. Now you're gonna have to intersect the 2D points that are living on that specific height. And so this allows me to end up with this energy surface, okay? Only points that are connecting this first element of the dictionary and the other dictionary, right? Over here will lie on the height at one, right? So there we have this zero energy level, okay? Written over here in dark, dark purple is zero. And otherwise, as you move away from this region, you're gonna have that this energy increases. The energy increases quadratically, right? Because the cost we set is the square p and distance. But since all the values are very, very tiny, you wouldn't be able to see anything, right? It's very flat. So instead, in order to make it less flat, right? Instead of having this very smooth parabola, I use a square root such that it becomes, it pops out, you can see better on the screen, okay? That's why there is this cube value. Okay, we understand what's going on. Yes, no, I hope so, right? Okay, very good. So finally, we have a energy function here, which is the actual just reconstruction term, right? And the reconstruction term is going to be zero only along this, you know, location over here. So on the outer rim, the approximate approximation are going to be tangential, right? And they are like tangential approximation of the spiral, of the manifold. Whereas for whatever numerical reasons, symmetric reason, the approximation in the central part instead is going to be radial, right? So points that are happening here are going to be approximated along this dimension, right? This direction. Points that are happening here, they are located here, they are going to be oriented in this orientation and so on, okay? So for each point here, this picture took like 20 minutes or 30 minutes to actually generate with a computer, because every location, you can see every pixel, every pixel is a full minimization process, right? So I train the system with those 50 points that were augmented with the extra one on top. And then here I evaluate what is the reconstruction error for every location. I have 301 points times 301 points. It's like 90,000 values in total. Questions so far. So these were the new part, new, new lesson. As you can tell, I was a bit choppy, but I think we managed to get to the end of this new content, exclusive content for you only. And everyone that is gonna watch the recording. Questions? Before I move on, right? What was the goal of sparse city coding? So sparse coding allows me to have a non-flat manifold, okay? If you don't have that additional term, like if you're in this case over here, right? If you don't measure the length of the Z, right? In terms of energy, then every location on this map here, on this picture is gonna have exactly zero energy, right? So if your model can reach every location of the space, then whenever you measure the distance between a specific location and the possible, like the, whether the model can reach it, if the model can reach it, you can have zero energy, right? You try a different location, the model can just reach it there. And then you have zero energy everywhere. So you don't anymore know whether a new point is on the manifold or off manifold if you always have a zero energy associated to it, right? Okay, so the objective, the, what's called didn't, the target of this energy-based model, right? The way we train them is gonna be minimizing the loss in order to have a well-behaved energy function. An energy function is well-behaved if it assigns low energy for points coming from the data manifold, high values to otherwise, right? If it assigns low values for every point, then it's gonna be a useless model, right? If it tells you all the ways, oh, the energy is zero and then you are okay. And then now if you say, if you see colors, they are not just purple, that means that this model somehow tells you the allocation over here, it has a very high energy, right? So this is gonna be a bad location. Although, again, the model is not well-trained, right? So this model is not well-behaved, right? Because it doesn't really necessarily assign a low energy to the points on the manifold, okay? So sparsity was one way to restrict the degree of freedom, right, that this Z has. So we basically limit the volume that the model can assign low energy, okay? One way to do that is going to be adding this penalty, regularization term, which is penalizing large Zs in the one norm, right? Okay, are we understanding? Okay, other questions before I move on. Okay, here's a question. A lot of these model is not differentiable. How do you minimize it? Everything was differentiable, right? I just did the first argument here by agreeing to send. Norm one is not differentiable. It's going to be just a constant gradient, right? What do you mean it's not differentiable? It has a king for sure. So there are different ways, right? To compute the optimal Z, I just use the standard gradient descent like I've done for the other notebook. Another better way, right, to actually find a Z check would be to use ISTA, iterative shrinking thresholding algorithm, okay? Which is basically applying subgradients, yes, to minimize, it's going to be applying the one gradient for the minimization of the reconstruction error and then a subgradient for minimizing the norm, the one norm of the latent, okay? This becomes a little bit too technical. I think we don't care. So you can either use ISTA or you just apply gradient descent, just everything works, right? Gradient descent just allows you to, you know, you minimize the objective function by following subgradient, right? Anyway, moving forward because it looks like everyone is not complaining. So from sparse coding, we're gonna be looking at something else, right? So same diagram, right, here. So, but we're gonna be talking about this target prop. What is target prop? Many new things that you might not have heard before. Target prop or target propagation is the following, okay? So we compute a Z tilde, right? So I have a green bold Z tilde. What is green? Remember, we started the lesson today by having definitions. What does green represent in our drawings? It is not a latent state. It is a hidden state, right? And then we made very clear that hidden and latent are very different words in our field, right? Latent is orange. Hidden is green. Hidden means internal, right? That's another word for the hidden, whereas latent means missing, okay? So we may want to just talk about internal and missing variables, right? But the actual terminology is hidden and latent, but they are two different things. Anyway, I compute this Z tilde. What does tilde means? Circa, right? More or less, a prediction, estimation, yes. So this is coming from the encoder of Y. Oh, what is this enc? It's the first time we see encoder, right? We haven't seen an encoder so far. With me. Is it correct? Yes. Right, we never talk about encoders, remember? So far, all we have seen was decoder, as you can see here, or what was the other block? A predictor, right? So we either have a predictor when we go from X to Y, right? Or we had a decoder, right? To go from the internal representation down to the, or from latent representation to go down to the actual ambient space. Okay, and today we're gonna be introducing this new guy here, this encoder. Interesting, what does it do? Not much, it stays where it is, right? So why stays in the same space? But internal, it goes to internal space, right? So it's the opposite of the encoder. The encoder goes from, the decoder goes from this hidden internal representation down to the ambient space. This encoder goes from the ambient space down up to the internal representation. That's why it also is green. All right, so there we go. So then we use this value I computed by fitting the encoder with my target. I use it for initializing the Z. Why do I do that, right? Because before, how was I initializing the Z? Randomly, right? I pick a random Z, and then we were doing the minimization to compute the free energy, right? Now instead, I'm gonna be using this first module in order to be initializing what is gonna be the latent variable that I start my search for, eventually, okay? Why is the check different from a hidden representation H? Hold on, Z check is blue, right? Z check is the optimal latent. Okay, we are gonna be making a lot of confusion here. Z tilde, Z tilde is green. So what is the question? How is it different, right? Okay, I guess. How is it different from a hidden representation? That's the question. So if you have a tilde somewhere and you don't have the tilde on the other side, guess what's missing in the center? That's my question for you. Then you're gonna get the answer. A spring, right? I'm just gonna be adding this spring here. I call it D. All right, it's gonna be my cost D. I cannot call it C because of different that term, but okay, still a cost, right? Anyway, so now we're gonna be seeing that this spring doesn't let my Z fly too far from my Z tilde, right? Because this is actually a free variable, right? We're gonna be minimizing over. But in the other side, once I had the Z check here, I will actually try to get my Z tilde to be close to my Z check, right? Anyway, let's figure what's going on. So I compute my Z check by minimization of the energy, right? The sum, energy is gonna be what? The energy is the... What is the energy in this system here? Tell me. Yeah, there you go. So the energy is gonna be the sum of all the boxes, right? As we have said plenty of times. So it's gonna be the C plus the R plus the D, okay? And so we're gonna be trying to minimize all the summation of these items here, starting from the initial location given to us by Z tilde for my orange Z, okay? Then I minimize this free energy, okay? I minimize the free energy, which is going to be the energy I get when I replace the Z with the Z check, right? I skip a step, you understand, right? So free energy is gonna be E at when you replace Z, the orange Z with the Z check, okay? So we're gonna have the R Z check, D Z check. And so what does this do? Well, how do we minimize this thing, right? So how do we minimize, let's say the C, right? And the C term can be minimized by moving in the opposite direction of the gradient of the C with respect to the parameters of the decoder, right? So I update the weights of the decoder by moving towards the negative gradient of the final cost, right? The reconstruction cost with respect to the weights of the model, right? And then how do we minimize this distance here? So this Y tilde is gonna be the output of, you know, feeding Z check, right, to the decoder. How do I minimize this D term? Same with D and the parameters of the encoder. Okay, very good. So I will update the weights of the encoder by moving a little bit in the opposite direction of the gradient of this D term, right? Which is gonna be a spring between my Z check and the Z tilde. The gradient of this thing with respect to the weights of the encoder. Interesting, okay? So you can tell now that this blue bold Y allowed me to compute Z check, right? Which is also blue, right? And now my blue bold Z check here acts as a target for my Z tilde, okay? So we have back propagated the target through the model, inside the model, okay? So now this missing variable that we had before is converted into a new target. Before, we had to compute a minimization. Every time we were performing entrance, remember? Every time I had to compute that F, I need to perform a minimization. And every time we were starting from a random point. Now I can just have the benefit of starting in a nice initial guess, right? What is this nice initial guess? Well, the initial guess is gonna be provided to us by this Z tilde. And Z tilde is going to be an estimate for finish my sentence, if you are following. Okay, let's try to put together the sentence again, okay? So before, when we were computing a Y tilde prediction, I had to minimize the cost, right? With respect to the latent. In order to find a Z check, this minimization takes time, right? Because it's a, we had to run gradient descent. Now, it's no longer needed because, oh, well, it's not gonna be that expensive because the initial value for my Z, the orange ball Z, is gonna be provided to me by Z tilde. Z tilde acts as an approximation for Z check, yeah? And how did we end up there? Well, we end up there because we are training the encoder to do that, okay? So answering the question above. Not really following the difference between Z tilde and Z check, okay? So the encoder, Z tilde, it's green. It means it's a hidden representation or internal representation, which is something that you compute yourself. I feed the Y, blue ball Y inside the encoder. The encoder spits out a green ball Z tilde, which is my internal representation for something, okay? We use this internal guess to initialize my random variable. Remember, this latent variable is missing variable. How do we find Z check? Z check is gonna be the optimal value that allows me to minimize my energy overall, right? So whenever I have a target Y, I will minimize the energy to perform inference, right? With respect to the latent variable. The latent variable has to be initialized to something, right? You can choose a random value, you can choose zero, you can choose whatever you want, but then it's gonna take many steps of green descent in order to minimize the energy to perform inference, right? Then we go down the hill, right? We perform the minimization, we find Z check. Z check is the optimal latent. And now the interesting thing is that you train the encoder through the minimization of this spring, right? In order to try to get Z tilde, the initial guess, to be as close as possible to my optimal value. You understand here? My Z tilde is my initial guess, right? The thing I used to initialize my latent. And then Z check is gonna be, it's blue, means it's super cold. It is the lower energy latent variable associated to that specific blue ball Y, to that specific target, okay? So given a target, I have a energy function, an energy that is changing across all the Z. Remember in the, I showed you 24 boxes in the last lesson, right? Where I was showing you, right? Z was going from zero to two pi, right? And then we had the U shape, the squiggly line, and so on, right? So we have an energy function per each target I have, right? And then we also define that Z check is the lowest value that the energy takes, right? So I can just have the lowest value of this E here. Now that Z check is a outcome of a minimization process. This minimization takes time, takes computations. In order to speed up computations, and well, in order to skip computation, well, in order to save time, we can start from a better initial location, right? How to find a better initial location? So this was the initial location that was initially maybe not very good. Then we tried to get the encoder to come up with initializations, which are very close to that Z check, right? Z check was the optimal latent for that specific target. Now this encoder, it's fed with this Y, and the encoder basically learns to predict what is the optimal latent, okay? So the optimal latent is Z check. We train the encoder to predict the Z check. Here you can see this, right? The encoder is trained to minimize the distance between its prediction and the optimal latent for that specific global Y. Are we okay? Other questions? Do you only minimize Z to find Z check during training or would you still use that to evaluate the model? I show you last Wednesday to perform inference, right? I know whatever, whatever I was showing you, the ellipse. How did we perform inference? Inference was estimating the free energy of a blue ball Y, right? The energy of the blue ball Y, right? That ellipse case is gonna be given to you by the shortest distance. To find the shortest distance, I do need to find Z check, right? So you definitely need to find the latent, the Z check to do inference, right? And the issue with the collapse model is the fact that you can find a Z check which is giving you zero distance all the time, right? Now I'm asking if the point of the encoder is to supplant, yeah, well, you see that, right? So in this case, you try, so you're using the encoder to speed up computations where to speed up time, right? Because you're gonna be the second time around the second epoch, you're gonna already have very good guesses, right? So you don't have to spend forever in minimizing this thing because you're gonna be maybe requiring just a few steps of really in the sand to hit the Z check, right? But you still want to use the minimization, right? Because that's the actual Z check. Z check is the actual minimizer. The encoder gives you an approximation for that. Maybe it's gonna be a good encoder and maybe you just don't need to do any step, you reach already the minimum, right? You don't know. Just to make sure I understand, Z check is sort of a target for a good initial guess of the latent. Yeah, yeah, definitely, right? So Z check is what we train the encoder to be able to predict, right? So this is the distance factor, right? So Z check is fixed, right? The Z tiller is gonna be function of the weights of the encoder. We manipulate the weights, right? We are adopting the weights in order to try to hit the target, right? So Z check is gonna be my target for my Z tiller, right? And that's why it's called target propagation because we back-propagated this target through the architecture, okay? All right, so let's figure when is this used, right? So let's talk about, oh, hold on, one more question. Will you give an example of how to build this in PyTorch later? I don't have it ready, but which part, okay? The minimization of the latent or what? Like, which part is hard to understand? Like, which part are you referring to? Target propagation is just adding one additional module, right? So you have two neural nets, you have a decoder and an encoder. And this is gonna be one minimization. So you have first minimizer here and then you use this value. Well, let me think. I have this in the notebook so I can actually show you. Okay, maybe let's do that, right? Sure, okay, I can do that, right? I didn't plan to, but I should be able to pull it up. Okay, good question. Okay, it would be very helpful to see code for this. Yeah, okay, sure, I understand. You don't have to repeat yourself. So going towards the code, I guess, at this point, I'm gonna, let's have a more precise implementation of this thing, right? So this is the generic case. Let's have a look at the specific case, okay? So we're gonna be talking about still sparse coding, but down in a different manner, right? So instead of using the regularizer, right? The R term, which had the one norm of the latent, we're gonna use something called, okay, sorry, the animations are broken, but okay. We're gonna use something called a nonlinear activation, right? So instead of using this R term over here, we're gonna be using a sparsifier, okay? So instead of just having a soft constraint, I use some sort of harder constraint. We don't really care what's inside here, okay? I just used a nonlinearity that allows me to get this sparse latent variable. At a higher level, can you describe a concrete example of where we would use a system like this? That's what I'm doing right now, right? Whenever you are using a latent variable energy-based model, to perform inference, you need to perform a minimization, right? Minimization with respect to the latent in order to find this energy. This minimization takes time. If you use gradient descent, it takes several steps, right? Target propagation allows you to remove this kind of limitation, right? So far, what we figure was that every time you had to run gradient descent, full gradient descent, right? We had full gradient descent. You find the free energy that closes distance and then you try to minimize a little bit that one in one step of stochastic gradient descent, remember? So every training sample needs to go and undergo a full minimization process, right? Each training sample, right? To find the correspondent latent variable, right? The optimal latent Z check, right? Given Z check, I can compute now the free energy and now I can just do one step of stochastic gradient descent in the parameter space, okay? This, as you understand, may take a lot of time, right? So it's a really painful operation because you have to run the optimization process at every point in the training set, right? This is insane. So target prop comes in to rescue from this big issue and it allows you, after a few epochs, whenever the encoder starts catching up to giving you very good estimates for that initial value of the Z, such that the computation of Z check doesn't require any more those many operations, okay? That's the example at a higher level that you can understand in terms of concepts that we have covered so far. Then there is another explanation which I can give you in the future, which is seen from a other side, right? But I cannot tell you things from the future because that's not how you are supposed to learn, right? You're supposed to learn one way, okay? Again, if you already know these topics, then, yes, I could say, but the other people in class might not know, so then it's not fair to explain things from knowledge that you are not being given from this class, okay? All right, so one example that was going to be making these things more concrete was to use this target propagation to train this sparse model, right? Whenever instead of using this soft constraint, we use this sparse constraint, okay? So the sparse constraint basically gives me a headache here because it's gonna give me very, very little gradient here coming back to modify this Z, okay? Anyway, we can really speed up computations if you come up with a very good initial value, okay? And I'm just repeating myself basically here, okay? So, specification means many zeros. Many zeros means there is no gradient coming back. Very painful operation because again, this Z is not changing much at all, right? Given that you have very few items are non-zero, then it's very convenient to start from a location that is of advantage, right? Advantage location, okay, good. So let me show you how the outputs of this applying this model look, okay? So I train the system on the same points. These dots here are going to be the columns of my decoder, okay? In this case, they are not necessarily of unit norm because there is no constraint that makes my Z small and try to push those up, right? This is a non-linear function that is just dropping to zero, many of the components of the Z. And this is how this overall spiral gets reconstructed, right? So the region inside these kinds of cones looks like darker, right, towards zero. Things that are outside these kinds of cones, I don't know if I want to call it cones or whatever, it looks like cones, right? So this looks like the intersection of many cones that are facing the basically origin somehow or the central of the spiral, okay? But then is this all dark? Let me zoom in and let me change the scale of the energy, okay? And so here you can see that the inside of the spiral is actually non-zero, right? It's around one. And then the only zero locations are going to be those around the outer rim, more or less, okay? So only those points here along the outer rim are given a low energy. Whereas everything else inside this part has a unitary, more or less, unit of 1.5, I believe energy. And then everything outside this spiral gets a very high energy very quickly, okay? And then you wanted to check the notebook, so let's check the notebook again. This is gonna be, I don't know if it's gonna work, right? This card, because I didn't even try. But why not? Let's see whether we can improvise, right? So we go on work, GitHub book, GitHub restore, backprop, okay? GitHub pull from the activate book, Jupyter Lab. So let's have a look at the target propagation, okay? So here I have my model, which is a sparse code that has this, I can sample my latent, I can update some sort of internal sparsifier or something. Then there is the sparsification item, turns my Z into zeros and ones. And most of the time is gonna be zeros, okay? So it makes it between zero and one, and then most of the items are going to be shrunk down to zero. Then there is this decoder, which is simply sending, maybe I should zoom a little bit, right? I have this decoder, which is fed with the latent variable, right? As we said before, so Z, the decoder is defined here. The decoder goes from D, which is 20 dimensions. So I have 20 columns in my metrics, and then I have three rows, right? Again, we are using this kind of additional one on top of this sparsifier. And the encoder instead goes the other way around. It goes from three, that are these locations, the Y location, and then they go into this 20 dimensional hidden representation, okay? Oh, yes, I use Greek letters in the code, right? So if you want to write, let's say sigma, right? You do backslash C, sigma, right? And then I press slash tab, right? Or if you want beta, beta, tab, or if you want theta, tab, right? This is very convenient whenever I write mathematics, right? If I had to use English to convert the symbols, I go a little crazy. This is not yet code, right? This is notebooks, right? So I wouldn't call notebooks code. Code notebooks are some sort of hybrid, right? So yes, I use Greek in the notebooks. No, I don't use Greek in code, right? Anyway, so far, all good, right? We don't have any crazy thing. Let's move on. So we are interested in the training this thing, right? So we define a few things, right? So we have that the energy is going to be, remember the sum of D. So the energy in this case, there is no longer D R, right? The energy is going to be the summation of C and D, right? Because we swap D R for this specification, non-linear function, okay? So we have this Y minus net, the code specified latent, right? And then I square, I square this difference, right? And I take the sum. As you can tell, this is, what term is this one I show you right now? Can you tell me? This is the C, right? The Y minus Y tilde, correct, right? And then the other term here, Z minus Z zero square, and then summation, there's going to be what? The D term, right? Okay. And Z zero is gonna be Z tilde in this case, right? The initial value, but I also can have Z tilde, but I don't remember any more why I didn't use that term. Okay, so let's see, let's see, we just use a optimizer, Adam, for training the dictionaries. I have some buffers for, you know, logging the training loss, the batch loss and so on. I also have some initial value for the best loss so that I can save, I can do early stopping and so on. Here you have this dummy data, right? So the dummy data has this bunch of ones on top of the Y, right? So I augment my Ys, those points in the ellipse by a bunch of ones. And so the training part is gonna be the following, okay? So here we have the full thing. First, I'm gonna be getting my Z tilde, is going to be sampling a latent variable, right? Oh, hold on, Z tilde, simple latent Y. What is this doing? I see. Ah, okay, because I should have changed the name. Okay, see, I was not preparing, I was not prepared for teaching this. So what is the sampling thing doing? Where is it? So simple latent, if I provide a Y, so if Y is not none, so if I provide a Y, then simply encode the Y, right? If one is none, so if I don't provide the actual target, then I just return a random Z, right? You see this, right? So this is going to be allowing me to do the two different things, right? In one case, I can sample a latent given that I provide a Y. In the other case, I have the latent when I don't provide a Y, right? So one is a conditional sampling, the other one is gonna be a conditional sampling, right? But I could have simply used in this case, the self encode, right? So if Y is not none, then Z, which is gonna be the Z tilde, right? It's going to be the encoded version of the Y. You see this, right? So here we have the following, right? So Z tilde, it's basically encode, right? So we can just do encode Y. Then I say Z zero, which is going to be my initialization, I just detach this thing, right? Why do I detach? Can anyone tell me? What would happen if I don't detach the Z tilde? So whenever we compute later on, there's gonna be some backward, right? Whenever you compute backward, what happens? Backward goes in the opposite direction of the forward pass, right? Now, whenever we do back propagation with respect to the latent variable, right? You want to change the latent variable. You don't want to change the encoder, right? So if you run, if you run back propagation twice, right? You're gonna be getting into trouble, right? Because you just forward once through the encoder. But then as you run back propagation multiple times, you haven't run forward propagation inside the, for the encoder, right? Again, this is maybe too advanced, I don't know. Anyway, we detach the Z tilde such that the autograd doesn't go back inside the encoder when we are gonna be minimizing the energy with respect to the latent variable, right? Then I compute Z check. How do we compute Z check? Well, Z check is gonna be, there's this compute Z check, which is simply gradient descent. Let me check. Let's go see what is that compute Z check. So compute Z check, right? It's here. It's running LBFGS over this energy, okay? And LBFGS is simply this function over here, where I just use this, you know, gradient descent algorithm using perhaps strong walls to minimize the energy by changing the latent variable, right? So this is full, this is a full optimization loop, right? Whenever you run this thing here, you're gonna be just minimizing the whole thing, right? So whenever you do opt step, this one goes down the gradient, right? To minimize this E function, given that it has the input, right? The input was that Z tilde initialization, okay? All right, let's go down here. So here we have this compute Z check, which is a full gradient descent minimization of the energy E, given that we start at Z zero, right? And Z zero is going to be my detached Z tilde, right? So Z tilde is still attached. Z zero is the detached version of the Z tilde. If you run, if you don't have the detached part, when you run this compute, it's gonna break because whenever you have this back propagation here, you saw here, right? There's back propagation inside here. This back propagation will try to go inside the encoder and the first time is fine. The second time it goes through the encoder, it's gonna break, right? Because we haven't run forward multiple times, right? There, okay? I hope you're following. So here we compute Z check by minimizing the energy by an N having a initial value set to that Z tilde, right? Then we say that Z zero is Z tilde, so I reconnect the computational grads. And now I can compute the free energy, being the energy E, right? And the summation of those two blocks at the Z check location, which is what we were saying before, right? Then I say I use the energy loss. My loss is going to be equal to the energy. I zero the previous gradients and I run back propagation. Now we're gonna be running back propagation given that I have found the optimal Z check, right? And so my back propagation flows backward. Let's go back to the slides, right? So I found my Z check, right? I compute the Z tilde. I call my Z initial Z, Z zero, the detached version of Z tilde, right? Such that when I perform back propagation here, right? To find the, when I minimize this thing over here, right? I go here and I do this minimization. I minimize this cost by minimizing, so I minimize this cost by changing Z. And I don't have any arrow going down here, right? That's why there is the detached. So I block this path over here. Finally, when I found the Z check, right? When I find the Z check, I reconnect. I reconnect the path here, right? So I used again the Z tilde over there. Such that I can minimize both these, right? I can minimize this loss. The loss, the free energy, the energy loss, right? And we minimize this by grading the center, right? By having, you know, by using a back prop to compute these things. And then we perform the gradient descent, right? You see? So once again, let's repeat what's going on here. I have my initial Z tilde, which is going to be the encoded target, okay? Then I have Z zero, which is going to be the detached version of the Z tilde, such that I don't have gradients flowing backward inside the encoder. Then I compute the optimal Z check by running full gradient descent. In that case, I use LBFGS with, you know, strong walls, but we don't care. Then I reconnect my Z tilde, right? So I have the Z zero is going to be Z tilde. And then I finally compute this free energy, right? Which is going to be feeding the Z check inside this energy. And then I have the loss by choice. It's going to be the free, the energy itself. And then finally the whole steps, right? We have the zero of the gradient back propagation to compute the partial derivatives, right? And then I follow the negative direction, right? That's it. The, these other stuff is for logging, okay? Questions? Why is reconnecting, why? Connecting, yeah, yeah. I know. Yes, a good question. Because, because, because, because, because here. The E has this Z zero, right? So Z zero is what comes inside this D expression, right? So initially the Z zero is the detach version. And now, since I want to still have the gradients, right? I need to put back in place here my Z tilde. So I had to call Z zero, Z tilde, such that when I minimize these D term, these things sends me gradients back in the encoder, right? I know it's a bit, I didn't tell you. But you found out, and now I told you. More questions, otherwise I start talking about more things. Are we good, right? I show you here the example, and if we train this stuff, you're gonna get the exact same drawings I showed you before, okay? Good question, Patrick, though. Are we good? Yes, no. Is everyone still alive? Okay, someone is, okay. Okay, someone is, okay, okay, okay, okay. This is gonna be the introduction for tomorrow, tomorrow class, okay? So this is the generic version of target propagation where we were speeding up the overall process by using these, the notebook is not, the notebook is gonna be available on the book, okay? It's not yet ready for people. In fact, you saw it, there were mistakes, right? So that notebook is, again, you can watch the video, but it's not runable yet. It's working progress. So we saw here that, again, just wait one month, you're gonna have the notebook, okay? It's not gonna be necessary for solving your homework right now. The point here was to use this encoder to speed up the minimization process, right? To compute the Z check, right? By having this good initial guess. But then, can we simplify things? Can we put all, like, can we do something else, right? So this is just the beginning of the new chapter, which is gonna be a reserve for tomorrow, but it starts this way, okay? So we first clean up the screen. We clear up the initialization. We remove the spring. We remove also the latent variable. And then we remove the prediction. And someone mentioned before, right? How is that Z tilde different from my internal representation? That is actually what we are gonna be doing here. So I'm gonna be just having the encoder feeding the internal hidden representation to the decoder, okay? So one important definition that I didn't tell you. The task that we were asking the encoder before, right? In this case, here, the encoder is said to perform amortized inference, okay? So how are we performing inference in a latent variable energy-based model answering the chat? How do we perform inference in a latent variable energy-based model? Type in the chat, type in the chat. Minimize the energy as a function of the latent. Perfect, okay. Now we can actually bypass this minimization by using a approximate solution, right? It's gonna be approximate inference, right? And this is actually called amortized inference is when you use the neural network, I'm not your neural network, in this case the encoder to perform the, you know, to predict what is gonna be the output of a minimization process, right? Similarly, I would say also the decoder does that. The decoder is trained to provide, you know, the output of a minimization process, which is the minimization we apply to reduce the loss, right? So I would even say, and I should be correct, right? That the decoder is also somehow performing amortized inference where the prediction is gonna be like, you know, the minimization process, the actual training process. But again, it does a little bit too much meta. In this case, it's like more concrete, right? So this encoder is gonna be performing this amortizing inference important there. But then we said we can skip a few steps, remove the latent, remove the spring, and then pop that kind of Z tilde up there, right? So we replace the spring now with a wire, okay? So the spring is like a, you know, how do you call it? A giving constraint, right? It's like a soft constraint. You try to get your Z tilde to become kind of close to the Z check, right? Now instead you put a wire, palm, okay? So here the H is going to be, as you can tell from the picture, right? The output of the encoder, right? Which is fed with a Y. And then the Y tilde is going to be the same and the decoder of H, okay? So what is called an architecture that is encoding its own input? Oh, no, no. Yes, Patrick and some Sumanu are correct, right? This is the beginning of the auto encoder chapter of our course, okay? So that's pretty much it, right? So we came to the beginning of the lesson that you know, it used to be the beginning of the lesson but we talk about many other things today which is the lesson about auto encoders which are simply one step further from target propagation and target propagation was, tell me, what is target propagation? So what are the logical steps? First we start with, what is the basic component we start the class with? Like in the chat such that we can finish the lesson. We started with what is the basic thing we start today class with? The first thing we talk about and came in since one representative was the latent variable energy-based model, okay? So that was the minimal example of a generative model, okay? Then from this latent variable energy-based model, we went a step forward and we learned that target propagation allows us to spare computation and speed up training, right? Or even inference if you want. Actually you speed up inference and therefore training, okay? Third point, we see now how replacing a soft constraint like that kind of a soft spring D with a wire brings us to the plane to the table now the auto encoders, right? So everything is just an energy-based model, right? So that's why it's so pretty, these energy stuff because everything is just an energy-based model just you had to look it from the right angle from the right light. Anyway, thank you again for being with me. We went just through the whole content. It was a bit choppy, but that's what you get when you listen for the first time to content I haven't really tried out before, but apologies. But also thank you for being my guinea pig, right? I'll see you tomorrow for the next part of the class with auto encoders, okay? Bye-bye, see you tomorrow.