 All right, welcome back to class again today. So I've been working on these slides for the past 48 hours, basically nonstop. I hope it's gonna be better than last week. Topics are very similar. Today we're gonna be talking about the variation load encoder, but we are just restarting, just very quickly going through all we have seen last week with correct and consistent notation and working animations, okay? So I'm gonna be going, well, I changed everything. So I don't even know if the things that are repetitions are new, right? So let's see, right? Tell me if I'm too slow, too fast or whatever. Good luck, I mean, good luck to me. And we already seen this stuff, right? So this one we saw two weeks ago at this point, right? This was the conditional predictive latent variable model, right? Conditional latent variable model. Conditional, yeah. Predictive latent variable model, right? Because we go from an X to a Y, right? From the X we go through the predictor to get the hidden representation. Then we need the latent variable Z in order to be able to capture multiple possible values of Y that can be associated to the given X, okay? Then we also saw that this stuff can rotate. So it was very pretty, but the last we can move on. So again, these are shaded because they are from previous lessons. So that's why they are not, you know, bright. Then we saw the other version, which was the unconditional part, right? Where we didn't have an X, we only have a Y. So X, yeah, I'll tell you in a sec what X and Y's are, right? But anyway, so here we don't have the X, right? So the only difference with the one before that was this one is the fact that when I move here, we no longer have the X. So what is the X? What is Y? Y is not a label. Y can be a label, okay? So this is the definition, right? So X is always available to you. It's available during training. It's available during testing, always available. If you don't have a observed conditional variable like X, then, you know, just you end up having Y, right? So X might not be there at all. So Y is only available for you during training. During testing, you don't have it, okay? So that's the definition of Y. Z, you never have it. It's latent, right? Latent means it's missing. You don't have it, you just infer it. You try to do the best you can to figure out what is supposed to be this missing variable. It's supposed to be an input, but it's missing. So it's called latent. Like you can't find it. It's hidden, no, it's hidden. Hidden is a different word, right? Although in English, hidden and latent means perhaps the same. In this course, latent means missing. Hidden means it's hidden somewhere in the network, upper level, okay? It's an internal representation. Okay, so far, no fancy stuff, right? So then we move on from this stuff here that was still from the previous lesson to the recap, no? How do we train this stuff? So let me actually start from the most generic version that is the last slide we covered two labs ago, okay? So the last slide was something like this. Where in the orange circle, where I usually, which I use for representing the latent variable, I'm gonna put this vector Z, okay? So in the ellipse case, Z was how many dimensions? Head Z, type in the chat if you're following. In the ellipse case, in the study case, in the study case we covered in class two weeks ago, okay? What is the nature of Z? Z is a continuous variable, it's not discrete, of how many dimensions? How many, in how many direction can Y till the move? Okay, I'm reading your answers, but I'm not pleased. So I'm trying to see whether someone gets the right thing. So maybe I'm not clear, right? Y, okay, do you remember Y? Do you remember where was Y living? So new question, you had to type not just number, tell me, Y is living in a space, two-dimensional space. Okay, yeah. So Y was living in a two-dimensional space. Do you remember where Z was living now? One-dimensional, yes. So Z can only go from zero, for example, to two pi, right? And then afterwards basically this stuff doesn't matter. You can go on, on because we had that the final Y was using cosine and sine, right? And so where is the final Y till the constraint? You can only live in a one-dimensional manifold in a two-dimensional space. It can only go around circles, ellipses, right? So it only has one degree of freedom. It cannot go in any other way, right? You can, this was clear, right? So I think, I mean, it should be clear right now, right? So this was the picture, right? This figure here, I can, I show you here on top of this Z, that there is just one line. Z is actually scalar, can go up and down a line. And similarly, Y till there is also restrained to one-dimensional subspace of the 2D plane where it is embedded, okay? All right, so I'm doing a different thing here. Now I'm saying Z is bold, so it's no longer a scalar, okay? So Z doesn't necessarily live on a one-dimensional. So now we have a problem. If Z is bold, it can take many, many values, right? And therefore, if you can always find the best Z that is, you know, reconstructing your input, you're gonna end up with a flat manifold, right? With a flat, sorry, energy surface, right? So we cannot have a collapsed model. We need to have low values for good samples, high energy, otherwise, right? If Z is too powerful, you end up with a zero, you know, a low energy value for everything, okay? Then it's useless, it's tough. So we have to introduce the thing that the UN calls a regularization factor, right? So there are two options, or there are regularized latent variable model, or there are architectural types, okay? In this case, you know, showing you that I'm using the regularized case, okay? The other one, when we choose Z to be one-dimensional, you can think about that the regularization is just the choice of dimensions, which is the size of the Z equal one, the size of the dimension of for Z equal one. It's a very strong, you know, brutal regularization, okay? Otherwise, you can use different types we have seen. We can use sparsity, A2 doesn't work, because if you make Z short, well, the decoder can simply increase the weights, right? And so A2 doesn't work as a regularizer in latent space, because again, if you make Z small, then the decoder just make it larger, because it increases the weight of the game, right? So it cannot be used. And we have seen last time a few other techniques about how to do this. Do we see? No, we didn't see. We see today other techniques. Anyway, training recap, okay? So how did we train this latent variable energy base, regularized energy base model, right? Mouthful of words. So we were given an observation Y, right? Then we are also given this energy E of YZ, which is the sum of all the squares in this diagram, right? So it's gonna be the C in the reconstruction error, right? C between Y, the target, and Y tilde, my prediction, plus these are the regularization term for the latent variable, such that the latent cannot assume every possible value. We had to constrain the freedom, let's say, of Z, okay? All right. Y tilde is the prediction, so it's the decoded version of my latent. And then how we were doing this stuff, how we were training this thing. So we are computing first the free energy, which is simply the soft mean over Z of this energy, right? So this softer version of the minimum. So we take all the, like, contribution of all possible Z proportional to their energy, right? So the smaller the energy, the more they contribute to the finite free energy for that given point. And then finally, we were minimizing this loss functional, which is giving us a good energy function, right? The good energy function is a function that is an energy which is low on the samples we observed high otherwise. Okay. So this is with the warm temperature, right? So instead, if I go cold, and I put on the AC and become super cold, the soft mean becomes a, beta is the coldness, right? So if beta goes to plus infinity, if it's super cold, if I turn on the AC, what happens to that soft mean? The soft mean, it's soft because it's, we warmed up the ice cream. If it's a high temperature, it becomes like, smart, you know, like terrible, okay? If the soft mean, you make it cold, the ice cream, how does it become? If you have a soft ice cream, you put it in the freezer and becomes a regular mean. Yes, I have to correct. All right. So we just, yeah, it becomes solid. All right. So I just do this here. So the zero temperature limit, right? So I increase the coldness, I put in the freezer, what happens? I simply compute the Z check, remember? The Z check was the value of the latent that is minimizing my energy. Such that I can compute the F infinity, the free energy, the zero temperature limit free energy, which was this minimum value of the E or just the E computed at the location of the Z check, right? That's no news. We know this stuff. Finally, how do we train? We minimize the loss functional. All right, cool. So let me clear the screen. What do I talk today about? So that was recap, right? Kind of. So today I'm gonna be talking to you about the first step that is the connection between this latent variable energy based model to out encoders. Last time I skipped this step and I felt very bad because I cannot, I was not, how do you say? I would say I was almost insulting you. I was like skipping steps. I cannot skip steps in my explanations, right? I should make sense. If you don't understand what's going on, it's my fault, complain, okay? All right, anyway. So this is missing step from last week. So given observation wise, this is still the same picture drawing we've seen so far. I'm gonna be giving you the following information. This is gonna be the target prop, target propagation. Yeah, I also covered this yesterday, I think. So we compute Z tilde. What is Z tilde? Z tilde is the encoded version of my Y, okay? So here we go. I encode this Y into this encoder and I get Z tilde. What is the Z tilde? It's my initial guess for what the latent should be, okay? But it's not latent, it's green. Green means it's hidden representation, okay? So this is not input. This is an output. You can see it's an output of the encoder, right? Where is the Z there on top is orange? It's the input. There is no arrows going inside, okay? So the green guy here is the output of a module. It's a hidden representation, although it's called Z. The one on top, the orange one, is gonna be an input. It's orange, it doesn't have arrows pointing in. I hope it's clear, okay? Then I initialize my latent, the orange one, with this green value, my guess for what the latent should be, my Z tilde, my prediction. Cool. Then I don't want this orange bubble here to go too far from me, the green one, right? And so I put a spring, I put a factor there such that Z stays attracted towards Z tilde. So don't go too far from my initial value, right? You can, but there's like a spring. You can go somehow far, but then if you go too far, this stuff gets pulled back. Okay. So what do we do? We compute Z check, exactly as the slide before, which is this argument of this energy. What is this energy? Energy is gonna be simply the sum of all the boxes, right? We know. So it's gonna be the reconstruction term C, Y target and the Y tilde, my expected guess for what the Y should be, plus the regularization term R, plus this new D, this distance in latent space, but between the latent and the hidden. Okay. All right. So what next? Next we minimize, again, this loss functional. How? Two steps, right? So first step is gonna be, I move the parameters in the decoder, such that the final cost C is minimized, right? So I take a step in the opposite direction of the gradient of the cost C with respect to the parameters of the decoder, right? So given that I have the best Z, right? Because I really compute Z check. So Z check is the closest one that gives me the best Y tilde. Let's say I have my Y guess here, and then I optimize Z, I get Z check. So my Y tilde basically becomes Y check, right? It's pointing towards my other Y. So this is my Y. This was my initial guess. Then I start tuning this Z to get Z check. So it goes, it goes very close here. And then what do I do? I do a step in getting descent in the parameter space of the decoder, such that I land on the Z, right? On the Y. So I repeat again. This is my Y, the blue Y. I draw on the bottom right of the diagram. I have an initial guess for my Z, which is my Z tilde, which is over here. Then I perform gradient descent, such that the decoded version of my Z tilde, which is again my Y tilde that started here, I just minimize the Z over the Z, such that now I get Z check. And by getting Z check, I'm gonna get Y check. Y check is my closest Y to my Y check, is the closest to this Y here, right? And then I do the gradient descent in the decoder space, such that I get there. Makes sense? This stuff or not? Talk to me. Well, type to me. This is latent variable model training, yeah? We covered this two weeks ago. Okay, all right, you're following. Cool. Okay, so what's next? Oh, we also have to refine Z tilde, right? So how do we improve this D? How do we lower this final energy? So we want to lower this F, right? So we can also do now a step in the encoder parameters. Now that I do have a target, right? I have a target for my Z tilde. My target for Z tilde is my Z check, right? So given that I observe a Y, I did a minimization over Z to get Z check, which is the minimizer of the energy. Now I know what is the optimal Z, no, my optimal latent given that input. And then I'm gonna be, try to shoot my hidden representation towards that optimal latent, okay? And so my latent is gonna be my target, right? So that's gonna be my optimal latent, my target latent. And so I propagate the target back in the network, right? So that's why target prop, I believe. So why was my original target? Then I minimize the energy in order to find my Z check. Now my Z check is the target. So I can minimize this D energy in order to get the encoder train, right? So it's basically you train the decoder and the encoder in two basically separate steps, right? How does RZ affect encoder and decoder? RZ basically prevents Z to take extreme values, let's say like this, okay? So RZ gets Z, R avoids Z to take too many values, right? We need to constrain the power or the capacity, the information capacity that can be stored in the latent. Because otherwise if the latent can always explain everything there is in the input, regardless of the input, then you're gonna be always getting a zero manifold, a zero energy, right, everywhere. Instead, you need this regularization term such that Z is constrained. You can only take a subset of possible values, okay? All right, so finally, that was the missing step from last week explanation. We can move on and delete things from the screen. Let me delete the initialization. Let me delete that de-factor. I'll also remove that latent variable. And what I do now is introduce this I move the green bubble from the bottom to top, right? Oof, now, I change also the name, it's called H now. Same color, right, same stuff, same shit. So what happened here? I just keep a step, right? Instead of finding the optimal latent, I remove the latent and now I'm just putting a wire through that D, right? So there was a D before. Here, you see? So there is a D, a box between Z tilde and Z. Now I just put a wire. I just put a wire between Z tilde on the bottom and Z on the top. So they are the same thing. So Z is no longer an input, it's no longer orange. It's the output of my encoder. I just keep a step, right? And so my encoder is gonna try now to learn how to predict the value of the value and learn how to predict the Z in one step. And so the minimization process we were doing before, now that to find Z check is gonna be performed by the encoder now, right? So this is called the amortized inference, right? We learn how to perform an optimization. Cool, so I just moved the bubble up. Oof, there you go, finished. So this is auto encoders. We already know everything now. Auto encoder, equations, blah. And then the energy is gonna be just a sum of the reconstruction and the regularization. Reconstruction cost, we already saw that there is the real value one. For example, this quadratic distance between my target and my prediction. Or there is this other option for binary input. Last functional, we said that the last functional is the average of this per sample loss functionals. And then for example, we can take the energy loss functional, which is the free energy computed for that given sample. We already covered last week, what is the under complete and over complete. And I think there was a question, maybe I forgot to ask. It's like, why on earth? So my question last week was, oh, what does an auto encoder, everyone answers? Oh, it compressed information. Well, not necessarily. What does compression mean? Oh, you get a smaller representation because people think about PCA maybe not, which is like a linear auto encoder, where you get, you know, you try to get a dimensionality reduction. That was the answer last week, which is not what auto encoder necessarily have to be used for. Actually, I guess it might be even one of the not super common uses, right? So what are we doing here? Why am I trying to learn a hidden representation that is larger than the input? And we covered this in the second lesson, I believe, or fourth lesson at this point, sorry about that. So why do we like high dimensional space, intermediate high dimensional space? Do you remember? Grab more feature. Less chance of local minima, perhaps? Yeah, I mean, yeah, but the aim is that, okay, in a high dimensional space, yeah, exactly, okay? So we can make it linear separately in high dimensional space. So if you go in a high dimensional space, there's much freedom and you can move things more apart and you can get nicer representation than if you're in smaller spaces where everything is cramped. So here we basically perform representation learning, okay? And so you actually want a larger intermediate representation. Then there is a problem that this stuff is gonna collapse. What is the collapse? Who doesn't mean collapse in terms of energy-based models? Remind me, I've forgotten. What does, yeah, zero everywhere. Doesn't have to be zero, just flat, right, everywhere, but yes, it's okay, your answer is correct. Right, so the issue here is that if you have a very large intermediate representation, you can simply copy things through, right? So you can copy this value here. Maybe I can try to draw my baby note. I don't know. You can copy this value over here and then you copy this value over here, right? Then you copy this value over here and then you copy this value over here. I'm using my mouse, of course, right? This one here and then back up here, right? Oh, okay, cool. And then you end up with something that is able to reconstruct everything. Okay, it's missing a part, fine, whatever. If you can reconstruct everything, then you're useless, right? You only need to selectively be able to reconstruct things that you have observed during training. That's what's the point, right? And the point is that we have a low energy for the things that actually belong to the training manifold, not for everything. Otherwise, it's useless, this stuff. I hope it's clear. Okay, moving on. We introduced the first technique to avoid this, you know, dead manifold, which was this denoising autoencoder. So what does this denoising autoencoder? Here's the same diagram we have been seeing a thousand times now. Why do we have high dimensional or H then? I think I answered already. The high dimensional, yeah, someone answered here, Arthur, right? Arthur said we can make it linearly separable in high dimensional space. It's a nice way. It's an, okay, it depends what you're asking. In energy-based models, so the high dimensional space was for the hidden representation. We don't go in a high dimensional space for the latent variable, because then we have to regularise this latent variable at home, right? So latent variables are just missing information. You want to provide a system that was the other topic, right? In this case, autoencoders, we're trying to learn representations. And we like to have larger dimensional representation. They are easier to learn, right? But then, yeah, we have to find ways of constraining this information that goes through this hidden, right? And there are several ways of constraining the possible ways H can take, right? So H is also called, in this case, code, CODE code, right? And so, in autoencoders, you want to be able to limit the possible codes that can be, you know, generated or can be associated to the input. Such that if there is scarcity of codes, you know, you have to distribute these codes in a sparingly manner, right? And there are different techniques. There is like, so there is a need of, you know, a bottleneck, basically, for the information. And there are techniques like denoising autoencoder, there is contractive autoencoder, we are gonna be talking in a second. There is variation autoencoder, we are gonna talk as well in a second. There is sparsity over the hidden layer, right? So there are different ways of regularizing the hidden layer. There are different ways of constraining the information that is allowed to pass through the hidden, okay? Hope it's clear. All right, denoising autoencoder is a contrastive technique, okay? So although I was just saying, oh, it's a regularization technique, no it's not. This is actually a contrastive technique. What does, what do contrastive techniques do? What does not? Yeah, what do they do? What do contrastive techniques do? Remind me, type push down on good guys, push up on bad guys, perfect. So denoising autoencoder does exactly what you just said, our term, perfect. So we have to come up with bad guys. How do we come up with bad guys? We take the good guy, the blue one. The blue means low energy, right? Cold, blue, cold, low energy, right? So we take that one, we corrupt it, we sample basically from that one. So we have like some noise distribution here, P. So we get a sample, why hat? What is why hat? Hat is the hat, no, the little thing that is pointing upwards, right? So you have a high energy. Also why hat is hot, and that's why it's red. Like if you have a thermometer, right? The blue is in the cold region, like below zero. The hot is, you know, when you have boiling water. The corrupted value, learn from positive and negative samples. Yes, the corrupted value, yes, I'm talking about that right now. So we had this why hat, which is hot, which is red, which is the corrupted version of the input. And then, huh, I encode it back to H, I encode it into H. And then I force the decoder to produce a y tilde, which is going to be close to my y input, target, right? So now, regardless of the corruption that undergoes, y undergoes, the out encoder, the noise out encoder is forced to retrieve the original value for the y. So the out encoder has to learn how to undo this insertion of noise. So if you have corrupted input, for example, the corruption could be, you know, a additive Gaussian noise, or it can be some sort of dropout, like you can drop some pixels, right? You have like some black pixels around. So the encoder decoder is gonna be undoing this kind of corruption, right? So there is a big assumption here. The assumption is that when at inference time, when you actually use this stuff, you are expected to find the same type of corruption, right? Perhaps, such that, you know, when you learn to undo a corruption, then you are good at undoing that specific corruption, not necessarily you generalize to any type of corruption, right? Okay, so how does it work? We already covered this one. We had the manifold, right? I have my y. I moved this y away. For example, like some random vector. And let's say I get my y hat. And now I decode this one and force the decoded version to be back on the original location, right? And so automatically you can tell me what is the implicit energy I give to y hat, type in the chat. Adenoising autoencoder is a contrastive technique that is giving what energy to this bad guy, the y hat. How much is the bed? Yeah, high energy. No, you have to tell me exactly a number. What is this amount of energy? How much is this value from the picture, right? How do I measure this? How do we measure usually? What's inside the C? C can be two things we have seen today actually, right? But usually what is our energy? We said it is the Euclidean, almost. It's not the Euclidean distance, it's the Euclidean distance. Mm, add, mm. What's mm? One more word, please. No, no, no, it's Manhattan. No, no, no, it's not Manhattan. Come on, let's make two slides ago. Squared, yes, thank you. It's the square Euclidean distance, right? So what is the energy associated to a given y hat? The square Euclidean distance from its original value, okay? So this is the noising autoencoder. Contrastive technique, which is assigning this energy to this point. Then we saw a lot of cute things. We already saw this stuff last week. That was the sparse autoencoding. And then we move on to the contractive autoencoder. This is new material we haven't seen yet. And then we cover the variation autoencoder and the notebooks, otherwise you're gonna be complaining that we don't cover code. Okay, depends if you want me more explaining this stuff, like with pretty picture, so you want me to go more practical. Let me know which one you prefer. Anyway, contractive autoencoder, well, how does it work? What is it? It's another technique for limiting the amount of information that the hidden can have, okay? Same stuff, right? So we had the manifold. We have our y and I have my free energy, which is exactly the same stuff we've been seeing all over the time. So we have the C reconstruction plus our regularization. Okay, what is the difference? What are, oh, okay, one second. Bear with me. A very famous example of denoising autoencoder is BERT, which is trained on something that is called mask language model, MLM, mask language model. What is it, mask language model? You have your input, your y, you perform some corruption. Basically you delete some words in a sentence and then you force the model to predict what are the missing words, okay? So it's exactly what I just showed you in this. Okay, okay, I cannot scroll. Let me, I don't know how to use my computer, sorry guys. Okay, so we saw this one, where is it? Denoising autoencoder. So y on bottom right is my full sentence. There is a corruption here, which is like a dropout mask, which is basically dropping a few words. And then you have the encoder decoder modules, and then you have like, oh, well, there's just a big module, but whatever. We can always think about this as being an encoder decoder. And then the output is gonna be a reconstruction, y tilde, which we try to get it as close as possible to the original y, right? So we try to fill in those missing words that were removed by the sampler here, okay? So this is called the model was BERT, which is like a transformer model, which operates on sets, but we don't care. We, well, yeah, we don't care. It's just a denoising autoencoder, where the sampling is, as I told you before, is a dropout. We drop a few of these words in a sentence, okay? That's it. So you now know what this mask language modeling is. It's a denoising autoencoder technique. Let me go back here. All right, so we, this one, I didn't tell you anything so far, we already saw everything. So there was no big jump, jump now, okay? Only a question from here. Okay, the C means that we penalize insensitivities to reconstructions along the manifold, right? So if along the manifold, you don't try to predict what happens, right? So if your y tilde doesn't follow your y, then the C is gonna be getting large, right? So in order to keep y C small, the reconstruction small, your y tilde should follow the y as you move along the manifold, right? So, for example, let's say along this line. We said that we don't want, though, that the y tilde to be following every possible value that we input, right? So it's that, so that we cannot end up with this flat thing, right? So that's why there is the R. What is this R? So R instead here, I'm gonna show you, is gonna be this following, right? So there is some, you know, lambda, whatever, out of the hyperparameter, gradient of the hidden with respect to the input y. So this is the stiffness. How much h changed when I changed y, right? So this is dh over dy, yeah. And so this term becomes, starts screaming, starts annoying, it starts getting annoying. Whenever h changes, given that you change y, so if you change y, the more, so if you change y, the more h changes and the more this stuff blows up, right? And so the model will try to get, if you wiggle a lot, y is gonna be trying to wiggle less the h, right? So you're gonna penalize how much the wiggling of the h comes from wiggling the y on the bottom. So what does it do? Like this thing basically say, I don't want h to change. So I basically penalize the change of h for whatever change of y, okay? And so the slanted line there is gonna be penalized and incentivized. So it's gonna be penalized by this R factor, but it's gonna be incentivized by the other one, the C, whereas all of the other direction are going to be just penalized, right? So this is a, you see now, right? Everything is basically penalized, but only one direction is incentivized, right? And in this manner, you basically get to push up the energies only in the direction that are not used for the reconstruction part, right? I hope it's clear. And this is called a contractive out encoder, all right? Summary, so how do these out encoders work? So basically you have like an input manifold, you have some observations, right? These dots represent what I see. Then I call them y, this is my y and that's the whole manifold, right? On the right hand side, I had this hidden representation where I map one dot. So those are my hidden representation living on this H manifold, whatever, subset of RD. And then you have another dot over there. So this is the basic version of the out encoder. We saw this before, right? I am not cheating. I'm just showing the same diagram I showed you before, right? So this is the diagram I showed you before, right? So right now I'm just getting, I'm just showing exactly the same thing. So I'm here, poof, okay? All right, so nothing changed. So this was what we know already, which is this encoder, which is acting as amortized inference, inferrer, inferrer, right? Amortized inferrer, whatever. It's performing amortized inference. Cool. So on the left hand side, instead, we're gonna be talking about the variational out encoder. So what's the difference? I'm telling you first what's not different. So we still start from observation Y, right? So no big E, right? So we start with observation. And then guess what? I have an encoder, which is producing a hidden representation which now looks a little bit funny. So instead of having one vector, I just have a longer vector and I split it into parts. Again, H now is gonna be called mu concatenated with V, mu and V are vectors. Same stuff, right? Nothing changed. On the top right, we had a R, regularization factor attached to H. And here I'm gonna have exactly the same. I'm gonna have a U, U because it's similar to the mu, right? It's like mu and then U, U, it looks similar, right? More or less. And then you have the other is V, so I'm gonna have a capital V, right? So it's actually the same, nothing changed. Now, big difference, right? But I have a sampler, okay? So what does this sampler do? Well, this sampler samples a latent variable set, okay? How? Well, let's say with a Gaussian distribution, okay? Of what parameters? Well, parameters are given to you by the encoder, right? So the encoder now are encoding the parameters that we are gonna be using for sampling Z, okay? So Z now, again, is orange. Why is orange? What does orange mean? We said it 2000 times now. Why is Z is orange here? Type. The following? No, it's not hidden. It's a hidden input, right? It's a latent, yes. So Z is orange because it's a latent input. Why mu and V are green? What does it mean? Hidden, yeah, they are an output of something else, okay? Awesome. It's hidden inside the network, so you don't see them. Okay, cool. So the encoder encodes my Y observation into these mu and V hidden variables. And then I have these U and V energy terms, like on the right-hand side, we had the R, same stuff. Then we have a sampler. So now we sample a latent. We were doing exactly the same before, right? When I was showing you target prop, kind of. We were sampling Z, remember? Here we were doing exactly the same. So we were getting a hidden, right? From the hidden, we were sampling Z, and then we were minimizing over Z, right? And then in order to have Z not go too far from Z tilde, I had this elastic in the middle. Now we don't do that anymore. Now I do sample this Z from these things, right? So it's basically the same. Before I was copying, now I sample, and the sampling come from this normal or Gaussian now with these parameters, and I get the Z. But there's no box between Z and these guys, right? These are parameters. So there is no such thing as a guess for my Z, right? Cool. What next? Next is just copy and paste the thing you have on the right. So I have a decoder. I get my Y tilde, and then I have this, you know, come back to the origin, okay? So what does it mean? We have a Y. We encode some parameters for a distribution. I sample my latent, and then I decode this new, possible different input, such that the reconstructions comes back to the original input, okay? The Y tilde is forced to be close to Y, okay? There shouldn't be any major jump, like what's called thought process going on here. I just basically copy the right side where I just sample instead of, you know, there's one additional module, right? We already, so what is that H on the right hand side? Okay, can you briefly tell what is U and V? Yeah, I covered that in the next slide. Can anyone guess what is the H on the right hand side? How can you bring the left diagram to the right hand side, right? So how can I, how can the right hand side diagram be representing a variational out encoder? So the right hand side diagram can be a variational out encoder if what's the only difference? How to get a deterministic sampler with a Gaussian? Yeah, leave the sample sampler where it is. How do you transform that sampler into like a deterministic one? Exactly, so you set the variance to zero, right? So what, so H and what H, so what is H in this case? I guess that you type U, I guess you meant mu, but yeah, and so that is basically a mu, right? So the right hand side, it's a variational out encoder, which variance it's being set to zero, okay? So there is no any more noise in the latent. So there is no more latent, it's just a hidden. Okay, I think someone at least is following. Well, we already noticed, we already saw before, we already seen before another diagram which had a sampling module, right? Which was that? When did we see another sampling module? We saw the noise in out encoder, right? So here is how this variational out encoder compares with the noise in out encoder. Before the sampling was happening between Y and Y hat. So we move the input and then we decode everything, right? Well, we process everything. In this case, we encode the input and we add the noise in the hidden, right? So we just switch the position of the encoder and the sampler, more or less, right? Okay. Finally, let's learn how this variational out encoder actually work. So yes, U and D, what are these? So we saw this diagram before. We start from a Y over there. We encode this Y into a, basically Z by adding some noise, which is, you know, this variance. So my Z, now it doesn't, it's not anymore a point. Before Z was just the mu, right, the hidden. Now Z is actually sample. So Z can take, you know, a volume, a space in this hidden representation, right? It's no longer a point. It's a big difference, right? Then we decode this, you know, possible, one of these possible points in this, you know, space, right? And then we get a reconstruction and then we try to minimize the reconstruction. So if you do like that, you still end up with issues and flat manifold because the Z can simply go everywhere, very far apart. You know, you're gonna be learning those means that are very, very, very far apart and you cannot even possibly know how to sample later on if you want to just sample new Z at inference. So instead we're gonna be enforcing that the Z doesn't go too far, the distribution of Z doesn't go too far from a normal distribution, zero one. Zero one means the zero mean and the identity metrics, okay? And the other one on the left-hand side, the mu and V means mu, mean and then the V is on the diagonal, okay? And so this D is a distance or well, divergence since there are probabilities. Well, again, distance between my own distribution for Z and this zero one that is the classical bubble in zero. How do we add this noise for the Gaussian? We just do something called the reparameterization trick. What is it? Well, this pretty, so the issues that we don't know how to back propagate through a sampler sampling module but we don't care. We can simply say that the new latent, Z, is going to be my mu, the guess for the mean, plus a epsilon that was sampled from a normal, right? Which amplitude is changed by square root of V, the variance, okay? So this allows me to get gradients back flowing in this encoder, again, not too important. All right, finally, variation of the encoder. What are these U and what are these V's? I said that given a Y, U encode the Y and you add some noise, right? So one point in the input space get mapped into a region of the hidden space, right? So that's why I represent bubbles here and they are orange because we are in the latent space. Yes, we are in the latent space. Cool. So I'm introducing here my free energy, tilde. This is an approximation, this is the upper bound, right? We don't care. Our energy, it has two terms. It has a reconstruction term and then has another term which has a hyper parameter there, beta, but whatever, which is this distance between distribution which is called, there are actually divergencies but whatever, it's again another cost which is making you pay for getting a distribution for Z that is too far from a normal classical Gaussian. Why is that necessary? Because otherwise it's too powerful, right? So we want to constrain the expressivity of this latent, okay? So if the latent can only be sampled from a normal distribution regardless of the input, well, that's gonna be completely uninformative, right? So if my latent only follow a normal, then there's no information there. But instead of doing that, we have like a basically, again, a box, no, between the distribution of the latent and this reference, one is called posterior, the other is called prior, but I don't care. So we have a distribution for the latent and then this attractor, okay? By the way, is that a mean KL loss or just a KL loss? Since KL loss isn't symmetric or does that not matter? So this is just the relative entropy is called and sometimes people call them KL which stays for the names of the author. Anyway, it's just a, so it's not symmetric. That's why it's not a distance, right? Nevertheless, let me keep going otherwise I don't finish. If I didn't answer your question, I will answer afterwards. So the first term wants you to be able to reconstruct these bubbles to the correct locations. So what happens if these bubbles overlap? If you have two bubbles, you can reconstruct the first one to the original sample, the other one to the original sample. What happens now if the two bubbles overlap? Where is the overlapped region going to be reconstruct? So what is the effect of C? What is C trying to do here? Any guess? C is gonna be very angry if any of these bubbles will overlap because it has no clue on the model how to reconstruct, how to go back, right? And so C is gonna be just trying to push all these bubbles away, right? Such that there is no overlap. Okay, another option for C to avoid overlap is to kill the variances, right? So you can make the variance very tiny. And again, the reconstruction is very happy because there is no overlap, right? If there are no more bubbles, right? So you have points, there's no overlapping points, right? But if you have volumes, then of course you have overlapping volumes. All right, so this C does two things or it tries to send these things further away and also try to kill the variance. Then if you compute that relative entropy between a Gaussian and a normal, you're gonna end up with that expression over there where V stays for the variance, right? V is the diagonal variance. Okay, so what does that expression do? That's not too complicated but I'm gonna be helping you figure it out. Don't worry. So let's call this first term V, okay? So how does V change when you move Vi? Well, in this manner, okay? So you can see that V, there is like a V, so there's a linear, well, I'm mirror, right? So there's a V, then there is minus log. So plus log, where does it go? It goes from minus infinity to, you know, like that. I flip it, right? So it goes from plus infinity and then it crosses down at one, right? So at one it goes down to zero. So if I have the linear, the linear minus this thing, you still have, it comes down from plus infinity. It turns up, it has a minimum at the location one, one and then it goes up and then there is also a minus one. So I shift down one unit, right? And so you end up having this curve that goes from plus infinity, goes down, touches the horizontal axis at the coordinate one zero, right? And then it keeps going up in a linear fashion. If I minimize this expression, I'm gonna get that these little bubbles, the orange one, will try to strive to go to a size of one, right? So there is a push, you know, internal push that will fight anyone that is trying to collapse these bubbles, right? So we said before the seed term, reconstruction term, does two things. One thing was sending all these bubbles apart, right? And the other thing was trying to collapse this one. It is to try to squeeze this bubble. This V is gonna start screaming like crazy, right? Because you can see it goes to plus infinity. Wait, that directs, right? No, here, it goes to plus infinity, right? So you try to squeeze this one, it's gonna be like, ah! Okay, I'm too excited, I didn't sleep. Okay, so you cannot squeeze them too much, right? There is a compromise, a trade-off between reconstructing and then squeezing bubbles. All right, what next? But next, let me clean up and we have one more term, right? Okay, this one is trivial when it's simple, right? It's a quadratic term, right? So U is gonna be the square of the mean of the single items, right? So how does it work? Same stuff, right? So this is gonna be a parabola, right? And if I try to minimize that one, you're gonna end up having this minimum at zero, right? And so this term basically says that, look, I will try to push these things inside this bigger bubble, okay? So end up, at the end, if you finish everything, right? You end up having all basically a collection of little bubbles inside of bigger bubble, right? Is everything clear so far? There's a caveat here, but that was my explanation. There is one thing that is not correct here. Arguably the size of the little bubbles, it's not the same for everyone, right? They are larger for samples that are more similar, I believe. So again, the size of these bubbles are going to be in different sizes. You also want to pay attention and to think about what is the size of the purple bubble, okay? And what is the size of possible of the orange one? So the point over here is that if you want to add more bubbles, they will get high costs coming from this part here, right? The more bubbles you add and the larger they become you, right? Because more bubbles, what happens, right? So if you want to add more capacity to the network, means you want the network to be able to learn more possible codes, you try to add more bubbles, but the more bubbles you add and the more they start growing, right? And they get further away from the center. So the U term is going to be going up. The other option to lower the energy, right? Here is going to be to try to squeeze them more, but if you try to squeeze the balls, this term, the V starts screaming, right? So don't squeeze bubbles, right? So squeezing bubbles, the V starts complaining. Add too many bubbles and the U starts complaining, right? Try to make them overlap and then see complaints, okay? So you're on a trade-off here, right? How many bubbles you can squeeze together? How to squeeze more bubbles in a given volume? Well, the answer is very straightforward, you increase D. So the only factor, the only term here you have to allow more capacity for the hidden layer is to increase the dimensionality D. And so if these, you have a larger space, you know, you can pack more little bubbles in a larger space, right? And so you can tell now that although this bubble will adjust and take the spaces, they find fit, you know, that the loss will find, that the energy will find the best trade-off, you have control of how many of these bubbles you can pack together by controlling this D factor, right? This is the size of the hidden layer. And so here you just learn another way to limit the energy, which is this F, right? For possible codes, right? Nice part now is that you can actually sample after you train this model. Now you can simply sample from the prior, from this normal zero one, and you can reconstruct these sampled items, right? And so we learn that now, given that this Z on the right-hand side has been packing this unitary bubble here, the normal, you can simply sample from this distribution and generate arbitrary points along the training manifold, right? Even things that you haven't seen before, right? It is super cool, right? So a variation out encoder can be used as a generative model, although we saw it now, we introduced it as being a regularized technique, right? It's a technique that allows me to limit the possible values that the code can take by limiting the fact of, by having the fact that you only can pack a given limited number of little bubbles in the big bubble, right? So this is the bubbles of bubbles explanation of the variation out encoder. Thank you for being with me today. Thank you for being with me and us here today. The notebooks I wanted to cover, then you let me know in the campus whether you want me to cover this next time or you go over yourself. And that's pretty much it. Thank you for listening and I see you next time. Bye.