 Okay, oh 97. Yes, almost 100. Come on three more, please. I Should invite my mom too. She hacked this morning conversation. That was funny How the heck she managed to hack soon? Only God knows. Yeah, don't join with the two devices. Just increase the number Yeah, 100 a 101. Okay. It's like that and then the dogs All right, so let's get back to the Outen coders that we have started. Oh, well, not out encoders generative models, right? And so let's restart by having a quick review about the out encoders, right? So again, we have an input at the bottom in pink as now you can see the colors Then you have the rotation the fine transformation and then you get the hidden layer Again another rotation and then you get the final Output which we are going to be Trying to enforce to be close to be similar to the input Again, you have a parallel kind of diagram where each transformation is represented with with a box, right? And so in this case people call this network out to layer neural net because there are two transformations But what I actually, you know, advocate is that this is a three layer neural net because for me the layers are The activations that's kind of what is usually the definition And then yeah now uses that new kind of symbols that look like a box with a round top, okay All right, so we have two different diagrams here because we can switch back and forth between the representations Sometimes it's easier to use the left one when we want to talk about the Single neurons, but then sometimes we prefer to use the other one which can also Like, you know Account for multiple layers. So each like block here are like the encoder and the decoder can be several layers as well so again this Two macro modules, I guess And so the input gets goes inside an encoder, which gives us a code So H which was before the hidden representation of a neural net When we talk about, you know out encoders H is called code and therefore we have an encoder which is encoding the input into this code And then we have a decoder which is decoding the code into whatever representation In this case is is the similar as the same representation as the input Okay, so on the right hand side, you have an out encoder On the left hand side, you're gonna be seen. What is a variational out encoder? All right, so there you go variational out encoder. Okay, it looks the same. So what's the difference? Nothing It's missing something. So the first difference is here and that instead of having the Hidden layer H. Now we have the code is actually made of two things. It's made of one thing that's this kind of capital E of Z and V of Z and they're gonna be representing soon the mean and the variance of these latent variables Z Then we are gonna be sampling from this distribution that has been parameterized by the encoder And we get Z and Z is my latent variable my latent representation and then this latent representation goes inside the decoder So the parameters that I sample from like I have a normal distribution Which have some parameters E and V E and V are deterministically Determined by the input X But then Z is not deterministic Z is a random variable Which you which you get sample from a distribution which is parameterized by the encoder. Okay So let's say H was of size D now The code represent the code here on the left hand side is gonna be of size two times D, right? because we have to represent the all the means and then all the variances in this case we assume that we have, you know D means and D variance so each of those components are independent Okay alright, so We can also think about the classic out encoder is just encoding the means and so if you encode the mean and you have basically zero variance you're gonna get a Again a deterministic out encoder. Okay, so H might be in this case D And therefore on the left hand side E and V will be total 2D since we have D means so What does that mean we're sampling the distributions? It's gonna be one multivariate Gaussian that is orthogonal and so you have all those components that are independent from each other and Therefore Z is gonna be a D dimensional vector, but then to sample D dimensional vector from a Gaussian you will need D means and then in this case D variances because we assume in that all the other components in the covariance matrix are all zeros You have only have the diagonal where you have all the variances. Okay So here and just to make a recap you have the encoder that is mapping this kind of input Distribution into like the input set of samples into these are 2D And so we can think in this case that we map from X to the header representation And then the decoder instead maps the Z space into our end Which is back to the original space of the X and therefore we go from lower case Z into X hot Someone asked if E of Z and the B of C is that the output of the encoder? Yeah E of Z and V of Z are just parameters that are that in deterministically Output by the encoder. So the encoder is a deterministic, you know, it's just the classical rotation and squashing And then another fine transformation. So it's just a piece of a neural network, which is output in some parameters Okay, so this is the encoder which is giving me these parameters E and V Given my input X, right? So this is deterministic part then given that we have these parameters these parameters are, you know giving me a Gaussian distribution with specific means and specific variances and from these variants from these Gaussian distribution we sample one One samples that okay, and then we decode Which means we're gonna see what means this in a in a second but basically you're gonna be encoding the The mean and then you're gonna be adding some additional Some noise, okay to that encoding in the very in the noisy not encoder We were getting our input you were adding noise to the input and then you were trying to reconstruct the input without noise In here, the only thing that is changed is the fact that the noise is added to the hidden representation Rather than to the input Does it make sense? Yeah, that makes a lot more sense. Thank you So I noticed that the notation itself kind of looks like expected value. Are we Generating just a a normal mean from Z or are we actually computing like kind of a weighted average? No, no, there is no okay So my x instead of output thing the is outputting Let's say D is gonna be 10 that is the hidden representation now instead of having 10 values representing the mean We're gonna have 20 values 10 values are representing the mean and 10 values are representing the variances, okay, so we just output a vector h here Given my x the first half of the vector represents the means of a standard deviation of a Gaussian distribution And the other half of the vector represents the variances for the same Gaussian distribution, okay? So the component h the first component h1 is gonna be the mean of the first Gaussian and then the component h let's say Okay, let's call it h2 in this case It's gonna be the variance and then you have h3 is gonna be another mean h4 H4 is gonna be another variance and so on, okay So does that make um would that make Z like a ten dimensional vector that's sampled from those? Yeah Yeah, yeah, so Z Z here is gonna be Half of these sides here, right? so the encoder gives me twice the dimension of Z and then because you get Half of the dimensions like one set of these are for the means and one set of these are the variances Then we sample from a Gaussian that has these values So the network simply gives me not just the means is for the classical Out encoder, but also gives me some What is the? Range that I can pick things from right before when we were using the classical out encoder here We only have the means and then you simply decode the means In this case you not only have the means, but also you can have some variance some variations across those means okay, so Out encoder normal out encoder is deterministic the output is the terministic input function of the input a Variational out encoder the output is not longer a deterministic It's no longer a deterministic function of the input. It's gonna be a Distribution Given the input, right? So it's a conditional distribution given the input So in this case We see that are we so similar a similar diagram last time where we were going from a specific point on the left-hand side To the right-hand side In this case we start here like a point and then we get through the encoder You're gonna get some Position here, but then there is a addition of noise, right? If you only have the mean you would get just one Z, but then given that there is some Additional noise that is due to the fact that we don't have a zero variance That final point that final Z. It's not gonna be just one point. It's gonna be like a fuzzy point Okay, so instead of having one point now one one X is gonna be mapped into one region of points Okay, so it's gonna be actually taking some space And then we how do we train the system? We train the system by sending this latent variable Z Back to the Decoder in order to get this X a hat and of course it's not gonna be getting it Exactly to the original point because perhaps we haven't yet trained so we have to Reconstruct the original input and to do that we're gonna be trying to minimize What is the square distance between the reconstruction and the original input? And then we had the problem before Like to go to the latent to go from the latent to the input space We need to know or the latent distribution or to enforce some distribution last time we were seeing that we were doing something similar when we are using the The classical the standard out encoder, but we were going from one point X to one point Z and then back to X right right now instead We're gonna be enforcing a Distribution over these points in the latent space before we were going to one point one point one point and then you don't know What's happening if you move around in the latent space remember? So if you have on the left hand side ten samples, you're gonna have automatically on the other side ten latent variables But then you don't know How to go between these input to between these you don't know how to travel in this latent space Because we don't know how the space behave, okay variation of the encoders and force some structure and they do this by adding a penalty of being different or far from a normal distribution so If you have a latent distribution, which is not really resembling a Gaussian Then this term here will be very strong very high and when we train a variation of the encoder We're gonna be training knit by minimizing both this term over here and This term over here. So the term on the left hand side Make sure that we can get back to the original position The term on the right hand side and force some structure in the latent space Because otherwise we wouldn't be able to you know sample from there when we'd like to use this Decoder as a generative model Okay, this is maybe not too clear, but let me give you a little bit more Things to think about So how do we actually create this latent variable Z? So my Z is simply going to be my mean E of Z plus some you know some noise epsilon, which is a sample from a normal Distribution, which is like a normal multivariate Gaussian distribution with zero mean and identity matrix as the covariance matrix which has each Components multiplied by the standard deviation, right? So you should be familiar with these a question here on the top right This is how you rescale random variable epsilon, which again is a normal You have to use this kind of Reparameterization in order to get a Gaussian that has you know a specific mean in a specific Variance, okay, the noise in the latent variables. You just encoded version of the noise introduced in the input So there is no noise in the input You put the input inside the encoder and then the encoder gives you two parameters E and variance When you sample from this distribution You basically get Z and what you get here. It's simply you can write the sampling part as this one So the problem with sampling is that we don't know how to perform back propagation through a sampling module Actually, there is no way to perform Back propagation through sampling because this one is just generating a new Z So how do we get gradients through this module in order to train the encoder? And so this can be done if you use this trick which is called the Reparameterization trick the reparameterization trick allows you to express your sampling in terms of you know additions and multiplication which we can differentiate through right the epsilon is simply a additional input that is you know Coming from whatever wherever well, we don't have any need to send gradients through this input The gradients will be coming going through the multiplication and through the addition, okay? So whenever you have gradients for training the system The gradient comes down and then here we can replace the sampling module with a Addition between E plus the epsilon multiplied by the square root of the variance, okay? Such that now you have you know addition, you know how to back prop through an addition therefore you get gradients for the encoder here a output gradient and then you can compute the Partial derivatives of the you know finite cost with respect to the parameters in this module Okay So in just in a you know intuition part this KL here allows me to enforce a structure In the latent space. Okay. That's what we think about like that's how I'd like you to think about this KL term And so let's actually figure out how this stuff works, okay? So we have two terms in my per sample loss We have the first one, which is the reconstruction loss and then there is a second term Which is going to be these KL this relative entropy term. Okay, so we have some Zs in this case, which are spheres bubbles. Okay in this case why there are bubbles because If you we add some additional noise, right? We had the means and the means are basically The the center of these points, right? So you have one mean here one mean over here one mean over here one mean over here and then What the reconstruction term is gonna be doing is the following so If this means if these bubbles overlap, what does it happen? So if you have Um one mean here and another mean Like one bubble here another bubble. It is overlapping and there is a region where there is you know Intersection How can you reconstruct These two points later on right you can't right you are you following so far If you have a bubble here, and then you have another bubble here All points of the on this bubble here will be reconstructed to the original input here So you start from an original point you go to the latent space over here And then you add some noise you actually have a volume here Then you take another point and this other point it gets reconstructed here Right now if these two guys overlap How can you reconstruct The points over here So if the points are in this bubble I'd like to go back to the original point here If the points are in this bubble, I'd like to go to the other point But if points are overlapped So if the bubbles are overlapped, then you can't really figure out where to go back, right? So Then the reconstruction term will just do this The reconstruction term will try to get all those bubbles as far as possible such that they don't overlap Because if they overlap then the reconstruction is not going to be good and so Now we have to fix this So there are a few ways to fix this, right? How can yeah, you tell me right now, how can we fix this Overlapping issue, right? Why didn't we have this overlapping issue? With the normal out encoder Because there is no variance Aha, and so what does it mean? Okay, can you translate what not being not having a variance mean The the spheres are not spheres, but they're points Correct, right? So if you have just points points will never overlap, right? Well, they have to be the exact same point, but you have the exact same point Only if the encoder is dead, right? Or you have the same inputs I think well, it's unlikely the two points overlap if now instead of having points you have actually volumes Well, you know Volume can overlap because there are many infinite points, right in that volume. Okay So one option is going to be kill the variance And so you have points And now this defeats the whole variational thing, right without this spacey thing By killing the variance now, you don't know anymore what's happening between the points, right? Because if you have space like if they take volume You can walk around in the latent space You can always figure out where to go back If these are points as soon as you leave this position here You have no whatsoever whatsoever idea how where to go. Okay anyhow, first point we can kill the variance other option Well, the one I show you here, right? The other option is going to get these Bubbles as far as possible, right? So if they are as far as possible What's going to happen? in your Python script, so if these these these means, right? They go very very far Then they will increase a lot a lot a lot, right? and Then the problem is that you're going to get infinite, right? This stuff is going to explode because all these values are trying to go as far as possible such that they don't overlap And then that's not good. Okay. All right. So let's figure out how Variational variation out encoder fixes this problem Alfredo, could you just clarify what you mean by pushing the points apart? Like are you putting them in a higher dimensional space to push them apart? No, no, no, no So as they are here, so each If you don't have the variance all those circles here all those bubbles here are just points Given that we have some variance. They will take some space Now if this space taken by two bubbles overlaps with another bubble The reconstruction error will increase because you have no idea how to go back to the original point that generated that sphere And so the network the encoder has two options in order to Reduce this reconstruction error One option is going to be to kill the variance such that you get points The other option is going to be to send all those points In any direction such that they don't overlap. Okay. Okay. Yeah, that makes sense. Okay Cool. So reconstruction error gets this stuff to fly around But then let's introduce the second term So I would really recommend you to compute this relative entropy between the A Gaussian and a normal distribution such that you can Practice maybe for next week But then if you compute that relative entropy you get this stuff, right? You get several four terms basically And everyone should understand how this looks No, okay. I'm just joking. I'm gonna be actually explaining that. Okay. So we have this expression Let's try to analyze a little bit in more detail. Uh, what the these terms represent So the first term you have these variance minus log variance minus one So if we graph it it looks like this You have a linear function right on the uh after you know after Two on the x-axis and then on the other terms you have a Did you subtract a logarithm which goes to plus infinity? Like if you sum a minus logarithm it goes to plus infinity at zero And then otherwise it's gonna be just you know uh decay So if you sum the two and subtract one you get this kind of cute function And if you minimize this function you get just one And therefore this shows you how this term enforces those spheres here to have a radius Of one in each direction Because if he tries to be smaller than one this stuff, you know goes up as crazy And if the increase is here it doesn't go as up as crazy So they are slightly, you know, roughly all the way is at least one or you know Half but they won't be much smaller because this stuff, you know increases uh a lot So in this case we have Enforced the network not to collapse these bubbles Nor to make it grow them too much right because otherwise they still get penalized here Um, so then we have another term here these uh e of z everything squared And that's classical parabola which has a minimum over there And so this term here basically says that all it means Should be condensed towards zero And so basically you get like this additional force here by this uh purple side And then you get that All those bubbles get Uh squashed together into this bigger bubble So here you get the bubbles of bubbles representation of a variational encoder. Okay How cute is this? Very cute, right? How Can you pack more bubbles? So what is uh The only parameter here, which is telling you the strength of your variational encoder It's going to be simply the dimension uh d Because you know given a dimension you always know how many bubbles You can pack in a larger bubble, right? So it's just a function of the dimension you pick And you choose for your hidden layer So is the reconstruction lost the first term the yellow term? Is that the one that actually pushes the bubbles further apart? And then the rest of it is what kind of Uh keeps them from doing that, right? So the reconstruction would push things around Uh, because we have these additional Uh taking volumes thing, right? So if we wouldn't be taking volumes the reconstruction term wouldn't be pushing anything away because they don't overlap Given that we actually have some variants The variants will have these points actually taking some volume and therefore this reconstruction will try to get those points away So if you check again those few animations, I'll show you So we had at the beginning Those were the points with additional noise. Then you get the reconstruction that is you know pushing everything away Then you get the variants it is assuring you that those little bubbles don't collapse And then you have the final term which is the spring term because it's the quadratic term in the loss Which is basically adding these additional pressure Such that all the little guys get you know packed towards zero But they don't overlap because there is the reconstruction term. So no overlap due to the reconstruction size not going to small then one because of the first part of the Relative entropy and then all these guys are packed again for the quadratic part, which is the spring force, right data term something that needs to be tuned like a hyper parameter kind of thing or so the beta is the actual In the original version of this Variation out encoder. There was no beta and then there is a paper which is the beta variation out encoder Just to say that you can use a hyper parameter to Change how much these two terms, you know contribute for the final loss This loss the second loss term with the beta. Uh, that's the k l divergence Yeah, and then the normal distribution, right? Yeah between the z which is coming from a gaussian of mean E and variance v and then the second term is going to be this Uh, normal distribution and so this term tries to get z to be as close as possible To a normal distribution in the space the dimensional space Okay, so and this formula that you Looking down that's a generic This is so I would recommend you to take a Paper and pen and then then try to write the relative entropy between a gaussian and a normal distribution And then you should get all these terms The relative entropy. So yeah, this l k l is the relative entropy Yeah, so just look up the formula for the relative entropy Which is telling you basically Uh, how far two distributions are and the first distribution is going to be a Uh, multivariate gaussian and the second one is going to be a normal distribution, right? Are the normal distributions not the same thing? The gaussian has a mean vector and the covariance matrix The normal has zero mean and identity matrix for the covariance matrix We said earlier though that the z should not have covariance. It should be diagonal, right? Yeah, yeah, so all it's going to be diagonal, but the values on the diagonal are those v Variances, okay. It's an off-center big normal versus a centered and normal small normal so It's off-center and then each direction is scaled by the By the standard deviation, right of that dimension So if you have a large standard deviation in one dimension means that in that direction is very very spread, right? Make sense, but there is a line. It's a line on the d axis, right? Because again, all the components are independent Yeah, is the reconstruction loss the pixel-wise distance between the Final out and the original image the reconstruction loss. We we saw that last week and we have two options for the reconstruction loss uh One was the binary for binary data and we have the binary cross entropy and the other one is going to be instead the Real value one, uh, so so such that you can use the half or well the ms msc, right? So these are the reconstruction losses we can use for example You talk more with me than with the end good. No, well not good. You should talk as well with the end, but We should be going over the The notebook such that we can see how to call this stuff and also play with the distributions Because before again the main point was that Before we were mapping points to points and back to points, right? Right now instead you're gonna map points to Space and then space to points But then also all the space now is going to be all covered By these bubbles because of several factors, right? If you have some space between these bubbles, then you have no idea How to go from this region here back to the input space, right? Uh instead of variation out encoder gets you to this very well behaved coverage You know this nice coverage of the latent space. Okay Good. I can't see you. I miss you guys Okay, uh So code or other questions so far I hope you can see stuff I just give feedback Can you see stuff? Yeah, yes. Yep. All right, so work Kita Kita pdl Uh conda activate pdl Um Jupiter notebook Boom, okay, so I'm gonna be covering now the va and so Now I'm going to just execute everything such that this stuff starts training and then I'm going to be explaining Things, okay. Uh, all right. So at the beginning I'm going to be just importing our random sheet as usual Then I have a display routine. We don't care. Don't add it to the notes Uh, I have some default Values for the random seeds such that you're going to get the same Numbers I get then here. I just use the mNist dataset the modify nist from young Device I set the cpu or gpu in theory I could have used gpu because my mac here actually has a gpu and then I have my variational out encoder, okay So my variational out encoder has two parts has a encoder here. Let me turn on the line numbers so my encoder goes from seven 184 which is the sides size of the input to D square for example and D in this case is 20 so 400 And then from D square I go to two times t Which is going to be uh, half of my means and half is going to be for my Sigma squares for my variances Uh, the other case the other the decoder instead picks only D, right? You can see right D here We go from D to D square and then from D square to 784 such that we match the input dimensionality And then finally I have a sigmoid whether I have a sigmoid because My input is going to be limited from zero to one. Uh, there are images from zero to one Then there is a module here, which is called reparameterize And if we are training we use this reparameterization part Operator sorry, could you just say again why you used the sigmoid in the decoder? Yeah, because my uh, data it's Leaving between zero and one. So I have those digits from the MNIST and they are They are values like the values of the Digits are going to be from zero to one. So I like to have my network This module here outputs things that goes from minus infinity to plus infinity If I send it through a sigmoid this stuff sends things through like zero to one. Okay When you say the values of the digits you mean the deactivations, right? the So I use the MNIST data set And this is going to be both my input and also my targets right the images And the values of these images will be Ranging between zero to one like is a real value Each pixel can be between zero and one. Yeah, well, I think actually the the inputs are binary So the inputs are all zero or one But my network will be outputting a real range between zero and one Uh, sorry parameterization. We have the um We we we what do we do here? So Uh, reparameterization given a mu and a log variance Explain later why we use log variance if you are in You know in training we compute standard deviation is going to be log variance multiplied by one half and then I take the exponential And so I get the standard deviation from the log variance And then I get my epsilon which is simply sampled from a normal distribution Which with whatever size I have here, right? So standard deviation I get the size I create a new tensor and I fill it with a normal distribution data Then I return the epsilon multiply by this standard deviation and I add the mu which is what I show you before If I am not training I don't have to add noise, right? So I can simply return my mu So I use this network in a deterministic way Uh, the forward mode is the following So here we have that the encoder gets the input which is going to be reshaped into You know These things such that basically I unroll the images into a vector Then the encoder is going to be Output in output in something and then I reshape that one such that I have batch size Two and then the where these the dimension of the mean and the dimension of the variances Then I have mu which is the mean simply The first part right of these guys of this d and then the log variance is going to be the other guy And then I have my z which is going to be my latent variable Is going to be these three parameterization given my mu and the log var Uh, why do I use a log var? You tell me Why do I use a log var? Output of my growth can be negative so you need to watch them Right, right, right. So, uh Given that the variances are only positive if I compute the log allows you to output the full real range For the encoder right so you can use the whole real range And then I define my model as this va and I send it to the device Here I define the optimization optimizer And then I define my loss function Which is the sum of two parts the binary cross entropy between the input and the reconstruction Which is here. So I have the x hat and then the x And then I try I sum all of them And then the KL KL divergence or so we have the Var which is the uh, you know linear then you have the minus log var Which is the logarithmic flip down and then minus one and then we have the a mu And then we try to minimize this stuff, right? All right, so training script It's very simple, right? So you have the model which is outputting the prediction x hat Let me let's see here, right forward outputs the output of the decoder the mu and the log var So here you get the model, uh, you feed the input you get x hat mu log var You can compute the loss using the x hat x mu and log var x being you know the input but also the target And then we You know, yeah, we add the item to the loss We clean up the Gradients from the previous steps perform computation compute the partial derivatives and then you step And then here I just do the testing and I do some caching for later on so we started with initial error of 500 roughly 540 This is pre before training and then he goes immediately immediately down to 200 and then goes down to 100 okay And so now I'm going to be showing you a few of the results. This is the input I feed to the network and the untrained network reconstructions Of course look like Sheet right but okay, that's fine Uh, so we can keep going and that's going to be the first epoch right cool second epoch third Fourth zone right and they look better and better of course So what can we do right now? Uh a few things we can do for example now We can simply Sample z from a normal distribution and then I decode This random stuff right so this doesn't come from a encoder and I show you now What the decoder does whenever you sample from the distribution that the latent variable should have been following And so these are a few examples Of how sampling from the latent distribution Uh, you know gets decoded into something we got a nine here. We got a zero. We got some five So some of the regions are very well defined nine two But then other regions like this thing here Or this thing here are or the number 14 here. They don't really look like Well, uh like digits this is because why what's the problem here? We haven't really covered the whole space I just train for one minute if I train for 10 minutes. It's going to be just working perfectly Okay, so here those bubbles don't yet feel the whole space, right? Um, and that's the same problem which you would have With a normal out encoder without this variational thing, right with a normal out encoders. You don't have you know any kind of structure any kind of Um defined behavior in the regions between different points With a variational out encoder, we actually take the space and enforce that the reconstruction of the all these region actually actually makes sense. Okay Uh, so let's do some cute stuff and then I am done here. I just show you a few digits Um And so let's pick two of them. For example, let's pick three and eight Which is going to be let me show you here So we'd like to find a interpolation now Between a five and a four. Okay, and this is my five reconstructed and a four reconstructed So if I perform a linear interpolation in the latent space and I then send it to the decoder we get this one So the five gets morphed into a four can see Slowly but it looks like crap. Let's try to get something that stays on the manifold So let's get for example uh These three so it's going to be number one number one and then let's say Maybe uh This uh 14 here So I do interpolation of these guys here Uh, you can see my out encoder actually fixed those kind of issues here And then you can see now how the three Gets those little edges closed to look like an eight right and so all of them look like kind of legit No, this is kind of a three kind of a three a three that became an eight right And so you can see how now by walking in the latent space We get to reconstruct things that look legit in the input space, right? This would have never worked with a normal out encoder finally I'm going to show you a few nice representation of the embeddings uh of the means For this uh train out encoder So here I just show you a collection of the embeddings of you know the test data set And then I perform a like a dimensionality reduction and then I show you how The encoder clusters all the means in different regions in the latent space And so here is what you get When you train this variation out encoder So this is the uh the beginning when the network is not trained you can still see, you know, uh clusters of digits But then as you keep training well, at least, you know after five epochs you get these groups to be, you know separated And then I think if you keep uh training more you should have like more separation okay So here i'm basically during the uh testing part I I get all the means so my model outputs x hat mu and log var, right? And so my means I append them, uh, I append all my mus into this mean list I append all the log vars in this log vars list and I append all the y's to these labels list During uh the testing part, right? So this is uh testing And so I have like a list here of codes, which is uh, I have the mu log var Uh, and then the y's, right? So here later on I put those lists inside my uh a dictionary here and then later here below I compute a dimensionality reduction For epoch zero epoch five and epoch 10 Uh, so I use this tsn which is a technique for reducing the dimensions of the uh codes, which are 20 Right now the dimensionality is 20 So, uh, I fit I get my x is going to be let's say the first thousand component first thousand uh samples of the means And then uh, I get these ease which are basically a 2d projection somehow Of these uh 20 dimensional Muse, okay And then I show you in this chart here how these 2d projections they look Uh at epoch zero before training the network because this one is before the first training epoch and then at epoch five And you can see how the network gets all this mess here to be you know kind of more Uh nicely put Here I didn't visualize the variances. I'm thinking whether I can If I'm able to do that as well, I'm not sure So each of these points represent the location of the mean after training the variation random coder I haven't represented the area that these means Are actually taking okay Aren't the means supposed to be random at epoch zero? The randomness Is in the encoder, right? But then you still feed to the encoder and those input digits So the input digits all the ones are kind of similar Right, so if you perform a random transformation of those similarly looking uh initial vectors, you're going to have similarly looking Transformed versions But then they are not necessarily grouped All together like most of them are for example, let's say these are ones. Let me turn on the The color bar so we can see what this stuff is So let's say these are the zeros these over here. So all zeros look like they all look similar therefore Even a random projection of those zeros will all be kind of together What you can see instead is going to be this purple is all spread around right So means the force there are yeah, there are very many ways of drawing a four Uh, you know, someone right is closes the top someone doesn't so if you see on the right hand side instead All the fours are almost all here, right? There's just a little cluster here next to the nine because you can you can think about if you if you write a four like that It's very similar to write a nine, right? And so you have these fours here That are very close to the nines just because of how people drew this specific fours, okay? Nevertheless, they are still clustered over here. You get all these things are spread around Okay, so this is very bad Nevertheless, the detail you this this diagram here shows you that there is very little variance across the Drawing of a zero, okay? So it shows you like somehow There is a specific mode. It's very concentrated here, but it's really not concentrated for these guys um, so i'm just curious like what are some other um some other like like motivations or usages of variational auto encoder like Why are they useful so the main point was the whenever I show you in class the two weeks ago A generative model You cannot have a generative model with a classical auto encoder in this case here. Again. I trained. I didn't train this stuff a lot If you train it longer, you can have better performance here and the point is that My input z comes from just this random distribution, okay? And then by sending this random number here a random A number coming from a normal distribution you send it inside this decoder If this this if this decoder is actually a powerful decoder Then this stuff will actually draw very nice Shapes or numbers like for example those two images I show you all the two faces in the first part of the class Last time those are simply you take a number from a random distribution You feed it to a decoder and the decoder is going to be drawing you this very beautiful Picture of whatever you train this decoder on, okay? And you cannot use a standard auto encoder to get this kind of properties because again here we enforce the decoder to reconstruct meaningful or Good-looking reconstruction when they are sampled from this normal distribution Therefore later on we can sample from this Normal distribution feed things to the decoder and the decoder will generate stuff that looks like legit, right? If you didn't train the decoder In order to perform a good reconstruction when you sample from this Normal distribution, you wouldn't be able to actually get anything meaningful. Okay. That's the big Take away here Next time we're going to be seeing said generative adversarial networks and how they How they are very similar to this stuff we have seen today Um, hi Alfredo. I have a question for the yellow bubble yellow bubble. Yeah, yeah Each yellow bubble comes from one input example. Yeah So if we had 1000, I don't know like images or 1000 inputs, that means we have 1000 exactly yellow bubbles Yeah And each yellow bubble it comes from the the easy We easy distribution to gather with the noise added to late invariable So the bubble come from here. Let me show you Um Should I show you this one is okay or should I show you the slides? Okay So here you get these x and these x goes inside the model, right? Whenever you send these x through the model it goes inside forward So x goes inside here and then it goes inside the encoder, right? Okay, and then from the the encoder gives me this mu log var there From which I just extract the mu and log var. Okay. So so far is everything is like a normal out encoder Okay, the bubble comes here. So My z now Comes out from these self-reparament rise and these self-reparant rise Is going to be working in a different way if we are in the training loop Or we are not in the training loop So if we are not in the training loop, I just return the mean So there is no bubble when I use The testing part. Okay, so I get the best Value the encoder can give me if I am training instead Uh, what this is what happens. So my I compute the standard deviation from this log var So I get the log var. I divided by two and then I take the exponential, right? So I have e to the one half log var such that You get, you know, the standard deviation and then the epsilon is going to be simply an D dimensional vector Sample from a normal distribution. And so this one Is one sample coming from this normal distribution And the normal distribution it's, you know, like a sphere in D dimensions, right Sphere with the radius, which is going to be square root of d But then so here at the end you simply resize that thing The point is that every time you go you call this reparameterization Reparament rise function You're going to get a different epsilon because epsilon it's sample from a normal distribution, right? so given a mu and given a log var You're going to be getting every time different epsilons And therefore this stuff here if you call it a hundred times it's going to give you 100 different Points all of them cluster in mu with a radius of, you know, roughly standard deviation And so this is the line which returns you every time just one sample But if you call this in a for loop, you're going to get, you know A cloud of points all of them centered in mu which has a specific radius, okay And so this is where we get these bubbles come from the sampling and Of these uh thing, right I have to run it 100 times If you want 100 samples you get 100 times you have to run it 100 times These are reparameterization and gives you every time a different point Which is, you know, parameterized by this location and this kind of, you know Up, you know Volume, right? Yeah, and this comes from at the mu and log variance comes from one sample one input example Yeah, yeah, yeah, so my one input x here Gives me one mu and gives me one log var and this one mu and while one log var gives me A z which is one sample from the whole distribution if you run this function here 1000 times you're gonna get 1000 z which all of them will take this volume, right of okay Got it got it. Thank you, of course I had a question about like auto coder or sorry encoders and decodes in general Yeah, it looks like in this implementation It's like fairly straightforward in terms of like it just has like a couple linear layers with a relu and a sigmoid our most like I like I previously We've seen encoders where they're using like attention all this stuff like Is this is something kind of as basic as this? It seems like it's pretty satisfactory like is is that I don't know. All right. Are they usually this basic or more complex? Okay, okay, that was I think softball for me. Um, so so Everything we see in class is things that I've tried it works And it's fairly a representative of what is sufficient to get this stuff to run So, you know, I'm running on my laptop on the mnist dataset You can run several of this kind of uh test and play and so today we have seen How you can encode how you can how can you code up a variation auto encoder? And all you need is like three lines four lines of code Which are like like what are the differences between the plain auto encoder, right? And so the difference is like you have the reparameter reparameter reparameterization Reparameterize module method here And then you know just these three Lines over here, right? So you have like six lines plus the relative entropy Uh the architecture That's completely different. So it's completely orthogonal, right? One thing is going to be the architecture, which is based on the current input You can use a convolutional net. You can use a recurrent net. You can use anything you want And the other thing is the fact that you convert Uh some deterministic network into a network that allows you to sample and then generate samples from a Uh distribution. Okay, so we never Had to talk about distributions before we didn't know how to generate distributions now With generative model, you can actually generate data Which are basically a you know, um How do you say like a bending a rotation or a transformation of whatever? Uh is a original Gaussian, right? So we have this multivariate Gaussian And then the decoder takes this ball and then it shapes it to make it look like the input The input may be like something curved You have this bubble here this big bubble of bubbles and then you The decoder gets it back to whatever it looks like how the input looks like so All you need depends on the specific data you are using For mnist, this is sufficient If you're using a convolutional version is going to be work working much better Point is that this class was about variational encoder know how to get crazy stuff All the crazy stuff is simply you know adding several of these things. I've been teaching you so far Uh, but the beat about variational encoder. I think it was Covered mostly uh here. Okay Okay, thanks Other questions No, okay Okay, that was it. Okay. Thank you so much for joining us. Okay everyone almost left 70 percent. Um See you next week. Bye. All right. Thanks y'all. Bye. Bye