 All right, so a quick reminder on a Zoom, you can react on the bar such that you can communicate with me your feelings, right? And I understand how you're feeling about the class and everything, okay? So I've been working all overnight. I didn't sleep to prepare this class and again, all the diagrams and everything is new. And I might be a bit crazy because again, I slept very little. And so don't joke around too much, okay? If I go a little crazy, don't try to push harder to a hard on me, okay? Okay, so what's going on here? Oh, okay, Vincenzo, what? Oh, Vincenzo. No, okay. Low, what's going on here? This is not okay, right? I mean, it's normal, right? Okay, today we talk about guns, right? So guns are gonna be generating more copies of your data. So you have one data, you want to make more data. And so you're gonna be using these guns to make copies of the original data, which look like yourself, okay? And so you can make more and more copies and even more copies that you cannot tell apart from the original, okay? And so today, main part is gonna be the, how to train your gun, generative cellular network. And, but of course, of course, yeah, you had to stop this kind of reproduction of clones, otherwise you might get into troubles, okay? So, okay, I'm just delusional, okay? I think we are all, okay, we don't know what's going on. All right, so, but then you told me on campus wire that on campus wire told me that you want me to talk about the videos, right? Sorry, about the notebooks. So we start from the notebooks, okay? Because again, I haven't explained too much, all right? Okay, so let's get started on the 1st of April, 2021, 9.33 in the morning, New York City, me, Alf Vincenzo and all the team here on my shoulders on behind me helping out for this class, which are the TAs and other people that have shown up during this course. All right, so, sharing the screen. All right, let me open the chat because I have no idea what you're writing. I've been... Okay, let's read the chat. Wow, lol, so many Alf, okay? All right, I hope you enjoyed. If you missed the beginning of the class, you can watch the recording. Okay, okay, okay, it worked. I didn't sleep, I slept one hour tonight, okay? But that's because I'm crazy and it's okay. All right, so we are starting with the notebooks, right? Of the stuff from last week, which were the Outend Coders. So you should be able to see something, I think. Full screen. So Outend Coders, right? So we figure that we're gonna be using the Outend Coder to generate a representation of our data, which we call Y because there is no condition, right? So X means you have something during testing. When you use unsupervised learning, there is no nothing you get during testing, right? So you just have your data and your data is gonna be the Y. Y is gonna be only the data that appears only during training, okay? So I changed the name of the variables here on this notebook reflecting this new, how do you call it? No syntax notation, okay? But I haven't pushed it yet. Anyway, so here we import a bunch of libraries, we don't care. This is like to denormalize an image, we don't care. This is to display images, we don't care. I defined that loading step, perhaps we care. So I have a batch size of 256 images. I have basically a transformation which is subtracting 0.5 from the images, which are going from zero to one and then divided by 0.5, so which is gonna be scaling these things back to minus one to plus one. And why do we do this? So that we have a zero mean kind of standard deviation equal one data, right? So the networks are happy. Then we load the MNIST data and we have a data loader, right? Then we set the device such that we can use the GPU in case we have a GPU or otherwise it goes for the CPU. So we start with the default out encoder, okay? So we have a input N, which is the size of your Y, right? Which is a 28 by 28 pixels because we are using the MNIST which is modify, NIST, modify by whom, by young, right? Like 35, no, something, 30 something years ago. He took that NIST data set and he modified the MNIST. Now the NIST data center created a modify, modify NIST. I think it was five years ago, but okay, we didn't know about that, interesting. All right, anyway, so in this case we choose an auto encoder which internal hidden representation has a size of 13, okay? Which is much smaller than 784, right? So we will assume that this is gonna be under complete out encoder. So I have my out encoder, I define this in the more flexible case where I have the init, like I subclassed the NN module and then I have the super init and then I have the encoder and then I have a decoder, right? We have seen in the slides last week. We're gonna be covering this again when we talk about the guns. So if you don't remember, it's okay. In this case, we have a rotation and rotation squashing. How does the forward work? So in this case, we do have a forward. So the point here, I didn't show you any code for the latent variable energy based models. That model didn't have a forward, just had a decode because we didn't have an encoder, we didn't have X, okay? We just had a Y and the Y was generated by sampling a Z and then do minimization in the latent space. In this case, we actually have an encoder because we are encoding the same Y into H which is a hidden H, hidden representation. Then we get this Y tilde. How to type Y tilde? So you type Y, no, Y, backslash tilde. Then you press tab, boom. You can do also Y backslash dot, boom. Or you can do Y backslash bar. You can do many things, okay? You can do like Sigma. You can do capital Sigma, no, okay? This is how I use a Jupyter notebook and Python 3 supports Unicode. So this is not coding. This is notebooks. This is math. So I like to do like that this way. So here I have my latent coder, which I create and then I have my criterion, square, square equivalent distance. Learning rate 10 to the minus three, optimize the Adam and so on, okay. So for 20 epochs, I do the following, right? So these are the classical five steps for training something. So first of all, I get my image from the data, right? Our Ys. Then I send the Y to the correct device. Then I have my, I change the size such that these are reshaped in big long vectors, right? We are not using convolutions. So this out encoder lesson is, you know, orthogonal with the convolutional lesson. In fact, if you see inside here, the encoder has a linear module. So you expect a vector, right? Of size 28 square, which is 784. Anyway, so you type on the chat if I'm too slow, okay, and you need any clarification. There is also, you know, go slow, go fast, buttons on the zoom. All right, so here we said we get the data. We send it to the correct device. We reshape it in vectors. This we don't care because we are, this is for the second part. Then we have the output, which is simply sending the image through the out encoder, which is basically going through the forward, and which means it goes inside the encoder to give you the hidden representation and goes through the decoder to give you the Y tilde. Y tilde and they had a C cost, right? To make it close to the original Y. And the C cost is gonna be this criterion here, okay? All right, so we have the output, which is gonna be my Y tilde, right? Then I compute the energy, actually. And then the loss functional is gonna be the energy function. So okay, we can call it loss, whatever, which is gonna be my quadratic distance. And then I have optimizer zero grad to clean up the trash. Then you do compute the partial derivative of the loss with respect to the parameters, and then step in the opposite direction of the gradient. Right, so, so far, everything is fine. Then I print here the performance at every epoch. Here this stuff, okay, so here we have, we start from initial loss of 0.2, basically, and then we go down to 0.06. And so this was the first reconstruction of my input, right? So you can't even tell what's going on. But then, as you can see through different epochs, you're gonna get that the reconstructions are getting better and better and better. And these are actually reasonably looking well, okay? Like they are digits that you can recognize. No questions, right? So there is no fancy thing, right? So I don't believe, like, let me know if you think anything, if anything was unclear so far, but let me know, right? So unless, if it's not unclear, then I keep going. So then I display, what do I display here? Here I display the weights, right? But I actually reshape the weights such that they look like images, right? So these actually were vectors. And so the vectors that are the kernels of the metrics, right, the row vectors, I can reshape them in form of images just to see what kind of pattern they've been extracting. You know, we perform scalar product, which is, you know, measuring the alignment between the kernel and the different images. And so I reshape the kernels into actual images, right? To get something out of this. And so here you can see how these kernels have some kind of patterns within the central region. So here you can see there is black, which is minus one, then there is yellow, this is plus one. So there is this kind of black and, I mean, minus one plus one or purple and yellow regions, right? In the center, which are major feature edge detectors, right? My question to you is, why do we observe all these salt and pepper noise outside the digit? Okay, so question for people at home, if you are listening and you're awake, why do you see salt and pepper? Salt and pepper is a type of noise, which is this kind of pixelation you can see within the, like on the edges, right, of these kernels, filters. Question for people at home, why do we observe this kind of pixelation, right? This kind of salt and pepper noise type in the chat? So it's that I can understand whether you're following or I can, we can find out what's the correct answer if we don't know. Anyone? Low frequency in picture getting picked up. That was my first guess four years ago. Then actually, I figured that it was not the case, but that was definitely a guess, possibly guess, right? What are other options? So what happened on this guy here, on this kernel? And why do we observe actually a pattern in the center? Can you remember what kind of data we are dealing with? It is because you never put any character expecting the center, and that's correct, Jeffrey. And so what does it mean? What is the color here? What is the type of frequency you have here? Oh, sorry, my bad. Actually, so actually the previous student actually said the answer was correct, right? So this region here are low frequency, right? So low frequency, low spatial frequency, it means it's uniform, right? So it's picking up uniform things. I don't know if it's picking up uniform things. The point is that given that all this contour is uniform, if you sum black and white plus ones and minus ones, they will cancel out, right? If they are all sum, right? So all this purple region is gonna be minus one. All the yellow region here in this picture are plus ones, okay? So if all the salt and pepper, which are plus ones and minus ones are all summed together, on average, they produce zero, right? So there is no major, let me think. Yeah, so whenever you perform the scalar products, they will kill themselves, right? They will, their contribution to the final score will be zero because on average, they are zero mean and they will contribute with zero for most of the pictures, right? So you can see exactly which pictures have zero mean, the one that are appearing on the outside border that are always multiplied by a constant, okay? So that's what happens here. What happens to this kernel? And this kernel simply died, okay? Same as to this one, maybe this one, there is a low frequency something here, okay? This is a high frequency on the right-hand side here. You can see there are many clear lumps, okay? All right, cool. Second part, we switch now to a denoising auto-encoder. So instead of having 30 dimensions, we switch to 500 dimensions. So is this a over-complete hidden layer or not? Well, someone could argue 784 is the size of the input, 500 is the dimension of the hidden. The hidden still looks smaller than the input. Well, yes, but the actual pixel that are utilized in the input are not 784, they're much less than 500. And so, yeah, exactly. So 500, we assume that is over-complete. I didn't go larger because it takes forever to train. So I have the exact same architecture, but we're gonna be trying, we're gonna train it now and denoising auto-encoder. Do you remember what does the denoising auto-encoder do? So it's a technique that is a contrasting technique. You take a sample from the data set. This one should have zero energy, low energy. You take it away by some distance, the square distance of you took away this thing will be the energy you assign to this pull off from the manifold sample, okay? How do you do that? By enforcing the model to reproduce the original location, right? And so at the end, when you compute the square Euclidean distance, if you always output the original point from where you moved, you're gonna have like the square distance of how far you pulled, you drag this away from the original point, right? And so we are basically learning a vector field, which is bringing you back to the original location. We have to move on to the variation auto-encoder because otherwise we get late, right? So here is actually the same architecture. I just changed the dimension of the hidden size, right? Here is actually the same cell. Here we had to change a few things, right? So first of all, I create this module, which is called dropout module, which is setting some pixels to zero. So our images were minus one plus one, right? Remember we had like a normalization factor. Now we decide to choose to set some random pixels to zero. So right in between. So we are gonna have basically three values, minus one, zero and plus one, kind of because I don't think the images are binary. I think they are gray scale, but anyway. So we have that one, everything the same. In this case, I create here a noise mask, okay? Which is my dropout applied to a matrix of all ones, right, a vector of all ones. And then I have my bad image is going to be simply my original image multiplied by this noise and the noise is this mask of ones and zeros, right? And then inside the model, I will send not image, but image bad, okay? So I will provide the model with a bad image, but then the loss is gonna be made between like the distance is gonna be made by the y tilde, not the output of the model and the original image, not the bad one, right? So I will enforce the model to output the original location, the original point, given that I provide a bad one is input. One last thing, instead of sending known here, I'm gonna send actually the corrupted, the corrupt bad images here, bad images, bad, it's called image bad, right? Image bad, right. So we train these one for 20 epochs and it's already run, I run this before you woke up. We go start from zero to like before more or less and we go to actually a little bit larger value, right? So here we can see what's going on here, right? So this is my input. So I took the image, which was minus one to plus one, more or less. And then I add a set of zeros and zero now is gonna be green, right? Because that's the scale I'm using here. So what I'm sending to the network is this garbage, okay? And then what comes out from the network is garbage. Why? Because the network is not trained. Then you keep doing this for several iterations and you get, still you provide garbage, but then, oh, okay, things start looking more reasonable, still garbage. Keep training, you know, after 20 epochs you scroll, scroll, scroll and you can see that things starts getting reasonably still awful, but okay, better, okay? Maybe I didn't train enough. Okay, here we go. You get nicer, nine, six, zero, four, I guess. You can see, you know, it's really few pixels. They're like, I don't know, 20 pixels I send inside and this one actually constructs the whole shape, right? And you can see, no? How bad, how good? And this is basically we got to instill some, what's called, some bias in the model, right? The model knows what type of things it should reproduce. All right, so interestingly now, if I show you the weights of these, the noise and out encoder, what happens here? Oh, can you see? You're not reacting with, oh, maybe I don't see your reactions, but it's okay. What's the main difference from here? Okay, thank you for the reaction. What's the main difference between here and the before, and the kernels we saw before? What I was just asking, right? The salt and pepper noise just disappeared, why is that? Because now they are no longer zero mean, since they will be multiplied by stuff that is changing, right? So all these pixels will have changing values. And so the kernel has to learn to ignore whatever happens outside the region of interest. That's why all these pixels now are set to zero, whereas before they were, yeah, the noise in the school, I agree. Whereas before, we were learning just to capture, sorry, I really like to learn them here, before we were just learning to capture some relationships in the center and completely ignoring what was happening outside and therefore leaving those weight unchanged given that they were zero mean, like zero, their contribution was zero, right? To the final outcome. Now that the images are constantly changing, now we can no longer ignore what are the values of the pixels or the feature of the kernel in the surrounding area, okay? That's so cool, I think. Anyway, so a few interesting things more. Okay, what happened here? We know, right? Can someone remind me what happened on this kernel right side here? It died, yeah. Do we know why? No, we don't know. I mean, I don't know. So what do we do here? One more thing before we switch to variation out encoder because I want to talk about guns today as well. I import here state-of-the-art in-painting algorithms, which one is the Navier Stokes and the other one is the Telea, right? So these are state-of-the-art algorithms for in-painting. In-painting is figuring out what color should those missing pixels have, okay? So here I'm showing you the noise mask, okay? The one that I used for corrupting my data. Here I show you my corrupted data. Below you have the original data, which is, you can tell now it's a grayscale, now it's not binary, it's not a bitmap because you can see some zero values in the center, right? But most of them are either like minus one or plus one. And then I show you here the reconstruction from our denoising up encoder. You can say, okay, fair, there is some sort of edges that are not that nice. How does this compare to state-of-the-art algorithms that perform in-painting only, right? And there we go. These are the state-of-the-art algorithms in computer vision, but now computer vision seems to be deep learning, right? Until like a decade ago, there was no deep learning computer vision, but okay, we are in different age, I'm old. All right, so this is denoising up encoder. Very well, it works. Ah, one more thing. So we learned that denoising up encoder is a contrasted technique, which should assign, so it's an energy-based model, which should assign, remind me, type on the chat, what does, what do denoising, what do energy-based models? What does an energy-based model do? Tell me, type. It's gonna give a, mm, energy to mm, and mm, mm. Push down, no, push down. You push down to train, right? But when it's train, it doesn't push. Pushing is the training part, okay? It assigns, yeah, there you go. It assigns low energy for good guys and high energy otherwise, right? So now I take two of these digits. I overlap them, right? I merge them, like I do alpha composite. Then I send them through the out encoder. Is the out encoder going to be able to reconstruct this overlap thing? Now you can see this reconstructs a three, seven, nine, and two. What happens if I overlap the seven with a three or something like that? So I just do that because I'm curious. And I take two of these things. I put a five and a seven together and I try to send them through the out encoder. You can tell you cannot reconstruct this, right? So it can only reconstruct things that have been observed during training. This is super cool. Therefore, the energy of these garbage on the left-hand side will be high. Now, that's awesome, right? Now you have a technique that gives you the score, it tells you how garbage your input is. Anyway, moving on, variation out encoder, right? We are almost on time. Now we are not, I'm late, but okay. I will try to speed up, maybe. These are exactly the same notebook. So maybe we can fly faster. I hope it was clear so far. So variation out encoder, I again, I changed all the Bible's names according to this year. I will post this eventually on the website. I believe we need to be rewriting or providing both version of this generative models lecture. So if someone wants to join the Skybeam for these things, they are welcome. All right, we import libraries, we display stuff, same stuff as before, import set random seed. Loading MNIST, in this case, we no longer do the transformation, okay? So we don't subtract the mean, we don't subtract zero five and we don't divide by zero five. So they are zero to one, they are basically by, yeah, it's the input statistics different from before. Device, again, to send to the CPU or GPU. And in this case, I train a model which in hidden representation is 20, right? We covered variational out encoder last week, right? Remember? So I have two layers, right? I have to set from 784 to 400 and then I go from 400 to two times 20, right? So if 20 is the dimension of a single output of the encoder, I have two outputs for this encoder. I had the mu and I had the V, right? Both vectors, each of them are dimension D. So the encoder provides me two times the dimension of my hidden dimension, right? And then my decoder instead gets one D, right, in input and then it converts it back to the original size, right? And I have a sigmoid because now we have zero to one output. You can mess with these things, right? So if otherwise you should have used a hyperbolic tangent. All right. So there is the reparameterization trick which is multiplying by the... Okay, we skip this for the moment, but we multiply by the standard deviation and then we sum the average, right? The mu. So here in the last line, use the one that you actually want, is the one that I read. I multiply my epsilon, which is all sample from a normal distribution by, I multiply by a standard deviation and then which is coming somehow from the encoder and then I offset by mu, which is the mean, okay? And also here, I could have used the backslash, you know? Mu, I think maybe I should change this stuff. Anyway, this is done during training. In inference, I just returned the means. What does the forward do? Forward does the following. So the encoder returns me this mu and then the log var. Why do we return the log var? Why don't I return the var? Why don't I return the sigmoid, that I'm sorry, the standard deviation? Anyone can guess? So question for people at home. Why do I return the means and then the logarithm of the variance? Why don't I return just the variance? Or since the variance is the squared standard deviation, why don't I just return the standard deviation? Can you guess? Just guessing, right? You don't have to know. If you make a prediction, you're gonna get, you know, better understanding or better, how do you say? Learning, you have a better learning experience, right? Do we know why do we do that? So what is the range of the variance? Or what do we use? What is the, we have an energy, right? To convert it from energy to probability, not necessarily. So variance is not a probability, right? So what we have here is like the V term, right? From class, from last week, we saw that the V term, it has the, the V term was coming from the relative entropy, the KL term. And we're using the variance inside that, right? So we need the variance, but the variance is just positive, right? So if I compute the log variance, now I have the possibility to span the whole real range, okay? And I also have more refined grain for smaller variances. It's just a change of scale, okay? Basically. All right, so I have the mu, which is gonna be the first of these outputs. And then I have the log variance. It's gonna be the second item, right? The zero, the number one. Then Z is gonna be my reparameterize using mu and log var. So Z is gonna be my latent variable, which is basically sample from this normal and then reshape, resize by the standard deviation and then displaced by the average. And then I return this stuff if in training. Otherwise, I'll just return the mu, okay? Cool. And then I have a definition of my model. So that was pretty much it regarding the code. The learning rate and stuff is the same stuff that we saw before. Here we have the learning, the training, right? And then, sorry, the energy terms, right? So we have the first one, which is the construction error. We call it C, which is simply, it's not the square distance, but in this case, it's binary cross entropy. Since we are using binary data as input. And in the other case, we have the K, KLB, the divergence, KL divergence or the relative entropy, which was, if you remember, one half the summation of the var minus the log of the var minus one, that was that curve that was doing like this, right? The head, the minimum at one goes to plus infinity as you go close to zero and goes up linear if you keep going towards large values. So you have like a very strong penalty to have very small variances. It had a minimum at one. And then you had this bowl, no? This quadratic well for the means, such all the means are drawn towards the center. Plus the reconstruction. So there is a balance between these three sources. In this case, I said the beta and the weighting coefficient to one. So I have basically reconstruction plus the variance, the U, the V sorry, and then the U term for the means, right? So this would be like the C plus U plus V we saw in class. All right, cool. How do we train this stuff? Well, it's actually the same code as you saw just before. So I almost not even going to read this. So I had 20 epochs set the model to train. Why do we do that? Okay, so this is actually it's relevant, which we didn't have before. We want to distinguish between training and a validation because during training, we want to actually sample garbage here. Like we want to sample this random thing and then scale it up. But then if we are not in training, then we just want to return the best guess, right? So we need to say, oh, we are in training such that you can sample the latent, right? Then all the same, why send it to the device, compute Y tilde, mu and log var from the model. The energy here is gonna be the summation of the two terms, which we saw here, B, C, E, and K, L, D given Y tilde, Y, mu and log var. I whatever I account take in consideration for later and this drawing this loss, but okay. I zero the garbage. I compute the partial derivative of this energy with respect to the parameters and then I step in the opposite direction of the gradient. Cool, testing, same stuff is above, but I set the model to evaluation such that I turn off the source of noise during basically when we get the latent, okay? Okay, same stuff, same stuff, training very high. We changed the loss, right? So definitely it's different value from before. We cannot compare. Multiplot tip complains and here we go. So these are my input data and these are the reconstruction. This variational encoder didn't train. And actually from the first epoch, things start looking very reasonable, I think. So you remember, right? Why do we use a variational encoder? Variational encoder basically tries to take all the space within the zero, the region, right? With these volumes such that you fill up this larger bubble, the Gaussian, right? The normal bubble, right? With all these little things, right? Which are trying to fill up this larger space. Why do we do that? Such that later we can sample randomly from the big ball, right? From the Gaussian we can sample and then hopefully we get something that look decent. Although this is not the case in this, I think, notebook. So these are just the training samples and these are samples, right? So my Z is sample from the normal distribution, right? Zero, one, right? And then I send the, I generate a sample through my decoder, which was fed with the latent Z, right? Sample from the normal distribution. And then you get something, you get a six, a seven, a nine. Also you get garbage, right? So this means that there was like an area that was not yet well covered. Possibly the reconstruction error was too predominant and there was no enough pool towards the zero, right? So if you have more pool towards the zero then you actually fill all the gaps, right? So maybe the U-term should be higher, right? So maybe for whenever you read this notebook you may want to print not only the total sum of the energies here as I wrote here but you want to plot the all three components and then it's gonna be like hyper parameter search in order to figure out what is the best balance between all these things. All right, so what do we do here? Here I'm showing you a few samples and then I'm gonna be providing something interesting. So let's say I have a five and then I have a zero. So in this case, if I perform my interpolation between these things, you're gonna get something that is like that, right? So it looks like a five or eight, no? Because it's an overlay between a five and a zero, no? And this is how overlay look. But then what happens instead of doing this? I actually do the interpolation of the code generated the latent on the hidden representation, the mu generated by this encoder. So if I instead do the interpolation of the mus and then I decode the interpolated mu, you get the following. You get a five that becomes a zero by closing these gaps. So what we have done here is gonna be a linear interpolation in the latent space and then I project that back down into the input space. You can tell how my linear interpolation morphed basically this five into a zero. Whereas before, whenever I was just morphing the inputs, in this other case instead, you are just getting this overlay crap. Then what happens if I send the overlay crap inside the network? Well, the network is gonna tell me, oh, this is an eight and it's gonna be actually bumping up all these regions that were basically incorrect values. So you get this stuff over here. Cool. Okay, so there's more stuff. So we said this stuff should take the volume, right? Towards zero. I think this is clear from the last lab. Here I just plot some embeddings. This is before training where everything is just scattered around. And then as you can tell here as we move through the epochs, things get more and more clustered together. Although on the right-hand side, things seems to be more packed, right? Then on the left, maybe it's more separate. On the right, it's less separate. You have to take in account that this is a 20 dimensional space and this is just a 2D projection. So this red, although it looks exactly on top of the green and the yellow and the brown, it actually is in a third dimension is out. So all these clouds of points are basically taking the space. Why are there black regions within these clusters? Question for people at home. So how do you debug this stuff, right? How do you debug math? How do you debug deep learning? You have to plot everything. Why do we plot? Because you can see what's going on. If you see what's going on, then you can patch and fix, right? Math is not debuggable as you don't have an error. Just things will give you clues about what might happen and how do you have clues by plotting that deep out of it, okay? Just because I don't want to use bad words. So question, you have to answer my questions. So we need to go move to the generative other side of networks. Why is there a black region around these clouds of points? Why is the orange surrounded by a black region? Why is the purple here surrounded by a black region? Why is the green surrounded by a black region? Answer me. In the chat, someone, guess, please, you are, how many people, you are many people. I don't know how many you are. You're 54 people or something like that. There is a black background. Why are these classes not touching each other? My question is, right? So why are these classes not exactly, because otherwise there would be ambiguous regions and therefore the loss would be higher, right? So here, the reconstruction term basically gets things to be separate, such that it's gonna be correctly reconstructing different digits. What happens, I didn't tell you. I wrote on this bar on the right hand side the actual digits that is the X in this case, right? So why is the image X is the conditional case if it tells you what it is, not the actual label in this case. Things have been grouped by the class without having ever used the class. So with unsupervised learning technique, like variational encoder, I managed to cluster my data per class while not having actually used the data, right? I have not used the data and someone is, I don't know, showing a camera going around. I don't know. So here, again, we managed to perform clustering of, in the classes of our dataset without actually having this information. This is just, I just plot it. All right, on the last 15 minutes, generate the other side of the network. So click. This one is the diagram we saw last week, right? So we have a denoising out encoder. We started from Y. We were pulling this Y away. We get a Y hat. We encode and decode this Y hat into Y tilde and we try to get Y tilde to be exactly the same location where we started from. How, what are we talking about about today? Oh, today we're talking about generative other side of the network. Why am I talking about denoising out encoder? We had the same modules. So you're gonna be learning maybe in this episode, given that I make these new diagrams for you, that everything is just the same thing, just reorganized blocks, okay, more or less. So we have a sampler and we have a Y. What's missing? The arrow. Okay, that's the difference. So there is no connection between the data point and the sampler. All right. First difference. Boom, we sampled this Z that goes through a decoder and we give a Y hat. Okay, fantastic. So sampler plus decoder gives me a Y hat. So in the right hand side, we have a Y sampler, Y hat. On the left hand side, we just have Y and then on. Separately, we have a sampler which through a decoder gives me Y hat. Now what? The decoder, it's not called decoder because there is no encoder, right? So if there is no encoder, we cannot decode what an encoder has encoded. So instead of calling decoder, it's gonna call generator and it's exactly the same thing. Just change of words, okay? Change of variable, change of words. All right, cool. Finally, we provide Y hat or Y to the cost function C separately, not together, one or the other. So they don't have connections, this Y and Y hat. They are just two types of input, right? So as you can tell, since there is a Y and there is a Y hat, what is a generative other side of network? Generative other side of network is a contrastive energy-based model, right? The sampler on the right hand side was sampling in the input space. On the left hand side, the sampler is sampling in the hidden space or in the latent space, okay? So what other architecture did we encounter which is sampling in the latent space? Which is this variation of encoder, no? The sampler samples a variable in the latent space. This sampler, it is conditioned, right? By mu and v, no? Which are provided by something, but we don't care. So in this case, instead, as I already told you in the generative other side of net, we have just a sampler which is providing this that there is no condition. Then there is this decoder, but again, this not a decoder, it's called a generator. So what does the generator do? Generator maps this Z space, the latent space into R and back to home space, right? So it maps this latent variable Z into this Y hat, so this contrastive sampler. On the other side, we have a Y, the observation. The observation and the Y hat, they go, not together, they go with a switch, no? What is this cost? Well, a cost, simply maps my R and Y hat or Y Y hat. The V means R, no, in Latin, versus, right? I think maps to this C, this cost, this scalar, right? That's it. So how does it work? Does it work? Yes, well, kind of. So how does it actually work, right? So let me try to give you one possible implementation of this. So training, how do we train this thing? The first one is gonna be my loss for my C box. So in this case, the C box is a neural network. So far, my Cs have been simply a quadratic distance or something like that. Right now, my C is a neural net. My neural net, as you can tell here, is gonna be pushing down the energy of the blue samples, right? If I minimize this loss, right? So we push down the energy of the blue boys here, blue guys, and then we're gonna be pushing up the energy on the Y hat up to M, no? If you push harder, more, this stops, there is no more gradient, right? So if you are above M, the stuff here becomes negative, the positive part of a negative number is zero, so you get no more gradient. So what is this Y hat? We just said before that Y hat was this decoded Z or generated sample from Z, right? So here I write generation, generated sample coming from Z. And so what does this mean, right? So we basically start from this diagram, right? And we said we start from the right hand side. We take a Z, we send it through the generator, right? We produce Y hat. And then we enforce that Y hat has a high cost, right? So let's say I have my Z, I have this Y hat here, and I push up the energy of this Y hat up to the M, right? So this is water level. I took my Y hat, the red guy, and I push it up here, M height. Then what do I do? Then I take this other point from the manifold, which was here, and I push it down to zero, basically, if I have the water level, right? If this stuff is bounded by zero. So again, repeating, I have some points here, red points generated by my generating network, generative network, right? And then I will push them up to high M, right? The margin was M. This point here, that are on the manifold, instead, I will push them down to whatever, no? And if this is bounded by the water level, it's gonna go down to the water level. So we started like this, and we do vook by training this cost network. So the cost network pushed the energy high to this one, pushed the energy of this one down, okay? The Y hat high to M, the blue Y's down to zero. How do we train the other network now? Well, it's quite easy actually, so we can easily finish this up. The other network, the generator, we simply try to minimize the cost, okay? What does it mean? So let's say my generator produced these Y's and these one, right, over here, and then these are my Y's from the data set. These have been pushed up here, and these have been pushed down here. Now my generator will try to go down, right? So there is a slope, right? There is a gradient over here, right? The cost function, the C function, the C network, created a slope, no? If the energy is flat, this stuff doesn't know in which direction to go. But since I'm training the C to have a sloped surface, the energy is sloped. So these have high energy, these have low energy. Well, the generator can simply follow the slope and go down, right? Well, the opposite direction of the slope. So we follow the negative gradient of the cost function and we come down, okay? And this is how this energy-based guns work. For example, moreover, a possible implementation, like if you read this paper from my friend, Jake, which I called just six hours ago in China, and say, oh, why are you awake? Oh, because I'm making new lessons and maybe pranks. But for my students, but this is a possible option for the cost, right? My cost can be a autoencoder as we have seen so far that tells me how far you are from the manifold, okay? So if I make the whole picture and I go a little bit further, so I can draw here, here we have our blue samples, right? And here we have some red samples and here we have some red samples. We said the cost network would push this down to zero. This is water level, right? So push down this down here, push these red guys over here, boom, here to M. Now you can see that what is the type of cost? The cost here is gonna be some quadratic well, right? Quadratic well, which is basically squared distance, right? From the output of the autoencoder, which is possibly a point on the manifold in your original point, right? So if you have your Y hat, you put it inside the network, the network will return something on the manifold, then you square the distance. So you're gonna have the square distance that's gonna be whoop up here. Same for this guy over here. You put this value inside the autoencoder, it's gonna return you something here. You measure the distance here, you're gonna square it, it's gonna be up here. So this is the energy level here, right? This is vertical axis telling you the energy level and the X axis here. You're gonna have the distance between your Y samples, the blue one, which are here and this other Y is hot. So you have this quadratic well, for example, you have your samples here and then when training the generator, the generator tries just to go down this well. But then the cost network is gonna be pushing up, then you go down and up and down and basically boom, you arrive at a destination. So it's like these blue Ys, it's like they have some gravity, right? And these red Ys are basically attractive, right? So they were here, when they are here, they have very high energy level, right? And these are the blue one and they are like zoom, right? And the zoom comes from the fact that there is this energy, the C, which is telling you how to minimize the energy, right? So you minimize the energy by going closer to the manifold. By doing this, you're gonna have two things, right? So you're gonna have a generator that is gonna be giving you samples that are exactly here. And then on the other case, we were training this cost network, which were giving high energy to things that are further away and they are close, small for the points that are nearby, okay? Finally, I can also give you a final funny way of thinking about these networks, okay? So a funny way. So sometimes people don't use a cost, but they use something else, unfortunately, like a discriminator. But the point is that you cannot discriminate between this and the other, right? Because then you have, where do you discriminate? A cost is simply tell you that this stuff should be higher cost than things that are closer nearby, right? Discriminator will be like, oh, these are one class, this is the other class. But okay. Still, there is this other interpretation. So we are in Sicily and we are making fake money because why not? And so we are making money, you know, south of the Italy and making the money. We go to the Germany to buy something, okay? All right, so we go to Germany and then we have the German people that are very careful, maybe the Swiss, let's say we heard the Swiss, but people, they are very careful to check this money or this money is fake, right? We don't get money from Italy, right? Okay, okay. So we go back to Sicily again and then why not, what's Sicily? Okay, somewhere there, I don't know. Anyway, so we go back to Italy and we are like, we need to do better. How do we do better? Well, we call the spy we have in the Switzerland, right? So the spy is gonna tell me, oh, you had to fix that part of the Euro, no? In order to get it through the security, right? So we try to fix there, right? We are the generator in Italy. We are making better money, no? So, okay, we are gonna try to buy again, we are gonna try to fool the Swiss people again. So we go back up and then Swiss people are like, huh, this look better, but still fake, no? So back to Italy, we call the spy and the spy tells us, oh, you had to fix this, this and the other, right? And then you do this iteratively, right? And so eventually we are getting a very good making money in Italy. And the other one are like, okay, this looks basically real, I trust you, right? Well, I get your money, easy money. Anyway, the point is that the Italian people are the generator, whereas the Swiss people are the discriminator of the cost function, assigning how bad or how good those money are. What are, what is the spy in this case, no? In this case, if you follow the analogy, no? What is the spy in this context? How does the generator, how does the generator know that the cost was wrong, no? Now this is, this is like actually a Russian accent, I don't know, I don't know how to speak Italian. Oh, like Mamma Mia, that is better. All right, so I like the accent, fantastic. Okay, so how, what is the spy? What is the spy? How do I know that the money was wrong, right? The cost function was the Swiss people, right? So the cost function is the judgment, but the judgment doesn't tell us how to fix the money. How can we fix the money? Can you tell me, please? The gradients, the gradients, congratulation, bravissimo. The gradients are gonna be giving me the direction in which I need to change the money, the opposite direction, right? So that is, again, I'm speaking Russian. I don't know, I'm making a mess with accents today. Okay, maybe I should sleep actually, right? That would be a good idea. Summarizing, so I let you go. The gradient, the gradient is gonna be, well the opposite direction of the gradient is gonna be the call to the spy, which is telling us in which direction to change the money in order to be able to fool our Swiss friends. The Italian people are, of course, they are the makers. We are the makers, that's why we are the generator, and the Swiss people are the precise people that are checking what are we doing, okay? And so with that, we're gonna say mama mia. Thank you, grazie. And I'll see you next time, okay? I'll see you next Thursday for a new lesson. Hopefully I will sleep with this that next time. I hope you enjoyed the class. We are right on time, right? 10, 30, is it? I don't have a clock anymore. Yes, I'm on time, see? I didn't even screw up. All right, code, how do you read the code? So I will, you can decide to drop now to go outside the class. I'm gonna show for the one that stays around, and you can have the recording anyways, where to find some code to read. So the point is that you are supposed to read code to learn how to code. I didn't know this, someone told me. So if you go on GitHub, and then you go on PyTorch, PyTorch, PyTorch, and then you go in Examples, you're gonna see a lot of nice stuff. You can read the DCGant, okay? There is just one file, the main, and here there are many instructions, but there is a few that we are interested. We have a generator, right? Which is generating images. So in this case, it's convolutional. And it's called netG, generating net, right? And there is a discriminator, and it's gonna be called netT. I have a criteria, which is gonna be this tracentropy because they use a discriminator. And we have two optimizers, because we're gonna be optimizing the two things separately, right? So one case, we try to push up the energy for bad one and push down the energy for the good one, and the other case, we just follow the gradient, right? And the training code is over here, which goes from line 218 down to 273. I can just read it through, and if you want to listen later to the recording, you can, you don't have to. It's just here for such that you can read this along with me, okay? So first of all, we clear up the gradient of the discriminator. Then we're gonna be having our labels, which is going to be like a just a full, a tensor full of real labels, okay? So I zeroed the discriminator and I have the correct labels here. And then I produced the output from the discriminator. Here I tried to minimize the, well, here I have the criterion, which is the output of the discriminator and the correct real labels, right? All right, so we send the real data, the real images, and then the real labels. And so we back propagate, so we compute the partial derivatives. Then here I generate noise. So here I generate those Zs. I generate fake data, which is done by how? By sending this noise through the generator, as we have seen before. Now I will fill my labels with a fake label. Well, it is not fake, it's the labels for the fake data, right? And I have the output of the discriminator to which I provide the fake data, right? And now I have, again, the criterion for the output of the generated images with the label that is saying it is generated. And I do a backward, right? So first I did backward when sending true data, like authentic data through the discriminator, and the labels were saying true data, authentic data, data. And here, and then I compute the backwards, so I compute the partial derivative. And in this case, instead I've done just the other parts. So here I created the generated data, the generated sample, the Y hat. So these I create Y hat. I say the labels are Y hat. And then I compute these output, I compute the error, and then I compute the partial derivative, right? Cool, so now these partial derivative are letting me train these costs and network. Finally, I step with the optimizer. This is for training the cost network. On the other side, I want to train the generator. How do we train the generator? We just follow the cost, right? But in this case, you have to do something a little bit different. Here I feed the labels saying, oh, these are the true data, okay? But then I actually input to the discriminator the fake data. And so here I have Y hat, but then I actually said these are actually Ys. And so my error, in this case, is gonna be the distance between my actual outputs and the wrong classes. And in this case, instead, I compute the backward, the partial derivative, for my generator, right? So in this case, I'm tricking the discriminator. Before I trained the discriminator, we generate the Y hats, the red ones, and the blue ones, the one, good one, right? So I trained with the contrasted samples and the non-contrasted. So we train the cost network with this contrastive loss. And now instead, we are getting the gradients for when I do the opposite, when I provide the red ones, the Y hats to the cost network. In this case, again, it's the discriminator, so I have to provide also the wrong labels. So today I can see in which direction I should move my samples in order to have more wrong labels provided by the network, okay? And then here I just step for the generator. Again, this is like, I guess, less intuitive, right? So this is like one network, try to fool the other. Whereas the interpretation with the energy-based model seems much more linear. You just push up the energy here, low energy here, and the other one's like, ah, going down the slope, okay? So that was the final bit, okay? And we just read here how many lines. So it starts here, and it goes down to here, right? So from line 220, right? To line 255, so it's like 35 lines of code to train these two networks using this adversarial framework using this kind of discriminative loss which is, again, wrong. The cost one would have just simply be easier, right? Just minimize, like, just minimize the cost network, right? All right, so that was it, okay? So we went a little over, but again, this is not going to be helpful at least for the competition where you have to actually train on unlabeled data. We discourage you from using adversarial generative networks because they don't seem to learn very good representation if you don't have at least enough time to play with them, okay? So given time constraint, generative adversarial networks are not the best options for this project, okay? I'm just telling you in advance. We had a few teams last year or two years ago that try to use these, they spend all the time trying to get it to work because the main difference is that whenever you train a network, you just minimize the cost. In this case, there is no minimization of the cost. There is like a balance between, so it's a equilibria, right? It's called Nashia equilibria between these two different players in a game, right? In Max game. So this is not a stable for now procedure and technique. So we wouldn't recommend it to use it. We would recommend you to use it. Done, finish. Thank you. See you next week. Happy Easter. I don't know if you're cooking. I'm gonna be cooking like crazy. Many things. Maybe I post something later. And again, Susan, thank you for being with us today. Bye. See you next time. Yup.