 Today, we're going to be talking about generative adversarial networks or how to actually have them properly made. All right, so generative adversarial networks, unsupervised learning, generative models. So generative models, again, are models that allow you to get something that is in the input space. Most of the time, what is happening in this field is that we assume there is a probability distribution over these samples that doesn't have to. For example, a decoder in a classic routing code can be thought as a generative model, in my opinion, and also for young. Many will disagree, and they say generative model has to have an input which follows a specific distribution. We are in the realm of unsupervised learning, where we don't have labels. And so let's get started with generative adversarial networks. So what is this stuff? You should know, right? So this is a variation of the encoder. The variation of the encoder is basically like a normal encoder, where the encoder, in this case, provides us the parameters for a distribution from where we sample our latent input to Z. So the only difference between the normal one is, again, the sampler, which is going to pick a random sample. So instead of having one simple code, which is one point, you have one input here, you have one code here, instead now you're going to have some volume. And therefore, each point within this volume will be mapped back to the original point. That's a very important part about the variation of the encoder. So let's see how this generative adversarial net look like. So we have this stuff, which is, huh, it's actually the same, right? So what's going on here? We have the same generator and the same sampler. And then what else do we have? OK, we have another input there. So the input, before it was on the left-hand side, on the bottom, now the input is halfway through. And the output is actually also halfway through. Finally, we get that kind of switch. And then on top of that switch, we're going to have a cost network. Usually in the classical definition, in the classical formulation of a gun, there we have like a discriminator. Discriminators are just plain wrong option, at least following young suggestions, which I agree with. Because we see that in a bit. Right now, let's focus on the fact that we have this cost network. So we have basically similar models, right? Sampler on the right-hand side. There's a sampler on the left-hand side. We have a decoder on the left-hand side, which is basically generating something. Since z is considered a code, then we have a decoding step. Whereas on the right-hand side, since z was not a code, but was simply an input, then we have a generator. And that z is simply, for example, sample from a Gaussian distribution with a normal distribution. And then that x hat will be generated by this initially untrained network. The cost network, instead, has to figure out it has to be a high cost if we feed that x hat, the blue one, because we want to say, oh, this is a bad sample. Or instead, if we sample the pink one, if you get the switch to select the pink one, we should have a low cost, because that would allow us to figure out that we are actually doing. We have actually a true sample, a good sample. So summarizing the sequence of operations, we have the generator maps my latent input to z into this Rn, which is the space of the input space. So we have the latent input, the orange one that is mapped into the original input. So we have that orange z mapped to x hat in blue. The top one, instead, is a cost network in this case, which maps the input, which can be the pink x or the blue hat, the blue x hat, which is mapped to my cost. So in this case, this cost module is actually a cost. In Jan's diagram, it's going to be a square, which outputs a scalar. This scalar will be a high value, a positive large number, if the input is a fake input. And it should be a low number, probably up to zero, if we actually have the input coming from the pink side, the real side. And then how do we train this system? So this system will be trained with different gradients. So the cost network will be trained in order to have low cost for input that are pink and a high cost for input that are blue. So for example, you can think about this if you would have a discriminator in this case, you may think about this as two classes, classification problem. You try to get zero for x, pink x, and a one for the blue x. We talk a bit about why that's bad, to use this zero one output in a second. But otherwise, we just want this network to learn this cost. So let's figure out how this works in a diagram. Do you remember how we were starting with the variational encoder? With the variational encoder, we were starting from the left hand side, right? We were picking an input, and then we were, so we were taking the input, we were moving to the latent space. We were moving this point because we are adding some noise, and then we were getting back to the original point. Then we were trying to get those points close together by using the reconstruction laws, and then we were trying to set some structure in the latent space by using that relative entropy term. Instead, for the GAN, the Generative Atter Siler Network, we're going to be starting from the right hand side. So we pick a sample, a random number, let's say 42. We feed that through a generator, and we get that blue x hat over there. Then we're going to be training in another network in order to be coming up with a high value for that blue sample. Then we're going to pick another x, let's say a pink x, in this case, on the bottom right of the spiral, which is going to be enforced now to have a low cost. So this is pretty much like a first initial big picture about how this system works. So let me try to give you two more interpretations. This is like the kind of definitions and mathematical definition and then the visual definition. Now I'm going to be trying to give you a few interpretations, which I pretty like, and are going to make me sound like a fool, but I am a fool, so I just go for it. So you can think about the generator as being an Italian, and therefore I will be using some proper Italian accent. So I'm a proper Italian now, and I am in the south of Italy, and I'm going to be trying to make some fake money, because we are very good at that usually. So we make a fake money, and then we go to Germany to get some, to purchase something. We go to Germany with this fake money, and then there is this German people look at us, and it's like, oh, fucking Italian, this is fake money. And so we can't really manage to buy anything. But since we are Italian, we have spies. We have spies in the, OK, there are questions, hold on. Maybe I'm offending people now. Chat, what's going on? Oh, OK, you're enjoying the thing, cool. OK, so I was not offending anyone, fantastic. OK, so we have a spy back in Germany, and the spy is calling back home, hey, mamma mia. You gave us the wrong money, like it was so fucked up. It was just not the proper, OK, OK. So yeah, chill, chill down, right? We are back again home. What movie is this? It's just my own movie. So we are back in Italy. We are able to make such nice art and everything. And so we must be able to make better money, right? So we try now to fix the things that our spy told us. So we make a better money. We go back to Germany and try to buy other things. And Germans are like, huh, it's better? It's fake, oh, OK. Then again, you had a spy calling back down to Italy, and it's like, oh, what are you doing? And then we're like, oh, I understand, eh, cabisci. And we are fixing the money. We are making several iterations of that. So we try to make better and better versions of the money. Finally, we go back to Germany. In this case, Germany, because they have money, right? That they have things we can buy. So we go back there and they are like, huh, it looks very good. No, I don't know how to make a German accident. I'm sorry. And so they accept the money, right? OK, and this is how pretty much these generative adversarial networks works. We have like a generator, which are the Italian dudes in the South, which are making fake money. And we are trying to purchase something in Germany. And Germany is the discriminator. And they are very strict and very, you know, they are German. OK, politically correct. I'm not, so whatever. But then we do have a spy, right? And what is this spy? Can anyone figure out what's the spy analogy here? We haven't mentioned that so far. So the loss function, backprop, discriminator. OK, some feedback. OK, it's feedback. And how is the feedback coming from? So whenever we train the discriminator or the cost network, right, we have some gradient. That gradient allow me to do two things, right? I can either lower the final value. And so I can tune my parameters of the cost function. Let me go back to the cost function. So we have some gradients of the final cost, right? And so we have some gradients of the final cost with respect to the parameters of the network. And so usually when I train the network, the cost network, I will try to tune the parameters such that I will have a final lower loss, right? This is a cost network and there is a loss on top of the cost network, right? It's a bit confusing. So we're going to be trying to optimize the parameters of the cost network in order to perform well and therefore having a very low loss. On the same way, we can use those same gradients that are computed with respect to this network and you see my mouse. So I have my final loss on top of here. We come down with the gradients and then you have here some gradients. And now these gradients, you know how if you change this x hat, you're going to know how this final loss will change, right? Therefore, you can train now this generator with this gradient in order to increase this final loss, okay? So when we train this cost network, we'd like to minimize the final loss given that we input these two different inputs, right? But also we'd like to increase this final loss. So we'd like to make this final network perform worse by, you know, improving the generator, okay? And so this information that comes down here and down this way, which is the backward pass, right? The input gradient will be used for tuning the parameter of the generator such that it managed to fool the cost network, okay? And so this is the analogy with the spy in the German, in Germany, okay? Is the distribution of Z fixed? So yeah, so Z actually comes from, let's say a normal distribution. I actually don't really have anything about anything to say about this distribution. As long as you pick your distribution, you know, the generator will map that distribution into some x hat distribution, which will hopefully match what is the pink distribution of the axis, okay? So even though the distribution of Z is fixed, we can be sure that we can change the generator in such a way that we can minimize the cost function. Right, so although Z distribution is fixed, the generator will, how do you say, ply PLAY, I think, you will ply this kind of distribution such that you're gonna be basically like reflowing into something that looks like the x in the pink x. Hopefully, okay? I haven't told you about the pitfalls of this system, but hopefully we'd like to manage to get a distribution out of those blue axes, x hat, such that they resemble the original distribution on the left hand side in the pink one, okay? Did I answer your question? Yeah, that makes sense. Okay. Wouldn't the x produced by the generator be the new improved money? The blue one, okay, yeah, thank you. I actually didn't finish that one. So the pink one are the true euros we are using in Europe and the blue, the blue x hat are the money that we make in Italy, okay? Oh, mommy, yeah. Okay, other questions. I thought the generator was supposed to give negative samples. So negative samples, okay, so there are two steps here. We provide negative samples that are these x hat to the cost network. So the cost network is trained in order to have low values on the pink inputs and higher values on the blue input, okay? And so if the network, the cost network performs well, then the final loss here on top will be very low, okay? So if the cost network is performing very well, then you're gonna have a final low loss here. Nevertheless, the generator will be trained in order to increase that loss because we'd like to fool these Germans. Does it make sense? Can you just clarify what the spy is in this analogy? Yeah, the spy is the input gradient. So whenever here I have my cost network and to train this cost network, I'm gonna have a final layer here on top, right? Let's say this is an MSC, for example, an MSC with zero for whenever I have an input x pink or some value, let's say plus 10 in this case. It's an arbitrary number for the moment. Value plus 10 for the blue guys, right? So my cost network is a regression network. You can think about this as just one single linear layer. So it's like an affine transformation of the input. And then this basically a final value. I set it to be zero for the pink input. I have an MSC between the output of the network and zero for whenever I input the pink input. And instead, let's say I choose an arbitrary value of 10 to be reflecting that the input is the blue one, right? So we have cost network, which is a network that is outputting a single scalar value. And this scalar value will go inside the MSC module here on top, let me write maybe. So we can all see what's going on. So I have here my MSC, this is my loss function, right? So don't get confused between loss and cost. There are two different things. So I have my MSC here. And if I have this guy here, my target is gonna be zero. It's gonna be my Y, okay, for this one. And instead, if I input this guy here to the cost network, I expect to have, let's say, in this case, an arbitrary plus 10. So my MSC in this case is gonna be, mean square error between the output of the cost network and zero. In the other case, I'm gonna have the MSC between the output of the network and 10. So the network, if I just train, let's say we forget about all this stuff, right? We have just a few samples with, we think for the moment that the generator is not improving. So we have several pink samples and several blue samples. And then you train a network such that if I put the input, the pink one, you're gonna get a zero in the output. And then if you put the blue one instead, you're gonna be forcing the network to learn number 10, okay? So you do some steps in gradient descent in the parameter space, such that in one case, you get zero. In the other case, you get 10. Whenever you provide several samples, right? Now that we have this network, this cost network, you can think about having the cost network to be actually the loss for the generator, okay? And so if I have my generator output in something and this cost network, we say, oh, it's a very high cost, then by trying to minimize this cost, you will try to basically generate something that was initially making the cost network providing you a low value, okay? Is it making sense? Could you just quickly clarify the difference between cost and loss? The loss is what we use in order to train something, okay? So my loss, in this case, is the MSC loss. This is my loss. So in order to train my cost network, I will have a loss function, which is the MSC loss function. By minimizing the MSC loss function, I will be training the cost network. Now the fact that part comes and I'm gonna say that for my generator, the loss function that I want to minimize is the cost network. So for this generator, the loss is the cost. And I try to minimize this guy output, okay? So this is also relative to what Yan is teaching with the energy-based model. So you have energies and we try to have low energies through minimization of a loss function. So the loss function is what you use in order to train the parameters of a network, okay? So that's the difference. So it's the network, okay? So another additional point is that a cost is like a evaluation of some network performance. So if my generator outputs a bad X, like which is not pretty good looking, then you're gonna have a high cost, okay? It's like a high energy. But in order to minimize this energy, usually you have to minimize these losses, okay? So, but again, the definition what we like to use is that the loss is what you minimize in order to train the parameters of a network. So instead, like a cost can be thought as, you know, I take an action and then I have a cost for taking that specific action, okay? So you take an action, which is like writing an email about changing things. And then the cost is gonna be having everyone piece at you. Makes sense, right? You always learn something new. Okay, other questions so far? Sorry, Al, I'm still confused about the cost and generator. So for generator that generates the blue X, we want to increase the cost. But you just mentioned that we want to minimize the cost is like the loss function for the generator and we want to minimize the loss. So we want to increase the cost or we want to decrease the cost for the generator. For the generator, you want to minimize the cost. So we train the generator through minimization of the cost network value, okay? So there is two parts of this thing. Let me change color. So first part is gonna be the training of this guy here. And the training of the cost network is made through the minimization of the MSC on top of here. So this is the loss for the cost network. So the cost, the MSC here is made between zero whenever I input a pink input. And then let's say for this example, like for sake of example, I like to have an MSC against 10 whenever I input a blue sample, okay? So now we perform several steps of gradient descent in the parameter space of the cost network such that we minimize this loss, okay? So now we have a network here which is gonna be outputting zero if I put a pink input and input an output a 10 if I input a blue input. So far, are you all with me? Yeah, so it's like cost will, the network cost will generate a high value for blue X, right? Yeah, that's what we train this cost to do, okay? So this cost network will have to generate some large value in this case 10 if I input a blue guy and we'll have to generate a small zero output if I put a zero, if I put a pink input. And in order to do that, we do this by minimization of a MSC loss, okay? This is first part. So far, you're with me, right? Yeah. Okay, fantastic. Now we have the second part, which is the cute version, the version that Jan likes, the different version that you don't find online, which is the following. So this cost network now will give you values that are close to zero whenever you input something that looks like proper, okay? Otherwise it will put a high output, let's say a value around 10 if you put inside crappy input. So now finally, how do we train this generator? Well, the generator now will be trained through the minimization of the cost network, right? So the cost network will say 10 here. So this output blue guy here, it's bad guy, right? So if the generator now switches slightly this X to make something that looks like this guy over here, then you get that from 10, we went down to zero, right? And therefore you got to minimize this cost network output value. And so we are using the cost network as the loss for training the generator. Okay, what do you mean by like getting blue X closer to pink X? Right, so right now my generator outputs these blue X, okay? And this is like some image that looks bad or it's money that really looks fake. Now, how do you make better money? Well, the cost network is gonna give you a scalar value for each output your generator makes. Therefore, you can compute the partial derivative, you can compute the gradient, you know, of that cost value. I like to compute the partial derivative of these, okay, lower case C. So DC over DX hat, right? So here I have the partial, this is awful writing, sorry. All right, I can't write, okay, E, in order to C. Oh my God, okay, this was a lower case C, so it's like that, all right, cool. So I compute the partial derivative of my lower case C with respect to the X hat, right? So now I have a gradient. This gradient allows me to move around and I figure out whether the cost is gonna increase or decrease, right? So this is kind of maybe a little bit not standard as in also yesterday Jan was talking about this. You know, you have some inputs to your network. You can decide to do gradient descent in the input space. I can decide, for example, there is architecture which doesn't have a generator at all. You start with a sample here and then you perform gradient descent in this sample space. And then you move these samples such that you get a lower, lower value for the cost network. In this way, you can, you know, get an input that looks like resembling a good input, right? The pink one. Does it make, did I kind of explain myself? Or is it still weird? Oh, it's much clearer. Thank you. You sure? Yeah, yeah. It's like taking gradients in the input space and make it move towards, like, and then decrease the cost. So that means the input actually gets better, like gets better money or better image. Right, right, right. Then you can also use this one as your gradient here, coming down here, right? And so now you can compute with a chain rule also on the partial derivative of this lower case C with respect to the parameters W of the generator, okay? So in this case, then I can train the generator, right? I had the partial of the cost over the parameters. And therefore I can change now the values of the parameters on the generator in order to improve the network. Okay, got it. It totally makes sense. Thank you. Of course. Is that so? Yeah. Are they trained simultaneously or just trained first? The cost network or the generator network? Right. People try both. They say it's better sometimes to keep one fixed while you change the other because otherwise you'll have always a moving target. Then there are contradictory evidence. We are actually gonna be reading now some source code after we cover the major pitfalls. But I'm gonna get back to your question in a few minutes. We don't need a regularization like KLD for Zed and Gan because we sample from normal, yeah, the right directory, direct, yeah, directly. You sample the orange guy here from a normal distribution. So that's it, right? You'll have a sample, like a random number. And then you send this random number through the generator. That's it. And my Google home just came back to life. Okay, I answer your question, I think. More questions? Then we have pitfalls and then we actually are gonna be looking at source code, yeah? So it seems like we are replacing the reconstruction loss with the differentiator network. How does that help exactly? Why can't, like, how is it bad to just use the reconstruction loss? What does- Okay, okay, okay, this is a very, very good question. I mean, it's something I forgot completely to say. So on the variational encoder, we were always starting from some point. Then we were getting back to this space. We were moving a little bit that point such that we could cover some area and then go back to the other side and then you try to make those two close, right? But in our case right now, in this generative adversarial net, we actually starting from the right hand side. So in the generative adversarial net, you start from the right. There is no whatsoever connection between this guy here and this guy here. All you have is a cost network, which is telling you whether you are on this kind of thing here, right? I can't, it's gonna be ugly, but okay. There's a cost network and it's gonna tell you, in this case, plus 10 here, and then it's gonna tell you, let's say, zero here, okay? In the other case, you have a generative network here, which is mapping this input here down to here, right? So one is trained in order to have below values around the manifold and then larger values outside. And then you would like something that is like, you may want some curve levels, right? Like that, such that as you move further away and the stuff keeps increasing. If you have a discriminator, they will force to have zero here and one outside exactly in this manifold, like very, very close by, right? And so that creates many problems. So, okay, let me try another analogy. There is another analogy. So, hold on, there are questions, more questions. Let me go with the analogy and then let's see whether this makes more sense. Let me actually see myself, such that I can, okay, I can see now myself. All right, so you have some true data points here, okay? And then you have some generated data points over here that have been generated by the generator, right? So points here, points down there. Let's assume now we are talking about this discriminator, okay? Such that I can illustrate what are the problems there. So you have a discriminator, which has these two kind of data. You have true data down here, fake data over here. And so what does the discriminator do? The discriminator decision boundary is gonna be just a line here, right? That is cutting this stuff in half, right? So far? Yeah, right? Yes. Okay, cool. So now you turn on the second step. Second step is gonna be you turn on gravity on this decision boundary. So these points that are here will be, boom, falling down here, okay? The point here gets attracted by the decision boundary. So we train first the discriminator, we had this kind of decision boundary, then we train the generator. You have these guys collapsing down here. So then you're gonna be new situation. You have true data here, fake data here. You train again the discriminator in this case. You're gonna have a decision boundary, which is gonna be halfway here, right? Then you turn on gravity such that these points here will collapse here, right? And then you keep iterating this stuff, right? This stuff will be getting closer and closer and closer and closer to the true data, right? So you had these points that are like approaching and arriving to the real data location. So let's say now you're using your discriminator. You have those binary cross entropy loss for training the discriminator. What is now the main issue? Let's say I do a shifting, I bring my true data here such that we can see better what happens. So you have true data here. You have generated data here, right? They are overlapping. And now you have a discriminator cutting here. So you're gonna have overlap of these samples and this discriminator has no idea what to do, right? So first of all, you're gonna get, you know, misclassifications just because you thought you converged, like we actually converge, right? If you think about that, my true data is here, my generated data is here, they are overlapping. So I actually managed to reach convergence and now my discriminator has no whatsoever clue how to split these things apart. So, huh, so we don't converge. Or when we converge, we get issues, right? Huh, the discriminator, I think the discriminator just tells apart two classes. Well, the discriminator cannot tell apart the two classes because these inputs are, you know, no more separated, right? They are gonna be like, if you actually managed to get the generator to perform very, very good samples, then these good samples are, you cannot tell them apart from the actual real samples, right? Now, the discriminator has no whatsoever clue about how to basically tell them apart. So whenever the generator works, the discriminator will not work. How nice is that? Okay, one other problem. Let's say, again, you have the fake data here, true data over here. And now you have a perfect, amazing, awesome discriminator, such that here is absolutely zero and then here is absolutely one, okay? So you have like a, basically like a step function. You don't have a sigmoid. What's gonna be now the gradient? It's saturated, right? Or it's zero or it's one, there is no more gradient. These points will never move, right? So the gravity that I was showing you before, that was attracting these generated data through onto the decision boundary was basically the gradients that I, so the gradient of the final, of the output of the discriminator or the cost network with respect to the samples generated by the generator, right? But now if this discriminator has a perfect, is a perfect discriminator is zero here, one here. Well, it's completely flat, right? If it's like that, there is no whatsoever gradient here, right? And therefore, if you're over here, so let's say we have data in one X, right? In one dimension, you have zero, zero, zero, then you have one, one, one, one, one. But then if there is just, you know, there is no gradient, these points, we'll never know they had to go in that direction. They will see, oh, we are bad guys, we have a bad value, but then we don't know in which direction to move because there is no whatsoever direction. The gradient is zero, it's a flat region, right? So this is a very big issue, right? So whenever we train this generator, the serial network, you want to make sure that this cost gradually increases as you move away from your region of the true data, okay? Such that if there is a smooth or it's like a convex thing, right? So if it keeps going up, up, up, up, you always know in which direction to fall down in order to arrive at the location where your true data is, okay? And my Google home keeps rebooting, I'm like turning this shift off. There you go. Is it clear so far? Yeah? RIP, Google. Yeah, one final issue was that if we get a generator which gets every point here mapped into this point over here, you know, all weights are zero, you had the final bias be exactly this value over here, then that's finished because the discriminator or the cost function will say you've done a very good job and the generator say, yay. And then the generator just outputs one image, right? And this is called mode collapse, meaning that all points are mapped into just one point and you can't do anything about it. So the actual full story is that if every point here gets mapped to this point here, then the discriminator will tell that, oh, this is a fake point, right? And therefore the generator will switch and will say, this is the real output, right? And then you train the discriminator, discriminator say, oh, this is fake. Okay, so the generator will say, this is the real one, right? Okay, so you basically have a network that is just jumping through the samples and you can't fix that unless you introduce some, you know, penalty for not having some kind of diversity in the output of the generator. Finishing gradients, whenever you have like saturated discriminators and we don't like discriminators, we prefer to learn this kind of smooth loss, cost, right? Cost network. Mod collapse, the thing that I just described right now, we just fold on one specific point. Unstable convergence, yeah, and the point is that whenever you get a very good generator, the, you know, the discriminator will have no idea what's going on, you may have like a very big, a very big loss because you may get, you know, this point should be classified as this one. Instead, it's completely classified as something else. You get some very, very large gradient. The discriminator will jump away and then the generator will jump, you know, away and the decision boundary will go, you know, bunkers. Bunkers and then you're gonna have the generator trying to run after these, you know, running away decision boundary, okay? And so there is no convergence. There is equilibrium. So it's an unstable equilibrium point, which is very, very tricky to catch, yeah? So I understand we have some sort of minimax problem here with our generator and our cost, but in general, when you optimize this, I don't know if really any straightforward ways to make sure you converge them to the right point. Right, I am not sure how you figure out whether you converge to a good point, but through visual inspection of your outputs of the generator. Or you can train several, you can train several guns and then you train a discriminator on some image dataset and then you classify, you evaluate the quality of the image, right? So this is like some kind of not good metric we don't like, but that's what has been done. It's called inception score. So you train a network, let's say the inception network, that's why it's called inception score on image dataset. And then you can try to see whether these generators are giving you images that look like something from this training dataset. Again, it's not really a good metric, but someone tried to use this for a way to evaluate generative, to evaluate generative models, yeah. Before starting, before going to the notebooks, let's have a look to actually a practical example of training loss for these two networks we have just seen now, okay? So the loss function for my cost network, given the input X and the latent input Z in orange, can be the following. So it can be equal my cost C, given my pink input X, and then plus this part here, which is the positive part of a margin M minus the cost I'm going to give to a generated input, which is outputted by my generator, which is fed with the input, latent input, a random number, okay? So G of Z gives me a fake input, then C will have to give me a cost, and as long as this cost will be lower than M, this part here will be positive part. As soon as C, the cost network, gives me a cost for this generated input, which is larger than M. Then this part here, M minus some number larger than M is going to be a negative number. Then since I take the positive part, this goes to zero. And so this part of the loss goes to zero. Whenever the cost network gives me a output that is larger than M for a input that is being provided by my generator. On the other side here, we have simply the cost associated to the correct input, right? And so in order to squish this down to zero, you just have to have your cost network outputting a zero whenever the input is the good one, okay? So in the example I was making before, I was saying that M is 10, and therefore the network is encouraged to output a scalar of 10, at least 10, at least 10, right? For inputs that are coming from the generator and said cost that is equal to zero is promoted by this term over here. So this is a example of possible loss we can use for training the cost network. Now this is done in this paper here by Jake, Michael, and Jan from 2016. Then how do we train the generator? Well, that's quite straightforward because you simply have the loss for training the generator and being equal the cost that the network, the cost network, gives me for a given generated sample, right? And so my generator will simply try to get a low cost and that's so pretty. All right, okay. Again, can we be more specific? No, what is this cost network? I haven't told you yet. A specific choice you can make for creating a network that is giving you this scalar base on the input, but I think you may already have some ideas how this network can be made. And so a possible choice for this network is going to be the following. It's going to be the MSE, the quadratic difference between the encoding of the end coding of the specific input. So this is the reconstruction of a autoencoder minus the input itself, square, right? The norm, square. So how does this work? Well, like if the autoencoder is being trained only on pink samples, it would be able to reconstruct pink samples only, right? And therefore the distance between my input, the pink input and the reconstruction of the autoencoder when I provide the pink input will be very small, hopefully, right, if we train this nicely. Instead, what happens now if I put an input here that is far from anything that is on the data manifold? Well, my autoencoder has been trained to output things that stays on the data manifold. And therefore there will be a substantial difference between my actual input and what my autoencoder can give you, right? The nice part of this specific choice of cost network is that you can train this autoencoder without the generator, right? You can simply train an autoencoder, whatever, you can have like an under complete hidden layer, over complete and you use some kind of regularization and information restriction bottleneck. But nevertheless, you can actually train this guy without having a generator, right? And this one, you will simply learn what is the train data manifold. And then you can use this as a proxy to establish the difference, the distance between your current input and what the network thinks the closest input on the training manifold could be, okay? All right, let's move on. In the last five minutes, if there are no questions, we are gonna be reading the source code from PyTorch examples together. And I think it's gonna be the first case where we are actually reading some programmer, developer code, I'm not a programmer. So whatever you've been consuming so far, where my notebooks, which were some kind of pedagogical educational content, which is kind of massaged such that it looks nice and pretty and has nice looking output. Right now, you're gonna be reading actually nice code written by people that do this as their job, right? So we go GitHub, we don't go on PyTorch deep learning, we're going to PyTorch, PyTorch, I don't know, PyTorch examples, examples, okay. Okay, so let's zoom a little bit. Okay, so here we have the DC gun and the swith here, okay. So we can just go through this code main things, right? So we start with importing a bunch of crappy things. As usual, you have an argument parser such that you can send some specific commands, specific parameters in the command line. This print out all the options for the current setup. This one tries to make a directory, otherwise, you know, whatever, this is if you choose a manual seed, then you're gonna be actually setting a manual seed in such a way you're gonna have reproducible results. Could a benchmark equal true? I think speeds up the, yeah, this one allows you to have faster GPU routines, kernels. If you don't have CUDA, you're gonna be taking forever to train this stuff. Data root, whatever, dataset, so you're gonna be loading here image net folders or L, F, W on dataset. So this is all things that we already know. Okay, so NGPU is gonna be the number of GPU and Z is gonna be the size of the latent variable. NGF and NDF is gonna be, let's see, NGF, NDF. The number, I think, on the generative features and the number of discriminative features. And okay, we have some specific weight initialization which really helps getting some proper training starting. And then let's actually have a look to this generator, right? Okay, so this is classical, a classical and then subclass generator. You don't need these stuff if you're using Python 3. So let's see, so we have a sequential, right? We have the generator will be up-sampling. So such that as you have seen from the last homework, you want to go from a small dimension to a larger dimension, you're gonna use this model. They have batch norm, reload, and so on, right? And transpose convolution, batch norm, reload, and keep going. And finally, we have a tanh. We have a tanh because the output in this case is gonna be lying within minus one to plus one. Forward is simply you send the input through the main and the main was this one, main model, right? This is for using data parallel if you want to use several GPUs. And then here is how to initialize with the specific initialization you define above. So simply just to put in short, right? What does this thing do? You input something here that has NZ size, right? And NZ is the size of the latent, which is NZ and Z 100. So you input a vector of size 100. So it's a tensor, a one-dimensional tensor with 100 size. The size is 100. And so whenever you input this 100 vector, the output is gonna be something like 64 by 64 times the number of channels in case you have color, image or not, right? NC, NC being the number of channels of the output, the input image, right? Okay, it should be clear so far, right? No crazy things going on. Let's see the last part and then let's see how they train. So the discriminator is the same stuff. You have a sequential. In this case, we feed these whatever number of channels times 64 times 64. And then you go down with Likirilou. Oh, this is important. So Likirilou in the discriminator, make sure you're not gonna be killing the gradient if you are in the region, in a negative region, right? This is very, really important. If you don't have gradients here, then you can't train the generator. So you keep going down like that. And then finally, they use a sigmoid because they train this stuff as like a discriminator, like a classifier between two classes. And the forward is simply you send stuff through the main branch and they initialize this network. So we have net D and net G. So this implementation is slightly different from what you were going over before, right? Because the discriminator is just one, it outputs like the sigmoid. The only difference is this line here. Right, right. So far. So in the things we were talking in the lecture just before, we don't have the sigmoid. We just have this final convolution layer, okay? Okay, gotcha. Thank you. Of course. Second difference is that we would not be using a binary cross entropy loss. This is the source of all evals, right? BCE plus this sigmoid, it's wrong way of training a generative adversarial network, a generator, okay? So nevertheless, we go with the main formulation here. So let's see how it works. Fixed noise, you just create some random stuff with the batch size and the correct size here. We have two optimizers, one optimizer for the discriminator, one optimizer for the generator. And let's see what are the five steps that you should all know, right? So let's figure out. First of all, we zero the gradients of the discriminator. Okay, so now we have the real data is going to be the data zero that comes from the data loader, right? So we have real data here. And then we're gonna be having a set of labels, which are gonna be the real labels, okay? So then we have the network, the discriminator is gonna be fed with the real input. And then we have some real output, right? And then you're gonna be computing the first part, which is gonna be the criterion, which is the binary cross entropy, between the output for whenever we put the real input and the real label, okay? And then we perform the first step. So here we perform backward in this criterion, which is computing the partial derivative of this binary cross entropy with respect to the weights of the discriminator when we fed the real data to the discriminator and with output we try to match the labels, which are the real labels, okay? This is first point, number one, okay? Keep in mind. Second part. Second part is gonna be you get noise and therefore you get your network, your generator. You feed some noise inside the generator. Therefore you get some fake output. Here I'm gonna be having my labels now are filled with the fake label, okay? Therefore you feed the stuff inside the discriminator. We feed the fake data, but we detach, right? This is an important part. So right now we fed the fake data, but we detach it from the generator. And then we train again. So we have the criterion, we compute the loss between the output of the discriminator with the labels for the fake class, okay? And then we perform another step of backward. So now we have two backward, right? So we have backward here and backward here, and we have computed the partial derivative of these criterion in the case where we were inputting real data and in the case where we were inputting fake data. And so you compute backward here, backward here. There is no clear gradient, right? This is an important part. So we only called clear the gradient at the beginning. And we compute first the gradients with the real data, and then the gradients for the fake data. Now you have that we can compute this one, right? So we step in the optimizer. So we computed the backward, the partial derivatives. We computed the other partial derivatives. Now we step. Finally, we train the generator and then we are done. So how do we train the generator? Now you fill the labels with the real labels, okay? But you still feed the discriminator and the fake data, the one that was generated by my generator. The discriminator should say, oh, this is fake data. But we say, no, no, this is real data. And therefore, you basically swap the thing, right? So now you have the, when we compute these fake propagation, we have these gradients which are going in the opposite direction. These are trying to make your network perform worse, okay? But then we are gonna be just stepping with the generator, right? So this one computes the partial derivative for everyone, right? Partial derivative of the criterion with respect to the weights of the discriminator and the weights of the generator. But then we are gonna be stepping only with the generator. So the generator will try to make lower criterion and the criterion has the label swapped, right? These are real label for whenever we feed the discriminator fake data. And so this one is actually working against the discriminator and that was it. So you had one backward here, you have another backward here and you have another backward here. And other questions right now. Wait, what's the difference between the first two backwards because they're both on the same objective? Right, right, okay. So the first backward here, it's computed when the network, the discriminator, the cost network has been fed with the real data and the label here are filled with the real label, okay? So this is the first part of the backward. So you have class, true class and then you have class of the fake class, right? In this case, I generate my fake data through the generator, which was fed noise. And then I feed my discriminator with the fake data but I stop the gradients to go backwards in the generator. And this criterion still tries to make the output of the discriminator as being close to the label and the label in this case are the one, the fake label, the one that are associated to the noise. So more than noise, I mean, maybe we can call this noise label or maybe, okay, it's fake label, it's fine too, right? Fake is the data and then the blue X hat that is generated by my generator network. And then when I put this X hat here inside the, sorry, the discriminator, I will tell the discriminator, hey, this one should be labeled as fake labels, right? And so you have this criterion here. So in this backward, you're gonna be getting those partial derivative of the loss function with respect to the parameters. In the case when, in the case when we have fed the fake data and we are trying to label them as fake, you know, fake labels, right? We have fake targets, fake labels. In the other part here, we actually were inputting inside the discriminator real data. And then we tell the network, you have a loss between your output and the labels which are supposed to be real label. So the first part, you try to get, you get the partial derivatives corresponding to the loss that has been computed when real data was fed to the discriminator. In the second part, instead you have the loss of, with respect, you know, the loss of your output of the network when we fed fake data, right? And so here we simply do, again, another backward. So in this case, this backward, this line here and this line here, will give you, they will accumulate, right? Because PyTorch by default will accumulate every time you perform backward. So first part, you accumulate for the first half of the batch. And then second time, you accumulate it basically, you have the partial derivative for the second part of the batch. The first part of the batch is the real data. Second part of the batch is the fake data. Overall, you're gonna have, you know, the partial derivatives of the fake, the real data and the fake data. And then we use this gradient in order to tune to change the parameters of the network, the discriminator, right? Does it make sense so far? Yeah, that makes sense. But one of them is increasing it and the other one's decreasing. So these ones so far are both trying to decrease the criterion, okay? So this is, you can see here and this criterion here has the output, which is fed of the discriminator when it was fed with the real CPU data. So you have real data and real labels, okay? So the criterion here is trying to match to pair real data and real labels, okay? So far? Yes. Second part, you try to have the network here, try to match fake data with fake label, okay? Because the output comes from this discriminator, which was input with fake data. And then this, you know, you should force the network to say, oh, these are fake labels, right? And so first one, you had this criterion here acting on true data with true, with labels that are telling you these are true data. And then you train, you have the loss for the network, which is gonna be saying that this output instead should be labeled as fake data, right? So this is still trying to minimize this criterion. Therefore, whenever you perform the optimizer step, the optimizer step will try to lower both this one and this one, okay? Another way to do this one would be to have the summation between this one plus this one. You perform only one gradient descent step, okay? The alternative, if you understand what I said, would be let me try to open item, item, this line here, right? So at 226, and then the other one was down to 235, right? 235, so we perform this one dot backward and we did this one dot back, right? But otherwise we could have done 226 plus the other one 235. And then we just perform back backward here, okay? So this was an alternative, which is actually exactly the same as right now. If you perform twice backward on the two different criterions, it's actually as summing the two criterions and then performing backward only once, okay? And then below, whenever we train the generator, here we swap the labels, right? In this case, we're gonna be training the, so we're stepping with the generator optimizer, such that we try to induce the network to output labels that are real labels when we provide data that is fake data, right? So this stepping here, it will not try to untrain the discriminator, but it will train the generator such that it tries to make the discriminator performing poorly, okay? So our generator, if it's generating our fake data, don't we want to be able to tell that apart? So don't we wanna take a step in the other direction for that? Yeah, so you want to take a step in the other direction for the generator, right? You said? No, for the fake data, we want to be able to tell it's fake. Yeah, and that's where you do that here. If you have fake data, when you put fake data inside the discriminator, you also say that these labels are fake labels, right? Oh, okay, fake label doesn't mean they are fake. That these are the label for the fake data. Maybe this is weird. So these are the true label, they are not fake label, they are true label for the fake data, okay? This, I guess it's, see that's what I dislike from other people writing code. That doesn't make sense. In this case, before this, for the generator, for the discriminator, we try to lower this criterion, and we put this criterion, so these two lines, are trying to match real data, the true data with the true label, right? In this case, you have trying to match the generated data with the generated labels, okay? So both of these two parts are trying to train the discriminator, such that it can tell apart the two things. Wait, so just to clarify, so for example, if we're trying to produce cat images, then the generator would produce like, oh, I tried to make a cat image here, and here's the label that's saying that it should be a cat versus this image, I didn't try to make a cat, so the label is zero, for I didn't try to make a cat. Okay, so let me go with the cat, I guess it's gonna be easier. Where is it? So here, we're gonna have real data. These are very nice cute pictures of cats, right? And so we're gonna say, oh, this output should be named as cat, right? Because it's very nice and looking cute. Then I'm gonna be feeding some garbage, some noise to the generator. This looks like a monster, okay? Ugly cat. So then we provide these monster looking like images to the discriminator, and then we are gonna be feeding these loss with the verdict, then whatever the discriminator says, and with the label that says these are monsters. And so here you perform backward, again, and then step, such that you're gonna be training the discriminator, such that they can tell apart cats from monsters. First part. Second part below, we feed the monsters, in this case, we still have, we have the gradients, right? In this case, we cut off the gradient, pay attention to this part. Here, we cut off the gradient. So gradients don't go down the generator. In this case, we actually input the fake data, the monster looking images inside the discriminator. The discriminator say, oh, monsters, monsters. But in this case, we say, no, these are cute cat pictures. And so now we train the, we perform backward, which is computing the partial derivatives with respect to everything. And then we step for the generator, such that the monsters that the generator was making, now they look more cute, okay? I can't be more cute than this, sorry. Why don't we send the gradient of the fake data to the discriminator? We do in the second case, right? So let me answer the thing. So in this case here, when we send the gradients backwards, back to the generator, we actually swap the correct labels with the incorrect labels. In this case, we input monsters. The discriminator says, these are monsters. And we say, oh, these are good looking cats. And then we train the generator, such that these monsters will look like more nice looking cats. In this case, you don't want to send the gradients through because in this case, you try to minimize the correct classification part, right? So if you would send gradients backward, you would basically get a worse performing generator, right? Because you don't want to minimize this criterion. You want to maximize this criterion, right? So that's why we don't have gradients in this first case, but we do have gradients in this case because we absolutely want to compute the gradients with respect to the generator of this criterion. Is the combination of BC loss and sigmoid because... I mean, as a problem because of the underflow overflow? No? So the problem with the BCE thing here is the probabilistic approach, right? So this sigmoid, if you train this network very well, this sigmoid will be giving you zero gradients. And because if it saturates, you know, you're in the two, if you're not exactly on the middle way, if you are just away from the decision boundary, you're gonna have basically or one, so it's still gonna have zero gradient, or it's gonna be the other side here, still all zero, but there is no gradient. So if you're over here, you don't know where to go, how to go down the hill, right? Because there is no hill, there is like a plateau. So this is the first problem. Second problem is that if you want to really have very vertical, like a very vertical edge here, you will need very, very, very, very large weights, okay? Such that if you, you know, the larger the weight, the larger is gonna be the final value inside the sigmoid. And if you want to get like a saturated sigmoid, you're gonna have like pretty large weights leading to that module. And this one creates some, you know, is gonna make your weights and everything kind of explode. So that's why people want to do several things like they want to limit the norm of the weights, then you want to limit the norm of the gradients. And there are many, many ways to patch this architecture, but that's patching, right? We don't want patching, we'd like to know what is proper. And what is proper is gonna be basically and using a out encoder, for example, for your final cost network. So if you consider the reconstruction error of an out encoder, the reconstruction error of an out encoder will be zero or small if you provide a data that is coming from the training distribution. If you provide a sample that is away from the training distribution, remember the manifold from last time, then the out encoder will do a poor job at the reconstruction. And therefore the reconstruction error will be larger, right? So instead of using a discriminator, you can use a out encoder reconstruction error. How can you get more out of this course, right, overall? So let me give you a few suggestions. First, comprehension. If something was still not clear, just ask me the question section below the video. I will answer every question, so you will get it eventually. If you'd like to get more news about the field, things I do in terms of educational content and things I find interesting, you can follow up on Twitter. And there you have my handle, AlfCNZ. If you'd like to have updates about newer videos, don't forget to subscribe to the channel and activate the notification bell. If you actually like this video, don't forget to put a thumb up. It helps as well recommending this video to other people. If you'd like to search the content of this lesson, we have an English transcription, which is connected directly to this video. So every title in the transcription is clickable. If you click on the title, you get the right director to the correct location on the video. In the same way, each section of the video is the same title as in the transcription, so you can go back and forth. Maybe English is not your first language. Par l'italiano, hables a espagnol, nishuopu, tonkwama, speak Korean, I have no idea how to speak Korean. Well, we have several translations of this material available at the website. And we are also looking for more translations if you can help as well. It's really important that you actually try to do some of the exercises and you play around with the notebooks and the source code we provided in order to internalize and understand better the concepts we explained during the lessons. Contribute. This is really giving you the opportunity to show your contribution. For example, you find some typos in the write-up, so you find some bugs in the notebooks. You can fix those and be part of this whole project by sending me a pull request on GitHub or letting me know otherwise. And that was it. So, see you next time. Bye-bye.