 So we're going to be talking about GANs today. Who has heard of GANs? Yeah, most of you. Very hot technology but definitely deserving to be in the cutting edge-deep learning part of the course because they're not quite proven to be necessarily useful for anything but they're nearly there. They're definitely going to get there and we're going to focus on the things where they're definitely definitely going to be useful in practice and there's a number of areas where they may turn out to be useful in practice But we don't know yet So I think the area that we're going to be that they're definitely going to be useful in practice is The kind of thing you see on the left here, which is for example turning drawings into rendered pictures This comes from a paper that's just came out two days ago. So there's a very active research going on right now Before we get there though, let's talk about some interesting stuff from the last class This is an interesting thing that one of our diversity fellows Christine Payne did. Christine has a master's in medicine from Stanford and so she obviously had an interest in thinking what would it look like if we built a language model of medicine and one of the things that we Briefly touched on back in lesson four, but didn't really talk much about last time is this idea you can actually Seed a generative language model, which basically means you've trained a language model on some corpus And then you're going to generate some text from that language model And so you can start off by feeding it a few words, you know to basically say Yes Here's the first few words to create the hidden state in the language model and then generate from there, please and so when Christine so Christine did something Clever which was to kind of pick a was to seed it with a question and then repeat the question So three times Christine three times and then let it generate from there And so she fed a language model lots of different medical texts and then fed it this question What is the prevalence of malaria and the model said? In the US about 10% of the population has the virus But only about 1% is infected with the virus about 50 to 80 million infected She said what's the treatment for ectopic pregnancy and it said it's a safe and safe treatment for women with a history or symptoms It may have a significant impact in clinical response most important factor is Development of management of ectopic pregnancy, etc. And so what I find interesting about this is you know, it's it's pretty close to being a To me as somebody who doesn't have a master's in medicine from Stanford are pretty kind of close to being a believable Answer to the question But it really has no bearing on reality whatsoever, and I kind of think it's an interesting Kind of ethical and user experience quandary so actually I'm involved also in a company called doc.ai that's trying to basically do a number of things but in the end Provide an app for doctors and patients which can help kind of create a conversational user interface around helping them with their medical issues and I've been kind of continually saying to the software engineers on that team Please don't try to create a Generative model using like an LSTM or something because they're going to be really good at creating Bad advice that sounds impressive, you know kind of like, you know political pundits or tenured professors, you know people who Can say bullshit with great authority So I think Yeah, so I thought it was really I thought it was really interesting experiment And great to see, you know, what what our diversity fellows are doing. I mean, this is why we have this program I suppose I shouldn't just say Masters in medicine actually a Juilliard trained classical musician Well, I'm actually also a Princeton valedictorian in physics. So also a high-performance computing expert. Yeah, okay So she does a bit of everything So yeah, really impressive group of people and great to see such exciting kind of Ideas coming out and if you're wondering, you know, I've done some interesting Experiments should I let people know about it? Well, I Christie mentioned this in the forum. I went on to mention it on Twitter To which I got this response You You're looking for a job. You may be wondering who Xavier Marichain is Well, he is the founder of a hot new medical AI startup He was previously the head of engineering at Cora before that He was the guy at Netflix around the data science team and built their recommender systems And so this is what happens if you do something cool But let people know about it and get get noticed by awesome people like Xavier So let's Talk about sci-fi 10 and the reason I'm going to talk about sci-fi 10 is that We're going to be looking at some more You know bare bones pie torch stuff today to build these generative adversarial models. There's there's no Really fast AI support to speak of at all for GANs at the moment I'm sure there will be soon enough but currently there isn't so we're going to be building a lot of models from scratch Now it's been a while since we've done much, you know serious model building a little bit of model building I guess for our Bounding box stuff, but really all the interesting stuff there was a loss function So we looked at sci-fi 10 in the part one of the course and we built something which was getting about 85% accuracy and I can't remember a couple of hours to train Interestingly, there's a competition going on now to see who can actually train sci-fi 10 the fastest going through this Stanford Dawn bench and currently so the goal is to get it to train to 94% accuracy so It'd be interesting to see if we can build an architecture then can get to 94% accuracy because that's a lot better than Previous attempt and so hopefully in doing so we'll learn something about creating good architectures That'll be then useful for looking at these GANs today, but I think also it's useful because I've been kind of looking much more deeply into the last few years papers about Different kinds of CNN architectures and realize that a lot of the insights in those papers are not being widely Leveraged and clearly not widely understood. So I want to show you what happens if we can leverage some of that understanding so I've got this Notebook called sci-fi 10 dark net That's because the particular architecture. We're going to look at is it's quite is really very close to the dark net Architecture, but you'll see in the process that the dark net architecture has in not the whole Yolo version 3 end-to-end thing, but just the part of it that they pre-trained on the internet to do classification It's almost like the most generic Simple architecture almost you could come up with And so it's a it's a really great starting point for experiments So we're going to call it dark net, but it's not quite dark net and you can fiddle around with it to create things That definitely aren't dark net. It's really just the basis of nearly any Modern res net based architecture so sci-fi 10 remember is a fairly small data set the images are only 32 by 32 in size And I think it's a really great data set to work with because it's You can you can train it, you know relatively quickly unlike image net It's a relatively small amount of data unlike image net And it's actually quite hard to recognize the images because 32 by 32 is it's kind of too small to easily see what's going on So it's it's somewhat challenging so I think it's a really Underappreciated data set because it's old, you know And you know who at DeepMind or open in AI wants to work with a Small old data set when they could use their entire server room to process something much bigger But you know to me I think this is a really great data set to focus on So So we'll go ahead and kind of import our usual stuff and we're going to try and build a Network from scratch to train this with One thing that I think is a really good exercise for anybody who's not a hundred percent confident with their kind of Broadcasting and PyTorch and so forth basic skills is Figure out how I came up with these numbers Okay, so these numbers are the averages for each channel and the standard deviations for each channel in sci-fi 10 so try and That's a bit of a homework. Just make sure you can recreate those numbers and see if you can do it in you know No more than a couple of lines of code, you know no loops Right ideally you want to kind of do it in one go if you can All right because these are fairly small we can use a larger batch size than usual 256 and the size of these images is 32 Transmissions Normally we kind of have this standard set of side-on transformations. We use for photos of normal objects We're not going to use that here though because these images are so small that trying to rotate a 32 by 32 image a bit It's going to introduce a lot of you know blocky kind of distortions So the kind of standard transforms that people tend to use is a random horizontal flip and Then we add Size divided by eight so four pixels of padding on each side And one thing which I find works really well is by default fast AI doesn't add black padding Which basically every other library does we actually take the last four pixels of the existing photo and flip it and reflect it And we find that we get much better results by using this reflection padding by default So now that we've got a 36 by 36 image This set of transforms in training will randomly pick a 32 by 32 crop So we get a little bit of variation but not heaps All right, so we can use a normal from paths to grab our data So we now need an architecture and What we're going to do is we've got to create an architecture which fits in one screen okay, so This is from scratch as you can see the only you know I'm using the pre-defined com 2d batch norm 2d leaky value Modules, but I'm not using any Blocks or anything they're all being defined. So the entire thing is here on one screen So if you're ever wondering can I understand a modern good quality? Architecture absolutely. Let's study. Let's study this one. Okay, so My basic starting point with an architecture is to say, okay, it's it's it's a stacked bunch of layers and Generally speaking, there's going to be some kind of hierarchy of layers So at the very bottom level there's things like a convolutional layer and a batch norm layer, but generally speaking Any time you have a Convolution you're probably going to have some standard sequence and normally it's going to be con batch norm Then a non-linear activation like a value, right? So, you know, I try to Start kind of right from the top by saying, okay, what are my basic units going to be? And so by defining it here that way I don't have to worry about I Don't have to worry about kind of trying to try to keep everything consistent and it's going to make everything a lot simpler So here's my con flare. And so anytime I say con flare, I mean con batch norm Relu now, I'm not quite saying Relu I'm saying leaky value and That's I think we've briefly mentioned it before but the basic idea is that normally a Relu Looks like that, right? Hopefully you all know that now A leaky value Looks like that, right? So this part as before has a gradient of one and this part has a gradient of it can vary But something around point one or point oh one is common right and the idea behind it is that When you're in this negative zone here, you don't end up with a zero gradient, which makes it very hard to update it In practice people have found leaky value More useful on smaller data sets and less useful on big data sets But it's interesting that for the yellow version three paper. They did use a leaky value and got great performance from it So it really makes things worse and it often makes things better So it's probably not bad if you need to create your own architecture to make that your default go to Is to use leaky value, okay? You'll notice I don't define a PyTorch module here. I just go ahead and go sequential This is something that if you read other people's PyTorch code, it's really underutilized people tend to write everything as a PyTorch module with an in it and a forward But if you're if the thing you want is just a sequence of things one after the other It's much more concise and easy to understand to just make it a sequential Right, so I just got a simple plane function that just returns a sequential model all right, so I Mentioned that there's generally kind of a number of hierarchies of kind of units in most modern networks and I think we know now that the kind of next level in this unit hierarchy for Resnets and kind of this this is a type of res net is the is the the res block or the residual block I caught up here a res layer and Back when we lasted sci-fi 10, I Over-simplified this I cheated a little bit We had X coming in and we put that through a conv and Then we added it back up to X To go out All right, so we ended up so in general, you know, we've got your output is equal to Your input plus some function of your input Right, and the thing we did last year was we meant we made F was a 2d conv Okay, but actually the in the real Res block There's actually two of them. Okay, so it's actually Conv of Conv of X Okay, and when I say con I'm using this as a shortcut for our conv layer in other words In other words con that's normal value. Okay, so you can see here. I've created two comms and Here it is I take my X put it through the first conv Put it through the second conv and add it back up to my input again to get my basic Res block, okay So One kind of interesting approach or one interesting insight here is kind of What are the Number of channels in these convolutions, right? So we've got coming in some and I some number of input channels number of inputs or number of input filters Okay The way that the dark net folks set things up is they said, okay We're going to make every one of these res layers Spit out the same number of channels that came in and I kind of like that That's why I used it here because it makes life simpler, right? And so what they did is they said, okay, let's have the first con have the number of channels and Then the second con double it again So ni goes to ni over two and then ni over two goes to ni, right? so you've kind of got this like funnelling thing where if you've got like 64 channels coming in kind of get squished down with a first con down to 32 channels and Then taken back up again to 64 channels coming out Yes, Rachel Why is in place equals true in the leaky rally? Oh, thanks for asking a Lot of people forget this or don't know about it But this is a really important Memory technique if you think about it this con flare. It's like the lowest level thing So pretty much everything in our Resnet once it's all put together is going to be con flares con flares con flares If you don't have in place equals true, it's going to create a whole separate piece of memory For the output of the value So like it's going to allocate a whole bunch of memory. That's that's totally unnecessary and actually Since I wrote this I come up came up with an another idea the other day Which I'll now implement which is you could do the same thing for the res layer rather than going. Let's just reorder this to say X plus that you can actually do the same thing here Hopefully some of you might remember that in pytorch pretty much every function has an underscore suffix version which says do that in place so plus There's also a add and so that's add in place and so that's now Suddenly reduced my memory there as well So these are these are really handy little tricks and I actually forgot the in place equals true at first for this And I literally was having to decrease my batch size to much lower amounts And I knew should be possible and it was driving me crazy and then I realized that that was missing You can also do that with dropout by the way if you if you have dropped out so dropout and All the activation functions you can do in place and then Generally any arithmetic operation you can do in place as well Why is bias usually like in Resnet set to false in the comp layer Yeah, so if you're watching the video pause now and See if you can figure this out right because this is a really interesting question. It's like why don't we need bias? Okay, so I'll wait for you to pause Okay, welcome back. So if you figured it out Here's the thing right immediately after the con is a batch norm and remember batch norm has two learnable parameters For each activation the the kind of the thing you multiply by and the thing you add So since we're if we if we had bias here to add and then we add another thing here We're adding two things which is totally pointless like that's two weights where one would do right So if you have a batch norm after a conv then you can you can either Say in the batch norm don't don't include the ad bit there, please or easier is just to say don't include the bias in the in the column There's no particular harm, but again, it's it's it's going to take more memory because that's more gradients that it has to keep track of So best to avoid Also another thing a little trick is Most people's comp layers have padding as a parameter But generally speaking you should be able to calculate the padding Easily enough right and I see people like trying to like implement, you know special same padding modules and all kinds of stuff like that But like if you've got a stride one and you've got or pretty much any stride actually and you've got padding of Sorry and Kernel size of three right then obviously that's going to overlap By kind of one unit on each side. So we want padding of one or else if it's Stride one, then we don't need any padding. So in general padding of kernel size integer divided by two That's what you need. There's some tweaks sometimes but in this case this works perfectly. Well so again trying to simplify my code by Having the computer calculate stuff for me rather than me having to do it myself Another thing here with the two common players. So we we kind of Have this idea of a bottleneck this idea of reducing the channels and then increasing them again is also what kernel size we use So here's a one by one conv, right? And so this is again something you might want to pause the video now and think about what's a one by one conv Really what actually happens in a one by one con? So if we've got you know a little four by four grid here, right and of course there's a Filters or channels access as well. Maybe that's like 32 All right, and we're going to do a one by one conv So what's the kernel for a one by one con going to look like? It's going to be one by 32 All right, so remember when we talk about the the kernel size We never mentioned that last piece We don't say it's one by one by 32 because that's part of the filters in and filters out So in other words then what happens is this this one thing gets Placed first of all here on the first cell and we basically get a dot product of that 32 deep bit with this 32 bit deep bit, and that's going to give us our first oops and That's going to give us our first output Right, and then we're going to take that 32 bit bit and put it with the second one to get the second output All right, so it's basically going to be a bunch of little dot products Okay, for each point in the grid So it's what it basically is then is a It's basically something which is allowing us to to to kind of change the dimensionality In whatever way we want in the in the channel dimension And so that would be That would be one of our filters right and so in this case we're creating Ni divided by two of these right so we're going to have ni divided by two of these dot products all with different Basically different weighted averages of the input channels. Okay, so it's going to basically lets us You know with very little computation Add add this additional step of calculations and non-linearities So that's a that's a cool trick, you know this idea of taking advantage of these one by one comms Creating this bottleneck and then pulling it out again with three by three comms. So that's actually going to take advantage of the You know the 2d nature of the input properly well, so one by one com doesn't take advantage of that at all So these two lines of code There's not much in it But it's a really great test of your understanding and kind of your intuition About what's going on? Why is it that a one by one con going from ni to ni over two channels? Followed by a three by three con going from ni over two to ni and I channels like Why does it work? Why do the tensor ranks line up? Why do the dimensions all line up nicely? Why is it a good idea? What's it really doing? Like it's a really good thing to fiddle a fiddle around with maybe create some small ones in Jupiter notebook, you know run them yourself see what inputs now puts come in and out, you know really get a feel for that Once you've done so You can then play around with different things right and there's actually one of the Really unappreciated papers is this one Wide residual networks right and it's really quite a simple paper But what they do is they basically fiddle around with With these two lines of code right and what they do is they say well, what if this wasn't Divided by two, but what if it was times two? Like that'd be totally allowable right that's going to line up nicely or what if We had another Com three after this And so this was actually ni over two to ni over two and then this is ni over two Again, that's going to work right kernel size one three one Going to half the number of kernels leave it at half and then double it again at the end And so they they come up with this kind of simple notation for basically defining What this can look like and then they show lots of experiments and basically what they show is that this approach of Bottle-necking of decreasing the number of channels, which is like almost universal and resnets Is probably not a good idea in fact from the experiments definitely not a good idea Because what happens is it lets you create really deep networks right and the guys who created resnets Got particularly famous for creating a 1001 layer network Okay, but the thing about a thousand and one layers is you can't calculate layer two until you finish layer one And you can't calculate layer three until you finish calculating layer two. So it's sequential GPUs don't like sequential So what they showed is that if you have less layers But with more active with more Calculations per layer and so one easy way to do that would be to remove the divided by two No other changes, right like try this at home try running so far and see what happens, right? Well, we'll give you more to play it by two or fiddle around and That basically lets your GPU do more work and it's very interesting because the vast majority of papers that talk about performance of different architectures never actually time how long it takes to run a batch through it like they literally say this one requires x number of floating point operations per batch But then they never actually bothered to run the damn thing like a proper Experimental list and find out whether it's faster or slower and so a lot of the architectures that are really famous now Turn out to be slow as molasses and take crap loads of memory and just a totally useless Because the the researchers never actually bothered to see whether they're fast and to actually see whether they fit in rare With normal size batch batch sizes. So the wide resnet paper is unusual in that it actually Times how long it takes as does the yellow version three paper? Which which made the same insight? I'm not sure they might have missed the wide resnets paper because the yellow version three paper came to a lot of the same conclusions But I'm not even sure they cited the wide resnets paper. So they might not be aware that all that works being done But they're both both great to see people actually timing things and noticing what actually makes sense. Yes, Rich So you looked really hot in the paper which came out, but I noticed that you don't use it. What's your opinion on sell you? So sell you is Something largely for a fully connected layers which allows you to get rid of batch norm and the basic idea is that if you use this Different activation function, it's it's kind of self normalizing. That's what the S and sell you stands for So self normalizing means that it'll it'll always remain at unit standard deviation and zero mean and therefore you don't need that batch norm It hasn't really gone anywhere And the reason it hasn't really gone anywhere is because it's incredibly finicky You have to use a very specific initialization. Otherwise. It doesn't start with exactly the right Standard deviation of mean Very hard to use it with things like embeddings if you do then you have to use it to the particular kind of embedding initialization Which doesn't necessarily actually make sense for embeddings So, you know, and you do all this work Very hard to get it right and if you do finally get it right What's the point where you've managed to get rid of some batch norm layers which weren't really hurting you anyway And it's interesting because that paper that sell you paper I think one of the reasons people noticed it or in my experience the main reason people noticed it was because it was created by the inventor of LSTMs And also it had a huge mathematical appendix and people were like lots of maths from a famous guy This must be great, you know, but in practice. I don't see anybody using it To get any state of the art results or win any competitions or anything like that Okay, so this is like some of the tiniest bits of code we've seen but there's so much here and it's fascinating to play with So now we've got this block which is built on this block and then we're going to create another block on top of that block Okay, so We're going to call this a group layer And it's going to create a but it's going to contain a bunch of res layers and so a group layer And it's going to have some number of channels or filters coming in. Okay and What we're going to do is we're going to double the number of channels coming in by just using a standard cong layer Optionally we'll halve the grid size by using a straight of two Okay, and then we're going to do a whole bunch of Res blocks a whole bunch of res layers we can pick how many That could be two or three or eight because remember these res layers don't change the grid size and they don't change the number of Channels so you can add as many as you like anywhere you like without causing any problems And it's just going to use more computation and more RAM But there's no reason other than that you can't add as many as you like so a group layer Therefore it's going to end up Doubling the number of channels because of this initial convolution which doubles the number of channels This initial convolution doubles the number of channels and depending on what we pass in a stride It may also have the grid size if we put stride equals two And then we can do a whole bunch of Res block computations as many as we like all right so then to define our Dark net or whatever we want to call this thing We're just going to pass in something that looks like This and what this says is create five group layers The first one will contain one of these extra res layers the second will contain two Then four then six then three and I want you to start with 32 filters All right, so the first One of these res res layers will contain 32 filters And they'll just be one extra res layer The second one It's going to double the number of filters because that's what we do which time we have a new group layer We double the number so the second one will have 64 and then 128 then 256 and then 512 and Then that'll be it right so that's going to be like nearly all of the network is going to be those Bunches of layers and remember every one of those group layers also has One convolution of the start Okay And so then all we have is before that all happens We're going to have one convolutional layer at the very start And at the very end we're going to do our standard adaptive average pooling Flatten and a linear layer to create the number of classes out at the end, right? So one convolution at the end adaptive pooling and one linear layer at the other end and then in the middle these group layers each one Consisting of a convolutional layer followed by n number of res layers And that's that's it Again, I think we've mentioned this a few times, but I'm yet to see any code out there any Any examples anything anywhere that uses adaptive average pooling everyone I've seen writes it like this and Then bits a particular number here, right? Which means that it's now tied to a particular image size, which definitely isn't what you want So most people even the top researchers I speak to most of them are still under the impression that Specific architecture is tied to a specific size And that's a huge problem when people think that because it really limits their ability to like Use smaller sizes to kind of kickstart their modeling or to use smaller sizes for doing experiments and stuff like that Again, you'll notice I'm using sequential here But a nice way to create architecture is to start out by creating a list in this case This is a list with just one comp layer in and then my function here make group layer. It just returns another list Right, so then I can just go plus equals Appending that list to the previous list and then I could go plus equals to append this bunch of things to that list And then finally sequential Of all those layers, right? So that's a very nice thing. So now my forward is just self dot layers Okay, so here's a kind of you know, this is a nice kind of picture of how to make your architectures as simple as possible Okay, so you can now go ahead and create this and as I say you can fiddle around You know You could even parameterize this to to make it a number that you kind of pass in here to pass in different numbers So it's not to maybe it's times two instead You could pass in things that change the kernel size or change the number of convolutional layers You know fiddle around with it and maybe you can create something I've actually got a version of this which I'm about to run for you which kind of implements all of the different Parameters that's in that wide resonant paper. So I could fiddle around to see what worked well So once we've got that we can use conf learner from model data to take our pytorch model Module and that a model data object and turn them into a learner give it a criterion That's a metrics if we like and then we can call fit and away we go Could you please explain adaptive average pooling? How does setting to one work? Sure? Before I do I Just want to like since we've only got a certain amount of time in this class I wanted to see I do want to see how we go You know With this simple network against these state-of-the-art results So to make life a little easier so we can we can start it running now and see how it looks later So I've got the command ready to go So we've basically take taken all that stuff and put it into a simple little Python script And I've modified some of those parameters I mentioned how to create something I've called a WRN 22 network which doesn't officially exist But it's got a bunch of changes to the parameters. We talked about based on my experiments We're going to use the new Leslie Smith one cycle Thing so there's quite a bunch of cool stuff here So the one cycle implementation was done by our students you're going to go. I think I don't know how to pronounce this name exactly silver This the train sci-fi experiments were largely done by Brett currents and stuff like getting the half-position floating point implementation Integrated into fast AI was done by Andrew Shaw So it's been a cool kind of bunch of different student projects coming together to allow us to run this So this is going to run actually on a AWS Amazon AWS P3 Which has eight GPUs? The P3 has these newer Volta architecture GPUs which actually have special support for half precision floating point Fast AI is the first library I know of to actually integrate the Volta optimized half precision floating point into the library So we can just go learn that half now and get that support automatically And there's also the first one to integrate one cycle. So these are the parameters for the one cycle So we can go ahead and get this running. So what this actually does is it's using PyTorch's Multi-GPU support since there are eight GPUs It's actually going to fire off eight separate Python processes and each one's going to train on a little bit and then at the end It's going to pass the gradient updates back to kind of the master process That's going to integrate them all together. So you'll see There they are right so lots of Progress bars all pop up together and you can see it's training You know three or four seconds When you do it this way where else when I had Where else when I was training earlier I was getting let's see 30 epochs in I was getting about 30 seconds per epoch So doing it this way we can kind of train things like 10 times faster or so Which is pretty cool Okay, so we'll leave that running So you are asking about adaptive average pooling and I think specifically is what's the number one doing so? Normally when we're doing average pooling Let's say we've got four by four Let's say we did average pooling two comma two Right, then that creates a two by two Area and takes the average of Those four right and then we can pass in The stride Right, so if we said stride one then the next one is we look at this Block of two by two and take that average and so forth, right? So that's like what a normal two by two average pooling would be And so that would in that case if we didn't have any padding that would spit out a three by three Okay, because it's two here two here two here Okay, and if we added padding we can make it three by three as well. That's very four by four So if we wanted to spit out something we didn't want three by three. What if we wanted one by one? Right, then we could say average pool Four comma four, right, and so that's going to do four comma four and Average the whole lot Right, and that would spit out one by one But that's just one way to do it rather than saying the size of the The kind of the pooling filter Why don't we instead say well, I don't care what the size of the input grid is I always want one by one Right, so that's where then you say adaptive average pool and Now you don't say what's the size of the pooling filter who instead say what's the size of the output? I want and so I want something that's of one by one And if you only put a single int it assumes you mean one by one so in this case adaptive average pooling one With a four by four grid coming in It's the same as average pooling four comma four if it was a seven by seven grid coming in it would be the same as Seven comma seven, right? So it's the same operation. It's just expressing it in a way that says regardless of the input I want something of that sized output, please Okay, how's our little thing going along? Oh Okay Well, we got to 94 and it took three minutes and 11 seconds And the previous date of the art was one hour and seven minutes So was it worth fiddling around with those parameters and learning a little bit about how these architectures actually work and not just using What came out of the box? Well, holy shit. We just used a publicly available Instance we used a spot instance so it's cost so that cost us so like eight dollars per hour For three minutes cost us a few cents to train this from scratch 20 times faster than anybody's ever done it before So that's like the most crazy state of the art result. I think we've ever seen we've seen many But this one just blew it out of the water, right? And so You know, this is kind of partly thanks to Just fiddling around with those parameters of the architecture mainly frankly about using Leslie Smith's one cycle thing and so far as implementation of that and remember not only So just to remind you of what that's doing It's basically saying This is batches, right And this is learning rate All right, it creates an upward path That's equally long as the downward path, right? So it's a true CLR triangular cyclical learning rate as per usual you can pick the ratio between Those two numbers, right? So x divided by y in this case is is a number that you get to pick and this case we picked 50, okay, so we started out with a much smaller one here and Then that's got this cool idea, which is you get to say what percentage of your epochs then is spent going from the bottom of this down all the way down pretty much to zero and That's what this second number here is so 15% of the batches are spent going from The bottom of our triangle even further So importantly though with that's not the only thing one cycle does we also have momentum Right and momentum goes from 0.95 to 0.85 like this in other words when the Learning rates really low we use a lot of momentum and when the learning rates really high we use very little momentum Which makes a lot of sense, but until Leslie Smith showed this in that paper. I've never seen anybody do it before So it's a really really cool trick. So You can now use that by using the use CLR beta parameter in in fast AI And you should be able to basically replicate this data-the-art result You can use it on your own computer or your paper space obviously the only thing you won't get is the multi GPU Piece but that makes it a bit easier to train anyway, so on a single GPU You should be able to beat this This this on a single GPU Yeah Make group layer contains stride equals to so this means stride is one for layer one and two for everything else What's the logic behind it? Usually the strides I have seen are odd No strides are either one or two. I think you're thinking of kernel sizes So stride equals to means that I jump to a cross and so stride of two means that you have your grid size So I think you might have just got confused between stride and kernel size there and so If we have a stride of one The grid size doesn't change if we have a stride of two Then it does and so in this case because this is the sci-fi 10 32 by 32 is small and we don't get to have the grid size very often, right? Because pretty quickly we're going to run out of cells and so that's why The first layer has a stride of one so we don't decrease the grid size straight away basically And it's kind of a nice way of doing it because that's why we kind of have a low number here So we can we can start out With you know not too much computation on the big grid and then we can gradually do more and more Computation as the grids get smaller and smaller because the smaller grid the computation will take less time Okay So I think so that we can do all of our ganning in one go. Let's take a slightly early break and come back at 730 okay So we're going to talk about generative adversarial networks Also known as GANs and specifically we're going to focus on the busherstein GAN paper Which included some guy called sumith chintala who went on to create some piece of software called high torch The busherstein GAN was heavily influenced by the So I'm just going to call this W again. That's the time the DC GAN or deep convolution deep convolutional generative adversarial networks paper Which also sumith was involved with So It's a really interesting paper to read a lot of it Looks like this The good news is you can skip those bits Because there's also a bit that looks like This which says do these things All right now I will say though that like a lot of papers have a theoretical section Which seems to be there entirely to get past the reviewers need for Theory that's not true of the W GAN paper. The theory bit is actually really interesting Like you don't need to know what to use it, but if you want to Learn about like some some cool ideas and see the thinking behind why this particular Algorithm it's it's absolutely fascinating and almost nobody Before this paper came out. I didn't know literally. I knew nobody who had studied the math that it's based on So like everybody had to learn the math that was based on and so the paper does a pretty good job of Laying out all the pieces you'll have to do a bunch of reading yourself. So if you're interested like in Digging into the deeper math behind Some paper to see what it's like to study it. I would pick this one Because at the end of that theory section, you'll come away saying like, okay, I can see now why they Made this algorithm the way it is Right and then having come up with that idea like the other thing is often these theoretical sections are very clearly added after they Come up with the algorithm. They'll come up the algorithm based on intuition and experiments and then later on Post hoc justify it for else this one you can clearly see it's like, okay Let's actually think about what's going on in GANs and think about what they need to do and then come up with the algorithm So the basic idea of a of a GAN is it's a generative model. Okay, so it's something that is going to create Sentences or create images going to generate stuff, right and It's going to try and create Stuff Which is very hard to tell the difference between Generated stuff and real stuff that so a generative model could be used to face swap a video, you know a very well-known Controversial thing of deep fakes and fake pornography and stuff happening at the moment Could be used to fake Somebody's voice It could be used to fake the answer to a medical question But in that case, it's not really a fake, right? It could be a generative Answer to a medical question. That's actually a good answer, right? So you're like generating language You could generate a caption to a to an image, for example So generative models have lots of interesting applications But generally speaking they need to be good enough that for example if you're using it to you know Automatically create a new scene for Carrie Fisher in the next Star Wars movies And she's not around to play that part anymore You want to you know try and generate an image of her that looks the same then it has to fool the Star Wars audience into thinking like Okay, that that doesn't look like some weird Carrie Fisher. That looks like the real Carrie Fisher Or if you're trying to generate an answer to a medical question You want to generate English that reads, you know nicely and clearly and sounds authoritative and meaningful so the idea of a generative adversarial network is we're going to create not just a generative model to create Say the the the generated image But a second model that's going to try to pick which ones are real and which ones are generated And we're going to call them fake, right So which ones are real and which ones are fake? So we've got a generator that's going to create our fake content and a discriminator that's going to try to get good at Recognizing which ones are real and which ones are fake so there's going to be two models and then they're going to be adversarial Meaning the generator is going to treat a trick trying to keep getting better at fooling the discriminator into thinking that fake is real And the discriminator is going to try to keep getting better at discriminating between the real and the fake and they're going to go Head-to-head like that, right and it's basically as easy as I Just described it really is but it's going to build two models in PyTorch We're going to create a training loop that first of all says the loss function for the discriminator is can you tell the difference between real and fake and Then update the weights of that and then we're going to create a loss function for the generator Which is going to say can you generate something which pulls the discriminator and update the weights from that loss? And we're going to look through that a few times and see what happens and so let's come back to The pseudo code here of the algorithm and let's read the real code first So there's lots of different things you can do with Gans and we're going to do something that's kind of Boring but easy to understand and and it's kind of cool that it's even possible Which is we're just going to generate some pictures from nothing But it's going to get it to draw some pictures. Okay, and specifically we're going to get it to draw pictures of bedrooms Right, you'll find if you hopefully get a chance to play around with this during the week with your own data sets If you pick a data set that's very varied like image net and then get a GAN to try and create image net pictures it tends not to do so well Because it's it's not really clear enough. What do you want a picture of? Right, so it's better to give it for example the there's a data set called celeb a which is pictures of celebrities faces That works great with Gans. You create really Clear celebrity faces that don't actually exist the bedroom data set also a good one right lots of things pictures of the same kind of thing Okay, so that's just a suggestion So there's something called the L Sun scene classification data set You can download it using these steps I've also it's pretty huge So I've actually created a Kaggle data set of a 20% sample, right? So unless you're really excited about generating bedroom images, you might prefer to grab the 20% sample So then we do the normal steps of creating some different paths And in this case, you know, I as we do before I find it much easier to kind of grow the CSV route When it comes to handling our our data So I just generate a CSV with the list of files that we want and a fake label That's zero because we don't really have labels for these at all And I actually create two par two CSV files one that contains everything in that bedroom data set and one that just Contains a random 10% and it's just nice to do that because then I can most of the time use the sample When I'm experimenting Because like because there's well over a million files even just reading in the list takes a while So this will look pretty familiar So here's a con block. This is before I Realized that sequential models are much better So if you compare this to my previous con block with a sequential model, there's just a lot more lines of code here But you know, it does the same thing of doing con value batch norm okay, and We calculate our padding and here's a bias fault. So this is the same as before basically, but with a little bit more code All right, so the first thing we're going to do is we're going to build a discriminator So a discriminator is going to receive as input an image Okay, and it's going to spit out a number and the number is meant to be lower If it thinks this image is real. Okay, of course the What does it do for a lower number thing? Doesn't appear in the architecture that'll be in the loss function So all we have to do is that create something that takes an image and spits out a number Okay So a lot of this code is borrowed from the original Authors of this paper of the paper. So some of the naming scheme and stuff is different to what we're used to So sorry about that But hopefully I've tried to make it look at least somewhat familiar I probably should have renamed things a little bit, but it looks very similar to actually what we had before we start out with a Convolution to remember com block is com for all your bachelor. Okay, and then we have a Bunch of extra com flares. This is not going to use a residual Right, so it looks very similar to before a bunch of extra layers, but these are going to be com flares rather than res layers and Then at the end we need to append enough stride to Enough stride to con players that we decrease the size the grid size down to be No, bigger than four by four Right, so it's going to keep using stride to Divide the size by two stride to divide by size by two until our grid size is no bigger than four And so this is quite a nice way of like creating as many layers as you need in a network to handle arbitrary sized Images and turn them into a fixed known grid size. Yes, Rachel Does again need a lot more data than say dogs versus cats or NLP or is it comparable? You know honestly I'm kind of embarrassed to say I am not an expert practitioner in Gans so, you know The stuff I teach in part one is stuff. I'm happy to say I know the best way to You know pretty close to the best way to do these things and so I can show you state of the art results Like I just did with so far ten with the help of some of my students, of course I'm not there at all with Gans So I'm not quite sure how much you need like in general it seems you need quite a lot But remember the only reason we didn't need too much in dogs and cats is because we had a pre-trained model And could we leverage pre-trained Dan models and fine-tune them? Probably I Don't think anybody's done it as far as I know That could be a really interesting thing for people to kind of think about an experiment with Maybe people have done it and there's some literature there. I haven't come across so I'm Somewhat familiar with the main pieces of literature in Gans, but I don't know all of it So maybe I've missed something about transfer learning in Gans, but that would be the trick to not needing too much data And so it's the huge speed up combination of one cycle learning rate and momentum and yielding plus the 8 GPU parallel training and the half precision is That only possible to do the half precision calculation with consumer GPU Another question. Why is the calculation eight times faster from single to half precision while from double to single is only two times faster? Okay, so the sci-fi ten result. It's not eight times faster from single to half It's about two or three times as fast from single to half the the Nvidia claims about the about the flops performance of the tensor cores are Academically correct, but in practice meaningless because it really depends on What cores you need for what pieces so about two or three x improvement for half So yeah, the half precision helps a bit The the extra GPUs helps a bit The one cycle helps an enormous amount But then another key piece was the playing around with the parameters that I told you about so kind of reading the wide-res net paper carefully identifying the kinds of things that they found there and then writing a version of the Architecture you just saw that made it really easy for me to fiddle around with Brett We're not for me for Brett turns to fit around with parameters Staying up all night trying every possible combination of different kernel sizes and numbers of kernels and numbers of layer groups and size of layer groups and The amount of remember we did a bottleneck But actually we tended to focus not on bottlenecks, but instead on widening So we actually like things that increase the size and then decrease it because it takes better advantage of the GPU So that all those things combined together. I'd say the one cycle was perhaps the most Critical, but but every one of those resulted in a big speed up That's why we were able to get this 30x improvement over the state of the art sci-fi 10 and We've got some ideas for other things like after this Dawn bench finishes and you know, maybe we'll try and go even further So if we can beat one minute one day that'd be fun Okay So so here's okay, so here's our discriminator, right? I mean that the important thing to remember about an architecture is it doesn't do anything other than have some Input tensor size and rank and some output tensor size and make so this is going to spit out You see the last com here has one channel Right. This is a bit different to what we're used to right because normally our last thing is a linear block Right, but our last thing here is a common block right and it's only got one channel, but it's got a grid size of Something around four by four. It's no more than four by four So we're going to spit out a let's say it's four by four a four by four by one tensor So what we then do is we then take the mean of That right so it goes from four by four by one to The scalar right so this is kind of like the ultimate adaptive average pooling right because we've got something with just one channel We take the mean so this is a bit Yeah, a bit different normally we first do average pooling and then we put it through a Fully connected layer to get our one thing out in this case though We're getting one channel out and then taking the mean of that I haven't fiddled around with like why did we do it that way? What would instead happen if we did the usual average pooling followed by a fully connected layer would it work better? Would it not I I don't know I Rather suspect it would work better if we did it like the normal way But I haven't tried it and I don't really haven't good enough intuition to know whether I'm missing something Finish the experiment to try if somebody wants to stick an adaptive average pooling layer here in a fully connected layer afterwards with a single output It should keep working, you know, I should do something The loss will go down to see whether it works Okay, so that's the discriminator right so there's gonna be a training loop Let's put let's assume. We've already got a generator. Somebody says okay Jeremy. Here's a generator. It generates bedrooms Okay, I want you to build a model that can figure out which ones are real and which ones aren't so I'm going to take the data set I'm going to basically label a bunch of images Which are fake bedrooms from the generator and a bunch of images of real bedrooms from my else on data set Just take a one or a zero and each one and then I'll try to get the discriminator to tell the difference Okay, so that's going to be simple enough But I haven't been given a generator I need to build one So a generator and we haven't talked about the loss function yet Okay, we're just going to assume that there's some loss function that does this thing Okay, so a generator is also an architecture which doesn't do anything by itself until we have a loss function and data But what are the ranks and sizes of the tenses? Well, the input to the generator is going to be a vector of random numbers Okay, and in the paper they call that the prior right it's going to be a vector of random numbers. How big I Don't know some big 64 128 right and the idea is that a different bunch of random numbers will generate a different bedroom Okay, so that's the idea so our again generator sorry has to take as input a vector And it's going to take that vector. So here's our input Right, and it's going to stick it through in this case a sequential model And the sequential model is going to take that vector and it's going to turn it into a Two by two sorry a Turn it into a well a rank four tensor or if you take off the batch bit a rank three tensor height by width by Three okay, so you can see at the end here our final step Here and see number of channels, so I think that's going to have to end up being three because we're going to create a three channel image of some size Yes, Rachel In comms block forward. Is there a reason why batch norm comes after value? I Self dot batch norm dot self dot. No, there's not It's just what they had in the code I borrowed from I think Resnet the order is reversed. Yeah, so Again unless my intuition about Ganses all wrong and say some for some reason need to be different to what I'm used to I would normally expect to Yeah, I would actually know sorry. I would normally expect to go rally you Then batch norm that this this is actually the order that makes more sense to me But I think the order I had in the dark net was what they used in the dark net paper So I don't know everybody seems to have a different order of these things and in fact Most people for sci-fi 10 have a different order again, which is they actually go Bn then value then con which is kind of a quirky way of thinking about it But it turns out that for often for residual blocks that works better That's called a pre-activation resnet. So if you Google for pre-activation resnet, you can see that So, yeah, there's there's a few not so much papers but more blog posts out there where people have experimented with different orders of those things and Yeah, it seems to depend a lot on what Specific data set it is and what you're doing it with although in general the difference in performance is small enough You won't care unless it's for a competition Okay, so the generator Needs to start with a vector and end up with a rank 3 tensor We don't really know how to do that yet. So how do we do that? How do we start with a vector and turn it into a rank 3 tensor? Um, we need to use something called a Deconvolution and a deconvolution is or as they call it in pie torch a transposed convolution Same name, sorry same same thing different name And so a deconvolution is something which rather than decreasing The grid size it increases the grid size So as with all things It's easiest to see in an Excel spreadsheet So here's a convolution right we start let's say with a four by four grid cell Okay, with a single channel the single filter And let's put it through a three by three kernel again with a single output filter. Okay, so we've got a Single channel in a single filter kernel And so if we don't add any padding We're going to end up with two by two right because that three by three you can go in one two Three four places right can go in one of two places across one of two places down if there's no padding Okay, so there's our There's our convolution right remember the convolution is just the sum of the product if the kernel And the appropriate grid cell so there's our There's our standard three by three pond one channel one filter So the idea now is I want to go the opposite direction. I want to start with my two by two and I want to create a four by four and Specifically, I want to create the same four by four that I started with And I want to do that by using a convolution So how would I do that? Well, if I have a three by three convolution Then if I want to create a four by four output I'm going to need to create this much padding Because with this much padding I'm going to end up with One two three four by one two three four You see why that is so this filter can go in any one of four places across and four places up and down So let's say my convolutional filter was just a bunch of zeros then I can calculate my my error for each cell just by taking this attraction and then I can get the Sum of absolute values the L1 loss by just summing up the absolute values of those errors All right, so now I could use optimization so in Excel that's called solver To do a gradient descent Okay, so I'm going to set that cell Equal to a minimum and I've tried to reduce my loss by changing My filter, okay, and I'll go solve Okay, and you can see it's come up with a filter such that you know 15.7 computer 16 17 is right So it's not perfect right and in general you can't assume that a deconvolution can exactly Create the same, you know the exact thing that you want Because there's just not enough, you know, there's only nine things here and there's 16 things you're trying to create But it's it's made a pretty good Attempt, okay, so this is what a Deconvolution looks like a stride one three by three Deconvolution on a two by two grid cell Input did you have a question? How difficult is it to create a discriminator to identify fake news versus real news? Well, you don't need anything special that's just a classifier Right, so you would just use the NLP classifier from our previous previous to previous class and lesson four Right, it's there's nothing like it in that case. There's no generative piece, right? So you just need the data set that says these are the things that we believe are fake news And these are the things we consider to be real news And it should actually work Very well, you know like as to the best of our knowledge if you if you try it you should get a you know As good a result as anybody else has got whether it's good enough to be useful in practice. I don't know Oh, I was gonna say that it's very hard Using the technique you've described very hard well in that like I mean there's not a Good solution that does that Well, but I don't think anybody in our course has tried and nobody else outside our course knows of this technique, right? So like there's been as we've as we've learned We've just had a very significant jump in NLP classification capabilities And Yeah, I mean I think it's it obviously the best you could do I think at this stage would be to generate a Kind of a triage that says these things look Pretty sketchy based on how they're written and then you know some human could go and then fact-check them, you know I mean a an NLP classifier and RNN can't fact-check things, but it could recognize like oh these these are written in that kind of You know that kind of highly popularized style which often fake news is written in and so maybe these ones are worth paying attention to I Think that would probably be the best you could hope for Without drawing on some kind of external data sources Yeah, but it's important to remember, you know A discriminator is basically just a classifier and you don't need any special Techniques beyond what we've already learned to do an LP classification Okay, so to to do that kind of decombolution in In Pytorch to say comm transport is 2d and in the normal way you say the number of input channels The number of output channels the kernel size the stride the padding the bias so these parameters are all the same Right and the reason it's called a comm transpose is because actually it turns out that this is this is the same as the calculation of the gradient of convolution that's That's basically why they call it that So this is a really nice example back on the old theano website that comes from a really nice paper Which actually shows you some visualizations, so this is actually the one we just saw Of doing a 2x2 decombolution if there's a stride 2 Then you don't just have padding around the outside, but you actually have to put padding in the middle As well, right They're not actually quite implemented this way because this is slow to do in practice You they implement them a different way, but it all happens behind the scenes. We don't have to worry about it But yeah, this we've we've talked about this convolution arithmetic Tutorial before and if you're still not comfortable with convolutions and in order to get comfortable with decombolutions This is a great site to go to if you want to see the paper just Google for convolution arithmetic That'll be the first thing that comes up Let's do it now. So, you know you found it Here it is And so that's the Arno tutorial actually comes from this paper But the paper doesn't have the Animated GIFs Okay, so it's interesting then a decomb block looks identical to a conv block except it's got the word transpose written here Okay, we just go conv really batch norm as before it's got input filters output filters the only difference is that stride 2 means that the grid size will double rather than half Both nn conv transpose 2d and nn dot up sample seem to do the same thing I expand grid size height and width from the previous layer Can we say that conv transpose 2d is always better than up sample since up sample is merely resizing and filling unknown Unknowns by zeros or interpolation No, you can't So there's a fantastic Interactive paper on distilled up hub called deconvolution and checkerboard artifacts Which points out that? What we're doing right now is extremely suboptimal But the good news is everybody else does it If you have a look here, could you see these checkerboard architect artifacts? It's all like dark blue light blue dark blue light blue, you know you kind of and so these are all from from actual papers, right and Basically, they noticed every one of these papers with generative models has these checkerboard artifacts And what they realized is is it's because when you have a stride 2? Convolution of size 3 kernel they overlap Right, and so you basically get like some pixels get twice as much kind of active some grid cells I guess get twice as much activation and so even if you start with random Wates you end up with a checkerboard artifact So you can kind of see it here, right? and So the deeper you get kind of the worse it gets Their advice is Actually less direction than it ought to be I found that for most generative models Upsampling is better Right, so all if you do nn.upsample then all it does is it's it's basically doing pooling, right? It basically But it's kind of it's kind of the opposite of pooling, right? It says let's replace this one pixel or this one grid cell with four two by two and There's a number of ways to upsample one is just to kind of copy it across to those four and other is to use Kind of by linear or by cubic interpolation There are various techniques to kind of try and create a smooth Upsampled version and you can pretty much choose any of them in PyTorch so if you do a an up a two by two upsample and then a regular Stride one three by three comf That's like another way of doing the same kind of thing as a comf transpose Right, it's it's doubling the grid size and doing some convolutional arithmetic on it and I found for generative models and pretty much always works better and in that distilled a pub publication They kind of indicate that maybe that's a good approach, but they don't just come out and say just do this Where else I would just say just do this Having said that for GANS, I haven't had that much success with it yet And I think it probably requires some tweaking to get it to work I'm sure some people have got it to work The the the issue I think is that in the early stages It doesn't create enough Noise I Had I Don't have it here. I had a version actually where I tried to do it with that with an upsample And you could kind of see that the noise didn't look very noisy So anyway, it's an interesting question But next week when we look at style transfer and super resolution and stuff I think you'll see and end up sample Really comes into its own Okay, so the generator we can now basically start with the vector We can decide and say like okay. That's not think of it as a vector But actually it's a one by one grid cell and then we can turn it into a four by four and eight by eight and so forth And so that's why we have to make sure it's a it's suitable multiple So that we can actually create something of the right size and so you can see it's doing the exact opposite as before right? It's making the cell size smaller and smaller by two at a time As long as it can Sorry bigger and bigger I'm sorry the cell size bigger and bigger as long as it can Until it gets to half the size that we want And then finally we add one more on at the end Sorry, we add and more on at the end of just With no stride and then we add one more com transpose to finally get to the size that we wanted And we're done Finally we put that through a fan and that's going to force us to be in the Zero to one range because of course we don't want to spit out Arbitrary size pixel values Okay, so that's so we've got a generator architecture Which spits out an image of some given size with the correct number with the correct whatever we asked for correct number of channels And with values between zero and one So at this point we can now Create our model data object These things take a while to train so I just made it 128 by 128 So this is just a convenient way to make it a bit faster And That's going to be the size of the input, but then we're going to use transformations to turn it into 64 by 64 Okay There's been more recent advances which have attempted to really increase this up to kind of like high resolution sizes But they still tend to require either like a batch size of one or like lots and lots of GPUs or whatever So we're kind of trying to do things that we can do one can single consumer GPUs here So here's an example of one of the 64 by 64 bedrooms Okay, so we're going to do pretty much everything manually. So let's go ahead and create our two models our generator and our Discriminator and as you can see the DC GAN. So in other words, they're the same Modules that came up were appeared in this paper. So if you're interested in reading the papers You it's well worth going back and looking at the DC GAN paper to see What these architectures are because that's assumed that when you read the Vossestein GAN paper that you already know that Yes, shouldn't we use a sigmoid if we want values between zero and one? I always forget which one switch. Okay, so sick. Yeah, so sigmoid is zero to one Fan is one to minus one I Think what will happen is I'm gonna have to check that I vaguely remember thinking about this when I was writing this notebook and realizing that one to minus one made sense for some reason, but I can't remember what that reason was now, so Let me get back here about that during the week and remind me if I forget It's good question. Thank you. Okay, so we've got our generator in our discriminator. So we need a function that returns a Prior vector, so a bunch of noise So we do that by creating a bunch of zeros And Z is the size of Z. So like very often in our code if you see a mysterious letter It's because that's the letter they used in the paper. That's a Z is the size of our noise vector Okay So there's the size of our noise vector and then we use a normal distribution To generate random numbers inside that And that needs to be a variable because it's going to be participating in the in the gradient updates So here's an example of creating some noise And so here are four different pieces of noise okay, so We need an optimizer in order to update our gradients In the Vossestein GAN paper They told us to use RMS prop. So that's fine That's so when you see this thing saying do an RMS prop update in a paper. That's nice We can just do an RMS prop update with PyTorch okay and they suggested a learning rate of 5e neg5 I think I found 1e neg4 seem to work so I just made it a bit bigger So now we need a training loop and so this is the thing that's going to implement this algorithm so A training loop is going to go through Some number of epochs that we get to pick so that's going to be a parameter And so remember when you do everything manually you got to remember all the manual steps to do So one is that you have to set your modules into training mode When you're training them and into evaluation mode when you're evaluating them because in training mode batch norm Updates happen and dropout happens in evaluation mode. Those two things get turned off Okay, that's basically difference. So put it into training mode We're going to grab an iterator from our training data loader We're going to see how many steps we have to go through and then we'll use TQDM to give us a progress bar, and then we're going to go through that many steps. Okay so The first step of this algorithm Is to update the Is to update the discriminator So in this one I'm just trying to remember Yes, they don't call a discriminator. They call it a critic Right so w are the weights of the of the critic So the first step is to train our critic a little bit And then we're going to train our generator a little bit and then we're going to go back to the top of the loop Right so this in a so we've got a while loop on the outside Okay So here's our while loop on the outside and then inside that there's another loop for the critic And so here's our little loop inside that for the critic. Okay, we call it a discriminator So what we're going to do now is we're going to try we've got we've got a generator And at the moment it's random right so our generator is going to generate Stuff that looks something like this Right and so we need to first of all teach our discriminator to tell the difference between that and a bedroom All right, it shouldn't be too hard You would hope So we just do it in basically the usual way, but there's a few little Tweaks, so first of all we're going to grab a mini batch of real bedroom photos So we can just grab the next batch from our iterator turn it into a variable Okay, then we're going to calculate The loss for that right so this is going to be How much does the discriminator think this looks This looks fake Right through the real ones look fake And then we're going to create some fake images and to do that we'll create some random noise and We'll stick it through our generator, which at this stage is just a bunch of random weights Okay, and that's going to create a mini batch of fake images Okay, and so then we'll put that through the same discriminator module as before Okay, to get the loss For that so how fake to the fake ones look Remember when you do everything manually you have to zero the gradients in in your loop And if you've forgotten about that go back to the part one lesson where we do everything from scratch So now finally the total discriminator loss is equal to the real loss minus the fake loss Okay, and so you can see that Here they don't talk about the loss. They actually just talk about one of the gradient updates So this here is the symbol for get the gradients, right? So inside here is the loss Right and like trying to like learn to throw away in your head all of the boring stuff So when you see some over M divided by M that means take the average So just throw that away and replace it with NP dot mean in your head There's another NP dot mean right so you want to get quick at like being able to see these common idioms So anytime you see one over M some over M you go okay NP dot mean right? So we're taking the mean of and we're taking the mean of so that's all fine X I what's X I it looks like it's X to the power of I But it's not right the math notation is very overloaded. They showed us here what X I is and it's a set of M samples From a batch of the real data. So in other words, this is a mini batch, right? so when that when you see something saying sample it means just Grab a row grab a row grab a row and you can see here grab it M times And we'll call the first row X Parenthesis one the second row X parenthesis two one of the annoying things about math notation is the way that we index into arrays is Everybody uses different approaches subscripts superscripts things in brackets combinations commas square brackets whatever right So you've just got to look in the paper and be like okay at some point They're going to say take the ith row from this matrix or the ith Image in this batch. How are you going to do it in this case? It's a superscript in parentheses Okay, so that's all sample means and curly brackets means it's just a set of them This little squiggle followed by Something here means according to some probability distribution And so in this case like and very very often in papers. It simply means Hey, you've got a bunch of data right grab a bit from it at random. Okay, so that's like that's the Probability distribution of the data you have is the data you have right? So this says grab M things at random from your real data This says Grab M things at random from your prior samples and so that means in other words Call create noise to create M random vectors So now we've got M real images each one gets put through Our discriminator We've got M bits of noise Each one gets put through our generator to create M Generated images each one of those gets put through look FW. That's the same thing So each one of those gets put through our discriminator to try and figure out whether they're fake or not And so then it's this minus this and the mean of that and Then finally get the gradient of that in order to figure out how to use RMS prop to update our weights using some learning word Okay, so In PyTorch We don't have to worry about getting the gradients we can just specify the loss bit Okay, and then just say lost up backward discriminator optimizer dot step Okay That is one key step right which is that we have to keep all of our activations Sorry, all of our weights Which are the parameters in a PyTorch module in this small range between point oh one negative point oh one and point oh one Why? Because the mathematical assumptions that make this algorithm work only Only apply in like a small ball, right? So I'm not going to tell I don't I think it's kind of interesting to understand the math of why that's the case, but I it's Very specific to this one paper and understanding it won't help you understand any other paper So only study it you know if you're interested in you know, I think it's nicely explained. I think it's fun But it won't be information that you'll reuse Elsewhere unless you get super into dance I'll also mention after the paper came out and Improved for us a stein Gann came out that said hey There are better ways to ensure that your that your weight space is in this tight ball Which was basically to kind of penalize? Gradients that are too high So actually nowadays there are there are slightly different ways to do this Anyway, that's why this line of code there It's kind of the key contribution this you know this one line of code actually is the one line of code You add to make it of us as time again basically But the work was all in knowing like That that's the thing that you can do that makes everything work better Okay, so at the end of this we've got a discriminator that can recognize it in stream real bedrooms and our totally random crappy Generated images So let's now try and create some better images. So now set trainable discriminator to false Set trainable the generator to true Zero out the gradients of the generator And now our loss again is FW that's the member. That's the discriminator of The generator applied to some more random noise Okay, so Here's our random noise Here's our generator Here's our discriminator I think I can remove that now Because I think I've put it inside the discriminator, but I won't change it now because it's going to confuse me So it's exactly the same as before where we did Generator on the noise and then pass that to discriminator But this time the thing that's trainable is the generator not the discriminator. So in other words in this pseudo code the thing they update is theta Which is the generators parameters rather than W, which is the discriminators parameters and so hopefully you'll see now that this this w down here is telling you These are the parameters of the discriminator This theta down here is telling you these are the Actually better these this theta here is telling you these are the parameters of the generator okay, and again, it's not as Universal mathematical notation. It's a thing they're doing in this particular paper, but it's kind of nice when you see some Some suffix like that is like try to think about what it's telling you okay, so we take some noise generate some images Try and figure out if they're fake or real and use that to get gradients with respect to Okay, the generator as opposed to earlier We got them with respect to the discriminator and use that to update our weights with rms prop with an alpha learning rate Okay, you'll see that It's kind of unfair that the discriminator is getting trained and critic times Which they set to five for every time that we train the generator once And the paper talks a bit about this, but the basic idea is like there's no point Making the generator better if the discriminator doesn't know how to discriminate yet Okay, so that's why we've got this while loop and here's that five right and actually something which Was added I think in the later paper Maybe a supplementary material is the idea that From time to time And a bunch of times at the start You should do more steps at the discriminator. So kind of make sure that the discriminator It's pretty capable from time to time. Okay, so Do a bunch of epochs of training the discriminator a bunch of times to get better at telling the difference between real and fake and Then do one step of making the generator being better at generating And that is an epoch and so let's train that for one epoch And then let's create To noise so we can generate some examples Actually we're going to do that later Let's first of all decrease the learning rate by ten and do one more pass so we've now done two epochs And now let's use our noise to pass it to our generator Okay, and then Put it through our denormalization to turn it back into something we can see and then plot it and We have some bedrooms And there's not real bedrooms and some of them don't look particularly like bedrooms, but some of them look a lot like bedrooms So that's that's the idea. Okay, that's again and I think like The best way to think about a GAN is it's like a an underlying technology that you're probably never use Like this, but you'll use in lots of interesting ways For example We're going to use it to create now a cycle GAN and we're going to use a cycle GAN to turn horses into zebras You could also use it to turn money prints into photos Or to turn photos of Yosemite in summer into winter So it's going to be pretty yes, Rachel two questions One is there any reason for using RMS props specifically as the optimizer as opposed to Adam? I I don't Remember it being explicitly discussed in the paper. I don't know if it's just experimental or those theoretical reason Yeah, have a look in the paper and see what it says. I don't recall and Which could be a reasonable way of detecting overfitting while training or of evaluating the performance of one of these GAN models Once we are done training in other words, how does the notion of train validation test sets translate to GANs? That's an awesome question And there's a lot of people who make jokes about how GANs is the one field where you Don't need a test set and people take advantage of that by making stuff up and saying it looks great There are some pretty famous problems with with GANs one of the famous problems with GANs is called mode collapse and mode collapse happens where you look at your bedrooms and it turns out that there's basically only three kinds of bedrooms that every possible Noise vector map to when you look at your gallery and it turns out that turns out. They're all just the same thing Well, there's just three different things Mode collapse is easy to see if you collapse down to a small number of modes like, you know three or four But what if you have a mode collapse down to 10,000 modes? So there's only 10,000 possible bedrooms that all of your noise vectors collapse to that's not like you wouldn't be able To see it here right because it's pretty unlikely you would have two identical bedrooms out of 10,000 or what if Every one of these bedrooms is basically a direct copy of one of the it was basically a memo memorized some input you know Could that be happening and the truth is most papers Don't do a good job or sometimes any job of checking those things So the question of like How do we evaluate GANs and even like the point of like hey, maybe we should actually evaluate GANs properly is something that Is not widely enough understood even now And some people are trying to you know really push so Ian Goodfellow Who a lot of you will know because he came and spoke here at a lot of the book club meetings last year and of course was the First author on the most famous deep learning book he's the inventor of GANs and he's been sending a Continuous stream of tweets reminding people about the importance of testing GANs properly So yeah, if you see a paper that claims exceptional GAN results, then this is definitely something to look at You know is is have they talked about mode collapse have they talked about memorization and so forth Okay, so this is going to be really straightforward because it's just a neural net, right? So all we're going to do is we're going to create an import containing lots of zebra photos And with each one we'll pair it with an equivalent Horse photo and we'll just train a neural net that goes from one to the other Or you could do the same thing for every Monet painting create a data set containing the photo of the place Oh wait, that's not possible because the places that Monet painted aren't there anymore And there aren't exact zebra versions of horses and oh wait How the hell is this going to work? This seems to break everything. We know about what neural nets can do and how they do them All right, Rachel, you're gonna ask me a question to spoil our whole train of thought come on Can GANs be used for data augmentation? Yeah, absolutely. You can use a GAN for data augmentation. Should you I? Don't know like there are some papers that try to do semi supervised learning with GANs I haven't found any that are like particularly compelling showing state-of-the-art results on Really interesting data sets that have been widely studied I'm I'm a Little skeptical the reason I'm a little skeptical is because in my experience if you train a model with synthetic data The neural net will become fantastically good at recognizing the specific Problems of your synthetic data and that'll be end up what it's learning from And there are lots of other ways of doing semi supervised models which do work well There are some places that it can work for example, you might remember Ottavio good created that fantastic Visualization in part one of like the zooming comb net where he kind of showed a letter going through MNIST He at least at that time Had the was the number one autonomous remote-controlled car guy I mean in in autonomous remote-controlled car competitions and he trained his model Synthetically augmented data where he basically talk real videos of a car driving around a circuit and added like fake people and fake other cars and stuff like that, and I think that worked well because Well a because he's kind of a genius and B because I think he had a kind of a well-defined Kind of little subset that he had to work in But yeah in general, it's really hard. It's really really hard to use synthetic data. I've tried using synthetic data in models for Decades now obviously not guns because they're pretty new, but in general, it's very hard to do Very interesting research question. All right so somehow These folks at Berkeley created a model that can turn a horse Into a zebra Despite not having any photos unless they went out there and painted horses and took before and after shots But I believe they did it So how the hell did they do this it's it's kind of it's kind of genius I will say the person I know who's doing the most interesting practice of Cyclegan right now is one of our students Helena I'm sarin She's The only artist I know of who is a cyclegan artist. Here's an example. I love she created this little doodle in the top left And then trained a cyclegan to turn it into this beautiful painting in the bottom right Here's some more of her amazing works And I think it's really interesting like I mentioned at the start of this class that like Gans are in the category of like stuff. That's Not there yet, but it's nearly there and in this case like there's at least one person in the world now Who's creating beautiful and extraordinary artworks using gas? And there's lots of specifically cyclegans and there's actually like at least maybe a dozen people I know of who are just doing interesting creative work with neural nets More generally and the field of creative AI is going to expand dramatically And I think it's interesting with Helena. I mean, I don't know her personally from what I understand of her background She's a you know, she's a software developer You know as her full-time job and an artist as her hobby and she's kind of started combining These two by saying gosh, I wonder what this Particular tool could bring to my art and so if you follow her Twitter account We'll make sure we add it on the wiki Somebody can find it. It's Helena sarin sarin She basically posts a new work almost every day and they're always Pretty amazing So here's the basic trick, okay, and this is from the cyclegan paper We're gonna have two kind of Two images assuming we're doing this with images, right? But the key thing is they're not paired images So we're not we don't have a data set of horses and the equivalent zebras We've got bunch of horses bunch of zebras Grab one horse Grab one zebra. Okay. We've now got an axe. So X. Let's say X is horse and Y is zebra We're going to train a generator And what they call here a mapping function that turns horse into zebra We'll call that mapping function G and we'll create one mapping function generator that turns a zebra into a horse And we'll call that F We'll create a discriminator Just like we did before which is going to get as good as possible at recognizing real from fake horses So that'll be DX and then another discriminator which is going to be as good as possible and recognizing Real from fake zebras. We'll call that dy. Okay, so that's kind of our starting point But then the key thing to making this worse work Okay, so we're kind of generating a loss function here, right? Here's one bit of the loss function Here's a second bit of the loss function We're going to create something called cycle consistency loss, which says after you turn your horse Into a zebra with your G generator And check whether or not I can recognize that it's real So I keep forgetting which one's horse and which one zebra. I apologize if I get my X's and Y's backwards I turn my horse into a zebra and they're going to try and turn that zebra Back into the same horse that I started with Right and so then I'm going to have another function that's going to check whether my this horse Which I've generated knowing nothing about X generated entirely from the zebra is similar to the original horse or not Right, so the idea would be if your generated zebra Doesn't look anything like your original horse. You've got no chance of turning it back into the original horse So a loss which compares X hat to X is going to be really bad Unless you can go Into Y and back out again, and you're probably only going to be able to do that If you're able to cut create a zebra that looks like the original horse so that you know what the original horse looked like and vice versa take the original take your zebra turn it into a fake horse and Check that you can recognize that and then try and turn it back into The original zebra and check that it looks like the original so notice here this F, right is our zebra to horse This G is our horse to zebra, right? So so the G and the F are kind of doing two things They're both turning the original horse Into the zebra and then turning the zebra back into the original horse Okay, so notice that there's only two generators, right? There isn't a separate generator for the reverse mapping You have to use the same generator that was used for the original mapping. Okay, so this is the cycle consistency loss And I just think this is like This is genius, you know, like the idea that this is a thing that could be even be possible Honestly when this came out it just never occurred to me as a thing that I could even try and solve It seems so obviously impossible and then the idea that you can solve it like this. I just think it's so damn smart so It's good to look at the equations in this paper because They're just good example like they're written pretty simply, you know, it's not like some of the stuff in the bussestangan paper which is like lots of Theoretical proofs and whatever else in this case that you know, they're just equations that just lay out what's going on and You really want to get to a point where you where you can read them and understand them So like let's kind of start talking through them So we've got a horse and a zebra Okay So for some mapping function G okay, which is our horse to zebra mapping function Then there's a GAN loss right, which is the bit we're already familiar with that says I've got a horse a zebra a Fake zebra recognizer and a horse to zebra generator Okay, and the loss is exact. It's it's what we saw before it's our ability to draw one zebra out of our zebras okay, and Recognize whether it's real or fake okay, and then generate a take a Horse and turn it into a zebra and recognize whether that's real or fake. Okay, and then You're you then do one minus the other and in this case they've got a log in there The log's not terribly important. So this is this is the thing we just saw That's that's why we did bussestangan first is this is just a standard GAN loss in math form. Did you have a question Rachel? All of this sounds awfully like translating in one language to another than back to the original have GANs or any equivalent been tried in translation Not that I Not that I know of I wonder if A's is here because I know he's been working on this. Someone has posted the link to the unsupervised Yeah, because there's the there's this there's unsupervised machine translation which um Does kind of do something like this, but I Don't I haven't looked at it closely enough to know if it's nearly identical or if it's just vaguely similar, but yeah So to kind of back up to what I do know Um, normally with translation you require this kind of paired input you require parallel texts You know, this is the french translation of this english sentence um I do know there's been a couple of recent papers That show the ability to create a good quality translation models without paired data um I haven't implemented them and I don't understand anything I haven't implemented But yeah, they may well be doing the same basic idea. We'll look at it during the week and get back to you Okay um All right, so we've got our dan loss the next piece is the cycle consistency loss right and so the basic idea here is that We start with our horse Use our zebra generator on that to create a zebra Use our horse generator on that to create a horse and then compare that to the original horse and this Double lines with the one we've seen this before this is the l1 loss Okay, so this is the sum of the absolute value of differences Or else if this was a two it would be the l2 loss or the two norm Which would be the sum of squared differences the square root of it actually um And again, we now know this squiggle idea. Okay, which is from our horses Grab a horse Okay, that's so this is what we mean by sample from a distribution There's all kinds of distributions, but most commonly in these papers We're using an empirical distribution. In other words, we've got some rows of data grab a row. Okay, so when you see this Thing squiggle other thing this thing here when it says p data that means Grab something from the data And we're going to call that thing x so from our horses pictures grab a horse Turn it into a zebra turn it back into a horse Compare it to the original and sum up the absolute values Okay, do that for horse to zebra do it for zebra to horse as well At the two together and that is our cycle consistency loss um, okay So now we get our loss function And the whole loss function depends on our horse generator our zebra generator Our horse recognizer our zebra recognizer discriminator and we're going to add up the GAN loss for Recognizing horses the GAN loss for recognizing zebras and the cycle cycle consistency loss for our two generators Okay, and then we've got a lambda here Which hopefully we're kind of used to this idea now that is when you've got two different kinds of loss You chuck in a parameter there that you can multiply them by so that they're about about the same scale Okay, and we did a similar thing with our bounding box Loss compared to our classifier loss when we did that localization stuff Okay So then we're going to try to for this mac for this loss function maximize the capability of the discriminators at the discriminating discriminating whilst Minimizing that for the generators so the generators And the discriminators are going to be facing off against each other. So when you see this min max Thing in papers you'll see it a lot. It means it all it basically means this idea that in your training loop One thing is trying to make something better. The other is trying to make something worse And you generally there's lots of ways to do it But most commonly you'll alternate between the two and you'll often see this just referred to in math papers as Min max. So when you see min max You think you should immediately think okay adversarial training So let's look at the code and we're only going to we probably won't be able to finish this today But we're going to do something almost unheard of which is I started looking at somebody else's code And I was not so disgusted that I threw the whole thing away and did it myself I actually said I quite like this. I like it enough. I'm going to show it to my students So This is where the code comes from. So this is one of the people that created the original code for cycle gans and they've created a PyTorch version And I had to clean it up a little bit, but it's actually pretty damn good. It's I think the first time I found code that I didn't feel the need to rewrite from scratch before I showed it to you And so the cool thing about this is One of the reasons I liked doing it this way if like finally finding something that's not awful Is that you're now going to get to see Almost all the bits of fast AI or like all the relevant bits of fast AI Written in a different way by somebody else, right? And so you're going to get to see like oh, how they do data sets and data loaders and models and training loops and so forth, okay, so You'll find there's a segan directory which is basically nearly this With some cleanups, which I hope to submit as a PR sometime to It was written in a way that unfortunately made it a bit over connected to how they were using it as a script But I so cleaned it up a little bit so I could use it as a module But other than that, it's pretty similar. So segan is basically their code Copied from there from their github repo with some minor changes So the way the segan Many library has been set up is that the configuration options. They're assuming are being passed in to like a script So they've got this train options parser method And so you can see I'm basically pass passing in an array of like Script options, okay? Where's my data? How many threads do I want drop out? How many iterations? What am I going to call this model? Which gpu do I want to write it on? Okay So that gives me a opt Object which you can then see Or you know what it contains you'll see that it contains some things I didn't mention That's because it's got defaults for everything else that I didn't mention. Okay So, um, we're going to Rather than using fast ai stuff. We're going to largely use segan stuff. So the first thing we're going to need is a data loader And so this is also a great opportunity for you again to practice your ability to navigate through code with your editor or idea of choice So we're going to start with create data loader So you should be able to go find symbol or in vim tag To jump straight to create data loader And we can see that's creating a custom data set loader And then we can see custom data set loader Is a base data loader Okay, so that doesn't really do anything It creates. Okay. So basically we can see that it's going to use A standard py torch data loader. So that's good And so we know if you're going to use a standard py torch data loader You have to pass it a data set and we know that a data set is something that contains a length And a an indexer So presumably when we look at create data set it's going to do that Here is create data set. Okay So this library actually does more than just cycle GAN It handles both aligned and unaligned image pairs We know that our image pairs are unaligned So we've got an unaligned data set Okay, here it is and as expected it has a get item And a length good And so the main obviously the main the length is just Whatever of our so a and b is our horses and zebras that we've got two sets So whichever one is longer is the length of the data loader and so get item is just going to go ahead and Randomly grab something from each of our two Horses and zebras open them up with pillow or p.i.l run them through some transformations And then we could either be turning horses into zebras or zebras into horses So there's some direction and then it will just go ahead and then return our horse and our zebra And our path to the horse and the path to zebra. So Yeah, hopefully you can kind of see that this is looking pretty similar to the kind of stuff that fast ai does fast ai obviously does Quite a lot more when it comes to transforms and performance and stuff like this But you know remember this is like Research code for this one thing like it's pretty cool that they Did all this work So we've got a data loader So we can go and load our data into it And so that'll tell us how many mini batches are in it. That's the length of a data loader in pi torch Next step we've got a data loader is to create a model Um So you can go go tag for create model There it is Okay, same idea. We've got different kinds of models. So we're going to be doing a cycle GAN So here's our cycle GAN model Okay, so there's quite a lot of stuff in a cycle GAN model. So let's go through and find out What's going to be used? but basically at this stage We've just called initializer And so when we initialize it You can see it's going to go through and it's going to define two generators Which is not surprising a generator for our horses and a generator for zebras Let's see what else we've got here Okay, there's some way for it to generate a pool of fake data And then here we're going to grab our Our GAN loss and as we talked about our cycle consistency loss is an L1 loss That's interesting. They're going to use atom. So Obviously for cycle GANs and they found atom works pretty well And so then we're going to have an optimizer for our horse discriminator an optimizer for our zebra discriminator and an optimizer for our generator Okay The optimizer for the generator is going to contain the parameters both for the horse generator and the zebra generator all in one place So okay, so the initializer is going to set up all of the different networks and loss functions we need and they're all going to be stored Inside this model And so then it prints out and shows us Exactly the PyTorch bottles we have and so it's interesting to see that they're using resnets And so you can see the resnets look pretty familiar we've got Con batch norm value con batch norm So instance norm is just the same as batch norm basically, but it applies it to one image at a time The difference isn't particularly important Okay, and you can see they're doing reflection padding Just like we are you can kind of see like when you When you try to build everything from scratch like this, it is a lot of work and you know, you can kind of to get the little You know the nice little things that Fastai does automatically for you. You kind of have to do all of them by hand and only end up with a subset of them So, you know over time hopefully soon we'll get all of this GAN stuff into fastai and it'll be nice and easy Okay, so we've got our model and remember the model contains the loss functions It contains the generators. It contains the discriminators all in one convenient place So I've got a head and kind of copied and pasted and slightly refactored The training loop from the from their code so that we can run it inside the notebook So this is a little pretty familiar right a loop to go through each epoch and a loop to go through the data Before we did this we set up our This is actually not a pytorch data set. I think this is what they use slightly confusingly to talk about their You know, they're combined what we would call a model data object. I guess all the data that they need um Look through that with tqdm to get a progress bar Um, and so now we can go through and see what happens in the model all right, so set input So set input So um, so it's kind of a different approach to what we do in fastai This is kind of neat, you know, it's quite specific to cycle GANs We're basically internally inside this model. It's this idea that we're going to go into our data and grab, you know, we're knowing we're either going Horse to zebra or zebra to horse depending on which way we go We either you know, I is either the horse or the zebra and vice versa And if necessary put it on the appropriate gpu Um, and then grab the appropriate um paths. Okay, so the model now has A mini batch of horses and a mini batch of zebras Um, and so now we optimize the parameters okay so It's kind of nice to see it like this you can see each step right so first of all try to Optimize the generators then try to optimize the horse discriminator then try to optimize the zebra discriminator Zero grad is part of pytorch Step is part of pytorch So the interesting bit is The actual thing that cart which does the back propagation on the generator. So here it is Um, and let's jump to the key pieces There's all the bits all the formula that we basically just saw from the paper So let's take a horse and Generate a zebra So we've now got a fake zebra And let's now use the discriminator to see if we can tell whether it's fake or not Okay, so it's pred fake and then Let's pop that into our loss function which We set up earlier To see if we can Basically to get a loss function based on that prediction Um, then let's do the same thing to do the GAN loss. So take a Go in the opposite direction and then we need to use the opposite discriminator and then put that through the loss function again And then let's do the cycle consistency loss. Okay, so again we Take our fake which we created up here Okay, and try and turn it back again into the original And then let's use that loss function or cycle consistency loss function we created earlier to compare it to the real original And here's that lambda right? So there's some Weight that we used and that was set up Actually, we just used the default that they suggested in their options and then do the same for the opposite direction And then add them all together Do the backward step And that's it so we can then do the same thing for the first discriminator And since basically all the work's been done now There's much less to do here Okay, so There that is so I won't step all through it, but it's basically the same Same basic stuff that we've already seen So optimize parameters Basically is calculating the losses and doing the optimizer step from time to time save and print out some results And then from time to time Update the learning rate. So they've got some learning rate annealing built in here as well Isn't very exciting, but You can take a look at it Okay, so they've basically got some kind of like fast ai. They've got this idea of schedulers Which you can then use to update your learning rates So I think kind of like, you know, for those of you who are interested in better understanding Deep learning apis or interested in contributing more to fast ai or interested in like creating, you know Your own version of some of this stuff in some different back end It's cool to like look at a second Kind of api that covers some subset of some of the similar things to get a sense for how are they solving some of these problems And what are the similarities and what are the differences? So we train that for a little while And then we can just grab A few examples And here we have them So here are our horses Here they are as zebras And here they are back as horses again Here's a zebra into a horse back in a zebra. It's kind of thrown away its head for some reason, but not so much It couldn't get it back again This is a really interesting one like this is obviously not what zebras look like but if it's going to be a zebra version of that horse It's also interesting to see it's failure situations. I guess it doesn't very often see basically just an eyeball It has no idea how to do that one So some of them don't work very well This one's done a pretty good job This one's interesting. It's done a good job with that one and that one, but for some reason the one in the middle didn't get go Um, yeah, this one's a really weird shape, but it's done a reasonable job of it. This one looks good This one's pretty sloppy Again the fork just ahead. It's not bad. So, you know, um, I didn't I took me quite a It took me like 24 hours to train it even that far. Um, so it's kind of slow And I know Helena it's constantly complaining on twitter about how Long these things take I don't know how she's so productive with them So, yeah, I will mention one more thing that just came out yesterday Which is there's now a multimodal image to image translation of unpaired And so you can basically now create different cats For instance From this dog. Um, so this is basically Not just creating one example of the output that you want but creating multiple ones. So here's a house cat to big cat And here's a big cat to house cat. Um, this is the paper. Yeah, so this came out like yesterday or the day before I think I think it's pretty amazing Cat and a dog So you can kind of see how this technology is developing and I think, you know, if you're There's so many opportunities to, you know, maybe do this with music Or speech or writing or to create kind of tools for artists or whatever All right, thanks everybody and I'll see you next week