 Welcome to lesson 13, where we're going to be talking about image enhancement. And image enhancement would cover things like this painting that you might be familiar with. However, you might not have noticed before that this painting actually has a picture of an eagle in it. The reason you might not have noticed that before is this painting actually didn't used to have an eagle in it. By the same token actually on that first page, this painting did not used to have Captain America's shield on it either. And this painting did not used to have a clock in it either. This is a cool new paper actually that just came out a couple of days ago called Deep Painterly Harmonization. And it uses almost exactly the technique we're going to learn in this lesson with some minor tweaks. But you can see the basic idea is take one picture, paste it on top of another picture, and then use some kind of approach to combine the two. And the basic approach is something called a style transfer. Before we talk about that though, I wanted to mention this really cool contribution by William Horton who added this stochastic weight averaging technique to the Fast AI Library. That is now all merged and ready to go. And he's written a whole post about that which I strongly recommend you check out. Not just because stochastic weight averaging actually lets you get higher performance from your existing neural networks with basically no extra work. It's as simple as adding these two parameters to your fit function. But also he's described his process of building this and how he tested it and how he contributed to the library. I think it's interesting, you know, if you're interested in doing something like this, I think William had not built this kind of library before, so he describes how he did it. Another very cool contribution to the Fast AI Library is a new train phase API. And I'm going to do something I've never done before, which I'm actually going to present somebody else's notebook. And the reason I haven't done it before is because I haven't liked any notebooks enough to think they're worth presenting, but Silvan's done a fantastic job here of not just creating this new API, but also creating a beautiful notebook describing what it is and how it works and so forth. And the background here is, as you guys know, we've been trying to train networks faster, partly as part of this research competition, and also for a reason that you'll learn about next week. And I mentioned on the forums last week it would be really handy for our experiments if we had an easier way to try out different learning rate schedules and stuff. And I basically laid out an API that I had in mind. I said, be really cool if somebody could write this, because I'm going to bed now and I kind of need it by tomorrow. And Silvan replied on the forum, well, that sounds like a good challenge. And by 24 hours later it was done. And it's been super cool. I want to take you through it because it's going to allow you to do research into things that nobody's tried before. So it's called the train phase API. And the easiest way to show it is to show an example of what it does, which is here. Here is an iteration against the learning rate chart, as you're familiar with seeing. And this is one where we train for a while a learning rate of .01, and then we train for a while a learning rate of .001. I actually wanted to create something very much like that learning rate chart because most people that train ImageNet use this stepwise approach. And it's actually not something that's built into FastAI because it's not generally something we recommend. But in order to replicate existing papers, I wanted to do it the same way. And so rather than writing a number of fit, fit, fit calls with different learning rates, it would be nice to be able to basically say train for n epochs at this learning rate and then m epochs at that learning rate. And so here's how you do that. You can say phases. So a phase is a period of training with particular optimizer parameters. And it consists of a number of training phase objects. And a training phase object says how many epochs to train for, what optimization function to use, and what learning rate, amongst other things that we'll see. And so here you'll see the two training phases that you just saw on that graph. So now, rather than calling learn.fit, you say learn.fit with an optimizer scheduler with these phases. Fit opt shit. And then from there, most of the things you pass in can just get sent across to the fit function as per usual. So most of the usual parameters will work fine. But in this case, generally speaking, actually, we can just use these training phases to see if it fits in the usual way. And then when you say plot LR, there it is. And not only does it plot the learning rate, it also plots momentum. And for each phase, it tells you what optimizer it used. You can turn off the printing of the optimizers. You can turn off the printing of the mentums. And you can do other little things like a training phase could have an LR decay parameter. So here's a fixed learning rate and then a linear decay learning rate and then a fixed learning rate, which gives us that picture. And this might be quite a good way to train, actually, because we know at high learning rates you get to explore better. And at low learning rates, you get to fine-tune better. And it's probably better to gradually slide between the two. So this actually isn't a bad approach, I suspect. You can use other decay types, not as linear, so cosine. And this probably makes even more sense as a genuinely potentially useful learning rate and kneeling shape. Exponential, which is a super popular approach. Polynomial, which isn't terribly popular, but actually in the literature works better than just about anything else, but seems to have been largely ignored. So polynomial is good to be aware of. What Sylvain's done is he's given us the formula for each of these curves. And so with a polynomial, you get to pick what polynomial to use. So here it is with a different size. And I believe a p of 0.9 is the one that I've seen really good results for, FYI. If you don't give a tuple of learning rates when there's an LR decay, then it will decay all the way down to 0. And as you can see, you can happily start the next cycle at a different point. So the cool thing is now we can replicate all of our existing schedules using nothing but these training phases. So here's a function called phases SGDR, using the new training phase API. And so you can see if he runs this schedule, and here's what it looks like. But he's even done the little trick I have where you train at a really low learning rate just for a little bit and then pop up and do a few cycles, and the cycles are increasing in length, and that's all done in a single function. So the new one cycle we can now implement with, again, a single little function. And so if we fit with that, we get this triangle followed by a little flatter bit, and the momentum is a cool thing. The momentum has a momentum decay. And then here we've got a fixed momentum at the end. So it's doing the momentum and the learning rate at the same time. So something that I haven't tried yet that I think would be really interesting is to use, he's calling it differential learning rates. We've changed the name now to discriminative learning rates. Oops, that's what we'll fix it. Discriminative learning rates. So a combination of discriminative learning rates and one cycle, no one's tried yet. So that would be really interesting. There's actually a, the only paper I've come across which has discriminative learning rates is called, uses something called LARS, L-A-R-S. And it was used to train ImageNet with very, very large batch sizes by basically looking at the ratio between the gradient and the mean at each layer and using that to change the learning rate of each layer automatically, and they found that they could use much larger batch sizes. That's the only other place I've seen this kind of approach used, but there's lots of interesting things you could try with combining discriminative learning rates and different interesting schedules. So you can now write your own L-R-Finder of different types, specifically because there's now this stop-div parameter, which basically means that it'll use whatever schedule you asked for, but when the loss gets too bad, it'll stop training. So here's one with, you know, learning rate versus loss, and you can see it stops itself automatically. One useful thing that's been added is the linear parameter to the plot function. If you use linear schedule rather than an exponential schedule in your learning rate finder, which is a good idea because if you've kind of fine-tuned in to roughly the right area, then you can use linear to find exactly the right area, and then you probably want to plot it with a linear scale. So that's why you can also pass linear to plot now as well. You can change the optimizer, H-Face, and that's more important than you might imagine because actually the current state of the art for training on really large batch sizes really quickly for ImageNet actually starts with RMS prop for the first bit and then they switch to SGD for the second bit. And so that could be something interesting to experiment more with, because at least one paper has now shown that that can work well. And again, it's something that isn't well-appreciated as yet. And then the bit I find most interesting is you can change your data and why would we want to change our data? Because you remember from Lessons 1 and 2 you could use smaller images at the start and bigger images later. And the theory is that you could use that to kind of train the first bit more quickly with smaller images. And remember if you have half the height and half the width then you've got a quarter of the activations basically every layer. So it can be a lot faster. And it might even generalize better. So you can now create a couple of different, for example in this case he's got 28 and then 32 sized images, this is just sci-fi 10 so there's only so much you can do. And then if you pass in an array of data in this data list parameter, when you call fit.shed, it'll use a different data set for each face. So that's really cool because we can use that now like we could use that in our Dornbench entries and see what happens when we actually increase the size with very little code. So what happens when we do that? Well, the answer is here in Dornbench, training on ImageNet. And you can see here that Google has won this with half an hour on a cluster of TPUs. The best non-plaster of TPU result is FastAI plus students under three hours beating out Intel on 128 computers or else we ran on a single computer. We also beat Google running on a TPU. So using this approach we've shown the fastest GPU result, the fastest single machine result, the fastest publicly available infrastructure result, these TPU pods you can't use unless you're Google. And the cost is tiny, like this Intel one cost them $1,200 worth of compute. They haven't even written it here. But that's what you get if you use it. So 128 computers in parallel. Each one with 36 cores, each one with 140 gig compared to our single AWS instance. So this is, you know, kind of a breakthrough in what we can do, like the idea that we can train ImageNet on a single publicly available machine. And this $72, by the way, it was actually $25 because we used a spot instance. So one of our students, Andrew Shaw, asked the whole system to allow us to throw a whole bunch of spot instance experiments up and run them simultaneously, and pretty much automatically. But Don Bench doesn't quote the actual number we used. So it's actually $25, not $72. So this data list idea is super important and helpful. And so our sci-fi 10 results are also now up there officially. And you might remember the previous best was a bit over an hour. And the trick here was using one cycle, basically. So all this stuff that's in Sylvan's training phase API is really all the stuff that we use to get these top results. And really cool, another fast AI student who goes by the name here, BKJ, has taken that and done his own version. He took ResNet 18 and added the concat pooling that you might remember that we learned about on top and used Leslie Smith's one cycle. And so he's got on the leaderboard. So all the top three fast AI students, which is wonderful. And same for cost, the top three. And you can see paper space. So Brett ran this on paper space and got the cheapest result just ahead of BKJ. Ben, his name is, I believe. Okay, so I think you can see a lot of the kind of interesting opportunities at the moment for training stuff more quickly and cheaply are all about the learning rate annealing and size annealing and training with different parameters at different times. And I still think everybody is scratching the surface. I think we can go a lot faster and a lot cheaper. And that's really helpful for people, you know, in resource-constrained environments, which is basically everybody except Google, maybe Facebook. Architecture is interesting as well, though. And one of the things we looked at last week was just like creating a simpler architecture, which is basically state-of-the-art, you know, like the really basic kind of dark-net architecture. But there's a piece of architecture we haven't talked about, which is necessary to understand the inception network. And the inception network is actually pretty interesting because they use some tricks to actually make things more efficient. And we're not currently using these tricks, and I kind of feel like maybe we should try it. And so this is the most interesting, most successful inception network is their Inception ResNet2 network. And most of the blocks in that look something like this. And it looks a lot like a standard ResNet block in that there's an identity connection here, and then there's a conv-con path here, and then we add them up together, right? But it's not quite that, right? The first is that this path is a one-by-one conv, not just any old conv, but a one-by-one conv. And so it's worth thinking about what a one-by-one conv actually is. So a one-by-one conv is simply saying, for each grid cell in your input, you've got a, basically it's a vector, right? A one-by-one by number of filters tensor is basically a vector, right? So for each grid cell in your input, you're just doing a dot product with that tensor, right? And then, of course, there's going to be one of those vectors for each of the 192 activations we're creating. So basically do 192 dot products with grid cell 11, and then 192 with grid cell 12 and 13 and so forth, and so you'll end up with something which has got the same grid size as the input and 192 channels in the output. So that's a really good way to, you know, either reduce the dimensionality or increase the dimensionality of an input without changing the grid size. That's normally what we use one-by-one cons for. So here we've got a one-by-one conv and then we've got another one-by-one conv and then they're added together. And then there's a third path, and this third path is not added. This third path, it's not actually explicitly mentioned, but it's concatenated, right? And so actually there is a form of ResNet, which is basically identical to ResNet, but we don't do plus, we do concat, right? And that's called a dense net, right? So it's just a ResNet where we do concat instead of plus. And that's an interesting approach because then the kind of the identity path is literally being copied, right? So you kind of get that flow through all the way through. And so as we'll see next week, that tends to be good for like segmentation and stuff like that where you really want to kind of keep the original pixels and the first layer of pixels and the second layer of pixels untouched. So concatenating rather than adding branches is a very useful thing to do. And so here we're concatenating this branch. And this branch is doing something interesting, which is it's doing first of all the one by one conf and then a one by seven and then a seven by one. So what's going on there? So what's going on there is basically what we really want to do is do a seven by seven conf. The reason we want to do a seven by seven conf is that if you've got multiple paths, each of which has different kernel sizes, then it's able to look at, you know, different amounts of the image. And so like the original inception network had like a one by one or three by three or five by five, seven by seven, kind of getting concatenated in together, something like that. And so if we can have a seven by seven filter, then we get to kind of look at a lot of the image at once and create a really rich representation. And so actually the stem of the inception network, that is the first few layers of the inception network, actually also use, you know, this kind of seven by seven conf because you start out with this 224 by 224 by three and you want to turn it into something that's like one 12 by one 12 by 64. And so by using a seven by seven conf, you can get a lot of information in each one of those outputs to get those 64 filters. But the problem is that seven by seven conf is a lot of work. You've got 49 kernel values to multiply by 49 inputs for every input pixel across every channel. So the compute is crazy, you know. You can kind of get away with it maybe for the very first layer and in fact the very first layer, the very first con of ResNet is a seven by seven conf. But not so for inception. For inception they don't do a seven by seven conf. Instead they do a one by seven followed by a seven by one. And so to explain, the basic idea of the inception networks, all of the different versions of it, that you have a number of separate paths which have different convolution widths. In this case, conceptually, the idea is this is a one by one convolution width and this is going to be a seven convolution width. And so they're looking at different amounts of data and then we combine them together. But we don't want to have a seven by seven conf throughout the network because it's just too computationally expensive. But if you think about it, if we've got some input coming in and we have some big filter that we want and it's too big to deal with, what could we do? So let's say, let's just make it a little bit less drawing, let's do five by five. What we can do is to create two filters. One which is one by five, one which is five by one or seven or whatever or nine. So we take our activations the previous layer and we put it through the one by five. We take the activations out of that and put it through the five by one and something comes out the other end. What comes out the other end? Well, rather than thinking of it as first of all, we take the activations, then we put it through the five by one, then we put it through the one by five. Sorry, one by five and then the five by one. What if instead we think of these two operations together and say, what is a five by one dot product and a one by five dot product do together? And effectively, you could take a one by five and a five by one and the outer product of that is going to give you a five by five. Now, you can't create any possible five by five matrix by taking that product, right? But there's a lot of five by five matrices that you can create. And so the basic idea here is, when you think about the order of operations, and I'm not going to go into the detail of this. If you're interested in more of the theory here, you can check out Rachel's numerical linear algebra course, which is basically a whole course about this stuff. But conceptually, the idea is that very often, the computation you want to do is actually more simple than an entire five by five convolution. Very often, the term we use in linear algebra is that there's some lower rank approximation. In other words, that the one by five and the five by one combine together, that five by five matrix is nearly as good as the five by five matrix you really ideally would have computed if you were able to. And so this is very often the case in practice, right? Just because the nature of kind of the real world is that the real world tends to have more structure, than kind of randomness. So the cool thing is, if we replace our seven by, if we replace our seven by seven conv with a one by seven and a seven by one, right? Then this has basically, for each cell, it's got 14 by input channel by output channel dot products to do, whereas this one has 49 to do. So it's just going to be a lot faster and we have to hope that it's going to be nearly as good. It's certainly capturing as much width of information by definition. So if you're interested in learning more about this specifically in the deep learning area, you can Google for factored convolutions. The idea was come up with three or four years ago now. It's probably been around for longer, but that was when I first saw it. And yeah, it turned out to work really well and the inception network uses it quite widely. They actually use it in their stem. It's interesting actually, we've talked before about how we tend to kind of add on, we tend to say like this, this main backbone, like when we have ResNet 34, for example, we kind of say, oh, there's this main backbone, which is all of the convolutions. And then we've talked about how we can add on to it a custom head, right? And that tends to be like a max pooling layer and a fully connected layer or something like that. It's actually kind of better to talk about the backbone as containing kind of two pieces. One is the stem, and then the other is kind of the main backbone. And the reason is that the thing that's coming in, remember it's only got three channels. And so we want some sequence of operations. It's going to expand that out into something richer, generally something like 64 channels. And so in ResNet, the stem is just super simple. It's a 7 by 7 conv, Strive 2 conv, followed by a Strive 2 max pool. I think that's it, if memory serves correctly. In Inception, they have a much more complex stem with multiple paths getting combined, incatenated, including factored comms, it's 1 by 7 and 7 by 1. And yeah, I'm kind of interested in what would happen if you stuck like a ResNet, standard ResNet on top of an Inception stem, for instance. Like I think that would be a really interesting thing to try, because like an Inception stem is kind of quite a carefully engineered thing. And this thing of like, how do you take your three-channel input and turn it into something richer seems really important. And all of that work seems to have got thrown away for ResNet. We like ResNet. It works really well. But what if we put, you know, a dense net? What if we put the dense net backbone on top of an Inception stem? Or what if we replaced the 7 by 7 conv with a 1 by 7, 7 by 1 factored conv in a standard ResNet? I don't know. There's lots of things we could try, and I think it would be really interesting. So there's some more thoughts about potential research directions. Okay, so that was kind of my little bunch of random stuff section. Moving a little bit closer to the actual main topic of this, which is... What was the word I used? Image enhancement. I'm going to talk about a new paper briefly because it really connects what I just discussed with what we're going to discuss next. And the new paper... Well, it's not that new, is it? Maybe it's a year old. It's a paper on progressive GANS which came from NVIDIA. And the progressive GANS paper is really neat. It basically... Sorry, Rachel, yes. We have a question. 1 by 1 conv is usually called a network within a network in the literature. What is the intuition of such a name? No, network in network is more than just a 1 by 1 conv. So it's part of NIN. And we don't... I don't think there's any particular reason to look at that that I'm aware of. Okay. So the progressive GAN basically takes this idea of actually gradually increasing the image size. It's the only other direction I'm aware of where people have actually gradually increased the image size. And it kind of surprises me because this paper is actually very popular and very well known and very well liked. And yet people haven't taken the basic idea of gradually increasing the image size and use it anywhere else, which shows you the general level of creativity you can expect to find in the deep learning research community, perhaps. So they start with a 4 by 4... Like, they really go back. Start with a 4 by 4 GAN. Like, literally, they're trying to create, like, replicate 4 by 4 pixel. And then 8 by 8. And so here's the 8 by 8 pixels. This is the CelebA data set. So we're trying to recreate pictures of celebrities. And then they go 16 by 16 and then 32. And then 64. And then 128. And then 256. And one of the really nifty things they do is that as they increase size, they also add more layers to the network. Right? Which kind of makes sense, right? Because if you're doing a more of a resnetty type thing, you know, then you're spitting out something which hopefully makes sense at each grid cell size. And so you should be able to kind of layer stuff on top. And they do another nifty thing where they kind of add a skip connection when they do that. And they gradually change the linear interpolation parameter that moves it more and more away from the old 4 by 4 network and towards the new 8 by 8 network. And then once it's totally moved it across, they throw away that extra connection. So the details don't matter too much, but it uses the basic ideas we've talked about, gradually increasing the image size, kind of skip connections and stuff. But it's a great paper to study because A, you know, it's like one of these rare things where they've like good engineers actually built something that just works in a really sensible way. And it's not surprising, this actually comes from NVIDIA themselves, right? So NVIDIA don't do a lot of papers, but it's interesting that when they do, they build something that's so thoroughly practical and sensible. And so I think it's a great paper to study, you know, if you want to kind of like put together lots of the different things we've learned, you know? And there aren't many re-implementations of this. So like it's an interesting thing, you know, to project. And maybe you could build on and find something else. So here's what happens next. We eventually go up to 1.024 by 1.024 and you'll see that the images are not only getting higher resolution, but they're getting better. And so 1.024 by 1.024, I'm going to see if you can guess which one of the next page is fake. They're all fake. That's the next stage, right? You go up, up, up, up, up, up, up, up, and then boom, okay? So like, GANs and stuff are getting crazy. And some of you may have seen this during the week. Yeah, so this video just came out and it's a speech by Barack Obama. And let's check it out. So, my Jordan Peele. This is a dangerous time. Moving forward, we need to be more vigilant with what we trust from the internet. It's a time when we need to rely on trusted new sources. It may sound basic, but how do we move forward? So as you can see, they've used this kind of technology to literally move Obama's face in the way that Jordan Peele's face was moving. And like, you basically have all the techniques you need now to do that. So is that a good idea? So this is the bit where we talk about what's most important, which is like, now that we can like do all this stuff, what should we be doing? And how do we think about that? And the TODR version is, I actually don't know. Recently a lot of you saw the founders of the Spacey Prodigy folks, founders of Explosion AI did a talk and Matthew and Innis. I went to dinner with them afterwards and we basically spent the entire evening talking, debating, arguing about what does it mean that companies like ours are building tools that are democratising access to tools that can be used in harmful ways. And they're incredibly thoughtful people and I wouldn't say we didn't agree. We just couldn't come to a conclusion ourselves. So I'm just going to lay out some of the questions and point to some of the research. And when I say research, most of the actual literature of you and putting this together was done by Rachel. So thanks, Rachel. Let me start by saying the models we build are often pretty shitty in ways which are not immediately apparent. And you won't know how shitty they are unless the people that are building them with you are a range of people and the people that are using them with you are a range of people. So for example, a couple of wonderful research is Timnitz at Stanford. Where's Joy? She's at Microsoft now. Joy just finished her PhD at MIT. Joy is from MIT. So Joy and Timnitz did this really interesting research where they looked at some basically off-the-shelf face recognisers, one from Face++ which is a huge Chinese company, IBMs and Microsofts and they looked for a range of different face types. And generally speaking, the Microsoft one in particular was incredibly accurate unless the face type happened to be dark skinned when suddenly it went 25 times worse, got it wrong nearly half the time. And for somebody to... a big company like this to release a product that for a very, very large percentage of the world basically doesn't work is more than a technical failure. It's a really deep failure of understanding what kind of team needs to be used to create such a technology and to test such a technology or even an understanding of who your customers are. Yeah, some of your customers have dark skin. Yes, Rachel? I was also going to add that the classifiers all did worse on women than on men. Shocking. Yeah. But funny, actually, Rachel tweeted about something like this the other day and some guy was like, what's this all about? What are you saying that we don't know about? People made cars for a long time. You're saying you need women to make cars too and Rachel pointed out, well, actually, yes. For most of the history of car safety women in cars have been far, far more at risk of death than men in cars because the men created male-looking, feeling-sized crash test dummies and so car safety was literally not tested on women-sized bodies. So the fact, you know, like, you know, shitty product management with a total failure of diversity and understanding is not new to our field. And I was just going to say that was comparing impacts of similar strength men and women. Yeah, I don't know why. Rachel has to say this because anytime he says something like this on Twitter, there's like 10 people who will be like, oh, you have to compare all these other things, as if we didn't know that. So, yeah. I mean, yeah. Other things, you know, our very best, most famous systems do, like Microsoft's face recognizer or Google's language translator. You turn, she is a doctor, he is a nurse into Turkish and they both the pronouns become, oh, because there's no gendered pronouns in Turkish. So go the other direction. I'll be a doctor. I don't know how to say that. The equivalent for a Turkish nurse. And what does it get turned into? He is a doctor, she is a nurse. So, like, we've got these kind of, like, biases built into tools that we're all using every day. And again, people are like, oh, it's just showing us what's in the world and there's lots of problems with that basic assertion. But as you know, machine learning algorithms love to generalize, right? And so because they love to generalize, this is one of the cool things about you guys knowing the technical details now. Because they love to generalize, when you see something like 60% of people cooking are women in the pictures they use to build this model, and then you actually run the model on a separate set of pictures, then 84% of the people they choose as cooking are women, rather than the correct 67%, right? Which is like a really understandable thing for an algorithm to do, is it took a biased input and created a more biased output. Because, you know, for this particular loss function, you know, that's kind of where it ended up. And this is a really common kind of, a really common kind of model amplification, right? So this stuff matters, right? It matters in ways more than just, you know, awkward translations, or like, you know, black people's photos not being classified correctly, or, you know, maybe there's some, there's some wins too as well, like, you know, horrifying surveillance everywhere, maybe won't work on black people, I don't know. Or it'll be even worse because it's horrifying surveillance and it's flat out racist and wrong. Could be that too. But let's go deeper, right? Like, for all we say about human failings, humans are generally, you know, there's a long history of civilization and societies creating kind of layers of human judgment which avoid, hopefully, the most horrible things happening. And sometimes, companies which love technology think, let's throw away the humans and replace them with technology like Facebook did, right? So like two or three years ago, a couple years ago, Facebook literally got rid of their human editors, like this is in the news at the time, and they were replaced with algorithms. And so now there's algorithms that put all the stuff on your news feed and human editors are out of the loop. But what happened next? Many things happened next. One of which was a massive, horrifying genocide in Myanmar. Babies getting torn out of their mother's arms and thrown onto fires. Mass rape, murder, and an entire people exiled from their homeland. Okay, I'm not going to say that was because Facebook did this, but what I will say is that when the leaders of this horrifying project are interviewed, they regularly talk about how everything they learnt about the disgusting animal behaviors of Rohingyas that need to be thrown off the earth, they learnt from Facebook, right? Because the algorithms just want to feed you more stuff that gets you clicking. And so if you get told these people that don't look like you and you don't know a bad people and hear lots of stories about the bad people and then you start clicking on them and then they feed you more of those things and next thing you know you have this extraordinary cycle. And people have been studying this, right? So for example, we've been told a few times people click on our fast AI videos and then the next thing recommended to them is like conspiracy theory videos from Alex Jones and then that continues from there. Because, you know, humans click on things that shock us and surprise us and horrify us, right? And so at so many levels, you know, this decision has had extraordinary consequences which we're only beginning to understand. And again, this is not to say this particular consequence is because of this one thing, but to say it's entirely unrelated would be clearly ignoring all of the evidence and information that we have, right? So this is really kind of the key takeaway is to think like what are you building and how could it be used, right? So lots and lots of effort now being put into face detection including in our course, right? We've been spending a lot of time thinking about how to recognize stuff and where it is and there's lots of good reasons to want to be good at that, you know, for improving crop yields in agriculture for improving diagnostic and treatment planning in medicine for improving your Lego sorting robot system, whatever, right? But it's also being widely used in surveillance and propaganda and disinformation and, you know, again, it's like the question is like, well, what do I do about that? I don't exactly know, right? But it's definitely at least important to be thinking about it, talking about it and sometimes you can do really good things. For example, meetup.com did something which I would put in the category of really good thing which is they recognized early a potential problem which is that more men were tending to go to their meetups and that was causing their collaborative filtering systems which you're all familiar with building now to recommend more technical content to men and that was causing more men to go to more technical content which was causing the recommendation systems to suggest more technical content to men, right? And this kind of runaway feedback loop is extremely common when we interface the algorithm and the human together. So what did meetup do? They intentionally made the decision to recommend more technical content to women, right? Not because of some, you know, highfalutin idea about how the world should be but just because that makes sense, right? The runaway feedback loop was a bug, right? There are women that want to go to tech meetups but when you turn up to a tech make-up and it's all men and you don't go and then it recommends more to men and so on and so forth, right? So a meetup made a really strong product management decision here which was to not do what the algorithm said to do. Unfortunately this is rare. Most of these runaway feedback loops, for example in predictive policing where algorithms tell policemen where to go which very often is more black neighborhoods which end up crawling with more policemen which leads to more arrests which has assistance to tell more policemen to go to more black neighborhoods and so forth. So this problem of algorithmic bias is now very widespread and as algorithms become more and more widely used for specific policy decisions, judicial decisions, day-to-day decisions about just who to give what offer to, this just keeps becoming a bigger problem, right? And some of them are really things that the people involved in the product management decision should have seen at the very start didn't make sense and were unreasonable under any definition of the term. For example, this stuff that I've gone pointed out these were questions that were used to decide which was this sentencing guidelines? This software is used for both pre-trial so who it was required to post bail so these are people that haven't even been convicted as well as for sentencing and for who gets parole and this was upheld by the Wisconsin Supreme Court last year despite all the flaws pointed out. Okay, so whether you have to stay in jail because you can't pay the bail and how long your sentence is for and how long you stay in jail for depends on what your father did, whether your parents stayed married, who your friends are and where you live. Now, it turns out these algorithms are actually terribly, terribly bad so some recent analysis showed that they're basically worse than chance but even if the company's building them were competent and these were statistically accurate correlations does anybody imagine there's a world where it makes sense to decide what happens to you based on what your dad did? So a lot of this stuff at the basic level is obviously unreasonable and a lot of it just fails in these ways but you can see empirically that these kind of runaway feedback loops must have happened and these over-generalizations must have happened. For example, these are the kind of cross tabs that anybody working in these fields, in any field that's using algorithms should be preparing. So prediction of likelihood of reoffending for black versus white defendants, like we can just calculate this very simply, of the people that were labeled high-risk but didn't reoffend, there were 23.5% white but about twice that, African-American, whereas those that were labeled lower-risk but did reoffend was like half the white people and only 20% of the African-American. So this is the kind of stuff where at least, if you're taking the technologies we've been talking about and putting the production in some kind of, in any way, or building an API for other people or providing training for people or whatever, then at least make sure that what you're doing can be tracked in a way that people know if something's, people know what's going on. So at least they're informed. I think it's a mistake, in my opinion, to assume that people are evil and trying to break society. I think I prefer to start with an assumption of, okay, if people are doing dumb stuff, it's because they don't know better. So at least make sure that they have this information. And I find very few ML practitioners thinking about what is the information they should be presenting in their interface. And then often I'll talk to data scientists who will kind of say like, oh, the stuff I'm working on doesn't have a societal impact. It's like, really? Like a number of people who think that what they're doing is entirely pointless, come on. Otherwise people are paying you to do it for a reason. It's going to impact people in some way. So think about what that is. The other thing I know is a lot of people involved here are hiring people. And so if you're hiring people, I guess you're all very familiar with the fast AI philosophy now, which is the basic premise that, and I think it comes back to this idea that I don't think people on the whole are evil. I think they need to be informed and have tools. So we're trying to give as many people the tools as possible that they need. And particularly we're trying to put those tools in the hands of a more diverse range of people. So if you're involved in hiring decisions, perhaps you can keep this kind of philosophy in mind as well. If you're not just hiring a wider range of people, but also promoting a wider range of people and providing really appropriate career management for a wider range of people. Well, apart from anything else, your company will do better. It actually turns out that more diverse teams are more creative and tend to solve problems more quickly and better than less diverse teams. But also you might avoid these kind of awful screw-ups, which at one level are bad for the world and at another level if you ever get found out they can also destroy your company. Also they can destroy you or at least make you look pretty bad in history. A couple of examples. One is going right back to the Second World War, IBM basically provided all of the infrastructure necessary to track the Holocaust. So these are the forms that they used. And so they had different code for, you know, Jews were eight and Gypsies were 12. Death in the gas chambers was six. And they all went on these punch cards. You can go and look at these punch cards and museums now. And this has actually been reviewed by a Swiss judge who said that IBM's technical assistance facilitated the task of the Nazis and the commission of the crimes against humanity. And it's interesting to read back the history from these times to see what was going through the minds of people at IBM at that time. And what was clearly going through the minds was the opportunity to show technical superiority, the opportunity to test out their new systems. And of course the extraordinary amount of money that they were making. And when you do something which at some point down the line turns out to be a problem, even if you are told to do it, that can turn out to be a problem for you personally. For example, you'll remember the diesel emissions scandal in VW. Who was the one guy that went to jail? He was the engineer. Just doing his job. So if all of this stuff about actually not fucking up the world isn't enough to convince you, it can fuck up your life too. So if you do something that turns out to cause problems, even though somebody told you to do it, you can absolutely be held criminally responsible. And you'll certainly look at the, what's his name, Cogan. I think a lot of people now know the name Alexander Cogan. He was the guy that handed over the Cambridge Analytica data. He's a Cambridge academic. Now a very famous Cambridge academic the world over for doing his part to destroy the foundations of democracy. So this is probably not how we want to go down in history. All right, so let's have a break before we do. Rachel? I have a question on a different topic. Yes. In one of your tweets, you said dropout is patented. I think this is about WaveNet patent from Google. What does it mean? Can you please share more insight on this subject? Does it mean that we'll have to pay to use dropout in the future? Yeah. Okay. Good question. Let's talk about that after the break. And so let's come back at 7.40. The question before the break was about patents. What does it mean? So I guess the reason it's coming up was because I wrote a tweet this week, which I think was like three words and said dropout is patented. One of the patent holders is Jeffrey Hinton. So what? Isn't that great? Inventions all about patents, blah, blah, blah, right? And so, you know, my answer is no, you know, patents have gone wildly crazy. The amount of things that are patentable that we talk about every week would be dozens. Like it's so easy to come up with a little tweak and then, you know, if you turn that into a patent to stop everybody from using that little tweak for 14 years, and you end up with a situation we have now where everything is patented in 50 different ways, and so then you get these patent trolls who have made a very, very good business out of basically buying lots of shitty little patents and then suing anybody who accidentally turned out, did that thing, you know, like putting rounded corners on buttons, you know. But who was it? There's Apple Suit, Samsung or something? I don't remember. So, yeah. So what does it mean for us that a lot of stuff is patented in deep learning? I don't know. It's like one theory, like a lot of the, one of the main people doing this is Google. And people from Google who reply to this patent tend to assume that, oh, and Google's doing it because they wanted to have it defensively. So if somebody sues them, they'll be like, don't sue us. We'll sue you back because we have all these patents. The problem is that as far as I know they haven't signed what's called a defensive patent pledge. So basically you can sign a legally binding document that says our patent portfolio will only be used in defense and not offense. And even if you believe all the management of Google would never turn into a patent troll, you've got to remember that, you know, management changes, right? And like to give a specific example, I know the, you know, the somewhat recent CFO of Google, you know, has a much more, you know, kind of aggressive stance towards the P&L. And I don't know, maybe she might decide that they should start monetizing their patents. Or maybe the, you know, the group that made that patent might get spun off and then sold to another company that might end up in private equity hands and decide to monetize the patents or whatever. So I think it's a problem. There has been a big shift legally recently away from software patents actually having an illegal standing. So it's possible that these will all end up thrown out of court. But, you know, the reality is that anything but a big company is unlikely to have the financial ability to defend themselves against one of these huge patent trolls. So I think it's a problem. I don't know. Like, you can't, you can't avoid using patented stuff if you write code. Like most, I wouldn't be surprised if most lines of code you write have patents on them. So actually, funnily enough, the best thing to do is not to study the patents because if you do and you infringe knowingly, then the penalties are worse. So the best thing to do is to, like, put your hands in your ears, sing a song, you know, and get back to work. So that thing about, it's said about dropouts patented. Forget I said that. You don't, you don't know that. You skipped that bit. Okay. This is super fun. Artistic style. We're going to kind of go a bit retro here because this is actually the kind of original artistic style paper. And there's been a lot of updates to it. A lot of different approaches. And I actually think kind of in many ways the original is the best. We're going to look at some of the newer approaches as well. But I actually think the original is a terrific way to do it, even with everything that's gone since. Let's just jump to the code. So this is the style transfer notebook. So the idea here is that we want to take a photo. We're going to take a photo of this bird. And we want to create a painting that looks like Van Gogh painted the picture of the bird. Van Gogh, Van Gogh. I don't know. Quite a bit of the stuff that I'm doing, by the way, uses ImageNet. You don't have to download the whole of ImageNet for any of the things I'm doing. There's an ImageNet sample on files.fast.ai. Which has like, I don't know, a couple of gig. It should be plenty good enough for everything we're doing. If you want to get really great results, you can grab ImageNet. You can download it from Kaggle. On Kaggle, the localization competition actually contains all of the classification data as well. All right. So if you've got room, it's good to have a copy of ImageNet because it comes in handy all the time. So I just grabbed a bird out of my ImageNet folder and there is my bird. And so what I'm going to do is I'm going to start with this picture and I'm going to try and make it more and more like a picture of this bird painted by Van Gogh. And the way I do that is actually very simple. You're all familiar with it. We will create a loss function, which we'll call F. And the loss function is going to take as input a picture and spit out as output a value. And the value will be lower if the image looks more like a bird photo painted by Van Gogh. Having written that loss function, we will then use the PyTorch gradient and optimizers. Gradient times the learning rate. And we're not going to update any weights. We're going to update the pixels of the input image to make it a little bit more like a picture which would be a bird painted by Van Gogh. And we'll stick it through the loss function again to get more gradients and do it again and again. And that's it. So it's identical to how we solve every problem. You know I'm a one-trick pony, right? This is my only trick. Create a loss function, use it to get some gradients, multiply it by learning rates to update something. Always before we've updated weights in a model. But today we're not going to do that. We're going to update the pixels in the input. But it's no different at all. We're just taking the gradient with respect to the input rather than respect to the weights. That's it. So we're nearly done. Let's do a couple more things. Let's mention here that there's going to be two more inputs to our loss function. One is the picture of the bird. Birds look like this, okay? And the second is an artwork by Van Gogh. They look like this. Oh, and of course. There we go. And by having those as inputs as well, that means we'll be able to rerun the function later to make it look like a bird painted by Monet or a jumbo jet painted by Van Gogh or whatever. So those are going to be the three inputs. And so initially, as we discussed, our input here, this is going to be the first time we've ever found the rainbow pen useful. That is awesome. Okay, some random noise. Okay, so we start with some random noise, use the loss function, get the gradients, make it a little bit more like a bird painted by Van Gogh, and so forth. Okay, so the only outstanding question, which I guess we can talk about briefly, is how we calculate how much our image looks like a bird, this bird, painted by Van Gogh. Okay, so let's split it into two parts. Let's split it into a part called the content loss, and that's going to return a function, a value that's lower if it looks more like the bird. Not just any bird, the specific bird that we had coming in. Okay, and then let's also create something called the style loss, and that's going to be a lower number if the image is more like Van Gogh's style. Okay, so there's one way to do the content loss, which is very simple. We could look at the pixels of the output, compare them to the pixels of the bird, and do a mean-squared error, add them up. So if we did that, I ran this for a while, eventually our image would turn into an image of the bird. You should try it. You should try this as an exercise. Try to use the optimizer in PyTorch to start with a random image and turn it into another image by using mean-squared error pixel loss. Okay, not terribly exciting, but that would be step one. The problem is, even if we already had our style loss function working beautifully, and then presumably what we're going to do is we're going to add these two together, right? And then one of them we'll multiply by some lambda. So like adjust, some number we'll pick to adjust how much style versus how much content. So assuming we had a style loss or we had picked some sensible lambda, if we used a pixel-wise content loss, then anything that makes it look more like Van Gogh and less like the exact photo, the exact background, the exact contrast, lighting, everything will decrease the content loss, which is not what we want, right? We want it to look like the bird, but not in the same way, right? It's still going to have the same two eyes in the same place and be the same kind of shape and so forth, but not the same representation. So what we're going to do is this is going to shock you. We're going to use a neural network, all right? We're going to use a neural network. I totally meant that to be black and it came out green. It's always a black box, never mind. And we're going to use the VGG neural network because that's what I used last year and I didn't have time to see if other things worked, so you can try that yourself during the week. And the VGG network is something which takes in an input and sticks it through a number of layers. And I'm just going to treat these as just the convolutional layers. There's obviously value there, and if it's a VGG with batch norm, which most are today, then it's also got batch norm. And there's some max pooling and so forth, but that's fine. What we could do is we could take one of these convolutional activations and then rather than comparing the pixels of this bird, we could instead compare the VGG layer 5 activations of this to the VGG layer 5 activations of our original bird, or layer 6 or layer 7 or whatever. So why might that be more interesting? Well, for one thing, it wouldn't be the same bird. It wouldn't be exactly the same because we're not checking the pixels. We're checking some later set of activations. And so what are those later sets of activations contain? Well, assuming it's after some max pooling, they contain a smaller grid, so it's less specific about where things are, and rather than containing pixel color values, they're more like semantic things, like is this kind of like an eyeball, or is this kind of furry, or is this kind of bright, or is this kind of reflective, or is this laying flat, whatever. So we would hope that there's some level of semantic features through those layers where if we get something, a picture that matches those activations, then any picture that matches those activations looks like the bird, but it's not the same representation of the bird. So that's what we're going to do. That's what our content loss is going to be. And people generally call this a perceptual loss, because it's really important in deep learning that you always create a new name for every obvious thing you do. So if you compare two activations together, you're doing a perceptual loss. So that's it. Our content loss is going to be a perceptual loss, and then we'll do the style loss later. So let's start by trying to create a bird that initially is random noise, and we're going to use perceptual loss to create something that is bird-like, but it's not this bird. So let's start by saying we're going to do 28 by 28. Because we're only going to do one bird, there's going to be no GPU memory problems. I was actually disappointed that I realized that I picked a rather small input image. It'd be fun to try this or something much bigger to create a really grand-scale piece. The other thing to remember is if you were productionizing this, you could do a whole batch at a time. People sometimes complain about this approach. Gaddies is the lead author. The Gaddies style transfer approach is being slow. I don't agree it's slow. It takes a few seconds, and you can do a whole batch in a few seconds. Anyway, we're going to stick it through some transforms as per usual, transforms through a VGG16 model. So remember, the transform class has a done-the-call method, so we can treat it as if it's a function. So if you pass an image into that, then we get the transformed image. So try not to treat the fast AI and PyTorch infrastructure as a black box, because it's all designed to be really easy to use in a decoupled way. So this idea of the transforms are just callables, i.e. things that you can do with parentheses, comes from PyTorch, and we totally plagiarized the idea. So with TorchVision or with fast AI, basically, your transforms are just callables. And the whole pipeline of transforms is just a callable. So now we have something of 3 by 288 by 288, because PyTorch likes the channel to be first. And as you can see, it's been turned into a square for us. It's been normalized to 01 or that normal stuff. Okay. Now we'll create a random image. Okay. And here's something I discovered. Trying to turn this into a picture of anything is actually really hard. I found it very difficult to actually get an optimizer to get reasonable gradients that went anywhere. And just as I thought I was going to run out of time for this class and really embarrass myself, I realized the key issue is that pictures don't look like this. They have more smoothness. So I turned this into this by just kind of blurring it a little bit. I used a median filter. Basically, it's like a median pooling, effectively. And as soon as I changed it from this to this, it immediately started training really well. So it's like a number of little tweaks you have to do to get these things to work is kind of insane. But here was a little quick. All right. So we start with a random image, which is at least somewhat smooth. Okay. And I found that my bird image had a standard deviation of pixels that was about half of this, sorry, mean, about half of this mean. So I divided it by two, just trying to make it a little bit easier for it to match. I don't know if it matters. Turn that into a variable because this image, remember, we're going to be modifying those pixels with an optimization algorithm. So anything that's involved in the loss function needs to be a variable. And specifically, it requires a gradient because we're actually updating the image. Okay. All right. So we now have a mini-battery of one, three channels, 28 by 28, random noise. We're going to use, for no particular reason, the 37th layer of VGG. If you print out the VGG network, you can just type in m underscore VGG and print it out. You'll see that this is a, you know, kind of mid to late stage layer. So we can just grab the first 37 layers and turn it into a sequential model. And so now we've got a subset of VGG that will spit out some mid-layer activations. And so that's what the model is going to be. So we can take our actual bird image, right? And we want to create a mini-battery of one. So remember, if you slice in numpy with num, also known as np.newaccess, it introduces a new unit axis in that point. So here I want to create an axis of size one to say this is a mini-battery of size one. So slicing with none, just like I did here, I sliced with none to get this one unit axis at the front. So then we turn that into a variable. And this one doesn't need to be updated, so we use vv to say you don't need gradients for this guy. And so that's going to give us our target activations. So we've basically taken our bird image and turned it into a variable, stuck it through our model to grab the 37th layer activations. And that's our target, right? Is that we want our content lost to be this set of activations here. So now we're going to create an optimizer. We'll go back to the details of this in a moment, but we're going to create an optimizer and we're going to step a bunch of times, going zero the gradients, call some loss function, loss.backward, blah, blah, blah. So that's the high-level version, and I'm going to come back to the details in a moment. But the key thing is that the loss function we're passing in, that randomly generated image, the optimization image, or actually the variable of it. So we pass that to our loss function. And so it's going to update this using the loss function and the loss function is the mean-squared-arrow loss comparing our current optimization image pass through our VGG to get the intermediate activations and comparing it to our target activations. Just like we discussed. And we'll run that a bunch of times and we'll print it out and we have our bird, but not the representation of the bird. So there it is. So a couple of new details here. One is a weird optimizer, LBFGS. Anybody who's done... I don't know exactly what courses they're in, but certain parts of math and computer science courses comes into deep learning, discovers we use all this stuff, and then we have the algorithm and SGD and always assume that nobody in the field knows the first thing about computer science and immediately says, oh, have any of you guys tried using BFGS? There's basically a long history of a totally different kind of algorithm for optimization that we don't use to train neural networks. And of course the answer is, actually the people who have spent decades studying neural networks do know a thing or two about computer science and turns out these techniques on the whole don't work very well. But it's actually going to work well for this and it's a good opportunity to talk about an interesting algorithm for those of you that haven't studied this type of optimization algorithm at school. So BFGS is... What are the names? Broiden, Faber, Gottfurt... I can't remember. Anyway, initials are four different people. The L stands for limited memory. It's really just called BFGS. Limited memory BFGS. And it's an optimizer. So as an optimizer, it means that there's some loss function and it's going to use some gradients to... I mean, not all optimizers use gradients, but all the ones we use do. Use gradients to find a direction to go and try to make the loss function go lower and lower by adjusting some parameters. There's just an optimizer. But it's an interesting kind of optimizer because it does a bit more work than the ones we're used to on each step. And so specifically... Okay. There's a better place for it. So the way it works is it starts the same way that we're used to, which is we just kind of pick somewhere to get started. And in this case, we've picked like a random image, as you saw. And as per usual, we can't let the gradient. Okay? But we then don't just take a step. But what we actually do is, as well as finding the gradient, we also try to find the second derivative. So the second derivative says how fast does the gradient change? So the gradient is how fast does the function change? The second derivative is how fast does the gradient change? In other words, how curvy is it? All right? And the basic idea is that if you know that it's like not very curvy, then you can probably jump further. But if it is very curvy, then you probably don't want to jump as far. And so in higher dimensions, the gradient's called the Jacobian, and the second derivative's called the Hessian. You'll see those words all the time, but that's all they mean. Okay? Again, mathematicians have to invent new words for everything as well. They're just like deep learning researchers. Except maybe a bit more snooty. So with BFGS, we're going to try and calculate the second derivative, and then we're going to use that to figure out kind of what direction to go and how far to go. So it's less of a kind of a wild jump into the unknown. Now the problem is that actually calculating the Hessian, the second derivative, is almost certainly not a good idea because in each possible direction that you can add, for each direction that you're measuring the gradient in, you also have to calculate the Hessian in every direction. It gets ridiculously big. So rather than actually calculating it, we take a few steps and we basically look at how much the gradient's changing as we do each step, and we approximate the Hessian using that little function. And again, this seems like a really obvious thing to do, but nobody thought of it until somewhat, well, surprisingly long time later. Keeping track of every single step you take takes a lot of memory. So don't keep track of every step you take. Just keep the last 10 or 20. And the second bit there, that's the L to the LBFGS. So a limited memory BFGS means keep the last 10 or 20 gradients, use that to approximate the amount of curvature, and then use the curvature and gradient to estimate what direction to travel and how far. And so that's normally not a good idea in deep learning for a number of reasons. You know, it's obviously more work to do than kind of an atom or an SGD update. And it's obviously more memory. Memory is much more of a big issue when you've got a GPU to store it on and hundreds of millions of weights. But more importantly, the mini batches are super bumpy, so figuring out like curvature to decide exactly how far to travel is kind of polishing turds, as we say. Is that an American expression or just an Australian thing? I bet English there too. Is that English there? Oh yeah, polishing turds. You get the idea. And also interestingly, actually using the second derivative information, it turns out it's like a magnet for saddle points. So there's some interesting theoretical results that basically say it actually sends you towards nasty flat areas of the function if you use second derivative information. So normally not a good idea. But in this case, we're not optimizing weights. We're optimizing pixels, so all the rules change. And actually it turns out LBFTS does make sense. And because it does more work each time, it's kind of a different kind of optimizer. The API is a little bit different in PyTorch. As you can see here, when you say optimizer.step, you actually pass in the loss function. And so my loss function is to call step with a particular loss function, which is my activation loss. And as you can see, inside the loop, you don't say step, step, step, right, but rather it looks like this. So it's a little bit different. And you're welcome to try and rewrite this to use SGD. It'll still work. It'll just take a bit longer. I haven't tried it with SGD. It's much longer it takes. Okay, so you can see the loss function going down. The mean squared error between the, you know, activations at layer 37 of our VGG model for our optimized image versus the target activations. And remember, the target activations were the VGG applied to our bird. Does that make sense? Okay, so we've now got a content loss. Now, one thing I'll say about this content loss is we don't know which layer is going to work best. So it'd be nice if we were able to experiment a little bit more, and the way it is here is annoying. Maybe we even want to use multiple layers. So rather than like lopping off all of the layers after the one we want, wouldn't it be nice if we could somehow, like, grab the activations of a few layers as it calculates? Now, we already know one way to do that. Back when we did SSD, we actually wrote our own network, which had a number of outputs. Remember, like the different convolutional layers? We spat out a different, like, Ocon thing. But I don't really want to go and, like, add that to the TorchVisionResNet model, especially not if, like, later on I want to try, you know, then I want to try the TorchVisionVGG model, and then I want to try NestNetA model. I don't want to go into all of them and, like, change their outputs, right? Besides which, I'd like to easily be able to turn certain activations on and off the domain. So we've briefly touched before on this idea that PyTorch has these fantastic things called hooks. You can have forward hooks that let you plug anything you like into the forward path of a calculation or a backward hook that lets you plug anything you like into the backward path. So we're going to create the world's simplest forward hook. And this is one of these things that, like, almost nobody knows about. So, like, almost any code you find on the Internet that implements style transfer will have all kinds of horrible hacks rather than using forward hooks. But with forward hooks, it's really easy. So to create a forward hook, you just create a class, right? And the class has to have something called hook function, right? And your hook function is going to receive the module that you've hooked. It's going to receive the input for the forward pass and it's going to receive the target. And then you do whatever the hell you like. So what I'm going to do is I'm just going to store the output of this module in some attribute. That's it, right? So this can actually be called anything you like, but hook function seems to be the standard because you can see what happens here in the constructor is I store inside some attribute the result of, this is going to be the layer that I'm going to hook. You go module dot register forward hook and pass in the function that you want to be called when this module, when it's forward method is called. So when it's forward method is called, it will call self.hook function which will store the output in an attribute called features, okay? So now what we can do is we can create our VGG as before, right? And let's set it to not trainable so we don't waste time and memory calculating gradients for it. And let's go through and find out, let's find all of the max pool layers, right? So let's go through all of the children of this module and if it's a max pool layer, let's spit out index minus one. So that's going to give me the layer before the max pool. And so in general, the layer before a max pool or the layer before a stride two con is a very interesting layer, right? Because it's like, it's the most, you know, complete representation we have at that grid cell size. Because the very next layer is changing the grid, okay? So that seems to me like a good place to grab the content loss from is, you know, the best, most semantic, most interesting content we have at that grid size. So that's why I'm going to pick those indexes. So here they are. Those are the indexes of the last layer before each max pool in VGG. So I'm going to grab this one here, 22, just for no particular reason, just to try something else. So I'm going to say, sorry, this one here, 32. So I'm going to say block n3, that's going to be 32. So children VGG index to block n3 will give me the 32nd layer of VGG as a module, right? And then if I call the save features constructor, it's going to go self.hook equals 32nd layer of VGG.registerforwardhook function, okay? So now every time I do a forward pass on this VGG model, it's going to store the 32nd layers output inside sf.features. So we can now say, see here, I'm calling my VGG network, but I'm not storing it anywhere. I'm not saying, you know, activations equals VGG of my image. I'm calling it throwing away the answer and then grabbing the features that we stored in our sf in our save features object, right? So that way this is now going to contain, once I've done, this is a forward pass. Now that's how you do a forward pass in PyTorch. You don't say .forward, you just use it as a callable. And using it as a callable on an nn.module automatically calls forward. That's how PyTorch modules work, okay? So we call it as a callable. That ends up calling our forward hook. That forward hook stores the activations in sf.features. And so now we have our target variable, just like before, but in a much more flexible way. These are the same four lines of code we had earlier. I've just stuck them into a function, okay? And so it's just giving me my random image to optimize and an optimizer to optimize that image. This is exactly the same code as before. So that gives me these. And so now I can go ahead and do exactly the same thing, right? But now I'm going to use a different loss function, activation loss number two, which doesn't say out equals mvgg. Again, it calls mvgg to a forward pass, throws away the results, and grabs sf.features, okay? And so that's now my 30-second layer activations, which I can then do my msc loss on. You might have noticed the last time, the last loss function and this one are both multiplied by 1,000. Why are they multiplied by 1,000? Again, this was like all the things that were trying to get this lesson to not work correctly. I didn't used to have the 1,000. It wasn't training. Lunchtime today, nothing was working. After days of trying to get this thing to work, and finally kind of just randomly noticed, like, gosh, the loss functions, the numbers are really low, like 10e and x7. And I just kind of thought, well, what if they weren't so low? So I multiplied them by 1,000, and it started working. So why did it not work? Because we're doing single precision floating point, and single precision floating point ain't that precise. And particularly once you're kind of getting gradients that are kind of small, and then you're multiplying the learning rate, it can be kind of small, and you end up with a small number. And if it's so small, it can get rounded to zero, and that's what was happening, and my model wasn't training. So I'm sure there are better ways to multiply it by 1,000, but whatever, it works fine. It doesn't matter what you multiply a loss function by, because all you care about is its direction and its relative size. And interestingly, like this is actually something similar to when we were training ImageNet, we were using half precision floating point because the Volta tensor cores require that. And it's actually a standard practice if you want to get the half precision floating point to train, you actually have to multiply the loss function by a scaling factor. And we were using 1,024 or 512. And I think FastAI is now the first library that has all of the tricks necessary to train in half precision floating point built-in. So if you now, if you have a lucky enough to have a Volta or you can pay for a P3, if you've got a learner object, you can just say learn.half, and it will now automatically train correctly half precision floating point. Built into the model data objects as well, it's all automatic, and pretty sure no other library does that. Okay, so this is just doing the same thing on a slightly earlier layer, and you can see that the bird looks, you know, the later layer, you know, doesn't look very bird-like at all, but you can kind of tell it's a bird, slightly earlier layer, more bird-like, right? And hopefully that makes sense to you that earlier layers are getting closer to the pixels. You know, it's a smaller grid size. Well, there's, you know, more grid cells. Each cell is smaller. Smaller receptive field, less complex semantic features. So the earlier we get, the more it's going to look like a bird. And in fact, the paper has a nice picture of that showing various different layers and kind of zooming into this house. They're trying to make this house look like this picture, and you can see that later on it's pretty messy, and earlier on it looks like this. Okay, so this is just doing what we just did. And I will say like, one of the things I've noticed in our study group is anytime I say to somebody to answer a question, anytime I say, read the paper, there's a thing in the paper that tells you the answer to that question, there's always this look, like, read the paper. Me? The paper? But seriously, the papers have, like, they've done these experiments and drawn the pictures. Like, there's all this stuff in the papers. Like, it doesn't mean you have to read every part of the paper, right? But at least look at the pictures. So check out the Gaddy's paper. It's got nice pictures. Okay. So they've done the experiment for us. They basically did this experiment. Okay. But it looks like they didn't go as deep. They just got some earlier ones. Okay, the next thing we need to do is to create style loss, right? So we've already got the loss, which is how much like the bird is it. Now we need how much like this painting style is it. And we're going to do nearly the same thing. Okay, we're going to grab the activations of some layer. Now the problem is that the activations of some layer, let's say it was a 5 by 5 layer. I mean, of course, there are no 5 by 5 layers at 224 by 224, but we'll pretend. 5 by 5 by whatever. 19, say. Totally unrealistic sizes, but never mind. So here's some activations and we could get these activations both for our, the image we're optimizing and for our Van Gogh painting. And let's look at our Van Gogh painting. There it is. Okay, sorry, no. I downloaded this from Wikipedia and I was wondering what was taking so long to load. It turns out that the Wikipedia version I downloaded was 30,000 by 30,000 pixels. It's pretty cool. They've got this like serious gallery quality archive stuff there. I didn't know it existed. So don't try and run a neuron in on that. Totally killed my Jupiter notebook. Okay. So yeah, so we can do that for our Van Gogh image and we can do that for our optimizer image. And then we could compare the two and we would end up creating an image that looks you know, content like the painting, but it's not the painting. That's not what we want. We want something with the same style, but it's not the painting. It doesn't have the content. So we actually want to throw away all of the spatial information. We're not trying to create something that looks that has a moon here and stars here and okay, so it's a church here and whatever. We don't want any of that. So how do we throw away all the spatial information? What we do is let's grab, so there are like in this case, there are like 19 faces on this, right? Like 19 slices. So let's grab this top slice. Let's grab that top slice. So that's going to be a five by five matrix. And now let's flatten it. So now we've got a 25 long vector. Now in one stroke, we've thrown away the bulk of the spatial information by flattening it, right? Now let's grab a second slice. So another channel and do the same thing. Okay, so here's channel one flattened. Here's channel two flattened. And they've both got 25 elements. And now let's take the dot product which we can do with at in numpy. And so the dot product's going to give us one number. Right, and what's that number? What is it telling us? Well, assuming this is kind of somewhere around the middle activation, you know, the activations are somewhere around the middle layer of the VGG network, we might expect some of these activations to be like, how textured is the brush stroke? And some of them to be like, how bright is this area? And some of them to be like, is this part of a house or part of a circular thing? Or other parts to be, you know, how dark is this part of the painting? And so this, a dot product, remember, is basically a correlation, right? If this element and this element are both highly positive or both highly negative, it gives us a big result, right? Or else if they're the opposite, it gives a small result. If they're both close to zero, it gives no result. So it's basically a dot product as a measure of how similar these two things are. Right, and so if the activations of channel one and channel two, you know, are similar, that it basically says, let's give an example. Let's say this first one was like, how textured are the brush strokes? And this one here, let's say, was like, how kind of diagonally oriented are the brush strokes, right? And if both of these were high together and both of these were high together, then it's basically saying, oh, anywhere that there's more textured brush strokes, they tend to be diagonal, right? Another interesting one is, what would be the dot product of C1 with C1? So that would be basically the two norm, the sum of the squares of that channel, which in other words is basically just, on average, how... Sorry, let's go back, I screwed this up. Channel one might be texture and channel two might be diagonal and this one here would be cell one comma one and this cell here would be like cell say four comma two and so, sorry, what I should have been saying is, if these are both high at the same time and these are both high at the same time, then it's saying that grid cells would have texture, tend to also have diagonal. So sorry, I drew that all wrong. The idea was right, I just drew it all wrong. So this number is going to be high when grid cells that have texture also have diagonal and when they don't, they don't. So that's C1 dot product C2, whereas C1 dot product C1, right, is basically, as we said, like the two norm effectively, squared, or the sum of the squares of C1, sum over I of C1 squared. And this is basically saying how and how many grid cells is the textured channel active and how active is it? So in other words, C1 dot product C1 tells us how much textured painting is going on and C2 dot product C2 tells us how much diagonal paint strokes is going on and maybe C3 is, you know, is it bright colors? So C3 dot product C3 would be, you know, how often do we have bright colored cells? So what we could do then is we could create a 25 by 25 matrix containing every one, channel 1, channel 2, channel 3, channel 1, channel 2, channel 3, sorry, not channel, man, it's been a long day. 19, there are 19 channels, 19 by 19. Okay, channel 1, channel 2, channel 3, channel 19, channel 1, channel 2, channel 3, channel 19. Okay, and so this would be the dot product of channel 1 with channel 1, this would be the dot product of channel 2 with channel 2, and so forth, after flattening. Yeah, and like we've discussed, mathematicians have to give everything a name. So this particular matrix where you flatten something L out and then do the dot product, all the dot products is called a grand matrix. And I'll tell you a secret, like most deep learning practitioners either don't know or don't remember all these things like what is a grand matrix? If they ever did study at university, they probably forgot it because they had a big night afterwards. And the way it works in practice is like you realize, oh, I could create a kind of non-spatial representation of how the channels correlate with each other. And then when I write up the paper, I have to go and ask around and say like, does this thing have a name? And somebody would be like, oh, isn't that the grand matrix? And you go and look it up, and it is. So don't think like you have to go and study all of math. First, you use your intuition and common sense and then you worry about what the math is called later. Normally. Sometimes it works the other way. Not with me, because I can't do math. Okay. So this is called the grand matrix. And of course, if you're a real mathematician, it's very important that you say this as if you always knew it was a grand matrix and you kind of just go, oh, yes, we just calculate the grand matrix. That's really important. So the grand matrix then is this kind of map of the diagonal is perhaps the most interesting. The diagonal is like which channels are the most active. And then the off diagonal is like which channels tend to appear together. And overall, if two pictures have the same style, then we're expecting that some layer of activations, they will have similar grand matrices. Because if we found the level of activations that capture a lot of stuff about like paint strokes and colors and stuff, then the diagonal alone might even be enough. And like that's another interesting homework assignment if somebody wants to take it, is try doing Gatti's style transfer, not using the grand matrix, but just using the diagonal of the grand matrix. And that would be like a single line of code to change, but I haven't seen it tried. I don't know if it would work at all, but it might work fine. Christine, I'll pass this to Christine. Okay, yes, Christine, you've tried it. I was going to say I have tried that and it works most of the time except when you have funny pictures where you need two styles to appear in the same spot. So if you have like grass in one half and like a crowd in one half and you need the two styles. Cool. You still got to do your homework. Okay, Christine says, she'll do it for you. Okay, so let's do that. So here's our painting. I've tried to resize the painting so it's the same size as my bird picture. So that's all this is just doing. So I make the, yeah, there it is. It doesn't matter too much which bit I use as long as it's got lots of the nice style in it. I grab my optimizer and my random image just like before. And this time I call save features for all of my block ends and that's going to give me an array of save features objects. One for each module that appears the layer before a max pull. So now I've, because this time I want to play around with different activation layer styles or more specifically I want to let you play around with it. So now I've got a whole array of them. So now I call my VGG module on my image again. Yeah, I'm not going to use that yet. Okay, ignore that line. I style image, sorry, style image is my Van Gogh painting. So I take my style image, put it through my transformations to create my transform style image. I turn that into a variable, put it through the forward pass of my VGG module and now I can go through all of my save features objects and grab each set of features. And notice I call clone, right? Because I don't, because like later on if I call my VGG object again it's going to replace those contents. I haven't quite thought about whether this is necessary. If you take it away it's not, it's fine, but I was just being careful. So here's now an array of the activations at every block and layer. So here you can see all of those shapes. And you can see like being able to like whip up a list comprehension really quickly. It's really important and you put a fiddling around because you really want to be able to like immediately see, you know, here's my channel 6412 and you can see here the grid size halving as we would expect because all of these appear just before a MaxPool. So to do a gram MSC loss it's going to be the MSC loss on the gram matrix of the input the gram matrix of the target and the gram matrix is just the matrix multiply of X with X transpose where X is simply equal to my input where I flattened the batch and channel axes all down together and I've already got one image so you can kind of ignore the batch part, right? It's basically channel something else which in this case is the height and width is the other dimension because it's now going to be channel by height and width and then as we discussed we can then just do the matrix multiply of that by its transpose and just to normalize it we'll divide that by the number of elements it would actually be more elegant if I had said divided by input dot some elements that would be the same thing okay and then again this kind of gives me tiny numbers so I multiply it by a big number to make it something more sensible okay so that's basically my loss so now my style loss is to take my image to optimize throw it through VGG forward pass grab an array of the features in all of the say features objects and then add them to my gram MSC loss on every one of those layers okay and that's going to give me an array and then I just add them up now you could add them up with different weightings you could add up a subset whatever right? in this case I'm just grabbing all of them pass that into my optimizer as before and here we have a random image in the style of Van Gogh which I think is kind of cool okay and again Gaddies has done it for us here is different layers of random image in the style of Van Gogh and so the first one as you can see the activations are simple geometric things not very interesting at all these are much more interesting so we kind of have a suspicion that we probably want to use later layers largely for our style loss if we want it to look good alright I added this where was it this save features.close which just calls remember I stored the hook here and so hook.remove gets rid of it I have no idea to get rid of it because otherwise you can potentially just keep using memory so at the end I go through each of my save features object and close it so style transfer is adding the two together with some weight so there's not much to show grab my optimizer, grab my image and now my combined loss at one particular layer my style loss at all of my layers sum up the style losses add them to the content loss the content loss I'm scaling actually the style loss I scaled already by 1.e6 and this one is 1, 2, 3, 4, 5, 6 so actually they're both scaled exactly the same add them together and again you could try weighting style losses or you could maybe remove some of them, whatever so this is the simplest possible version train that and like holy shit it actually looks good so I think that's yeah I think that's pretty awesome again the main takeaway here is if you want to solve something with a neural network all you've got to do is set up a loss function and then optimize something and the loss function is something which a lower number is something that you're happier with because then when you optimize it it's going to make that number as low as you can and it'll do what it's do here we came up with a loss function that does a good job of being a smaller number when it looks like the thing we want it to look like and it looks like the style of the thing we want it to be the style of and that's all we had to do when it actually comes to it apart from implementing gram MSC loss which was like six lines of code if that that's our loss function pass it to our optimizer weight five seconds and remember we could do a batch of these at a time so we could weight five seconds and 64 of these will be done yeah so I think that's really interesting and since this paper came out it's really inspired a lot of interesting work to me though most of the interesting work hasn't happened yet because to me the interesting work is the work where you combine human creativity with these kinds of tools you know and I haven't seen much in the way of tools that you can download or use where the artist is in control and can kind of do things interactively it's interesting talking to the guys at the Google Magenta project which is kind of their creative project all of the stuff they're doing with music is specifically about this it's building tools that musicians can use to perform in real time and so you'll see much more of that on the music space thanks to Magenta if you go to their website there's all kinds of things where you can like press the buttons to like actually change the drum beats or melodies or keys or whatever and you can definitely see like Adobe and NVIDIA kind of starting to release you know little prototypes that are starting to do this you know this kind of like creative AI explosion hasn't happened yet I think we have pretty much all the technology we need but no one's like put it together into a thing and said like look at the thing I built and look at the stuff that people built with my thing you know so that's just a huge area of opportunity so the paper that I put at the start of class in passing the one where we can add Captain America's shield to arbitrary paintings basically used this technique that the trick was though some minor tweaks to make the kind of the pasted Captain America shield blend in nicely right but like that would that paper's only a couple of days old so like that would be a really interesting project to try because you can use all this code it really does leverage this approach and then you could start by making the content image be like the painting with the shield and then the style image could be the painting without the shield and like that would be a good start and then you could kind of see what specific problems they tried to solve in this paper but you know you could have a start on it right now ok so let's make a quick start on the next bit which is yes Rachel say two questions earlier there were a number of people that expressed interest in your thoughts on pyro and probabilistic programming so yeah so you know TensorFlow's now got this they call it TensorFlow probability or something there's a bunch of probabilistic programming frameworks out there I think they're intriguing you know but as yet unproven in the sense that like I haven't seen anything done with any probabilistic programming system which hasn't been done better without them the basic premise is that it allows you to create more of a a model of how you think the world works and then like plug in the parameters so back when I used to work in management consulting 20 years ago we used to do a lot of stuff where we would use a spreadsheet and then we would have these Monte Carlo simulation plugins there's one called at risk and one called crystal ball I don't know if they still exist decades later but basically they would let you like change a spreadsheet cell to say this is not a specific value but it actually represents a distribution of values with this mean and the standard deviation or it's got this distribution and then you would like hit a button and the spreadsheet would recalculate a thousand times pulling random numbers from the distributions and show you like the distribution of your outcome that might be some you know profit or market share or whatever and we used them all the time back then I partly think that a spreadsheet is a more obvious place to do that kind of work because you can kind of see it all much more naturally but I don't know I will see at this stage I hope it turns out to be useful because I find it very appealing and it kind of appeals to as I say the kind of work I used to do a lot of there's actually whole practices around this stuff they used to call systems dynamics which really was built on top of this kind of stuff but I don't know it's not quite gone anywhere okay then there was a question about pre-training for generic style transfer yes I don't think you can pre-train for a generic style but you can pre-train for a generic photo for a particular style which is where we're going to get to although it may end up being homework I haven't decided but I'm going to do all the pieces and one more question is please ask him to talk about multi-GPU oh yeah I even have a slide about that it's about to actually we're about to hit it okay so before we do just another interesting picture from the Gaddies paper they've got a few more just didn't fit in my slide here but different convolutional layers for the style different style to content ratios and here's the different images obviously this isn't Van Gogh anymore this is a different combination so you can see like if you just do all style you don't see any image if you do all lots of content but you use low enough convolutional layer it looks okay but the background is kind of dumb so you kind of want somewhere around here or here I guess anyway so you can play around with an experiment but also use the paper to help guide you actually I think I might work on the math now and we'll talk about multi-GPU and and super resolution next week because I think this is from the paper and like one of the things I really do want you to do after we talk about a paper is to read the paper and then ask questions on the forum anything that's not clear but there's kind of like a key part of this paper which I wanted to talk about and discuss how to interpret it so we're going to be, the paper says we're going to be given an input image x and this little thing means it's what not only means it's a vector Rachel but this one's a matrix I guess it can mean either yeah I don't know maybe it's anyway so normally small letter bold means vector or small letter with doobie on top means vector, they can both mean vector and normally big letter means matrix or small letter with two doobies on top means matrix in this case our image is a matrix we are going to basically treat it as a vector so maybe we're just getting ahead of ourselves so we've got an input image x and it can be encoded in a particular layer of the CNN by the filter responses so the activations filter responses are activations so hopefully that's something you all understand that's basically what a CNN does is it produces layers of activations a layer has a bunch of filters which produce a number of channels and so this here says that layer number L has capital NL filters and again this capital does not mean matrix so I don't know math notation is so inconsistent so capital NL distinct filters that layer L which means it has also that many feature maps so make sure you can see that this letter is the same as this letter so you've got to be very careful to read the letters and recognize it's like snap so that's the same letter as that so obviously filters and L feature maps or channels H1 is of size M so I can see this is where the unrolling is happening Hbap is of size M little L so this is like M square bracket L in NumPy notation it's the Lth layer so M for the Lth layer and the size is height times width so we flattened it out so the responses at that layer L can be stored in a matrix F and now the L goes at the top for some reason so this is not F to the power of L this is just another indexing we're just moving it around for fun and this thing here where we say it's an element of R this is a special R meaning the real numbers N times M this is saying that the dimensions are the same so this is really important it's just like with PyTorch making sure that you understand the rank and size of your dimensions first same with math these are the bits where you stop and think why is it N by M so N is the number of filters N is height by width so do you remember that thing where we did view batch times channel comma minus 1 here that is so try to map the code to the math so F is F is X if I was nicer to you I would have used the same letters as the paper but I was too busy getting this damn thing working to do that carefully so you can go back and rename it as capital F okay and this is why we moved the L to the top is because we're now going to have some more indexing where else in NumPy or PyTorch we index things by square brackets and then lots of things with commas between the approach in math is to like surround your letter by little letters all around it okay and just throw them up there everywhere so here FL is the Lth layer of F and then IJ is the activation of the Ith filter at position J of layer L so position J is up to size M which is up to size height by width this is the kind of thing that would be easy to get confused like often you'd see an IJ and assume that's like indexing into a position of an image like height by width but it's totally not is it right it's indexing into channel by flattened image right and it even tells you it's the Ith filter the Ith channel in the Jth position that image in layer L right so you're not going to be able to get any further in the paper unless you know unless you understand what F is okay so that's why like these are the bits where you stop and make sure you're comfortable right so now the content loss I'm not going to spend much time on but basically we're going to just check out the values of the activations versus the predictions squared right so there's our content loss right and the style loss will be much the same thing but using the gram matrix G and I really wanted to show you this one because I think it's super sometimes I really like things you can do in math notation and there are things that you can also generally do in J and APL which is there's kind of this implicit loop going on here what this is saying is there's a whole bunch of values of I and a whole bunch of values of J and I've got to define G for all of them and there's a whole bunch of values of L as well I'm going to define G for all of those as well and so for all of my G at every L at every I at every J it's going to be equal to something and you can see the something has an I and a J right so matching these and it also has a K and that's part of the sum so what's going on here well it's saying that my gram matrix in layer L for the Ith channel well these aren't channels anymore in the Ith position in one axis and the Jth position in another axis is equal to my matrix so my flattened out matrix for the Ith channel in that layer versus the Jth channel in the same layer and then I'm going to sum over I'm going to so you see this K in this K they're the same letter right so we're going to take the Kth position and multiply them together and then add them all up right so that's exactly what we just did before when we calculated our gram matrix right so like this there's a lot going on because of some like to me very neat notation right which is there are three implicit loops all going on at the same time plus one explicit loop in the sum and then they all work together to create this gram matrix for every layer right so let's go back and see if you can match this yeah so so all that's kind of happening all at once which I think is pretty great okay so that's it so next week we're going to be looking at a very similar approach basically doing style transfer all over again but in a way where we're actually going to trade a neural network to do it for us rather than having to do the optimization and we'll also see that you can do the same thing to do super resolution and we're also going to go back and revisit some of that SSD stuff as well as doing some some segmentation so if you're if you've forgotten SSD might be worth doing a little bit of revision this week alright thanks everybody see you next week