 Okay, welcome again everybody. So some really fun stuff appearing on the forums this week and one of the really great projects which was created by I believe our sole Bulgarian participant in the course, Slav Ivanov, wrote a great post about picking an optimizer for style transfer. This post came from a forum discussion in which I made an offhand remark about how I didn't, I know that in theory BFGS is a deterministic optimizer, it uses a line search, it approximates the Hessian, it ought to work on this kind of deterministic problem better but I hadn't tried it myself and I hadn't seen anybody try it and so maybe somebody should try it. I don't know if you've noticed but pretty much every week I say something like that a number of times and every time I do I'm always hoping that somebody might go, whoa, I wonder as well. And so Slav did wonder and so he posted a really interesting blog post about that exact question and I was thrilled to see that the blog post got a lot of pick up on the machine learning reddit. It got 55 uploads which for that subreddit is put it in second place on the front page. It also got picked up by the Wild ML mailing list weekly summary of interesting things in AI as the second post that was listed. So that was great. For those of you that have looked at it and kind of wondered okay what is it about this post that causes it to get noticed whereas other ones don't. I mean I'm not sure I know the secret but as soon as I read it I kind of thought okay I think a lot of people are going to read this. It gives some background you know like it assumes an intelligent reader, but it assumes an intelligent reader who doesn't necessarily know all about this. So you know something like you guys six months ago. And so it describes like this is what it is and you know this is where this kind of thing is used and gets some examples. And then goes ahead and shows talks about you know sets up the question of different optimization algorithms. And then shows lots of examples of both learning curves as well as pictures that come out of these different experiments. And I think hopefully it's been a great experience for Slav as well because in the Reddit thread there's you know all kinds of folks pointing out you know other things that he could try questions that weren't quite clear. And so now there's a whole kind of actually kind of summarized in that thread a kind of list of things that perhaps could be done next to kind of open up a whole interesting question. Another post which I'm not even sure it's officially posted yet. I got the early bird version from Brad is this crazy thing. Here is Kanye drawn using a brush of Captain John Luke Picard. In case you're wondering is that really him? I will show you a zoomed in version. It really is John Luke Picard. And this is a really interesting idea because he points out that generally speaking, when you try to use a non artwork as your style image, it doesn't actually give very good results. It's another example of a non artwork doesn't give good results. It's kind of interesting, but it's not quite what I was looking for. But if you tile it, you totally get it right. So it's Kanye using a Nintendo and control of brush. So then he tried out this John Luke Picard and got okay results and kind of realized that actually the size of the texture is pretty critical. And I've never seen anybody do this before. So I think I think when this image gets shared on Twitter, it's going to go everywhere because it's just the freakiest thing. Freaky is good. So, you know, I think I warned you guys about your projects when I first mentioned them as being something that's very easy to. Overshoot a little bit and spend weeks and weeks talking about what you're eventually going to do. You've had, you know, a couple of weeks. Really, it would have been nice to have something done by now, right, rather than spending a couple of weeks wondering about what to do. So if your team is being a bit slow agreeing on something to start working on something yourself, or, you know, as a team, just pick something that you can do by next Monday. And write up, you know, something brief about it. So for example, if you're thinking, okay, we might do the $1 million data science poll, that's fine. You're not going to finish it by Monday. But maybe by Monday you could have written a blog post introducing, you know, what you can learn in a week about medical imaging. Oh, it turns out it uses something called DICOM. Here are the Python DICOM libraries and we tried to use them. And these are the things that got us kind of confused and these are the ways that we solved them. And here's a Python notebook which shows you some of the main ways that you can look at these DICOM, for instance. So, you know, split up your project into little pieces. It's like when you enter a Kaggle competition, I always tell people submit every single day and try and put in at least half an hour a day to make it slightly better than yesterday. So how do you put in the first day submission? So what I always do on the first day is to submit the benchmark script, you know, which is generally like all zeros. And then the next day I try to improve it. So I'll put in like all 0.5s. The next day I'll try to improve it. I'll be like, okay, what's the average for cats? The average for dogs? I'll submit that. And if you do that every day for 90 days, you'll be amazed at how much you can achieve. Or else if you wait two months and spend all that time reading papers and theorizing and thinking about the best possible approach, you'll discover that you don't get any submissions in. Or you finally get your perfect submission in and it goes terribly and now you don't have time to make it better. So I think those tips are equally useful for Kaggle competitions as well as for making sure that at the end of this part of the course, you have something that you're proud of, something that you feel you did a good job in a small amount of time. If you try and publish something every week on the same kind of topic, you'll be able to keep going further and further on that thing. So I don't know what Slav's plans are, but maybe next week he'll follow up on some of the interesting research angles that came up on Reddit or maybe Brad will follow up on some of his additional ideas from his post. Okay, there's a Lesson 10 Wiki up already which has the notebooks. And just do a github repo to get the most up to date on Python. Another thing that I wanted to point out is that in study group, so we've been having study groups each Friday here, and I know some of you have had study groups elsewhere around the Bay Area. One of you asked me, I don't understand this grand metric stuff. I don't get it. What's going on? I understand the symbols, I understand the math, but what's going on? And I said, maybe if you had a spreadsheet, it would all make sense. And he's kind of like, I'm doing it in Python, you know, this Python is nice. And I was like, maybe if you had a spreadsheet, it would all make sense. It's like, maybe I'll create a spreadsheet. He's like, yes, do that. And 20 minutes later, I turned to him and I said, so how do you feel about Graham Mattries now? And he goes, I totally understand them. And I looked over and he had created a spreadsheet. This was the spreadsheet he created. It's a very simple spreadsheet where it's like, here's an image where the pixels are just like 1 minus 1 on 0. It has two filters that are either 1 or minus 1. He has the flattened convolutions next to each other. And then he's created the little product matrix. And so I haven't been doing so much Excel stuff myself. But, you know, I think you learn a lot more by trying it yourself, right? And particularly if you try it yourself and can't figure out how to do it in Excel, then we have problems. I love Excel. So if you ask me questions about Excel, I will have a great time. I'm not going to put that one on the forum for now because I think it's so easy to create and you get so much more out of doing it yourself for anybody who's still not quite understanding what Graham Mattries is doing. All right. So last week, we talked about the idea of learning with larger datasets. And our goal was to try and replicate the device paper. And to remind you, the device paper is the one where we do a regular CNN. But the thing that we're trying to predict is not a one-hot encoding of the category, but it's the word vector for the category. So it's an interesting problem. But one of the things interesting about it is just, OK, we have to use all of ImageNet, which has its own challenges. So last week, we got to the point where we had created the word vectors. And to remember the word vectors, we then had to like map them to ImageNet categories. So there are a thousand ImageNet categories. So we had to create the word vector for each one. We didn't quite get all of them to match, but something like two thirds of them matched. So we're working on like two thirds of ImageNet. We've got as far as reading all the file names for ImageNet. And we then were going to resize our images. So we're going to resize all of our images to 224 by 224. I think it's a good idea to do some of this pre-processing upfront. Something that TensorFlow and PyTorch both do and Keras recently started doing is that if you use a generator, it actually does the image pre-processing in a number of separate threads in parallel behind the scenes. And so some of this is a little less important than it was six months ago when Keras didn't do that. It used to be that we had to spend a long time waiting for our data to get processed before it could get into the CNN. Having said that, particularly image resizing, when you've got large JPEGs, just reading them off the hard disk and resizing them can take quite a long time. So I always like to put it into do all that resizing upfront and end up with something in a nice convenient vehicles array. Amongst other things, it means that unless you have enough money to have a huge NVNE or SSD drive, which you can put the entirety of ImageNet on, you probably have your big data sets on like some kind of pretty slow spinning disk or slow rate array. So one of the nice things about doing the resizing first is that it makes it a lot smaller and you probably can then fit it on your SSD. There's lots of reasons that I think this is good. So I'm going to resize all of the ImageNet images, put them in a big holes array on my SSD. So here's the path and D path is the path to my fast SSD mount point. And we talked briefly about, you know, there are these things that actually do the resizing and we're going to do a different kind of resizing. In the past, we've done the same kind of resizing that Keras does, which is to add a black border. Like if you start with something that's not square or you make it square, you resize the largest axis to be the size of your square, which means you're left with a black border. I was concerned that any model where you have that is going to have to like learn to model the black border A and B that you're kind of throwing away information, you know, you're not using the full size of the image. And indeed, every other library or pretty much paper I've seen uses a different approach, which is to resize the smaller side of the image to the square. And now the larger size is now too big for your square. So you crop off the top and bottom or crop off the left and right. So this is called center cropping approach. And so I understand your point about continuing to learn what to do. And you said after that, it seems like you're not throwing away data. Okay, that's true. What you're doing is throwing away. You're throwing away compute like, like with the one where you do center crop you have a complete 224 thing full of meaningful pixels. Well, so with a black border, you have like a 180 by 224 bit with meaningful pixels and a whole bunch of black pixels. Yeah, that can be a problem. So it works well for image net because image net things are generally somewhat centered. So you need to do some kind of initial step to do like a heat map or something like we did in lesson seven to figure out roughly where the thing is before you decide where to send to the crop. So these things are all compromises. But I've got to say, since I switched to using this approach, I feel like my models have trained a lot faster and given better results. Certainly the super resolution. Okay, now I said last week that we are going to start looking at parallel processing. And if you're wondering about last week's homework, we're going to get there. But some of the techniques we're about to learn, we're going to use to do last week's homework even better. So don't worry. So what I want to do is I want to have, I've got a CPU with something like 10 cores on it. And then each of those cores have hyper threading. So it means each of those cores can do kind of two things at once. So I really want to be able to, you know, have a couple of dozen processes going on each one resizing an image. That's called parallel processing. Right. And just to remind you, this is as opposed to vectorization or SIMD, which is where a single thread operates on a bunch of things at a time. And so we learnt that to get SIMD working, you just have to install below SIMD and just happens 600% speed up. I tried it, it works. Now we're going to, as well as a 600% speed up also get another 10 or 20 X speed up by doing parallel processing. The basic approach to parallel processing in Python three is to set up something called either a process pool or a thread pool. So the idea here is that we've got a number of little programs running threads or processes. And when we set up that pool, we say how many of those little programs do we want to fire up? Right. And then what we do is we say, OK, now I want you to use all of the code workers. I want you to use all of those workers to do some thing. Right. And the easiest way to do a thing in Python three is to use map. A lot of you have probably used, how many of you have used map before? OK, so for those of you who haven't, map is a very common functional programming construct that's found its way into lots of other languages, which simply says loop through a collection and call a function on everything in that collection and return a new collection, which is a result of calling that function on that thing. In our case, the function is resize and the collection is image net images. Well, in fact, the collection is a bunch of numbers, 0, 1, 2, 3, 4, and so forth. And what the resize image is going to do is it's going to open that image off disk. Right. So it's turning the number three into the third image resize to 224 by 224. And we'll return that. So the general approach here, this is basically what it looks like to do parallel processing in Python. It may look a bit weird. We're going result equals exec dot map. And this is a function I want. This is the thing to map over. And then I'm saying for each thing in that list, do something. Now this might make you think, well, wait, does that mean this list has to have enough memory for every single resize image? And the answer is no. No, it doesn't. One of the things that Python 3 uses a lot more is using these things they call generators, which is basically, it's something that looks like a list, but it's lazy. It only creates that thing when you ask for it. So as I append each image, it's going to give me that image. And if this mapping is not yet finished creating it, it'll wait. So this approach looks like it's going to use heaps of memory, but it doesn't. It uses only the minimum amount of memory necessary, and it does everything in parallel. So resize image is something which is going to open up the image. It's going to turn it into an umpire array, and then it's going to resize it. And so then the resize does the center cropping we just mentioned. And then after it's resized, it's going to get appended. What does append image do? So this is a bit weird. What's going on here? What it does is it says, is it's going to actually stick it into what we call a pre-allocated array. So we're learning a lot of computer science concepts here. Anybody that's done computer science before will be familiar with all of this already. If you haven't, you probably won't. But it's important to know that the slowest thing in your computer, generally speaking, is allocating memory. It's finding some memory. It's reading stuff from that memory. It's writing to that memory, unless of course it's like cash or something. And generally speaking, if you like create lots and lots of arrays and then throw them away again, that's likely to be really, really slow. So what I wanted to do was to create a single 224 by 224 array, which is going to contain my resized image. And then I'm going to append that to my big holes tensor. So the way you do that in Python, it's wonderfully easy. You can create a variable from this thing called threading.local. Now TL is now just, it's basically something that looks a bit like a dictionary, but it's a very special kind of dictionary. It's going to create a separate copy of it for every thread or process. So normally when you've got lots of things happening at once, it's going to be a real pain because if two things try to use it at the same time, you get bad results or even crashes. But if you allocate a variable like this, it automatically creates a separate copy in every thread. And you don't have to worry about locks. You don't have to worry about race conditions, whatever. So once I've created this special threading local variable, I then create a placeholder inside it, which is just an array of zeros of size 224 by 224 by 3. And so then later on, I create my big holes array, which is where I'm going to put everything eventually. And to append the image, I grab the bit of the image that I want and I put it into that preallocated thread local variable. And then I append that to my big holes array. So there's lots of detail here in terms of using parallel processing effectively. I wanted to briefly mention it, not because I think somebody who hasn't studied computer science is now going to go, okay, I totally understood all that, but to give you some of the things to, like, search for and learn about over the next week if you haven't done any parallel programming before. You're going to need to understand thread local storage and race conditions and this. I'm sorry, I don't have the green box today. This is somebody who knows their Python. In Python, there's something called the global interpreter lock, which is one of the many awful things about Python, possibly the most awful thing, which is that, in theory, two things can't happen at the same time because Python wasn't really written in a thread safe way. The good news is that lots of libraries are written in a thread safe way. So if you're using a library where most of its work is being done in C, as is the case with PLOSAMD, actually you don't have to worry about that. And I can prove it to you even because I drew a little picture. Here is the result of serial versus parallel. Now, the serial without SIMD version is six times bigger than this. So the default Python code you would have written maybe before today's course would have been 120 seconds, process 2,000 images. With SIMD, it's 25 seconds. With the process pull, it's eight seconds for three workers, for six workers, it's five seconds, so on and so forth. The thread pull is even better, 3.6 seconds for 12 workers, 3.2 seconds for 16 workers. Now, your mileage will vary depending on what CPU you have. Given that probably quite a lot of you are using the P2 still, unless you've got your deep learning box up and running, you'll have the same performance as other people using the P2. You should try something like this, right, which is to try different numbers of workers and see what's the optimal for that particular CPU. And now, once you've done that, you know, once I went beyond 16, I didn't really get improvements. So I know that on that computer, a thread pull of size 16 is a pretty good choice. And as you can see, once you get into the right general vicinity, it doesn't vary too much, right? So as long as you're roughly okay. Just behind you, Rachel. So that's the general approach here, is run through something in parallel each time, append it to my beacolls array. And at the end of that, I've got a beacolls array, which I can use again and again. So I don't rerun that code very often anymore. I've got all of the ImageNet, resized into each of 72 by 72, 224 and 288. So I just, and I give them different names and I just use them just... So just a bit and a follow-up image of sizing. So if you would say, constrain your size to some number and we just squished the rectangle to square. I mean, I'm just a little bit distorted, but how it will affect... Right. In fact, I think that's what Keras does. Now, I think about it. I think it squishes. Yeah, I... Okay, so here's one of these things. I'm not quite sure, right? My guess was that I don't think it's a good idea because you're now going to have dogs of various different squish levels. And you'll see an end is going to have to learn that thing. It's got another type of symmetry to learn about level of squishiness. Whereas if we keep everything of the same kind of aspect ratio, I think it's going to be easier to learn so we'll get better results with less epochs of training. Yeah, that's my theory and I'd be fascinated for somebody to do a really in-depth analysis of black borders versus center cropping versus squishing with ImageNet. Okay, so for now on, we can just open the big holes array and there we go. So we're now ready to create our model and I'll run through this pretty quickly because most of it's pretty boring. The basic idea here is that we need to create an array of labels which are called VEX, which contains for every image in my big holes array needs to contain the word vector, the target word vector for that image. And just to remind you, last week we randomly ordered the filenames. So these big holes array is in random order. Okay, so we've got our labels which is the word vectors for every image. We need to do our normal pre-processing. This is a handy way to pre-process in the new version of Keras. We're using the normal Keras ResNet model, the one that comes in Keras.applications. It doesn't do the pre-processing for you, but if you create a Lambda layer that does the pre-processing, then you can use that Lambda layer as the input tensor. So this whole thing now will do the pre-processing automatically without you having to worry about it. So that's a good little trick. I'm not sure it's quite as neat as what we did in Part 1, where we put it in the model itself. But at least this way we don't have to maintain a whole separate version of all of the models. So that's kind of what I'm doing nowadays. All right. When you're working on really big data sets, you don't want to process things any more than necessary and any more times than necessary. I know ahead of time that I'm going to want to do some fine tuning. So what I decided to do was I decided this is the particular layer where I'm going to do my fine tuning. So I decided to first of all create a model which started at the input and went as far as this layer. And so my first step was to create that model and save the results of that. And then the next step will be to take that intermediate step and take it to the next stage I want to fine tune to and save that. So it's a little shortcut. There's a couple of really important intricacies to be aware of here though. The first one is you'll notice that ResNet and Inception are not used very often for transfer learning. And again, this is something which I've not seen studied and I actually think this is a really important thing to study is which of these things work best for transfer learning. But I think one of the difficulties is just that Inception and Inception are harder. And the reason they're harder is that if you look at ResNet, you've got lots and lots of layers which make no sense on their own. Ditto for Inception, because they keep on splitting into two bits and then merging again, splitting into two bits and merging again. So what I did was I looked at the Keras source code to kind of find out, okay, how is each block named? So what I wanted to do was to say, all right, we've got a ResNet block, we've just had a merge and then it goes out and it does a couple of convolutions and then it comes back and does an addition. And basically, I want to get one of these. I want to grab one of these. Unfortunately, for some reason, Keras does not name these merge cells. What I had to do was get the next cell and then go back by one. So it kind of shows you how little people have been working with ResNet with transfer learning. Is it like literally the only bits of it that make sense to transfer learn from are nameless in one of the most popular, probably the most popular thing for transfer learning, I suspect, Keras. Okay, there's a second complexity that I've been working with ResNet. We haven't discussed this much but ResNet actually has two kinds of ResNet blocks. One is this kind which is an identity block and the second time is a ResNet convolution block which they also call a bottleneck block and what this is, it's pretty similar. One thing that's going up through a couple of convolutions and then goes and gets added together but the other side is not an identity. The other side is a single convolution and in ResNet they throw in one of these every half a dozen blocks or so. Why is that? The reason is that if you only have identity blocks then all it can really do is to continually fine-tune where it's at so far. We've learned quite a few times now that these identity blocks basically they map to the residual so they keep trying to fine-tune the types of features that we have so far. Where else these bottleneck blocks actually force it from time to time to create a whole different type of features because there is no identity path through here. The shortest path is through a single convolution. When you think about transfer learning from ResNet you need to think about should I transfer learn from identity block before or after or from a bottleneck block before or after. Again, I don't think anybody studied this or at least I haven't seen anybody write it down. I've played around with it a bit and I'm not sure I have a totally decisive suggestion for you. Clearly my guess is that the best point to grab in ResNet is the end of the block immediately before a bottleneck block and the reason for that is that at that level of kind of well, at that level of receptive field obviously because each bottleneck block is changing the receptive field so at that level of receptive field and at that level of kind of semantic complexity this is the most sophisticated version of it because it's been through a whole bunch of identity blocks to get there and gradually fine tune, fine tune, fine tune bottleneck, fine tune, fine tune, fine tune bottleneck. My belief is that you want to get just before that bottleneck is the best place to transfer learn from. That's what this is. This is a spot just before the last bottleneck layer in ResNet. So it's pretty late and so as we know very well from part one with transfer learning, when you're doing something which is not too different and in this case we're switching from one pot encoding to word vectors which is not too different, you probably don't want to transfer learn from too early. That's why I picked this fairly late stage which is the just before the final bottleneck block. Okay, so the second complexity here is that this bottleneck block has these dimensions. The output is 14 by 14 by 1024. So we have about a million images. So a million by 14 by 14 by 1024 is more than I wanted to deal with. So I did something very simple which was I popped in one more layer after this which is an average pooling layer 7 by 7. So that's going to take my 14 by 14 output and turn it into a 2 by 2 output. So let's say one of those activations was looking for birds eyeballs. Then it's saying in each of the 14 by 14 spots, how likely is it that this is a bird's eyeball. After this, it's now saying in each of these four spots on average how much were those cells looking like birds eyeballs. So this is losing information if I had a bigger SSD and more time, I wouldn't have done this, but it's a good trick. When you're working with these fully convolutional architectures, you can pop an average pooling layer anywhere and decrease the resolution to something that you feel like you can deal with. So in this case, this was my decision, was to go to 2 by 2 by 1024. We had a question. I was going to ask, have we talked about why we do the merge operation in some of these more complex models? We have quite a few times which is basically the merge operation, which is the thing which does the plus here. That's the trick to making it into a res net block is having the addition of the identity with the result frame to a couple of convolutions. So recently I was trying to sort of go from many filters. So you kind of just talked about downsizing the size of the geometry. Is there a good best practice I'm going from, let's say like 512 filters down to less? Or is it just as simple as doing convolution with less filters? There's not exactly a there's not exactly a best practice for that, but in a sense every single successful architecture gives you some insights about that. Because every one of them eventually has to end up with a thousand categories if it's image net or three channels of not 255 continuous if it's generative. So the best thing you can really do is, well there's two things ways to kind of look at the successful architectures. Another thing is although this week is kind of the last week we're mainly going to be looking at images I am going to briefly next week open with a quick run through some of the things that you could look at to learn more and one of them is going to be a paper we've got two different papers which have like best practices you know really nice kind of descriptions of we did these hundred different things and yes the hundred different results but all this stuff it's still pretty artisanal, you know. Good question, so we initially resized images with 224 with 224 right? It ended up being as big calls that are already great. So a couple it's like 50 gig data or some data you can read the images. So yes and that's compressed and uncompressed it's like a couple of hundred gig. But if you load it into memory I'm not going to load it into memory, you'll see. So what you do is kind of place the load where it's getting there. So that's exactly the right segue I was looking for so thank you. So what we're going to do now is we want to run this model we just built just called basically dot predict on it and save the predictions. The problem is that the size of those predictions is going to be bigger than the amount of RAM I have so I need to do it a batch at a time and save it a batch at a time so I've got a million things each one with this many activations. So and this is going to happen quite often right you're either working on a smaller computer or you're working with a bigger data set or you're working with the data set where you're using a larger number of activations. This is actually very easy to handle you just create your big holes array where you're going to store it and then all I do is I go through from zero to the length of my array my source array a batch at a time right so this is creating the numbers 0 0 plus 128 128 plus 128 so on and so forth and then I take the slice of my source array from originally 0 to 128 then from 128 to 256 and so forth. So this is now going to contain a slice of my source big holes array and then well actually this is going to create a generator which is going to have all of those slices and of course being a generator it's going to be lazy so I can then enumerate through each of those slices and I can append to my big holes array the result of predicting just on that one batch right so you've seen like predict and evaluate and fit and so forth and the generator versions also in Keras there's generally an on batch version so there's a train on batch and a predict on batch and what these do is they basically have they have no smarts to them at all this is like the most basic thing so this is just going to take whatever you give it and call predict on this thing it won't shuffle it it won't batch it it's just going to throw it directly into the computation graph call predict on just this batch of data and then from time to time I'll print out how far I've got just so that I know how I'm going also from time to time I call dot flush that's the thing in big holes that actually writes it to disk that's we just make sure that's continuously written to disk so this thing doesn't actually take very long to run and one of the nice things I can do here is I can do some data augmentation as well I've added a direction parameter and what I'm going to do is I'm going to have a second copy of all of my images which is flipped horizontally so to flip things horizontally that's interesting I think I screwed this up okay I have screwed this up that should be here to flip things horizontally you've got batch byte and then yeah this is columns so with this is the thing if we pass in a minus one here and it's going to flip it horizontally that explains why some of my results haven't been quite as good as I hoped okay so when you run this we're going to end up with a big big holes array that's going to contain two copies of every resize image net image the activations at the layer that we one layer before this so I call it once with direction forwards and one with direction backwards and so at the end of that I've now got nearly two million images or activations of two by two by one or two four so okay so that's the pretty close to the end of ResNet I've then just copied and pasted from the Keras code the last few steps of ResNet so this is the last two blocks I added in one extra identity block just because I had a feeling that might help things along a little bit again people have not really studied this yet so I haven't had a chance to properly experiment but it seemed to work quite well so this is basically copied and pasted from Keras's code I then need to copy the weights from Keras for those last few layers of ResNet and so now I'm going to repeat the same process again which is to call it on these last few layers and the input will be the output from the previous one okay so we went like two thirds of the way into ResNet and got those activations and put those activations into the last few stages of ResNet to get those activations now the outputs from this are actually just a vector of length 2048 which does fit in my RAM so I didn't bother with calling predict on batch I can just call dot predict if you try this at home and you don't have enough memory you can use the predict on batch trick again so anytime you run out of memory when calling predict you can always just use this pattern so at the end of all that I've now got the activations from the ultimate layer of ResNet and so I can do our usual transfer learning trick of creating a linear model and my linear model is now going to try to use the number of dimensions in my word vectors as its output and you'll see it doesn't have any activation function that's because I'm not doing one hot encoding my word vectors could be any size numbers so I just leave it as linear and then I compile it and then I fit it and so this linear model is now my very first this is almost the same as what we did in lesson 1 you know group docs versus cats we're fine tuning a model to a slightly different objective function to a slightly different target to what it was originally trained with it's just that we're doing it with a lot more data so we have to be a bit more thoughtful there's one other difference here which is I'm using a custom loss function and the loss function I'm using is cosine distance you can look that up at home if you're not familiar with it but basically cosine distance says for these two points in space what's the angle between them rather than how far away are they the reason we're doing that is because we're about to start using k nearest neighbors so k nearest neighbors we're going to basically say oh here's a word vector predicted which is the word vector which is closest to it it turns out that in really really high dimensional space the concept of how far away something is is nearly meaningless and the reason why is that in really really high dimensional space everything sits on the edge of that space basically because you can imagine like as you add each additional dimension the probability that something's on the edge in that dimension you know let's say the probability it's right on the edge is like one tenth then if you've only got one dimension you've got a probability of one tenth it's on the edge in one dimension two dimensions it's basically more applicatively decreasing the probability that that happens so in a few hundred dimensional space everything is on the edge and when everything's on the edge everything is kind of an equal distance away from each other more or less distances aren't very helpful but the angle between things your varies so when you're doing anything with trying to find nearest neighbours it's a really good idea to train things using cosine distance and this is the formula for cosine distance again this is one of these things where I'm skipping over something that you'd probably spend a week in undergrad studying and it keeps the information about cosine distance on the web so for those of you already familiar with it I won't waste your time for those of you not it's a very very good idea to become familiar with this and feel free to ask on the forums if you can't find any material that makes sense okay so we've fitted our linear model as per usual we save our weights and we can see how we're going so what we've got now is something where we have an image and it will spit out a word vector but it's something that looks like a word vector it has the same dimensionality as a word vector but it's very unlikely that it's going to be the exact as one of our thousand target word vectors so if the word vector for a pug is this list of 200 floats even if we have a perfectly puggy pug we're not going to get that exact list of 2000 floats we'll have something that is similar and when we say similar we probably mean that the cosine distance between the perfect platonic pug and our pug is pretty small right so that's why after we get our our predictions we then have to use nearest neighbors as a second step to basically say okay for each of those predictions what are the three word vectors that are the closest to that prediction so we can now take those nearest neighbors and find out for our a bunch of our images what are the three things it thinks it might be so for example for this image here its best guess was trombone next was flute and third was cello so this gives us some hope that this approach seems to be working okay it's not great yet but it's recognized these things in musical instruments and its third guess was in fact the correct musical instrument so we know what to do next what we do next is to fine tune more layers and because we have already saved the intermediate results from an earlier layer that fine tuning is going to be much faster to do two more things are briefly mentioned one is that there's a couple of different ways to do nearest neighbors one is what's called the brute force approach which is literally what's the nearest word vector to this word vector and like to go through everyone and see how far away it is there's another approach which is approximate nearest neighbors and when you've got lots and lots of things you're trying to look for nearest neighbors the brute force approach is going to be n squared time where else approximate nearest neighbors are generally n log n time so orders of magnitude faster if you've got a large data set the particular approach I'm using here is something called locality sensitive hashing it's a fascinating and wonderful algorithm anybody who's interested in algorithms I strongly recommend you go read about it let me know if you need a hand with it my favorite kind of algorithms are these approximate algorithms like in data science you almost never need to know something exactly yet nearly every algorithm that people learn at university and certainly at high school are exact we learn exact nearest neighbor algorithms and exact indexing algorithms and exact median algorithms pretty much for every algorithm out there there's an approximate version that runs an order of n or log n over n faster and one of the cool things is that once you start realizing that you suddenly discover that all of the libraries you've been using for ages were written by people that didn't know this and then you realize that like in every sub-algorithm they've written they could have used an approximate version next thing you've got to know you've got something that runs a thousand times faster the other cool thing about approximate algorithms is that they're generally written to provably be accurate to within so close and like it can tell you with your parameters how close is so close which means that if you want to make it more accurate you run it more times with like different random seeds and so this thing called LSH Forest is a locality sensitive hashing forest which means it creates a bunch of these locality sensitive hashes and the amazingly great thing about approximate algorithms is that each time you create another version of it you're exponentially increasing the accuracy or multiplicatively increasing the accuracy but only linearly increasing the time so if the error on one call of LSH was E then the error on two calls is 1 minus E squared and three calls is 1 minus E cubed and the time you're taking though one call was N is now 2N and 3N so when you've got something where you can make it as accurate as you like with only linear increasing time this is incredibly powerful so this is a great approximation algorithm I wish we had more time so I'd love to tell you all about it so I generally use LSH Forest when I'm doing nearest neighbors because it's arbitrarily close and much faster when you've got lots of word vectors the reason that the time that becomes important is when I move beyond ImageNet which I'm going to do now so let's say I've got a picture and I don't just want to say which one of the 1000 ImageNet categories is it which one of the 100,000 WordNet nouns is it now that's a much harder thing to do and that's something that no previous model could do when you trained an ImageNet model the only thing you could do was recognize pictures of things that were in ImageNet but now we've got a word vector model and so we can put in an image that spits out a word vector and that word vector could be closer to things that are not an ImageNet at all or it could be some higher level of the hierarchy so we could look for a dog rather than a pug or a plane rather than a 747 so here we bring in the entire set of word vectors and I have to remember to share these with you because these are actually quite hard to create and this is where I definitely want LSH Forest because this is going to be pretty slow and we can now do the same thing and not surprisingly it's got worse the thing that was actually cello now cello is not even in the top 3 and so this is a harder problem so let's try fine tuning so fine tuning is the final trick I'm going to show you just behind you Rachel just to clarify what you were bringing up there when you take all the word vectors it's like every word in English language yeah you might remember last week we looked at creating our word vectors and what we did was we actually I created a list I went to wordnet and I I went to wordnet and I downloaded the whole of wordnet and then I figured out which things were nouns and then I used a red jackspot pass out those and then I saved that so yeah we actually have the entirety of wordnet nouns why was cello not close to it because it's not a good enough model yet so now that there's 80,000 nouns there's a lot more ways to be wrong so what it only has to say which of these thousand things is it that's pretty easy which of these 80,000 things is it it's pretty hard okay I'm sure you want to take that okay so to fine-tune it it looks very similar to our usual way of fine-tuning things which is that we take our two models and stick them back to back and we're now going to train the whole thing rather than just the linear model now the problem is that the input to this model is too big to fit in round so how are we going to call fit or fit generator when we have an array that's too big to fit in round well one obvious thing to do would be to pass in the big holes array because to most things in Python a big holes array looks just like a regular array doesn't really look any different the way a big holes array is actually stored is it's actually stored yeah a big holes array is actually stored in a directory as I'm sure you've noticed and in that directory it takes got something that called chunk length actually I set it to 32 when I created these big holes arrays what it does is it takes every 32 images and it puts them into a separate file right each one of these has 32 images in it or 32 the leading axis of the array now if you then try to take this whole array and pass it to dot fit in Keras with shuffle it's going to try and grab one thing from here and one thing from here and here's the bad news but big holes to get one thing out of a chunk it has to read and decompress the whole thing it has to read and decompress 32 images in order to give you the one image you asked for that'd be a disaster that'd be ridiculously horribly slow we didn't have to worry about that when we called predict on batch because we were going not shuffling, but we were going in order so it was just grabbing one it was never grabbing a single image out of a out of a chunk but now that we want to shuffle it would right so what we've done is we've somebody very helpfully actually on a Kaggle forum provided something called a big holes array iterator which was kindly discovered on the forums actually by somebody named mp jensen originally written by this fellow what it does is it provides a Keras compatible generator which grabs an entire chunk at a time so it's a little bit less random but given that if this has got 2 million images in and the chunk length is 32 then it's going to basically create a batch of chunks rather than a batch of images and so that means we have none of the performance problems and particularly because we remember originally we randomly shuffled our files so this whole thing is randomly shuffled anyway so this is a good trick so you'll find the big holes array iterator on github feel free to take a look at the code it's pretty straightforward there was a few issues with the original version so mp jensen and I have kind of tried to fix it up and I've written some tests for it and he's written some documentation for it but if you just want to use it then it's as simple as writing this blah equals big holes array iterator this is your data these are your labels double equals true batch size equals whatever and then you can just call fit generator as per usual passing in that iterator and that iterator's number of items okay so to all of you guys who have been asking you know how do I deal with data that's bigger than memory this is how you do it okay so hopefully that will make life easier for a lot of people so we fine tune it for a while we do some learning rate annealing for a while and this basically runs overnight for me kind of takes about six hours to run and so I come back the next morning and I just copy and paste my k nearest neighbours so I call predict get my predicted word vectors for each word vector I then pass it into nearest neighbours this is my just thousand categories and lo and behold we now have hello in the top spot as we hoped how did it go in the harder problem of looking at the hundred thousand or so nouns in English pretty good I got this one right and just to pick another one at random let's pick the first one it said throne that sure looks like a throne so looking pretty good so here's something interesting now that we have bought images and words into the same space let's play with that some more so why don't we use nearest neighbours with those predictions so predicted question yeah so to the to the word to vector which Google created but the subset of those which are nouns according to word net mapped to their since that IDs so the word well the word vectors are just the word to word to vectors that we can download off the internet so yeah they were pre trained by Google they're not they're embeddings but they're not really weights yes okay so we're like comparing proximity of word to image exactly so we're saying here is this image spits out a vector from a thing we just trained we have 100,000 word to word vectors for all the nouns in English which one of those is the closest to the thing that came out of our model and the answer was wrong in language translation use one language word in space hold that thought we'll be doing language translation starting next week so no we don't quite do it that way but you can think of it like that so let's do something interesting let's create a nearest neighbors not for all of the word to vet vectors but for all of our image predicted vectors and now we can do the opposite let's take a word we pick it random let's look it up in our word to vet dictionary and let's find the nearest neighbors for that in our images there it is so this is pretty interesting you know you could now find there's a top three so you can now find the images that are the most like whatever word you come up with so ok that's crazy but we can do crazier here is a random thing I picked now notice I picked it from the validation set of image net so we've never seen this image before and honestly when I opened it up my heart sank because I don't know what it is so this is a problem what is that somebody said something manta ray ok so I didn't pick it so what we can do is we can call dot predict on that image and we can then do a nearest neighbors of all of our other images and there's the first there's the second there's even somebody putting their hand on it which is slightly crazy but that was what the original one looked like in fact I've got a if I can find it this one I ran it again on a different image yeah here ok I took this one which is like pretty I actually looked around for something weird this is pretty weird right this is a net is it a fish so when we then ask for nearest neighbors we get fish in that right so it's like I don't know sometimes deep learning so magic you just kind of go out in there possibly work just behind you Rachel ok yeah yeah yeah only a little bit and maybe in a future course you might not look at desk and I think in maybe even in your numerical and your algebra course you might be looking at dusk so I don't think we'll cover it this course but do look at dusk it's super cool the three images you showed are from image chat and I guess they have to be labeled using that hand no not at all so these were actually labeled as this particular image in fact that's the other thing is it's not only found fish in nets but it's actually found more or less the same breeder fish in the nets but when we called dot predict on those it created a word vector which was probably like halfway between that kind of fish and a net because it doesn't know right and so sometimes when it sees things like that it would have been marked in image net as a net and sometimes it would have been a fish and so the best way to minimize the loss function would have been to kind of hedge right so it hedged and as a result the images that were closest were the ones which actually were halfway between the two themselves so it's kind of a convenient accident yeah yeah absolutely yeah there's a at the back there going to just walk it over Richard you absolutely can and I have and but really for nearest neighbors I haven't found anything nearly as good as cosine and that's true in all of the things I looked up as well by the way I should mention the when you use locality sensitive hashing in python by default it uses something equivalent to the cosine metric so that's why the nearest neighbors work yes yes yes yes exactly so starting next week we're going to be learning about sequence to sequence models and memory and attention methods they're going to show us how we can take an input such as a sentence in English and spit out input such as a sentence in French which is the particular case study we're going to be spending two or three weeks on when you combine that with this you get image captioning I'm not sure if we're going to have time to do it ourselves but it will literally be trivial for you guys to take the two things and combine them and do image captioning like it's just those two techniques together okay so we're now going to switch to actually before we take a break I want to show you the homework and the homework hopefully you guys noticed I gave you some tips because it was a really challenging one even though in a sense it was kind of straightforward which was take everything that we've already learned about super resolution and slightly change the loss function so that it does perceptual losses for style transfer instead the details were tricky and I'm going to quickly show you two things I'm going to first of all show you how I did the homework because I actually hadn't done it last week luckily I have enough RAM that I could read the two things all into memory so don't forget you can just do that with a b-calls array should turn it into a numpy array in memory so one thing I did was I created my upsampling block to get rid of the checkerboard patterns that was literally as simple as saying upsampling 2D and then a one by one so that got rid of my triangle my checkerboard patterns the next thing I did was I changed my loss function and I decided before I tried to do style transfer with perceptual losses let's try and do super resolution with multiple content loss layers because that's one thing that I'm going to have to do for style transfer is be able to use multiple layers so like I always like to start with something that works and make small little changes so it keeps working at every point so in this case I thought okay let's first of all slightly change the loss function for super resolution so that it uses multiple layers so here's how I did that I changed my get output layer sorry I changed my pgg content so it created a list of outputs con with one from each of the first, second and third blocks right and then I changed my loss function so it went through and added the mean squared difference for each of those three layers I also decided to add a weight just for fun so I decided to go 0.1, 0.8, 0.1 because this is the layer that they used in the paper but let's have a little bit of more precise super resolution and a little bit of more semantic super resolution see how it goes I created this function to do a more general mean squared error and that was basically it so other than that line everything else was the same so that gave me super resolution working on multiple layers one of the things I found fascinating is that this is the original low res and it's done a good job of upscaling it but it's also fixed up the weird white balance which really surprised me right it's taken this obviously over yellow shot and this is what ceramic should look like it should be white and somehow it's kind of adjusted everything so the background has gone from a yellowy brown to a nice white as with these cups here it's figured out that these slightly pixelated things are actually meant to be upside down handles this is on only 20,000 images so I I'm very surprised that it's fixing the color because we never asked it to but I guess it knows what a cup meant to look like and so this is what it's decided to do is to make it up the way it thinks it's meant to look so that was pretty cool so so then to go from there to style transfer was pretty straightforward I had to read in my style as before this is the code to do their special kind of resnet block where we use valid convolutions which means we lose two pixels each time and so therefore we have to do a center crop but don't forget lambda layers are great for this kind of thing just whatever code you can write, chuck it in a lambda layer suddenly it's a keras layer so do my center crop so this is now a resnet block which does valid coms so this is basically all exactly the same we have to do a few down samplings and then the computation and our up samplings just like the supplemental paper so the loss function looks a lot like the loss function did before but we've got two extra things so here is a version of the gram matrix which works a batch at a time if any of you tried to do this a single image at a time you would have gone crazy with how slow it took I saw a few of you trying to do that so here's the batch wise version of gram matrix and then the second thing I needed to do was somehow feed in my style target so another thing I saw some of you do was feed in the style target every time feed in that array into your loss function now you can obviously calculate your style target by just calling dot predict with the thing which gives you your different style target layers but the problem is this thing here returns a numpy array it's a pretty big numpy array which means that then when you want to use it as a style target in training it has to copy that back to the GPU and copying to the GPU is very very very slow and this is a really big thing to copy to the GPU so what if you do try this and I saw some of you try it and it took forever so here's the trick call dot variable on it turning something into a variable fix it on the GPU for you so once you've done that you can now treat this as a list of symbolic entities which are the GPU versions of this right so I can now call this use this inside my GPU code so here are my style targets I can use inside my loss function there it is I can use it inside my loss function and it doesn't have to do any copying backwards and forwards so there's a subtlety but if you don't get that subtlety right you're going to be waiting for a week or so if your code to finish so those were the little subtleties which were necessary to get this to work and once you get it to work it does exactly the same thing basically as before so where this gets combined with device is I wanted to try something interesting which is in the original perceptual losses paper they trained it on the cocoa data set which has 80,000 images which didn't seem like many I wanted to know what would happen if we trained it on all of image net so I did so I decided to train a super resolution network on all of image net and the code's all identical so I'm not going to explain it other than you'll notice we don't have the square bracket colon square bracket here anymore because we don't want to try and read in the entirety of image net into RAM so these are still b-codes arrays all the other code is identical until we get to here so I use a b-codes array iterator I can't just call dot fit because dot fit or dot fit generator assumes that your iterator is returning your data and your labels in our case we don't have data and labels we have two things that both get fed in as two inputs and our labels are just a list of zeros so here's a good trick this answers your earlier question about how do you do multi input models on large data sets and the answer is create your own training loop create your own training loop which loops through a bunch of iterations and then you can grab as many batches of data from as many different iterators as you like and then call train on batch so in my case my b-codes array iterator is going to return my high resolution and low resolution batch of images so I go through a bunch of iterations grab one batch of high res and low res images and pass them as my two inputs the train on batch so this is the only code I changed other than changing dot fit generator to actually calling train so as you can see this took me four and a half hours to train and I then decreased learning rate and I trained for another four and a half hours actually I did it overnight last night and I only had enough time to do about half of image net so this isn't even the whole thing but check this out so check that model and we're going to call dot predict on this is the original high res image here's the low res version and here's the version that we've created and as you can see it's done a pretty extraordinarily good job when you look at the the original ball there was this kind of vague yellow thing here it's kind of turned it into a nice little inscription you can see that her eyes was like two gray blobs it's kind of turned into some eyes you can see that her you could just tell that's an A maybe if you look carefully now it's very clearly an A so you can see it does an amazing job of upscaling this cooler still is this is a fully convolutional net and therefore is not specific to any particular input resolution so what I can do is I can create another version of the model using our high res as the input right so now we're going to call dot predict with the high res input and that's what we get back so like look at that we can now see all of this detail on the basketball which simply none of that really existed here like it was there but pretty hard to see what it was and like look at before like look at her hair right this kind of gray blob here you can see it's it knows it's like little bits of pulled back hair so like we can take any sized image and make it bigger and like this this to me is one of the most amazing results I've seen in deep mining when we train something on nearly all of image net it's a single epoch so there's definitely no overfitting and it's like able to recognize what hair is meant to look like when pulled back into a bun is a pretty extraordinary result I think something else which only realized later is that the it's all a bit fuzzy right and like there's this arm in the background that's a bit fuzzy the model knows that that is meant to stay fuzzy it knows what out of focus things look like so like equally cool is not just how like you know it's now incredibly precise and accurate but the fact that it knows that blurry things need to stay blurry so I don't know if you're as amazed as this as I am but I thought this was a pretty cool result and you know we could run this over a 24 hour period on maybe two epochs of all of image net and presumably it would get even better still okay so let's take a seven minute break and see you back here at last eight okay thanks everybody that was fun so we're going to do something else fun and that is to look at oh actually before I continue I did want to mention one thing in the homework that I changed which is I realized in my in my manually created loss function I was already doing a mean squared error in the loss function but then when I told Tarras like make that thing as close to zero as possible I had to also give it a loss function and I was giving it MSE and effectively that was like squaring my squared errors that seemed wrong so I've changed it to M AE mean absolute error so when you look back over the notebooks that's why is because this is just to say hey get the loss as close to zero as possible I didn't really want to re-square it that didn't make any sense so that's why you'll see that minor change the other thing to mention is I did notice that when I retrained my super resolution on my new images that didn't have the black border it gave good results much much faster and so I really think that thing of like learning to put the black border back in seemed to take quite a lot of effort for it so again hopefully some of you are going to look into that in more detail so we're going to learn about adversarial networks generative adversarial networks and this will close off our deep dive into generative models as applied to images and just to remind you the purpose of this has been to learn about generative models not to specifically learn about super resolution or artistic style but remember these things can be used to create all kinds of images so one of the groups is interested in taking a 2D photo and trying to turn it into something that you can rotate in 3D or at least show a different angle of that 2D photo and that's a great example of something that this should totally work for you know it's just a mapping from one image to some different image which is like well this image looked like from above versus from the front so keep in mind the purpose of this is just like in part 1 we learned about classification which you can use for a thousand things now we're learning about generative models that you can use for a different thousand things now any generative model you build you can make it better by adding on top of it a GAN a generative adversarial network and this is something I don't really feel like it's been fully appreciated people I've seen generally treat GANs as a different way of creating a generative model but I think of this more as like why not create your generative model using the kind of techniques we've been talking about but then think of it this way think of that all the artistic style stuff we were doing and the like my terrible attempt at a Simpsons cartoon version of a picture it looked nothing like a Simpsons right so what would be one way to improve that one way to improve that would be to create two networks there would be one network that takes our picture which is actually not the Simpsons and takes another picture that actually is Simpsons and maybe we can train a neural network that takes those two photos sorry those two images and spits out something saying is that a real Simpsons image or not right and this thing we'll call the discriminator so we could easily train a discriminator right now right it's just a classification network just use the same techniques we used in part one we feed it the two images and it's going to spit out a one if it's a real Simpsons cartoon and a zero if it's Jeremy's crappy generative model of Simpsons right so that's easy right we know how to do that right now go and build another model question there's two images as inputs yeah what was that no so you would feed it one thing that's the Simpsons and one thing that's a generative output it's up to you to feed it one of each or alternatively you could feed it one thing in fact probably easier is to just feed it one thing and it spits out is it the Simpsons or isn't it the Simpsons and you could just mix them up actually it's the ladder that we're going to do so that's probably easier my arrays are not working we're going to have one thing which is either not a Simpsons or it is a Simpsons and we're going to have a mix of 50-50 of those two and we're going to have something come out what do you think? is it real or not so this thing this discriminator from now on we'll generally be calling it D so this thing called D and we can think of that as a function D is a function that takes some input x which is an image and spits out a 1 or a 0 or maybe a probability so what we could now do is create another neural network and what this neural network is going to do is it's going to take as input some random noise just like our all of our generators have so far and it's going to spit out an image and the loss function is going to be if you take that image and stick it through D did you manage to fool it so could you create something where in fact we want it to say oh yeah totally that's a real Simpsons so if that was our loss function so we're going to call this the generator we'll call it G it's just something exactly like our perceptual losses style transfer model it could be exactly the same model but the loss function is now going to be take the output of that and stick it through D the discriminator and try to trick it so the generator is doing well if the discriminator is getting it wrong so one way to do this would be to take our discriminator and train it as best as we can to recognize the difference between our crappy Simpsons and real Simpsons and then get a generator and train it to trick that discriminator but now at the end of that it's probably still good because you realize that actually the discriminator didn't have to be very good before because my Simpsons generators were so bad so I could now go back and retrain the discriminator based on my better generated images and then I could go back and retrain the generator and back and forth I go and that is the general approach of a GAN is to keep going back between two things which is training a discriminator and training a generator using a discriminator as a loss function so we've got one thing which is discriminator on some image and another thing which is a discriminator on a generator on some noise yes, so in practice these things are going to spit out probabilities okay so that's the general idea in practice they found it very difficult to do this like train the discriminator as best as we can stop train the generator as best as we can so instead what the original GAN paper let's find that right now the original GAN paper is called Generative Adversarial Nets and what they did was to and here you can see they've actually specified this loss function so here it is in notation is to they call it minimizing the generator whilst maximizing the discriminator this is what Min Max is referring to what they do in practice is they do it a batch at a time so they have a loop I'm going to go through a loop and do a single batch put it through the discriminator that same batch stick it through the generator and so we're going to do it a batch at a time so let's look at that so here's the original GAN from that paper and we're going to do it on MNIST and what we're going to do is we're going to see if we can start from scratch to create something which can create images which the discriminator cannot tell whether they're real or fake and it's a discriminator that has learned to be good at discriminating real from fake pictures of MNIST images so we load in MNIST and the first thing they do in the paper is to just use a standard multi-layer perceptron so I'm just going to skip over that and let's get to the perceptron so here's our generator it's just a standard multi-layer perceptron and here's our discriminator which is also a standard multi-layer perceptron the generator has a sigmoid activation so in other words we're going to spit out an image where all of the pixels are between 0 and 1 so if you want to print it out we'll just multiply it by 255 I guess so there's our generator, there's our discriminator so there's then the combination of the two so take the generator and stick it into the discriminator so we can just use sequential for that and this is actually therefore the loss function that I want on my generator is generate something and then see if you can fool the discriminator so there's all my architectures set up so the next thing I need to do is set up this thing called train which is going to do this adversarial training let's go back and have a look at train so what train is going to do is it's going to go through a bunch of epochs and notice here I wrap it in this TQDM this is the thing that creates a nice little progress bar it doesn't do anything else, it just creates a little progress bar we learned about that last week so the first thing I need to do is to generate some data to feed the discriminator so I've created a little function to that and here's my little function so it's going to create a little bit of data that's real and a little bit of data that's fake so my real data is okay let's go into my actual training set and grab some randomly selected MNIST digits so that's my real bit and then let's create some fake so noise is a function that I've just created up here which is create some 100 random numbers so let's create some noise called g.predict and so then I concatenate the two together so now I've got some real data and some fake data and so this is going to try and predict whether or not something is fake so one means fake, zero means real so I'm going to return my my data and my labels which is a bunch of zeros to say they're all real and a bunch of ones to say they're all fake so there's my discriminator's data so go ahead and create a set of data for the discriminator and then do one batch of training now I'm going to do the same thing for the generator but when I train the generator I don't want to change the discriminator's weights right so make trainable simply goes through each layer and says it's not trainable so make my discriminator non-trainable and do one batch of training where I'm taking noise, that's my inputs and my goal is to get the discriminator to think that they are actually real so that's why I'm passing in a bunch of zeros because remember zero means real and that's it and then make discriminator trainable again so keep looping through this train the discriminator on a batch of half real, half fake and then train the generator to try and trick the discriminator using all fake repeat so that's the training loop that's a basic GAN because we use TQDM we get a nice little progress bar we can plot out the a kept track of the loss at each step so there's our loss for the discriminator and there's our loss for the generator so our question is what do these loss curves mean, are they good or bad like how do we know and the answer is for this kind of GAN they mean nothing at all like the generator could get fantastic but it could be because the discriminator is terrible and like they don't really know whether each one is good or not so like even the order of magnitude of both of them is meaningless so these curves mean nothing, the direction of the curves mean nothing and this is one of the real difficulties with training GANs and here's what happens when I plot 12 randomly selected random noise vectors stuck through there and we have not got things that look terribly like them there's digits and they also don't look terribly much like they have a lot of variety this is called mode collapse very common problem when training GANs and what it means is that the generator and the discriminator have kind of a stalemate where neither of them basically knows how to go from here and in terms of optimization we've basically found a local minimum right so okay that was not very successful can we do better so the next major paper that came along was this one let's go to the top so you can see what it's called unsupervised representation learning with deep convolutional derivative adversarial networks so this created something that they called DC GANs and the main page that you want to look at here is page 3 where they say core to our approach is doing these three things and basically what they do is they just do exactly the same thing as GANs but they do three things one is to use the kinds of well in fact all of them is to learn the tricks that we've been learning for generative models use an all convolutional net get rid of max pooling and use straighter convolutions instead get rid of fully connected layers and use lots of convolutional features instead and add in batch null and then use a CNN rather than MLP so here is that this will look very familiar it looks just like last lesson stuff right so the generator take in some random grid of inputs it's going to do a batch norm upsample you'll notice that I'm doing even newer than this paper I'm doing the upsampling approach because we know that's better upsample 1 by 1 conv batch norm upsample 1 by 1 conv batch norm and then a final conv player discriminator basically does the opposite which is some 2 by 2 upsamplings so downsampling in the discriminator another trick that they mentioned I think it's mentioned in the paper is to before you do the back and forth batch for the discriminator and a batch for the generator is to train the discriminator for a fraction of an epoch like there are a few batches through the discriminator so at least it knows how to recognize the difference between a random image so you can see here I actually just start by calling discriminator.vit with just a very small amount of data so this is kind of like bootstrapping the discriminator and then I just go ahead and call the same train as we had before with my better architectures and again these curves are totally meaningless but we had something which if you squint you could almost convince yourself that that's a 5 so until a week or two before this all started this was kind of about as good as we had people were much better at the artisanal details of this than I was and indeed there's a whole page called GAN hacks which had lots of tips but then a couple of weeks before this class started as I mentioned in the first class along came the Wasserstein GAN and the Wasserstein GAN got rid of all of these problems and here is the Wasserstein GAN paper and this paper is quite an extraordinary paper and it's particularly extraordinary because I think I mentioned this in the first class of this part most papers tend to either be math theory that goes nowhere or kind of nice experiments in engineering where the theory bits kind of hacked on at the end and kind of meaningless this paper is entirely driven by theory and then the theory they go on to show this is what the theory means, this is what we do and suddenly all the problems go away the lost curves are going to actually mean something and we're going to be able to do what I said we wanted to do right at the start of this GAN section which is to train the discriminator a whole bunch of steps and then do a generator and then discriminate a whole bunch of steps and do the generator and all that is going to suddenly start working how do we get it to work so in fact, despite the fact that this paper is both long and full of equations and theorems and proofs and there's a whole bunch of appendices at the back with more theorems and proofs there's actually only two things we need to do one is remove the log from the loss function so rather than using cross entropy loss we're just going to use mean squared error that's one change and the second change is we're going to constrain the weights so that they lie between negative 0.01 and positive 0.01 we're going to constrain the weights to make them more now in the process of saying that's all we're going to do is to kind of massively is to not give credit to this paper because this paper is they figured out that that's what we need to do and on the forums some of you have been reading through this paper and I've already given you some tips as to some really there's a really great walkthrough I put it on our wiki that explains all the math from scratch but basically what the math says is this again the loss function for a GAN is not really the loss function you put into Keras like we thought we were just putting in like a cross entropy loss function but in fact and again what we really care about is the difference between two distributions the difference between the discriminator and the generator and the difference between two loss functions has a very different shape for the loss function on its own right so it turns out that the difference between the two loss functions the two cross entropy loss functions is something called the Jensen Shannon distance and this paper shows that that loss function is hideous it is not differentiable and it does not have a nice smooth shape at all so it kind of explains why it is that we kept getting this mode collapse and failing to find nice minimums is basically mathematically this loss function does not behave the way a good loss function should and previously we've not come across anything like this because we've been training a single function at a time we really understand those loss functions, mean squared error cross entropy even though we haven't always derived the math in detail, plenty of people have we know that they're kind of nice and smooth and that they have pretty nice shapes and they do what we want them to do in this case by training two things kind of adversarily to each other we're actually doing something quite different but just absolutely fantastically shows with both examples and with theory why that's just never going to work yes, but even the cosine distance so the cosine distance is the difference between two things these distances that we're talking about here are the distances between two distributions which is a much more tricky problem to deal with the cosine distance actually if you look at the notebook during the week you'll see it's basically the same as the Euclidean distance but you normalize the data first so it has all the same nice properties that the Euclidean distance did so the one thing that's fun is that the authors of this paper released their code in PyTorch and luckily so PyTorch the first pre-release came out in mid-January you won't be surprised to hear that one of the authors of the paper is the main author of PyTorch so he was writing this before he even released the code there's lots of reasons we want to learn PyTorch anyway so here's a good reason so let's look at the Wasserstein GAN in PyTorch most of the code in fact other than this pretty much all the code I'm showing you in this part of the course is very loosely based on lots of bits of other code which I had to massively rewrite because all of it was wrong and hideous this code actually I only did some minor refactoring to simplify a thing so this is actually very close to their code so it was a very nice paper with very nice code so that's a great thing so before we look at the Wasserstein I guess it's the Wasserstein really Wasserstein GAN in PyTorch let's look briefly at PyTorch so PyTorch basically what you're going to see is that PyTorch looks a lot like NumPy which is nice we don't have to create a computational graph using variables and placeholders and later on in a session and blah blah blah I'm sure you've seen by now with Keras with tensorflow you try to print something out with some intermediate output it just prints out like tensor and tells you how many dimensions it has and that's because all that thing is is a symbolic part of a computational graph PyTorch doesn't work that way PyTorch is what's called a defined by run framework it's basically designed to be so fast to take your code and compile it that you don't have to create that graph in advance like every time you run a piece of code it puts it on the GPU, runs it, sends it back all in one go so it makes things look very simple so this is a slightly cut down version of the PyTorch tutorial that PyTorch provides on their website so you can grab that from there so rather than creating np.array you create torch.tensor but other than that it's identical so here's a random torch.tensor so the API is a little bit different rather than dot shape it's dot size but you can see it looks very similar and so unlike in tensorflow or theano we can just say x plus y and there it is we don't have to say z equals x plus y f equals function x and y as inputs gives z as output and then function dot a vowel no we just go x plus y and there it is so you can see why it's called defined by run we just provide the code and it just runs it generally speaking most operations in torch as well as having this infix version there's also a prefix version so this is exactly the same thing you can often add in fact nearly always add a out equals and that puts the result in this preallocated memory we've already talked about why it's really important to preallocate memory it's particularly important on GPUs so if you write your own algorithms in PyTorch you'll need to be very careful of this perhaps the best trick is that you can stick an underscore on the end of most things and it causes it to do in place this is basically y plus equals x that's what this underscore at the end so there's some good little tricks you can do slicing just like numpy you can turn numpy stuff into torch tensors and vice versa by simply going dot numpy one thing to be very aware of is that a and b are now referring to the same thing so that if I now add underscore so in place a plus equals one it also changes b vice versa you can turn numpy into torch by calling torch from numpy and again same thing if you change the numpy it changes the torch all of that so far has been running on the CPU to turn anything into something that runs on the GPU you can go back dot CUDA at the end of it so this x plus y just ran on the GPU so where things get cool is that something like this knows not just how to do that piece of arithmetic but it also knows how to take the gradient of that something which calculates gradients you just take your torch tensor wrap it in variable and add this parameter to it and now from now on anything I do to x it's going to remember what I did so that it can take the gradient of it so for example x plus two okay I get back three just like a normal tensor so a variable and a tensor have the same API except that I can keep doing things to it squared times three dot mean later on I can go dot backward and dot grad and I can get the gradient right so that's the critical difference between a tensor and a variable they have exactly the same API except variable also has dot backward and that gets you the gradient so when I say dot gradient the reason that this is d out dx is because I typed out dot backward right so this is the thing it's the derivative is for respect to so this is kind of crazy you can do things like file loops and get the gradients of them right so this kind of thing pretty tricky to do with TensorFlow or Tiano or these kind of computation graph approaches so it gives you a whole lot of flexibility to define things in like much more natural ways so you can really write PyTorch just like you're writing regular old kind of NumPy stuff it has plenty of libraries so if you want to create a neural network here's how you do a CNN I warned you early on that if you don't know about OO in Python you need to learn it and so here's why because in PyTorch everything's kind of done using OO I really like this right because like in Tiano sorry in TensorFlow they kind of invent their own weird way of programming rather than use Python OO or else PyTorch just goes oh we already have these features in the language let's just use them so it's way easier in my opinion so to create a neural net you create a new class you derive from module and then in the constructor you create all of the things that have weights right so these you know Conv1 is now something that has some weights it's a 2D Conv, Conv2 is something with some weights, PolyConnected1 is something with some weights so there's all of your layers and then you get to say exactly what happens in your forward pass now because MaxPool2D doesn't have any weights and value doesn't have any weights there's no need to define them in the initializer call them as functions but these things have weights so they need to be kind of stateful and persistent so in my forward pass you literally just define what are the things that happen .view is the same as reshape okay so like the whole API has different names for everything which is mildly annoying for the first week but you kind of get used to it during the week if you try to use PyTorch and you're like how do you say blah in PyTorch and you can't find it feel free to post on the forum having said that PyTorch has its own discourse based forums and as you can see it is just as busy and friendly as our forums people are posting on these all the time so I find it a really great helpful opportunity so feel free to ask over there or over here okay so you can then put all of that computation onto the GPU by calling .cuda you can then take some input put that on the GPU with .cuda you can then calculate your derivatives calculate your loss um and then later on you can optimise it um this is just one step of the optimiser so we have to kind of put that inward so there's the basic pieces um at the end here there's a complete process but I think more fun will be to see the process in the Wasserstein GAN so here it is um I've kind of got this Torch Utils thing um which you're finding GitHub which has the basic stuff you'll want for Torch all there so you can just import that um get the Wasserstein GAN working so we set up the batch size um the size of each image um the size of our noise vector um and look how cool it is I really like this um this is how you import um data sets uh it has a data sets module already in the Torch Vision um library is the sci-fi 10 data set um it will automatically download it to this path for you if you say download equals true and rather than having to figure out how to do the pre-processing you can create a list of transforms right so I think this is really lovely API the reason that this is so new yet has such a nice API is because this comes from a lower library called Torch that's been around for many years and so these guys are basically started off by copying what they already had and what already works well so I think this is very elegant um there's another um so I've got two different things you can look at here they're both from the paper um one is sci-fi 10 which are these tiny little images another is something we haven't seen before which is called Lsun um which is a really nice data set it's a huge data set um with um millions of images 3 million bedroom images for example um so we can use either one um this is pretty cool we can then call create a data loader say how many workers to use we already know what workers are right and this is all built into the framework okay so now that you know how many workers your CPU likes to use you can just go ahead and put that number in here use your CPU to load in this data um in parallel in the background okay so we're going to start with sci-fi 10 so we've got 47,000 of those images um definitions of the discriminator and the generator architectures into a separate python file vcgain.py um we'll to skip over very quickly because it's really straightforward here's a conge block that consists of a conge 2d a batch norm 2d and a leaky value um so in my initializer I can go ahead and say okay we'll start with a conge block um optionally have a few extra conge blocks um this is really nice uh here's a while loop that says keep adding more down sampling blocks until you've um sorry up sampling blocks um until you've got as many as you need um so that's really nice kind of use of a while loop to simplify our creating our architecture uh and then a final conge block at the end to actually create the thing we want um and then this is pretty nifty uh if you pass in uh ngpu greater than one then it will call parallel dot data parallel passing in those gpu ids and it will do automatic multi gpu um training so this is like by far the easiest multi gpu training I've ever seen um so yeah that's it that's the that's the forward pass behind you Rachel um we'll learn more about this over the next couple of weeks um in fact given we're a little sort of time let's discuss that next week and let let me know if you don't think we cover it um okay here's the generator looks very very similar again there's a while loop to make sure that we've gone through the right number of decon blocks um so this is actually interesting this would probably be better off with an up sampling block followed by a one by one convolution so maybe at home you could try this and see if you get better results because this is probably got the checkerboard pattern problem um okay so there's our generator and our discriminator um it's only 75 lines of code um nice and easy um you know everything's a little bit different in PyTorch if we want to say what initializer to use again we're going to kind of use a little bit more it's kind of more decoupled maybe at first it's a little more complex but there's less things you have to learn um you can go through in this case we can call something called apply which takes some function and passes it to everything in our architecture so this function is something that says oh is this a com2d your com transpose 2d if so use this initialization function or if it's a batch norm use this initialization function right so again you know everything's a little bit different there isn't a separate initializer parameter um this is in my opinion much more flexible um I really like it um okay so um as before we need something that creates some noise um so let's go ahead and create some fixed noise we're going to have an optimizer for the discriminator we've got an optimizer for the generator um here is something that does one step of the discriminator right so we're going to call the forward pass and then we call the backward pass and then we return the error um just like before we've got something called make trainable um so this is how we make something trainable or not trainable in PyTorch and just like before we have a train loop the train loop um has got a little bit more going on partly because of the Vassistan GAN partly because of PyTorch but the basic idea is the same um for each epoch um um for each um batch um make the discriminator trainable um and then this is the number of iterations to train the discriminator for so remember I told you one of the nice things about the Vassistan GAN is that we don't have to do one batch discriminator one batch generator one batch discriminator one batch generator but we can actually train the discriminator properly for a bunch of batches and in the paper they suggest using um five iterations or five batches of discriminator um training each time through the loop unless it's one of the unless you're still in the first 25 generations 25 iterations and say if you're in the first 25 iterations to 100 batches right and then they also say from time to time to 100 batches nice by having um the flexibility here to really change things we can do exactly what the paper wants us to do so basically at first we're going to train the discriminator carefully uh and we will also from time to time train the discriminator very carefully otherwise we'll just do five batches so this is where we go ahead and um train the discriminator and you'll see here we clamp this is the same as clip the weights in the discriminator to fall in this range and if you're interested in reading the paper the paper explains that basically the reason for this is that their assumptions only true in this kind of small area right so that's why we have to make sure that the weights stay in a small area so then we go ahead and do a single step with the discriminator um then we create some noise and do a single step with the generator um we uh get our uh fake data for the discriminator uh that we can subtract the fake from the real um to get our error for the discriminator so there's one step with the discriminator uh we do that either five or five or a hundred times um make our discriminator not trainable and then do one step of the generator you can see here we call the generator with some noise and then pass it into the discriminator to see if we tricked it or not um so during the week you can look at this these two different versions and you're going to see basically the pie torch and the keras version of basically the same thing and the only difference is in the two things one is the presence of this clamping and the second is that the loss function um is mean squared error rather than uh cross entropy so let's see what happens um here is um some examples from sci-fi 10 and so they're like certainly a lot better than our um crappy dc gand m m nest examples um but they're not great uh why are they not great so probably the reason they're not great is because sci-fi 10 uh has um quite a few categories different kinds of categories of different kinds of things and so it doesn't really know what it's meant to be drawing a picture of um sometimes I guess it kind of figures it out like this must be a plane I think um but a lot of the time it's kind of it hedges and kind of draws a picture of something that looks like it might be a reasonable picture but it's not a picture of anything in particular on the other hand the elson um uh data set has uh three million bedrooms so we would hope that when we train um uh the vassarstein gand on elson bedrooms we might get better results um but here's the real sci-fi 10 by the way um so here are our fake bedrooms and they are pretty freaking awesome um so literally they started out as random noise and uh everyone has been turned in like that it's definitely a bedroom I mean they're all definitely bedrooms and so and then here is the real bedrooms to compare you can kind of see here that like imagine if you talk this and stuck it on the end of any kind of generator um you could I think really use this to make your generator um much more believable you know like whatever anytime you kind of look at it and you say oh that doesn't look like the real x maybe you could try using a w again to try to make it look more like a real x so this paper is and like so important like this is basically oh here's the other thing the the loss the loss function for these actually makes sense like the discriminator and the generator loss functions actually decrease as they get better so you can actually tell if your thing is training properly um you can actually kind of exactly compare two different architectures to each other still um but you can certainly see that the training curves are working so now that we like have you know in my opinion a GAN that actually really works reliably for the first time ever I feel like this changes the whole kind of changes the equation for what generators can and can't do and this has not applied to anything yet so you can take any old paper that produces 3D outputs or augmentations or depth outputs or colorization or whatever and add this and it would be great to see what happens right because none of that's been done before and it's not been done before because we haven't had a good way to train GANs before um so you know this is kind of I think something where anybody who's interested in a project yeah this would be a great project and something that maybe you can do recently quickly another thing you could do as a project is to convert this into Keras like so you can take the Keras DC GAN notebook that we've already got and change the loss function at the weight clipping try training on this Elson bedroom data set and you should get the same results and then you can add this on top of your any of your Keras stuff so yeah there's so much you could do this week I'm I don't feel like I want to give you a assignment per say because there's a thousand assignments you could do I think you know as per usual you should go back and look at the papers the original GAN paper is a fairly easy read there's a section called theoretical results which is like this is kind of like the pointless math bit you know like here's some theoretical stuff it's actually interesting to read this now because like you go back and you look at this stuff where they like prove various nice things about their GAN oh like they're talking about how the generative model perfectly replicates the data generating process it's interesting to go back and look and say like okay so they've like proved these things but it turned out to be totally pointless like it still didn't work like it didn't really work so it's kind of interesting to look back and say oh so how you know how is it the people which is not to say this isn't a good paper it is a good paper but it is interesting to see like when there's a theoretical stuff useful and when not or else then you look at the Vassestan GAN theoretical sections and it spends a lot of time talking about why their theory actually matters so they have this really cool example for example but they say that's creating really simple what if you want to learn just parallel lines and they show like why it is that the kind of old way of doing GANs can't learn parallel lines and then they show how their different objective function can learn parallel lines so I think anybody is interested in kind of getting into the theory a little bit it's very interesting to look at kind of like okay why did the you know proof of convergence you know like showed something that it didn't show something that really turned out to matter where else in this paper the theory turned out to be super important and basically created something that allowed GANs to work for the first time so there's lots of stuff you can get out of these papers if you're if you're interested in terms of the notation we might look at some of the notation a little bit more next week but if we look for example at the see if we can find the algorithm sections I think in general like the most the bit I find the most useful not being not being much of a math guy is the bit where they actually write the pseudocode even that it's useful to learn some some kind of nomenclature so for each iteration easy the each step easy what does this mean sample noise samples from noise prior probability nomenclature which you can very quickly translate a prior simply means np.random.something in this case np.random.normal this just means some random number generator that you get to pick this one here sample from a data generating distribution that means randomly pick some stuff from your array okay so these are the two steps right generate some random numbers and then randomly select some things from your array and actually in the and then the bit where it talks about the gradient you can kind of largely ignore except the bit in the middle is your loss function you can see here this said these things here is your noise that's your noise okay so noise generator on noise discriminator on generator on noise okay so there's the bit where we're trying to fool the discriminator and we're trying to make that trick it so that's why we do one minus and then here's the getting the discriminator to be accurate because these x's is the real data okay so that's the math version of what we just learned the Vossestein GAN also has a algorithm section and so it's kind of interesting to compare the two so here we go with the Vossestein GAN here's the algorithm and basically this says exactly the same thing as the last one said but you know I actually find this one a bit clearer sample from the real data sample from your priors okay so hopefully that's enough to get going and yeah look forward to talking on the forums and see how everybody gets along thanks everybody