 So welcome back everybody, thanks for coming and I hope you had a good week and had a fun time playing around with artistic style. I know I did. I thought I'd show you. So I tried a couple of things myself over the week with this artistic style stuff. Just like I just tried a couple of simple little changes which I thought you might be interested in. Yeah. Oh, actually, yeah. Okay, so one thing before I talk about the artistic style is I just wanted to point out some of the really cool stuff that people have been contributing over the week. If you haven't come across it yet, be sure to check out the wiki. So the there's a nice thing in discourse where you can basically set any post as being a wiki which means that anybody can edit it. So I created this wiki post early on and by the end of the week, we now have all kinds of stuff with links to the stuff in the class, a summary of the paper, examples, a list of all the links, both snippets, handy list of steps that are kind of necessary when you're doing style transfer, lots of stuff about the TensorFlow Dirt Summit and so forth. So wiki threads every week. So look out for that. Lots of other threads. One I saw this afternoon popped up, which is great from Chin Chin. Talked about trying to summarize what they've learned from lots of other threads across the wiki and this is really, sorry, across the forum. This is a great thing that we can all do is, you know, when you look at lots of different things and it kind of takes notes. If you put them on the forum for everybody else, this is super handy. So if you haven't quite caught up on all the stuff going on the forum, looking at this curating lesson experiments thread would be probably a good place to start. So a couple of little changes I made from my experiments. I tried thinking about like why quite a few of you pointed out how depending on what your starting point is for your optimizer, you get to a very different place. And so clearly our convex optimization is not necessarily finding a local minimum, but at least saddle points is not getting out of. So I tried something which was to create a take the random image and just add a Gaussian blur to it. So Gaussian filter is just a blur. And so that makes a random image into this kind of thing. And I just found that even the plain style looked a lot smoother. So that's one of the changes that I made, which I thought worked quite well. Another change that I made just to play around with it was I added a different weight to each of the style layers. And so my, my zip now has a third thing in which is the weights and I just multiply by the weight. So I thought that those two things made my little bird look significantly better than my little bird look before. So I was happy with that. So it's a similar thing for content loss. You could also maybe add more different layers of content loss and give them different weights as well. I'm not sure if anybody tried that as yet. A question. I have a question in regards to style transfer for cartoons. The cartoons, when you think of transferring the style, what we really mean is transferring the contours of the cartoon to redraw the content in that style. This is not what style transferring is doing here. How might one implement this? Yeah, I don't know that anybody's quite figured that out, but I'll show you a couple of directions that may be useful. I've tried selecting activations that correspond with edges and such as indicated by one of the comm visualization papers and comparing outputs from specifically those activations. Yeah, yeah, so I'll show you some things. I'll show you some things you could try. I haven't seen anybody do a great job of this yet, but here's one example from the forum. Somebody pointed out that this cartoon approach didn't work very well with Dr. Seuss, but then when they changed their initial image not to be random, but to be the picture of the dog, it actually looked quite a lot better. So there's one thing you could try. There's some very helpful diagrams that somebody posted, which is fantastic. They're summarizing what we learned. I like this summary of what happens if you add versus remove each layer. So this is what happens if you remove block 0, block 1, block 2, block 3, and block 4 to get a sense of how they impact things. So you can see for the style that the last layer is really important to making it look good, at least with this image. One of you had some particularly nice examples. I love this. It seems like there's a certain taste. They're kind of figuring out what photos go with what images. I thought this Einstein was terrific. I thought this was terrific as well. Brad came up with this really interesting insight that starting with this picture and adding a style to it creates this extraordinary shape here where, as he points out, you can tell it's a man sitting in the corner, but there's like less than 10 brush strips. So sometimes this style transfer does things which are surprisingly fantastic. I have no idea what this is even in the photos, so I don't know what it is. I guess I don't watch that kind of music enough. So there's lots of interesting ideas you can try. And I've got a link here, and you might have seen it in the PowerPoint, to a Keras implementation that has a whole list of things that you can try. And here are some particular examples. This is something called, and all of these examples you can get the details from this link. There's something called chain blurring. And for some things, this is this might work well for cartoons. Notice how the matrix doesn't do a good job with the cat. And when you use the classic, this is our paper, right? But if you use this chain bloat approach, it does a fantastic job. And so I wonder if that might be one secret to the cartoons. Some of you I saw in the forum have already tried this, which is using color preservation and luminance matching, which basically means you're still taking the style, but you're not taking the color. So in particular examples, this is really great results. I think it depends a lot on what things you tried with. You can go a lot further. For example, you can add a mask, and then say just do color preservation for one part of the photo. So here the top part of the photo has got color preservation and the bottom hasn't. They even show in that code how you can use a mask to say one part of my image should not be stylized. Again, good results. Well, this is really crazy. Use masks to decide which one of two style images to use. And then you can really generate some creative stuff. So there's a lot of stuff that you can play with and you can go beyond this to coming up with your own ideas. Now, some of the best stuff, you're going to learn a bit more today about how to do some of these things better. But just to give an idea, if you go to likemo.net, you can literally draw something using four colors and then choose a style image and it will turn your drawing into an image. Basically, the idea being that blue is going to be water and green is going to be foliage and red is going to be foreground. So there's a lot of good examples of this kind of neural doodle, they call it online. Something else we'll learn more about how to do better today is if you go to affinelayer.com, there's a very recent paper called Pix2Pix, which is basically, we're going to be learning quite a bit in this class about how to do segmentation, which is where you take a photo and turn it into a colored image, basically saying the horse is here, the bicycle is here, the person is here. This is basically doing the opposite. Is that by drawing something, saying I want you to create something that has a window here and a window still here and a draw here and a column there. And it generates a photo, which is fairly remarkable. So the stuff we've learned so far won't quite get you to do these two things, but by the end of today, we should be able to. This is a nice example that I think some folks at Adobe built showing that you could basically draw something and it would try and generate an image that was close to your drawing where you just needed a small number of lines. Again, you can find this, we'll link to this paper from the resources. This actually shows it to you in real time. You can see that there's some new way of doing art that's starting to appear where you don't necessarily need a whole lot of technique. I'm not promising it's going to turn you into a Van Gogh, but you can at least generate images that maybe are in your head in some style that's somewhat similar to somebody else's. I think it's really interesting. Okay. One thing I was thrilled to see is that at least two of you have already written blog posts on medium. That was fantastic to see. So I hope more of you might try to do that this week. Definitely doesn't need to be something that takes a long time. You're already planning, also planning on turning your forum posts into blog posts. So hopefully we'll see a lot more blog posts this week popping up. I know the people who have done that have found that a useful experience as well. So one of the things that I suggested doing pretty high on the list of priorities for this week's assignment was to go through the paper knowing what it's going to say. You know, I think this is really helpful is when you already know how to do something is to go back over that paper. And this is a great way to learn how to read papers. Because you already know what it's telling you. This is like the way I learned to read papers was totally this method. So I've kind of gone through and I have highlighted a few key things which I, as I went through, I thought were kind of important. In the abstract of the paper. But let me ask how many people kind of went back and re looked at this paper again. A few of you, that's great. So in the abstract, they basically say what is it that they're introducing. So it's a system based on a deep neural network that creates artistic images and high perceptual quality. Okay, so we're going to read this paper and hopefully at the end of it we'll know how to do that. Then in the first section, they tell us about the basic ideas. So when CNN is a trained on object recognition, they developed a representation of an image along the processing hierarchy of the network. It's transformed into representations that increasingly care about the actual content compared to the pixel values. So it describes the basic idea of content loss. And then they describe the basic idea of style loss. Which is looking at the correlations between the different filter responses over the spatial extent of the feature maps. And this is one of these sentences that read on its own doesn't mean very much. But now that you know how to do it, you can read it and it's like, okay, I think I see what that means. And then when you get to the methods section, we learn more. So the idea here is that by including the feature correlations and this answers one of the questions that one of you had on the forum. By including feature correlations of multiple layers, we obtain a multi-scale representation of the input image. This idea of a multi-scale representation is something we're going to be coming across a lot. Because a lot of this, as we discussed last week, a lot of this class is about generative models. And one of the tricky things with generative models is both to get the general idea of the thing you're trying to generate correct, but also get all the details correct. And so the details generally requires you to zoom into a small scale and getting kind of the big picture correct is about zooming out to a large scale. So this was one of the key things that they did in this paper was show how to create a style representation that included multiple resolutions. And we now know that the way they did that was to use multiple style layers. And as we go through the layers of VGG, they gradually become lower and lower resolution, larger and larger receptive fields. Always great to look at the figures and make sure, and I was thrilled to see that some of you were trying to recreate these figures, which actually turned out to be slightly non-trivial. So we can see exactly what that figure is. And if you haven't tried it for yourself yet, you might want to try it, see if you can recreate this figure. It's good to try and find in a paper the key finding, the key thing that they're showing. In this case, they found that representations of content and style and a CNN are separable. And you can manipulate both to create new images. So again, hopefully that now we can look at that and say, oh yeah, that makes sense. You can see that with papers, certainly with this paper, there's often quite a lot of introduction that often says the same thing a bunch of different ways. So it's often worth, you know, the first time you read it, one paragraph might not make sense, but later on they say it a different way and it starts to make more sense. So it's worth looking through the introductory remarks maybe two or three times. They can certainly see that again, talking about the different layers and how they behave. Again, showing the results of some experiments. Again, you can see if you can recreate these experiments, make sure that you understand how to do it. And then there's a kind of a lot of stuff I didn't find that interesting until we get to the section called methods. So the method section is the section that hopefully you'll learn the most about reading papers after you've implemented something by reading the section called methods. Now I want to show you a few little tricks of notation. You do need to be careful of little details that fly by like here they used average pulling. So that's a sentence which if you weren't reading carefully you could skip over it. Now we know, okay, we need to use average pulling not math pulling. So they will often have a section which explicitly says now I'm going to introduce the notation. This paper doesn't this paper just kind of introduces the notation as part of the discussion. But at some point you'll start getting Greek letters or, you know, things with subscripts or whatever notation starts appearing. And so at this point you need to start looking very carefully. And at least for me I find I have to go back and read something many times to remember what's L, what's M, what's M. This is the annoying thing with math notation is the single letters they generally don't have any kind of mnemonic. Often though you'll find that across papers in a particular field they'll tend to reuse the same kind of English and Greek letters for the same kinds of things. So M will generally be the number of rows, capital M. Capital N will often be the number of columns. K will often be the index that you're summing over so on and so forth. So here we've got the first thing which is introduced is X with an arrow on top. So X with an arrow on top means it's a vector. It's actually an input image but they're going to turn it into a vector by flattening it out. So okay so our image is called X. And then the CNN has a whole bunch of layers and every time you see something with a subscript or a superscript like this you need to look at both of the two bits because they've both got a meaning. The big thing is like the main object. So in this case capital N is a filter and then the subscript or superscript is like in an array or a tensor in Python it's like the thing in square brackets. And so each filter has a letter L which is like which number of the filter is it. And so often as I read a paper I'll actually try to write code as I go and like put little comments so that I'll like write layer, square bracket, layer number, close square bracket and then I have a comment after. They're like N, L, just to remind myself. So I'm creating the code and mapping it to the letters. So there are N, L filters. We know from a CNN that each filter creates a feature map. So that's why there are N, L feature maps. So remember anytime you see the same letter, it means the same thing. Within a paper, not necessarily across papers. Each feature map is of size M. And as I mentioned before, N tends to be rows and M tends to be columns. So here it says M is the height times the width of the feature map. So here we can see okay, they've gone flat basically to make it all one row. Okay, now this is another piece of notation you'll see all the time. A layer L can be stored in a matrix called F and now the L has gone to the top. Doesn't matter. Same basic idea. It's just an index. So the matrix F is going to contain our activations. And this thing here where it says R with a little superscript, this has a very special meaning. It's referring to basically what is the shape of this. So when you see this shape it says these are, R means that they're floats. And this thing here means it's a matrix. You can see the X, that means it's rows by columns. So there are N rows and M columns in this matrix. And every matrix, there's one matrix for each layer. And there's different number of rows and different number of columns for each layer. So you can basically go through and map it to the code that you've already written. So I'm not going to read through the whole thing, but there's not very much here. And it'd be good to make sure that you understand all of it. Perhaps with the exception of the derivative, because we don't care about derivatives, because they get done for us, thanks to Atheano intensive flow. So you can always skip the bits about derivatives. Okay, so then they do the same thing basically describing the gram matrix. So they show here that the basic idea of the gram matrix is that they create an inner product between the vectorized feature map I and J. So vectorized here means turned into a vector. So the way you turn a matrix into a vector is flat. There's a flattened inner product between the flattened feature maps to those matrices we saw. Yeah, so hopefully you'll find this helpful. You'll see there'll be like small little differences. So rather than taking the mean, they tend to use here the sum, and then they kind of divide back out the number of rows and columns to kind of create the mean this way. In our code, we actually put the division inside the sum. So you see these little differences of how we implement things. And sometimes you may see actual meaningful differences, and that's often a suggestion of like, oh, there's something you could try. Some differences you could try. Okay, so that describes the notation and the method, and that's it. But then very importantly is throughout this anytime you come across some concept which you're not familiar with, it'll pretty much always have a reference, a citation. So you'll see there's little numbers all over the place. There's lots of different ways of doing these references. But anytime you come across something which has a citation, like it's a new piece of notation or a new concept, you don't know what it is. Generally the first time I see it in a paper, I ignore it. But if I keep reading and it turns out to be something that actually is important and I can't understand the basic idea at all, I generally then put this paper aside. I put it in my to read file and make the new paper I'm reading the thing that it's citing. Because very often a paper is entirely meaningless until you've read one or two of the key papers it's based on. Sometimes this can be like reading the dictionary if you don't get low English. It can be like layer upon layer of citations. And at some point you have to stop. I think you should find that the basic set of papers that things refer to is pretty much all stuff you guys know at this point. So I don't think you're going to get stuck in an infinite loop. But if you ever do, let us know on the forum and we'll try and help you get unstuck. Or if there's any notation you don't understand, let us know. Another of the horrible things about math is it's very hard to search for. It's not like you can take that function name and search for Python in the function name instead of some weird squiggly shape. So again, feel free to ask if you're not sure about that. There is a great Wikipedia page which lists math notation or something which lists pretty much every piece of notation. There are various places you can look at notation as well. Okay, so that's the paper. So let's move to the next step. So I think what I might do is kind of try and draw the basic idea of what we did before so that I can draw the idea of what we're going to do differently this time. So previously and now this thing is actually calibrated. You'll be pleased to hear. We had a random image and we had a loss function. So it doesn't matter what the loss function was. We know that it happened to be a combination of style loss, loss, content loss. And what we did was we took our image and our random image and we put it through this loss function and we got out of it two things. One was the loss and the other was the gradients. And then we use the gradients with respect to the original pixels to change the original pixels. And so we basically repeated that loop again and again and the pixels gradually changed to make the loss go down. So that's the basic approach that we just used. It's a perfectly fine approach for what it is. And in fact for if you are wanting to do lots of different photos with lots of different styles, like if you created a web app where you said please upload any style image and any content image, here's your artistic style version, this is probably still the best, particularly with some of those tweaks I talked about. But what if you wanted to create a web app that was a Van Gogh irises generator? Upload any image and I will give you that image in the style of Van Gogh's irises. You can do better than this approach. And the reason you can do better is that we can do something where you don't have to do a whole optimization run in order to create that output. Instead we can train a CNN to learn to output photos in the style of Van Gogh's irises. The basic idea is very similar. What we're going to do this time is we're going to have lots of images. And we're going to take each image and we're going to feed it into the exact same loss function that we used before. With the style loss plus the content loss. But for the style loss we're going to use Van Gogh's irises and for the content loss we're going to use the image that we're currently looking at. We've got lots of images we're going to go through. And what we do is we're going to rather than changing the pixels of the original photo. Instead what we're going to do is we're going to train a CNN. So a whole bunch of layers of a CNN. We're going to train a CNN to take this. I think the best way to show you this. Let's move this out of the way. That's better. So let's put a CNN in the middle. These are the layers of the CNN. And we're going to try and get that CNN to spit out a new image. So there's an input image and there's an output image. And this new CNN we've created is going to spit out an output image that when you put it through this loss function, hopefully it's going to give a small number. It means that the content of this photo still looks like the original photo's content. And the style of this new image looks like the style of Van Gogh's irises. So if you think about it, when you have a CNN, you can really pick any loss function you like. We've tended to use some pretty simple loss functions so far like mean squared error or cross entropy. In this case, we're going to use a very different loss function, which is going to be style plus content loss using the same approach that we used just before. And because that was generated by a neural net, we know it's differentiable. And you can optimize any loss function as long as the loss function is differentiable. So if we now basically take the gradients of this output, not with respect to the input image, but with respect to the CNN weights, then we can take those gradients and use them to update the weights of the CNN so that the next iteration through the CNN will be slightly better at turning that image into a picture that has a good style match with Van Gogh's irises. Does that make sense? At the end of this, we run this through lots of images. We're just training a regular CNN. And the only thing we've done differently is to replace the loss function with the style loss plus content loss that we just used. And so at the end of it, we're going to have a CNN that has learnt to take any photo and will spit out that photo in the style of Van Gogh's irises. And so this is a win because it means now in your web app, which is your Van Gogh irises generator, you now don't have to run an optimization path on the new photo. You just do a single forward pass through a CNN, which is instant. Yes, green box over there. So this is going to limit the filter to use, right? Significantly, let's say you have Photoshop and you want to change multiple styles. Yeah, this is going to... Each neural network is going to learn to do just one type of style. Is there a way of combining multiple styles? Or is there just going to be a combination of all styles? You can combine multiple styles by just having multiple bits of style loss for multiple images, but you're still going to have the problems that that network has only learned to create one kind of image. It hasn't learned. Now, it may be possible to create... to train it so that it takes both a style image and a content image, but I don't think I've seen that done yet, as far as I know. Okay. Thanks, that was a great question. Having said that, there is something simpler and, in my opinion, more useful we can do, which is rather than doing style loss plus content loss, let's think of another interesting problem to solve, which is called super resolution. Super resolution is something which, honestly, when Rachel and I started playing around with it a while ago, nobody was that interested in it. But in the last year or so, it's become really hot. Yes, Rachel? It was less than a year ago that it... Yeah, so we were kind of playing around with it quite a lot. We thought it was really interesting, but suddenly it's got hot. And the basic idea of super resolution is that you start off with a low res photo. And so the reason I started getting interested in this was I wanted to help my mum take her family photos that were often pretty low quality and blow them up into something that was big and high quality that she could print out. So that's what you do is you try to take something which starts with a small low res photo and turns it into a big high res photo. L, R, low res, HR, high res. Now, perhaps you can see that we can use a very similar technique for this. What we could do is between the low res photo and the high res photo, we could introduce a CNN. And that CNN could look a lot like the CNN from our last idea, but it's taking in as input a low res image and then it's sticking it into a loss function. And the loss function is only going to calculate content loss. And the content loss it will calculate is between the input that it's got from the low res after going through the CNN compared to the activations from the high res. So in other words, has this CNN successfully created a bigger photo that has the same activations as the high res photo does? And so if we pick the right layer for the high res photo, then that ought to mean that we've constructed a new image. What's that Rachel? There's a question. Can we just stick a CNN between any two things and it will learn the relationship? Yes, absolutely. And this is one of the things I wanted to talk about today is in fact, I think it's at the start of the next paper we're going to look at is they even kind of talk about this. So this is the paper we're going to look at today. Perceptual losses for real-time style transfer and super resolutions. This is from 2016. So it took like about a year or so to go through the thing we just saw to this next stage. Is that right? Oh, maybe half a year. So what they point out in the abstract here is that people had done super resolution with CNNs before. But previously the loss function they used was simply the mean squared error between the pixel outputs of the upscaling network and the actual high res image. And the problem is that it turns out that that tends to create blurry images. And it tends to create blurry images because the CNN has no reason not to create blurry images. And blurry images are actually tend to look pretty good in the loss function because as long as you get the general oh, this is probably somebody's face, I'll put like a face color here. Then it's going to be fine. Whereas if you take the second or third conf block of VGG then it needs to know that this is an eyeball or it's not going to look good. It needs to know that this is a nose so it's not going to look good. So if you do it not with pixel loss but with the content loss we just learned about you're probably going to get better results. So this like many papers in deep learning this paper introduces its own language and in the language of this paper perceptual loss is what they call the mean squared errors between the activations of a network with two images. So the thing we've been calling content loss they call perceptual loss. So one of the nice things they do at the start of this and I really like it when papers do this is to say okay, why is this paper important? Well this paper is important because many problems can be framed as image transformation tasks where a system receives some input and chucks out some other output. For example, denoising learn to take an input image that's full of noise and spit out a beautifully clean image. Super resolution. Take an input image which is low risk and spit out a high risk. Colourisation. Take an input image which is black and white and spit out something which is colour. Now one of the interesting things here is that all of these examples you can generate as much input data as you like by taking lots of images which are either from your camera or you download off the internet or from ImageNet and you can make them lower risk. You can make them black and white. So you can generate as much level data as you like. That's one of the really cool things about this whole topic of generators right George to the left. But I did have to catch something. Well with that example so going to lower resimetry or some camera it's algorithmically done. So it's been known that we're only going to learn how to transfer out something that's algorithmically done versus an actual low res imagery that doesn't Yeah so one thing I'll just mention is the way you would create your level data is not to do that low res on the camera you would grab the images that you've already taken and make them low res just by doing filtering in OpenCV or whatever. And yeah that is algorithmic and it may not be it may not be perfect but there's lots of ways of generating that low res image. So there's lots of ways of creating a low res image so part of it is about how do you do that creation of low res image and how well do you match the real low res data you're going to be getting. But in the end in this case things like low resolution images or black and white images it's so hard to start with something which is like it could be like I've seen versions with just an 8x8 picture and turning it into a photo like it's so hard to do that regardless of how that 8x8 thing was created that often the details of how the low res image was created don't really matter too much. There are some other examples they mention which is turning an image into an image which includes segmentation We'll learn more about this in coming lessons but segmentation refers to taking a photo of something and creating a new image that basically has a different colour for each object. So horses are green, cars are blue buildings are red, that kind of thing, that's called segmentation. As you know from things like the fisheries competition segmentation can be really important as a part of solving other bigger problems. Another example they mention here is depth estimation. There's lots of important reasons you would want to use depth estimation. For example maybe you want to create some fancy video effects where you start with a flat photo and you want to create some cool new Apple TV thing that like moves around the photo with a parallax effect you know as if it was 3D so if you were able to use a CNN to figure out how far away every object was automatically and you could like turn a 2D photo into a 3D image automatically. So yeah, taking an image in and sticking an image out is kind of the idea in computer vision at least of generative networks or generative models and so this is why I wanted to talk a lot about generative models during this class. It's not just about artistic style, artistic style was just my sneaky way of introducing you to the world of generative models. Okay so let's look at how to create this super resolution idea and your homework or part of your homework this week will be to create the new approach to style transfer so I'm going to build the super resolution version which is a slightly simpler version and then you're going to try and build on top of that to create the style transfer version. So make sure you let me know if you're not sure at any point. So I've already created a folder of 20,000 a sample of 20,000 images and I've created two sizes, one is 288x288 and one is 72x72 and they're available as beak holes arrays. Okay so I actually posted the link to these last week and it's on platform.fast.ai so we'll open up those beak holes arrays and one trick you might have hopefully learned in part one is that you can turn a beak holes array into a NumPy array by slicing it with everything. So any time you slice the beak holes array you get back a NumPy array. So if your slice is everything then this turns it into a NumPy array. This is just a convenient way of sharing NumPy arrays in this case. So we've now gotten a bunch of low resolution images and an array of high resolution images. So let me start maybe by showing you the final network. Okay this is the final network. So we start on off by taking in a batch of images low res images and the very first thing we do is stick them through a convolutional block with a stride of one. This is not going to change its size at all but this convolutional block has a filter size of nine and it generates 64 filters. So this is a very large filter size. Particularly nowadays filter sizes tend to be three. Actually in a lot of modern networks the very first layer is very often a large filter size just the one, just one very first layer. And the reason is that it basically allows us to immediately increase the receptive field of all of the layers from now on. So by having nine by nine and we don't lose any information because we've gone from three channels to 64 filters. So each of these nine by nine convolutions can actually have quite a lot of information because you've got 64 filters. So you'll be seeing this quite a lot in modern CNN architectures. Just a single large filter conv flare. So this will be unusual in the future. Now the next thing do you want to give the green box behind you? Oh just a moment sorry. Yeah. The stride one also pretty popular these days? Yeah well the stride one is important for this first layer because you don't want to throw away any information yet. So in the very first layer we want to keep the full image size. So with the stride one it doesn't change, it doesn't downsample at all. But there's also a lot of duplication right? Yeah they overlap a lot, absolutely. But that's okay. A good implementation of a convolution is going to hopefully memorise some of that or at least keep it in cache. So hopefully won't slow it down too much. One of the discussions I was just having during the break was like how practical are the things that we're learning at the moment compared to like part one where everything was just designed entirely to be like here are the most practical things which we have best practices for. And the answer is like a lot of the stuff we're going to be learning. No one quite knows how practical it is because a lot of it just hasn't really been around that long and isn't really that well understood and maybe there aren't really great libraries for it yet. So one of the things I'm actually hoping from this part two is by learning the edge of research stuff or beyond amongst a diverse group is that some of you will look at it and think about your whatever you do I'm going to go to five or eight to six or whatever and think oh I wonder if I could use that for this. If that ever pops into your head please tell us. Please talk about it on the forum because that's what we're most interested in. It's like oh you could use super resolution for BLAR or depth finding for this or generative models in general for this thing I do in pathology or satellite engineering or whatever. So yeah so it's going to require some imagination sometimes on your part and so often that's why I do want to spend some time looking at stuff like this where it's like okay what are the kinds of things that this can be done for but one of the you know I'm sure you know in your own field like one of the differences between expert and beginner is the way an expert can look at something from first principles and say okay I could use that for this totally different thing which has got nothing to do with the example that was originally given to me because I know that the basic steps are the same and that's what I'm hoping you guys will be able to do is kind of not just say oh this is right at batteries again I can just put this in yeah it's not just say okay now I know how to do artistic style you know are there things in your field which have some similarities to artistic style so we were going to talk about the super resolution network and we talked about the idea of the initial conf block so after the initial conf block we have the computation and when I say the computation and any kind of generative network there's like the key work it has to do which in this case is starting with a low res image figure out like what might that black dot be is it an eyeball or is it a wheel basically if you want to do really good upscaling you actually have to figure out what the objects are so that you know what to draw right so that's kind of like the key computation this CNN is going to have to learn to do in generative models we generally like to do that computation at a low resolution there's a couple of reasons why the first is that at a low resolution there's less work to do so the computation is faster but more importantly at higher resolutions where it generally means we have a smaller receptive field it generally means we have less ability to kind of capture large amounts of the image at once and if you want to do really really great kind of computations where you recognize that oh this blob here is a face and therefore the dot inside it is an eyeball then you're going to need enough of a receptive field to cover that whole area now notice a couple of you asked for information about receptive fields on the forum thread so there's quite a lot of information about this online so google is your friend here but the basic idea is if you have a single convolutional filter of 3x3 the receptive field is 3x3 so it's how much space can that convolutional filter impact so here's a 3x3 filter now on the other hand what if you had a 3x3 filter which had a 3x3 filter as its input so that means that the center one took all of this but what did this one take or this one would have taken depending on the stride probably these ones here and this one over here would have taken these ones here so in other words in the second layer assuming a stride of 1 the receptive field is now 5x5 not 3x3 so the receptive field depends on two things one is how many layers deep are you and the second is how much did the previous layers either have a non-units stride or maybe they had max pooling so in some way they were becoming down-sampled those two things increased the receptive field and so the reason it's great to be doing layer computations on a large receptive field is that it then allows you to look at the big picture and look at the context it's not just edges anymore but eyeballs and noses so in this case we have four blocks of computation where each block is a resnet block so for those of you that don't recall how resnet works it would be a good idea to go back to part 1 and review but to remind ourselves let's look at the code here's a resnet block so all a resnet block does is it takes some input and it does two convolutional blocks on that input and then it adds the result of those convolutions back to the original input so you might remember from part 1 we said there's some input and it goes through two convolutional blocks and then it goes back and is added to the original and if you remember we basically said in that case we've got y equals x plus some function of x which means that the function equals y minus x and this thing here is residual so a whole stack of residual blocks resnet blocks on top of each other can learn to gradually get on whatever it's trying to do in this case what it's trying to do is get the information it's going to need to upscale this in a smart way so we're going to be using a lot more of this idea of taking blocks that we know work well for something and just reusing them and so then what's a conv block? Well all the conv block is in this case it's a convolution followed by a batch norm optionally followed by an activation and one of the things we now know about resnet blocks is that we generally don't want an activation at the end and that's one of the things that a more recent paper discovered so you can see that for my second conv block no activation I'm sure you've noticed throughout this course that I refactor my network architectures a lot my network architectures don't generally list every single layer but they're generally functions which have a bunch of functions which have a bunch of layers in a lot of people don't do this like a lot of the architectures you find online are like hundreds of lines of layer definitions I think that's crazy it's so easy to make mistakes when you do it that way and so hard to really see what's going on in general I would strongly recommend you try to refactor your architectures so that by the time you write the final thing it's you know half a page and you'll see plenty of examples of that so hopefully that'll be helpful alright so we've increased the receptive field we've done a bunch of computation but we still haven't actually changed the size of the image which is not very helpful so the next thing we do is we're going to change the size of the image and the first thing we're going to learn is to do that with something that goes by many names one is decombolution another is it's also known as transposed convolutions and it's also known as fractionally strided convolutions in Keras they call them decombolutions and the basic idea is something which I've actually got a spreadsheet to show you okay so here's the spreadsheet the basic idea is that you've got some kind of image so here's a 4x4 image some 4x4 data and you put it through a 3x3 filter a convolutional filter and if you're doing valid convolutions then that's going to leave you with a 2x2 output because here's one 3x3 another 3x3 alright 3x4 of them and so each one is having the whole filter and the appropriate part of the data so it's just a standard 2D convolution so we've done that now let's say we want to undo that we want something which can take this result and recreate this input how would you do that so one way to do that would be to take this result so let's copy it over here and put back that implicit padding so let's surround it with all these zeros such that now if we use let's have some filter we just started at zero we have some convolutional filter and we're going to put it through this entire matrix a bunch of zeros with our result matrix in the middle and then we can calculate our result in exactly the same way just a normal convolutional filter so if we now use gradient descent we can look and see okay what is the error so how much does this pixel differ from this pixel and how much does this pixel differ from this pixel and then we add them all together to get our mean squared error so we can now use gradient descent which hopefully you remember from part one in Excel it's called solver and we can say okay set this cell to a minimum by changing these cells so this is basically like the simplest possible optimization solve that and here's what it's come up with so it's come up with a convolutional filter you'll see that the result is not exactly the same as the original data and of course how could it be we don't have enough information we only have four things to try and regenerate 16 things but it's not terrible and in general this is this is the challenge with upscaling right when you've got something that's blurred and down sampled you've thrown away information so the only way you can get information back is to guess what was there but the important thing is that by using a convolution like this we can learn those filters so we can learn how to up sample it in a way that gives us the loss that we want so this is what a deconvolution is it's just a convolution on a padded input right now in this case I've assumed that my convolutions had a unit striped right there was just one pixel between each convolution if your convolutions are of striped 2 then it looks like this picture right and so you can see that as well as putting the 2 pixels around the outside we've also put a 0 pixel in the middle so these 4 cells is now our data cells and you can then see it calculating the convolution through here I strongly suggest looking at this link which is where this picture comes from and in turn this link comes from a fantastic paper called the convolution guide which is a really great paper and so if you want to know more about both convolutions and deconvolutions you can look at this page and it's got lots of beautiful animations including animations on they call it transposed convolutions so you can see there we go this is the one I just showed you so that's the one we just saw in Excel so that's a really great site okay so that's what we're going to do first is we're going to do deconvolutions so in Keras a deconvolution is exactly the same as convolution except with DE on the front we've got all the same stuff, how many filters do you want what's the size of your filter what's your stride or sub sample as they call it border mode circles we have a question if tensor flows the back end shouldn't the batch normalization axis equals negative one and then there was a link to a github conversation where Francois said that for theano axis is one it should be and in fact axis minus one is the default yes, thank you well spotted thank David Gutman he is also responsible for some of our beautiful pictures we saw earlier so double thank you to David Gutman let's remove axis that will make the things look a bit better and go faster as well so just in case you weren't clear on that you might remember from part one that the reason we had that axis equals one is because in theano that was the channel axis so we basically wanted not to throw away the x, y information the batch normal cross channels in theano channel is now the last axis since minus one is the default we actually don't need that okay so that's our deconvolution blocks and so we're using a stride of 2,2 each time we go through this deconvolution it's going to be doubling the size of the image for some reason I don't fully understand and haven't looked into in keras you actually have to tell it the shape of the output so you can see here you can actually see it's gone from 72 by 72 to 144 by 144 to 288 by 288 so because these are convolutional filters it's learning to upscale but it's not upscaling with just three channels it's upscaling with 64 filters so that's how it's able to do more sophisticated stuff and then finally we're kind of reversing things here another 3 by another 9 by 9 convolution in order to get back our three channels so the idea is we previously had something with 64 channels and so we now want to turn into something with just three channels the three colors and to do that we want to use quite a bit of context so we have a 9 by 9 single 9 by 9 filter at the end to get our three channels in so at the end we have a 288 by 288 by 3 sensor in other words an image so if we go ahead now and train this then it's going to do basically what we want but the thing we're going to have to do is to create our loss function and creating our loss function is a little bit messy but I'll take you through it slowly and hopefully it'll all make sense so we've taken just let's remember some of the symbols here input is the original low resolution input and then the output of this is called output and so let's call this whole network here let's call it the upsampling network so this is the thing that's actually responsible doing the upsampling so we're going to take the upsampling network and we're going to attach it to VGG and the VGG is going to be used only as a loss function to get the content lost so before we can take this output and stick it into VGG we need to stick it through our standard mean subtraction preprocessing so this is just the same thing that we did over and over again in part one so let's now define this output as being this lambda function applied to the output of our upsampling network so that's what this is this is just our preprocessed upsampling network output so we can now create the VGG network and let's go through every layer and make it not trainable like you can't ever make your loss function be trainable the loss function is the fixed in stone thing that tells you how well you're doing so clearly you have to make sure VGG is not trainable alright now which bit of the VGG network do we want we can try a few things I'm using block 2 conf 2 so relatively early and the reason for that is that if you remember when we did the content reconstruction last week the very first thing we did we found that if you could basically totally reconstruct the original image from early layer activations where else by the time we got to layer 4, sorry block 4 we got pretty horrendous things so we're going to use a somewhat early block as our content loss or as the paper calls it the perceptual loss and you can play around with this see how it goes alright so now we're going to create two versions of this VGG output and this is something which is I think very poorly understood or appreciated with the keras' functional API which is any kind of layer and a model is a layer as far as keras is concerned can be treated as if it was a function right so we can take this model and pretend it's a function and we can pass it any tensor we like and what that does is it creates a new model where those two pieces are joined together right so VGG2 is now equal to this model on the top and this model on the bottom and remember this model was the result of our upsampling network followed by pre-processing and the upsampling network is the lambda function to normalize the output image yeah that's a good point so we use a fan activation which can go from negative 1 to 1 so if you then go that plus 1 times 127 and a half that gives you something that's between 0 and 255 which is the range that we want interestingly this was suggested in the original paper and supplementary materials more recently on Reddit I think it was the author said that they tried it without the fan activation and therefore without the final de-processing and it worked just as well well you can try doing that if you wanted to try it you would just remove the activation and you would just remove this last thing entirely but obviously if you do have a fan then you need the output and this is actually something I've been playing with with a lot of different models any time I have some particular range that I want and one way to enforce that is by having a fan or sigmoid followed by something that turns that into the range you want it's not just images okay so we've got two versions of our VGG layer output one which is based on the output of the upscaling network and the other which is based on just an input and this just an input is using the high resolution shape as its input so that makes sense because this VGG network is something that we're going to be using at the high resolution scale we're going to be taking the high resolution target image and the high resolution up sampling result and comparing them so now that we've done all that we're nearly there we've now got the high res perceptual activations and we've got the low res up sampled perceptual activations we now just need to take the mean sum of squares between them and here it is here in Keras any time you put something into a network it has to be a layer so if you want to take just a plain old function and turn it into a layer you just chuck it inside a capital L so our final model is going to take our up sampled input and our sorry our low res input and our high res input as our two inputs and return this loss functions and output okay one last trick when you fit things in Keras it assumes that you're trying to take some output and make it close to some target in this case our loss is the actual loss function we want it's not that there's some target right we want to make it as low as possible since it's a sum of squared errors for a mean squared error actually it can't go beneath zero so what we can do is we can basically check Keras and say that our target loss is zero and you can't just use the scalar zero remember every time we have a target set of labels in Keras you need one for every row you need one for every input so we're going to create an array of zeros so that's just so that we can fit it into what Keras expects and I kind of find that increasingly as I start to move away from the kind of you know the well trodden path of deep learning more and more particularly if you want to use Keras you kind of have to do weird little hacks like this so let's hope there's a weird little hack there's probably more elegant ways of doing this but this works so we've got our loss function that we're trying to get every row as close to zero as possible we have a question if we're only using up to block to, conf to could we pop off all the layers afterwards to save some computation? Sure, wouldn't be a bad idea at all okay so we compile it we fit it one thing you'll notice I've started doing is using find it here this callback called tqdm notebook callback tqdm is a really terrific library basically it does something very very simple which is to add a progress meter to your loops you can use it in a console as you can see and so basically anywhere you've got a loop you can add tqdm around it right and that loop does just what it is to do but it gets its progress it even guesses how much time is left and so forth you can also use it inside a notebook and it creates a neat little neat little graph that gradually goes up and shows you how long is left and so forth so this is just a nice little trick use some learning rate annealing and at the end of training it for a few epochs we can try out a model now the model we're interested in is just the upsampling model right we're going to be feeding the upsampling model low res inputs and getting out the high res outputs we don't actually really care about the value of the loss so I'll now define a model which takes as input the low res input and spits out as output our high res output so with that model we can try it called predict so here is our original low resolution mashed potato and here is a high resolution mashed potato and it's it's amazing what it's done like you can see in the original like the shadow of the leaf was very unclear the kind of the bits in the mashed potato were just kind of big blobs in this version we have like bare shadows hard edges and so forth question can you explain the size of the target it's the first dimension of the high res times 128 why um it's the okay so obviously it's this um so this is the basically the number of images that we have um and then it's 128 oh it's 128 because that layer has 128 filters so this ends up giving you um the mean squared error of 128 filter losses well since I did this um yes and then there was another question would popping the unused layers really save anything aren't you only getting the layers you want um when you do the bgg.getLayer block2.com2 yeah I'm not sure I I don't I can't quite think quickly enough um you could try it it might not help and intuitively what features is this model learning well what it's learning is it's looking at 20,000 images um very very very low resolution images like this and it's learning like when there's a kind of a soft gray bit next to a hard bit you know in certain situations that's probably a shadow and when there's a shadow this is what a shadow looks like for example it's learning that when there's a curve um it doesn't actually meant to look like a jagged edge but it's actually meant to look like something smooth um you know it's really learning what the world looks like you know and then when you take that world and blur it and make it small you know what does it then look like and so it's just like when you look at a picture like this and particularly if you like blur your eyes and defocus your eyes you can often see you know what it originally looked like because your brain basically is doing the same thing it's like when you read a really blurry text you can still read it because your brain is thinking like it knows like oh that must have been an E that must have been an F so are you suggesting there is a similar universality on the other way around like you know when BGG saying the first layer is like learning a line and then a square then a nose or eye are you saying the same thing is true in this case? Yeah yeah absolutely it has to be like there's no way to up-sample like there's an infinite number of ways you can up-sample there's lost information so in order to do it in a way that decreases this lost function it actually has to figure out what's probably there based on context but don't you agree I'm just intuitively thinking about it like example of the you say suggesting like the album of pictures for your mom would you think like be a bit easier if we're just feeding it pictures of humans because of the interaction of the circle of the eye and the nose? Yeah it's going to be a lot better so in the three versions of the secret resolution you get 8 by 8 inches you'll see that all of them pretty much is the same data set so that is a data set of pictures of celebrity spaces and all celebrity spaces are pretty similar and so they show these fantastic and they are fantastic and amazing results but they're taken 8 by 8 in terms of input from true effects and it looks pretty close and that's because they've taken advantage of this in our case we've got 20,000 images from 1000 categories it's not going to do nearly as well if we wanted to do as well as the select 8 versions we would need hundreds of millions of images of 1000 categories It's just hard for me to imagine mashed potatoes in a face kind of like in the same category that's my biggest thing here. The key thing to realize is there's nothing qualitatively different between what mashed potato looks like so you know, something can work to recognize the unique features of mashed potato and so at a big enough network there are not examples of mashed potato but writing and pictures and whatever so for your examples you're most likely to be doing stuff which is more domain specific and so you should use more domain specific data taking advantage of these kind of issues That's a good question, thank you So one thing I mentioned here is I haven't used a test set so another piece of the homework is to add in a test set and tell us is this mashed potato overfit is this actually just matching the particular training set version of this mashed potato or not but if it is overfitting can you create something that doesn't work there so there's another piece of homework so it's very simple now to take this and turn it into our fast style transfer so the fast style transfer is going to do exactly the same thing but rather than taking something, turning something low res into something high res it's going to take something photo and turn it into Van Gogh's irises so we're going to do that in just the same way we're going to rather than go from low res through a CNN to find the content loss against high res we're going to take a photo go through a CNN and do both style loss and content loss against a single fixed style image I've given you links here so I have not implemented this for you this is for you to implement but I have given you links to the original paper and very importantly also to the supplementary material which is a little hard to find because there's two different versions and only one of them is correct and of course I don't tell you which one is correct so the supplementary material goes through all of the exact details of what was their loss function what was their processing what was their exact architecture and so on and so forth so while I wait for that to load you got a question like we did a doodle regeneration using the model's photographers weights could we create a regular image to see how you would look if you were a model I don't know if you could come up with a loss function which is how much does somebody look like a model you could so you'd have to come up with a loss function and it'd have to be something where you can generate labeled data one of the things they mention in the paper is that they found it very important to add quite a lot of padding and specifically they didn't add zero padding you know normally we just add a black border but they add reflection padding so reflection padding literally means take the edge and reflect it to your padding I've written that for you because there isn't one but you may find it interesting to look at this because this is like one of the simplest examples of a custom layer so we're going to be using custom layers more and more and so I don't want you to be afraid of them so a custom layer in Keras is a Python class so if you haven't done as I mentioned before to the class started if you haven't done OO programming in Python now's a good time to go and look at some tutorials because we're going to be doing quite a lot of it particularly the PyTorch PyTorch absolutely relies on it so we're going to create a class it has to inherit from layer and Python this is how you can create a constructor Python's OO syntax is really gross you have to use a special weird custom name thing which happens to be the constructor every single damn thing inside a class you have to manually type out comma as the first parameter if you forget you'll get stupid errors sorry it's not my fault and then in the constructor for a layer this is basically where you just save away any of the information you were given so in this case you said I want this much padding so you just have to save that somewhere say I need this much padding and then you need to do two things in every Keras custom layer one is you have to define something called shape4 that is going to pass in to the shape of an input and you have to return what is the shape of the output that that would create so in this case if s is the shape of the input then the output is going to be the same batch size and the same number of channels and then we're going to add in twice the amount of padding both the rows and columns so this is going to tell it because remember one of the cool things about Keras is like you just chuck the layers on top of each other and it magically knows how big or the intermediate things are and magically knows because every layer has this thing defined that's how it works the second thing you have to define is something called call and call is the thing which will get your layer data and you have to return whatever your layer does and in our case we want to cause it to add reflection padding and in this case so happens that tensorflow has something built in for that called tf.pad obviously generally it's nice to create Keras layers that would work with both theano and tensorflow backends by using that capital K dot notation but in this case theano didn't have anything obvious that did this easily through our class I just decided just to make it tensorflow so here is a complete layer I can now use that layer in a network definition like this I can call dot predict which will take an input and turn it into you can see that the bird now has the left and right sides here have been reflected so that is there for you to use because in the supplementary material for the paper they add that they add spatial reflection padding at the beginning of the network and they add a lot 40 by 40 and the reason they add a lot is because they mention the supplementary material that they don't want to use same convolutions they want to use valid convolutions in their computation because if you add any black borders during those computation steps it creates weird artifacts on the edges of the images so you'll see that through this computation for their residual blocks the size gets smaller by 4 each time and that's because these are valid convolutions so that's why they have to add padding to the start so that these steps don't cause the image to become too small so this section here should look very familiar because it's the same as our app sampling network a bunch of residual blocks 2 d convolutions and 1 9 by 9 convolution so this is identical that you can copy it this is the new bit and so why do we have we've already talked about why we have this 9 by 9 conv but why do we have these down sampling convolutions to start with we start with an image up here of 3 3 6 by 3 3 6 and we harvest size and then we harvest size again why do we do that like the reason we do that is that as I mentioned earlier we want to do our computation at a lower resolution because it allows us to have a larger receptive field and it allows us to do less computation so this this pattern where it's like reflective right like the last thing is the same as the top thing the second last thing is the same as the second thing you can see it's like a reflection symmetric it's really really common in generative models it's first of all to take your object down sample it increasing the number of channels at the same time so increasing the receptive field you're creating more and more complex representations you then do a bunch of computation on those representations and then at the end you're up sampled again so you're going to see this pattern all the time and so that's why I wanted you guys to implement this yourself okay so that's the last major piece of your homework there are questions about what is stride equals one half mean that's exactly the same as decombolution stride 2 so I remember I mentioned earlier that another name for decombolution is fractionally strided convolution so you can remember that little picture we saw this idea of like you put little columns of zeros in between each row and column so that's kind of think of it as doing like a half stride at a time so that's why this is exactly what we already have I don't think you need to change it at all except you'll need to change my same convolutions to develop convolutions but this is well worth reading the whole supplementary material because it really has the details it's so great when a paper has supplementary material like this you'll often find in fact the majority of papers don't actually tell you the details of how to do what they did and many don't even have code these guys both have code and supplementary material which makes this absolute A plus paper plus it works great okay so that is super resolution, perceptual losses and so on and so forth so I am glad we got there is now I can title and ask it let's make sure I don't have any more slides oh there's one other thing that's going to show you which is these deconvolutions can create some very ugly artifacts and I can show you some very ugly artifacts because I have some right here you see these checkerboard this is called a checkerboard pattern the checkerboard pattern happens for a very specific reason and I've provided a link to this paper it's an online paper you guys might remember Chris Ola he had a lot of the best kind of learning materials we looked at in part one he's now got this cool thing called distilled.pub and with some of his colleagues at google and he wrote this thing discovering why is it that everybody gets these god damn checkerboard patterns right and what he shows is that it happens because you have stride 2 convolutions which means that every pair of convolutions sees 1 pixel twice so it's like a checkerboard it's just a natural thing that's going to come out so they talk about this in some detail and all the kind of things you can do but in the end they point out 2 things the first is that you can avoid this by making it that your stride divides nicely into your size so if I change size to 4 they're gone right so one thing you could try if you're getting checkerboard patterns which you will is make your size 3 convolutions into size 4 convolutions the second thing that he suggests doing is not to use deconvolutions instead of using a deconvolution he suggests first of all doing an upsampling what happens when you do an upsampling is it's basically the opposite of max pooling you take every pixel and you turn it into a 2x2 grid of that exact pixel that's called upsampling if you do an upsampling followed by a regular convolution that also gets rid of the checkerboard pattern and as it happens Keras has something to do that which is called I guess it's not loaded yet it's called upsampling 2D so all this does is kind of the opposite of max pooling it's going to double the size of your image at which point you can use a standard normal unit stride of convolution and avoid the other acts so extra credit after you get your network working is to change it to an upsampling and unit stride convolution network and see if the checkerboard artifacts go away so that is that at the very end here I've got some suggestions for some more things that you can look at although most of those were already in the PowerPoint so I don't think there's anything else there so let's move on I want to talk about going big going big can mean two things of course it does mean we get to say big data which is important you have to do that I'm very proud that even during the big data thing I never said big data without saying rude things about the stupid idea of big data so who cares about how big it is but in deep learning sometimes we do need to use either large objects like if you're doing diabetic retinopathy you have 4,000 by 4,000 pictures of eyeballs or maybe you've got lots of images lots of objects like if you're working with ImageNet and to handle this data that doesn't fit in RAM we need some tricks so I thought we would try some interesting project that involves looking at the whole ImageNet competition data set so the ImageNet competition data set is about 1.5 million images in a thousand categories as I mentioned I think in the last class if you try to download it it will give you a little form saying you have to use it for research purposes and that they're going to check it in practice if you fill out the form you'll get back an answer seconds later so anybody who's got a terabyte of space because you're building your own boxes you now have a terabyte of space you can go ahead and download ImageNet and then you can start working through this project so this project is about implementing a paper called Devise and Devise is a really really interesting paper I actually just chatted to the author about it quite recently an amazing lady named Andrea from who's now at Clarify which is a computer vision startup and what she did with Devise was she created a really interesting multi-modal architecture so multi-modal means that we're going to be combining different types of object and in her case she was combining language with images and it's quite an early paper to look at this idea and she did something which was really interesting she said normally when we do an ImageNet network our final layer is a one hot encoding of a category and so that means that a pug and a golden trevor are no more similar or different in terms of that encoding than a pug and a jumbo jet and that seems kind of weird, right? if you had an encoding where similar things were similar in the encoding you could do some pretty cool stuff and in particular what she was trying to do one of the key things she was trying to do is to create something which went beyond the thousand ImageNet categories so that you could work with types of images that were not in ImageNet at all and so the way she did that was to say alright let's throw away the one hot encoded category and let's replace it with a word embedding of the thing so pug is no more longer 0000100 but it's now the word to vector of the pug and that's it that's the entirety of the thing train that and see what happens I'll provide a link to the paper and one of the things I love about the paper is that what she does is to show quite an interesting range of the kinds of cool results and cool things you can do when you replace a one hot encoded output with a vector output embedding just to clarify so every pixel one hot encoded pixel suddenly becomes a vector no pixels are not one hot encoded pixels are encoded by their channel sorry I mean bit of what are the results no so the let's take let's say this is an image of a pug right it's a type of dog and so pug gets turned into let's say pug is the 300th class in image net it's going to get turned into a 1000 long vector with 1000 zeros sorry 999 zeros and a 1 in position 300 that's normally what we use as our target when we're doing image classification we're going to throw that 1000 long thing away and replace it with a 300 long thing and the 300 long thing will be the word vector for pug that we downloaded from word vector so normally we have our input image comes in it goes through some kind of computation in our CNN and it has to predict something and normally the thing it has to predict is a whole bunch of zeros and a 1 here and so the way we do that is that the last layer is a soft max layer which encourages one of the things to be much higher than the others so all we do is we throw that away and we replace it with the word vector for that for that thing pox or pug or jumbo jet and since the word vector so generally that might be say 300 dimensions and that's dense that's not lots of zeros so we can't use a soft max layer at the end anymore we probably now just use a regular linear layer okay so the hard part about doing this really is processing image there's nothing weird or interesting or tricky about the architecture all we do is replace the last layer so we're going to leverage big holes quite a lot so we start off by inputting our initial stuff and don't forget with tensorflow to call this limit mem thing I created so that you don't use up all of your memory and one thing which can be very helpful is to define actually two parts once you've got your own box you've got a bunch of spinning hard disks that are big and slow and cheap and maybe a couple of fast expensive small SSDs or NVMe drives so I generally think it's a good idea to define a path one to this actually happens to be a now point that has my big slow cheap spinning disks and this path happens to live somewhere which is my fast SSDs and that way throughout what I'm doing my code any time I've got something I'm going to be accessing a lot particularly if it's in a random order I'm going to want to make sure that that thing as long as it's not too big sits in this path and anytime I'm accessing something which I'm accessing generally sequentially or which is really big I can put it in this path this is one of the good reasons another good reason to have your own box is that you get this kind of flexibility okay so the first thing we need is some word vectors so interestingly actually the paper built their own Wikipedia word vectors I actually think that the word to VEC vectors you can download from Google are maybe a better choice here so I've just gone ahead and shown how you can load them in one of the very nice things about Google's word to VEC word vectors is that where else do you remember last in part one when we used word vectors we tended to use glove glove would not have a word vector for golden retriever they would have a word vector for golden they don't have like phrase things whereas Google's word vectors have phrases like golden retriever so for our thing we really need to use Google's word to VEC vectors plus anything like that which has like multi part concepts as things that we can look at so you can download word to VEC I will make them available on our platform.ai site because the only way to get them otherwise is to get them from like this the authors Google Drive directory and trying to get to a Google Drive directory from Linux is an absolute nightmare so I will save them for you so that you don't have to get it so once you've got them you can load them in and then they're in a weird proprietary binary it's like if you're going to share data why put it in a weird proprietary binary format and a Google Drive thing that you can't access from Linux anyway this guy did so I then save it as text to make it a bit easier to work with the word vectors themselves are in a very simple format they're just the word space followed by the vector space separated I'm going to save them in a simple dictionary format so what I'm going to share with you guys will be the dictionary so it's a dictionary from word or phrase to a NumPy array okay I'm not sure I've used this idea of zip star before so I should talk about this a little bit so if I've got a dictionary a dictionary which maps from word to vector how do I get out of that a list of the words and the list of the vectors the short answer is like this but let's think about what that's doing so I don't know like we've used zip quite a bit right so normally with zip you go like zip list 1, list 2, comma whatever right and what that returns is an iterator which first of all gives you element 1 of list 1, element 1 of list 2 element 1 of list 3 and then element 2 of list 1 and so forth that's what zip normally does there's a nice idea in Python that you can put a star before any argument and if that argument is an iterator something that you can rip through it acts as if you had taken that whole list and actually put it inside those brackets right so let's say that w to be list contained like fox, colon and then some array and then pug, colon and then some array and so forth right when you go zip star that it's the same as actually taking the contents of that list and putting them inside there you would want star star if it was a dictionary star for list? not quite star just means you're treating it as an iterator but you're right I mean in this case we are using a list so let's not worry about it so yeah you can use let's talk about star star another time but you're right in this case we have a list which is actually just in this format this is fox comma array pug comma array and then lots more so what this is going to do is when we zip this is it's going to basically take all of these things and create one list for those that's going to become words and then all of these things and create one list for those and then it's going to become vectors so this idea of zip star is something we're going to use quite a lot honestly I don't normally like think about what it's doing I just know that any time I've got like a list of tuples and I want to turn it into a couple of lists you just do zip star so that's all that is it's just a little python thing okay so this gives us a list of words and the list of vectors so anytime I start looking at some new data I always want to test it and so I wanted to make sure this worked the way I thought it ought to work so one thing I thought was okay let's look at the correlation coefficient between small jjoramy and big jjoramy and indeed there is some correlation which you would expect or else the correlation between jjoramy and banana I hate bananas so I was hoping this would be massively negative unfortunately it's not but it is at least lower than the correlation between jjoramy and big jjoramy so like okay it's not always easy to exactly test data but you know try and come up with things that ought to be true make sure they are true and so in this case this has given me some comfort that these word vectors behave the way I expect them to now I don't really care about capitalization so I'll just go ahead and create a lowercase word2vec dictionary where I just do the lowercase version for everything one trick here is I go through in reverse because word2vec is ordered where the most common words are first so by going in reverse it means there is both a capital jjoramy and a small jjoramy the one that's going to end up in my dictionary will be the more common one so what I want for device is to now get this word vector for each one of our 1000 categories in ImageNet and then I'm going to go even further than that because I want to go beyond ImageNet so I actually downloaded the original wordnet categories and I filtered it down to find all the nouns and I discovered that there are actually 82,000 nouns in wordnet which is quite a few it's quite fun looking through them so I'm going to create a map of word vectors for every ImageNet category will be one set and every wordnet noun that will be another set and so my goal in this project is to do useful things with the full set of wordnet nouns we're going to go beyond ImageNet we've already got the 1000 ImageNet categories we've used that plenty of times before so grab those load them in I can okay and then I do the same thing for the full set of wordnet IDs which I will share with you and so now I can go ahead and create a dictionary which goes through every one of my every one of my ImageNet 1000 categories and converts it into the wordvector notice I have a filter here and that's because some of the ImageNet categories won't be in wordvec and that's because sometimes the ImageNet categories will say things like hug bracket doc they won't be exactly in the same format if you wanted to you could probably get a better match than this but I found even with a simple approach I managed to match 51,600 out of the 82,000 wordnet nouns which I thought was pretty good so what I did then was I created a list of the categories which didn't match and this commented out bit as you can see is something which literally just moved those folders out of the way so that they're not in my ImageNet path anymore okay so the details are very important but hopefully you can see at the end of this process I've got something that maps every ImageNet category to a wordvector at least if I could find it and that I've modified my ImageNet data so that the categories I couldn't find I've moved those folders out of the way okay nothing particularly interesting there and that's because wordnet's not that big it's in there so that's pretty straightforward the images are a bit harder because we've got a million or so images so we're going to try everything we can to make this run as quickly as possible so to start with even the very process of getting a list of the file names of everything in ImageNet takes a non-trivial amount of time so everything that takes a non-trivial amount of time is going to save its output right so the first thing I do is I use glob I can't remember if we used glob in part 1 I think we did, yeah it's just a thing that's like ls star dot star that's called glob so we use glob to grab all of the ImageNet trading set and then I just go ahead and pickle dot dump that so I can pickle dot load it for various reasons we'll see shortly it's actually a very good idea though at this point to randomize that list of file names put them in a random order the basic idea is later on if we use chunks of file names that are next to each other they're not all going to be the same type of thing so by randomizing the file names now it's going to save us a bit of time so then I can go ahead and save that randomized list I've given it a different name so I can always come back to the original so I want to resize all of my images to a constant size I'm being a bit lazy here and then I resize them to 224 by 224 that's the input size for a lot of models obviously including the one that we're going to use that would probably be better if we resize to something bigger and then we like randomly zoom and crop maybe if we have time we'll try that later but for now we're just going to use resize everything to 224 by 224 ok so we have nearly a million images it turns out to resize to 224 by 224 that could be pretty slow so I've got some handy tricks to make it much faster generally speaking there are three ways to make an algorithm significantly faster the three ways are let's just cache memory locality I'm going to explain these in a moment memory locality the second is simd also known as vectorization vectorization the third is parallel processing Rachel is very familiar with these because she's currently creating a course for the master students here on numerical linear algebra which is very heavily about these things so these are the three ways you can make data processing faster memory locality simply means in your computer you have lots of different kinds of memory but for example level 1 cache level 2 cache RAM solid state disk regular on hard drives whatever the difference in speed as you go up from one to the other is generally like 10 times or 100 times or 1000 times slower you really really don't want to go to the next level of the memory hierarchy if you can avoid it unfortunately level 1 cache might be more like 16k level 2 cache might be a few meg RAM is going to be a few gig solid state drives is probably going to be a few hundreds of gig and your hard drives are probably going to be a few terabytes so in reality you've got to be careful about how you manage these things so you want to try and make sure that you're putting stuff in the right place that you're not filling up the resources unnecessarily and that if you use if you're going to use a piece of data multiple times try to use it each time like immediately use it again so that it's already in your cache the second thing which is what we're about to look at is SIMD which stands for single instruction multiple data something that a shockingly large number of people even who claim to be professional computer programmers don't know is that every modern CPU is capable of in a single operation in a single thread calculating multiple things at the same time and the way that it does it is that you basically create a little vector of generally about 8 things and you put all the things you want to calculate so let's say you want to take the square root of something you put 8 things into this little vector and then you call a particular CPU instruction which is basically a take the square root of 8 floating point numbers that is in this register and it does it in a single clock cycle so when we say clock cycle you know your CPU might be say 2 or 3 gigahertz so it's doing 2 or 3 billion things per second but it's not it's doing 2 or 3 billion times 8 things per second if you're using SIMD because so few people are aware of SIMD and because a lot of programming environments don't make it easy to use SIMD a lot of stuff is not written to take advantage of SIMD including for example pretty much all of the image processing in Python however you can do this you can go pip install pillow SIMD and that will replace your pillow remember pillow is like the main Python imaging library with a new version that does use SIMD for at least some of its things because SIMD only works on certain CPUs any vaguely recent CPU works but because it's only some you have to add some special directives to the compiler to tell it I want to use I have this kind of CPU so please do use these kinds of instructions and what pillow SIMD does it actually literally replaces your existing pillow right so that's why you have to say pause, reinstall because it's going to be like oh you already have pillow but this is like no I want pillow by SIMD so if you try this the speed of your resize literally goes up by 600% you don't have to change any code okay so I'm like a huge fan of SIMD in general it's one of the reasons I'm not particularly fond of Python because it doesn't make it at all easy to use SIMD but luckily some people have written stuff in C which does use SIMD and then provided these Python interfaces so this is something to remember to try to get working when you go home before you do it write a little benchmark that resizes a thousand images times it and then run this command and make sure that it gets 600% faster that way you know it's it's actually working we have two questions I don't know if you want to finish the three ways to do things faster first you go one is how could you get the relation between a pug and a dog and the photo of a pug in its relation to the bigger category of dog yes sure we'll think about that okay now there why do we want to randomize the file names can't we use shuffle equals true on the Keras flow from directory you'll see yeah the short answer is kind of to do with locality right if you say shuffle equals true you're jumping from here on the hard disk to here on the hard disk to here on the hard disk and hard disk take that like literally there's a spinning disk with a little needle and the things moving all over the place so you want to be getting things that are all in a row that's basically the reason as you'll see this is going to basically work for the concept of dog versus pug because the word vector for dog is very similar to word vector for pug so at the end we'll try it we'll see if we can find the dogs see if it works I'm sure it will finally there's parallel processing parallel processing refers to the fact hopefully as you all know any modern CPU has multiple cores which literally means multiple CPUs in your CPU and often boxes that you buy from home they even have multiple CPUs in them again, Python is not great for parallel processing Python 3 is certainly a lot better but a lot of stuff in Python doesn't use parallel processing very effectively but a lot of modern CPUs have 10 cores or more even for consumer CPU so if you're not using parallel processing you're missing out on a 10x speedup if you're not using SAMD you're missing out on a 6 to 8x speedup so if you can do both of these things you can get 50 plus I mean you will, you'll get 50 plus speedup assuming your CPU has enough cores so we're going to do both to get SAMD we're just going to install it and to get parallel processing we're probably not going to see all of it today but we're going to be using parallel processing so I define a few things to do my resizing one thing is I've actually recently changed how I do resizing as I'm sure you guys have noticed in the past when I resized things to square I've tended to add a black border to the bottom or a black border to the right because that's what Keras did now that I've looked into it no best practice papers, cackle results in a thing used that way and it makes perfect sense because a CNN is going to have to like learn to deal with the black border and you're throwing away all that information what pretty much all the best practice approaches is to rather than rescale the longest side to be the size of your square and then fill it in with black instead take the smallest side and make that the size of your square the other side is now too big so just chop off the top and bottom or chop off the right and left that's called center cropping so resizing and center cropping so what I've done here is I've got something which resizes to the size of the shortest side and then over here somewhere over here I've got something which does the center cropping you can look at the details when you get home if you like it's not particularly exciting so I've got something that does the resizing this is something you can improve currently I'm making sure that it's a three channel image so I'm not doing a black and white or something with an alpha channel I just ignore them okay so before I finish up there's I think I'm out of time so what we're going to learn next time when we start is we're going to learn about parallel processing so anybody who's interested in pre-reading yeah feel free to start reading and playing around with parallel processing alright thanks everybody see you next week I hope your assignments go really well and let me know if I can help you out in the forum