 So I wanted to start, let's start actually on the wiki on the lesson three section of the wiki. Because Rachel added something which I think is super helpful to the wiki this week, which is in this section about the assignments, you'll see where it talks about going through the notebooks as a section called how to use the provided notebooks. And I think the feedback I get is each time I talk about the kind of teaching approach in this class, people get a lot out of it. So I thought I wanted to keep kind of talking a little bit about that. As we've discussed before, in the two hours that we spend together each week, that's not nearly enough time for me to teach you deep one. I can show you what kinds of things you need to learn about and I can show you where to look and try to give you a sense of some of the key topics. But then the idea is that you're going to learn about deep learning during the week by doing a whole lot of experimenting. And one of the places that you can do that experimenting is with the help of the notebooks that we provide. Having said that, if you do that by loading up a notebook and hitting shift enter a bunch of times to go through each cell until you get an error message and then you go, oh shit, I got an error message, you're not going to learn anything about deep learning. I was almost tempted to not put the notebooks online until a week after each class because it's just so much better when you can build it yourself. But the notebooks are very useful if you use them really rigorously and thoughtfully, which is, as Rachel's described here, read through it and then put it aside, minimize it or close it or whatever. And now try and replicate what you just read from scratch. And any time you get stuck, you can go back and open back up the notebook, find the solution to your problem, but don't copy and paste it. Put the notebook aside again. Go and read the documentation about what it turns out the solution was, try and understand why is this a solution and type in that solution yourself from scratch. And so if you can do that, it means you really understand now the solution to this thing you're previously stuck on and you've now learned something you didn't know before. You might still be stuck and that's fine. So if you're still stuck, you can refer back to the notebook again, still don't copy and paste the code, but whilst having both open on the screen at the same time, type in the code. Now that might seem pretty weird, like why would you type in code, you can copy and paste, but just the very kinesthetic process of typing it in forces you to think about like where are the parentheses, where are the dots and what's going on. And then once you've done that, you can try changing the inputs to that function and see what happens and see how it affects the outputs and really experiment. So it's through this process of trying to come up and think about what step do I take next. That means that you're thinking about the concepts you've learned. And then how do you do that step? It means that you're having to recall how the actual libraries are working. And then most importantly, through experimenting with the inputs and outputs, you get this really intuitive understanding of what's going on. So one of the questions I was thrilled to see over the weekend, because it's an election to discussion, which is exactly the kind of thing I think is super helpful. How do I pronounce your name, sorry, Sravya, is it? Sravya? Sravya. Asked, OK, I'm trying to understand correlate. So I sent it two vectors with two things each and was happy with the result. And then I sent it two vectors of three things in and I don't get it. And so that's great. This is like taking it down to make sure, OK, I really understand this. And so I typed something in and the output was not what I expected, what's going on. And so then I tried it by creating a little spreadsheet and showed here are the three numbers and here is how it was calculated. And then it's like, OK, I kind of get that, not fully. And then I finally describe it. Does it make sense in the end? OK, so you now understand correlation and convolution. You know you do because you put it in there, you figured out what the answer ought to be. And eventually the answer was what you thought. So this is exactly the kind of, and I hope you don't mind thinking like this, I just thought that's great. This is exactly the kind of experimentation. I find a lot of people try to jump straight to full-scale image recognition before they've got to the kind of like one plus one stage. And so you'll see I do a lot of stuff in Excel. And this is why, in Excel or with simple little things in Python, I think that's where you get the most experimental benefit. So that's what we're talking about when we talk about experiments. OK, I want to show you something pretty interesting. And remember last week, we looked at this paper from Matzaila where we saw what the different layers of a convolutional neural network look like. Right, so you guys remember this. Yes, Rachel? I'd like to ask some questions. Oh yeah, yeah, thank you so much, Rachel. That's a great point, as we even wrote it down. One of the steps in the how to use the provided notebooks is if you don't know why a step's being done or how it works or why the inputs and outputs are what you observe, please ask. Anytime you're stuck for half an hour, please ask. So far, I believe that there has been a 100% success rate in answering questions on the forums. So when people ask, they get an answer. So part of the homework this week in the assignments is ask a question on the forum. OK, yes, Rachel? It's OK if that question is about lesson one, or if it's about the question could be about anything at all. Setting up AWS, don't be embarrassed if you still have questions there. No, absolutely. I know a lot of people are still working through cats and dogs or cats and dogs, Redux, and that makes perfect sense. Different people here have different backgrounds. There are plenty of people here who have never used Python before. Python was not a prerequisite. The goal is that for those of you that don't know Python, that we give you the resources to learn it and learn it well enough to be effective in doing deep learning in it. But that does mean that you guys are going to have to ask more questions. There are no dumb questions. And so if you see somebody asking on the forum about how do I analyze functional brain MRIs with 3D convolutional neural networks, OK, that's fine. That's where they are at. That's OK if you then ask, what does this Python function do? Or vice versa, if you see somebody ask, what does this Python function do? And you want to talk about 3D brain MRIs, do that too. The nice thing about the forum is that, as you can see, it really is buzzing now. And the nice thing is that the different threads allow people to dig into the stuff that interests them. And I'll tell you from personal experience, the thing that I learn the most from is answering the simplest questions. So actually answering that question about a 1D convolution, I found very interesting. Because I actually didn't realize that the reflect parameter was a default parameter. And I didn't quite understand how it worked. And so answering that question, I found very interesting. And even sometimes if you know the answer, figuring out how to express it teaches you a lot. So asking questions of any level is always helpful to you and to the rest of the community. So please, if everybody only does one part of the assignments this week, do that one, which is to ask a question. And here are some ideas about questions you could ask if you're not sure. OK, thank you, Rachel. All right, so I was saying last week, we kind of looked later in the class at this amazing visualization of what goes on in a convolutional neural network. I want to show you something even cooler, which is the same thing in video. This is by an amazing guy called Jason Yusinski and his supervisor, Haud Lipson, and some other guys. And it's doing the same thing but in video. And so I'm going to show you what's going on here. And you can download this. It's called the Deep Visualization Toolbox. So if you go to Google and search for the Deep Visualization Toolbox, you can do this. You can grab pictures. You can click on any one of the layers of a convolutional neural network. Here's the first convolutional layer. And it will visualize every one of the filters, the outputs of the filters in that convolutional layer. So you can see here with this dog, it looks like there's a filter here, which is kind of finding edges. And you can even give it a video stream. So if you give it a video stream of your own webcam, you can see the video stream popping up here. So this is a great tool. And looking at this tool now, I hope it will give us a better intuition about what's going on a convolutional neural network. Look at this one here he selected. There's clearly an edge detector. As he slides a piece of paper over it, you get this very strong edge. And clearly it's specifically a horizontal edge detector. And here is actually a visualization of the pixels of the filter itself. And that's exactly what you'd expect. Remember from our initial lesson zero, an edge detector has black on one side and white on the other. So you can scroll through all the different layers of this neural network. And different layers do different things. And the deeper the layer, the larger the area it covers. And therefore, the smaller the actual filter is. And the more complex the objects that it can recognize. So here's an interesting example of a layer 5 thing, which it looks like it's a face detector. So you can see that as he moves his face around, this is moving around as well. So one of the cool things you can do with this is you can say, show me all the images from ImageNet that match this filter as much as possible. And you can see that it's showing us faces. So this is a really cool way to understand what your neural network is doing. Or what in this case ImageNet's doing. You can see another guy's come along, and here we are. And so here you can see the actual result in real time of the filter deconvolution. And here's the actual recognition that it's doing. So clearly it's a face detector, which also detects cat faces. So the interesting thing about these types of neural net filters is that they're often pretty subtle as to how they work. They're not looking for just some fixed set of pixels, but they really understand concepts. So here's a really interesting example. Here's one of the filters in the fifth layer, which seems to be like an armpit detector. So why would you have an armpit detector? Well, interestingly, what he shows here is that actually it's not an armpit detector. Because look what happens. If he smooths out his fabric, this disappears. So what this actually is, is a texture detector. It's something that detects some kind of regular texture. Here's an interesting example of one, which clearly is a text detector. Now interestingly, ImageNet did not have a category called text. One of the 1,000 categories is not text. But one of the 1,000 categories is bookshelf. And so you can't find a bookshelf if you don't know how to find a book. And you can't find a book if you don't know how to recognize a spine. And the way to recognize a spine is by finding text. So this is the cool thing about these neural networks is that you don't have to tell them what to find. They decide what they want to find in order to solve your problem. So I wanted to start at this end of like, oh my god, deep learning is really cool. And then jump back to the other end of, oh my god, deep learning is really simple. So everything we just saw works because of the things that we've learned about so far. And I've got a section here called CNN Review in Lesson 3. And Rachel and I have started to add some of our favorite readings about each of these pieces. But everything you just saw in that video consists of the following pieces. Matrix products, convolutions, just like we saw in Excel in Python, activations such as values and softmax, stochastic gradient descent, which is based on back propagation. We'll learn more about that today. And that's basically it. One of the, I think, challenging things is even if you feel comfortable with each of these one, two, three, four, five pieces that are convolutional neural network, it's like really understanding how all those pieces fit together to actually do deep learning. So we've got two really good resources here on putting it all together. So I'm going to go through each of these six things today as revision. But what I suggest you do, if there is any piece where you feel like, OK, I'm not quite confident, I really know what a convolution is, or I really know what an activation function is, see if this information is helpful, and maybe ask a question on the forum. So let's go through each of these. I think a particularly good place to start, maybe, is with convolutions. And a good reason to start with convolutions is because we haven't really looked at them since lesson 0. And that was quite a while ago. So let's remind ourselves about lesson 0. So in lesson 0, we learned about what a convolution is. And we learned about what a convolution is by actually trying, running a convolution against an image. So we used the MNIST data set. The MNIST data set, remember, consists of 55,000 28 by 28 grayscale images of handwritten digits. So each one of these has some known label. And so here's five examples with a known label. So in order to understand what a convolution is, we tried creating a simple little 3 by 3 matrix. And so the 3 by 3 matrix we started with had negative ones at the top, ones in the middle, and zeros at the bottom. So we could kind of visualize that. So what would happen if we took this 3 by 3 matrix and we slid it over every 3 by 3 part of this image and we multiplied negative 1 by the first pixel, negative 1 by the second pixel, negative 1 by the third pixel. And then moved to the next row and multiplied by 1, 1, 1, 0, 0, 0, and add them all together. And so we could do that for every 3 by 3 area. That's what a convolution does. So you might remember from lesson zero, we looked at a little area to actually see what this looks like. So we could zoom in. OK, so here's a little small little bit of the 7. And so one thing I think is helpful is just to look at what is that little bit? Let's make it a bit smaller so it fits on our screen, shall we? There we go. So you can see that an image just is a bunch of numbers. And the blacks are zeros. And the things in between bigger and bigger numbers until eventually the whites are very close to 1. So what would happen if we took this little 3 by 3 area? 0, 0, 0, 0, 0.35, 0.5, 0.9, 0.98, 0.9, 0.9. And we multiplied each of those nine things by each of these nine things. So clearly, anywhere where the first row is zeros and the second row is ones, this is going to be very high when we multiply it all together and add the nine things up. And so given that white means high, you can see then that we end up with something when we do this convolution. We end up with something where the top edges become right because we went negative 1, negative 1, negative 1 times 1, 1, 1 times 0, 0, 0, 0, and added them all together. So one of the things we looked at in lesson 0 and we have a link to here is this cool little image kernels explained visually site where you can actually create any 3 by 3 matrix yourself and go through any 3 by 3 part of this picture and see the actual arithmetic and see the result. So if you're not comfortable with convolutions, this would be a great place to go next. That's an excellent question. How did you decide on the values of the top matrix? So in order to demonstrate an edge filter, I picked values based on some well-known edge filter matrices. So you can see here's a bunch of different matrices that this guy has. So for example, topsobel I could select. And you can see that does a top edge filter. Or I could say emboss. And you can see it creates this embossing sense. Here's a better example because it's nice and big here. So these types of filters have been created over many decades. And there's lots and lots of filters designed to do interesting things. So I just picked a simple filter which I knew from experience and from common sense would create a top edge filter. And so by the same kind of idea, if I rotate that by 90 degrees, so negative ones down the side, that's going to create a left-hand edge filter. So if I create the four different types of filter here, and I could also create four different diagonal filters like these, that would allow me to create top edge, left edge, bottom edge, right edge, and then each diagonal edge filters here. So I created these filters just by hand through kind of a combination of common sense and having read about filters. Because people spend time designing filters. The more interesting question then really is what would be the optimal way to design filters? Because it's definitely not the case that these eight filters are the best way of figuring out what's a 7 and what's an 8 and what's a 1. So this is what deep learning does. What deep learning does is it says, let's start with random filters. So let's not design them, but we'll start with totally random numbers for each of our filters. So we might start with eight random filters, each of 3 by 3. And we then use stochastic gradient descent to find out what are the optimal values of each of those sets of nine numbers. And that's what happens in order to create that cool video we just saw and that cool paper that we saw. That's how those different kinds of edge detectors and gradient detectors and so forth were created. When you use stochastic gradient descent to optimize these kinds of values, when they start out random, it figures out that the best way to recognize images is by creating these kinds of different detectors, different filters. Where it gets interesting is when you start building convolutions on top of convolutions. So we saw last week that we can put, yeah, I got a question. How do you decide on the size of the filter? OK, I definitely want to talk about that today. So if I don't, please remind me. So we saw last week how if you've got three inputs, you can create a bunch of weight matrices. So we could create one weight matrix. So if we've got three inputs, we saw last week how you could create a random matrix and then do a matrix multiply of the inputs times a random matrix. We could then put it through an activation function such as max 0 comma x. This is called the rectified linear or value. And we could then take that and multiply it by another weight matrix to create another output. And then we could put that through max 0 comma x. And we can keep doing that to create arbitrarily complex functions. And we looked at this really great neural networks and deep learning chapter where we saw visually how that kind of bunch of matrix products followed by activation functions can approximate any given function. Where it, I'll come back to it in one moment, Rachel. So where it gets interesting then is instead of just having a bunch of weight matrices and matrix products, what if sometimes we had convolutions and activations? Because a convolution is just a subset of a matrix product. So if you think about it, a matrix product says here's 10 activations and then a weight matrix going down to 10 activations. The weight matrix goes from every single element of the first layer to every single element of the next layer. So if this goes from 10 to 10, there are 100 weights. Whereas a convolution is just creating a subset of those weights. So I'll let you think about this during the week. Because it's a really interesting insight to kind of think about, oh, that a convolution is identical to a fully connected layer. But it's just a subset of the weights. And so therefore everything we learned about stacking linear and non-linear layers together applies also to convolutions. But we also know that convolutions are particularly well suited to identifying interesting features of images. So by using convolutions, it allows us to more conveniently and quickly find powerful deep learning networks. Yes, Rachel? Two questions. One is your spreadsheet available for download? And secondly, are the filters and the layers in the neural network? So the spreadsheet will be available for download. What do you think, Rachel? Tomorrow? Hopefully. We're trying to get to the point that we can actually get the derivatives to work and we're still slightly stuck with some of the details. But we'll make something available tomorrow. There'll be a spreadsheet. Are the filters the layers? Yes, they are. So this is something where spending a lot of time looking at simple little convolution examples is really helpful. Because for a fully connected layer, it's pretty easy. You can see if I have three inputs, then my matrix product will have to have three rows, otherwise that they won't match. And then I could create as many columns as I like. And the number of columns I create tells me how many activations I create, because that's what matrix products do. So it's very easy to see how with what Keras calls dense layers, I can decide how big I want each activation layer to be. If you think about it, you can do exactly the same thing with convolutions. You can decide how many sets of 3x3 matrices you want to create at random. And each one will generate a different output when applied to the image. So the way that VGG works, for example, okay, so the VGG network, which we learned about in lesson one, contains a bunch of layers. It contains a bunch of convolutional layers followed by a flatten. And all flatten does is it's just a Keras thing that says, okay, don't think of the layers anymore as being x by y by channel matrices. Think of them as being a single vector. So it just concatenates all the dimensions together. And then it contains a bunch of fully connected blocks. And so each of the convolutional blocks is, you can kind of ignore the zero padding. That just adds zeros around the outside so that your convolutions end up with the same number of outputs as inputs. It contains a 2D convolution followed by, and we'll review this in a moment, a max pooling layer. And you can see that it starts off with two convolutional layers with 64 filters and then two convolutional layers with 128 filters and then three convolutional layers with 256 filters. And so you can see what it's doing is it's gradually creating more and more filters in each layer. And these definitions of block are specific to VGG, right? Yeah, these definitions of block are specific to VGG. So I just created, this is just me refactoring the model so there wasn't lots and lots of lines of color. Okay, so I just didn't want to retype lots of codes. So I kind of found that these lines of code were being repeated, so I turned it into a function. So why would we be having the number of filters being increasing? Well, the best way to understand a model is to use the summary command. So let's go back to lesson one. There was a request to dim the lights in front of the screen. I don't think we're, actually one of them we can. Yes, thanks, Terry. So let's go right back to our first thing we learned, which was the seven lines of code that you can run in order to create and train a network. Okay, I won't wait for it to actually finish training. But what I do want to do now is go VGG.model.summary. So anytime you're creating models, it's a really good idea to use the summary command to look inside them, and it tells you all about it. So here we can see that the input to our model has three channels, red, green, and blue, and they are 224 by 224 images. After I do my first 2D convolution, I now have 64 channels of 224 by 224. Okay, so I've replaced my three channels with the 64, just like here I've got eight different layers, not layers, eight different filters. Here I've got 64 different filters because that's what I asked for, okay. So again, we have a second convolution set with 224 by 224 of 64. And then we do max pooling. So max pooling, remember from lesson zero, was this thing where we simplified things. So we started out with these 28 by 28 images, and we said, let's take each seven by seven block and replace that entire seven by seven block with a single pixel, which contains the maximum pixel value. So here is this seven by seven block, which is basically all gray, so we end up with a very low number here. And so instead of being 28 by 28, it becomes four by four because we are replacing every seven by seven block with a single pixel. That's all max pooling does. So the reason we have max pooling is it allows us to gradually simplify our image so that we get larger and larger areas and smaller and smaller images. So if we look at VGG, after our max pooling layer, we now longer have 224 by 224, we now have 112 by 112. And then I'll come back to the moment, Rachel. Later on, we do another max pooling, we end up with 56 by 56. Later on, we do another max pooling, and we end up with 30 by 30. Now, sorry, 28 by 28. So each time we do a max pooling, we're reducing the resolution. Of our image, it's not really an image anymore, but of our analyzed filters. And so as we're reducing the resolution, we need to hike up the number of filters, otherwise we're losing information. So that's really why each time we have a max pooling, we then double the number of filters because it means that every layer, we're keeping the same amount of information content. Yes, Rachel? The maps are positioned invariant. They detect visual patterns, regardless where they occur in the image. Why is it this more of a problem? Intuitively, it's just matters to me whether I see a pattern in one part of the picture, for example, top right or another part. But I guess it must not be, or CNNs would not work so well. Okay, so it starts out with a very, very important insight, which is a convolution is positioned invariant. So in other words, this thing we created, which is a top edge detector, let's go all the way back to it. Okay, this matrix here, which is a top edge detector, we can apply that to any part of the image and get top edges from every part of the image. And like earlier on, when we looked at that Jason Yosinski video, it showed that there was a face detector, which could find a face in any part of the image. So this is fundamental to how a convolution works. A convolution is position invariant. It finds a pattern regardless of whereabouts an image it is. Now that is a very powerful idea because when we want to say find a face, we want to be able to find eyes. And we want to be able to find eyes regardless of whether the face is in the top left or the bottom right. So position invariance is important, but also we need to be able to identify position to some extent because if there's eyes, if there's four eyes in the picture or if there's an eye in the top corner in the bottom corner, then something weird is going on or if the eyes and the nose aren't in the right positions. So how does a convolutional neural network both have this location invariant filter but also handle location? And the trick is that every one of the three by three filters cares deeply about where each of these three by three things is. And so as we go down through the layers of our model, from 224 to 112, 256, to 28, to 14, to seven, at each one of these stages. So think about this stage which goes from 14 by 14 to seven by seven. These filters are now looking at large parts of the image and so it's now at a point where it can actually, in that three by three, it can say there needs to be an eye here and an eye here and a nose here. So this is one of the cool things about convolutional neural networks. They can find features everywhere but they can also build things which care about how features relate to each other positionally. So you get to do both. Yes, Rachel? There are three more questions. Three more questions, all right. So do we need to do zero padding in each of these layers? Okay, do we need to do zero padding? How do you deal with filters that specialize in identifying the same objects? I don't know what that means so maybe you can add more in the slide. And CNN fails when it comes to cartoons. CNN fails when it comes to cartoons. Okay, so do we need zero padding? So zero padding is literally something that sticks zeros around the outside of an image. If you think about what a convolution does, it's taking a three by three and moving it over an image. If you do that, when you get to the edge, what do you do? Because at the very edge, you can't move your three by three any further. Which means if you only do what's called a valid convolution, which means you always make sure your three by three filter fits entirely within your image, you end up losing two pixels from the sides and two pixels from the top each time. There's actually nothing wrong with that. But it's a little inelegant. It's kind of nice to be able to like halve the size each time and be able to see exactly what's going on. So people tend to often like doing what's called same convolutions. So if you add a black border around the outside, then the result of your convolution is exactly the same size as your input. So that is literally the only reason to do it. In fact, this is a rather inelegant way of going zero padding and then convolution. In fact, there's a parameter to nearly every library's convolution function where you can say I want valid or full or half, which basically means do you add no black pixels, one black pixels or two black pixels, assuming it's three by three. And so I don't quite know why this one does it this way. It's really doing two functions where one would have done. But it does the job. Yeah, so there's no right answer to that question. Convolutional neural networks work fine for cartoons. The question was, do they work for cartoons? However, fine tuning, which has been fundamental to everything we've learned so far, it's gonna be difficult to fine tune from an image net model to a cartoon because an image net model was built. Do you remember on all those pictures of corn we looked at and all those pictures of dogs we looked at? So an image net model has learned to find the kinds of features that are in photos of objects out there in the world. And those are very different kinds of photos to what you see in a cartoon. So if you want to be able to build a cartoon neural network, you'll need to either find somebody else who has already trained a neural network on cartoons and fine-tuned that, or you're gonna have to create a really big corpus of cartoons and create your own image net equivalent. Yes? But if you have the fundamentals, why does it translate across? So why doesn't an image net network translate to cartoons given that an eye is a circle? Because the nuance level of a CNN is very high. It doesn't think of an eye as being just a circle. It knows that an eye is very specifically, has particular gradients and particular shapes and particular ways that the lights reflect soffit and so forth. So when it sees a round blob there, it has no ability to abstract that out and say, oh, I guess they mean an eye. One of the big shortcomings of CNNs is that they can only learn to recognize things that you specifically give them to recognize. If you feed a neural net with a wide range of photos and drawings, maybe it would learn about that kind of abstraction. To my knowledge, that's never been done. It would be a very interesting question as to what kind, it must be possible. I'm just not sure how many examples you would need and what kind of architecture you would need. Yes, Rachel? I don't think we've got to the correlate versus convolution. Correlate versus convolution? So in this particular example, I used correlate, not convolution. So one of the things we briefly mentioned in lesson one is that convolve and correlate are exactly the same thing, except convolve is equal to correlate of an image with a filter that has been rotated by 90 degrees. So you can see convolve images with rotated 90 degrees filter looks exactly the same, and numhigh.orclose is true. So convolve and correlate are identical, except that correlate is more intuitive. Correlate goes in each one, it goes rows and then columns. Where else with convolve, one goes along rows and the other one goes down columns. So I tend to prefer to think about correlate because it's just more intuitive. Convolve originally came really from physics, I think. And it's also kind of a basic math operation. There are various reasons that people sometimes find it more intuitive to think about a convolution. But in terms of everything that they can do in a neural net, it doesn't matter which one you're using. And in fact, many libraries let you set a parameter to true or false to decide whether or not internally it uses convolution or correlation. And of course, the results are going to be identical. Thank you, Rachel. So let's go back to our CNN review. So our network architecture is a bunch of matrix products, or in more generally, linear layers. And remember, a convolution is just a subset of a matrix product. So it's also linear layer. A bunch of matrix products or convolutions stacked with, alternating, nonlinear activation functions. And specifically, we looked at the activation function, which was the rectified linear unit, which is just max of 0 comma x. So that's an incredibly simple activation function. But it's by far the most common. It works really, really well for the internal parts of a neural network. I want to introduce one more activation function today. And you can read more about it in lesson two. Let's go down here where it says about activation functions. And you can see I've got all the derivations of the, or kind of the details of these activation functions here. I want to talk about one called the softmax function. And softmax is defined as follows. e to the xi divided by sum of e to the xi. What is this all about? Softmax is used not for the middle layers of a deep learning network, but for the last layer. The last layer of a neural network, if you think about what it's trying to do for classification, it's trying to match to a one-hot encoded output. Remember, a one-hot encoded output is a vector with all 0s and just a 1 in one spot. And the spot is like we had for cats and dogs two spots. The first one was a one if it was a cat, and the second one was a one if it was a dog. So in general, if we're doing classification, we want our output to have one high number and all the other ones be low. That's going to be easier to create this one-hot encoded output. Furthermore, we would like to be able to interpret these as probabilities, which means all of the outputs have to add to one. So we've got these two requirements here. Our final layer's activations should add to one, and one of them should be higher than all the rest. This particular function does exactly that, and we will look at that by looking, as usual, at a spreadsheet. So here is an example of what an output layer might contain. Here is e to the power of each of those things to the left. Here is the sum of e to the power of those things. And then here is the thing to the left divided by the sum of them, in other words, softmax. And you can see that we start with a bunch of numbers that are all of a similar kind of scale, and we end up with a bunch of numbers that sum to one, and one of them is much higher than the others. So in general, when we design neural networks, we want to come up with architectures, by which I mean convolutions, fully connected layers, activation functions. We want to come up with architectures where replicating the outcome we want is as convenient as possible. So in this case, our activation function for the last layer makes it pretty convenient, pretty easy, to come up with something that looks a lot like a one hot encoded output. And so the easier it is for our neural net to create the thing we want, the faster it's going to get there, and the more likely it is to get there in a way that's quite accurate. So we've learned that any big enough, deep enough neural network, because of the universal approximation theorem, can approximate any function at all. And we know that stochastic gradient descent can find the parameters for any of these, which kind of leaves you thinking, well, why do we need seven weeks of neural network training? Like, any architecture ought to work. And indeed, that's true. If you have long enough, any architecture will work. Any architecture can translate Hungarian to English. Any architecture can recognize cats versus dogs. Any architecture can analyze Hillary Clinton's emails, as long as it's big enough. However, some of them do it much faster than others. They train much faster than others. A bad architecture could take so long to train that it doesn't train in the amount of years you have left in your lifetime. And that's why we care about things like convolutional neural networks instead of just fully connected layers all the way through. That's where we care about having a softmax at the last layer, rather than just a linear last layer. So we try to make it as convenient as possible for our network to create the thing that we want to create. Yes, Rachel? So there are several questions. OK. One is, what is the theoretical justification for doing softmax as opposed to just taking the item divided by the sum of the items? One is followed up on their earlier question. It was, what if a network learns identical filters? So what's theoretical justification for a softmax? The other was, what if a network learns identical filters? One is about how Keras internally handles kind of these matrices of data. Any more information about that one? It's a different one. And then, yeah, that's it. OK. So the first one was softmax justification. Honestly, I don't do theoretical justifications. I do intuitive justifications. There is a great book for theoretical justifications, and it's available for free. If you just Google for a deep learning book, or indeed, go to deeplearningbook.org. It actually does have a fantastic theoretical justification of why we use softmax. The short version, basically, is as follows. Softmax contains an e to the in it. Our log loss layer contains a log in it. The two nicely mesh up against each other. And in fact, the derivative of the two together is just a minus b. So that's kind of the short version. But I will refer you to the deep learning book for more information about the theoretical justification. The intuitive justification is that because we have an e to the here, it makes a big number really, really big. And therefore, once we take one divided by the sum of the others, we end up with one number that tends to be bigger than all the rest. And that is very close to the one hot encoded output that we're trying to match. So what is the other question, Rachel? The other question was, could a network learn identical filters? Oh, could a network learn identical filters? A network absolutely could learn identical filters, but it won't. The reason it won't is because it's not optimal to. Stochastic gradient descent is an optimization procedure. It will come up with, if you train it for long enough, with an appropriate learning rate, the optimal set of filters. Having the same filter twice is never optimal. That's redundant. So as long as you start off with random weights, then it can learn to find the optimal set of filters, which will not include duplicate filters. OK, these are all fantastic questions. All right, so in this review, we've done our different layers, and though then these different layers get optimized with SGD. And so last week, we learned about SGD by using this extremely simple example where we said, let's define a function which is a line, ax plus b. Let's create some data that matches a line. So there's ix's and y's. Here's a scatter plot. Let's define a loss function, which is the sum of squared errors. And let's say, OK, we now no longer know what a and b are. So let's start with some guess. And obviously, the loss for our guess is pretty high. And let's now kind of try and come up with a procedure where each step makes the loss a little bit better by making a and b a little bit better. And the way we did that was very simple. We calculated the derivative of the loss with respect to each of a and b. And that means that the derivative of the loss with respect to b is if I increase b by a bit, how does the loss change? And the derivative of the loss with respect to a means as I change a a bit, how does the loss change? If I know those two things, then I know that I should subtract the derivative times some learning rate. We chose 0.01. And as long as our learning rate is low enough, we know that this is going to make our a guess a little bit better. And we do the same for our b guess. It gets a little bit better. And so we learned that that is the entirety of SGD. Run that again and again and again. And indeed, we set up something that would run it again and again and again in an animation loop. And we saw that indeed it does optimize our line. The tricky thing for me with deep learning is jumping from this kind of easy to visualize intuition that, yeah, OK, if I run this little derivative on these two things a bunch of times it optimizes this line, I can then create a set of layers with hundreds of millions of parameters that in theory can match any possible function. And it's going to do exactly the same thing. So this is where our intuition breaks down, which is that this incredibly simple thing called SGD is capable of creating these incredibly sophisticated deep learning models. We really have to just respect our understanding of the basics of what's going on. We know it's going to work, and we can see that it does work. But even when you've trained dozens of deep learning models, it's still surprising that it does work. It's always a bit shocking when you start without any ability to analyze some problem. You start with some random weights, you start with a general architecture, you throw some data in with SGD, and you end up with something that works. Hopefully now it makes sense. You can see why that happens, but it takes doing it a few times to really feel intuitively understanding, OK, it really does work. So one question about softmax, you used it for multi-class, multi-label classification for the multiple correct answers. Yeah, have you used softmax for multi-class classification? And the answer is absolutely yes. In fact, the example I showed here was such an example. So imagine that these outputs were for cat, dog, plane, fish, and building. So these might be what these five things represent. And so this is exactly showing a softmax for a multi-class output. You just have to make sure that your neural net has as many outputs as you want. And to do that, you just need to make sure that the last weight layer in your neural net has as many columns as you want. The number of columns in your final weight matrix tells you how many outputs. You have a dog and a plane in the picture, and you want both of those to light up. Oh, OK, that is not multi-class classification. So if you want to create something that is going to find more than one thing, then no. Softmax would not be the best way to do that. I'm not sure if we're going to cover that in this set of classes if we don't when we're doing it next year. And then I thought this was a good time to ask two big picture questions that came up earlier. Great. One is just at one point, can you talk about different neural networks and when to apply them? OK. And the other is how do you design your architecture and is there a limit on the number of layers? So let's come back to the question about 3 by 3 filters and more generally, how do we pick an architecture? So the question of like, OK, the VGG authors used 3 by 3 filters. The 2012 ImageNet winners used a combination of some 7 by 7 and 11 by 11 filters. What has happened over the last few years since then if people have realized that 3 by 3 filters are just better? The original insight for this was actually that Matt Zeiler visualization paper I showed you. It's real worth reading that paper because he really shows that by looking at lots of pictures of all the stuff going on inside a CNN, it's clearly works better when you have smaller filters and more layers. I'm not going to go into the theoretical justification as to why. For the sake of applying CNNs, all you need to know is there's really no reason to use anything but 3 by 3 filters. So that's a nice simple rule of thumb which always works, 3 by 3 filters. OK, how many layers of 3 by 3 filters? This is where there is not any standard agreed upon technique. Reading lots of papers, looking at lots of Kaggle winners, you will, over time, get a sense of, for a problem of this level of complexity, you tend to need this many filters. There have been various people that have tried to simplify this, but we're really still at a point where the answer is try a few different architectures and see what works. And the same applies to this question of how many filters per layer. So in general, this idea of having 3 by 3 filters with max pooling and doubling the number of filters each time you do max pooling is a pretty good rule of thumb. How many do you start with? You've kind of got to experiment. And actually, we're going to see today an example of how that works. OK. And then if you had it, would you still want 3 by 3 filters? Yeah, so if you had a much larger image, what would you do? So, for example, on Kaggle, there's a diabetic retinopathy competition which has some pictures of eyeballs that are at quite a high resolution. I think they're like a couple of thousand by a couple of thousand. The question of how to deal with large images is as yet unsolved in the literature. So if you actually look at the winners of that Kaggle competition, all of the winners resampled that image down to 512 by 512. So I find that quite depressing. It's clearly not the right approach. I'm pretty sure I know what the right approach is. I'm pretty sure that the right approach is to do what the eye does. Our eye does something called foveation, which means that when I look directly at something, the thing in the middle is very high resin, very clear, and the stuff on the outside is not. I think a lot of people are now generally in agreement with the idea that if we could come up with an architecture which kind of has this concept of foveation, and then secondly, we need something and there are some good techniques to this already called attentional models. An attentional model is something that says, OK, the thing I'm looking for is not in the middle of my view, but my low res peripheral vision and thinks that might be over there, let's focus my attention over there. And we're going to start looking at recurrent neural networks next week, and we can use recurrent neural networks to build attentional models that allow us to search through a big image to find areas of interest. So that is a very active area of research, but as yet is not really finalized. By the time this turns into a MOOC and a video, I wouldn't be surprised if that has been much better solved. It's moving very quickly. The Matt Zeiler paper showed larger filters because he was showing what AlexNet, the 2012 winner, looked like. And then later on in the paper, he then said, based on what it looks like, here are some suggestions about how to build better models. OK. So let us now finalize our review by looking at fine-tuning. So we learned how to do fine-tuning using the little VGG class that I built, which is like one line of code, vgg.fine-tune. We also learned about kind of conceptually, how could you take 1,000 predictions of all the 1,000 ImageNet categories and turn them into two predictions, which is just a cat or a dog, by building a model, a simple linear model that took as input. Let's go back to it. This is from last week's lesson. Here it is. That took as input the 1,000 ImageNet category predictions as input and took the true cat and dog labels as output. And we just created a linear model of that. And so that was this. Let's go down to it. Here it is. So here is that linear model. So it's got 1,000 inputs and two outputs. And so we trained that linear model. It took less than a second to train. And we got 97.7% accuracy. So this was actually pretty effective. And so why was it pretty effective to take 1,000 predictions of is it a cat, is it a fish, is it a bird, is it a poodle, is it a pug, is it a plane, and turn it into is it a cat or is it a dog? The reason that worked so well is because the original architecture, that ImageNet architecture, was already trained to do something very similar to what we wanted our model to do. We wanted our model to separate cats from dogs. And the ImageNet model already separated lots of different types of cats from lots of different types of dogs from lots of other things as well. So the thing we were trying to do was really just a subset of what ImageNet already does. So that was why starting with 1,000 predictions and building this simple linear model worked so well. This week, you're going to be looking at the State Farm Competition. And in the State Farm Competition, you're going to be looking at pictures like this one and this one and this one. And your job will not be to decide whether or not it's a person or a dog or a cat. Your job will be to decide, is this person driving in a distracted way or not? That is not something that the original ImageNet categories included. And therefore, this same technique is not going to work this week. So what do you do if you need to go further? What do you do if you need to predict something which is very different to what the original model did? And the answer is to throw away some of the later layers in the model and retrain them from scratch. And that's called fine-tuning. And so that is pretty simple to do. So if we just want to fine-tune the last layer, we can just go model.pop that removes the last layer. We can then say, make all of the other layers non-trainable. So that means it won't update those weights. And then add a new fully connected layer, dense layer to the end with just our dog and cat, our two activations, and then go ahead and fit that model. So that is the simplest kind of fine-tuning. Remove the last layer. But previously, that last layer was going to try and predict 1,000 possible categories and replace it with a new last layer, which we train. In this case, I've only run it for a 2E box, so I'm not getting a great result. But if we ran it for a few more, we would get a bit better than the 97.7 we had last time. When we look at State Farm, it's going to be critical to do something like this. So how many layers would you remove? Because you don't just have to remove one. So in fact, if you go back through your Lesson 2 notebook, you'll see after this, I've got a section called Retraining More Layers. And in it, we see that we can take any model and we can say, OK, let's grab all the layers up to the n-th layer. So in this case, we said all the layers after the first fully connected layer and set them all to trainable. And then what would happen if we tried running that model? So with Keras, we can tell Keras which layers do we want to freeze and leave them at their ImageNet decided weights and which layers do we want to retrain based on the things that we're interested in. And so in general, the more different your problem is to the original ImageNet 1,000 categories, the more layers you're going to have to retrain. Yes, Retraining. I just want to give you a heads up that it's 10 to 8, so it's really 5 to 5, right? Yeah. Yes? What would you say to that? So how do you decide how far to go back in the layers? Two ways. Way number one, intuition. So have a look at something like those matzile visualizations to kind of get a sense of at what semantic level each of those layers is operating at. And go back to the point where you feel like that level of meaning is going to be relevant to your model. Method number two, experiment. It doesn't take that long to train another model starting at a different point. And so I generally do a bit of both, right? So when I know dogs and cats are subsets of the ImageNet categories, I'm not going to bother generally training more than one new layer, one replacement layer. For State Farm, I really had no idea. I was pretty sure that I wouldn't have to retrain any of the convolutional layers because the convolutional layers are all about spatial relationships. And therefore a convolutional layer is all about recognizing how things in space relate to each other. I was pretty confident that figuring out whether somebody is looking at a mobile phone or playing with their radio is not going to use different spatial features. So for State Farm, I've really only looked at retraining the dense layers. And in VGG, there are actually only, let's have a look at the FC block here, there are actually only three dense layers, the two intermediate layers and the output layer. So I just trained all three. But generally speaking, the answer to this is try a few things and see what works the best. Are you setting the? When we retrain the layers, we do not set the weights randomly. We start the weights at their optimal ImageNet levels. And so that means that if you retrain more layers than you really need to, it's not a big problem because the weights are already at the right point. If you randomized the weights of the layers that you're retraining, that would actually kill the earlier layers as well potentially if you made them trainable. There's no point really setting them to random most of the time. We'll be learning a bit more about some of that after the break. But for so far, we have not reset the weights. When we say layer.trainable equals true, we're just telling Keras that when you say fit, I want you to actually use SGD to update the weights in that layer, please. OK, great. So that's been a lot of review, but I think it's very useful. When we come back, we're going to be talking about how to go beyond these basic five pieces to create models which are more accurate. And specifically, we're going to look at avoiding underfitting and avoiding overfitting. Next week, we're going to be doing half a class on review of convolutional neural networks and then half a class of an introduction to recurrent neural networks, which we'll be using for language. So hopefully by the end of this class, you'll be feeling ready to really dig deep into CNNs during the week. And this is really the right time this week to make sure that you're asking questions you have about CNNs because next week, we'll be wrapping up this topic. Let's come back in at 5 past 8. So we have a lot to cover in our next 55 minutes. And I think, actually, this approach of doing the new material quickly, and then you can review it in the lesson notebook on the video and by experimenting during the week, and then reviewing the next week is fine. I think that's a good approach. But I just want to make you aware that the new material over the next 55 minutes will move pretty quickly. And so don't worry too much if not everything sinks in straight away. If you have any questions, of course, please do ask. But also recognize that it's really going to sink in as you study it and play with it during the week. And then next week, we're going to review all of this. So if it's still not making sense, and of course, you've asked your questions on the forum, it's still not making sense. We'll be reviewing it next week. So yes, Richard. There are pretty several questions. Let's take a look. I can go to the in class channel. So where do we start from out here, right? OK, so if you don't retrain a layer, does that mean the layer remembers what gets saved? So yes, if you don't retrain a layer, then when you save the weights, it's going to contain the weights that it originally had. So that's a really important question. Why would we want to start out by overfitting? We're going to talk about that next. The last conflare in VGG is a 7 by 7 output. So there are 7 49 boxes, and each one has 512 different things. That's kind of right. But it's not that it recognizes 512 different things. When you have a convolution on a convolution on a convolution on a convolution on a convolution, you have a very rich function with hundreds of thousands of parameters. So it's not that it's recognizing 512 things. It's that there are 512 rich complex functions. And so those rich complex functions can recognize rich complex concepts. So for example, we saw in the video that even in layer 6 there's a face detector which can recognize cat faces as well as human faces. So the later on we get in these neural networks, the harder it is to even say what it is that's being found because they get more and more sophisticated and complex. So what those 512 things do in the last layer of VGG, I'm not sure that anybody's really got to a point that they could tell you that. I also just wanted to say that deeper layers are getting to see more of the original image. Right. So deeper layer, because it's a 7 by 7 output, is seeing 1 7th of the x and 1 7th of the y. So it can see more and more. Okay, I'm going to move on. And the next section is all about making our model better. So at this point we have a model with an accuracy of 97.7%. So how do we make it better? Now, because we have started with an existing model, a VGG model, there are two things, two reasons. Well, in fact, there's always two reasons that you could be less good than you want to be. Either you're underfitting or you're overfitting. Underfitting means that, for example, you're using a linear model to try to do image recognition. You're using a model that is not complex and powerful enough for the thing you're doing. Or it doesn't have enough parameters for the thing you're doing. That's what underfitting is. Overfitting means that you're using a model with too many parameters that you've trained for too long without using any of the techniques or without correctly using the techniques that you've learned about such that you've ended up learning what your specific training pictures look like rather than what the general patterns in them look like. You will recognize overfitting if your training set has a much higher accuracy than your test set or your validation set. So that means you've learned how to recognize the contents of your training set too well. And so then when you look at your validation set, you get a less good result. So that's overfitting. I'm not going to go into detail on this because any of you who have done any machine learning have seen this before. So any of you who haven't, please look up overfitting on the internet, learn about it, ask questions about it. It is perhaps the most important single concept in machine learning. So it's not that we're not covering it because it's not interesting. We've got a few already familiar with it. Underfitting we can see in the same way, but it's the opposite. If our training error is much lower than our validation error, then we're underfitting. So I'm going to look at this now because in fact you might have noticed that in all of our models so far, our training error has been lower than our validation error, which means we are underfitting. So how is this possible? And the answer to how this is possible is because the VGG network includes something called dropout, and specifically dropout with a p of 0.5. What does dropout mean with a p of 0.5? It means that at this layer, which happens at the end of every fully connected block, it deletes 0.5, so 50% of the activations at random. It sets them to 0. That's what a dropout layer does. It sets to 0.5 of the activations at random. Why would it do that? Because when you randomly throw away bits of the network, it means that the network can't learn to overfit. It can't learn to build a network that just learns about your images because as soon as it does, you throw away half of it, and suddenly it's not working anymore. So dropout is a fairly recent development. I think it's about three years old, Rachel. And it's perhaps the most important development of the last few years because it's the thing that now means we can train big, complex models for long periods of time without overfitting. That's incredibly important. But in this case, it seems that we are using too much dropout. So the VGG network, which used a dropout of 0.5, they decided they needed that much in order to avoid overfitting image net. But it seems for our cats and dogs, it's underfitting. So what do we do? So the answer is, let's try removing dropout. So how do we remove dropout? And this is where it gets fun. We can start with our VGG fine-tuned model. And I've actually created a little function called VGG fine-tune which creates a VGG fine-tune model with two outputs. It looks exactly like we'd expect it to look. It creates a VGG model. It fine-tunes it. It returns it. What does fine-tune do? It does exactly what we've learnt. It pops off the last layer, sets all the rest of the layers to non-trainable, and adds a new dense layer. So I just created a little thing that does all that. Every time I start writing the same code more than once, I stick it into a function and use it again in the future. It's good practice. I then load the weights that I just saved in my last model. So I don't have to retrain it. So saving and loading weights is a really helpful way of avoiding not refitting things. So already I now have a model that fits cats and dogs with 97.7% accuracy and under fits. So we can grab all of the layers of the model and we can then enumerate through them and find the last one which is a convolution. So let's remind ourselves model.summary. So that's going to enumerate through all the layers and find the last one that is a convolution. So at this point, we now have the index of the last convolutional layer. Turns out to be 30. And so we can now grab that last convolutional layer. And so what we want to try doing is removing dropout from all the rest of the layers. So after the convolutional layer are the dense layers. So let's sign those. Here we are. So after the convolutional layers, the last convolutional layer, after that we have the dense layers. So this is a really important concept in the Keras library of playing around with layers. And so spend some time looking at this code and really look at the inputs and the outputs and get a sense of it. So you can see here, here are all the layers from up to the last convolutional layer. Here are all of the layers from the last convolutional layer. So all the fully connected layers and all the convolutional layers. I can create a whole new model that contains just the convolutional layers. Why would I do that? Because if I'm going to remove dropout, then clearly I'm going to want to fine-tune all of the layers that involve dropout. That is all of the dense layers. I don't need to fine-tune any of the convolutional layers because none of the convolutional layers have dropout. Here, look at the comp block. The comp block does not have dropout. So I'm going to save myself some time. I am going to pre-calculate the output of the last convolutional layer. So you see this model I've built here, this model that contains all the convolutional layers. If I pre-calculate the output of that, then that's the input to the dense layers that I want to train. So you can see what I do here is I say model.predict with my validation batches model.predict with my batches, and that now gives me the output of the convolutional layer for my training and the output of it for my validation. And because that's something I don't want to have to do it again and again, I save it. So here, I'm just going to go loadArray and that's going to load the output of that. And so I'm going to say trainFeatures.shape and this is always the first thing that you want to do from when you've built something, is look at its shape. And indeed, it's what we'd expect. It is 23,000 images. Each one is 14 by 14 because I didn't include the final max pooling layer with 512 filters. And so indeed, if we go model.summary we should find that the last convolutional layer, here it is, 512 filters 14 by 14 dimension. So we have basically built a model that is just a subset of VGG containing all of these earlier layers. We've run it through our test set and our validation set and we've got the outputs. So that's the stuff that we want to fix. And so we don't want to re-calculate that every time. So now we create a new model which is exactly the same as the dense part of VGG but we replace the dropout P with zero. So here's something pretty interesting and I'm going to let you guys think about this during the week. How do you take the previous weights from VGG and put them into this model where dropout is zero? So if you think about it, before we had dropout of 0.5 so half the activations were being deleted at random. So since half the activations were being deleted at random now that I've removed dropout I effectively have twice as many weights being active. Since I have twice as many weights being active I need to take my image net weights and divide them by two. So by taking my image net weights and copying them across so I take my previous weights and copy them across to my new model for each time and divide them by two that means that this new model is going to be exactly as accurate as my old model before I start training but it has no dropout. Yes, Rachel? If Pax versus Dog only uses a subset of the image net weights are we wasting computational power? Is it wasteful to have filters that look for units or closed shelves? So is it wasteful to have in the cats and dogs model filters that have been learnt to find things like bookshelves? Potentially it is but it's okay to be wasteful. The only place that it's a problem is if we are overfitting and if we're overfitting then we can easily fix that by adding more dropout. So let's try this. So we now have a model which takes the output of the convolutional layers as input gives us our cats versus dogs as output and has no dropout. So now we can just go ahead and fit it. So notice that the input to this is my 512 by 512 by 14 by 14 inputs. My outputs are my cats and dogs as usual and train it for a few epochs and here's something really interesting. Dense layers take very little time to compute. A convolutional layer takes a long time to compute. Think about it, you're computing 512 3 by 3 by 512 filters for each of 14 by 14 spots that is a lot of computation. So in a deep learning network your convolutional layers is where all of your computation is being taken up. So look, when I train just my dense layers it's only taking 17 seconds. Super fast. On the other hand, the dense layers is where all of your memory is taken up. Because between this 4,096 layer and this 4,096 layer there are 4,000 by 4,000 equals 16 million weights. And between the previous layer which was 512 by 7 by 7 after Max Pauling, that's 25,088 there are 25,088 times 4,096 weights. So this is a really important rule of thumb. Your dense layers is where your memory is taken up your convolutional layers is where your computation time is taken up. So it took me a minute or so to run 8 epochs. That's pretty fast. And holy shit, look at that. 98.5%. You can see now I am overfitting. But even though I'm overfitting I am doing pretty damn well. So overfitting is only bad if you're doing it so much that your accuracy is bad. So in this case it looks like actually this amount of overfitting is pretty good. So for cats and dogs this is about as good as I've gotten. And in fact if I'd stopped it a little earlier you can see it was really good. In fact the winner was 98.8 and here I've got 98.75 and there are some tricks I'll show you later that always gives you an extra like 50% accuracy. So this would definitely have won cats and dogs if we had used this model. Yes Rachel? You can absolutely perform drop out on a convolutional layer. And indeed nowadays people normally do. I don't quite remember the VGG days I guess that was two years ago but those days didn't. Nowadays the general approach would be you would have like drop out of 0.1 before your first layer, drop out of 0.2 before this one, 0.3, 0.4 and then finally drop out of 0.5 before your fully connected layers. It's kind of the standard. If you then find that you're underfitting or overfitting you can modify all of those probabilities by the same amount. But generally speaking if you drop out in an early layer you're losing that information for all of the future layers so you don't want to drop out too much in the early layers but you can feel better dropping out more in the later layers. Of course I did, I knew it was your question. Yes sir. Tuning your drop out is a very much black magic. Yes. How is that different? This is how you manually tune with your fit or underfit. Another way to do it would be to modify the architecture to have less or more filters but that's actually pretty difficult to do. So it's the point that we didn't need drop out anyway. Perhaps it was. But VGG comes with drop out. When you're fine tuning you start with what you start with. And drop out is super useful in general. We are overfitting here so my hypothesis is that we maybe should try a little less drop out. But before we do I'm going to show you some better tricks. The first trick I'm going to show you is a trick that lets you avoid overfitting without deleting information. Drop out deletes information so we don't want to do it unless we have to. So instead of drop out in fact here is a list you guys should refer to this every time you're building a model that is overfitting. Five steps. Step one, add more data. This is a Kaggle competition so we can't do that. Step two, use data augmentation which we're about to learn. Step three, use more generalizable architectures. We're going to learn that after this. Step four, add regularization. That generally means drop out. There's another type of regularization which is where you basically add up all of your weights the value of all of your weights and then multiply it by some small number and you add that to the loss function. Basically you say having higher weights is bad and that's called either L2 regularization if you take the square of your weights and add them up or L1 regularization if you take the absolute value of your weights and add them up. Terra supports that as well. Also popular. I don't think anybody has a great sense of like when do you use L1 and L2 regularization and when do you use drop out. I use drop out pretty much all the time and I don't particularly see why you would need both. But I just wanted to let you know that that other type of regularization exists. And then lastly if you really have to reduce architecture complexity so like remove some filters but that's pretty hard to do if you're fine tuning because like how do you know which filters to remove? So really the first four now that we have drop out the first four is random filters. Yes? So random filters and random to take some features and things actually avoid problem filters? Yeah so like in random forests where we randomly select subsets of variables at each point that's kind of what drop out is doing. Drop out is randomly throwing away half the activations. So drop out and random forests both effectively create actually a fantastic kind of analogy between random forests. So just like when we went from decision trees to random forests it was this huge step which was basically create lots of decision trees with some random differences. Drop out is effectively creating lots of automatically lots of different neural networks with different subsets of features that have been randomly selected. Okay data augmentation is very very simple. Data augmentation is something which takes a cat and turns it into lots of cats. Okay, that's it. Actually it does it for dogs as well. You can rotate you can flip you can move up and down left and right zoom in and out right and in Keras you do it by rather than what we've always said before was image dot image data generator open parenthesis close parenthesis. Now we say all these other things. Flip it horizontally at random, zoom in a bit at random, share at random, rotate at random move it left and right at random and move it up and down at random. So once you've done that then when you create your batches rather than doing it the way we did it before you simply add that to your batches. We said okay this is our data generator and so when we create our batches use that data generator the augmenting data generator. Very important to notice the validation set does not include that because the validation set is the validation set that's the thing that we want to check against so we shouldn't be fiddling with that at all so the validation set has no data augmentation and no shuffling. It's constant and fixed. The training set on the other hand is as much as we can. So shuffle it's order and add all these different types of augmentation. How much augmentation do you use? This is one of the things that Rachel and I would love to automate. For now, two methods use your intuition. The best way to use your intuition is to take one of your images add some augmentation and check whether they still look like cats. So if it's like so warped it takes a photo of a cat like that you've done it wrong. So this is kind of like a small amount of data augmentation. Method two, experiment. Try a range of different augmentations and see which one gives you the best results. So here if we add some augmentation everything else is exactly the same except we can't pre-compute anything anymore. So earlier on we pre-computed the output of the last convolutional layer but we can't do that now because every time this cat approaches our neural network it's a little bit different. It's rotated a bit or it's flipped or it's moved around or it's zoomed in and out. So unfortunately when we use data augmentation we can't pre-compute anything and so things take longer. Everything else is the same though. So we grab our fully connected model we add it to the end of our convolutional model and this is the one without dropout compile it fit it and now rather than taking 9 seconds per epoch it takes 273 seconds per epoch because it has to calculate through all the convolutional layers because of the data augmentation. So in terms of the results here we have not managed to get back up to that 98.7 accuracy. I probably, actually no I have I've run a few more. So if I keep running them again I start overfitting so it's a little hard to tell because my validation accuracy is moving around quite a lot because my validation sets a little bit on the small side it's a little bit hard to tell whether this data augmentation is helping or hindering. I suspect what we're finding here is that maybe we're doing a little bit too much of the data augmentation so if I went back and reduced my different ranges by say half I might get a better result than this. But really this is something to experiment with and I had better things to do than experiment with this. But you get the idea. So data augmentation is something you should always do like there's never a reason not to use data augmentation is just what kind and how much. So for example, what kind should you flip X, Y? So clearly for dogs and cats, no. You pretty much never see a picture of an upside down dog. So would you do vertical flipping in this particular problem? No you wouldn't. Would you do rotations? Yeah, you very often see cats and dogs that are kind of on their hind legs or the photos taken a little bit uneven and you would have zooming because sometimes you're close to the dog sometimes further away. So use your intuition to think about what kind of augmentation. Data augmentation for color that's an excellent point. So something I didn't add to this but I probably should of is that there is a channel augmentation parameter for the data generator in Keras and that will slightly change the colors. And indeed that's a great idea for images like these because you have different white balance you have different lighting and so forth and indeed I think that would be a great idea. So I hope during the week people will take this notebook and somebody will tell me what is the best result they've got and hopefully I bet that that data augmentation will include some fiddling around with the colors. If you change all the images to monochrome images you've got a test set to change them to black and white. No it wouldn't because the Kaggle competition test set is in color. So if you're throwing away color you're throwing away information and figuring out whether something is but the Kaggle competition is saying is this a cat or is this a dog and part of seeing whether something is a cat or a dog is looking at what color it is. So you're making that harder. So you could run it on the test set and get answers but they're going to be less accurate because you've thrown away information. So what happened to the flatten layer and the answer is that it was there. Where was it? Oh gosh. I forgot to add it back to this one. So I actually changed my mind about whether to include the flatten layer and where to put it and where to put max pooling. It'll come back later. So this is a slightly old version. Thank you for picking it up. Yeah, so can you do drop out on the raw images? The simple answer is yes you could. There's no reason I can't put a drop out layer right here and that's going to drop out raw pixels. It turns out that's not a good idea. Throwing away input information is very different to throwing away modeled information. Throwing away modeled information is letting you effectively avoid overfitting the model. But you don't want to avoid overfitting the data. So you probably don't want to do that. Yes. So to clarify the augmentation is at random. I just showed you eight examples of the augmentation. So what the augmentation does is it says at random rotate by up to 20 degrees move by up to 10% in each direction share by up to 5% zoom by up to 10% and flip at random half the time. So then I just said okay, here are eight cats. But what happens is every single time an image goes into the batch it gets randomized. So it's effectively it's an infinite number of augmented images. And would it make a difference if instead of using vtt.me you computed it and treated it as? That doesn't have anything to do with data augmentation so maybe we'll discuss that one on the forum. Okay. The final concept to learn about today is batch normalization. Batch normalization like data augmentation is something you should always do. Why didn't VGG do it? Because it didn't exist then. Batch norm is about a year old maybe 18 months. And here's the basic idea. When you're taking the anybody who's kind of done any machine learning probably knows that one of the first things you want to do is take your input data subtract its mean and divide by its standard deviation. Why is that? Well imagine that we had like 40 and minus 30 and 1. You can see that the outputs are all over the place. The intermediate values some are really big some are really small so if we changed a weight which impacted x1 it's going to change the loss function by a lot whereas if we change a weight which impacts x3 it'll change the loss function by very little. So the different weights have very different gradients basically very different amounts that they're going to affect the outcome. And even more as you go further down through the model that's going to multiply. And particularly when we're using something like Softmax which has an e to the power of in it you end up with these crazy big numbers. So when you have inputs that are of very different scales it makes the whole model very fragile which means it is harder to learn the best set of weights and you have to use smaller learning weights. This is not just true for deep learning, it's true of pretty much every kind of machine learning model which is why everybody who's been through the M-SAM program here, hopefully you guys all learn to normalize your inputs. So if you haven't done any machine learning before, no problem just take my word for it you always want to normalize your inputs. It's so common that pretty much all of the deep learning libraries will normalize your inputs for you with a single parameter and we're doing it in hours because images like pixel values only range from 0 to 255 you don't generally worry about dividing by the standard deviation with images but you do generally worry about subtracting the mean so you'll see that the first thing that our model does is this thing called pre-process which subtracts the mean and the mean was something which basically you can look it up on the internet and find out what the mean of the internet data is so these three fixed values now what's that got to do with batch norm? Well, I imagine that somewhere along the line in our training we ended up with like one really big weight then suddenly one of our layers is going to have one really big number and now we're going to have exactly the same problem as we had before which is the whole model becomes very unresilient becomes very fragile becomes very hard to train going to be all over the place some numbers could even like get slightly out of control yeah so what do we do? you know really what we want to do is to normalize not just our inputs but our activations as well so you may think ok no problem let's just subtract the mean and divide by the standard deviation for each of our activation layers unfortunately that doesn't work SGD is very bloody minded if it wants to increase one of the weights higher and you try to undo it by subtracting the mean and dividing by the standard deviation the next iteration is going to try to make it higher again right so if SGD decides that it wants to make your weights it will do so so just normalizing the activation layers doesn't work so batch norm is a really neat trick for avoiding that problem before I tell you the trick I will just tell you why you want to use it because A it's about 10 times faster than not using it particularly because it often lets you use a 10 times higher learning rate and B because it reduces overfitting without removing any information from the model so these are the two things you want less overfitting and faster models I'm not going to go to detail into how it works you can read about this during the week if you're interested but brief outline first step normalizes the intermediate layers just the same way as input layers can be normalized the thing I just told you wouldn't work well it does it but it does something else critical which is it adds two more trainable parameters one trainable parameter multiplies by all the activations and the other one is added to all the activations so effectively that is able to undo that normalization both of those two things are then incorporated into the calculation of the gradient so the model now knows that it can rescale all of the weights if it wants to without moving one of the weights way off into the distance and so it turns out that this does actually effectively control the weights in a really effective way so that's what batch normalization is the good news is for you to use it you just type batch normalization specifically you want it after well in fact you can put it after dense layers you can put it after convolutional layers you should put it after all of your layers here's the bad news VGG didn't train originally with batch normalization and adding batch normalization changes all of the weights I think that there is a way to calculate a new set of weights with batch normalization I haven't gone through that process yet so what I did today was I actually grabbed the entirety of ImageNet and I trained this model on all of ImageNet and that then gave me a model which was basically VGG plus batch normalization and so that is the model here that I'm loading so this is the ImageNet whatever it is, large visual recognition competition 2012 data set and so I trained this set of weights on the entirety of ImageNet so that I created basically a VGG plus batch norm and so then I fine tuned the VGG plus batch norm model by popping off the end and adding a new dense network, a new dense layer and then I trained it and these only took 6 seconds because I pre-capitated the inputs to this then I added data augmentation and I started training that and then I ran out of time because it was class so I think this was on the right track I think if I had had another hour or so you guys can play with this during the week because this is now like all the pieces together it's batch norm and data augmentation and as much dropout as you want so you'll see what I've got here is I have dropout layers with an arbitrary amount of dropout and so in this the way I set it up you can go ahead and say okay create batch norm layers with whatever amount of dropout you want and then later on you can say okay I want you to change the weights to use this new amount of dropout so this is kind of like the Ultimate ImageNet fine tuning experience and I haven't seen anybody create this before so this is a useful tool that didn't exist until today and hopefully during the week we'll keep improving it interestingly I found that when I went back to even 0.5 dropout it was still massively overfitting so it seems that batch normalization allows the model to be like so much better at finding the optimum that I actually needed more dropout rather than less so anyway as I said this is all kind of something I was doing today I haven't quite finalized that what I will show you though is something I did finalize which I did on Sunday which is going through end to end an entire model building process on MNIST and so I want to show you this entire process and then you guys can play with it MNIST is a great way to really experiment with and revise everything we know about CNNs because it's very fast to train because there are only 28x28 images and there's also extensive benchmarks on what are the best MNIST approaches to MNIST so it's very very easy to get started with MNIST because Keras actually contains a copy of MNIST so we can just go from keras.datasets import MNIST MNIST.loaddata and we're done now MNIST are grayscale images and everything in Keras in terms of the convolutional stuff expects there to be a number of channels so we have to use expand dims to add this empty dimension so this is 60,000 images with one color which are 28x28 so if you try to use grayscale images and get weird errors I'm pretty sure this is what you've forgotten to do which is to add this empty dimension which is you actually have to tell it there is one channel because otherwise it doesn't know how many channels are there so there is one channel the other thing I had to do was take the y-values the labels and one-hot encode them because otherwise they were like this they were actual numbers 50419 and we need to one-hot encode them so that they're 5 0 109 because remember this is the thing that that softmax function is trying to approximate that's like how the linear algebra works so there are the two things I had to do to preprocess this add the empty dimension and do my one-hot encoding then I normalize the input by subtracting the mean and dividing by the standard deviation and then I try to build a linear model okay so I can't fine-tune from ImageNet now because ImageNet is 224 by 224 and this is 28 by 28 and ImageNet is full color and this is grayscale so we're going to start from scratch so all of these are going to start from random so a linear model needs to normalize the input and needs to flatten it because I'm not going to treat it as an image, I'm going to treat it as a single vector and then I create my one dense layer with 10 outputs grab my batches and train my linear model and so you can see generally speaking the best way to train a model is to start by doing one epoch with a pretty low learning rate so the default learning rate is 0.001 which is actually a pretty good default so you'll find like nearly all of the time I just accept the default learning rate and I do a single epoch and get it started once you've got it started you can set the learning rate really high so 0.1 is about as high as you ever want to go and do another epoch and that's going to like move super fast and then gradually you reduce the learning rate by 10, basically by an order of magnitude at a time so I go to 0.01, do a few epochs and basically keep going like that until you start overfitting so I got down to the point where I had 92.7% accuracy on the training 92.4% on the test and I was like okay that's about as far as I can go so that's a linear model not very interesting so next thing to do would be to grab one extra dense layer in the middle so one hidden layer this is what in the 80s and 90s people thought of as a neural network one hidden layer fully connected and so that still takes 5 seconds to train, again we do the same thing with a epoch with a low learning rate then pop up the learning rate for as long as we can gradually decrease it and we get to 94% accuracy so you wouldn't really expect a fully connected network to do that well so let's create a CNN so this was actually the first architecture I tried and basically I thought okay we know VGG works pretty well so how about I create an architecture that looks like VGG but it's much simpler this is just 28 by 28 so I thought okay well VGG generally has a couple of convolutional layers of 3 by 3 and then a max pooling layer and then a couple more with twice as many filters so I just tried that so this is kind of like my inspired by VGG model and I thought okay so after two lots of max pooling it'll go from 28 to 28 by 14 by 14 to 7 by 7 okay that's probably enough I added my two dense layers again so I didn't use any science here it's just kind of some intuition and it actually worked pretty well after my learning rate of 0.1 I had an accuracy of 98.9% validation accuracy of 99% and then after a few layers of 0.01 I had an accuracy of 99.75% but look my validation accuracy is only 99.2% so look I'm overfitting so this is the trick start by overfitting once you know you're overfitting you know that you have a model that is complex enough to handle your data so at this point I was like okay this is a good architecture it's capable of overfitting so let's now try to use the same architecture and reduce overfitting but reduce the complexity of the model no more than necessary so step one of my five step list is data augmentation so I added a bit of data augmentation and then I used exactly the same model as I had before and trained it for a while and I found this time I could actually train it for even longer as you can see and I started to get some pretty good results here 99.3 99.34 but by the end you can see I'm massively overfitting again 99.6 training versus 91.1 test so data augmentation alone is not enough and I said to you guys we'll always use batch norm anyway so then I add batch norm I use batch norm on every layer notice that when you use batch norm on convolution layers you have to add axis equals 1 I am not going to tell you why I want you guys to read the documentation about batch norm and try and figure out why you need this discussion about it on the forum because it's a really interesting analysis if you really want to understand batch norm is understand why you need this here if you don't care about the details that's fine just type axis equals 1 any time you have batch norm and so this is like a pretty good quality modern network you can see I've got convolution layers they're 3x3 and then I have batch norm and then I have max pooling and then at the end I have some dense layers it's actually a pretty decent looking model not surprisingly it does pretty well so I train it for a while at 0.1 I train it for a while at 0.01 I train it for a while at 0.01 and you can see I get up to 99.5% that's not bad but by the end I'm starting to overfit so add a little bit of dropout and remember what I said to you guys nowadays the rule for dropout is to kind of gradually increase it I only had time yesterday to just try adding one layer of dropout right at the end but as it happened that seemed to be enough so when I just added one layer of dropout to the previous model train it for a while at 0.1 0.01 and it's like oh great my accuracy and my validation accuracy are pretty similar and my validation accuracy is around 99.5% to 99.6% towards the end here so I thought okay that sounds pretty good so at 99.5% or 99.6% accuracy on handwriting recognition is pretty good but there's one more trick you can do which makes every model better and it's called ensembling ensembling refers to building multiple versions of your model so what I did was I took all of the code from that last section and put it into a single function so this is exactly the same model I had before and this is my exact steps that I took to train it my learning rate of 0.1 0.01 and so at the end of this it returns a trained model and so then I said okay six times fit a model and return a list of the results so models at the end of this contain six trained models using my preferred network so then what I could do was to say alright go through every one of those six models and predict the output for everything in my test set so now I have 10,000 test images and output by six models and so now I could take the average across the six models and so now I'm basically saying alright here are six models they've all been trained in the same way but from different random starting points and so the idea is that they will be having errors in different places so let's take the average of them and I get an accuracy of 99.7% how good is that it's very very very good it's so good that if we go to the academic list of the best amnesty results of all time and remember many of these were specifically designed for handwriting recognition it comes here so like one afternoon's work gets us to you know in the list of the best results ever found on this data set so it's as you can see it's not rocket science it's all stuff you've learnt before you've learnt now and it's a process which is barely repeatable can get you right up to the state of the art right so it was easier to do it on amnesty because I only had to wait a few seconds for each of my trainings to finish to get to this point on state farm it's going to be harder because you're going to have to think about like how do you do it in the time you have available and how do you do it in the context of fine tuning and stuff like that but hopefully you can see that you have all of the tools now at your disposal to create literally a state of the art model so I'm going to make all of these notebooks available you can play with them you can try to get a better result from dogs and cats as you can see it's kind of like an incomplete thing that I've done here I haven't found the best data augmentation I haven't found the best drop out I probably need to so there's some work for you to do so here are your assignments for this week so this is all review now right so I suggest you go back and actually read like there's quite a bit of prose in every one of these notebooks so like hopefully now you can go back and read that prose and some of that prose that at first was a bit mysterious now it's going to make sense you're going to be like oh okay I see what it's saying and if you read something and it doesn't make sense or if you read something and you want to check like oh is this kind of another way of saying this other thing ask on the forum so these are all notebooks that we've looked at already and you should definitely review ask us something on the forum make sure that you can replicate the steps shown in the lesson notebooks we've seen so far using the technique in the how to use the provided notebooks we looked at the state of class if you haven't yet got into the top 50% of dogs versus cats I've got the tools to do so if you get stuck at any point ask on the forum and then this is your big challenge can you get into the top 50% of state farm now this is tough the first step to doing well on a Kaggle competition is to create a validation set that gives you accurate answers so create a validation set and then make sure that the validation set accuracy is the same as you get when you submit to Kaggle if you don't you don't have a good enough validation set yet and creating a validation set for state farm is really your first challenge and requires like thinking long and hard about the evaluation section on that page and what that means and then thinking about okay which layers of the pre-trained network should I be retraining I actually have read through the top 20 results from the competition a close three months ago and I actually think all of the top 20 result methods are pretty hacky like they're pretty ugly I feel like there's a better way to do this that's kind of in our grasp so I'm hoping that somebody is going to come up with a top 20 result for state farm that is elegant we'll see how we go if not this year maybe next year because like honestly nobody in Kaggle will come up with a really good way of tackling this they've got some really good results but with some really convoluted methods okay and then as you go through or if you please anything any of these techniques that you're not clear about these five pieces please go and have a look at this additional information and see if that helps alright that was a pretty quick run through so I hope everything goes well and I will see you next week