 This is week 7 of 7, although in a sense it's week 7 to 14, no pressure and no commitment, but how many of you are thinking you might want to come back to Part 2 next year? Oh shit, okay. That's great. We were expecting, when we started this, I thought if 1 in 5 people come back to Part 2, I'd be happy, so that's the best thing I've ever seen. Thank you so much. Okay, well in that case that's perfect, because today I'm going to show you, and I think you'll be surprised and maybe a little overwhelmed at what you can do with this little set of tools you've worked already. So this is going to be kind of part one of this lesson. It's going to be a well-written tour of a bunch of different architectures, and different architectures, not just different because some of them will be just better at doing what they're doing, but some of them will be doing different things. And I want to kind of set your expectations and say that looking at an architecture and understanding how it does what it does is something that took me quite a few weeks to just get an intuitive feel for it. So I don't feel bad because as you'll see, it's like unprogrammed. It's like we're going to describe something we would think would be great if the model knew how to do it, and then we'll say fit, and suddenly the model knows how to do it, and we'll look inside it and we'll be like, how is it going to do that? So the other thing I want to mention is having said that, when we look about C, this is only the things we've done today. In fact, the first half we're only going to use CNNs. There's going to be no cropping of images. There's going to be no filtering. There's going to be nothing hand-tuned. It's just going to be a bunch of convolutional or dense layers without diversion functions. But we're going to put them together in some interesting ways. So let me start with one of the most important developments perhaps of the last year or two, which is called ResNet. ResNet won the 2015 ImageNet competition, and I was delighted that it won it because it's an incredibly simple and kind of intuitively understandable concept, and it's very simple to implement. In fact, what I would like to do is to show you. So let me describe as best as I can how ResNet works. In fact, before I describe how it works, I will show you why you should care that it works. So let's for now just put aside the idea that there's a thing called ResNet. It's another architecture, a lot like VGG, that's used for image classification or other CNN type of things. It's actually broader than just image classification. And we use it just the same way as we use the VGG16 class you're familiar with. We just say create something, a ResNet. And again, this different size ResNet is 50 because it's the smallest one and it works super well. I've started adding a prouder to my versions of these networks. I've added it to the new VGG as well, which is include top. It's actually the same as what the carers of the author have started doing with his models. And basically the idea is that if you say include top equals false, you don't have to go model.pop afterwards to remove the layers if you want to fly and tune. Include top equals false means only include the convolutional layers, basically, and I'm going to stick my own kind of final classification layers on top of that. So when I do this, it's not going to give me the last few layers. Maybe the best way to explain that is to show you when I create this network, I've got this thing at the end that says if include top and if so then we add the last few layers with this last dense fully connected layer that makes it just image net things that's probably a thousand categories. And if we're not included top, then don't add these additional layers. So this is just the thing which kind of means you can load in a model which is specifically designed for fine tuning to a little shortcut. And as you'll see shortly, it has some really helpful properties. So we're in the cats and dogs competition here. The winner of the cats and dogs competition had an accuracy of 0.985 on the private beta board and 0.989 on the private beta board. We use this ResNet model in the same way as usual. We grab our batches. We can pre-compute some features. And in fact, every single CNN model I'm going to show you, we're always going to pre-compute the convolutional features. So everything we see today will be things you can do without retraining any of the convolutional layers. So pretty much everything I train will train in a small number of seconds. And that's because in my experience when you're working with photos, it's almost never helpful to retrain the convolutional layers. So we can stick something on top of our ResNet in the usual way. And we can say go ahead and compile and fill it and in 48 seconds it's created a model where the 0.986 accuracy, which is when on the private beta board or the second on the private beta board. So that's pretty impressive. More impressive is, and I'm going to show you how this works in a moment, but ResNet's actually designed to not be used with a standard bunch of dense layers, but it's designed to be used with something called a global average pooling layer which I'm about to describe to you. So for now let me just show you what happens if instead of the previous model, I instead use this model, which has three layers, and compile it, fit it, I get 0.9875 in three seconds. In fact, I can even tell it that I don't want to use 224 by 224 images, but I want to use 400 by 400 images. And if I do that, and then I get batches, I say I want to create 400 by 400 images, and create those features, compile and fit, I get 99.3. So this is like kind of off the charts to go from somewhere around 98.5 to 99.3, we're reducing the amount of error by somewhere around a third to a half. So this is why you should be interested in ResNet. It's incredibly accurate, and we're using it for the thing it's best at, which was originally this ResNet was trained on ImageNet, and the dogs and cats competition looks a lot like ImageNet images. They're single pictures of a single thing that's kind of recently enlarged in the picture. They're not very big images on the whole, so this is something which this kind of ResNet approach is particularly good for. So I do actually want to show you how it works, because I first think it's fascinating and awesome. And I'm going to stick to the same approach that we've used so far when we've talked about architectures, which is that we have... Any shape represents a matrix of activations, and any arrow represents a layer operation. So that is a convolution or a dense layer with an activation function. ResNet looks a lot like VGG. So imagine that there's some part of the model down here that we're not going to worry about too much. We're kind of like halfway through the model, and there's some hidden activation layer that we've got to. With VGG, the approach is generally to go... The layers are basically a 3 by 3 conv that gives you some activations, another 3 by 3 conv that gives you some activations, another 3 by 3 conv that gives you some activations, and then from time to time it also does a max pooling. Each of these is representing a convolution layer. ResNet looks a lot like this. In fact, it has exactly basically that path, which is a bunch of convolutions and values on top of each other. But it does something else, which is there's this... There's this bit that comes out, and remember when we have two arrows coming into a shape, that means we're adding things. And you'll notice here, in fact, there's no shapes anywhere on the way here. In fact, this arrow does not represent a convolution, it does not represent a dense layer, it actually represents identity. In other words, we do nothing at all. And this whole thing here is called a ResNet block. And so ResNet, basically, if we represented a ResNet block as a square, ResNet is just a whole bunch of these blocks basically stacked on top of each other. And then there's an input, which is the input data, and then the output, of course, is yet. So another way of looking at this, of course, is just to look at the code. And I think the code is nice and kind of intuitive to understand. So let's have a look at this thing they call it. They call it in the code an identity block. So here's the code for what I just described. It's here. And you might notice that everything I just selected here looks like a totally standard VGG block. I've got a com2d, a batch normalization, and an activation function. I guess it looks like our improved VGG because it's got batch normal. Another convolution, another batch norm, another activation. Another com2d, another batch norm. But then this is the magic that makes it ResNet. This single line of code, and it does something incredibly simple. It takes the result, all of those three convolutions, and it adds it to our original input. So normally we have kind of like the output of some block is equal to kind of like a convolutions of convolutions of convolutions of some input to that block. But we're doing something different. We're doing the output to a block. So what's called this hidden state at time t plus 1 is equal to the convolutions of the convolutions of the convolutions of hidden state at time t plus the hidden state at time t. That is the magic which makes it ResNet. So why is it that that can give us this huge improvement in the state of the art in such a short period of time? And this is actually interestingly something that is somewhat controversial. The authors of this paper that originally developed this describe it in a number of ways. They basically gave two main reasons. The first is they claim that you can create much deeper networks this way because when you're back propagating the weights, back propagating through an identity is easy. You're never going to have an explosion of gradients or an explosion of activations. And indeed, this did turn out to be true. The authors created a ResNet with over a thousand layers and got very good results. But it also turned out to be a bit of a red herring because a few months ago, some other folks created a ResNet which was not at all deep. I think it had like 40 or 50 layers. But instead it's very wide and had a lot of activations and that did even better. So it's one of these funny things that seems even the original authors might have been wrong about why they built what they built. The second reason of why they built what they built seems to have stood the test of time growing up. Which is that if we take this equation and rejig it, let's attract that for both sides. And that gives us ht plus 1 minus ht. So the hidden activations at the next time period minus the hidden activations at the previous time period equals, and I'm going to replace all this with R for ResNet block equals a ResNet block. Or it's actually not a ResNet block, it's just a convolution of convolution of convolution applied to the previous hidden state. When you write it like that, it might kind of make you realise something. Which is all of the weights we're learning are here. So we're learning a bunch of weights which allow us to make our previous guess as to the predictions a little bit better. So basically saying let's take the previous predictions we've got, however we got to them and try and build a set of things which makes them a little bit better. In statistics this thing is called the residual. The residual is the difference between the thing you're trying to predict and your actions. So what we basically, I should say the authors of ResNet basically did here was they designed an architecture which without us having to do anything special automatically learns how to model the residuals. That learns how to build a bunch of layers which continually slightly improve the previous answer. For those of you who have more of a machine learning background you would recognise this as essentially being boosting. Boosting refers to the idea of having a bunch of models where each model tries to predict the errors of the previous model. And if you have a whole chain of those you can then predict the errors on top of the errors and you can add them all together and boosting is a way of getting much improved ensembles. So this ResNet is not manually doing boosting it's not manually doing anything. It's just a single one extra line of coding. It's all in the architecture. Yes Rachel. A question about dimensionality. I would have assumed that by the time we were close to output the dimensions would be so different that element-wise addition wouldn't be possible between the last layer and the first layer. So it's important to note that this input tensor is the input tensor to the block. So you'll see there's no max pooling inside here or no strides inside here. So the dimensionality remains constant throughout all of these lines of code so we can add them up and then we can do our strides or max pooling and then we do another identity block. So we're only adding it back to the input of the block not the input of the original image and that's indeed what we want. We want to say, you know, the input to each block is our kind of best prediction so far is effectively what it's doing. Then qualitatively how does this compare to dropout? In some ways, in most ways it's unrelated to dropout and indeed you can add dropout to ResNet. At the end of a ResNet block after this merge you can add dropout. So ResNet is not a regularization technique for a second. Having said that, it does seem to have excellent generalization characteristics and if memory serves correctly, yes, I just searched this entire code base for dropout and it didn't appear. So the imageNet, when I didn't use any dropout it didn't find it was necessary. But this is very problem dependent. If you're only a small amount of data you may well need dropout. And I explain another reason that we don't need dropout for this in just a moment. In fact, I'll do that right now. Which is, remember what I did here at the end was I created a model which had a special kind of layer called a global average falling layer. This is the next key thing I want to teach you about today. It's a really important concept. I'll come up with a couple more times during today's class. Let's describe what this is. It's actually very, very simple. Here is the output. So here is the output of the pre-computed ResNet on our 400 by 400. It's 13 by 30. This is the 400 by 400. This is the 224 by 224. So on the 224 by 224 the pre-computed convolutional or residual blocks give us say 13 by 13 output with 2048 filings in. One way of thinking about this would be to say, well, each of these 13 by 13 blocks could potentially try to say like how catty or how doggy one of those 13 blocks. And so then rather than max falling which is take the maximum of that grid we could do average falling which is to say across those 13 by 13 areas what is the average amount of doggyness in each one and what is the average amount of cattiness in each one. And that's actually what global average falling does. What global average falling does is it's identical to saying just to say average falling 13 by 13 because the input to it is 13 by 13. So I'll add a flat end zone as well. So in other words, whatever the input to a global average falling layer is it will take all of the x and all of the y quarter notes and just take the average for every one of these 2048 quarters. So let's take a look here. So what this is doing is it's taking an input of 2048 by 14 by 14, 13 by 13 and it's going to return an output which is just a single vector of 2048 and that vector is on average how much does this whole image have each of those 2048 features. And because ResNet actually was originally trained with global average falling 2D so you can see that because this is the ResNet code in fact they've, sorry, 7 by 7 original. So this was actually written before the global average falling 2D layer existed so they just did it manually put an average falling 7 by 7. So because ResNet was trained originally with this layer here that means that it was trained such that the last identity block was basically creating features that were designed to be averaged together and so that means that when we used this tiny little architecture we got the best results because that was how ResNet was originally designed to be used. If you had a wider network without the input fed forward to the output activation couldn't you get the same result? The extra activations in the wider network could pass the input all the way through all the layers. Well you can in theory have kind of convolutional filters that don't do anything that actually act as multi-entity vectors but the point is having to learn that is learning lots and lots of filters designed to learn that and so maybe the best way to describe this is everything I'm telling you about architectures is in some ways irrelevant. You could create nothing but dense layers at every level of your model and dense layers have every input connected to every output so every architecture I'm telling you about is just a simplified version of that we're just deleting some of those. But it's really helpful to do that it's really helpful to help our SGD optimizer by giving it, by making it so that every fault thing it can do is a thing that we want to. So yes in theory a COMF net or a native fully connected net could learn to do the same thing that ResNet does in practice it would take a lot of parameters for it to do so and time to do so and so this is why we care about architectures in practice having a good architecture makes a huge difference. That's a good question. Interesting. Another question. Would it be fair to say that if VGG was trained with average pooling it would yield better results? I'm not sure. So let's talk about that a little bit. So one of the reasons or maybe the main reason that ResNet didn't need to drop out is because we're using global average pooling. There's a hell of a lot less parameters in this model. Remember the vast majority of the parameters in a model are in the dense loads because if you've got M inputs and M outputs you have N times M connections. So in VGG I can't quite remember but that first dense layer has something like 300 million parameters because it has every possible feature of the convolutional layer by each of the three locations of the convolutional layer by every one of the 4,000 million six outputs. So it just created a lot of features and made it very easy to go over the bit. So with global average pooling and indeed not having any dense layers we have a lot less parameters so it's going to generalise better. It also generalises better because we're treating every one of those 7 by 7 or 13 by 13 areas in the same way. We're saying how doggy or catty are into these when it's averaging them. So it turns out that these global average pooling layer models do seem to generalise very well and we're going to be seeing more of that in a moment. Why do we use global average pooling instead of max pooling? You wouldn't want a max pool over Well, it depends. You can try both. In this case the images in the dogs and cats competition is basically an image where the entire frame is a dog or a cat. So if you did max pooling you would say which bit of that 7 by 7 or 13 by 13 grid that we're kind of downsample down to has the most doggyness or cattyness and I only care about that. That's unlikely to give you good results by saying let's look at every part of the image and have it small together. On the other hand, but I haven't tried this, the fisheries competition the fish is generally a very small part of each image. So maybe in the fisheries competition you should use a global max pooling layer to give it a try and tell us how it goes because in that case you actually don't care about all the parts of the image which have nothing to do with fish. So that would be a very interesting thing to try. Resnet is very powerful but it has not been studied much at all for transfer learning. This is not to say it won't work well for transfer learning. I just literally haven't found a single paper yet that somebody has analysed its effectiveness for transfer learning and to me, 99.9999% of what you will work on will be transfer learning because if you're not using transfer learning it means you're looking at a data set that is so different to anything that anybody has looked at before that none of the features in any model was remotely helpful for you which that's going to be rare. So I nearly all of this work I've seen on transfer learning both in terms of capital winners and in terms of papers uses VGG. And I think one of the reasons for that is as we talked about in lesson one the VGG actually lesson two the VGG architecture really is designed to create layers of gradually increasing kind of semantic complexity and all the work I've seen on visualising layers tends to use VGG or something similar to that as well like that Matzyla stuff we saw or those Dresden-Jasinski videos we saw and so we've seen how the VGG network for those kinds of networks create gradually more complex representations which is exactly what we want for transfer learning because it lets us say how different is this new domain to the previous domain and then we can pick a layer far enough back we can try a few that the features seem to work well. So for that reason we're going to go back to looking at VGG for the rest of these architectures and I'm going to look at the fisheries competition The fisheries competition is actually very interesting the pictures are from there's like a dozen boats and each one of these boats has a fixed camera and they can do both daytime and nighttime shots and so every picture has the same basic shape and structure for each of the 12 boats because it's a fixed camera and then somewhere in there most of the time there's one or more fish and your job is to say what kind of fish is it the fish are pretty small and so one of the things that makes this interesting is that this is the kind of somewhat weird kind of complex different thing to ImageNet which is exactly the kind of stuff which you're going to have to deal with anytime you're doing some kind of computer vision problem or any kind of CNN problem it's very likely that the thing you're doing won't be quite the same as what all the academics have been looking at so trying to figure out how to do a good job of the fisheries competition is a great example so when I started on the fisheries competition I just did the usual thing which was to create a VGG16 model fine tuned it to have just eight outputs because we have to say which of eight types of fish do we see in it and then I as per usual pre-computed the convolutional layers using the pre-computed VGG network and then everything after that I just used those pre-computed convolutional layers and as per usual the first thing I did was to stick a few dense layers on top and see how that goes and so the nice thing about this is you can see each epoch takes less than a second to run so when people talk about needing lots of data and lots of time it's not really true because the most stuff you do in your life you're only using pre-computed convolutional features and in our validation set we get an accuracy of 96.2% processed if we lost a 0.18 that's pretty good which seems to be recognising the fish pretty well but here's the problem there is all kinds of data leakage going on and this is one of the most important concepts to understand when it comes to building any kind of model or any kind of machine learning project leakage there was a paper I think it actually won the KDD best paper award from Claudia Perlich and some of her colleagues which studied data leakage data leakage occurs when something about the target you're trying to predict is encoded in the things that you're predicting with but that information is either not going to be available or it won't be helpful in practice when you're going to use the model for example in a fisheries competition different boats fish in different parts of the sea different parts of the sea have different fish in them and so in a fisheries competition if you just use something representing which boat did the image come from you can get a pretty good accurate validation set result and what I mean by that for example is here's something which is very tricky this is a list of the size of each photo along with how many times that appears you can see I've just gone through photo and opened it using PIL which is the python imaging library and graded size you can see that there's basically a small number of sizes that appear it turns out that if you create the simple linear model that says any image of size 1192 by 670 what kind of fish is that anything with 1280 by 720 what kind of fish is that you get a pretty accurate model because these are the different ships the different ships have different cameras the different cameras have different resolutions and this isn't helpful in practice because what the fisheries people actually wanted to do was to use this to find out when people are kind of illegally or accidentally overfishing or fishing in the wrong way so if they're bringing up dolphins or something they want to know about so any model that says I know what kind of fish this is because I know what the boat is is entirely useless so this is an example of leakage and in this particular paper I mentioned the authors looked at machine learning competitions and discovered that over 50% of them had some kind of data leakage I spoke to Claudia after she presented that paper and I asked her if she thought the regular machine learning projects in inside companies would have more or less leakage than that and she said a lot more because in competitions and she's a three-time KDD carpenter so she knows this stuff very well in competitions people have tried really hard to clean up the data ahead of time because they know that lots and lots of people are going to be looking at it and if there is leakage you're almost certain that somebody's going to find it because it's a competition whereas if you have leakage you won't even know about it and try to put the model into production and discover that it doesn't work as well as you thought it would Oh and I was just going to add that it might not even help you in the competition if your test set is brand new boats that weren't in your training set Great, so let's talk about that So trying to win a cable competition and trying to do a good job are somewhat independent and so when I'm working on a cable I focus on trying to win the cable competition you know I have a clear metric and I try to optimize the metric and sometimes that means finding leakage and taking advantage of it so in this case step number one for me in the fisheries competition was to say okay can I take advantage of this leakage and I want to be very clear this is the exact opposite of what you would want to do trying to help the fisheries people create a good model having said that, there's $150,000 at stake and I could donate that to the Fred Hollis Foundation and get lots of people their site back so you know winning this would be good so let me show you how I try to take advantage of this leakage which is totally legal in a cable competition and see what happened and then I'll talk more about Rachel's kind of issue after that so the first thing I did was I made a list for every file of how big it what the energy dimensions were and I did that for the validation of the training set I normalized them by subtracting the mean dividing by the standard deviation and then I created an almost exact copy of the previous model I showed you this one but this time rather than using the sequential API I used functional API but other than that the only difference is in this mode where what I've done is I've taken not just the input which is the output of the last convolutional layer of my BGG model but I have a second input and the second input is what size image is it and I should mention I have one hot encoder those image sizes so they're treated as categories so I now have an additional input so I have two inputs one is the output of the BGG convolutional layer one is the one hot encoded image size I batch normalize that obviously and then right at the very last step I concatenate the two together so my model is basically a standard kind of last few layers of BGG model so three dense layers I have my input and then I have another input and a nothing sum I think concatenated and that creates an output so what this can do now is that that last dense layer can now learn to combine the image features along with this metadata this is useful for all kinds of things other than taking advantage in a vastly way of linkage for example, if you were doing a collaborative filtering model you might have information about the user such as their age their gender their favourite genres that they asked to mention on a survey this is how you incorporate that kind of metadata into a standard neural land so I batch the two together and run it and initially it's looking encouraging if we go back and look at the standard model we've got 0.84 0.94 0.95 this multi input model is a little better 0.86 0.95 0.96 so that's encouraging interestingly the model without the using the leakage gets somewhere around 0.96 0.97 it's kind of all over the place which isn't a great sign but let's say somewhere around 0.97 this multi input model on the other hand does not get better than that its best is also around 0.97 why is that? this is very very common when people try and utilise metadata in deep learning models it often turns out that the main thing you're looking at in this case the image already encodes everything that your metadata have anyway in this case the size of the image tells us what Bode comes from but you can also just look at the picture and see what Bode comes from so by the later epochs the leakage actually turned out to be helpful anyway so it's amazing how often people assume they need to find metadata and incorporate it into the model and how often it turns out to be wasted time because the kind of the raw real data would be audio or the pictures or the language or whatever turns out to encode all of that energy anyway finally I wanted to go back to what Rachel was talking about which is what would have happened if this did work? let's say that actually this gave us a much better validation result than the non leakage model if I then submitted it to Kaggle and my leaderboard result was great that would tell me that I have found leakage that the Kaggle competition administrators didn't and I'm possibly on the way to the opening competition having said that the Kaggle competition administrators first and foremost try to avoid leakage and indeed if you do try and submit this to the leaderboard you'll find it doesn't do that great and I haven't really looked into it yet but somehow the competition administrators have seen for some attempt to remove the leakage the kind of ways that we did that when I was at Kaggle it would be to do things like some kind of stratified sampling where we would kind of say oh there's way more apocore from this ship from this ship let's enforce that every ship has to have the same number of the same kind of fish or something like that very limit leakage very good but honestly it's a very difficult thing to do and this impacts a lot more than just machining learning competitions every one of your world projects you're going to have to think long and hard about how can you replicate real world conditions in your test set maybe the best example I can come up with is when you put your model into production it will probably be a few months after you grabbed the data and trained it how much has the world changed and so therefore wouldn't it be great if instead you could create a test set that had data from a few months later that you trained instead and then you can replicate the situation that you actually have when you put your model into production and two questions one is just a note that they're releasing another test set later on in the fishery competition a question did you do two classifications one for the boats and one for the fish is that a waste of time I have two inputs not two outputs right so my input is the category the one hot encoded size of the image which I assumed is a proxy for the boat ID and some discussion on the cackle forum suggests that that's a really small assumption we're going to look at multi output in a moment okay in fact we're going to do it now oh yeah another question can you find a good way of isolating the fish on the images and then do the classification on that let's do that now Sherwin this is my lunch very good all right multi output there's a lot of nice things about how cackle competitions are structured and one of the things I really like is that in most of them you can create your own data sources as long as you share them with the community and so one of the people in the fisheries competition has gone through and by hand put a little square around every fish which is called an annotating the data set specifically this kind of annotation is called a bounding box the bounding box is a box in which your objectivity is because of the rules of cackle you have to make that available to everybody in the cackle community which he provides a link on the cackle forum so I'm going to head and download those there are a bunch of JSON files that basically it'll play this so for each image for each fish in that image it had the height width next and why so the details of the code don't matter too much I found the largest fish in each image and created a list of them so I've got now my training bounding boxes and my validation bounding boxes for things that didn't have a fish I just had 0 0 0 0 this is my empty bounding box here so as always when I want to understand new data the first thing to do is to look at it when we're doing computer vision problems it's very easy to look at data because it's pictures so I went ahead and created this little show bounding box thing which I tried it on an image and here is the fish and here is the bounding box there are two questions I don't know if you wanted to get to a good stopping point on your thought one is adding metadata is that not useful for both CNNs and RNNs or just for CNNs and the other one is VGG required images all the same size and training in the fisheries case are there different size images being used for training and how do you train a model on images with different dimensions so regarding whether metadata is useful for RNNs or CNNs it's got nothing to do with the architecture it's entirely about the semantics of the data if your text or audio or whatever unstructured data in some way kind of encodes the same information that is in the metadata the metadata is unlikely to be helpful so for example in the Netflix price in the early stages of the competition people found that it was helpful to link to RMDB and bring in information about the movies in later stages they found it wasn't the reason why is because in later stages they had figured out how to extrapolate from the ratings themselves they basically contained all the same information how do we deal with different size images I'm about to show you some tricks but so far throughout this course we have always resized everything to 224x224 whenever you use get matches resizing things to 224x224 because that's what ImageNet did with the exception that in my previous ResNet model I shared with you resizing to 400x400 instead so far and in fact everything we're doing this year we're going to resize everything to be the same size so I had a question about the 400x400 is that because there are two different ResNet models no it's not I'll show you how that happened in a moment we're going to get to that it's kind of a little sneak peek of what we're coming to okay so now that we've got these bounding boxes here is a complexity a practical one and a Kaggle one the Kaggle complexity is the rules say you're not allowed to manually annotate the test set so we can't put bounding boxes on the test set so if for example we want to go through and crop out just the fish in every image and just train on them this is not enough to do that because we can't do that on the test set because we don't have bounding boxes and the practical meaning of this is basically yeah I mean in practice they're trying to create an automatic warning system to let them know if somebody is taking the wrong kind of fish they don't want to have somebody drawing a box on everyone so what we're going to do is build a model that can find these bounding boxes automatically and how do we do that it may surprise you to know we use exactly the same techniques that we've always used here is the exact same model again but this time as well as having something at the end which has 8 softmax outputs we also have something which has 4 linear outputs i.e 4 outputs with no activation function what this is saying and then what we're going to do is when we train this model we now have 2 outputs so when we compile it we're going to say ok this model has 2 outputs one is the 4 outputs with no activation function one is the 8 softmax when I compile it the first of those I want you to optimize for mean square error and the second of those I want you to optimize for cross entropy loss and the first of them I want you to multiply the loss by 0.001 because the mean square error of finding the location of an image is going to be a much bigger number than the category of a cross entropy so just making them about the same size and then when you train it I want you to use the bounding boxes as the labels for the first output and then the fish types as the labels for the second output and so what this is going to have to do is it's going to have to figure out how to come up with a bunch of dense layers which is capable of doing these 2 things simultaneously so in other words we now have something that looks like this 2 outputs 1 input and notice that the 2 outputs you don't have to do it this way but in the way I've got it both have just their own dense layer it would be possible to do it like this instead that is to say each of the 2 outputs could have 2 dense layers of their own in this case though we're going to talk about the pros and cons both of my last layers are both going to have to use the same set of features to generate both the bounding boxes and the fish classes so let's have these goods so we just go fit as usual but now that we have 2 outputs we get a lot more information we get the bounding box loss we get the fishy classification loss we get the total loss which is equal to 0.001 times the bounding box because you can see this is like over 1000 times bigger than this so you can see why I did not apply it by 0.01 so that's what we added together then we get the validation loss total validation bounding box loss and the validation classification loss so here is something pretty interesting is that the first thing I want to point out is after I fit it a little bit we actually get a much better actress a bad actress now maybe this is counter intuitive because we're now saying our model has exactly the same capacity as before our previous dense layer is of size 512 and before that last layer only had to do one thing which is to tell us what kind of fish it was now it has to do two things it has to tell us where the fish is and what kind of fish it is but yet it's still done better why has it done better well the reason it's done better is because by telling it we want you to use those features to figure out where the fish is we've given it a hint about what to look for we've really given it more information about what to work on so interestingly even if we didn't use the bounding box for anything else and just threw it away at this point we already have a much better model and do you notice also the model is much more stable 978 98 98 98 98 2 before our loss was all over the place so by having multiple outputs we've created a much more stable resilient and accurate classification model and we also have bounding boxes the best way to look at how accurate bounding boxes are is to look at the picture so I do a prediction for the first 10 validation examples it's important to use the validation set any time you're looking at how good your model is this time I increased the function to show the bounding boxes to now create a yellow box for my prediction and a default red box for my actual and there it is so I just want to make it very clear here we haven't done anything clever we didn't do anything to program this we just said there is an output which we'll have for outputs and it has no activation function and I want you to use mean squared error to find a set of weights that would optimize those weights such that the bounding boxes and your predictions are as close as possible and somehow it has done that so that is to say very often if you're trying to get a neural net to do something your first step before you create some complex programming heuristic thing is just ask the neural net to do it and very often it it does why do both in the same bidding instead of training the boxes first than feeding that as input to recognize fishes we can, right? the first thing I want to point out is even then I would still have the first stage do both at the same time the more compatible tasks you can give it so like where is the fish and what kind of fish it is, the more it can create an internal representation that is as appropriate as possible now if you now want to go away over the next couple of weeks and crop out these fish and create the second model I can almost guarantee you'll get into the top table of this competition so it's and the reason I can always guarantee that is because there was quite a similar competition on Kaggle last year maybe earlier this year which was trying to identify particular whales type of whale, type of rat whale and like literally saying which individual whale is it and all of the top three in that competition did some kind of bounding box prediction and some kind of cropping and then modeled by checking the layer on the cropping features are the four bounding box outputs the vertical and horizontal size of the box and the two coordinates for its center it's whatever we were given which was not quite that it was the height width X and Y so how many of the people in this Kaggle competition are using this sort of model if you came up with this with a bit of tinkering why do you think that you would actually stay in the top 10 or would this just be sort of like an obvious thing that people would tend to do and so your ranking would basically drop over time as everyone else incorporates this I'm going to show you a few techniques that I used this week and one like that American football star a few techniques I used this week but they're all as you'll see they're all very basic, they're very normal we're at a point now in this $150,000 competition where over 500 people have entered and I am currently 20th so no, the stuff that you're learning in this course is not at all gone like there's never been applied deep learning course before so the people who are above me in the competition are people who have figured these things out over time and read lots of papers and studied and whatever else so I definitely think that picking this course particularly if some of you teamed up together would have a very good chance of winning this competition because it's a perfect fit for everything we've been talking about and particularly you can collaborate on the forums stuff like that and I should mention this 20th is, I haven't even done any cropping yet this is just using the whole image which is clearly not the right way to tackle this I was actually intentionally trying not to do too well because there's something I'm going to have to release this to everybody before I say I've done this and here's the notebook because it's $150,000 and I didn't want to say here's a way to get in the top 10 because that's not fair to everybody else so yeah, I think to answer your question by the end of the competition to win one of these things you've got to do everything right at every point and every time you fail you have to keep trying again tenacity is part of winning these things I know from experience the feeling of being on top of the leaderboard and waking up the next day and finding that five people have passed you and you're just like but then the thing is you then know they have found something that is there and you haven't found yet and that's part of what makes competing in the Kaggle competition so different to doing academic papers or looking at old Kaggle competitions that are long gone it's a really great test of your own processes what you'll probably find yourself doing is repeatedly fucking around with hyperparameters and minor architectural details because it's just so addictive until eventually you go away what's a totally different way of thinking about this problem so I hope some of you will consider seriously investing in like putting an hour a day into a competition because it's I learned far more doing that than everything else I've ever done in machine learning it's totally different to just playing around and after it it was something that every real-world project I've done has greatly benefited from that experience okay so to give you a sense of this like that's number seven here's number six like I can't even see that fish but it's done a pretty good job right and I think maybe it kind of knows that people tend to float around where the fish is or something because it's pretty hard to see as you can see this is just in a 224x224 image so this model is doing a pretty great job and the amount of time to train was under 10 seconds is there a way to find the bounding box when you're coding it this is great okay so I've got a section here on data documentation I'm going to skip it because you already know about data documentation you can check the notebook later if you're interested okay before we look at finding things without manually annotating bounding boxes I want to talk more about different size images so let's talk about sizes specifically talk about in which situations is our model going to be sensitive to the size of the input like a pre-trained model with pre-trained weights and it's all about what are these layer operations exactly if it's a dense layer then there's a weight going from every input to every output and so if you have a different sized input then that's not going to work at all because the weight matrix for your dense layer is just simply of the wrong size who knows what it should do what if it's a convolutional layer if it's a convolutional layer then we have a little set of weights let's say 3x3 block for each different feature and then that 3x3 block is going to be slid over to create the outputs if the image is bigger it doesn't change the number of weights it just means that block is going to be slid around more and the output will be bigger a max polling layer doesn't have any weights a batch normalization layer simply cares about the number of weights of the previous layer so really when you think about it the only layer that really cares what size your input is is a dense layer and remember that with VGG nearly all of the layers are convolutional layers so that's why it is that we can say not only include top equals false we can say not only include top equals false but we can also choose what size we want so if you look at my new version of the VGG model I've actually got something here that says if size is not equal to 224 by 224 then don't try to add the fully connected blocks at all just return that so in other words if we cut off whatever our architecture is before any dense layers happen then we're going to be able to use it on any size input to at least create those convolutional features and that's what I'm about to show you now for a convolutional layer shouldn't the input image size be fixed? there's no particular reason it has to be fixed a dense layer has to be fixed because a dense layer has a specific weight matrix and the input to that weight matrix generally is the flattened out version of the previous convolutional layer and the size of that depends on the size of the image that for the convolutional weight matrix simply depends on the filter size not on the image size so let's try it and specifically we're going to try building something called a fully convolutional net which is going to have no dense layers at all and so the input as usual will be the output of the last VGG convolutional layer but this time we're going to create a this time when we create our VGG 16 model we're going to tell it we want it to be 640 by 360 now be careful here when we talk about matrices we talk about rows by columns when we talk about images we talk about columns by rows so a 640 by 360 image is a 360 by 640 matrix I mentioned this because I screwed it up but I knew I screwed it up because I always draw pictures so when I drew the picture and I saw I had this little squashed but I knew that I screwed it up the weights we're loading here for CNN layers also have batch normalization yeah this is the exact same VGG 16 network we've been using since I added batch norm so nothing's been changed other than this one piece of code I just showed you which says you can use different sizes and if you do don't add the fully connected loads so now that I've got this VGG model expecting a 640 by 360 input I can then add to it my top layers and this time my top layers are going to get in an input which is of size 22 by 40 normally our VGG final layer is 14 by 14 or if you include the final max pooling it's 7 by 7 in this case though because we've told it we're not going to pass it a 224 by 224 we're going to pass it a 640 by 360 so this is what happens we end up with a different output shape so if we now try to pass that to the same dense layer we used before it wouldn't work so it would be the wrong size but we're actually going to do something very different anyway we're not going to use any pre-trained fully connected weights we're instead going to have in fact instead we're going to go conv-max pool conv-max pool conv-max pool conv global average pooling so the best way to look at that is to see what's happening to our shape so it goes from 22 by 40 until the max pooling 11 by 20 until the max pooling 5 by 10 and then because this is rectangular the last max pooling I did a 1,2 shape so that could be a square result so about 5 by 5 right then I do a convolutional layer in which I have just 8 filters and remember there are 8 types of fish there are no other weights after this and in fact even the dropout is not doing anything because I've set my p-value to 0 so ignore that dropout layer so we're going straight from a convolutional layer which is going to be grid size 5 by 5 and have 8 filters and then we're going to average across the 5 by 5 and that's going to give us something of size 8 so if we now say please train this model and please try and make these 8 things equal to the classes of fish now you have to think backwards how would it do that if it was to do that for us and it will because it's going to use SGT what would it have to do well it has no ability to use any weights to get to this point so it has to do everything by the time it gets to this point which means this convolution 2D layer is going to have to have in each of its 5 grid areas something saying how fishy is that area because that's all it can do after that all it can do is average them together so we haven't done anything specifically to kind of calculate it that way we just kind of create an architecture which has to do that now my feeling is that ought to work pretty well because as we saw in that earlier picture the fish only appears in one little spot and indeed as we discussed earlier maybe even a global max pooling could even be better so let's try this so we can fit it as per usual even without using bounding boxes we've got a pretty stable and pretty good result in about 30 seconds only 7.6 when I then tried this on the Kaggle leaderboard I got a much better result in fact to show you my submissions the 20th place was me just averaging together four different models today but this one on its own was 0.986 which is where which could be 20 seconds so this that little model on its own would get us a 20 second position and no data augmentation no pseudo-labeling we're not using the validation set to help us which you should enter so you can get 20 second position with this very simple approach which is to use a slightly larger image and use a fully convolutional network there's something else cool about this fully convolutional network which can get us into 20 second position and that is that we can actually look at the output of this layout and remember it's 5x5 I think you have to go slow how are you using VGG here? VGG as always before is the input to this model so I first of all calculated every single model I'm showing you today I pre-computed the output of the last convolutional layout using kind of a different input it's exactly the same so I go get data and I say I want to get a 360 and a 640 size data and so that gives me my image and then I this is data augmentation which I'm not doing at the moment so ignore that I then create my model pop off the last layout because I don't want the last max pooling layer so that's the size and then call predict to get the features from that last layer so it's what we always do it's just the only difference is that we passed 360,640 to our constructor for the model and we passed 360,640 to the get data command and then you're adding a lot of layers later on which is what you're focusing on yeah exactly so I've kind of always thank you for checking in that I'm always skipping that bit but everything I'm showing you today it's taking as input the last convolutional layout from VGG another question why did we replace all dense layers with cnn's a couple of reasons why the first because the authors of the paper which created the fully convolutional net found that it worked pretty well the global average pooling 2d layer as we've discussed turns out to have excellent generalization characteristics and so you'll notice here we have no drop out and yet we're in 22nd place on the leaderboard without beginning to try and then the final reason is the thing I'm about to show you which is that we basically have maintained a sense of kind of XY coordinates all the way through and which means that we can actually now visualize this last layer and I want to do that before I take the next question so I can say okay let's create a function which takes our models input as input and our fourth from last layer as output that is that convolutional layer that I showed you I'm going to take that and I'm going to pass into it the features of my first validation image and draw a picture of it for this picture and here is my picture and so you can see it's done exactly what we thought it would do which is it's had to figure out oh there's a fishy bit here so these fully convolutional networks have a nice side effect which is that they allow us to find whereabouts the interesting parts are that's another question and the question was why does max pooling reduce the dimensions along the X and Y to half what they were previously the default parameters for max pooling are 2,2 so it's taking each 2 by 2 square and replacing it with the largest value in that 2 by 2 square high res heat map we've ever seen so the obvious thing to make it all more high res would be to remove all the max pooling layers so here's exactly the same thing as before but I've removed all the max pooling layers so that means that my model now remains at 22 by 40 all the way through everything else is the same and that indeed does not give quite as accurate a result we get 95.2 rather than whatever it was 97.6 but on the other hand we do have a much higher resolution grid so if we now do exactly the same thing to create the heat map and the other thing we're going to do is we're going to resize the heat map to 360 by 640 and by default this resize command will try and interpolate and replace big pixels with kind of interpolated small pixels and that gives us for this image this answer which is much more interesting and so now we can stick one on top of the other like so and this tells us a lot it tells us that on the whole this is doing a good job of saying the thing that mattered the fishy thing specifically because we're asking here for the albacore plus remember the that layer of the model is 8 by 22 by 40 so we have to ask how much like albacore is each of those areas or how much like shark is each of those areas so when we called this function it returned a basically a heat map for every type of fish we can pass in 0 for albacore or actually here's a cool one class number 4 is no fish so one of the classes you have to predict in this competition is no fish so we could say tell us how much each part of this picture looks like the no fish class what happens is if you look at the no fish version it's basically the exact opposite of this you get a big blue spot here or around it the other thing I wanted to point out here is these areas of pinkishness that are not where the fish is this is telling me that our model is not currently just looking for fish it's also looking, if we look at this pink here it's looking for a particular looks like characteristics of the boat so this is suggesting to me that since it's not all concentrated on the fish I do think that there's some kind of data leakage still coming through how much do we know about why it's working? I think we know everything about why it's working we have set up we have set up a model where we've said we want you to predict each of the 8 fish classes we have set it up such that the last layer simply averages the answers from the previous layer the previous layer we have set up so it has the 8 classes we need so that's obviously the only way you can average and get the right number of classes we know that SGD is a general optimisation approach which will find a set of parameters which solve the problem that you give it or are you giving it that problem so really when you think of it that way unless it failed to train which it could for all kinds of reasons unless it failed to train it could only get a decent answer if it solved it in this way if it actually looked at each area and figured out how fishy it is could we build some sort of attention model with the heat maps so with that help of leakage we're not doing attention models in this part of the course per se I would say for now the simple attention model that I would do would be to find the largest area of the heat map and crop that and maybe compare that to the bounding boxes and make sure that they look about the same and those that don't you might want to hand fix and if you hand fix them you have to give that back to the Kaggle community of course because that's hand labeling and honestly that's that's the state of the art like in terms of like who wins the money in Kaggle that's how the Kaggle winners have won these kinds of competitions is by having a two-stage pipeline where first of all they find the thing of interest and then they zoom into it actually the other thing that you might want to do is to orient the fish so that the tail is kind of in the same place and the head is at the same place but make it as easy as possible basically for your consnet to do what it needs to do okay you guys might have heard of another architecture called Inception a combination of Inception plus ResNet one this year's ImageNet competition and I want to give you a quick a very quick hint as to how it works I have built the world's tiniest little Inception network here in this screen one of the reasons I want to show it to you is because it actually uses the same technique that we heard from Ben Bowles that he used to remember in his language model that Ben used a trick where he had multiple different convolution filter sizes and ran all of them and concatenated them together that's actually what the Inception network does there are two questions on the previous material one is how would you align the head and tail and the other is how is this a better way to isolate the fish than just taking the bounding box approach that the classifier generated to align the head and tail the easiest way would be to hand annotate the head and hand annotate the tail that was what was done in the whale competition how would this be a better way to isolate the fish than just taking the bounding box approach hand labelling always has errors there are a few people in the forum who have various bounding boxes that they don't think are correct so it's great to have an automatic approach which ought to give about the same answer as the hand approach and you can then compare the two and use the best of both worlds and in general this idea of kind of combining human intelligence and machine intelligence seems to be a great approach particularly early on you can do that for the first few bounding boxes to improve your bounding box model and then use that to gradually make the model have to ask you less and less for your input we found the location of the fish by finding what kind of fish it is so what was the point of finding the location if we already found out what fish it is is this for the bounding box or for the heat map I think for the heat map so why do you need to I mean the heat map you don't need to the heat map we're just visualizing one of the layers of the network we didn't use the bounding boxes we didn't do anything special it's just a side effect of this kind of model is that you can visualize the last compositional layer and doing so will give you a heat map I think it's also a nice point a lot of people refer to neural networks as black boxes or not having interpretability yeah absolutely there's so many ways of interpreting neural nets one of them is to draw pictures of the intermediate activations you can also draw pictures of the intermediate gradients there's all kinds of things you can draw pictures of another question you just showed us a way to build a model that implicitly finds its own bounding box it does the classification all in one model but are you saying that people later on take the bounding box model crop the images and then run the classifier yes okay so the inception network is going to use this trick where we're going to use multiple different convolutional filter sizes and concatenate them all together so just like in ResNet there's this idea of a ResNet block which is repeated again and again in the inception network the inception block which is repeated again and again and I've created a version of one here so I have one thing which takes my input and does a one by one convolution I've got one thing that takes the input and does a five by five convolution I've got one thing that takes the input and does two three by three convolutions I've got one thing that takes the input and just average pulls it and then we concatenate them all together so what this is doing is each inception block is basically able to look for things at various different scales and create a single feature map at the end which adds all those things together so once I've defined that I can then just create a little model just because the inception block the inception block com2d, global average pulling 2d output I haven't managed to get this to work terribly well yet I've got same kind of results I haven't actually tried submitting this to Kaggle how well it's generalizing and really part of the purpose of this is to give you guys a sense of the kinds of things we'll be doing next year this idea of we've built the basic pieces now of convolutions fully connected layers activation functions SGD really from here deep learning is putting these pieces together like what are the ways people have learned about putting these things together in ways that solve problems as well as possible and so the inception network is one of these ways and the other thing I wanted to do was to give you plenty of things to think about over the next couple of months and play with so hopefully this notebook is going to be full of things you can experiment with and maybe even try submitting the Kaggle results I guess the warnings about the inception network are a bit similar to the warnings about the ResNet network like ResNet the inception network is available actually Keras, I haven't converted one to my standard approach but Keras has an inception network that you can download and use it hasn't been well studied in terms of its transfer learning capabilities and most again I haven't seen people who have won Kaggle competitions using transfer learning of inception network so it's just a little bit less well studied but like ResNet the combination of inception plus ResNet is the most recent image net so if you are looking to really start with the most predictive predictive model this is where you would want to start so I wanted to finish off on a very different note which is looking at RNNs one more time and I've spent much more time on CNNs and RNNs and the reason is that this course is really all about being pragmatic, it's about teaching you the stuff that works and in the vast majority of areas where I see people using deep learning to solve their problems they're using CNNs having said that some of the most challenging problems are now being solved with RNNs like speech recognition and language translation so when you use Google translate now you're using RNNs and my suspicion is you're going to come across problems a lot less often but I also suspect that in a business context a very common kind of problem is a time series problem like looking at the time series of click events on your website or e-commerce transactions or logistics or whatever and these sequence to sequence RNNs we've been looking at which we've been using to illustrate Nietzschean philosophy are identical to the ones you would use to analyze a sequence of e-commerce transactions and trying to find anomalies so I think CNNs are more practically important for most people in most organizations right now but RNNs also have a lot of opportunities and of course we'll also be looking at them when it comes to attentional models next year which is figuring out in a really big image which part should we look at next Does Inception have the merge characteristic that ResNet and TLA The Inception merges a concat rather than that which is the same as what we saw when we looked at Bembolz's quid NLP model we're taking multiple convolution filter sizes and we're sticking them next to each other so that that feature basically contains information about 5x5 features and 3x3 features and 1x1 features and so when you add them together you lose that information and ResNet does that for a very specific reason which is we want to cause it to all our residuals In Inception we don't want that In Inception we want to keep them all in the feature space The other reason I wanted to look at RNNs is that last week we looked at building an RNN nearly from scratch in Theano and I say nearly from scratch because there was one key step which it did for us which was the gradients Really understanding how the gradients are calculated is not something you would probably ever have to do by hand but I think it can be very helpful to your intuition of training neural networks to be able to trace it through and so for that reason this is kind of the one time in this course over this year and next year's course where we're going to really go through and actually calculate the gradients ourselves So here is a recurrent neural network in pure python and the reason I'm doing a recurrent neural network in pure python is this is kind of the hardest like RNNs are the hardest thing to get your head around back propagating gradients so if you look at this and study this and step through this over the next couple of months you will really be able to get a great understanding of what a neural net is really doing there's going to be no magic or mystery because this whole thing is going to be every line of code but if we're going to do it all ourselves we have to write everything ourselves so if we want a sigmoid function we have to write the sigmoid function any time we write any function we also have to create this derivative so here is the derivative of this function so I'm going to use this approach where underscore D is the derivative of this function so I'm going to have to have the value and the derivative of value and I just kind of check myself as I go along that they look reasonable the Euclidean distance and the derivative of the Euclidean distance the cross entropy and the derivative of the cross entropy and note here that I am clipping my predictions because if you have zeros or ones there you're going to get infinities and it destroys everything so you have to be careful of this and this did actually happen I didn't have this clipping at first and I was starting to get infinities and this is necessary so here is the derivative of softmax here is the derivative of softmax so then I basically go through and I double check that the answers I get with my versions are the same as the answers I get with the theano versions to make sure that they're all correct and they all seem to be fine okay so I am going to use as my activation function value which means the derivative is value derivative and my loss function is cross entropy so the loss function is cross entropy derivative I also have to write my own scan so you guys remember scan scan is this thing where we go through a sequence one step at a time calling a function on each element of the sequence and each time the function is going to get two things it's going to get the next element of the sequence as well as the previous result of the call so for example scan of add two things together on the integers from 0 to 5 is going to give us the cumulative sum and remember the reason we do this is because GPUs don't know how to do loops so our theano version used a scan and I wanted to make this as close to the theano version as possible in theano scan is not implemented like this with a for loop in theano they use a very clever approach which basically creates a tree where it does a whole lot of the things kind of simultaneously combines them together next year we may even look at how that works if anybody's interested okay so in order to create our Nietzschean for your philosophy we did an input and an output so we have the eight character sequences one hot encoded for our inputs and the eight character sequences moved across by one one hot encoded were our outputs and we've got our vocab size which is 86 characters so here's our input and output shapes 75,000 phrases each one has eight characters in and each of those eight characters is a one hot encoded vector of size 86 alright so we first of all need to do the forward pass so the forward pass is to scan through all of the characters in the nth phrase the input and output calling some function and so here is the forward pass and this is basically identical to what we saw in theano in theano we had to lay out the forward pass as well so to create the hidden state we have to take the dot product of x with its weight matrix and the dot product of the hidden with its weight matrix and then we have to put all that through the activation function and then to create the predictions we had to take the dot product of the hidden with its weight matrix and then put that through softmax and so we have to make sure we keep track of all of the state that it needs so at the end we will return the loss the pre-hidden and pre-pred because we're going to use them each time we go through in the backdrop we'll be using those we need to know the hidden state of course we have to keep track of that because we're going to be using it the next time through the RNM and of course we're going to need our actual predictions so that's the forward pass very similar to theano the backward pass is the bit I wanted to show you and I want to show you how I think about it this is this is how I think of it all of my arrows I've reversed their direction and the reason for that is that when we create a derivative we're really saying how does the input change how does a change in the input impact the output and to do that we have to use the chain rule we have to go back from the end all the way back to the start this is our output last hidden layer activation matrix this is our loss which is adding together all of the losses at each of the characters if we want the derivative of the loss with respect to this hidden activation we would have to take the derivative of the loss with respect to this output activation and multiply it by the derivative of this output activation with respect to this hidden activation we have to then multiply them together because that's the chain rule the chain rule basically tells you to go from some function of some other function of x the derivative is the product of those functions I find it really helpful to literally draw the arrow so let's draw the arrow from the loss function to each of the outputs as well and so to calculate the derivatives we basically have to go through and undo each of those steps in order to figure out how that input would change that output we have to basically undo it we have to go back along the arrow the opposite direction so how do we get from the loss to the output so to do that we need to the derivative of the loss function and then we're also going to need if we're going to go back to the activation function we're going to need the derivative of the activation function as well so you can see it here this is a single backward pass we grab one of our inputs one of our outputs and then we go backwards through each one, each of the eight characters from the end to the start so grab our input character and our output character and the first thing you want is the derivative of pre-pred pre-pred was the prediction prior to putting it through the softmax so that was the bit I just showed you it's the derivative of the softmax times the derivative of the loss so the derivative of the loss is going to get us from here and then the derivative of the softmax gets us from here back to the other side of the activation function so it basically gets us to here so that's what that gets us to so we want to keep going further which is we want to get back to the other side of the hidden we want to get all the way over now to here so to do that we have to take the for those of you that haven't done vector calculus which I'm sure is many of you, just take my word for it and I say the derivative of a matrix multiplication is the multiplication with the transpose of that matrix so in order to go take the derivative of the pre-hidden times its weights we simply take it by the transpose of its weights so this is the derivative of that part and remember the hidden we've actually got two arrows coming back out of it also we've got two arrows coming into it so we're going to have to add together that derivative and that derivative so here is the second part so there it is with respect to the outputs that is with respect to the hidden and then finally we have to undo the activation function so multiply it by the derivative of the activation function so that's the chain rule that gets us all the way back to here so now that we've got those two pieces of information we can update our weights so we can now say for the blue line we're now going to equal so we basically have to take the derivative that we got to at this point which we called d pre-pred we have to multiply it by our learning rate which we're calling alpha and then we have to undo the multiplication by the hidden state to get the derivative with respect to the weights and I created this little columnify function to do that so it's turning a vector on so it's taking its transpose if you like so that gives me my new output weights my new hidden weights are basically the same thing it's the learning rate times the derivative that we just calculated and then we have to undo its weights and our new input weights again at the learning rate times the d pre-hidden times columnify version of x so I'm going through that very quickly the details aren't important but if you're interested it might be fun to look at it over the Christmas break or the next few days because you can see in this here is all of the steps necessary to do that problem through an RNN which is also why we would never want to do this by hand again so when I wrote this code luckily I did it before I got my cold and you can see I've written after everyone the dimensions of each matrix and vector because it makes your head hurt keeping everything straight so thank God, Theano does this for us but I think it's useful to see it so finally I now just have to create my initial weight matrices which are normally distributed matrices where these normal distribution are going to use the square root of 2 divided by the number of inputs because that's that chloro thing ditto for my y matrix and remember for my hidden matrix for a simple RNN we will use the identity matrix to initialize it Is state maintained across examples in this version? We haven't got to that bit yet so it depends how we use this at this stage all we've done is we've defined the matrices and we've defined the transitions and whether we maintain state will depend entirely on what we do next which is the loop and so here is our loop and so in our loop we're going to go through a bunch of examples we should really go through all of them but I was too lazy to wait run one forward step and then one backward step and then from time to time print out how we're getting along so in this case the forward step if we go back to it is passing to scan the initial state is a whole bunch of zeros so currently this is resetting the state it's not doing it statefully if you wanted to do it statefully it would be pretty easy to change you would have to have the final state returned by this and keep track of it and then feed it back the next time through the loop if you're interested maybe you could try that having said that you probably won't get great results because remember that when you do things statefully you're much more likely to have gradients and activations explode unless you do a GRU or LSTM which we're not so my guess is it probably won't work very well so so that was a very quick fly through and you know really more showing you around the code so that if you're interested you can check it out what I really wanted to do though was get on to this more interesting type of RNN which is there are actually two interesting types of RNN called long short-term memory and gated recurrent unit many of you will have heard of the one on the left LSTM one more question was mini-batch anywhere in there no for RNNs for stateful RNNs you can't exactly have mini-batches because you're doing one at a time and in our case we were going through it in order but yeah so using mini-batches is a great way to parallelize things on the GPU and make things run faster but then we have to be careful about how you're thinking that state so LSTM is a lot of you will have heard about because they've been pretty popular over the last couple of years for all kinds of cool stuff that Google does on the right however is the GRU which is simpler and better than the LSTM so I'm not going to talk about the LSTM I'm going to talk about the GRU they're both techniques for building your current neural network where your gradients are much less likely to explode and they're another great interesting example of a clever architecture but it's just going to be more of using the same ideas that we've seen again and again so what we have here is on the right hand side is this box is basically zooming into it's zooming into what's going on inside one of these circles in a GRU so normally in our standard RNN what's going on in here is pretty simple which is we do a multiplication by this WH weight matrix and stick it through an activation function and we grab our input do it by multiplication by weight matrix and grab its and tool it through its activation function and we add the two together a GRU though is going to do something more complex we still have the input coming in and the output going out so that's what these arrows are they're representing our new input character and our prediction but what's going on in the middle is more complex we still have our hidden state just like before but where else in a normal RNN does this state each time simply updates itself just goes through a weight matrix and an activation function and updates itself but in this case you can see that the loop looks like it's going back to come direct with self but then there's this gate here and so it's actually not just a self loop there's something more complicated so in order to understand what's going on let's go to the right hand side so on the right hand side you can see that the hidden state is going to go through another gate okay so what's a gate a gate is simply a little mini neural network which is going to output a bunch of numbers between 0 and 1 which we're going to multiply by its input in this particular one the R stands for reset between 0 and 1 if they were all 0 then the thing coming out of the reset gate would be just a big bunch of 0's in other words it would allow this network to forget the hidden state or it could be a big bunch of 1's which would allow the network to remember all of the hidden state do we want it to remember or forget well we don't know which is why we implement this gate which is a little neural network and this little neural network is going to have two inputs which is the input to the gate the input to the GU unit and the current hidden state and so it's going to learn a set of weights that it's going to use to decide when to forget so it's now got the ability to forget what it knows and that's what the reset gate does assuming that the reset gate has at least some non-zero entries which it will most surely will most of the time then whatever comes through we're going to call h tilde or in my code I call it h nu so this is the this is the kind of new value of the hidden state after being reset and so then finally that goes up to this top bit here the original hidden state this top bit here and then there's a gate which decides how much of each one should we have so this I don't know why it's called Z it's an update gate this update gate is going to decide if it's 1 will take more from this side if it's 0 will take more from this side and again that's implemented as a little neural network I think the easiest way to understand this is probably to look at the code so I have implemented this in Theano you can use a GU in Keras by simply replacing the words simple r and n with GU you don't really need to know this to use it and you get pretty good results but here's what it looks like when implemented we don't just have a hidden weight matrix an input weight matrix anymore we also have a hidden input weight matrix for our little reset gate mini neural net and for our update gate mini neural net so here is the definition of a gate a gate is something which takes its inputs its hidden state its hidden state weights its input weights and its biases it does a dot product of the X with WX and adds the biases and sticks it to a sequo function so that's what I meant by a mini neural net I mean it's part of the neural net it's just got one layer and so that's the definition of the reset gate and the update gate and so then in our step function this is the thing that runs each time on the scan it looks exactly the same as what we looked at last week the output equals the hidden state times the hidden weight matrix plus the hidden biases the new hidden state equals our inputs times its weights and the hidden state times its weights plus the biases but this time the hidden weights are multiplied by the reset gate and the reset gate is just our little neural net so now that we have h nu our actual new hidden state is equal to that times 1 minus the update gate plus our previous hidden state times the update gate so you can see that update plus 1 minus update will add to 1 so you can see why it's been drawn like so which is that this can really be anywhere at either end or somewhere in between so the update gate decides how much is h nu going to replace the the new hidden state with so actually although people tend to talk about LSTNs and GIUs as being pretty complex it really wasn't much and it actually wasn't that hard to write the key outcome of this though is that because we now have these reset and update gates is that it has the ability to learn these special sets of weights to make sure that it throws away state when that's a good idea or to ignore state when that's a good idea and so these extra degrees of freedom allow SGD to find better answers basically and so again this is one of these things where we're coming up with architectures which just try to make it easier for the optimizer to come up with good answers everything after this is identical to what we looked at last week that goes into the scan function to calculate the loss we calculate the gradients we do the SGD updates and we chuck it into a little bit ok so I think the really the main reason I wanted to do all that today was to show you the back prop example I know some learning styles are more detail oriented as well and so some of you hopefully will have found that helpful any time you find yourself wondering how the hell did this neural network do this you can come back to this piece of code and that's all it did that's all that's going on that's one way of thinking about it where you really get successful with neural nets though you're at a whole other level and you don't think of it at that level anymore but instead you start thinking if I'm an optimizer and I'm given an architecture like this what would I have to do in order to optimize it and once you start thinking like that then you can start thinking in this kind of like upside down way that is necessary to come up with good architectures you can start to understand why it is that this convolution layer followed by this average calling layer gives the answers that it does why does it work you'll get that real intuition for what's going to work for your problem well so there's kind of two ways two levels at which you need to think about neural nets I think and the sooner you can think of it at this super high level I feel like the sooner you'll do well with them and one of the best ways to do that is to you know over the next couple of weeks run this fish notebook yourself and screw around with it a lot and make sure that you know how to do these things that I did where I actually create a little function that allows me to spit out the output of any of the layers and visualize it make sure you kind of know how to inspect it and that you can really look at the inputs and outputs that's the best way to get an intuition that's so this was kind of like particularly the first half of this class was a bit of a preview of next year which is to say in the first six weeks you learn all the pieces and then today we very rapidly tried putting those pieces together in a thousand different ways and saw what happened and a million more ways that we know of and probably a billion more ways we don't know of so knowing this little set of tools you know convolutions fully connected layers activation functions SGD you're now able to be an architect create these architectures Keras's functional API makes it integrated all of the architectures you see today today this week while I was sick and my baby wasn't sleeping like my brain was not even working that's how easy Keras makes this so like definitely it takes I think a few weeks to build your comfort level up but hopefully you can try that and most importantly over the next few weeks as Rachel and I maybe with some of your help start to develop the MOOC you guys can talking on the forums about working through whatever problems you're interested in whether it be the projects that you want to apply these things to at your own organizations or your personal passion projects or if you want to try and win a competition or two Rachel and I are going to still be on the forums and then in a few weeks time when the MOOC goes online hopefully there's going to be thousands of people joining this community so we'll be like the the seed so I really hope you guys will stay a part of it and kind of help can you imagine that first day when half the people still think that a python is a snake and don't know how to connect to an AWS instance and you'll all be able to say like read the wiki, here's the page oh yeah, I have that problem too and hopefully our goal here is to create a new generation of deep learning practitioners people who have useful problems that they're trying to solve and can use this tool to solve them rather than create more and more exclusive heavily mathematical content that's designed to put people off so that's our hope that's really why we're doing this Rachel, anything else that we should add before we wrap up okay, well thank you so much it really has been a genuine pleasure and I'm so happy to hear that most of you are going to see you again next year you guys obviously you'll get first dibs on places for next year's course if the MOOC's successful next year's course could be quite popular so I do suggest that you do nonetheless get your applications in not too late they'll certainly go through with priority and we'll be notifying you through the forums yes, absolutely and be aware if you're not already we don't send email much, really the forums is our main way to communicate and slack to some extent if you want to see what's going on that's the places to look and of course our wiki is the knowledge base that we're creating for everybody so anytime you see something missing on the wiki or something you think could be improved edit it, even if you're not sure if you're saying the right thing you can add a little comment after it's saying coming along help you well thanks so much everybody I hope you all have a great vacation season