 So I guess I noticed during the week from some of the questions. I've been seeing that the idea of like what a convolution is is You know still a little counter-intuitive or surprising to some people So I feel like the only way I know to teach things effectively is by creating a spreadsheet. So here we are This is the famous number seven From lesson zero and I just copied and pasted the numbers into a spreadsheet. They're not all exactly zero. They're actually Floats just rounded off and as you can see I'm just using conditional coloring You can see the shape of our little number seven here So I wanted to show you exactly what a convolution does so and specifically what a convolution does in a Deep learning neural network So we are generally using a kind of modern Convolutions and that means a three by three convolutions. So here is a three by three convolution Okay, and I have just randomly generated nine random numbers Okay, so that is a filter Okay, there's one filter here is my second filter. It is nine more random numbers Okay, so This is what we do in In Keras when we asked for a convolutional layer. We tell it we pass the first thing we pass it is how many filters Do we want? Okay, and that's how many of these random little random matrices? Do we want it to build for us? So in this case, it's as if I passed convolution 2d The first parameter would be two And the second parameter would be three comma three because it's a three by three All right, and okay, so what happens to this little? Random matrix in order to calculate the very first item It takes the sum of The issues to here the sum of the blue stuff those nine times the red stuff those nine all added together Okay, so let's go down here into where it's a bit darker. How does this get calculated? This is equal to these nine Times these nine when I say times I mean element wise time So the top left by the top left the middle by the middle so forth And at them all together That's all a convolution is so it's just as you go through it's we take The corresponding three by three area in the image and we multiply each of those nine things by each of nine These nine things and then we add those nine products together That's it that's a convolution Okay, so there's really Nothing particularly weird or confusing about it and I'll make this available in class so you can have a look You can see that when I get to the top left corner. I can't move Further left and up right because of kind of I've reached the edge And this is why when you do a three by three convolution without zero padding you lose one pixel On each edge because you can't push this three by three any further Okay, so if we go down to the bottom left You can see again the same thing you kind of get stuck in the corner, right? So that's why you can see that my result is one row less than my starting point Okay, so I did this for two different filters. So here's my second filter and you can see when I Calculate say this one. It's exactly the same thing. It's these nine Times each of these nine added together and these are just nine other random numbers Okay, so that's how we start with our first in this case, I've got a two convolutional filters and this is the output of those two convolutional filters and they're just random at this point So my second layer Now my second layer is no longer enough just to have a three by three matrix And now I need a three by three by two tensor because to calculate my Let's say my top left of my second convolutional layer. I need These nine by these nine Added together plus these nine by These nine added together because at this point my previous layer is no longer just one thing, but it's two things Now indeed if our original picture was a three-channel Color picked a picture our very first convolutional layer would have had to have been three by three by three tensors okay, so All of the convolutional layers from now on are going to be three by three by number of filters in the previous layer Convolution matrices, so here is my first I've just drawn it like this three by three by two tensor and You can see it's taking nine from here nine from here and adding those two together and so then for my second Filter in my second layer It's exactly the same thing. I've created two more random matrices or one more random three by three by two tensor and Here again, I have those three By these sorry those nine by these nine some plus those nine by those nine some And that gives me that one so that gives me my first two layers of my convolutional neural network Then I do max pooling Max pooling is slightly more awkward to do an Excel, but that's fine. We can still handle it So here's max pooling so max pooling is now going because I'm going to do two by two max pooling It's going to decrease the resolution of my image by two on each axis So how do we calculate? That number that number is simply the maximum of those four and Then that number is the maximum of those four and so forth so with max pooling we end up with We started we had two filters in the previous layer So we still have two filters, but now our filters have half the resolution in each of the x and y axes Okay, and so then I thought okay, we've done two convolutional layers Yes, there's a question How did you go from one matrix to two matrices in the second layer? How did I go from one matrix to two matrices as in how did I go from? Just this one thing to these two things So the answer to that is I just created two random three by three filters This is my first random three by three filter. This is my second random three by three filter So each output then was simply equal to Each corresponding Nine element section multiply by each other and add it together. So because I had two Random three by three matrices. I ended up with two outputs. So two filters means two Two sets of outputs, okay All right, so now that we've got our max pooling layer Let's use a dense layer to turn it into our output so a dense layer means that every single one of our Activations from our max pooling layer Needs a random weight. So these are a whole bunch of random numbers Okay, so what I do is I take every one of those random numbers and Multiply each one by a corresponding Input okay, and add them all together. So I've got there's some product of this and then an amnest We would have ten activations Because we need an activation for naught one two three so called up to nine. So for amnest we would need ten Sets of these dense weight matrices so that we could calculate the ten outputs If we were only calculating one output, this would be a perfectly reasonable Way to do it. So it's for one output. It's just a sum product of everything from our final layer with a weight for everything in that final layer Out of together So that's all a dense layer is so really a Both dense layers and convolutional layers are You know kind of couldn't be easier mathematically I Think the surprising thing is What happens when you then say okay rather than using random weights? Let's calculate the derivative of what happens if we were to change that weight up by a bit or down by a bit and how would it impact our Loss and in this case. I haven't actually got as far as calculating a loss function, but we could add over here. We could add a Sigmoid loss for example and so we can calculate the derivative of the loss with respect to every single weight in the dense layer and every single weight in all of our filters in that layer and Every single weight in all of our filters in this layer And then with all of those derivatives We can calculate how to optimize all of these weights and the surprising thing is that when we optimize all of these weights We end up with these Incredibly powerful models like those visualizations that we saw So I I kind of I'm not quite sure where the disconnect between the kind of incredibly simple math And the outcome is I think it might be that it's so easy It's hard to believe that's all it is, but I'm not skipping over anything that that really is it and so to help you Really understand this. I'm going to talk more about SGD. Yes, Rachel. Why would you use a sigmoid function here? Oh? So the the loss function we generally use is the softmax. Okay, so e to the x i divided by some of e to the x i If it's just binary That's that's just the equivalent of having just one over one plus e to the x i so softmax In the binary case kind of simplifies into into a sigmoid function Thank you for clarifying that question So I think this is super fun. We're going to talk about not just SGD But every variant of SGD including one invented just a week ago Okay, so We've already talked about SGD. Yes Rachel two more questions Does SGD happen for all layers at once? SGD happens for all layers at once Yes, we calculate the derivative of all the weights with respect to the loss And when have a max pool after convolution versus when not to when to have a max pool after a convolution? Who knows, you know like this is a very controversial question and indeed some people now are saying never use max pool Instead of using max pool When you're doing the convolutions Don't do a convolution over every set of nine pixels But instead skip a pixel each time and so that's another way of down sampling Yeah, it's it's Jeffrey Hinton who is kind of the father of deep learning has gone as far as saying that the Extremely great success of max pooling is like Has been the greatest problem deep learning has faced because to him it's like Really stops us from going further. I I Don't know whether that's true or not. I assume it is because he's Jeffrey Hinton and I'm not For now We use max pooling every time we're doing fine tuning Because we need to make sure that our architecture is identical to the original VGG's authors Architecture and so we have to put max pooling wherever they did Why do we want max pooling or down sampling or anything like that? Are we just trying to look at bigger features at the input? Why is max pooling at all? Yeah, so There's a couple of reasons the first is that max pooling helps with Translation invariance so it basically says if this feature is here or here or here or here I don't care. It's kind of roughly in the right spot And so that seems to work well and the second is exactly what you said every time we max pool We end up with a smaller grid Which means that our three by three convolutions are effectively covering a larger part of the original image Which means that our convolutions can find Larger and more complex features. I think they would be the two main reasons Why oh, so if it's Jeffrey Hinton cool with the idea of doing the skipping Oh And why is that better? Jeffrey Hinton things that we should be using something called a capsule architecture the problem is he hasn't invented it yet, so We don't have an answer this The capsule architecture C-A-P-S-U-L-E So if you Google for Jeffrey Hinton capsule You can learn all about the thing that he thinks we ought to have but don't yet he did point out that I I can't remember what it was but one of the key pieces of Deep learning that he invented talk like 17 years from conception to working So like he is somebody who sticks with these things and makes it work So is max pooling unique to image processing? Max pooling is not unique to image processing It's likely to be useful for any kind of convolutional neural network and a convolutional neural network can be used for any kind of data That has some kind of consistent ordering So things like speech or any kind of audio or some kind of consistent time series All of these things have some kind of ordering to them and therefore you can use a CNN and therefore you can use max pooling And as we look at NLP We will be looking more at convolutional neural networks for other data types and interestingly the author of Keras last week or maybe the week before I Made the contention that perhaps it will turn out that CNN's are The architecture that will be used for every type of ordered data And this was just after one of the leading NLP researchers released a paper Basically showing a state-of-the-art result in NLP using convolutional neural networks So although we'll start learning about recurrent neural networks next week. I Have to be open to the possibility that they'll become Redundant by the end of the year, but they're still interesting Okay So SGD so we looked at the SGD intro notebook But I think things are a little more clear sometimes when you can see it all in front of you So here is basically the identical thing that we saw in the SGD notebook in Excel, right? So We are going to start by creating a line We create 29 random numbers and Then we say, okay, let's create something that is equal to two times x sorry No, that is equal to two times x plus 30 And so here is two times x plus 30. Okay, so that's my Input data, so I am trying to again create something that can find the parameters of a line Now the important thing and this is like this is the leap Which requires? Not thinking too hard lest you realize how surprising and amazing this is everything we learn about how to fit the line is identical to how to fit Filters and weights in a convolutional neural network and so that everything we learn About calculating the slope in the intercept we will then use to let computers see right and so The answer to any question, which is basically why is why not? Okay, this is a function that takes some inputs and calculates an output This is a function that takes some inputs and calculates an output So why not? Okay, the only reason it wouldn't work would be like because it was too slow For example, and we know it's not too slow because we try it, you know, and it works pretty well So everything we're about to learn works for any kind of function which Kind of has the appropriate types of gradients and we can talk more about that later But neural nets have the appropriate kinds of gradients So SGD we start with a guess What do we think the parameters of our function are in this case the intercept and the slope and with Keras? They will be randomized using the gloro initialization procedure We learned about which is 6 divided by n in plus n out Random numbers, but I'm just going to say let's pretend it is let's assume they're both one We are going to use very very small mini batches here Our mini batches are going to be of size one Basically because it's easier to do an Excel and it's easier to see but everything We're going to see would work equally well for a mini batch of size 4 or 64 or 128 or whatever Okay, so here's our first row or our first mini batch Our input is 14 and our desired output is 58 and so our guess is to our parameters are one and one and Therefore our predicted Y value Okay is equal to 1 plus 1 times 14, which is normally 15 Therefore if we're doing root mean squared error our error squared is prediction minus actual squared So the next thing we do is we want to calculate the derivative with respect to each of our two inputs One really easy way to do that is to add a tiny amount to each of the two inputs and see how the output varies So let's start by doing that. So let's add 0.01 to our intercept and Calculate the line and then calculate the loss squared Okay, so this is the error if B is increased by 0.01 And then let's calculate the difference between that error and the actual error and then divide that by our change Which was 0.01 and That gives us our Estimated gradient. I'm using DE for the error DB. I should have probably been DL for delos to be okay So this is the change in loss with respect to be is negative 85.99 We can do the same thing for a so we can add 0.01 to a and Then calculate our line Subtract our actual take the square and so there is our value of Estimated the loss DA subtract it from the actual loss divided by 0.01 and So there are two estimates as the derivative this approach to estimating the derivative is called finite differencing At any time you calculate a derivative by hand you should always use finite differencing to make sure your calculation was correct You're not very likely to ever have to do that however, because all of the libraries do derivatives for you and They do them analytically not using finite derivatives. And so here are the derivatives calculated analytically You can do by going to Wolfram alpha and typing in your formula and getting the derivative back So this is the analytical derivative of the loss with respect to be and the analytical derivative of a with the loss with respect to a and so you can see that our analytical and our finite difference are very similar for be and they are Very similar for a so that makes me feel comfortable that we got the calculation correct So all SGD does is it says okay? This tells us if we change our weights by a little bit This is the change in our loss function We know that increasing our value of B by a bit will decrease the loss function And we know that increasing our value of a by a little bit will decrease the loss function So therefore let's decrease both of them by a little bit and the way we do that is to multiply the derivative times a learning rate That's the value of a little bit and subtract that from our previous guess Okay, so we do that for a and we do that for B and here are our new guesses now We're one point one two and one point oh one and so let's copy them over here one point one two and one point oh one and And then we do the same thing and that gives us a new way in a B And we keep doing that again and again and again Until we've gone through the whole data set At the end of which we have a guess of a of two point six one and a guess at B at one point oh seven So that's one epoch Now in real life, we would be having shuffle equals true, which means that these would be randomized Okay, so this isn't quite perfect, but apart from that this is SGD with a mini batch size of one So at the end of the epoch we say, okay This is our new slope. So let's copy two point six one over here Okay, and this is our new Intercept so let's copy one point oh six over here and so now It starts again Okay So we can keep doing that again and again and again copy the stuff from the bottom Stick it back at the top and each one of these is going to be an epoch So I recorded a macro with me copying this to the bottom and pasting it at the top and added something that says Oh, I equals one to five around it and so now if I click run it will copy and paste it five times and So you can see it's gradually getting closer and we know that our goal is that it should be a A equals two and B equals 30. So we've got as far as a Equals 2.5 and B equals 1.3. So they're better than our starting point and you can see our Gradually improving loss function, but it's going to take a long time. Yes, Rachel Can we still do analytic derivatives when we are using non-linear activation functions? Yes, we can use Analytical derivatives as long as we're using a function that has an analytical derivative, which is pretty much every Useful function you can think of Except ones that like you can't have something which has like an if then statement in it because it kind of like jumps from here to here But even those you can approximate. So like a good example would be Relu so value which is max of zero comma x strictly speaking doesn't really have a derivative at every point Or at least not a well-defined one because this is what This is what really looks like All right, and so it's derivative here is zero and derivatives derivative here is one What is its derivative exactly here? Who knows right, but the thing is mathematicians care about that kind of thing we don't right like in real life This is a computer, you know and computers are never exactly anything You know we can either assume that it's like an infinite amount to this side or an infinite amount to this side And who cares right so As long as it has a derivative that you can calculate in a meaningful way In practice on a computer Then it'll be fine Okay, so one thing you might have noticed about this is it's going to take an awfully long time to get anywhere Right, and so you might think okay. Let's increase the learning rate Fine, let's increase the learning rate. So let's get rid of one of these zeros Odia Something went crazy. What went crazy. I'll tell you what went crazy Our a's and b's started to go out into like 11 million which is not the correct answer So how did it go ahead and do that? Well, here's the problem Let's say this was our the shape of our loss function and this was our initial guess Right, and we figured out the derivative is going this way Okay, well actually the derivative is positive. So we want to go the opposite direction right and so we step a little bit over here and Then okay, that leaves us to here and we step a little bit further that leaves us to here and this looks good, right? But then we increase the learning rate All right, so rather than stepping a little bit we stepped a long way and that put us here and Then we stepped a long way again And that put us here All right, if you're learning rates too high, you're going to get worse and worse And that's what happened Okay, so getting your learning rate, right is Critical to getting your thing to train it all called exploding gradients exploding gradients Yeah, or you can even have gradients to do the opposite, but I mean Exploding gradients is something a little bit different, but it's just kind of it's a similar idea So it looks like point oh One is the best we can do and that's a bit sad because this is really slow. So let's try and improve it So one thing we could do is to say well Given that every time we've been Actually, let me do this in a few more dimensions So let's say we had a three-dimensional set of axes now and we kind of had a Lost function that looks like this kind of valley And let's say our initial guess was somewhere over here, right? So over here the gradient is pointing in this direction, right? So we might make a step and end up There and then we might make another step which would put us there and another step that would put us there And this is actually the most common thing that happens in neural networks Something that's kind of flat in one dimension like this is called a saddle point And it's actually been proved that the vast majority of the space of a loss function in neural network is pretty much all saddle points Yeah, yeah So when you look at this, it's pretty obvious what should be done Which is if we go to here and then we go to here we can say well on average We're kind of obviously heading in this direction, especially when we do it again We're obviously heading in this direction. So let's take the average of how we've been going so far and do a bit of that And that's exactly what Momentum does There's a question. Yes If value isn't the cost function, why are we concerned with its differentiability? We care about the derivative of The output with respect to the inputs the inputs of the filters and Remember the loss function consists of a function of a function of a function of a function. So it is Categorical cross-entropy loss applied to softmax applied to Relu applied to dense layer applied to max pooling applied to relu applied to convolutions Etc. Etc. So in other words to calculate the derivative of the loss with respect to the inputs You have to calculate the derivative through that whole function And this is what's called back propagation right with back propagation It's easy to calculate that to calculate that derivative because we know that from the chain rule a derivative of a function of a function is simply equal to the product of the derivatives of those functions so in practice all we do is we calculate the derivative of Every layer with respect to its inputs and then we just multiply them all together And so that's why we need to know the derivative of the activation layers as well as the loss layer and everything else Okay, so here's the trick What we're going to do is we're going to say okay every time we take a step Every time we take a step We're going to also calculate the average of the last few steps So after these two steps the average is this direction So the next step we're going to take our gradient step as usual and we're going to add on our average of the last few steps and That means that we end up actually going to here and Then we do the same thing again So we find the average of the last few steps and it's now even further in this direction and so what is that surface? This is the surface of the loss function with respect to Some of the parameters in this case just a couple of parameters It's just an example of what a loss function might look like so this is the Loss and this is some weight number one and this is some weight number two Yeah So we're trying to kind of get our little if you can imagine this is like gravity We're trying to get this little ball to travel down this valley as far down to the bottom as possible And so the trick is that we're going to keep taking a step not just the gradient step but also the average of the last few steps and So in practice this is going to end up kind of going don't don't don't Don't all right. That's the idea So to do that in Excel is pretty straightforward To make things simpler I have removed the finite differencing base derivatives here So we just have the analytical derivatives right but other than that this is identical to the previous spreadsheet Same data same predictions same derivatives Except we've done one extra thing Which is that when we calculate our new B? We say it's our previous B Minus our learning rate times and we're not going times our gradient But times this cell and what is that cell? That cell is equal to our gradient Times point one Plus The thing just above it times point nine and the thing just above it is equal to It's gradient times point one plus the thing just above it times point nine right and so forth So in other words this column is keeping track of an average derivative of the last Few steps that we've taken which is exactly what we want and we do that for both of our two parameters so this point nine is Our momentum parameter so in Keras when you use momentum you can say momentum equals and you say how much momentum you want And where did that beta come from you just pick it so you just pick what that parameter? What do you want it just like your learning rate you pick it your moment your momentum factor? You just you pick it it's something you get to choose and you choose it by trying a few and find out what works best So let's try running this and you can see it is still Not exactly zipping along Why is it not exactly zipping along well the reason when we look at it is that we know that the constant term needs to get all the way up to 30 and It's still way down at 1.5. It's not moving fast enough Where else the slope term moved very quickly to where we want it to be so what we really want is we need like different learning rates for different parameters and doing this is called dynamic learning rates and the first really effective dynamic learning rate approaches have just appeared in the last three years or so and One very popular one is called at a grad and it's very simple all of these dynamic learning rate approaches Have the same insight which is this if the parameter that I'm changing if the derivative of that parameter is Consistently of a very low magnitude Then if the if the derivative of this mini batch is higher than that Then what I really care about is the relative difference between how much this Variable tends to change and how much it's going to change this time around right so in other words We don't just care about What's the gradient, but is the gradient is the magnitude of the gradient a lot more or a lot less than it has tended to be recently? so the easy way to calculate the overall amount of change of the gradient recently is to keep track of the square of the gradient so What we do with Ategrad is you can see at the bottom of my epoch here. I have got a sum of squares of all of my gradients And then I have taken the square root so I've got the roots on the squads and then I've just divided it by the count Get the average so this is the average of the roots of the squads of my gradients So this number here will be high if the magnitudes of my gradients is high and because it's squared It will be particularly high if sometimes they're really high So why is it okay to just use a mini batch since the surface is going to depend on what a what points are in your mini batch it's not ideal to just use a mini batch and We will learn about a better approach to this in a moment, but for now. Let's look at this and in fact There are two approaches very related at a grad and at a delta and one of them actually does this for all of the gradient so far And one of them uses a slightly more Sophisticated approach this approach of doing it on a mini batch by mini batch basis is slightly different either But it's similar enough to explain the concept That's our epoch by epoch. I should say is what I'm actually doing here So what I do is I cut yes feature and does this mean For a CNN would dynamic learning rates mean that each filter would have its own learning rate It it would mean that every parameter has its own learning rate So this is one parameter. That's a parameter. That's a parameter. That's a parameter. Okay, and then in our dense layer Okay, these that's a parameter. That's a parameter. That's a parameter So the when you go model.summary in Keras It shows you for every layer how many parameters there are so anytime you're unclear on How many parameters there are you can go back and have a look at these Spreadsheets and you can also look at the Keras model.summary and make sure you understand how they how they turn out so for the first layer it's going to be the size of your filter times the number of your filters if it's just black if it's just gray scale and then after that the number of parameters will be equal to the size of your filter Times the number of filters coming in times the number of filters coming out and Then of course your dense layers will be every import goes to every output So number of inputs times number of outputs a parameter to the function that is calculating Whether it's a cat or a dog Okay so The what we do now is we say okay now this this number here look at this 1857 This is saying that the derivative of the loss with respect to the slope varies a lot Whereas the derivative of the loss with respect to the intercept doesn't vary much at all So at the end of every epoch I copy that Up to here right and then I take my learning rate and I divide it by that and so now for each of my Parameters I now have this adjusted learning rate, which is the learning rate divided by the recent summer squared average gradient and So you can see that now one of my learning rates is a hundred times faster than the others in the other one And so let's see what happens when I run this Relationship with normalizing the input data. No, there's not really a relationship with normalizing the input data Because if you're I mean it can help But still if your inputs are very different scales It's still a lot more work for it to do so Yes, it helps, but it doesn't help so much that it makes it useless and in fact it turns out that even with dynamic Learning rates having normal not just normalized imports, but batch normalized activations is extremely helpful and so the thing about when you're using Out-of-grad or any kind of dynamic learning rates is generally you'll set the learning rate quite a lot higher because remember you're dividing it by this recent average There we go. So if I set it to point one Oh too far. Okay, so that's no good. So let's try 0.05 Run that Okay, so you can see after just five steps. I'm already halfway there. Okay? another five steps Getting very close and another five steps. Oh And it's exploded. Okay. Now. Why did that happen? because as we get Closer and closer to where we want to be You can see that you need to take smaller and smaller steps Right and by keeping the learning rates the same. It meant that eventually we went too far Okay, so This is still something you have to be very careful of All right as more elegant in my opinion approach to the same thing that at a grad is doing is something called RMS prop and RMS prop was first introduced in Jeffrey Hinton's Coursera course so if you go to the Coursera course called neural network for I don't know Jeffrey Hinton neural networks Coursera. You'll find it and in one of those classes. He introduces RMS prop So it's quite funny nowadays because this comes up in academic papers a lot when people cite it They have to cite Coursera course chapter 6 at minute 14 and 30 seconds But Hinton has asked that this be the official way that it is cited. So there you go You see how cool he is So here's what RMS prop does what RMS prop does is exactly the same thing as momentum Okay, but instead of keeping track of the Weighted running average of the gradients. We keep track of the weighted running average of the square of the gradients So here it is right everything here is the same as momentum so far except That my I take my gradient squared multiply it by my point one and add it to my previous Cell times point nine. Okay, so this is keeping track of the recent Running average of the squads of the gradients and when I have that I then do exactly the same thing with it that I did an add a grad which is to divide the learning rate by it, so I take my previous guess as to be and then I subtract from it my derivative Times the learning rate divided by the square root of the recent running weighted average of the squared gradients So it's doing basically the same thing as add a grad but in a way that's doing it kind of continuously So these are all different types of learning rate optimization These last two are different types of dynamic learning rate approaches so Let's try this one if we run it for a few steps and again I'm gonna have to guess what learning rate to start with let's say point one If anything, that's a little slow. So maybe you'll try point two point two And so as you can see this is going pretty well and I'll show you something really nice about our MS prop Which is what happens as we get very close. We know the right answer is two and thirty. Is it about to explode? No, it doesn't explode and the reason it doesn't explode is because it's Recalculating that running average every single mini batch and so rather than waiting until the end of the epoch by which stage It's gone so far that it can't come back again It just jumps a little bit too far and then it recalculates the dynamic learning rates and tries again So what happens with our MS prop is if your learning rates too high then it doesn't explode It just ends up going around the right answer And so when you use our MS prop as soon as you see your validation Scores flatten out, you know, this is what's going on And so therefore you should probably divide your learning rate by ten and you see me doing this all the time But I'm running care of stuff You'll keep seeing me like run a few steps divide the learning rate by ten run a few steps And you don't see that my loss function explodes. You just see that it flattens out So do you want your learning rate to get smaller and smaller? Yeah, you do you need Your very first learning rate often has to start small and we'll talk about that in a moment But once you've kind of started once you've kind of got started you generally have to gradually decrease the learning rate That's called learning rate annealing And then can you repeat what you said earlier that something does the same thing as At a grad but Yeah, so RMS prop Which we're looking at now does exactly the same thing as at a grad which is divide the learning rate by the root of summer squared of the gradients But rather than doing it Since the beginning of time or every mini batch or every epoch RMS prop does it continuously Using the same technique that we learned from momentum from momentum, which is take the squared of this gradient multiply it by point one and add it to point nine times the last calculation That's called a moving average it's a It's a weighted moving average where we're waiting it such that the more recent squared gradients are weighted higher I Think it's actually an exponentially weighted moving average to be more precise So there's something pretty obvious we could do here, which is momentum seems like a good idea RMS prop seems like a good idea Why not do both and that It's called Adam and so Adam was invented like I don't know last year 18 months ago and Hopefully one of the things you see from these spreadsheets is that like these recently invented things are Still at the ridiculously extremely simple end of the spectrum, right? so like the stuff that people are discovering and deep learning is a long long long long way of being away from being like Incredibly complex or sophisticated and so I hopefully you'll find this very encouraging, which is if you want to kind of play at the State of the art of deep learning That's not at all hard to do right so let's look at Adam which I remember it coming out Well, I can't remember 12 18 months ago and everybody was so excited because suddenly it became so much easier and faster to Train neural nets But once I actually tried to create an Excel spreadsheet out of it. I realized oh my god It's just RMS prop to set plus momentum And so literally all I did was I copied my momentum page And then I copied across my RMS prop columns and combined them so you can see here. I have my Exponentially weighted moving average of the gradients that's what these two columns. That's what these two columns are Here is my exponentially weighted moving average of the squads of the gradients And so then when I calculate my new parameters, I take my old parameter and I subtract my Not my derivative times the learning rate, but my momentum factor So in other words the comp the recent weighted moving average of the gradients Um multiplied by the learning rate divided by the recent moving average of the squads of the derivatives or the root of them anyway, okay So it's literally just combining momentum plus Plus RMS prop and so let's see how that goes. Let's run Five epochs And we can use the pretty high learning rate now because it's really handling a lot of stuff for us Wow, and five epochs. We're almost perfect and so another five epochs That does exactly the same thing that RMS prop does Which is it goes too far and tries to come back, right? So we need to do the same thing when we use Adam and Adam's what I use all the time now. I Just divide by ten every time I see it flatten out, okay so a week ago Somebody came out with something that they called not Adam, but Eve and Eve is a addition to Adam Which attempts to deal with this learning rate annealing automatically and So this All of this is exactly the same as my Adam page, right? But at the bottom. I've added some extra stuff I have kept track of the root mean squared error. This is just my loss function and Then I copy across my loss function from my previous epoch and From the epoch before that and what Eve does is it says how much has the loss function changed? And so it's got this ratio Between the previous loss function and the loss function before that All right, so you can see it's the absolute value of the last one minus the one before divided by whichever one smaller And what it says is okay, let's then adjust the learning rate Such that instead of just using the learning rate. We're given let's use the learning rate that we're given Let's adjust the learning rate that we're given by taking okay So it's next thing we do we take the exponentially weighted moving average of these ratios Okay, so you can see another of these beaters appearing here, right? So here this thing here is equal to our last ratio times point one sorry our last ratio times point nine plus our new ratio times point one and So then for our learning rate we Divide the learning rate from Adam by this That's we can see it's divided by f 38. So what that says is if the Learning rate is moving around a lot if it's very bumpy We should probably decrease the learning rate because it's going all over the place like remember How we saw before if we've kind of gone past where we want to get to it just jumps up and down Okay, on the other hand if the if the loss function is staying pretty constant Then we probably want to increase the learning rate So that all seems like a good idea. And so again, let's try it not bad, right? So after five epochs, it's kind of gone a little bit too far After a week of playing with it. I use this on state farm a lot during the week I grabbed a Keras implementation which somebody wrote like a day after the paper came out The problem is that Because it can both decrease and increase the learning rate Sometimes as it gets down to kind of the flat bottom point where it's kind of pretty much optimal it'll just so I mean it'll often be the case that the derivative gets pretty sorry that the loss gets pretty constant at that point and so therefore If we'll try to increase the learning rate And so what I tend to find happened that it would be very quickly Get pretty close to the answer and then suddenly it would jump to somewhere really awful And then it would start to get to the answer again and jump somewhere really awful Generally don't be for the exit condition. We give a delta that the change In this gradient should be If it is below certain delta then we just stop doing that. We have not done any such thing No, we have always said run for a specific number of epochs We have not to find any kind of a stopping criterion It is possible to define such a cross stopping criterion But nobody's really come up with one that's remotely reliable and the reason why is that when you look at the graph of loss Over time it doesn't tend to look like that. It tends to look like this Right and so in practice. It's very hard to know When to stop It's kind of still a human judgment thing can't it also have lots of plateaus. Oh, yeah That's definitely true and in particularly with a type of architecture called resnet that we'll look at next week The authors show that it kind of tends to go like this So yeah, it's in practice you kind of have to run your training for as long as you have patience for At whatever the best learning rate you can come up with is So something I came up with well something. I actually came up with I'm trying to think six or 12 months ago, but we kind of re-stimulated my interest after I read this Adam paper is something which dynamically updates learning rates in Such a way that they only go down and Rather than using the loss function, which as I just said is incredibly bumpy. There's something else, which is less bumpy Which is the average summer squared gradients So I actually created a little spreadsheet of my idea and I hope to prototype it in Python Maybe this week or the next week after we'll see how it goes and the idea is basically this keep track of the summer squads of the derivatives And Compare the summer squared the derivatives from the last epoch the summer squared derivatives of this epoch and look at the ratio of the two The derivatives should keep going down If they ever go up by too much that would strongly suggest that you've kind of jumped out of the good part of the function And so anytime they go up too much you should decrease decrease the learning rate So I literally added like Two lines of code to my incredibly simple VBA Adam with annealing here If the gradient ratio is greater than two so if it doubles divide the learning rate before and Here is what happens when I run that That's five steps Another five steps You can see it's automatically changing it right so I don't have to do anything. I just keep running So I'm pretty interested in this idea. I think it's going to work super well because it allows me to Focus on just running stuff without ever worrying about setting learning rates So I'm hopeful that this approach to automatic learning rate annealing is something that we can Have in our toolbox by the end of this course Hi One thing that happened to me today is I try a lot of different rates. I didn't get anywhere So but I was working with the whole data set with trying with sample Will actually I'm trying to understand if if I try with a sample and I find something Would that apply to the whole data set or how do I go about great question? This hold hold that thought for five seconds. Was there another question at the back before we answer that one? no, okay, so Here is the answer to that question the question was It takes a long time to figure out the optimal learning rate Can we calculate it using just a sample? And to answer that question I'm going to show you how I entered statefarm and indeed when I started entering statefarm I started by using a sample and So step one was to think okay? What insights can we gain from using a sample? Which can still apply when we move to the whole data set because running stuff in a sample took You know 10 or 20 seconds and running stuff on the full data set took two to ten minutes for an epoch so After I created my sample which I just created randomly I First of all wanted to find out well, what does it take to create a better than random model here? me So I always start with the simplest possible model and so the simplest possible model has a single dense layer Now here's a handy trick Rather than worrying about Calculating the average and the standard deviation of the input and subtracting it all out in order to normalize your input layer You can just start with a batch norm layer And so if you start with a batch norm layer, it's going to do that for you So any time you create a keras model from scratch I would recommend making your first layer a batch norm layer, and so this is going to Normalize the data for me So that's a cool little trick, which I haven't actually seen anybody use elsewhere, but I think it's a good default starting point all the time If I'm going to use a dense layer, then obviously I have to flatten Everything into a single vector first so this is really a minimal Most minimal model so I tried fitting it compiled it fit it and Nothing happened. Not only did nothing happen to my validation But really nothing happened by training It's only taking seven seconds per epoch to find this out. So that's okay So what might be going on so I look at model dot summary and I see that there's 1.5 million parameters And that makes me think okay. It's probably not underfitting. It's probably unlikely that with 1.5 million parameters. There's really Nothing useful that can do whatsoever. It's only a linear model true, but I still think it should be able to do something So that makes me think that what must be going on is it must be doing that thing where it jumps too far Okay, and it's particularly easy to jump too far At the very start of training And let me explain why It turns out that there are often Reasonably good answers that are way too easy to find so one really reasonably good answer would be always predict zero because there are ten output classes right in The state-run competition. There's one of ten different types of distracted driving and And you are scored based on the cross-entropy loss And what that's looking at is okay. Well, how accurate are each of your ten predictions? So Rather than trying to predict something well, what if we just always predict we're not zero Let's say we always predict zero point zero one Nine times out of ten you're going to be right because nine out of the ten categories. It's not that It's only one of the ten categories. So actually always predicting point oh one would be pretty good Now it turns out it's not possible to do that because we have a soft max lat and a soft max layer Remember is e to the x i divided by sum of e to the x i's and so in a soft max layer You have to everything everything has to add to one So therefore if it makes one of the classes really high and all of the other ones really low Then nine times out of ten it is going to be right nine times out of ten So in other words, it's a pretty good answer for it to always predict Some random class class eight close to a hundred percent certainty And that's what happens so anybody who tried this and I saw a lot of people on the forums this week saying I tried to train it and nothing happened and The folks who got the really interesting insight with the ones who then went on to say and then I looked at my predictions And it kept predicting the same class With great confidence again and again and again Okay, that's why it did that And I just wanted to point out that it's eight minutes to late. So we should take a break soon. Thank you okay, so Our next step then is to try decreasing the learning rate. So here is exactly the same model but I'm now using That was meant to be one a neck five much lower learning rate and When I run that Okay, it's actually moving Okay, so it was only 12 seconds of compute time to figure out that I'm going to have to start with a low learning rate once we've got to a point where the accuracy is You know reasonably better than random We're well away from that part of the loss function now that says Always predict everything is the same class And therefore we can now increase the learning rate back up again So generally speaking for these harder problems, you'll need to start as an epoch or two at a low learning rate and Then you can increase it back up again. Okay, so you can see now I could put it back up point one and very quickly increasing my accuracy So you can see here my accuracy on my validation set is point five Using a linear model and this is a good starting point because it kind of says to me Anytime that my validation accuracy is worse than about point five This is really no better than even a linear model. So this is not worth spending more time on One obvious question would be how do you decide how big a sample to use? And what I did was I tried a few different sizes of sample for my validation set and I then said okay evaluate the model so in other words calculate loss function on the validation set But for a whole bunch of randomly sampled batches, so do it ten times and So then I looked and I saw how the accuracy changed right and so with the validation set set at a thousand Images my accuracy changed from like point four eight or point four seven to point five one So it's not changing too much It's small enough that I think okay. I can make useful insights using a valid using a sample size of this size so What else can we learn from? A From a sample well one is like are there other architectures that work well So the obvious thing to do with a computer vision problem is to try a convolutional neural network And here's one of the most simple convolutional neural networks two convolutional layers each one with a max pooling layer and then One dense layer followed by my dense output layer so again, I tried that and Found that it very quickly got to an accuracy of a hundred percent on the training set But only twenty four percent on the validation set and that's because I I was very careful to make sure my validation set included different drivers To my training set because on Kaggle it told us that the test set has different drivers And so it's much harder to recognize what a driver is doing if we've never seen that driver before so I Could see that Convolutional neural networks clearly are a great way to model this kind of data, but I've got to have to think very carefully about overfitting So step one to avoiding overfitting is data augmentation As we learned in our data augmentation class so here's the exact same model and I tried Every type of data augmentation So I tried shifting it left and right a bit. I tried shifting it up and down a bit. I tried shearing it a bit I Tried rotating it a bit. I tried shifting the channels for the colors a bit And for each of those I traded tried four different levels and I found in each case What was the best and then I combined them all together? So here were my best Data augmentations amounts so on 1560 images so a very small set This is just my sample I then ran my very simple two convolutional layer model with this data augmentation of these optimized parameters and It didn't look very good after five epochs. They only had point one accuracy on my validation set But I can see that my training set is continuing to improve and so that makes me think okay Don't give up yet Try decreasing the learning rate and do a few more and lo and behold. It started improving, right? So this is where you've got to be careful not to jump to conclusions too soon So I ran a few more and it's improving. Well, so I ran a few more Another 25 and look at what happened They kept getting better and better and better until we were getting 67% accuracy so this 1.15 Validation loss is well within the top 50% in this competition So using incredibly simple model on just a sample We can get in the top half of this cable competition simply by using the right kind of data augmentation All right, so I think this is a really interesting insight about the power of this Incredibly useful tool Okay, let's have a five minute break and we'll do your question first. How can you grab the microphone? sample of the Set that you need me to trust your mouth as well with 10 classes 10 classes. Yeah, would a class imbalance affect It's unlikely that there's going to be a class in balance in my validation in my sample Unless there was an equivalent class imbalance in the real data because I've got a thousand Examples and so just statistically speaking That's unlikely if there is a class imbalance in my original data, then I want my sample to have that class imbalance to so I At this point I felt pretty good that I knew that we should be using a convolutional neural network, which is obviously a very strong hypothesis to start with anyway and also felt pretty confident when you What What kind of learning rate to start with and then how to change it and also what data augmentation to do? The next thing I wanted to wonder about was like okay What how else do I handle overfitting because although I'm getting some pretty good results I'm still overfitting hugely point six versus point nine So the next thing in our list of ways to avoid overfitting And I hope you guys all remember that we have that list in lesson three The five steps Let's go and have a look at it now to remind ourselves Approaches to reducing overfitting. Okay. These are the five steps. All right. We can't add more data We've tried using data augmentation We're already using batch norm and com nets So the next step is to add regularization and drop out. It's our kind of favored regularization technique so I was thinking okay, can we Before we do that, I'll just mention one more thing about this data augmentation approach. I Have literally never seen anybody write down a process as to how to Figure out what kind of data augmentation to use and the amount the only Hosts I've seen on it always rely on intuition Which is basically like, you know, look at the images and think about how much they seem like they should be to move around or rotate I really tried this week to come up with a rigorous repeatable process that you could use and That process is go through each data augmentation type one at a time try three or four different levels of it on a Sample with a big enough validation set that it's pretty stable To find the best value of each of the data augmentation parameters and then try combining them all together So I hope you kind of come away with this as a as a practical message which You know Probably your colleagues even if some of them claim to be deep learning experts. I doubt that they're doing this So this is something you can hopefully get people into the practice of doing Regularization, however, we cannot do in a sample and the reason why is that step one add more data for that step is Very correlated with ad regularization as we add more data We need less regularization So as we move from a sample to the full data set We're going to need less regularization. So to figure out how much regularization to use we have to use the whole data set so So at this point, I changed it to use the whole data set not the sample and I started using dropout so you can see that I started with my data augmentation amounts that you already seen and I started adding in some dropout and Ran it for a few epochs to see what would happen And you can see it's worked Pretty well, so we're getting up into the 75% now and before we were in the 64% so I haven't checked Once we add clipping which is very important for getting the best Cross-entropy loss function. I haven't checked where that would get us on the Kaggle leaderboard But I'm pretty sure it'd be at least in the top third based on this accuracy so I did a few more even. Let's see how we go. Okay, so actually I read a few more epochs with an even lower learning rate and got 0.78 0.79 So, yeah, so this is going to be well up into the top third maybe even the top quarter as probably the top third of leaderboard So this is just so I got to this point by just trying out a Couple of different levels of dropout Just and I just put them in my dense layers There's no rule of thumb here. A lot of people put small amounts of dropout in their convolutional layers as well All I can say is to try things But what VGG does is to put 50% dropout after each of its dense layers And that doesn't seem like a bad rule of thumb So that's what I was doing here and then trying around a few different Sizes of dense layers to try and find something reasonable. I didn't spend a heap of time on this So there's probably better architectures, but as you can see, this is still a pretty good one So that was my step 2 now So far we have not Used a pre-trained network at all So this is getting into kind of the top third to top third of the leaderboard without even using any image net features So that's pretty damn cool But we're pretty sure that image net features would be helpful So that was the next step was to use image net features. So VGG features Specifically I was reasonably confident that all of the convolutional layers of VGG are Probably pretty much good enough. I didn't expect I would have to fine-tune them much if at all because the convolutional layers are the things which really look at kind of the the shape and structure of things rather than how they fit together and These are Photos of the real world just like image net of photos of the real world So I really felt like most of the time if not all of it was likely to be spent on the dense layers So therefore because calculating the convolutional layers takes nearly all the time because that's where all the computation is I pre-computed The output of the convolutional layers And we've done this before You might remember When we looked at dropout, we did exactly this We figured out what was the last convolutional layers ID We grabbed all of the layers up to that ID. We built a model out of them and then we calculated the output of that model and that told us the Value of those features those activations from the from VGG's last convolutional layer So I did exactly the same thing. I basically copied and pasted that code So I said okay grab VGG 16 find the last convolutional layer build a model that contains everything up to and including that layer Predict the output of That model so predicting the output means calculate the activations of that last convolutional layer and since that takes some time Then save that so I never have to do it again So then in the future I can just Load that array Okay, so this array Okay, so I am not going to calculate those I am simply going to load them And so have a think about what would you expect the shape of this to be and you can figure out What you would expect the shape to be by looking at model dot summary and Finding the last convolutional layer here. It is and we can see it is 512 filters by 14 by 14 So let's have a look Just one moment and we'll find our Conv val feet conval feet dot Shape 512 by 14 by 14 as expected Is there a reason you chose to leave out the max pooling and flatten layers So why did I leave out the max pooling and flatten layers? basically because it takes zero time to calculate them and The max pooling layer loses information So I thought Given that I might want to play around with like other types of pooling or other types of convolutions or whatever I thought pre-calculating this layer is the last one that takes a lot of computation time I'm having said that the first thing I did with it In my new model was to max pull it and flatten it, right? So yeah, it was just that's the reason Okay, so now that I have the output of VGG for the last conv layer I can now build a model That has dense layers on top of that And so the input to this model will be the output of those common players And the nice thing is it won't take long to run this even on the whole data set because the dense layers don't take much computation time so here's my model and By making P a parameter I could change try a wide range of dropout amounts And I fit it and one epoch takes five seconds on the entire data set Right, so this is a super good way to play around and you can see one epoch gets me point six five Three epochs gets me point seven five I was not coughing at all today now I am So this is pretty cool. I have something that in 15 seconds can get me point seven five accuracy and notice here I'm not using any data augmentation Why aren't I using data augmentation because you can't pre-compute the output of convolutional layers if you're using data augmentation Because with data augmentation your convolutional layers Give you a different output every time So that's just a bit of a bummer sure You can't use data augmentation if you are pre-computing the output of a layer because Think about it every time it sees the same cat photo. It's rotating it by a different amount Say or sharing it by a different amount or moving it by different amount So I'd give a different output of the convolutional layer So you can't pre-compute it There is something you can do which I played with a little bit Which is you could pre-compute like something that's like ten times bigger than your data set consisting of like ten different data augmented versions of it Which is why I actually had this Where is it? Which is what I actually was doing here when I brought in this data generator with augmentations and I created something called data augmented convolutional features in which I predicted Five times the amount of data or calculated five times the amount of data And so that basically gave me a data set five times bigger And that actually worked pretty well It's not as good as having a whole new sample every time, but it's kind of a compromise Anyway, so once I played around to these dense layers I Then did some more fine-tuning And found out that so if I went basically here, I then tried saying okay, let's go through my All of my layers in my model from 16 onwards And set them to trainable And see what happens So I tried to retraining you know fine-tuning some of the convolutional layers as well. They basically didn't help Okay, so I experimented with my hypothesis and I found it was correct, which is it seems that for this particular Model Coming up with the right set of dense layers is what it's all about. Yes, Rachel There's a question If we want rotational invariance should we keep the max pooling or can another layer do it as well? Max pooling doesn't really have anything to do with rotational invariance max pooling does Translation invariance Okay, so I'm going to show you one more cool trick I've got what I'm going to show you a little bit of state farm every week from now on because there's so many cool things to try And I want to keep reviewing CNNs because convolutional neural nets Really are becoming What deep learning is all about? And I'll show you one really cool trick It's actually a combination of two tricks the two tricks are called pseudo labeling and knowledge distillation pseudo Labeling so if you google for pseudo labeling semi supervised learning you can see the original paper that came up with pseudo labeling I guess that's there you go 2013 and Then knowledge distillation this is a Jeffrey Hinton paper Distilling the knowledge in a neural network. This is from 2015. Okay, so these are a couple of really cool techniques which Hinton and Jeff Dean that's not bad We're going to combine them together And they're kind of crazy What we're going to do is we are going to use the test set To give us more information because in state farm the test set has 80,000 images in it And the training set has 20,000 images in it so Why would like what could we do? With those 80,000 images, which we don't have labels for It seems a shame to waste them like it seems like we should be to do something with them and there's a great little picture here Imagine we only had two points Right, and we knew their labels white and black And then somebody said how would you label this and then they told you that there's a whole lot of other unlabeled data Right notice, this is all gray right. It's not labeled But it's helped us hasn't it it's helped us because it's told us how the strap how the data is structured Right. This is what semi supervised learning is all about it's all about using the unlabeled data to try and understand Something about the structure of it and use that to help you just like in this picture pseudo labeling and knowledge distillation Our way to do this And what we do is and I'm not going to do it on the test set I'm going to do it on the validation set because it's a little bit easier to see the impact of it And maybe next week we'll look at look at the test set to see because that's going to be much cooler when you Don't the test set It's this simple what we do is we take out our model some model. We've already built and We predict the outputs from that model for our unlabeled set in this case I'm using the validation set as if it was unlabeled so like I'm ignoring the labels and Those things we call the pseudo labels so now that we have predictions for the test set or the validation set It's not that they're true But we can pretend they're true. We can say like well, there's there's some label They're not correct labels, but they're labels nonetheless so what we then do is we take our training labels and We concatenate them with our validation or test sets pseudo labels and So we now have a bunch of labels for all of our data And so we can now also concatenate our convolutional features with the convolutional features of the validation set or test set and We now use these To train a model. So the model we use is exactly the same model we had before And we train it in exactly the same way as before and Our loss goes up from point seven five to point eight two So our error has dropped by like twenty five percent Right and the reason why is just because we use this additional Unlabeled data to try to figure out the structure of it. Yes There's a question about model choice. Yes How do you learn how to design a model and when to stop messing with them? It seems like you've taken a few initial ideas tweaked them to get higher accuracy But unless your initial guesses are amazing, there should be plenty of architectures that would also work Okay, so if and when you figure out how to find an architecture and stop fucking with it Please tell me because I don't sleep Yeah, you know, we all want to know this right and like I look back at these Models, I'm showing you and I'm like thinking like I bet there's something like twice as good. I Don't know what it is There are all kinds of ways of optimizing other hyper parameters of deep learning For example, there's something called spearmint Spearmint the type yes Which is a Bayesian optimization hyper parameter tuning thing In fact just last week a new paper came out for hyper parameter tuning, but this is all about like tuning things like the learning rate and stuff like that coming up with architectures Yeah, who knows There are some There are some people who have tried to come up with some kind of more general architectures and we're going to look at one next week called resnets which are Seem to be pretty encouraging in that direction, but but even then like the general question of like okay I'll give an example Resnet which we're going to learn about next week is an architecture which was which one image net in 2015 and The author of resnet Super smart guy climbing her from Microsoft basic said the reason resnets so great is it lets us build very very very very deep networks and Indeed he showed a network with over a thousand layers, and it was totally state-of-the-art Somebody else came along a few months ago and built wide resnets with like 50 layers and easily beat climbing hers best results So like the very author of the image net winner Completely got wrong the reason why his invention was good, right? So the idea that any of us have any idea how to create optimal architectures is totally totally wrong We we don't so that's why I'm trying to show you What we know so far which is like the processes you can use to build them without waiting forever, right? So in this case doing your data augmentation on the small sample in a rigorous way Figuring out that probably the dense layers are where the action is at and pre-computing the input to them These are the kinds of things that can keep you sane, and I'm showing you the outcome Of like my last week's kind of playing with this. I can tell you that during this time I continually fell into the trap of Running stuff on the whole network and all the way through and fiddling around with high parameters hyper parameters And like I have to stop myself and have a cup of tea and say like okay It's just really a good idea. It's just really a good use of time, and so we all do it But not you anymore because you've been to this class Green box Can you run us through this one more time? I'm just a little confused Yeah, it feels like maybe we're using our validation Set as part of our training program and I'm confused how it's not sure but look we're not using the validation labels No, we're here. Does it say val underscore labels? So yeah, we are absolutely using our validation set, but we're using the validation sets inputs and For our test set we have the inputs So next week I will show you this page again and this time I'm going to use the test set I just didn't have enough time to do it this time around and hopefully we're going to see some great results And when we do it on the test set then you'll be really convinced that it's not using the labels because we don't have any labels But you can see here all it's doing is it's creating pseudo labels By calculating what it thinks it ought to be based on the model that we just built With that 75% accuracy and so then it's able to use the input data for the validation set in an intelligent way and therefore improve the accuracy Then which are being generated out of the project are the same as they are in the training set What do you mean the same? so the Val sudo yes the contents of that. Yes, it will be based on what the model has learned By training on the training. Yeah, it's using BN underscore model and BN underscore model Is the thing that we just fitted by using the training labels? So this is BN model to sing with this point seven five five accuracy Can you move a bit closer to the my The supervised and unsupervised learning and in this case semi-supervised learning Right and semi-supervised works because you're giving it a model which already knows about a bunch of labels But unsupervised wouldn't know unsupervised has nothing. That's right I Not sure I wasn't particularly thinking about doing this But so if I so unsupervised learning is where you're trying to build a model when you have no labels at all How many people here would be interested in hearing about unsupervised learning during this class? Okay enough people I should do that. All right. I will I will add it During the week perhaps we can create a forum thread about unsupervised learning and I can learn about what you're interested in doing with it Because many things that people think of as unsupervised problems actually Okay, so pseudo labeling is insane and awesome and we need the green box back And there are a number of questions And So one is earlier you talked about learning about the structure of the data that you can learn from the validation set Can you say more about that? I? Don't know not really Other than that picture. I showed you before with the two little spirally things And that picture was kind of showing how they clustered in a way Yeah, it was higher dimension and what yeah, and you just had to write So think about that that matzaila paper we saw or the Jason Yosinski visualization toolbox we saw The the layers learn, you know shapes and textures and concepts In that 80,000 test images of people driving in different distracted ways Though there are lots of concepts there to learn about ways in which people drive in distracted ways Even although they're not labeled. So what we're doing is we're trying to learn better convolutional or dense features That's what I mean by learning more of so the structure of the data here is basically like what are these pictures? tend to look like No more importantly and what ways do they differ because it's the ways that they differ that therefore must be related to how they're labeled Okay, um Can you use your updated model to make new labels? Yes, you can absolutely do pseudo labeling on pseudo labeling and And you should and if I don't get sick of running this code I will try it next week could that introduce bias towards your validation set No, because we don't have any validation labels One of the tricky parameters in pseudo labeling is in each batch How many how much do I make it a mix of training versus pseudo? And one of the big things that stopped me from Getting the test set in this week is that Keras doesn't have a way of creating batches Which have like 80% of this set and 20% of that set Which is really what I want because if I just pseudo label the whole test set and then concatenate at it Then like 80% of my batches are going to be pseudo labels And generally speaking the rule of thumb I've read is that somewhere around a quarter to a third of your mini batches should be pseudo labels So I need to find like write some code basically to Get Keras to generate batches, which are a mix from two different places Before I can do this properly and then there are two questions. I think we're asking the same thing Are your pseudo labels only as good as the initial model you're beginning from so do you need to have kind of a? Particular accuracy Yeah, your pseudo labels are indeed as good as your model you're starting from People have not studied this enough to know how sensitive it is to those initial suit to those initial labels Is there a rule of thumb about what accuracy level? No, this is too new, you know and Just try it, you know My guess is that pseudo labels Will be useful regardless of what accuracy level you're at because it'll make it better as long as you are in a semi-supervised learning context I you have a lot of unlabeled data that you want to take advantage of Okay, I really want to move on because I told you I wanted to get us down the path to NLP this week And it turns out that the path to NLP Strangers that sounds starts with collaborative filtering You will learn why next week this week. We are going to learn about collaborative filtering and So collaborative filtering is a way of doing recommender systems And I said to you guys an email today with a link to more information about collaborative filtering and recommender systems, so please Read those links if you haven't already just to get a sense of like what the problem. We're solving here is In short What we're trying to do is to learn to predict Who is going to like what and how much? For example the one million dollar Netflix price How much at what what rating level will this person give this movie? If you're writing Amazon's recommender system to figure out what to show you on their home page Which products is his person likely to rate highly? If you're trying to figure out what stuff to show on a news feed Which articles is his person likely to enjoy reading? There's a lot of different ways of doing this but broadly speaking there are two main classifications of recommender system One is based on metadata Which is for example? this guy filled out a survey in which they said they liked action movies and sci-fi and We also have taken all of our movies and put them into genres and here are all of our action sci-fi movies So we'll use them Broadly speaking that would be a metadata based approach A collaborative filtering based approach is very different. It says let's find other people like you and find out what they liked and assume that you will like the same stuff and Specifically when we say people like you we mean people who rated the same movies you've watched in a similar way And that's called collaborative filtering it turns out that in a large enough data set Collaborative filtering is so much better than the metadata based approaches that adding metadata doesn't even improve it at all So when people in a Netflix prize actually went out to like IMDB and stuff like that and sucked in additional data And tried to use that to make it better At a certain point it didn't help Right once their collaborative filtering models were good enough it didn't help and that's because it's something I learned about 20 years ago when I used to do a lot of surveys in consulting It turns out that asking people about their behavior is crap compared to actually looking at people's behavior So let me show you What collaborative filtering looks like and what we're going to do is we're going to use a data set called movie lens So you guys hopefully won't be able to play around with this this week Unfortunately Rachel and I could not find any Kaggle competitions that were about recommender systems and where the competitions were still open for entries So we're going to use However, there is something called movie lens, which is a widely studied data set in academia perhaps surprisingly Approaching or beating an academic state of the art is way easier than winning a Kaggle competition Because in Kaggle competitions lots and lots and lots of people look at that data And they try lots and lots and lots of things and they use a really pragmatic approach Whereas academic state of the arts are done by academics so With that said the movie lens benchmarks are going to be much easier to beat than any Kaggle competition, but it's still interesting So you can download movie lens data set from the movie lens data set website and You'll see that there's one here recommended for new research with 20 million items in Also conveniently they have a small one with only a hundred thousand ratings So you don't have to build a sample. They have already built a sample for you So I am of course going to use a sample So what I do is I read in ratings dot CSB And as you'll see here, I've started using pandas PD is PD for pandas. How many people here have tried pandas? Awesome. Okay. So those of you that don't hopefully the peer group pressure is kicking in All right, so pandas is a great way of dealing with structured data and you should use it Reading a CSB file is this easy showing the first few items is this easy Finding out how big it is Finding out how many users and movies there are are all this easy I wanted to play with this of course in Excel because that's the only way I know how to teach so What I did was I grabbed the The user ID by rating and grabbed the top 15 most Busiest movie watching users and then I grabbed the 15 most watched movies and Then I created a cross tab of the two And then I copied that into Excel Here is The table I downloaded from movie lens for the 15 busiest movie watching users and the 15 most widely watched movies and here are the ratings is the rating of mover User 14 for movie 27. Look at these guys. These three users have watched every single one of these movies I'm probably one of them. I love movies Wow and these have been watched by every single one of these users So user 14 kind of like movie 27 loved movie 49 hated movie 51 So let's have a look is there anybody else here Okay, so this guy really liked movie 49 didn't much like movie 57 So they may feel the same way about movie 27 as that user. That's the basic essence of collaborative filtering Okay, but we're going to try and automate it a little bit and the way we're going to automate it is we're going to say Well, let's pretend for each movie. We have like five characteristics, which is like is it sci-fi is it action Is it dialogue heavy? Is it new? Then does it have Bruce Willis? All right, and then we could like have those five things For every user as well Right, which is is this user? Cross one Is this user somebody who likes sci-fi action dialogue new movies and Bruce Willis? All right, and so what we could then do is multiply those Metrics actually since they might multiply matrix product or dot product that set of user features With that set of movie features If this person likes sci-fi and it's sci-fi and they like action and it is action and so forth Then a high number will appear in here for this matrix product of these two vectors with dot product of these two vectors And so this would be a cool way to build up a collaborative filtering system if only we could create these five Items for every movie and for every user Now because we don't actually know what five things are most important for users and What five things are most important for movies? We are instead going to learn them and the way we learn them is the way we learn everything Which is we start by randomizing them and Then we use gradient descent so here are five random numbers for every movie and Here are five random numbers for every user and In the middle is the dot product of that movie with that user Once we have a good set of movie factors and user factors for each one then each of these ratings will be similar to each of the observed ratings and therefore this Sum of squared errors Will be low Currently it is high So we start with our random numbers. We start with a loss function of 40 So we now want to use gradient descent And it all turns out that every copy of Excel has a gradient descent solver in it So we're going to go ahead and use it. It's called solver And so we have to tell it what thing to minimize So it's saying minimize this and which things do we want to change which is all of our Factors and then we set it to a minimum and we say solve and Then we can see in the bottom left It is trying to make this better and better and better using gradient descent Notice, I'm not saying stochastic gradient descent Stochastic gradient descent means it's doing it mini batch at a mini batch time Gradient descent means it's doing the whole data set each time Excel uses gradient descent not stochastic gradient descent. They give the same answer, right? But you might also wonder why is it so damn slow? It's so damn slow because it doesn't know how to create analytical derivatives So it's having to calculate the derivatives with finite differencing which is slow. Okay, so here we got a solution It's got it down to five. That's pretty good. So we can see here that It predicted five point one four and it was actually five. They predicted three point oh five and it was actually three So it's done a really really good job It's a little bit too easy because there are five times that many User factors and as you move your factors and five times that many user factors We've got nearly as many factors as we have things to calculate. Okay, so it's kind of over specified But the idea is that right There's one piece missing the piece we're missing is that some users Probably just like movies more than others and some movies are probably just more liked than others And this dot product does not allow us in any way to say this is a Enthusiastic user or this is a popular movie to do that we have to add bias terms So here is exactly the same spreadsheet, but I have added one more row To the movies part and one more column to the users part For our biases and I have updated the formula so that as well as the matrix multiplication It also is adding the user bias and the movie bias. So this is saying this is a very popular movie and Here we are. This is a very Enthusiastic user for example and so now that we have a Collaborative filtering plus bias We can do gradient descent on that So previously our gradient descent loss function was 5.6 We would expect it to be better with bias because we can really better specify what's going on Let's try it So again, we run solver solve and We let that zip along and we see what happens. So these Things we're calculating are called latent factors a Latent factor is some factor that is influencing outcome, but we don't quite know what it is We're just assuming it's there right and in fact what happens is when people do collaborative filtering They then go back and they like Draw graphs where they say, okay This here are the movies that are scored highly on this latent factor and low on that's late late late in factor And so they'll discover the Bruce Willis factor and the sci-fi factor and so forth And so if you look at the Netflix prize visualizations, you will see these these graphs people do and the way They do them is they literally do this not in Excel because they're not that cool But they they do these they calculate these latent factors and then they draw pictures of them And then they actually write the name of the movie on the graph anyway 4.6 Even better okay, so You can see that oh That's interesting. In fact, I also have an error here because any time that my Rating is empty. I Really want to be setting this to empty as well Which means my parenthesis was in the wrong place So I am going to recalculate this With my error fixed up and see if we get a better answer No, not really sticking there at 4.58 worth a try green box, please Okay, I'm gonna throw this fire this time I Where did the I may have forgotten or miss this where did the movie factors come for the Yeah, the movie factors come from they're random. They're random. Yeah, they're randomly generated and then optimized with gradient descent all right for some reason this this seems crazier than What we were doing CNN's wow because that was pretty crazy. This is even crazier. I think this is because like I actually you know like Movies I understand more than like features of images. Okay. I just don't intuitively understand so okay, so We can look at some pictures next week, but during the week Google for Netflix prize visualizations and You will see these pictures and look and it really does work the way I described It figures out You know, what are the most interesting dimensions on which we can rate a movie? And things like level of action and sci-fi and dialogue driven are very important features. It turns out Yeah, but rather than pre specifying those features We have definitely learned from this class that Calculating features using gradient descent is going to give us better features than trying to engineer them by hand Yeah anyway Interesting that feels crazy Tell me next week if you find some particularly interesting things or if it still seems crazy and we can try to decrysify it a little bit Okay, so let's do this in Keras Now there's really only one Main new concept we have to learn which is we started out with data not in a cross tab form but in This form we have user ID movie ID rating triplets. Okay, and I cross tab them So the rows and the columns Above the random numbers are they then the variations in all the features in the movies and the variations of features in the users Yeah, each of these rows is one feature of a movie and each of these columns is one feature of a user And so one of these sets of five is one set of features for a user. This is this user's latent factors I Think it's interesting and crazy because you're basically taking random data and you can generate those features Out of people that you don't know and movies that you're not looking at. Yeah. Yeah I mean, this is the this is the thing I said at the start of class, which is there's nothing Mathematically complicated about gradient descent the hard part is Unlearning the idea that this should be hard, you know gradient descent just Figures it out. Did you have a question of one question behind you? Okay? Make a comment. Yes I just wanted to also point out that this you can think of as a I'm Spark or small or more concise way to represent the movies and the users in math There's the concept of a matrix factorization SVD for example, which is where you basically take a big matrix and turn it into a Small narrow one and a small thin one and multiply the two together This is actually what we're doing instead of having like how user 14 rated every single movie We just have five numbers that represents it. It's pretty cool Right So earlier, did you say that both the user features at were random as well as the yes, I guess I'm I'm trouble Into I thought, you know, we usually we run something like great gradient descent on Something has like inputs that you know and here. What are the? What do you know? That's all we know the resulting ratings so like Can you press come up the wrong like you flip the feature for them? I mean flip the feature for a movie and a user because if you're Doing a multiplication like do you know? Which value goes which if we if if if some if one of the numbers was in the wrong spot Our loss function would be less good and therefore there would be a gradient From that from that weight to say you should make this weight a little higher or a little lower So all the gradient descent's doing is saying okay for every weight If we make it a little higher does it get better or if I make it a little bit lower Does it get better and then we keep making them a little bit higher and lower until we can't go any better and We had to decide How to combine the weights so we this was our architecture right so our architecture was Let's take a dot product of some assumed user feature and some assumed movie feature And let's add in the second case some assumed bias term Right, so we had to build an architecture and we built the architecture using common sense Right, which is to say this seems like a reasonable way of thinking about this I'm going to show you a better architecture in a moment. In fact, we're running out of time So let me jump into the better architecture So I wanted to point out that there is something new we're going to have to learn here which is how do you start with a numeric user ID and Look up to find what is their five element latent factor matrix Now remember when we have user IDs like one two and three One way to specify them is using one hot encoding All right zero one Zero zero zero one so one way to Handle this situation Would be if this was our user matrix. It was one hot encoded right and then we had a Factor matrix Containing a whole bunch of random numbers Okay, one way to do it would be to take a dot product or a matrix product of this and This right and what that would do would be for this one here It would basically say okay, let's multiply that by this it would grab the first column of The matrix and this here would grab the second column of the matrix and this here would grab the third column of the Matrix so one way to do this in Keras would be to represent our user IDs as one hot encodings and in and to create a user user factor matrix just as a regular matrix like this and then take a matrix product That's horribly slow Because if we have ten thousand users then this thing is ten thousand wide and that's a really big matrix Multiplication when all we're actually doing is saying for user ID number one Take the first column for user ID number two take the second column for user ID number three take the third column And so Keras has something which does this for us and it's called an embedding layer and Embedding is literally something which takes an integer as an input and looks up and grabs the corresponding column as output So it's doing exactly what we're seeing in this spreadsheet two questions One, how do you deal with them missing values? So if a user has not rated a particular movie? That's no problem. So missing values are just ignored. So I just said if it's missing I just set the loss to zero All right, and then how do you break up a training and test set? I broke up the training and test set randomly by grabbing random numbers and saying are they greater or less than 80 a point eight and then split my ratings into two groups based on that And you're choosing those from the ratings so that You have some ratings from all users and you have some ratings for all movies. Yeah, they're just I just grab them at random. Yeah, okay So Here it is. Here's our dot product in Keras We and there's one other thing I'm going to stop using the sequential model in Keras and start using the functional model in Keras I'll talk more about this next week, but you can look read about it learning the week There are two ways of creating models in Keras the sequential and the functional They do similar things, but the functional is much more flexible and it's going to be what we're going to need to use You know, okay, so this is going to look slightly unfamiliar But the ideas are the same so we create an input layer for a user and Then we say now create an embedding layer for N users, which is 671 And we want to create how many latent factors I decided not to create five, but to create 50 Okay, and then I create a movie input and then I create a movie embedding with 50 factors and Then I say take the dot product of those and that's our model So now please compile the model and now train it taking the user ID and movie ID as input The rating as the target and run it for six epochs and I get a one point two seven loss this is with a RMSE loss Notice that I'm not doing anything else clever. It's just that simple product that gets me to one point two seven Here's how I add the bias He is exactly the same kind of embedding inputs as before and I've encapsulated in a function So my user and movie embeddings are the same And then I create bias by simply create an embedding with just a single output and so then my new model is Do a dot product and then add the user bias and add the movie bias and Try fitting that and it takes me to a validation loss of one point one How is that going? Well, there are lots of sites on the internet where you can find out Benchmarks for movie lens and on the hundred thousand data set. We're generally looking for RMSE of about point eight nine There's some more best one here is Point nine again. Oh, here we are point eight nine And this one RMSE that's on the one million data set Let's go to the hundred thousand Hundred thousand RMSE one point nine point eight nine, okay So kind of high point eight nines low point nines would be State of the art according to these benchmarks. So we're on the right track, but we're not there yet So let's try something better. Let's create a neural net And a neural net does the same thing we create a movie embedding and a user embedding again with 50 factors And this time we don't take a dot product. We just concatenate the two vectors together stick one on the end of the other and Because we now have one big vector we can create a neural net credit dense layer add dropout Create an activation compile it and fit it and After five epochs We get something way better than state of the art Right, so we couldn't find anything better than about point eight nine right and so like this whole Notebook took me like half an hour to write and so I don't claim to be a collaborative filtering expert But I think it's pretty cool that these things that were written by people that like write collaborative filtering software for a living That's what these websites basically are coming from places that you know use like lens kit So lens kit is a piece of software for recommended for recommender systems We have just killed their benchmark And it took us ten seconds to train So I think that's pretty neat and we're right on time. So we're going to take one last question So neural net why is it that they're a number of factors so oh actually I thought it was an equal not a comma Never mind. We're good. Okay All right, so now you can go home. So that was a very very quick introduction to Embeddings like as per usual in this class I kind of stick the new stuff in at the end and say go study it right So your job this week is to keep improving state farm Hopefully win the new fisheries competition by the way before and the last half hour I just created this little notebook in which I basically copied the dogs and cats redux competition into Something which does it with the same thing with the fish data and I quickly submitted a result and So we currently have one of us in 18th place Yay So hopefully you can beat that tomorrow But yeah, most importantly download the movie lens data and and have a play with that and we'll talk more about embeddings next week Thank you