 So one of the things I wanted to talk about, and it really came up when I was looking at the survey responses, is what is different about how we're trying to teach this course and how will it impact you as participants in this course? And really we're trying to teach this course in a very different way to the way most teaching is done, or at least most teaching in the United States. Rachel and I are both very keen fans of this guy called David Perkins, who has this wonderful book called Making Learning Whole, How Seven Principles of Teaching Can Transform Education. We are trying to put these principles in practice in this course. I'll give you a little anecdote to give you a sense of how this works. It's an anecdote from the book. If you were to learn baseball, if you were to learn baseball the way that math is taught, you would first of all learn about the shape of a parabola. And then you would learn about the material science design behind stitching baseballs, and so forth. And 20 years later, after you had completed your PhD in postdoc, you would be taken to your first baseball game, and you would be introduced to the rules of baseball, and then 10 years later you might get to hit. The way that in practice baseball is taught is we take a kid down to the baseball diamond and we say, these people are playing baseball. Would you like to play? And they say, yeah, sure I would. You say, okay, stand here. I'm going to throw this. Hit it. Okay. Great. Now run. Good. You're playing baseball. Okay. So that's why we started our first class with here are seven lines of code you can run to do deep learning. Not just to do deep learning, but to do image classification on any data set as long as you structure it in the right way. So this means you will very often be in a situation and we've heard a lot of your questions about this during the week of, gosh, there's a whole lot of details I don't understand. Like this fine-tuning thing, what is fine-tuning? And the answer is, we haven't told you yet. It's a thing you do in order to do effective image classification with deep learning. We're going to start at the top and gradually working our way down and down and down. The reason that you are going to want to learn the additional levels of detail is so that when you get to the point where you want to do something that no one's done before, you'll know how to go into that detail and create something that does what you want. So we're going to keep going down a level and down a level and down a level and down a level, but through the hierarchy of software libraries, through the hierarchy of the way computers work, through the hierarchy of the algorithms and the math, but only at the speed that's necessary to get to the next level of let's make a better model, or let's make a model that can do something we couldn't do before. Those will always be our goals. So it's very different to, I don't know if anybody has been reading the Joshua Bengio and Ian Goodfellow Deep Learning Book, which is a great mathematical deep learning book, but it literally starts with five chapters of everything you need to know about probability, everything you need to know about calculus, everything you need to know about linear algebra, everything you need to know about optimization and so forth. And in fact, I don't know that in the whole book there's ever actually a point where it says here is how you do deep learning, even if you read the whole thing. I've read two thirds of it before, it's directly crazy. That's a really good math book, right? And anybody who's interested in understanding the math of deep learning, I would strongly recommend. But it's kind of the opposite of how we're teaching this course. So if you often find yourself thinking, I don't really know what's going on, that's fine, okay? But I also want you to always be thinking about, okay, well, how can I figure out a bit more about what's going on? So we're trying to let you experiment. So generally speaking, the assignments during the week are trying to give you enough room to find a way to dig into what you've learned and learn a little bit more. Make sure you can do what you've seen and also that you can learn a little bit more about it. So you are all coders, and therefore you are all expected to look at that first notebook and look at what are the inputs to every one of those cells? What are the outputs from every one of those cells? How is it that the output of this cell can be used as the input to that cell? Why is this transformation going on? This is why we did not tell you how do you use Kaggle CLI? How do you prepare a submission in the correct format and so forth? Because we wanted you to see if you can figure it out and also to leverage the community that we have to ask questions when you're stuck, okay? So being stuck and failing is terrific because it means you have found some limit of your knowledge or your current expertise. You can then think really hard, read lots of documentation and ask the rest of the community until you are no longer stuck, at which point you now know something that you didn't know before. So that's the goal. Asking for help is a key part of this and so there is a whole wiki page called How to Ask for Help. It's really important and so far I would say about half the times I've seen people ask for help, there is not enough information for your colleagues to actually help you effectively. So when people point you at this page, it's not because they're trying to be a pain, it's because they're saying I want to help you, but you haven't given me enough information. So in particular, what have you tried so far? What did you expect to happen? What actually happened? What do you think might be going wrong? What if you tried to test this out? And tell us everything you can about your computer and your software. Yes, Rachel? That's that first one. Where have you looked so far? Where you've looked so far? Yeah, like I searched the wiki for these key terms. Great, yeah. Show us screenshots, put in our error messages, show us your code. So the better you get at asking for help, the more kind of enjoyable experience you're going to have because continually you'll find your problems will be solved very quickly and you can move on. There was a terrific recommendation from the head of Google Brain, who's the head, Vincent Van Hook, on a Reddit AMA a few weeks ago where he said he tells everybody in his team, if you're stuck, work at it yourself for half an hour. You have to work it yourself for half an hour. If you're still stuck, you have to ask for help from somebody else. The idea being that you are always making sure that you try everything you can, but you're also never wasting your time when somebody else can help you. I think that's a really good suggestion. So maybe you can think about this half an hour rule yourself. I wanted to highlight a great example of a really successful how to ask for help. Who asked this particular question? This is really well done. So that was really nice. What's your background before being here at this class? Introduce yourself, please. Hey, I actually graduated from USF two years ago with last year with a USF favorite elements. And Jeremy Howard taught us as we came back to this class. Thank you. Well, hopefully you accepted one of these fantastic approaches to asking for help. You can see here that he explained what he's trying to do, what happened last time, what error message he got. And we've got a screenshot showing what he typed and what came back. He showed us what resources he's currently used, what these resources say, and so forth. And you get your question answered? Yes. Yes, Rachel will give you an answer. OK, great. So thanks very much. Very good. Good for you. And thank you for coming back. I just have to think about this question, because it's just so clear. I was like, this is easy to answer. Because it's a well-asked question. So as you might have noticed, the Wiki is rapidly filling out with lots of great information. So please start exploring it. You'll see on the left-hand side there is a recent changes section. And you can see every day there's lots of people have been contributing lots of things. So it's continually improving. There's some great kind of diagnostic sections. If you are trying to diagnose something which is not covered, and you solve it, please add your solution to these diagnostic sections. One of the things I loved seeing today was Tom. Where's Tom? Maybe he's remote. Actually, I think he was remote. I think he joined his remote yesterday. So he was asking a question about how fine-tuning works. And we talked a bit about the answers. And then he went ahead. And created a very small little wiki page. There's not much information there, but there's more than there used to be. And this is exactly what we want. And you could even see in the places where he wasn't quite sure, he put some question marks. So now somebody else can go back, edit his wiki page, and Tom's going to come back tomorrow and say, oh, now I've got even more questions answered. So this is the kind of approach where you're going to learn a lot. Oh, we've already spoken to Melissa, so this is good. This is another great example of something that I think is very helpful, which is Melissa, who we heard from earlier, went ahead and told us all, here are my understanding of the 17 steps necessary to complete the things that we were asked to do this week. Okay, so this is great not only for Melissa to make sure she understands it correctly, but then everybody else can say, oh, that's a really handy resource that we can draw on as well. There are 718 messages in Slack in a single channel. That's way too much for you to expect to use this as a learning resource. So this is kind of my suggestion as to where you might want to be careful of how you use Slack. Okay, so I wanted to spend maybe quite a lot of time, as you can see, talking about the resources that are available, because I feel like if we get that sorted out now, then we're all going to speed along a lot more quickly. So thanks for your patience as we talk about some non-deep learning stuff. Yeah, we expect the vast majority of learning to work at, learning to have an outside of class. And in fact, if we go back and finish off our survey, I know that one of the questions asked about that. So how much time you've prepared to commit most weeks to this class? And the majority are 8 to 15. Some are 15 to 30, and a small number are less than eight. Now, if you're in the less than eight group, I understand that's not something you can probably change, right? If you had more time, you'd put in more time. So if you're in the less than eight group, I guess just think about how you want to prioritize what you're getting out of this course and be aware it's not really designed that you're going to be able to do everything in less than eight hours a week. So maybe make more use of the forums and the wiki and kind of focus your assignments during the week on the stuff that you're most interested in. And don't worry too much if you don't feel like you're getting everything because you have less time available. For those of you in the 15 to 30 group, I really hope that you'll find that you're getting a huge amount of that time that you're putting in. Something I'm really glad I asked because I found this very helpful was how much was new to you? And for half of you, the answer is most of it. And for a well over half of you, most of it or nearly all of it from lesson one is new. So if you're one of the many people I've spoken to you during the week who are saying, holy shit, that was a fire hose of information. I feel kind of overwhelmed, but kind of excited. You are amongst friends. Remember during the week, there are about a hundred of you going through this same journey. So if you wanna catch up with some people during the week and have a coffee to talk more about the class or join a study group here at USF or if you're from the South Bay, find some people in the South Bay, I would strongly suggest doing that. So for example, if you're in Menlo Park, you could create a Menlo Park Slack channel and put out a message saying, hey, anybody else in Menlo Park is available on Wednesday night. I'd love to get together and maybe do some pair programming or whatever. For some of you, not very much of it was new. And so for those of you, I do wanna make sure that you feel comfortable pushing ahead, trying out your own projects and so forth. And basically in the last lesson, what we learned was a kind of pretty standard data science computing stack. So AWS, Jupyter Notebook, bit of NumPy, Bash. This is all stuff that regardless of what kind of data science you do, you're gonna be seeing a lot more of if you stick in this area. So these are all very, very useful things. And those of you who have maybe spent some time in this field, you'll have seen most of it before. So that's to be expected. Okay, so hopefully that is some useful background. So last week, we were really looking at the basic foundations, kind of computing foundations necessary for data science more generally and for deep learning more particularly. This week, we're gonna do something very similar, but we're gonna be looking at the key algorithmic pieces. So in particular, we're gonna go back and we're gonna say, hey, what did we actually do last week? And why did that work? And how did that work? For those of you who don't have much algorithmic background around machine learning, this is gonna be the same fire hose of information as last week was for those of you that don't have so much kind of software and Bash and AWS background. So again, if there's a lot of information, don't worry, this is being recorded. They're all the resources during the week. And so the key thing is to come away with an understanding of like, what are the pieces being discussed? Why are those pieces important? What are they kind of doing? Even if you don't understand the details. So if at any point you're thinking, okay, Jeremy's talking about activation functions. I have no idea what he just said about what an activation function is or why I should care. Please go on to the in class Slack channel and probably at Rachel, at Rachel, I don't know what Jeremy's talking about at all. And then Rachel's got a microphone and she can let me know or else put up your hand and I will give you the microphone and you can ask. So I do wanna make sure you guys feel very comfortable asking questions. I have done this class now once before because I did it for the Skype students last night. So I've heard a few of the questions already. So hopefully I can cover some things that are likely to come up. Before we look at these kind of digging into what's going on, the first thing we're gonna do is see how do we do the basic homework assignment for last week? So the basic homework assignment from last week was can you enter the Caggle, Dogs and Cats, Redux competition? So how many of you managed to submit something to that competition and get some kind of result? Okay, that's not bad. So maybe a third, right? So for those of you who haven't yet, keep trying during this week and use all of those resources I showed you to help you because now quite a few of your colleagues have done it successfully and therefore we can all help you. And I will show you how I did it. Here is Redux. All right, so the basic idea here, close one here. Okay, is we had to download the data to a directory. So to do that, I just typed KG download after using the KG config command. KG is a part of the Caggle CLI thing and Caggle CLI can be installed by typing hit install Caggle CLI. This works fine without any changes if you're using our AWS instances and set up scripts. In fact, it works fine if you're using Anaconda pretty much anywhere. If you're not doing either of those two things, you may have found this step more challenging, okay? But once it's installed, it's a simple saying KG config with your username, password and competition name. When you put in the competition name, you can find that out by just going to the Caggle website and you'll see that when you go to the competition in the URL, it has here a name. So just copy and paste that, that's the competition name. Caggle CLI is a script that somebody created in their spare time and didn't spend a lot of time on it. There's no error handling, there's no checking, there's nothing. For example, if you haven't gone to Caggle and accepted the competition rules, then attempting to run KG download will not give you an error, it will create a zip file that actually contains the contents of the Caggle webpage saying please accept the competition rules. So those of you that tried to unzip that and that said it's not a zip file, if you go ahead and cat that, you'll see, oh, it's not a zip file, it's an HTML file, okay? So this is pretty common with recent-ish data science tools and particularly with cutting-edge deep learning stuff, a lot of it's pretty new, it's pretty rough, and you really have to expect to do a lot of debugging, okay? It's very different to using Excel or Photoshop or something. Okay, so when I said KG download, it created a test.zip and a train.zip, so I went ahead and I unzipped both of those things, that created a test and a train, and they contained a whole bunch of files called you know, cat.one.jpeg and so forth. So the next thing I did to make my life easier was I made a list of what I believed I had to do, okay? So this, I find life much easier with a to-do list. So I thought, okay, I need to create a validation set, I need to create a sample, I need to move my cats into a cat's directory and dogs into a doc's directory, I then need to run the fine-tune and train, I then need to submit. So I just went ahead then and created markdown headings for each of those things and started filling them out. Okay, so create validation set and sample. A very handy thing in Jupyter, Jupyter Notebook, is that you can create a cell that starts with a percent sign, and that allows you to type what they call magic commands. There are lots of magic commands that do all kinds of useful things, but they do include things like CD and Magda and so forth. Another cool thing you can do is you can use an explanation mark and then type any bash command, right? So the nice thing about doing this stuff in the notebook rather than in bash is you've got a record of everything you did. So if you need to go back and do it again, you can. If you make a mistake, you can go back and figure it out. So this kind of like reproducible research very, very highly recommend. So I try to do everything in a single notebook so I can go back and fix the problems that I always make. So here you can see I've gone into the directory. I've created my validation set. I then use three lines of Python to go ahead and grab all of the JPEG file names, create a random permutation of them. And so then the first 2000 of that random permutation are 2000 random files, and then I moved them into my validation directory. That gave them my valid. I did exactly the same thing for my sample, but rather than moving them, I copied them. And then I did that for both my sample training and my sample validation. And that was enough to create my validation set and sample. The next thing I had to do was to move all my cats into a cats directory and dogs into a dogs directory, which was as complex as typing move cat dot star cats and dogs dot star dogs, okay, and that was that. And so the cool thing is now that I've done that, I can then just copy and paste the seven lines of code from our previous lesson. So these lines of code are totally unchanged. I added one more line of code, which was save weights. Once you've trained something, it's a great idea to save the weights so you don't have to train it again. You can always go back later and say load weights. Okay, so I now had a model which predicted cats and dogs through my Redux competition. My final step was to submit it to Kaggle. So Kaggle tells us exactly what they expect. And the way they do that is by showing us a sample of the submission file. And basically the sample shows us that they expect an ID column and a label column. The ID is the file number. So if you have a look at the test set, you'll see everyone's got a number. So it's expecting to get the number of the file along with your probability. So you have to figure out how to take your model and create something of that form. This is clearly something that you're gonna be doing a lot. So once I figured out how to do it, I actually created a method to do it in one step. So I'm gonna go and show you the method that I wrote. Rachel. Yes, I will absolutely share this notebook tomorrow morning. Okay, so I just added this utils module that I kind of chucked everything in. Actually, that's not true. I'll put it in my VGG module because I added it to the VGG class. So there's a few ways you could possibly do this. Basically, you know that you've got a way of grabbing a mini batch of data at a time or a mini batch of predictions at a time. So one thing you could do would be to grab your, let's say your mini batch is size 64. You could grab your 64 predictions and just keep appending them 64 at a time to an array. And it'll eventually you have your 12,500 test images or with a prediction in an array. That is actually a perfectly valid way to do it. How many people solved it using that kind of approach? Okay, not many of you. That's interesting, but works perfectly well. Those of you who didn't, I guess, either asked on the forum or read the documentation and discovered that there's a very handy thing in Keras called predict generator. And what predict generator does is it lets you send it in a bunch of batches, so something that we created with get batches, and it will run the predictions on every one of those batches and return them all in a single array. So that's what we wanted to do. If you read the Keras documentation, which you should do very often, you will find out that predict generator generally will give you the labels. So not the probabilities, but the labels. So cat one, dog zero, something like that. In this case for this competition, it told us that they want probabilities, not labels. So instead of calling the get batches, which we wrote, here is the get batches that we wrote. You can see all it's doing is it's calling something else, which is flow from directory. To get predict generator to give you probabilities instead of classes, you have to pass in an extra argument which is class mode equals, and rather than categorical, you have to say none. So in my case, I actually went ahead and actually modified get batches to take an extra argument, which was class mode. And then in my test method I created, I then added class mode equals none. And so then I could call model.predictGenerator passing in my batches, and that is gonna give me everything I need. So I will show you what that looks like. So once I do, I basically say vgg.test, this is the thing I created, pass in my test directory, pass in my batch size. That returns two things, it returns the predictions and it returns the batches. I can then use batches.fileNames to grab the file names because I need the file names in order to grab the IDs. And so that looks like this. Let's take a look at them. Preds, let's look at a few of those. Okay, so there's a few predictions and let's look at a few file names. Now one thing interesting is that at least for the first five, the probabilities are all ones and zeros rather than like 0.6 and 0.8 and so forth. We're gonna talk about why that is in just a moment. For now, it is what it is. It's not doing anything wrong, it really thinks that that's the answer. So all we need to do is grab, because Kaggle wants something which is isDog, we just need to grab the second column of this and the numbers from this, paste them together as columns and send that across. So here is, grabbing the first column from the predictions and I call it isDog. Here is grabbing from the eighth character until the dot in file names and turning that into an integer, get my IDs. NumPy has something called stack which lets you put two columns next to each other and so here is my IDs and my probabilities. And then NumPy lets you save that as a CSV file using say text. You can now either SSH to your AWS instance and use KG submit or my preferred technique is to use a handy little IPython thing called file link. If you type file link and then pass in a file that is on your server, it gives you a little URL like this which I can click on and it downloads it to my computer. And so now on my computer I can go to Kaggle and I can just submit it in the usual way. I prefer that because it lets me find out exactly like if there's any error messages or anything going on wrong on Kaggle, I can see what's happening. So as you can see, rerunning what we learned last time to submit something to Kaggle really just requires a little bit of coding to just create the submission file, a little bit of bash scripting to move things into the right place and then rerunning the seven lines of code, the actual deep learning itself is incredibly straightforward. Okay, now here's where it gets interesting. When I submitted my ones and zeros to Kaggle, I was put in, let's have a look at the reader board. Well the first thing I did was I actually was accidentally putting in is cat rather than is dog and that made me last place. Okay, so I had 38 was my loss. Then when I was putting in ones and zeros I was in 110th place, which is still not that great. Now the funny thing was I was pretty confident that my model was doing well because the validation set for my model told me that my accuracy was 97.5%, here it's here, validation accuracy 97.5%. I'm pretty confident that people on Kaggle are not all of them doing better than that. So I thought something weird's going on. So that's a good time to figure out, well what does this number mean? What is 12 or what is 17 or whatever? So let's go and find out. It says here that it is a log loss. So if we go to evaluation, we can find out what log loss is. And here is the definition of log loss. Log loss is known in Keras as binary entropy or categorical entropy. And you will actually find it very familiar because every single time we've been creating a model, we have been using, let's go and find out when we compile it. When we compile a model, we've always been using categorical cross entropy. So it's probably a good time for us to find out what the hell this means. All right, so the short answer is, it is this mathematical function. But let's dig into this a little bit more and find out what's going on. I would strongly recommend that when you want to understand how something works, you whip out a spreadsheet. Spreadsheets select my favorite tool for doing small scale data analysis. They are perhaps the least well utilized tools amongst professional data scientists, which I find really surprising because back when I was in consulting, everybody used them for everything and they were the most overused tools. So what I've done here is I've gone ahead and basically created a little column of is cats and is dogs, so this is the correct answer, and I've created a little column of some possible predictions. And then I've just gone in and I've typed in the formula from that cattle page. And so here it is. It's basically, it's the truth label times log of the prediction minus one minus the truth label times log of one minus the prediction. Now, if you think about it, the truth label is always one or zero. So this is actually probably more easily understood using an if function. It's exactly the same thing rather than multiplying by one and zero. Let's just use if function. So in other words, if it's a cat, then take log of the prediction. Otherwise, take log of one minus the prediction. Now, this is hopefully pretty intuitive. If it's a cat and your prediction is really high, then we're taking the log of that and getting a small number. If it's not a cat and then our prediction is really low, then we want to take the log of one minus that. And so you can kind of get a sense of it by looking here. Here's like a non-cat, which we thought is a non-cat. And therefore, we end up with log of one minus that, which is a low number. Here's a cat, which we are pretty confident isn't a cat. So here is log of that. And notice this is all being got a negative sign at the front just to make it so that smaller numbers are better. So this is log loss or binary or categorical cross entropy. And this is where we find out what's going on because I'm now gonna go and try and say, well, what did I submit? And I submitted predictions that were all ones and zeros. So what if I submit ones and zeros? Ouch. Okay, why is that happening? Because we're taking logs of ones and zeros. That's no good. So actually, Kaggle is being pretty nice not to return just an error. And I actually know why this happens because I wrote this functionality on Kaggle. Kaggle modifies it by a tiny, like I can't remember, 0.0001 just to make sure it doesn't die. So if you say one, it actually treats it as 0.99999. If you say zero, it actually treats it as 0.0001. So our incredibly overconfident model is getting massively penalized for that overconfidence. So what would be better to do would be instead of sending on cross ones and zeros, why not send across actual probabilities you think are reasonable? So in my case, what I did was I added a line, which was here. I said numpy.clip, my first column of my predictions and clip it to 0.05 and 0.95. So anything less than 0.05 becomes 0.05 and anything greater than 0.95 becomes 0.95. And then I tried submitting that. And that moved me from 110th place to 40th place. And suddenly I was in the top half. So the goal of this week was really try and get in the top half of this competition. And that's all you had to do was run a single epoch and then realize that with this evaluation function you need to be submitting things that aren't ones and zeros. Okay. There's two questions. Yes. I want to read. What is a physical entropy? Uncertainty is maximized when your prediction is 0.5 but log loss is maximized if you're very confident in the wrong direction? Let's take that one offline to about it on the forum because I actually need to think about that properly. Sorry. Another question is how did you decide to put range? Yeah. So probably I should have used and I'd be interested in trying this tomorrow maybe when I do a resubmission. I probably should have done 0.025 and 0.975 because I actually know that my accuracy on the validation set was 0.975. So that's probably the probability that I should have used. I would need to think about it more though to think like because it's like a nonlinear loss function is it better to kind of underestimate how confident you are or overestimate how confident you are. So I would need to think about it a little bit. In the end I kind of said, okay well it's about 97.5. I have a feeling that being overconfident might be a bad thing because of the shape of the function. So I'll just be a little bit on the tame side. The short answer to that question. I then later on tried actually 0.02 and 0.98 and I did actually get a slightly better answer. I actually got a little bit better than that. I think in the end, this afternoon I ran a couple more epochs just to see what would happen and that got me to 0.24. Okay, so I'll show you how you can get to 0.24 position and it's incredibly simple. You take these two lines here, fit and save weights and copy and paste them a bunch of times. And you can see I saved the weights under a different file name each time just so that I can always go back and use a model that I created earlier. Something we'll talk about more in the class later is this idea that halfway through after two epochs I changed my learning rate from 0.1 to 0.01 just because I happen to know this is often a good idea. I haven't actually tried it without doing that. I suspect it might be just as good or even better but that was just something I tried. So interestingly, by the time I run for epochs my accuracy is 98.3%. That would have been second place in the original Cats and Dogs competition. So you can see it doesn't take much to get really good results. And each one of these took, as you can see, 10 minutes to run on my AWS P2 instance. I have a question, so you're saying that it's like not a zero or even in the top half? The original Cats and Dogs used a different evaluation function which was just accuracy. So they changed it for the redux one to use block loss which makes it a bit more interesting, I think. Yes. Another question is, is this the same as changing NB underscore epochs? Yes, so the reason I didn't just say NB epoch equals four is that I really wanted to save the result after each epoch under a different weight spiel name. Just in case at some point it overfit, I could always go back and use like one that I got in the middle. Another question is, what decides the right number of runs so they get better and better with more training? We're gonna learn a bit a lot about that in the next couple of weeks. In this case, we have added a single linear layer to the end, we're about to learn a lot about this. And so we actually are not training very many parameters. So my guess would be that in this case we could probably run as many epochs as we like and it would probably keep getting better and better until it eventually levels off. That would be my guess. So I wanted to talk about like what are these probabilities? And one way to do that and also to talk about like how can you make this model better is anytime I build a model and I think about how to make it better, my first step is to draw a picture. Yes. Okay, we have lots of questions which are excellent. Good. What is the difference between categorical cross entropy and sparse categorical entropy? Let's take that one offline onto the forum because we don't need to cover it today. Data scientists don't draw enough pictures. Now when I say draw pictures, I mean everything from printing out the first five lines of your array to see what it looks like to drawing complex plots. For computer vision, you can draw lots of pictures because we're classifying pictures. I've given you some tips here about what I think are super useful things to visualize. So when I wanted to find out how come my Kaggle submission is 110th place, I ran my kind of standard five steps. And the standard five steps are let's look at a few examples of images we got right. Let's look at a few examples of images we got wrong. Let's look at some of the cats that we felt were the most cat-like, some of the dogs that we felt were the most dog-like, vice versa. Some of the cats that we were the most wrong about, some of the dogs we were most wrong about and then finally, some of the cats and dogs we were that our model is the most unsure about. This little bit of code I suggest you keep around somewhere because this is a super useful thing to do any time you do image recognition. So the first thing I did was I loaded my weights back up just to make sure that they were there and I took them from my very first epoch and I used that vgg.test method that I just showed you that I created. And this time I passed in the validation set, not the test set because the validation set, I know the correct answer. So then from the batches I could get the correct labels and I could get the file names. I then grabbed the probabilities and the class predictions and that then allowed me to do the five things I just mentioned. So here's number one, a few correct labels at random. So numpy.wear, the prediction is equal to the label. Let's then get a random permutation and grab the first four and plot them by index. So here are four examples of things that we got right. And not surprisingly, they look like this cat, looks like a cat and this dog looks like a dog. Here are four things we got wrong. And so that's interesting. You can kind of see here's a very black underexposed thing on a bright background. Here is something that is on a totally unusual angle and here is something that's so curled up you can't see its face. And this one you can't see its face either. So this gives me a sense of like, okay, the things that's getting wrong, it's reasonable to get those things wrong. If you looked at this and they were like really obvious cats and dogs, you would think, okay, there's something wrong with your model. But in this case, I'm thinking, no. The things that it's finding hard are genuinely hard. Okay, here are some cats that we felt very sure are cats. Here are some dogs we felt very sure are dogs. Yes. The weights you're saving are those from image cam weights or from, you know, when we train on cats and dogs. Yeah, so these weights, this one here results FT1.h5. This FT stands for fine tune. And you can see here I saved my weights after I did my fine tuning. So these are the cats and dogs. Oh, the fine tuning, are you just training the last layer or? Yes, I'm just training the last layer. We're not talking about that yet. We just used the fine tune command. And later today, we're gonna learn about what that does. Okay. So these I think are the most interesting, which is here in the images, we were very confident the cats, but they're actually dogs. And you can see, okay, well, here's one that is only 50 by 60 pixels. That's very difficult. Here's one that's almost totally in front of a person and is also standing upright. That's difficult because it's unusual. This one is very white and is totally from the front. That's quite difficult. And this one I'm guessing the color of the floor and the color of the fur are nearly identical. Okay, so again, this makes sense. These do look genuinely difficult. And so if we wanna do really well in this competition, we might start to think about, okay, should we start building some models of very, very small images? Because we now know that sometimes Cabel gives us 50 by 50 images, which are gonna be very difficult for us to deal with. Here are some pictures that we were very confident dogs, but they're actually cats. And again, not being able to see the face seems like a common problem. And then finally, here are some examples that we were most uncertain about. Now, notice that the most uncertain are still not very uncertain, like they're still nearly one or nearly zero. So why is that? Well, we will learn in a moment about exactly what is going on from a mathematical point of view when we calculate these things. But the short answer is the probabilities that come out of a deep learning network are not probabilities in any statistical sense of the term. So this is not actually saying that there is one chance that I had over 100,000 that this is a dog. It's only a probability from the mathematical point of view. And in math, the probability means it's between zero and one, and all of the possibilities add up to one. It's not a probability in the sense of like, this is actually something that tells you how often this is gonna be right versus this is gonna be wrong. So for now, just be aware of that. When we talk about these probabilities that come out of neural network training, you can't interpret them in any kind of intuitive way. We will learn about how to create better probabilities down the track. Every time you do another epoch, your network is gonna get more and more confident. This is why when I loaded the weights, I loaded the weights from the very first epoch. If I had loaded the weights from the last epoch, they all would have been one and zero. So this is just something to be aware of. Okay, so hopefully you can all go back and get great results on the Calculus competition. Even though I'm gonna share all this, you will learn a lot more by trying to do it yourself and only referring to this when and if you're stuck. And then if you do get stuck, rather than copying and pasting my code, find out what I used and then go to the Keras documentation and read about it and then try and write that line of code without looking at mine. Okay, so the more you can do that, the more you'll think, okay, I can do this. I understand how to do this myself. All right, just some suggestions, it's entirely up to you. Okay, so let's move on. So now that we know how to do this, oh, I wanted to show you one other thing, which was the last part of the homework was, redo this on a different data set, right? And so I decided to grab the State Farm Distracted Driver Competition, the Kaggle State Farm Distracted Driver Competition has pictures of people in 10 different types of distracted driving, ranging from drinking coffee to changing the radio station. I wanted to show you how I entered this competition. It took me a quarter of an hour to enter the competition and all I did was I duplicated my Cats and Dogs Redux notebook, and then I started basically rerunning everything. But in this case, it was even easier because when you download the State Farm Competition data, they have already put it into directories, one for each type of distracted driving. So I was delighted to discover, let's go to it. Okay, so if I type tree minus D, that shows you my directory structure, you can see in train it already had 10 directories, actually it didn't have valid, I had to create that. So in train it already had the 10 directories. So I could skip that whole section. So I only had to create the validation and sample set. If all I wanted to do was enter the competition, I wouldn't even have had to have done that, okay? So I won't go through, but it's basically the same code as I had before to create my validation set and sample. I deleted all of the bits which moved things into separate subfolders. I then used exactly the same seven lines of code as before. And that was basically done, okay? I haven't actually, I'm not getting good accuracy yet, I don't know why, so I'm gonna have to actually figure out what's going on with this. But as you can see, this general approach works for any kind of image classification. There's nothing specific about cats and dogs. So you now have a very general tool in your toolbox. And all of the stuff I showed you about visualizing the errors and stuff, you can use all that as well. So maybe when you're done, you could try this as well. Yes, you know. Can I grab these coins? I have security, it's a question for me. So suppose now that we have some kind of cats and they have some kind of CTs cats, do you think that would work? Since, you know, I don't really, how much the cats and dogs can reach out in the set, right? So I'll play it by understanding basically there's some filters that are so general. Yeah. So the question is, would this work for CT scans and say cancer? And I can tell you that the answer is yes, because I've done it. So my previous company I created was something called Inelitic, which was the first deep learning for a medical diagnostics company. And the first thing I did with four of my staff was we downloaded the National Lung Screening Trial Data, which is a thousand examples of people with cancer. It's a CT scan of their lungs. And 5,000 examples of people without cancer, CT scans of their lungs. We did the same thing. We took ImageNet, we fine-tuned ImageNet, but in this case instead of cats and dogs, we had malignant tumour versus non-malignant tumour. We then took the result of that and saw how accurate it was. And we discovered that it was more accurate than a panel of four of the world's best radiologists. And that ended up getting covered on TV on the CNN. So like making major breakthroughs in domains is not necessarily technically that challenging. The technical challenges in this case were really about dealing with the fact that CT scans are pretty big. So we had to kind of just think about some resource issues. Also, they're black and white. So we had to think about like, how do we change our ImageNet pre-training to black and white and stuff like that. But the basic example was really not much more or different code to what you see here. Oh, no. The state farm data is four gigabytes and I only downloaded it like half an hour before class started. So I only ran a small fraction of an epoch just to make sure that it works. I'm running a whole epoch. Probably would have taken overnight. All right. So let's go back to lesson one and there was a little bit at the end that we didn't look at. Actually, before we do, now's a good time for a break. So let's have a 12 minute break. Let's come back at 8 p.m. And one thing that she may consider doing during those 12 minutes if you haven't done it already is to fill out the survey. I will place the survey URL back onto the in-class page. So yeah, see you in 12 minutes. Okay, thanks everybody. How many of you have watched this video? Okay, some of you haven't. You need to because, as I've mentioned a couple of times in our emails, the last two thirds of it was actually a surprise lesson zero of this class and it's where I teach about what convolutions are. Okay, so if you haven't watched it, please do. Rachel will add it to the in-class Slack channel and also to the lesson two resources wiki page. Really, really important that you watch this here video. The first 20 minutes or so is more of a general backgrounder but the rest is a discussion of exactly what convolutions are. For now, I'll try not to assume too much that you know what they are. The rest of it hopefully will be reasonably standalone anyway. But I want to talk about fine-tuning and I want to talk about why we do fine-tuning. Why do we start with an image network and then fine-tune it rather than just train our own network? And the reason why is that an image network has weren't a hell of a lot of stuff about what the world looks like. A guy called Matt Zeiler wrote this fantastic paper a few years ago in which he showed us what these networks learn. And in fact, the year after he wrote this paper, he went on to win ImageNet. So this is a powerful example of why spending time thinking about visualizations is so helpful. By spending time thinking about visualizing networks, he then realized what was wrong with the networks at the time, made them better, and won the next year's ImageNet. We're not going to talk about that. We're going to talk about some of these pictures he drew. Here are nine examples of what the very first layer of an ImageNet convolutional neural network looks like. What are the filters look like? And you can see here that, for example, here is a filter that learns to find a diagonal edge for a diagonal line. Okay, so you can see it's looking for something where there's no pixels and then there's bright pixels and then there's no pixels, so that's finding a diagonal line. Here's something that finds a diagonal line in the other direction. Here's something that finds a gradient, horizontal, from orange to blue. Here's one diagonal from orange to blue. As I said, these are just nine of, there's actually, I can't remember, maybe about 60 or so of these filters in layer one of this ImageNet-trained network. So what happens, those of you who have watched the video I just mentioned will be aware of this, is that each of these filters gets placed pixel by pixel or group of pixels by group of pixels over a photo, over an image to find which parts of an image it matches. So which parts have a diagonal line. And over here it shows nine examples of little bits of actual ImageNet images which match this first filter. So here are, as you can see, they all are little diagonal lines. So here are nine examples which match the next filter, so diagonalized in the opposite direction and so forth. The filters in the very first layer of a deep learning network are very easy to visualize. This has happened for a long time and we've always really known for a long time that this is what they look like. We also know incidentally that the human vision system is very, very similar. The human vision system has filters that look much the same. Yes. So these, so to really answer the question of what are we talking about here, I would say watch the video. But the short answer is this is a seven by seven pixel patch which is slid over the image, one group of seven pixels at a time, to find which seven by six seven patches look like that. And here is one example of a seven by seven patch that looks like that. So for example this gradient, here are some examples of seven by seven patches. That look like that. So we know the human vision system actually looks for very similar kinds of things. These kinds of things that they look for are called Gabor filters. And if you want to Google for Gabor filters, you can see some examples. It's a little bit harder to visualize what the second layer of a neural net looks like, but Zeiler figured out a way to do it. And in his paper, he shows us a number of examples of the second layer of his image net trained neural network. And because we can't directly visualize them, instead we have to show examples of what the filter can look like. So here is an example of a filter which clearly tends to pick up corners. So in other words, it's taking the straight lines from the previous layer and combining them to find corners. There's another one which is learning to find circles and another one which is learning to find curves. And so you can see here are examples from here are examples, nine examples from actual pictures on ImageNet which actually did get heavily activated by this corner filter. And here are some that got heavily activated by this circle filter. The third layer then can take these filters and combine them. And remember this is just 16 out of I think about 100 which are actually in the ImageNet architecture. So in layer three we can combine all of those to create even more sophisticated filters. And so in layer three there's a filter which can find repeating geometrical patterns. Here's a filter. Well what's this at finding? Let's go and look at the examples. Well that's interesting. It's finding pieces of text. All right. And here's something which is finding kind of edges of natural things like fur and plants. Okay. Layer four is finding certain kinds of dog face. Layer five is finding the eyeballs of birds and reptiles and so forth. So there are 16 layers in our VGG network. Right. What we do when we fine tune is we say let's keep all of these learnt filters and use them and then just learn how to combine the most complex subtle nuanced filters to find cats versus dogs rather than combine them to learn a thousand categories of ImageNet. This is why we do fine tuning. So when I asked Yannette's earlier question about does this work for CT scans and lung cancer and the answer was yes. Now these kinds of filters that find dog faces are not very helpful for looking at a CT scan and looking for cancer but these earlier ones that can recognize repeating images or corners or curves certainly are. All right. So really regardless of what computer vision work you're doing starting with some kind of pre-trained network is almost certainly a good idea because at some level that pre-trained network has learnt to find some kinds of features that are going to be useful to you. And so if you start from scratch you have to learn them from scratch. In cats versus dogs we only had 25,000 pictures and so from 25,000 pictures to learn this whole hierarchy of kind of geometric and semantic structures would have been very difficult. Okay so let's not learn it. Let's use one that's already been learnt on ImageNet which is one in a half million pictures. So that's the short answer to the question why do fine tuning? To the longer answer it really requires answering the question what exactly is fine tuning? And to answer the question what exactly is fine tuning we have to answer the question what exactly is a neural network? Okay so a neural network we will ask them two more questions you might be getting to these. How do you decide which layer to go back into to find to do you have to select the right layer? And what's the other one? They're the same thing. Okay so the question is which layer should you fine tune from? And we'll learn more about this shortly but the short answer is if you're not sure try all of them. Generally speaking if you're doing something with natural images the second to last layer is very likely to be the best but you know I just tend to try a few and we're going to see actually today or next week some ways that we can actually experiment with that question. Okay so as per usual in order to learn about something we will use Excel. And here is a deep neural network in Excel. Rather than having a picture with lots of pixels I just have three inputs. A single row with three inputs which x1, x2 and x3 and the numbers are two, three and one. And rather than trying to pick out whether it's a dog or a cat we're going to assume there are two outputs five and six. So here's like a single row that we're feeding into a deep neural network. So what is a deep neural network? A deep neural network basically is a bunch of matrix products. So what I've done here is I've created a bunch of random numbers. They are normally distributed random numbers and this is the standard deviation that I'm using for my normal distribution and I'm using zero as the mean. Okay so here's a bunch of random numbers. What if I then take my input vector and matrix multiply them by my random weights and here it is. So here's matrix multiply that by that. Okay and here is the answer I get. So for example 24.03 equals 2 times 11.07 plus 3 times negative 2.81 plus 1 times 10.31 okay and so forth. Any of you who are either not familiar with or are a little shaky on your matrix vector products tomorrow please go to the Khan Academy website and look for linear algebra and read about what's the videos about matrix vector products. They are very very very simple but you also need to understand them very very very kind of intuitively comfortably just like you understand plus and times in regular algebra. I really want you to get to that level of comfort with linear algebra because these are the this is the basic operation we're doing again and again. Yes Rachel? Can you increase the font size of the formula? I don't think I can but instead what I will do is I will select it to point out here you go there you can see it. So it's matrix multiply and then the blue things times the red things blue things, red things okay. So that is a single layer. How do we turn that into multi-layers? Well not surprisingly we create another bunch of weights and now we take those weights the new bunch of weights times the previous activations with our matrix multiply and we get a new set of activations. And then we do it again. Let's create another bunch of weights and multiply them by our previous set of activations again another set of activations. Note that the number of columns in your weight matrix is you can decide you can make it as big or as small as you like as long as the last one has the same number of columns as your output. So we had two outputs five and six so our final weight matrix had to have two columns so that our final activations has two things. Okay so with our random numbers our activations are not very close to what we hope they would be, not surprisingly. So the basic idea here is that we now have to use some kind of optimization algorithm to repeatedly make the weights a little bit better and a little bit better and we will see how to do that in a moment. But for now hopefully you're all familiar with the idea that there is such a thing as an optimization algorithm and optimization algorithm is something that takes some kind of output to some kind of mathematical function and finds the inputs to that function that makes the outputs as low as possible. And in this case the thing we would want to make as low as possible would be something like the sum of squared errors between the activations and the outputs. I want to point out something here which is that when we stuck in these random numbers the activations that came out not only are they wrong they're not even in the same general scale as the activations that we wanted. So that's a bad problem. The reason it's a bad problem is because they're so much bigger than the scale that we were looking for. As we change these weights just a little bit it's going to change the activations by a lot and this makes it very hard to train. In general you want your neural network to start off even with random weights to start off with activations which are all of similar scale to each other and the output activations to be a similar scale to the output. For a very long time nobody really knew how to do this and so for a very long time people could not really train deep neural networks. It turns out that it is incredibly easy to do and there is a whole body of work talking about neural network initializations. It turns out that a really simple and really effective neural network initialization is called Xavier initialization named after its founder Xavier Glorow and it is 2 divided by n plus l. Okay like many things in deep learning you will find this complex looking thing like Xavier weight initialization scheme and when you look into it you will find it is something about this easy. This is about as complex as deep learning gets. So I am now going to go ahead and implement Xavier deep learning weight initialization schemes in XL. So I'm going to go up here and I'm going to type equals 2 divided by okay we need n plus out so there are 3 in plus 4 out and put that in brackets because we are complex and sophisticated mathematicians and press enter. Okay there we go so now my first set of weights have that as its standard deviation. My second set of weights I actually have pointing at the same place because they also have 4 in and 3 out and then my third I need to have equals 2 divided by 3 in plus 2 out. Okay done. So I have now implemented it in XL and you can see that my activations are indeed of the right general scale. Okay so generally speaking you would normalize your inputs and outputs to be mean zero and standard deviation one and if you use these I was just going to say that it would be explicit so in line 30 this activation output we want those to be 5 and 6 we are trying to get towards We want them to be of the same kind of scale. I mean obviously they're not going to be in 5 and 6 because we haven't done any optimization yet but we don't want them to be like 100,000 we want them to be somewhere around 5 and 6 Eventually we want them to get close to 5 and 6 yes exactly and so if we start off with them really really really really high or really really really really low then optimization is going to be really finicky and really hard to do and so for decades when people tried to train deep learning neural networks the training took forever or was so incredibly unresilient it was useless and this one thing better weight initialization was a huge step and this is like we're talking I think maybe three years ago that this was invented so this is not like we're going back a long time this is relatively recent Now the good news is that Keras and pretty much any decent neural network library will handle your weight initialization for you until very recently they pretty much all used this there are some even more recent slightly better approaches but they'll give you they'll give you a set of weights where your outputs will generally have a reasonable scale I'm going to say something that I think is a response to some questions it's a little bit kind of arbitrary like dimensions we're using here dimensions yeah of the different agencies that you have but I just wanted to say that's about network architecture yeah so what's not arbitrary is you are told you are given your input dimensionality so in our case for example it would be 224 by 224 pixels in this case I'm saying it's just it's three things you are given your output dimensionality so for example in our case it's for cats and dogs it's two it's cat and it's dog or for this I'm saying it's two we're told it's two the thing in the middle about how many columns does each of your weight matrices have is entirely up to you right the more columns you add the more complex your model and we're going to learn a lot about that and as Rachel said this is all about your choice of architecture so in my first one here I had four columns and therefore I had four outputs and my next one I had three columns and therefore I had three outputs my final one I had two columns and therefore I had two outputs and that is the number of outputs that I wanted okay so this thing of like how many columns do you have in your weight matrix is where you get to decide how complex your model is so we're going to see that so let's go ahead and create a linear model okay so that's two that has a question whether the savior initialization is variance or standard deviation what are you just saying variance oh is it variance okay if it says variance I'm sure it's variance I didn't study it closely enough all right so we're going to learn how to create a linear model let's first of all learn how to create a linear model from scratch and this is something which we did in that original usf data institute launch video but I just remind you right without using keras at all I can define a line as being a x plus b I can then create some synthetic data so let's say I'm going to assume a is three and b is eight create some random x's and my y will then be my a x plus b okay so here are some x's and some y's that I've created not surprisingly this kind of plot looks like so the job of somebody creating a linear model is to say I don't know what a and b is how can we calculate them so let's forget that we know that they're three and eight and say okay let's guess that they're minus one and one how can we make our guess better and to make our guess better we need a loss function so the loss function is something which is a mathematical function that will be high if your guess is bad and it's low if it's good the loss function I'm using here is sum of squared errors which is just my actual minus my prediction squared and add it up okay so if I define my loss function like that and then I say my guesses are minus one and one I can then calculate my average loss and it's nine okay so my average loss with my random guesses is not very good in order to create an optimizer I need something that can make my weights a little bit better if I have something that can make my weights a little bit better I can just call it again and again and again all right that's actually very easy to do if you know the derivative of your loss function with respect to your weights then all you need to do is update your weights by the opposite of that so remember the derivative is a thing that says as your weight changes your output changes by this amount okay that's what the derivative is yes Rachel I just wanted to say that again because I think it's surprising when you first hear it and it's very powerful the idea that you can start with something random and by iterating you're going to get to a solution that works yeah so let's let's try it right so in this case we have y equals a x plus b and then we have our loss function is actual minus predicted squared and then add it up all right so we're now going to create a function called update which is going to take our a guess and our b guess and make them a little bit better and to make them a little bit better we calculate the derivative of our loss function with respect to b and the derivative of our loss function with respect to a how do we calculate those we go to wolf from alpha and we enter in d along with our formula and the thing we want to get the derivative of and it tells us the answer all right so that's all I did because I went to wolf from alpha found the correct derivative pasted them in here all right and so what this means is that this formula here tells me as I increase b by 1 my sum of squared errors will change by this amount and this says as I change a by 1 my sum of squared errors will change by this amount so if I know let's say that dy da is 3 right so if I know that my loss function gets higher by 3 if I increase a by 1 then clearly I need to make a a little bit smaller because if I make it a little bit smaller my loss function will go down so that's why our final step is to say okay take our guess and subtract from it our derivative times a little bit LR stands for learning rate and as you can see I'm setting it to 0.01 how much is a little bit is something which people spend a lot of time thinking about and studying and we will spend time talking about but you can always trial and error to find a good learning rate when you use Keras you will always need to tell at what learning rate you want to use and that's something that you want the highest number you can get away with we'll see this more of this next week but the important thing to realise here is that if we update our guess minus equals our derivative times a little bit our guess is going to be a little bit better okay because we know that going in the opposite direction makes the loss function a little bit lower so let's run those two things where we've now got a function called update which every time we run it makes our predictions a little bit better so finally now I'm basically doing a little animation here that says every time you calculate an animation call my animate function which 10 times we'll call my update function okay so let's see what happens when I animate that there it is okay so it starts with a really bad line which is my minus one one and it gets better and better so this is how stochastic gradient descent works and stochastic gradient descent is the most important algorithm in deep learning stochastic gradient descent is the thing that starts with random weights like this and ends with weights that do what you want to do okay so as you can see stochastic gradient descent is incredibly simple and yet incredibly powerful because it can take any function and find the set of parameters that does exactly what we want to do with that function and when that function is a deep learning neural network that becomes particularly powerful yes sure it has nothing to do with neural nets except the so let's so just to remind ourselves about the setup for this we started out by saying this spreadsheet is showing us a deep neural network with a bunch of random parameters can we come up with a way to replace the random parameters with parameters that actually give us the right answer so we need to come up with a way to do mathematical optimization so rather than showing how to do that with a deep neural network let's see how to do it with a line so we started out by saying okay let's have a line ax plus b where a is three and b is eight right and pretend we didn't know that a was three and b is eight make a wild guess as to what a and b might be come up with an update function that every time we call it makes a and b a little bit better and then call that update function lots of times and confirm that eventually our line fits our data conceptually take that exact same idea and apply it to these weight matrices yes Richard going back there's a question in the step function for sgd shouldn't there be a random element to pop out of it um okay so the question is is there a problem here that as we um run this update function might we get to a point where let's say our the function so let's say the function looked like this right so currently we're trying to uh optimize sum of squared errors and the sum of squared errors looks like this which is fine but let's say the more complex function more information than we needed let's say it's a more complex function that kind of looked like this so if we started here and kind of gradually tried to make it better and better and better we might get to a point where the derivative is zero and we then can't get any better this would be called a local minimum um so the question was suggesting a particularly approach to avoiding that here's a good news in deep learning you don't have local minimum why not well the reason is that in an actual deep learning neural network you don't have one or two parameters you have hundreds of millions of parameters so rather than looking like this or even like a 3d version where it's like kind of something like this it's a 600 million dimensional space and so for something to be a local minimum it means that the the stochastic gradient descent has wandered around and got to a point where in every one of those 600 million directions it can't do any better and the probability of that happening is 2 to the power of 600 million right so for actual deep learning in practice there's always enough parameters that it's basically unheard of to get to a point where there's no direction you can go to get better so the answer is no for deep learning stochastic gradient descent is just as simple as this exactly as simple as this we will learn some tweaks to allow us to make it faster but this basic approach works just fine yes that's a great question so what if you don't know the derivative okay and so for a long time this was a royal goddamn pain in the ass anybody who wanted to create stochastic gradient descent for their neural network had to go through and calculate all of their derivatives and if you've got 600 million parameters that's a lot of trips to Wolfram Alpha so nowadays we don't have to worry about that because all of the modern neural network libraries do symbolic differentiation in other words it's like they have their own little copy of Wolfram Alpha inside them and they calculate the derivatives for you so you don't ever be in a situation where you don't know the derivatives you just tell it your architecture and it will automatically calculate the derivatives so let's take a look let's take this linear example and see what it looks like in keras in keras we can do exactly the same thing so let's start by creating some random numbers but this time let's make it a bit more complex we're going to have a random matrix with two columns and so to calculate our y-value we'll do a little matrix multiply here of our x with a vector of 2,3 and then we'll add in a constant of 1 so first of all we have to import everything okay so here's our x's okay or the first 5 out of 30 of them and here's the first few y's okay so here 3.2 equals 0.56 times 2 plus 0.37 times 3 plus 1 for instance hopefully this looks very familiar because it's exactly what we did in excel in the very first level okay how do we create a linear model in keras and the answer is keras calls a linear model dense it's also known in other libraries as fully connected so when we go dense with an input of two columns and an output of one column we have to find a linear model that can go from this two column array to this one column output the second thing we have in keras is we have some way to build multiple layer networks and keras calls this sequential sequential takes an array that contains all of the layers that you have in your neural network so for example in excel here I would have had one two three layers in a linear model we just have one layer okay so to create a linear model in keras you say sequential pass in an array with a single layer and that is a dense layer a dense layer is just a simple linear layer we tell it that there are two inputs and one output and then we tell it and this will automatically initialize the weights in a sensible way it will automatically calculate the derivatives so all we have to tell it is how do we want to optimize the weights and we will say please use stochastic gradient descent with a learning rate of 0.1 and we're attempting to minimize our loss of a mean squared error so if I do that that does everything except the very last solving step that we saw in the previous notebook to do the solving we just type fit okay and as you can see when we fit it takes so before we start we can say evaluate to basically find out our loss function with random weights which is pretty crappy and then we run five epochs and the loss function gets better and better and better using the stochastic gradient descent update rule we just learned and so at the end we can evaluate and it's better okay and then let's take a look at the weights they should be equal to 231 and they're actually 1.8, 2.7, 1.2 that's not bad so why don't we run another five epochs loss function keeps getting better we evaluate it now it's better and the weights are now closer again two, three, one so we now know everything that Keras is doing behind the scenes exactly I'm not like hand waving over details here that is it okay so we now know what it's doing so if we now say to Keras don't just create a single layer but create multiple layers by passing it multiple layers to this sequential we can start to build and optimize the neural networks but before we do that we can actually use this to create a pretty decent entry to our cats and dogs competition so forget all the fine-tuning stuff because I haven't told you how fine-tuning works yet how do we take the output of an image net network and as simply as possible create an entry to our cats and dogs competition so the basic problem here is that our current image net network returns a thousand probabilities in a lot of detail so it returns not just cat versus dog but to remind ourselves to explore image net here we are animals domestic animals okay and then ideally it would best be cat and dog here but it's not it keeps going you see Egyptian cats Persian cats and so forth so one thing we could do would be we could like write code to take this hierarchy and roll it up into cats versus dogs so I've got a couple of ideas here for how we could do that for instance we could find the largest probability that's either a cat or a dog for a thousand and use that or we could average all of the cat categories and all of the dog categories and use that but the downsides here are that would require manual coding for something that we could should be learning from data and more importantly it's ignoring information so let's say out of those thousand categories the category for a bone was very high it's more likely a dog is with a bone than a cat is with a bone so therefore you know it ought to actually take advantage it should learn to recognize environments that cats are in versus environments that dogs in or even recognize things that look like cats from things that look like dogs so what we could do is learn a linear model that takes the output of the image net model the thousand predictions and that uses that as the input and uses the dog cat label as the target and that linear model would solve our problem we have everything we need to know to create this model now so let me show you how that works let's again import our VGG model and we're going to try and do three things for every image we'll get the true labels is a cat or is a dog we're going to get the 1000 image net category predictions so that'll be a thousand floats for every image and then we're going to use the output of two as the input to our linear model and we're going to use the output one as the target for our linear model and create this linear model and build some predictions so as per usual we start by creating our validation batches and our batches just like before and I'll show you a trick because one of the steps here is get the thousand image net category predictions every image that takes a few minutes there's no need to do that again and again once we've done it once let's save the result so I want to show you how you can save NumPy arrays unfortunately most of the stuff you'll find online about saving NumPy arrays takes a very very very long time to run and it takes a shitload of space there's a really cool library called b-codes that almost nobody knows about that can save NumPy arrays very very quickly and in very little space so I've created these two little things here called save array and load array which you should definitely add to your toolbox they're actually in the utils.py so you can use them in the future and once you've grabbed the predictions you can use these to just save the predictions and load them back later rather than recalculating them each time I'll show you something else we've got is before we even worry about calculating the predictions we just need to load up the images when we load the images there's a few things we have to do we have to decode the jpeg images and we have to convert them into 224 by 224 pixel images because that's what VGG expects that's kind of slow too all right so let's also save the result of that so I've created this little function called getData which basically grabs all of the validation images and all of the training images and sticks them in a NumPy array so here's a cool trick in iPython notebook or Jupyter notebook if you put question mark question mark before something it shows you the source code okay so if you want to know what is getData doing go question mark question mark getData and you can see exactly what it's doing it's just concatenating all of the different batches together okay so anytime you're using one of my little convenience functions I strongly suggest you look at the source code and make sure you see what it's doing because they're all super super small okay so I can grab the data for the validation data I can grab it for the training data and then I just saved it so that in the future I can and so now rather than having to watch and wait for that to pre-process I'll just go load array and that goes ahead and loads it off disk it still takes a few seconds but this will be way faster than having to calculate it directly so what that does is it creates a numpy array with my 23,000 images each of which has three colors and is 224 by 224 in size if you remember from lesson one the output the labels that Keras expects are in a very particular format let's look at the format to see what they look like here they are the output the format of the labels is each one has two things it has the probability that's a cat and the probability that's a dog and they're always just zeros and ones so here is zero one is a dog one zero is a cat one zero is a cat zero one is a dog this approach where you have a vector where every element of it is a zero except for a single one for the class that you want is called one hot encoding and this is used for nearly all deep learning so that's why I've created a little function called one hot that makes it very easy for you to one hot encode your data so for example if your data was just like zero one two one zero one hot encoding that would look like this one zero zero zero one zero one zero one zero zero okay so that would be the kind of raw form and that is the one hot encoded form the reason that we use one hot encoding a lot is that if you take this and you do a matrix multiply by a bunch of weights w one w two w three you can calculate a matrix multiply you see these two are compatible so this is what lets you do deep learning really easily with categorical variables so the next thing I want to do is I want to grab my labels and I want to one hot encode them by using this one hot function and so you can take a look at that get my batches all right so you can see here that the first few classes look like so but the first few labels are one hot encoded like so so we're now at a point where we can finally do step number one so remind you step number one was actually so step number two get the thousand image net category predictions for every image so keros makes that really easy for us we can just say model dot predict and pass in our data okay so model dot predict with train data is going to give us the thousand predictions from image net for our train data and this will give it for our validation data and again running this takes a few minutes so I save it and then instead of waiting for you to wait I will load it and so you can see that we now have the 23,000 images are now longer 23,000 by 3 by 244 by 344 it's now 23,000 by 1,000 so for every image we have the 1,000 probabilities so let's look at one of them train features zero not surprisingly if we look at just one of these nearly all of them are zero right so for these thousand categories only one of these numbers should be big right there can't be lots of different things it's not a cat and a dog and a jet airplane right so not surprisingly nearly all of these things are very close to zero and hopefully just one of them is very close to one so that's exactly what we'd expect so now that we've got our 20 our 1,000 features for each of our training images and for each of our validation images we can go ahead and create our linear model so here it is here's our linear model the input is a thousand columns it's every one of those image net predictions the output is two columns it's a dog or it's a cat we will optimize it with I'm actually not going to use SGD I'm going to use a slightly better thing put rms prop which I will teach you about next week it's a very minor tweak on SGD that tends to be a lot faster so I suggest in practice you use rms prop not SGD but it's almost the same thing and now that we know of course how to fit the model once it's defined we can just go model dot fit and it runs basically instantly because all it has to do is let's have a look at our model lm dot summary we have just one layer with just 2,000 weights right so running three epochs took zero seconds and we got an accuracy of 0.9734 let's run another three epochs 0.9770 even better okay so you can see this is like the simplest possible model I haven't done any fine tuning all I've done is I've just taken the image net predictions for every image and built a linear model that maps from those predictions to cat or dog a lot of the kind of amateur deep learning papers that you see like I showed you a couple last week one was like classifying leaves by whether they're sick one was like classifying skin lesions by type of lesion often this is all people do is they take a pre-trained model they grab the outputs and they stick it into a linear model and then they use it and as you can see it actually works often pretty well okay so I just wanted to point out here that in getting this 0.9770 result we have not used any magic libraries at all all we've done is we have it's more cold than it looks like just because we've done some saving and stuff as we go we grabbed our batches okay so that just to grab the data we turned the images into a NumPy array we took the NumPy array and ran model.predict on them we grabbed our labels and we one-hot encoded them and then finally we took the one-hot encoded labels and the thousand probabilities and we fed them to a linear model with a thousand inputs and two outputs and then we trained it and we ended up with a validation accuracy of 0.9770 okay so what we're really doing here is we're digging right deep into the details we know exactly how SGD works we know exactly how the layers are being calculated and we know exactly therefore what Keras is doing behind the scenes so we started way up high was something that was kind of totally obscure as to what was going on we were just using it like he might use Excel and we've gone all the way down to see exactly what's going on and we've got a pretty good result okay so the last thing we're going to do is take this and turn it into a fine-tuning model to get a slightly better result and so what is fine-tuning? in order to understand fine-tuning we're going to have to understand one more piece of a deep learning model and this is activation functions this is our last major piece I want to point something out to you in this view of a deep learning model we went matrix multiply matrix multiply matrix multiply who wants to tell me how can you simplify a matrix multiply on top of a matrix multiply on top of a matrix multiply what's that actually doing? a linear model on a linear model on a linear model is itself a linear model okay so in fact this whole thing could be turned into a single matrix multiply because it's just doing linear on top of linear on top of linear so this clearly cannot be what deep learning is really doing right because deep learning is doing something a lot more than a linear model so what is deep learning actually doing? what deep learning is actually doing is at every one of these points where it says activations with deep learning we do one more thing which is we put each of these activations through a non-linearity of some sort there are various things that people use sometimes people use fan sometimes people use sigmoid but most commonly nowadays people use max 0 comma x which is called value or rectified linear with other of these terms right when you see rectified linear activation function people actually mean max 0 comma x so if we took this excel spreadsheet and added equals max x comma equals max 0 comma x and we made that 0 comma so if we replaced the activation with this and did that at each layer we now have a genuine modern deep learning neural network interestingly it turns out that this kind of neural network is capable of approximating any given function of arbitrarily complexity in the lesson you will see that there is a link to a fantastic tutorial by Michael Nielsen on this topic which is here and what he does is he shows you how with exactly this kind of approach where you put functions on top of functions you can actually actually let's you drag them up and down to see how you can change the parameters and see what they do and he gradually builds up so that once you have a function of a function of a function of this type he shows you how you can gradually create arbitrarily complex shapes okay so using this incredibly simple approach where you have a matrix multiplication followed by a rectified linear which is max zero x and stick that on top of each other on top of each other on top of each other that's actually what's going on in a deep learning neural net and so you will see that in all of the deep neural networks we have created so far we have always had this extra parameter activation equals something and generally you'll see activation equals value okay and that's what it's doing it's saying after you do the matrix product do a max zero comma x so what we need to do is we need to take our final layer which has both a matrix multiplication and an actual activation function and what we're going to do is we are going to remove it and I'll show you why if we look at our model our VGG model let's take a look at it VGG dot summary VGG dot model sorry dot summary because the dot model is the Keras model and let's see what is the end of a book like the very last layer is a dense layer the very last layer is a linear layer it seems weird therefore that in that previous section where we added an extra dense layer why would we add a dense layer on top of a dense layer given that this dense layer has been tuned to find the 1000 image net categories why would we want to take that and add on top of it something that's tuned to find cats and dogs how about we remove this and instead use the previous dense layer with its 4096 activations and use that to find our cats and dogs so to do that it's as simple as saying model dot pop that will remove the very last layer and then we can go model dot add and add in our new linear layer with two outputs cat and dog so when we said VGG dot fine tune earlier it was actually we can have a look VGG question mark question mark VGG dot fine tune here is the source code model dot pop model dot add a dense layer with the correct number of classes and the input equal to the that's interesting that's actually incorrect I think it's being ignored so forget this little part I will fix that later so it's basically doing a model dot pop and then model dot add dense so that once we've done that we will now have a new model which is designed to calculate cats versus dogs rather than designed to calculate image net categories and then calculate cats versus dogs and so when we use that approach everything else is exactly the same we then compile it giving it an optimizer and then we can call model dot fit anything where we want to use batches by the way we have to using keras something underscore generator so this is fit generator because we're passing in batches and if we run it for two epochs you can see we get 97.35 if we run it for a little bit longer eventually we will get something quite a bit better than our previous linear model on top of image net approach in fact we know we can we got 98.3 when we looked at this fine tuning earlier so that's the only difference between fine tuning and adding an additional linear layer we just do a pop first before we add it of course once I calculate it I would then go ahead and save the weights and then we can use that again in the future and so from here on in you'll often find that after I create my fine-tuned model I will often go model dot load weights fine-tune 1.h5 because this is now something that we can use as a pretty good starting point for all of our future dogs and cats models okay I think that's about everything that I wanted to show you for now anybody who is interested in going further during the week there is one more section here in this lesson which is showing you how you can train more than just the last layout but we'll look at that next week as well so during this week the assignment is really very similar to last week's assignment but it's just to take it further now that you actually know what's going on with fine-tuning and with linear layers there's a couple of things you could do one is for those of you who haven't yet entered the cats and dogs competition get your entry in and then have a think about everything you know about the evaluation function the categorical cross-entropy loss function fine-tuning and see if you can find ways to make your model better and see how high up the leaderboard you can get using this information maybe you can push yourself a little further read some of the other forum threads on Kaggle and on our forums and see if you can get the best result you can if you want to really push yourself then see if you can do the same thing by writing all of the code yourself so don't use our fine-tune at all don't use our notebooks at all see if you can build it from scratch just to you know really make sure that you understand how it works and then of course if you want to go further see if you can enter not just the dogs and cats competition but see if you can enter one of the other competitions that we talk about on our website such as Galaxy Zoo or the plankton competition or the state farm driver driver distraction competition or so forth great well thanks everybody I look forward to talking to you all during the week and hopefully see you here next Monday thanks very much