 Welcome to lesson two where we're going to be taking a deeper dive into computer vision applications and taking some of the amazing stuff that you've all been doing during the week and going even further. So let's take a look. Before we do, a reminder that we have these two really important topics on the forums. They're pinned at the top of the forum category. One is fact resources and official course updates. This is where if there's something useful for you to know during the course, we will post there. Nobody else can reply to that thread. So if you set that thread to watching and to notifications, you're not going to be bugged by anybody else except stuff that we think you need to know for the course. And it's got all the official information about how to get set up on each platform. Please note, a lot of people post all kinds of other tidbits about how they've set up things on previous solutions or previous courses or other places. I don't recommend you use those because these are the ones that we're testing every day and that the folks involved in these platforms are testing every day and they definitely work. So I would strongly suggest you follow those tips. And if you do have a question about using one of these platforms, please use these discussions. Not some other topic that you create because this way people that are involved in these platforms will be able to see it and things won't get messy. And then secondly, for every lesson, there will be an official updates thread for that lesson. So lesson one, official updates and the same thing. Only fast AI people will be posting to that. So you can watch it safely and we'll have all the things like the videos, the notebooks and so forth. And they're all wiki threads so you can help us to make them better as well. So I mentioned the idea of watching a thread. So this is a really good idea is that you can go to a thread like particularly those official update ones and click at the bottom watching. And if you do that, that's going to enable notifications for any updates to that thread. Particularly if you go into click on your little username in the top right, preferences and turn this on, that'll give you an email as well. So any of you that have missed some of the updates so far, go back and have a look through because we're really trying to make sure that we keep you updated with anything that we think is important. One thing which can be more than a little overwhelming is even now after just one week, the most popular thread has 1.1 thousand replies. So that's an intimidatingly large number. I've actually read every single one of them and I know Rachel has and I know Sylvia has and I think Francisco has. But you shouldn't need to. What you should do is click summarize this topic and it'll appear like this, which is all of the most liked ones will appear and then there'll be view 31 hidden replies or whatever in between. So that's how you navigate these giant topics. That's also why it's important you click the like button because that's the thing that's going to cause people to see it in this recommended view. So when you come back to work, hopefully you've realized by now that on the official course website, course-v3.fastedai, you will click returning to work. You will click the name of the platform you're using and you will then follow the two steps. Step one will be how to make sure that you've got the latest notebooks and step two will be how to make sure you've got the latest Python library software. They all look pretty much like this, but they're slightly different from platform to platform. So please don't use some different set of commands you read somewhere else. Only use the commands that you read about here and that will make everything very smooth. If things aren't working for you, if you get into some kind of messy situation, which we all do, just delete your instance and start again. Unless you've got mission-critical stuff there, it's the easiest way just to get out of a sticky situation. If you follow the instructions here, you really should find it works fine. So this is what I really wanted to talk about most of all is what people have been doing this week. If you've noticed, and a lot of you have, there's been 167 people sharing their work. And this is really cool because it's pretty intimidating to put yourself out there and say, like, I'm new to all this, but here's what I've done. And so example of the things I thought was really interesting was figuring out who's talking. Is it Ben Affleck or Joe Rogan? I thought this was really interesting. This is like actually very practical. I wanted to clean up why WhatsApp downloaded images to get rid of memes. So I actually built a little neural network. I mean, how cool is that to say, like, oh yeah, I've got something that cleans up my WhatsApp. It's a deep learning application I wrote last week. Why not? Like it's so easy now you can do stuff like this. And then there's been some really interesting projects. One was looking at the sound data that was used in this paper. And in this paper, they were trying to figure out what kind of sound things were. And they got a, as you would expect since they published a paper, they got a state of the art of nearly 80% accuracy. Ethan Souten then tried using the lesson one techniques and got 80.5% accuracy. So I think this is pretty awesome. Best as we know, it's a new state of the art for this problem. Maybe somebody since then has published something. We haven't found it yet. They take all of these with a slight grain of salt. But I've mentioned them on Twitter and lots of people on Twitter follow me. So if everybody knew that there was a much better approach, I'm sure somebody would have said so. This one is pretty cool. Survash has a new state of the art accuracy for Devangari text recognition. I think he's got it even higher than this now. And this is actually confirmed by the person on Twitter who created the data set. I don't think he had any idea. He just posted, hey, here's a nice thing I did. And this guy on Twitter was like, oh, I made that data set. Congratulations. You've got a new record. So that was pretty cool. I really like this post from Alina Harley. She describes in quite a bit of detail about the issue of the mystasticizing cancers and the use of point mutations and why that's a challenging, important problem. And she's got some nice pictures describing what she wants to do with this and how she can go about turning this into pictures. This is the cool trick, right? It's the same with this sound one, turning sounds into pictures and then using the lesson one approach. And here it's turning point mutations into pictures and then using the lesson one approach. And what did she find? It seems that she's got a new state-of-the-art result by more than 30% beating the previous best. Somebody on Twitter who's a VP at a genomics analysis company looked at this as well and thought it's looked to be a state-of-the-art in this particular point mutation one as well. So that's pretty exciting. So you can see, when we talked about last week, the idea that this simple process is something which can take you a long way. It really can. I will mention that something like this one in particular is using a lot of domain expertise. Like it's figuring out what picture to create. I wouldn't know how to do that because I don't even really know what a point mutation is, let alone how to create something that visually is meaningful that a CNN could recognize. But the actual big learning side is actually pretty straightforward. Another very cool result from Simon Willison and Natalie Down, they created a Cougar or not web application over the weekend and won the Science Hack Day Award in San Francisco. And so I think that's pretty fantastic. So lots of examples of people doing really interesting work. Hopefully this will be inspiring to you to think, wow, this is cool that I can do this with what I've learnt. It can also be intimidating to think like, wow, these people are doing amazing things. But it's important to realize that of the thousands of people doing this course, you know, I'm just picking out a few of the really amazing ones. And in fact, Simon is one of these very annoying people like Christine Payne who we talked about last week who seems to be good at everything he does. He created Django, one of the world's most popular web frameworks. He founded a very successful startup and blah, blah, blah, blah. So, you know, one of these really annoying people who tends to keep being good at things and now it turns out he's good at deep learning as well. So, you know, that's fine. You know, Simon can go and win a hackathon on his first week of playing with deep learning. Maybe it'll take you two weeks to win your first hackathon. That's okay. And I think like it's important to mention this really inspiring blog post this week from James Dellinger who talked about how he created a bird classifier using the techniques from lesson one. But what I really found interesting was at the end he said he nearly didn't start on deep learning at all because he went to the scikit-learn website which is one of the most important libraries of Python and he saw this and he described in this blog post how he was just like, that's not something I can do. That's not something I understand. And then this kind of realization of like, oh, I can do useful things without reading the Greek. So I thought that was a really cool message. And I really want to highlight actually Daniel Armstrong on the forum I think really shows is a great role model here which was here saying, I want to contribute to the library and I looked at the docs and I just found it overwhelming. And his next message one day later was, I don't know what any of this is. I didn't know how much there was to it. It caught me off guard. My brain shut down. But I love the way it forces me to learn so much. And then one day later I just submitted my first pull request. So I think that's awesome, right? It's this kind of like, it's okay to feel intimidated. There's a lot, right? But just pick one piece and dig into it. You don't try and push a piece of code or a documentation update or create a classifier or whatever. So here's lots of cool classifiers people have built. It's been really, really inspiring. Trinidad and Tobago Islander versus Masquerader classifier. A Zucchini versus Cucumber classifier. This one was really nice. This was taking the dog breeds, dog and cat breeds thing from last week and actually doing some exploratory work to see what the main features were and discovered that they could create a hairiness and a classifier. And so here we have the most hairy dogs and the most bald cats. So there are interesting things you can do with interpretation. Somebody else in the forum took that and did the same thing for anime to find that they had accidentally discovered an anime hair color classifier. We can now detect the new versus the old Panamanian buses correctly. Apparently these are the new ones. I much prefer the old ones, but maybe that's just me. This was really interesting. Henry Pallacci discovered that he can recognize with 85% accuracy which of 110 countries a satellite image is of, which is definitely got to be beyond human performance of just about anybody. Like I can't imagine anybody who can do that in practice. So that was fascinating. Batik cloth classification with 100% accuracy. Dave Reward did this interesting one. He actually went a little bit further using some techniques we'll be discussing in the next couple of courses to build something that can recognize complete or incomplete or foundation buildings and actually plot them on aerial satellite view. So lots and lots of fascinating projects. So don't worry. It's only been one week. That doesn't mean everybody has to have had a project out yet. A lot of the folks who already have a project out have done a previous course. So they've got a bit of a head start. But we'll see today how you can definitely create your own classifier this week. So from today up to dig a bit deeper into really how to make these computer vision classifiers in particular work well. We're then going to look at the same thing for text. We're then going to look at the same thing for tabular data. So they're kind of like more like spreadsheets and databases. Then we're going to look at collaborative filtering. So kind of recommendation systems. That's going to take us into a topic called embeddings which is basically a key underlying platform behind these applications. That will take us back into more computer vision and then back into more NLP. So the idea here is that it turns out that it's much better for learning if you kind of see things multiple times. So rather than being like, OK, that's computer vision. You won't see it again for the rest of the course. We're actually going to come back to the two key applications, NLP and computer vision, a few weeks apart. And that's going to force your brain to realize like, oh, I have to remember this. It's not just something I can throw away. So people who have more of a hard sciences kind of background in particular, a lot of folks find this, hey, here's some code. Type it in. Start running it approach rather than here's lots of theory approach. Confusing and surprising and odd at first. And so for those of you, I just wanted to remind you this basic tip, which is keep going. You're not expected to remember everything yet. You're not expected to understand everything yet. You're not expected to know why everything works yet. You just want to be in a situation where you can enter the code and you can run it and you can get something happening and then you can start to experiment and you kind of get a feel for what's going on. And then push on, right? Most of the people who have done the course and have gone on to be really successful watch the videos at least three times. So they kind of go through the whole lot and then go through it slowly the second time and then they go through it really slowly the third time and I consistently hear them say, I get a lot more out of it each time I go through. So don't pause at lesson one and stop until you can continue. So this approach is based on a lot of research, academic research into learning theory and one guy in particular, David Perkins from Harvard, has this really great analogy. He's a researcher into learning theory. He describes this approach of the whole game, which is basically if you're teaching a kid to play soccer, you don't first of all teach them about how the friction between a ball and grass works and then teach them how to sow a soccer ball with their bare hands and then teach them the mathematics of parabolas when you kick something in the air. No, you say, here's a ball. Let's watch some people playing soccer. Okay, now we'll play soccer. And then gradually over the following years learn more and more so that you can get better and better at it. So this is kind of what we're trying to get you to do is to play soccer, which in our case is to type code and look at the inputs and look at the outputs. Okay, so let's dig into our first notebook, which is called Lesson 2 Download. And what we're going to do is we're going to see how to create your own classifier with your own images. So it's going to be a lot like last week's Pet Detector, but it'll detect whatever you like. So it'll be like some of those examples we just saw. How would you create your own Panama bus detector from scratch? So this is inspired. The approach is inspired by Adrian Rosebrock who has a terrific website called Pi Image Search. And he has this nice explanation of how to create a dataset using Google Images. So that was definitely an inspiration for some of the techniques we use here. So thank you to Adrian. And you should definitely check out his site. It's full of lots of good resources. So here we are. So we are going to try to create a teddy bear detector. Thanks. We're going to try and make a teddy bear detector, and we're going to try and separate teddy bears from black bears from grizzly bears. Now this is very important. I have a three-year-old daughter, and she needs to know what she's dealing with. In our house you would be surprised at the number of monsters, lions, and other terrifying threats that are around, particularly around Halloween. And so we always need to be on the lookout to make sure that the thing we're about to cuddle is in fact a genuine teddy bear. So let's deal with that situation as best as we can. So our starting point is to find some pictures of teddy bears so we can learn what they look like. So I've got images.google.com, and I type in teddy bear, and I just crawl through until I kind of find a goodly bunch of them. And it's like, okay, that looks like plenty of teddy bears to me. So then I'll go back to here. So you can see it says search and scroll. Go to images and search. And the next thing we need to do is to get a list of all of the URLs there. And so to do that, back in your Google images, you hit Ctrl, Shift J, or Command, Option J, and you paste this into the window that appears. So I've got windows, so I go Ctrl, Shift J, paste in that code. So this is a JavaScript console. For those of you who haven't done any JavaScript before, I hit Enter, and it downloads my file for me. So I would call this teddy's. .text and press Save. So I now have a file of teddy's, or URLs of teddy's. So then I would repeat that process for black bears and for brown bears, since that's a classifier I would want, and I'd put each one in a file with an appropriate name. So that's step one. So step two is we now need to download those URLs to our server. Remember, when we're using Jupyter Notebook, it's not running on our computer, it's running on SageMaker, or Cressul, or Google Cloud, or whatever. So to do that, we start running some Jupyter cells. So let's grab the FastAI Library and let's start with black bears. I've already got my black bears URL. So I click on this cell for black bears and I'll run it. So here I've got three different cells doing the same thing but different information. This is one way I like to work with Jupyter Notebook. It's something that a lot of people with a more strict scientific background are horrified by. This is not reproducible research. I actually click here and I run this cell to create a folder called black and a file called urls black for my black bears. I skip the next two cells and then I run this cell to create that folder. And then I go down to the next section and I run the next cell which is download images for black bears. So that's just going to download my black bears to that folder. And then I'll go back and I'll click on Teddy's and I'll run that cell and then scroll back down and I'll run this cell. And so that way I'm just going backwards and forwards to download each of the classes that I want. But for me, I'm very iterative I'm very experimental. That works well for me. If you're better at kind of planning ahead than I am you can write a proper loop or whatever and do it that way. But when you see my notebooks and see things where there's kind of like configuration cells doing the same thing in different places this is a strong sign that I didn't run this in order. I clicked one place went to another around that, went back, went back. And for me, I'm an experimentalist. I really like to experiment in my notebook. I treat it like a lab journal. I try things out and I see what happens and so this is how my notebooks end up looking. It's a really controversial topic like for a lot of people they feel this is like wrong that you should only ever run things top to bottom. Everything you should do should be reproducible. For me, I don't think that's the best way of using human creativity. I think human creativity is best inspired by trying things out seeing what happens and fiddling around. So you can see how you go see what works for you. So that will download the images to your server. It's going to use multiple processes to do so. And one problem there is if there's something goes wrong it's a bit hard to see what went wrong. So you can see in the next section there's a commented out section that says max workers equals zero and that will do it without spinning up a bunch of processes and it will tell you the errors better. So if things aren't downloading try using the second version. Okay, so it takes I grabbed a small number of each and then the next thing that I found I needed to do was to remove the images that aren't actually images at all. And this happens all the time. There's always a few images in every batch that are disrupted for whatever reason. Google image tried to told us that this URL had an image but actually it doesn't anymore. So we've got this thing in the library called verify images which will check all of the images in a path and will tell you if there's a problem. If you say delete equals true it will actually delete it for you. So that's a really nice easy way to end up with a clean data set. So at this point I now have a bears folder containing a grizzly folder and a teddies folder and a black folder. In other words I have the basic structure we need to create an image data bunch to start doing some deep learning. So let's go ahead and do that. Now very often when you download a data set from like Kaggle or from some academic data set there will often be a folder called train and a folder called valid and a folder called test containing the different data sets. In this case we don't have a separate validation set because we just grab these images from Google search. But you still need a validation set otherwise you don't know how well your model is going and we'll talk more about this in a moment. So whatever you create a data bunch if you don't have a separate training and validation set then you can just say okay well the training set is in the current folder because by default it looks in a folder called train and I want you to set aside 20% of the data please. So this is going to create a validation set for you automatically and randomly. You'll see that whenever I create a validation set randomly I always set my random seed to something fixed beforehand. This means that every time I run this code I'll get the same validation set. So in general I'm not a fan of making my machine learning experiments reproducible i.e. ensuring I get exactly the same result every time. The randomness is to me really important a really important part of finding out is your solution stable? Is it going to work like each time you run it? But what is important is that you always have the same validation set. Otherwise when you're trying to decide has this hyper parameter change improved my model but you've got a different set of data you're testing it on then you don't know maybe that set of data just happens to be a bit easier. So that's why I always set the random seed here. So we've now got let's run that cell so we've now got a data bunch and so you can look inside at data.classes and you'll see these are the folders that we created so it knows that by classes we mean all the possible labels black bear, grizzly bear or teddy bear we can run show batch and we can take a little look and it tells us straight away that some of these are going to be a little bit tricky so this is not a photo for instance some of them kind of cropped funny some of them might be tricky like if you ended up with a black bear standing on top of a grizzly bear it might be tough anyway so you can kind of double check here data.classes there they are remember C is the attribute which the classifiers tells us how many possible labels there are we'll learn about some other more specific meanings of C later we can see how many things are in our training set we can see how many things are in our validation set so we've got 473 training set 141 validation set so at that point we can go ahead you'll see all these commands are identical to the pet classifier from last week we can create our CNN our convolutional neural network using that data I tend to default to using a ResNet 34 and let's print out the error rate each time and run fit1 cycle 4 times and see how we go and we have a 2% error rate so that's pretty good I personally aren't I mean sometimes it's easy for me to recognize a black bear from a grizzly bear but sometimes it's a bit tricky this one seems to be doing pretty well okay so after I kind of make some progress with my model and things looking good I always like to save where I'm up to to save me the 54 seconds of going back and doing it again and as per usual we unfreeze the rest of our model we're going to be learning more about what that means during the course and then we run the learning rate finder and plot it tells you exactly what to type and we take a look now we're going to be learning about learning rates today actually but for now here's what you need to know on the learning rate finder what you're looking for is the strongest downward slope that's kind of sticking around for quite a while so this one here looks more like a bump but this looks like an actual downward slope to me so it's kind of like it's something you're going to have to practice with and get a feel for like which bit works so like if you're not sure is it this bit or this bit try both learning rates and see which one works better but I've been doing this for a while and I'm pretty sure this looks like where it's really learning properly so I would pick something okay here it's not so steep so I would probably pick something back learning rate so you can see I picked 3eneg5 so you know somewhere around here that sounds pretty good so that's my bottom learning rate so my top learning rate I normally pick 1eneg4 or 3eneg4 it's kind of like I don't really think about it too much that's a rule of thumb it always works pretty well one of the things you'll realize is that most of these parameters actually matter that much in detail if you just copy the numbers that I use each time the vast majority of the time it'll just work fine and we'll see places where it doesn't today okay so we've got a 1.4% error rate after doing another couple of epochs so that's looking great so we've downloaded some images from google image search created a classifier we've got a 1.4% error rate let's see that and then as per usual we can use the classification interpretation class to have a look at what's going on and in this case we made one mistake there was one black bear classified as grizzly there so that's a really good step we've come a long way but possibly you could do even better if your data set was less noisy like maybe google image search didn't give you exactly the right images all the time so how do we fix that and so combining a human expert with a computer learner is a really good idea not nobody but very very few people publish on this very very few people teach this but to me it's like the most useful skill particularly for you most of the people watching this are domain experts not computer science experts and so this is where you can use your knowledge of point mutations in genomics or panamanian buses or whatever so let's see how that would work what I'm going to do is do you remember the plot top losses from last time where we saw the images which it was like either the most wrong about or the least confident about we're going to look at those and decide which of those are noisy like if you think about it it's very unlikely that if there is a mislabeled data that it's going to be predicted correctly and with high confidence that's really unlikely to happen so we're going to focus on the ones which the model is saying either it's not confident of or it was confident of and it was wrong about they are the things which might be mislabeled so big shout out to the San Francisco fast AI study group who created this new widget this week called the file deleter so that's Zach and Jason in Francisco built this thing where we basically can take the top losses from that interpretation object we just created and then what we're going to do is we're going to say okay that returns top losses there's not just plot top losses but there's also just top losses and top losses return two things the losses of the things that were the worst and the indexes into the data set of the things that were the worst and if you don't pass anything at all it's going to actually return the entire data set but sorted so the first things will be the highest losses as we'll learn during the course so we'll keep changing during the course every data set in fast AI has an X and a Y and the X contains the things that are used to in this case get the images so this is the image file names and the Y's will be the labels so if we grab the indexes and pass them into the data set X this is going to give us the file names of the data set ordered by which ones had the highest loss so which ones it was either confident and wrong or not confident about and so we can pass that to this new widget that they've created called the file deleter widget so just to clarify this top loss paths contains all of the file names in our data set when I say in our data set this particular one is in our validation data set so what this is going to do is it's going to clean up mislabeled images or images that shouldn't be there and we're going to remove them from a validation set so that our metrics will be more correct you then need to rerun these two steps replacing valid DS with train DS to clean up your training set to get the noise out of that as well that's a good practice to do both we'll talk about test sets later as well if you also have a test set you would then repeat the same thing so we run file deleter passing in that sort of list of paths and so what pops up is basically the same thing as plot top losses so in other words these are the ones which is either wrong about or at least confident about and so not surprisingly this one here does not appear to be a teddy bear or a black bear or a brown bear right so this shouldn't be in our data set so what I do is I work on the delete button okay and all the rest do look indeed like bears and then so I can click confirm and it will bring up another five what's that that's not a bear is it does anybody know what that is I'm going to say that's not a bear delete confirm well not bear well that's a teddy bear I'll leave that that's not really I'll get rid of that one confirm okay so what I tend to do when I do this is I'll keep going confirm until I get to a couple of screen full of things that all look okay and that suggests to me that I kind of got past the worst bits of the data okay and that's it and so now you can go back once you do it for the training set as well and retrain your model so I'll just note here that what our San Francisco study group did here was that they actually built a little app inside to put a notebook which you might not have realized as possible but not only is it possible it's actually surprisingly straightforward and just like everything else you can hit double question mark to find out their secrets so here is the source code okay and really if you've done any GUI programming before it'll look incredibly normal you know there's basically callbacks for what happens when you click on a button where you just do standard Python things and to actually render it you just use widgets and you can lay it out using standard boxes and whatever so it's this idea of creating applications inside notebooks is like it's really underused but it's super neat because it lets you create tools for your fellow practitioners for your fellow experimenters right and you could definitely envisage taking this a lot further in fact by the time you're watching this on the MOOC you'll probably find that there's a whole lot more buttons here because we've already got a long list of to-dos that we're going to add to this particular thing so that's it so I think like I'd love for you to have a think about now that you know it's possible to write applications in your notebook what are you going to write and if you Google for I Pi widgets you can learn about the little GUI framework to find out what kind of widgets you can create and what they look like and how they work and so forth and you'll find it's actually a pretty complete GUI programming environment you can play with and this will all work nicely with your models and so forth it's not a great way to productionize an application because it is sitting inside a notebook this is really for things which are going to help other practitioners other experimentalists and so forth for productionizing things you need to actually build a production web app which we'll look at next okay so after you have cleaned up your noisy images you can then retrain your model and hopefully you'll find it's a little bit more accurate one thing you might be interested to discover when you do this is it actually doesn't matter most of the time very much now on the whole these models are pretty good at dealing with moderate amounts of noisy data the problem would occur is if your data was not randomly noisy it would be biased noisy so I guess the main thing I'm saying is if you go through this process of cleaning up your data and then you rerun your model and find it's like 0.001% better that's normal it's fine but it's still a good idea just to make sure that you don't have too much noise in your data in case it is biased so at this point we're ready to put our model in production and this is where I hear a lot of people ask me about which mega Google, Facebook highly distributed serving system they should use and how do they use a thousand GPUs at the same time and whatever else for the vast vast vast majority of things that you all do you will want to actually run in production on a CPU not a GPU why is that? because a GPU is good at doing lots of things at the same time but unless you have a very busy website it's pretty unlikely that you're going to have 64 images to classify at the same time to put into a batch into a GPU and if you did you've got to deal with all that queuing and running it all together all of your users have to wait until that batch has got filled up and run it's a whole lot of hassle and then if you want to scale that there's another whole lot of hassle it's much easier if you just grab one thing throw it at a CPU to get it done and it comes back again so yes it's going to take maybe 10 or 20 times longer so maybe it'll take 0.2 seconds rather than 0.01 seconds that's about the kind of times we're talking about but it's so easy to scale you can chuck it on any standard serving infrastructure it's going to be cheap as hell you can horizontally scale it really easily so most people I know who are running apps that aren't kind of at Google scale based on deep learning are using CPUs and the term we use is inference so when you're not training a model but you've got a trained model and you're getting it to predict things we call that inference so that's why we say here you probably want to use CPU for inference so at inference time you've got your pre-trained model you've saved those weights and how are you going to use them to create something like Simon Wellison's Cougar Detector well first thing you're going to need to know is what were the classes that you trained with right you need to not know not just what are they but what were the order okay so you will actually need to like serialize that or just type them in or in some way make sure you've got exactly the same classes that you trained with if you don't have a GPU on your server it will use the CPU automatically if you want to test if you have a GPU machine and you want to test using a CPU you can just uncomment this line and that tells FastAI that you want to use CPU by passing it back to PyTorch so here's an example we don't have a Cougar Detector we have a Teddy Bear Detector and my daughter Claire is about to decide whether to cuddle this friend okay so what she does is she takes Daddy's deep learning model and she gets a picture of this and here is a picture that she's uploaded to the web app and here is a picture of the potentially cuddlesome object and so we're going to store that in a variable called image so open image is how you open an image in FastAI funnily enough here is that list of classes that we saved earlier and so as per usual we created a data bunch but this time we're not going to create a data bunch from a folder full of images we're going to create a special kind of data bunch which is one that's going to grab one single image at a time so we're not actually passing it any data the only reason we pass it a path is so that it knows where to load our model from right that's just the path that's the folder that the model is going to be in but what we do need to do is that we need to pass the same information that we trained with so the same transforms, the same size the same normalization this is all stuff we'll learn more about but just make sure it's the same stuff that you used before and so now you've got a data bunch that actually doesn't have any data in it at all it's just something that knows how to transform a new image in the same way that you trained with so that you can now do inference so you can now create a CNN with this kind of fake data bunch and again use exactly the same model that you trained with you can now load in those saved weights and so this is the stuff that you would do once just once when your web app is starting up and it takes 0.1 of a second to run this code and then you just go learn.predict image and it's lucky we did that because it is not a teddy bear this is actually a black bear so thankfully excellent deep learning model my daughter will avoid having a very embarrassing black bear cut all incident so what does this look like in production well I took Simon Willison's code and shamelessly stole it made it probably a little bit worse but basically it's going to look something like this so Simon used a really cool web app toolkit called Starlet if you've ever used a flask this will look extremely similar but it's kind of a more modern approach and by modern what I really mean is that you can use a weight it basically means that you can wait for something that takes a while such as grabbing some data without using up a process so for things like I want to get a prediction or I want to load up some data or whatever it's really great to be able to use this modern Python 3 asynchronous stuff so Starlet would come highly recommended for creating your web app and so yeah you just create a route as per usual in a web app and in that you say this is you say this is async to ensure that it doesn't steal a process while it's waiting for things you open your image you call dot predict and you return that response you can use a javascript client or whatever to show it and that's it that's basically the main contents of your web app so give it a go this week even if you've never created a web application before there's a lot of nice little tutorials on mine and kind of starter code if in doubt why don't you try Starlet there's a free hosting that you can use there's one called Python anywhere for example the one that Simon's used we'll mention that on the forum it's something you can basically package it up as a docker thing and shoot it off then it'll serve it up for you so it doesn't even need to cost you any money and so all these classifiers that you're creating you can turn them into web applications to see what you're able to make of that that'll be really fun okay so let's take a break we'll come back at 7.35 see you then okay so let's move on so I mentioned that most of the time the kind of rules of thumb I've shown you probably work and if you look at the share your work thread you'll find most of the time people are posting things saying I downloaded these images, I tried this thing it works much better than I expected well that's cool and then like 1 out of 20 says like I had a problem so let's have a talk about what happens when you have a problem and this is where we're starting to start getting into a little bit of theory because in order to understand these problems and how we fix them it really helps to know a little bit about what's going on so first of all let's look at examples of some problems the problems basically will be either your learning rate is too high or low or your number of epochs is too high or low so we're going to learn about what those mean and why they matter but first of all because we're experimentalists let's try them so let's go with our teddy bear detector and let's make our learning rate really high the default learning rate is 0.003 that works most of the time so what if we try a learning rate of 0.5 that's huge what happens our validation loss gets pretty damn high remember this is normally something that's underneath 1 so if you see your validation loss do that and what your validation loss is just know this if it does that your learning rate is too high that's all you need to know make it lower doesn't matter how many epochs you do and if this happens there's no way to undo this you have to go back and create your neural net again and fit from scratch with a lower learning rate so that's learning rate too high learning rate too low what if we use a learning rate 0.003 but 1e next 5 0.0001 so this is just I've just copied and pasted what happened when we trained before with our default error with our default learning rate and within one epoch we were down to a 2 or 3% error rate with this really low learning rate our error rate does get better but very very slowly and you can plot it to learn.recorder is an object which is going to keep track of lots of things happening while you train you can call plot losses to plot out the validation and training loss and you can just see them gradually going down so slow so if you see that happening then you have a learning rate which is too small bump it up by 10 or bump it up by 100 and try again the other thing you'll see if your learning rate is too small is that your training loss will be higher than your validation loss you never want a model where your training loss is higher than your validation loss that always means you haven't fitted enough which means either your learning rate is too low or your number of epochs is too low so if you have a model like that train it some more or train it with a higher learning rate okay? too few epochs so what if we train for just one epoch our error rate certainly is better than random it's 5% but look at this the difference between training loss and validation loss the training loss is much higher than the validation loss so 2 epochs and 2 lower learning rate look very similar right? and so you can just try running more epochs and if it's taking forever you can try a higher learning rate or if you try a higher learning rate and the loss goes off to 100,000 million then put it back to where it was and try a few more epochs that's the balance that's basically all you care about 99% of the time and this is only the 1 in 20 times that the defaults don't work for you okay? too many epochs we're going to be talking more about this create something called overfitting if you train for too long as we're going to learn about it will learn to recognize your particular teddy bears but not teddy bears in general here's the thing despite what you may have heard it's very hard to overfit with deep learning so we were trying today to show you an example of overfitting and I turned off everything and we're going to learn all about these terms soon I turned off all the data augmentation I turned off drop out I turned off weight decay I tried to make it overfit as much as I can I trained it on a small ish learning rate I trained it for a really long time and like maybe I started to get it to overfit maybe so the only thing that tells you that you're overfitting is that the error rate improves for a while and then starts getting worse again you will see a lot of people even people that claim to understand machine learning tell you that if your training loss is lower than your validation loss then you are overfitting as you will learn today in more detail Andrew and the rest of course that is absolutely not true any model that is trained correctly will always have train loss lower than validation loss that is not a sign of overfitting that is not a sign you've done something wrong that is a sign you have done something right okay the sign that you are overfitting is that your error starts getting worse because that's what you care about right you want your model to have a low error so as long as your training and your model error is improving you are not overfitting how could you be so there are basically the four possible the main four things that can go wrong there are some other details that we will learn about during the rest of this course but honestly if you stopped listening now please don't that would be embarrassing and you're just like okay I'm going to go and download images I'm going to create CNNs with ResNet 34 or ResNet 50 I'm going to make sure that my learning rate and number of epochs is okay and then I'm going to chuck them up and download Web API most of the time you're done at least to computer vision hopefully you'll stick around because you want to learn about NLP and collaborative filtering and tabular data and segmentation and stuff like that as well let's now understand what's actually going on what does it mean loss mean what does an epoch mean if you really understand these ideas you need to know what's going on and so we're going to go all the way to the other side rather than creating a state of the art cougar detector we're going to go back and create the simplest possible linear model okay so we're going to actually start seeing we're actually going to start seeing a little bit of math but it won't be turned off it's okay we're going to do a little bit of math but it's going to be totally fine because the first thing we're going to realize is that when we see a picture like this number 8 it's actually just a bunch of numbers it's a matrix of numbers for this grayscale 1 it's a matrix of numbers if it was a color image it would be have a third dimension we call it a tensor rather than a matrix it would be a 3D tensor of numbers red, green and blue so when we created that teddy bear detector what we actually did was we created a mathematical function that took the numbers from the images of the teddy bears and the mathematical function converted those numbers into in our case three numbers a number for the probability that it's a teddy a probability that it's a grizzly and a probability that it's a black bear in this case there's some hypothetical function that's taking the pixels representing a handwritten digit and returning 10 numbers the probability for each possible outcome the numbers from 0 to 9 and so what you'll often see in our code and other deep learning code is that you'll find a bunch of probabilities and then you'll find something called max or argmax attached to it, a function called and so what that function is doing is it's saying find the highest number and tell me what the index is so np.argmax or torch.argmax of this array would return this number here that makes sense in fact let's try it so we know that the function to predict something is called learn.predict so we can chuck two question marks before it or after it to get the source code and here it is pred equals res result.argmax and then what is the class where you just pass that into the classes array so like you should find that the source code in the fastai library can both kind of strengthen your understanding of the concepts and make sure that you know what's going on and really help you here you got a question come on over so can we have a definition of the error rate being discussed and how it is calculated I assume it's cross validation error sure so one way to answer the question of how is error rate calculated would be to type error rate question mark and look at the source code and it is 1 minus accuracy fair enough and so then a question might be what is accuracy accuracy question mark it is argmax so we now know that means find out the particular thing it is and then look at how often that equals the target so in other words the actual value and take the mean so that's basically what it is and so then the question is okay what is that being applied to and always in fastai metrics so these things that we pass in we call them metrics are always going to be applied to the validation set so anytime you put a metric here it will be applied to the validation set because that's your best practice that's what you always want to do is make sure that you're checking your performance on data that your model hasn't seen and we'll be learning more about the validation set shortly remember you can also type doc if the source code is not what you want which I will not be talking about in documentation that will both give you a summary of the types in and out of the function and a link to the full documentation where you can find out all about how metrics work and what other metrics there are and so forth and generally speaking you'll also find links to more information sample code and so forth showing you how to use all these things so don't forget that the doc function is your friend and also in the documentation both in the doc function and in the documentation you'll see a source link this is like question mark question mark but what the source link does is it takes you into the exact line of code in github so you can see exactly how that's implemented there's lots of good stuff there why were you using threes for your learning rates earlier with 3e, nex5 and 3e, nex4 we found that 3e, nex3 is just a really good default learning rate it works most of the time for your initial fine tuning before you unfreeze and then to kind of just multiply from there so I generally find then that the next stage I will pick 10 times lower than that for the second part of the slice and whatever the lrfinder found for the first part of the slice the second part of the slice doesn't come from the lrfinder it's just a rule of thumb which is like 10 times less than your first part which defaults to 3e, nex3 and then the first part of the slice is what comes out of the lrfinder and we'll be learning a lot more about these learning rate details both today and in the coming lessons but yeah, for now all you need to remember is that your basic approach looked like this it was learn.fit one cycle some number of epochs I often pick four and some learning rate which defaults to 3e, nex3 I'll just type it out fully so you can see and then we do that for a bit and then we unfreeze it and then we learn some more and so this is the bit where I just take whatever I did last time and divide it by 10 and then I also write like that and then I have to put one more number in here and that's the number that I get from the learning rate finder, the bit where it's got the strongest slope the kind of don't have to think about it, don't really have to know what's going on rule of thumb that works most of the time but let's now dig in and actually understand it more completely so we're going to create this mathematical function that takes the numbers that represent the pixels and spits out probabilities for each possible plus and by the way a lot of the stuff that we're using here we are stealing from other people who are awesome and so we are putting their details here so like please check out their work because they've got great work that we are highlighting in our course I really like this idea of this at all animated GIF of the numbers so thank you to Adam Geitge for creating that and I guess that was probably on Quora by the looks of this medium oh yes it was too that terrific medium post I remember that whole series of medium posts so so let's look and see how we create one of these functions and let's start with the simplest function I know y equals a x plus b that's a line that's a line and the gradient of the line is here and the intercept of the line is here so hopefully when we said that you need to know high school math to do this course these are the things we're assuming that you remember if we do kind of mention some math thing which I'm assuming you remember and you don't remember it don't freak out right happens to all of us Khan Academy is actually terrific it's not just for school kids go to Khan Academy find the concept you need a refresher on and he explains things really well so strongly recommend checking that out you know remember I'm just a philosophy student right so all the time I'm trying to either remind myself about something or I never learnt something and so we have the whole internet to teach us these things so I'm going to rewrite this slightly it's called a1x plus a2 so let's just replace b with a2 I'm just going to give it a different name so there's another way of saying the same thing and then another way of saying that would be if I could model play a2 by the number one this still is the same thing and so now at this point I'm actually going to say let's not put the number one there but let's put an x1 here and an x2 here and I'll say x2 equals 1 so far this is pretty early high school math this is model playing by 1 which I think we can handle so these two are equivalent with a bit of renaming now in machine learning we don't just have one equation we've got lots so if we've got some data that represents the temperature versus the number of ice cream sold then we kind of have lots of dots and so each one of those dots we might hypothesize is based on this formula y equals a1 x1 plus a2 x2 and so basically there's lots of so this is our y this is our x there's lots of values of y so we can stick a little i here and there's lots of values of x so we can stick a little x here so the way we kind of do that is a lot like numpy indexing but rather than things in square brackets or pi torch indexing rather than things in square brackets in here in our subscript to our equation so this is now saying there's actually lots of these different yi's based on lots of different xi1 and xi2 but notice there's still only one of each of these so these things here are called the coefficients or the parameters so this is our equation and this is still we're going to say that every xi2 is equal to 1 why did I do it that way because I want to do linear algebra why do I want to do it in linear algebra well one reason is because Rachel teaches the world's best linear algebra course so if you're interested check out computational linear algebra for coders for this free course which we make no money but never mind but more to the point right now it's going to make life much easier because I hate writing loops, I hate writing code I just want the computer to do everything for me and any time you see like these little i subscripts that sounds like you're going to have to do loops and all kinds of stuff but what you might remember from school is that when you've got like two things being modelled together two things being modelled together and then they get added up that's called a dot product and then if you do that for lots and lots of different numbers i then that's called a matrix product so in fact this whole thing can be written like this rather than lots of different yi's we can say there's one vector called y which is equal to one matrix called x times one vector called a now at this point I know a lot of you don't remember that so that's fine we have a picture to show you I don't know who created this so now I do somebody called Andre Stouts created this fantastic thing called matrixmodulplication.xyz and here we have a matrix by a vector and we're going to do a matrix vector product go that times that times that plus plus plus that times that times that plus plus plus that times that times that plus plus plus finished that is what matrix vector modulplication does in other words it's just that except his version is much less messy okay so this is actually an excellent spot to have a little break and find out what questions we have coming through our students what are they asking Rachel okay when generating new image data sets how do you know how many images are enough what are ways to measure enough yeah that's a great question so another possible problem you have is you don't have enough data how do you know if you don't have enough data because you found a good learning rate because if you make it higher then it goes off into massive losses if you make it lower it goes really slowly so you've got a good learning rate and then you train for such a long time that your error starts getting worse okay so you know that you've trained for long enough and you're still not happy with the accuracy it's not good enough for the you know teddy bear cuddling level of safety you want so if that happens there's a number of things you can do and we'll learn about some of them pretty much all of them during this course but one of the easiest ones is get more data if you get more data then you can train for longer get a higher accuracy lower error rate without overfitting unfortunately there's no shortcut I wish there was but I will say this most of the time you need less data than you think so organizations very commonly spend too much time gathering data getting more data than it turned out they actually needed so get a small amount first and see how you go what do you do if you have unbalanced classes such as 200 grizzlies and 50 teddies nothing try it it works a lot of people ask this question about how do I deal with unbalanced data I've done lots of analysis with unbalanced data over the last couple of years and I just can't make it not work it always works there's actually a paper that said if you want to get it slightly better then the best thing to do is to take that uncommon class and just make a few copies of it that's called oversampling but I haven't found a situation in practice where I needed to do that I've found it always just works fine for me once you unfreeze and retrain with one cycle again if your training loss is still lower than your validation loss likely underfitting do you retrain it unfrozen again which will technically be more than one cycle or do you redo everything with a longer epoch for the cycle hey you guys asked me that last week my answer is still the same I don't know either is fine if you do another cycle then it'll maybe generalize a little bit better if you start again do twice as long depends how patient you are it won't make much difference for me personally I normally just train a few more cycles but it doesn't make much difference most of the time so showing the code sample where you were creating a CNN with ResNet 34 for the Grizzly Teddy classifier this requires ResNet 34 which I find surprising all created by .save which is about 85 megabytes on disk would be able to run without also needing a copy of ResNet 34 yeah and I understand we're going to be learning all about this shortly you don't there's no copy of ResNet 34 ResNet 34 is actually what we call an architecture we're going to be learning a lot about this it's a functional form it doesn't take up any room it doesn't contain anything it's just a function ResNet 34 is just a function it doesn't contain anything it doesn't store anything I think the confusion here is that we often use a pre-trained neural net that's been learned on ImageNet in this case we don't need to use a pre-trained neural net and actually to entirely avoid even getting created you can actually pass pre-trained equals false and that will ensure that nothing even gets loaded which will save you another 0.2 seconds I guess but we'll be learning a lot more about this so don't worry if this is a bit unclear but the basic idea is this thing here is the basic equivalent of saying is it a line or is it a quadratic or is it a reciprocal function this is the ResNet 34 function it's a mathematical function it doesn't take any storage it doesn't have any numbers it doesn't have to be loaded as opposed to a pre-trained model and so that's why when we did it at inference time the thing that took space is this bit which is where we load our parameters which is basically saying as we're about to find out what are the values of A and B we have to store those numbers but for ResNet 34 you don't just store two numbers you store a few million or a few tens of millions of numbers so why did we do all this it's because I wanted to be able to write it out like this is that we can now do that in PyTorch with no loops single line of code and it's also going to run faster PyTorch really doesn't like loops it really wants you to send it a whole equation to do all at once which means you really want to try and specify things in these kind of linear algebra ways so let's go and take a look because what we're going to try and do then is we're going to try and take this we're going to call it an architecture this is like the tiniest world's tiniest neural network it's got two parameters A1 and A2 we're going to try and fit this architecture to some data so let's jump into a notebook and generate some dots and see if we can get it to fit a line somehow and somehow it's going to be using something called SGD what is SGD well there's two types of SGD the first one is where I said in lesson 1 hey you should all try building these models and try and come up with something cool and you guys all experimented and found really good stuff so that's where the S would be student that would be student gradient descent so that's version 1 of SGD version 2 of SGD what we're going to talk about today is where we're going to have a computer try lots of things and try and come up with a really good function and that will be called stochastic gradient descent so the other one that you hear a lot on twitter is stochastic grad student descent so that's the other one that you hear so we're going to jump into lesson 2 SGD so we're going to go bottom up rather than top down we're going to create the simplest possible model we can which is going to be a linear model and the first thing that we need is we need some data and so we're going to generate some data the data we're going to generate looks like this so this might represent temperature and this might represent number of ice creams we sell or something like that but we're just going to create some synthetic data that we know is following a line so with all of this we're actually going to learn a little bit about PyTorch as well so basically the way we're going to generate this data is by creating some coefficients a1 will be 3 and a2 will be 2 and we're going to create some like we looked at before basically a column of numbers for our x's and a whole bunch of ones and then we're going to do this x at a x at a in Python means a matrix product between x and a it actually is even more general than that it can be a vector-vector product, a matrix-vector product a vector-matrix product or a matrix-matrix product and then actually in PyTorch specifically it can mean even more general things where we get into higher rank tensors which we will learn all about very soon but this is basically the key thing that's going to go on in all of our deep learning the vast majority of the time our computers are going to be basically doing this multiplying numbers together and adding them up which is the surprisingly useful thing to do okay so we're basically going to generate some data by creating a line and then we're going to add some random numbers to it but let's go back and see how we created x and a so I mentioned that we've basically got these two coefficients 3 and 2 and you'll see that we've wrapped it in this function called tensor you might have heard this word tensor before who's heard the word tensor before about two-thirds of you so it's one of these words that sounds scary and apparently if you're a physicist it actually is scary but in the world of deep learning it's actually not scary at all tensor means array specifically it's an array of a regular shape so it's not an array where row 1 has 2 things and row 3 has 3 things and row 4 has 1 thing what you call a jagged array that's not a tensor a tensor is any array which has a rectangular or cube or whatever you know a shape where every element of length and then every column is the same length so 4 by 3 matrix would be a tensor a vector of length 4 would be a tensor a 3D array of length 3 by 4 by 6 would be a tensor that's all a tensor is and so we have these all the time for example an image is a 3-dimensional tensor that goes by number of columns by number of channels normally red, green, blue so for example a VGA texture would be 640 by 480 by 3 we do things backwards so when people talk about images they normally go width by height but when we talk mathematically we always go number of rows by number of columns so it would actually be 480 by 640 that will catch you out we don't say dimensions though with tensors we use one of two words we either say rank or axes rank specifically means how many axes are there how many dimensions are there so an image is generally a rank 3 tensor so what we've created here is a rank 1 tensor or also known as a vector but like in math people come up with slightly different words they come up with very different words for slightly different concepts why is a 1-dimensional array a vector and a 2-dimensional array is a matrix and a 3-dimensional array does that even have a name not really, doesn't have a name it doesn't make any sense with computers we try to have some simple consistent naming conventions they're all called tensors rank 1 tensor, rank 2 tensor, rank 3 tensor you can certainly have a rank 4 tensor if you've got 64 images then that would be a rank 4 tensor of 64 by 480 by 640 by 3 for example so tensors are very simple, they just mean arrays and so in PyTorch you say tensor and you pass in some numbers and you get back, in this case I go back a vector so this then represents our coefficients the slope and the intercept of our line and so because remember we're not actually going to have a special case of ax plus b instead we're going to say there's always this second x value which is always 1 you can see it here, always 1 which allows us just to do a simple to expect a product so that's a and then we wanted to generate this x array of data which is going to have, we're just going to put random numbers in the first column and a whole bunch of ones in the second column so to do that we basically say to PyTorch create a rank 2 tensor actually no sorry, let's do that again we say to PyTorch that we want to create a tensor of n by 2 so since we passed in a total of 2 things we get a rank 2 tensor the number of rows will be n and the number of columns will be 2 and in there every single thing in it will be a 1, that's what torch.1's mean and then this is really important you can index into that just like you can index into a list in Python but you can put a colon anywhere and a colon means every single value on that axis or every single value on that dimension so this here means every single row and then this here means column 0 so this is every row of column 0 I want you to grab a uniform random number and here's another very important concept in PyTorch anytime you've got a function that ends in an underscore that means don't return to me that uniform random number but replace whatever this has been called on with the result of this function so this takes column 0 and replaces it with a uniform random number between minus 1 and 1 so there's a lot to unpack there right but the good news is those two lines of code plus this one which we're coming to cover 95% of what you need to know about PyTorch how to create an array how to change things in an array and how to do matrix operations on an array so there's a lot to unpack but these small number of concepts are incredibly powerful so I can now print out the first 5 rows okay so column 5 is standard Python slicing syntax to say the first 5 rows so here are the first 5 rows 2 columns looking like my random numbers and my 1 so now I can do a matrix product of that x by my a add in some random numbers to add a bit of noise and then I can do a scatter plot and I'm not really interested in my scatter plot in this column of 1s they're just there to make my linear function more convenient so I'm just going to plot my 0 index column against my y's and there it is PLT is what we universally use to refer to the same library matplotlib and that's what most people use for most of their plotting in Python in scientific Python we use matplotlib it's certainly a library you'll want to get familiar with because being able to plot things is really important there are lots of other plotting packages lots of them the other packages are better at certain things than matplotlib but like matplotlib can do everything reasonably well sometimes it's a little awkward but for me I do pretty much everything in matplotlib because there's really nothing it can't do even though some libraries can do other things a little bit better or a little bit prettier but it's really powerful so once you know matplotlib you can do everything so here I'm asking matplotlib to give me a scatter plot if my x is against my y's and there it is so this is my my dummy data representing like temperature and ice cream sales so now what we're going to do is we're going to pretend we were given this data and we don't know that the values of our coefficients are 3 and 2 so we're going to pretend that we never knew that we have to figure them out so how would we figure them out how would we draw a line later and why would that even be interesting well we're going to look at more about why it's interesting in just a moment but the basic idea is this if we can find, this is going to be kind of perhaps really surprising but if we can find a way to find those two parameters to fit that line to those, how many points were there and was 100 in line to those 100 points we can also fit these arbitrary functions that convert from pixel values to probability it'll turn out that this techniques that we're going to learn to find these two numbers works equally well for the 50 million numbers in ResNet 34 so we're actually going to use an almost identical approach so that's and this is the bit that I found in previous classes people have the most trouble digesting like I often find even after week 4 or week 5, people will come up to me and say I don't get it how do we actually train these models and I'll say it's SGD, it's that thing we saw in the notebook with the two numbers but we're fitting a neural network it's like I know and we can't print the 50 million numbers anymore but it is literally identically doing the same thing and the reason this is hard to digest is that the human brain has a lot of trouble conceptualizing of what an equation with 50 million numbers looks like and can do so you just kind of for now will have to take my word for it it can do things like recognize teddy bits these functions turn out to be very powerful and we're going to learn a little bit more in just a moment about how to make them extra powerful this thing we're going to learn to fit these two numbers is the same thing that we've just been using to fit 50 million numbers okay, so we want to find what PyTorch calls parameters or in statistics you'll often hear called coefficient, these values a1 and a2 we want to find these parameters such that the line that they create minimizes the error between that line and the points so in other words you know if we created you know if the a1 and a2 we came up with resulted in this line then we'd look and we'd see like how far away is that line from each point and we'd say oh that's quite a long way and so maybe there was some other a1 or a2 which resulted in this line and they would say like oh how far away is each of those points and then eventually we come up with we come up with this line and it's like oh in this case each of those is actually very close so you can see how in each case we can say how far away is the line at each spot away from its point and then we can take the average of all those and that's called the loss that is the value of our loss so you need some mathematical function that can basically say how far away is this line from those points for this kind of problem which is called a regression problem a problem where your dependent variable is continuous so rather than being grizzlies or teddies it's like some number between minus one and six this is called a regression problem and for regression the most common is called mean squared error which pretty much everybody calls mse you may also see rmse just root mean squared error and so the mean squared error is a loss it's the difference between some prediction that you've made which is like the value at the line and the actual number of ice cream sales and so in the mathematics of this people normally refer to the actual way and the prediction they normally call it y hat as in they write it like that and so what I try to do when we're writing something like mean squared error equation there's no point writing ice cream here and temperature here because we want it to apply to anything so we tend to use these mathematical placeholders so the value of mean squared error is simply the difference between those two squared right and then we can take the mean because remember that is actually a vector or what we now call it a rank one tensor and that is actually a rank one tensor so it's the value of the number of ice cream sales at each place right so when we subtract one vector from another vector we'll be learning a lot more about this but it does something called element wise arithmetic in other words it subtracts each one from each other and so we end up with a vector of differences and then if we take the square of that it squares everything in that vector and so then we can take the mean of that to find the average square of the differences between the actuals and the predictors so if you're more comfortable with mathematical notation what we just wrote there was the sum of which way around do we do it y hat minus y squared over n so that equation is the same as that equation so one of the things I'll note here is I don't think this is you know more complicated or unwieldy than this right but the benefit of this is you can experiment with it like once you have to find it you can use it you can send things into it and get stuff out of it and see how it works right so for me most of the time I prefer to explain things with code rather than with math right because I can actually they're the same they're doing in this case at least in all the cases we'll look at they're exactly the same they're just different notations for the same thing but one of the notations is executable it's something that you can experiment with and one of them is abstract so that's why I'm generally going to show code so the good news is if you're a coder with not much of a math background actually you do have a math background because code is math right if you've got more of a math background and less of a code background then actually a lot of the stuff that you learned from math is going to translate very directly into code and now you can start to experiment really with your math okay so this is a loss function this is something that tells us how good our line is so now we have to kind of come up with what is the line that fits through here because remember we don't know how to pretend we don't know so what you actually have to do is you have to guess you actually have to come up with a guess what are the values of a1 and a2 so let's say we guessed that a1 and a2 are both 1 so this is our tensor a is 1,1 right so here is how we create that tensor and I wanted to write it this way because you'll see this all the time like written out it should be oh sorry I was saying it was minus 1 written out fully it would be minus 1.0 1.0 like that's written out fully we can't write it without the point because that's now an int not a floating point so that's going to spit the dummy if you try to do calculations with that in neural nets right I'm lazy to type dot 0 every time Python knows perfectly well that if you add a dot next to any of these numbers then the whole thing is now float right so that's why you'll often see it written this way particularly by lazy people like me okay so a is a tensor you can see it's floating point even PyTorch is lazy they just put a dot they don't bother with a 0 but if you want to actually see exactly what it is you can write dot type and you can see it's a float tensor and so now we can calculate our predictions with this like random guess x at a matrix product of x and a and we can now calculate the mean squared error of our predictions and our actuals and that's our loss so for this regression our loss is a 0.9 and so we can now plot a scatter plot of x against y and we can plot a scatter plot of x against y hat our predictions and there they are okay so this is the 1 comma minus 1 line sorry, minus 1 comma 1 line and here's our actuals so that's not great, which is not surprising it's just a guess so SGD or gradient descent more generally and anybody who's done ND engineering or probably computer science at school will have done plenty of this like Newton's method whatever it's all the stuff that you did at university if you didn't, don't worry we're going to learn it now it's basically about taking this guess and trying to make it a little bit better so how do we make it a little bit better well there's only two numbers right and the two numbers are and the two numbers are the intercept of that orange line and the gradient of that orange line so what we're going to do with gradient descent is we're going to simply say what if we change those two numbers a little bit what if we made the intercept a little bit higher or a little bit lower what if we made the gradient a little bit more positive or a little bit more negative so there's like four possibilities and then we can just calculate the loss for each of those four possibilities and see what worked did lifting it up or down make it better did tilting it more positive or more negative make it better and then all we do is we say okay well whichever one of those made it better that's what we're going to do and that's it but here's the cool thing for those of you that remember calculus you don't actually have to move it up and down and round about you can actually calculate the derivative the derivative is the thing that tells you would moving it up or down make it better or would rotating it this way or that way make it better so the good news is if you didn't do calculus or you don't remember calculus I just told you everything you need to know about it which is that it tells you how changing one thing changes the function that's what the derivative is kind of not quite strictly speaking but close enough also called the gradient so the gradient or the derivative tells you how changing A1 up or down would change our MSE how changing A2 up or down would change our MSE and it does it more quickly it does it more quickly than actually moving it up and down so in school unfortunately they lost us to sit there and calculate these derivatives by hand we have computers computers can do that for us we are not going to calculate them by hand instead we're going to call dot grad on our computer that will calculate the gradient for us so here's what we're going to do we're going to create a loop we're going to loop through 100 times and we're going to call a function called calculate that function is going to calculate y hat our prediction it is going to calculate loss our mean squared error from time to time it will print that out so we can see how we're going it will then calculate the gradient and in PyTorch calculating the gradient is done by using a method called backward so you'll see something really interesting which is mean squared error we're going to call standard mathematical function PyTorch for us keeps track of how it was calculated and lets us calculate the derivative so if you do a mathematical operation on a tensor in PyTorch you can call backward to calculate the derivative what happens to that derivative that gets stuck inside an attribute called dot grad so I'm going to take my coefficients a and I'm going to subtract from them my gradient and there's an underscore here why? because that's going to do it in place so it's going to actually update those coefficients a to subtract the gradients from them so why do we subtract? because the gradient tells us if I move the whole thing downwards the loss goes up so I want to do the opposite of the thing that makes it go up because our loss we want our loss to be small so that's why we have to subtract and then there's something here called LR LR is our learning rate and so literally all it is is the thing that we multiply and we subtract why is there any LR at all? let me show you why let's take a really simple example a quadratic and let's say your algorithm's job was to find where that quadratic was at its lowest point and so how could it do this? well just like what we're doing now the starting point would just be to pick some x value at random and then pop up here to find out what the value of y is that's its starting point and so then it can calculate the gradient and the gradient is simply the slope it tells you moving in which direction is going to make you go down and so the gradient tells you you have to go this way so if the gradient was really big you might jump this way a very long way so you might jump all the way over to here maybe even here and so if you jumped over to there then that's actually not going to be very helpful because then you see where does that take us to oh it's now worse we jumped too far so we don't want to jump too far so maybe we should just jump a little bit maybe to here and the good news is that is actually a little bit closer and so then we'll just do another little jump see what the gradient is and do another little jump that takes us to here and another little jump that takes us to here here right so in other words we find our gradient to tell us kind of what direction to go and like do we have to go a long way or not too far but then we multiply it by some number less than one so we don't jump too far and so hopefully at this point this might be reminding you of something which is what happened when our learning rate was too high so do you see why that happened now our learning rate was too high meant that we jumped all the way past the right answer further than we started with and it got worse and worse and worse so that's what a learning rate too high does on the other hand if our learning rate is too low then you just take tiny little steps and so eventually you're going to get there but you're doing lots and lots of calculations along the way so you really want to find something where it's kind of either like this or maybe it's kind of a little bit backwards and forwards maybe it's kind of like this something like that you know you want something that kind of gets in there quickly it jumps out and diverges not so slowly that it takes lots of steps so that's why we need a good learning rate and so that's all it does so if you look inside the source code of any deep learning library you'll find this you'll find something that just says coefficient dot subtract learning rate times gradient and we'll learn about some minor not minor easy but important optimizations we can do to make this go faster but that's basically it there's a couple of other little minor issues that we don't need to talk about now one involving zeroing out the gradients another involving making sure that you turn gradient calculation off when you do the SGD update if you're interested we can discuss them on the forum or you can do our introduction to machine learning course which covers all the mechanics of this in more detail but this is the basic idea so if we run update 100 times printing out the loss from time to time you can see it starts at 8.9 and it goes down down down down down down and so we can then print out scatter plots and there it is that's it believe it or not that's gradient descent so we just need to start with a function that's a bit more complex than x at a but as long as we have a function that can represent things like this is a teddy bear we now have a way to fit it and so let's now take a look at this as a picture as an animation and this is one of the nice things that you can do with this is one of the nice things that you can do with matplotlib is you can take any plot and turn it into an animation and so you can now actually see it updating each step so let's see what we did here we simply said as before create a scatter plot but then rather than having a loop we used matplotlib's funk animation so call 100 times this function and this function just called that update that we created earlier updated the y data in our line and so did that 100 times waiting 20 milliseconds after each one and there it is so you might think that visualizing your algorithms with animations is some amazing and complex thing to do but actually now you know it's 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 lines of code so I think that is pretty damn cool so that is SGD visualized and so we can't visualize as conveniently what updating 50 million parameters in a ResNet 34 looks like but it's basically doing the same thing and so studying these simple versions is actually a great way to get an intuition so you should try running this notebook with a really big learning rate with a really small learning rate and see what this animation looks like and try to get a feel for it maybe you can even try a 3D plot I haven't tried that yet but I'm sure it would work fine too so the only difference between stochastic gradient descent and this is something called mini batches you'll see what we did here was we calculated the value of the loss on the whole data set on every iteration but if your data set is 1.5 million images in ImageNet that's going to be really slow if you do a single update of your parameters you've got to calculate the loss on 1.5 million images you wouldn't want to do that so what we do is we grab 64 images also at a time at random and we calculate the loss on those 64 images and we update our weights and then we grab another 64 random images and we update the weights so in other words the loop basically looks exactly the same but at this point here it would basically be y square bracket and some random indexes here you know and some random indexes here and we'd basically do the same thing well actually sorry it would be there right so some random indexes on our x and some random indexes on our y to do a mini batch at a time that would be the basic difference because grab a random few points each time those random few points are called your mini batch and that approach is called SGD or Stochastic Gradient Descent so there's quite a bit of vocab we've just covered right so let's just remind ourselves the learning rate is a thing that we multiply our gradient by to decide how much to update the weights by and epoch is one complete run through all of our data points all of our images so for the non stochastic gradient descent we just did every single loop we did the entire data set but if you've got a data set with a thousand images and your mini batch size is a hundred then it would take you ten iterations to see every image once so that would be one epoch epochs are important because if you do lots of epochs then you're looking at your images lots of times and so every time you see an image there's a bigger chance of overfitting so we generally don't want to do too many epochs a mini batch is just a random bunch of points that you use to update your weights SGD is just gradient descent and epoch batches architecture and model kind of mean the same thing in this case our architecture is y equals x8 the architecture is the mathematical function that you're fitting the parameters to and we're going to learn for today or next week what the mathematical function of things like ResNet34 actually is but it's basically pretty much what you've just seen it's a bunch of matrix products parameters also known as coefficients also known as weights are the numbers that you're updating and then loss function is the thing that's telling you how far away or how close you are to the correct answer any questions? so these models the teddy bear classifiers are functions that take pixel values and return probabilities they start with some functional form like y equals xA and they fit the parameters using SGD to try and do the best to calculate your predictions so far we've learned how to do regression which is a single number we'll learn how to do the same thing for classification where we have multiple numbers but it's basically the same in the process we had to do some math we had to do some linear algebra and we had to do some calculus and a lot of people get a bit scared at that point and tell us I am not a math person if that is you that's totally okay you are a math person in fact it turns out that in the actual academic research there are not math people and non-math people it turns out to be entirely a result of culture and expectation so you should check out Rachel's talk there's no such thing as not a math person where she will introduce you to some of that academic research and so if you think of yourself as not a math person you should watch this so that you learn that you're wrong that your thoughts are actually there because somebody has told you you're not a math person but there's actually no academic research to suggest that there is such a thing in fact there are some cultures like Romania and China where the not a math person concept never even appeared that it's almost unheard of in some cultures there's somebody to say I'm not a math person because that just never entered that cultural identity so don't freak out if words like derivative and gradient and matrix product are things that you're kind of scared of it's something you can learn it's something you'll be okay with okay so the last thing that we're gonna close with today oh I just got a message from Simon Willison Simon's telling me he's actually not that special lots of people won medals that's the worst part about Simon is not only is he really smart he's also really modest which I think is just awful I mean if you're gonna be that smart at least be a horrible human being and make it okay so the last thing I wanna close with is the idea of we're gonna look at this more next week underfitting over and overfitting we just fit a line to our data but imagine that our data wasn't actually line shaped and so if we tried to fit something which was like constant plus constant times x i.e. a line to it then it's never gonna fit very well no matter how much we change these two coefficients we're gonna get really close on the other hand we could fit some much bigger equation so in this case it's a higher degree polynomial we have lots of wiggly bits like so but if we did that it's very unlikely we go and look at some other place to find out the temperature that it is and how much ice cream they're selling and that we'll get a good result because like the wiggles are far too wiggly so this is called overfitting we're looking for some mathematical function that fits just right to say with a teddy bear analogy so you might think if you have a statistics background the way to make things fit just right is to have exactly the right number of parameters to use a mathematical function that doesn't have too many parameters in it it turns out that's actually completely not the right way to think about it there are other ways to make sure that we don't overfit and in general this is called regularization regularization are all the techniques to make sure that when we train our model that it's going to work not only well on the data it's seen but on the data it hasn't seen yet the most important thing to know when you've trained a model is actually how well does it work on data that it hasn't been trained with so as we're going to learn a lot about next week that's why we have this thing called a validation set so what happens with a validation set is that we do our mini-batch STD training loop with one set of data with one set of teddy bears, grizzlies black bears and then when we're done we check the loss function and the accuracy to see how good is it on a bunch of images which were not included in the training and so if we do that then if we have something which is too wiggly it'll tell us oh your loss function and your error is really bad because on the bears that it hasn't been trained with the wiggly bits are in the wrong spot or if it was under fitting it would also tell us that your validation set is really bad so even for people that don't go through this course and don't learn about the details of deep learning like if you've got managers or colleagues or whatever at work who are kind of wanting to learn about AI the only thing that you really need to be teaching them is about the idea of a validation set because that's the thing they can then use to figure out if somebody is telling them snake oil or not they'll hold back some data and then they get told oh here's a model that we're going to roll out and then you say okay fine I'm just going to check it on this held out data and see whether it generalizes there's a lot of details to get right when you design your validation set we will talk about them briefly next week but a more full version would be in Rachel's piece on the fast AI blog called how and why to create a good validation set and this is also one of the things we go into in a lot of detail in the intro to machine learning course so we're going to try and give you enough to get by for this course but it is certainly something that's worth a deeper study as well any questions or comments before we wrap up okay good alright well thanks everybody I hope you have a great time building your web applications see you next week