 So, hello everybody and welcome back to practical deep learning for coders. This is lesson 2. And in the last lesson, we started training our first models. We didn't really have any idea how that training was really working, but we were looking at a high level at what was going on. And we learned about what is machine learning and how does that work. And we realized that based on how machine learning worked that there were some fundamental limitations on what it can do. And we talked about some of those limitations. And we also talked about how after you've trained a machine learning model, you end up with a program which behaves much like a normal program, something with inputs and a thing in the middle and outputs. So today we're going to finish up talking about that. And we're going to then look at how we get those models into production and what some of the issues with doing that might be. I wanted to remind you that there are two sets of notebooks available to you. One is the fast book repo, the full actual notebooks containing all the text of the O'Reilly book. And so this lets you see everything that I'm telling you in much more detail. And then as well as that, there's the course V4 repo, which contains exactly the same notebooks, but with all the prose stripped away to help you study. So that's where you really want to be doing your experiment and your practice. And so maybe as you listen to the video, you can kind of switch back and forth between the video and reading or do one and then the other. And then put it away and have a look at the course V4 notebooks and try to remember like, okay, what was this section about and run the code and see what happens and change it and so forth. So we were looking at this line of code where we looked at how we created our data by passing in information. And perhaps most importantly, some way to label the data. And we talked about the importance of labeling. And in this case, this particular data set, whether it's a cat or a dog, you can tell by whether it's an uppercase or a lowercase letter in the first position. That's just how this data set, if they tell you when the read me works. And we also looked particularly at this idea of valid percent equals 0.2 and like what does that mean creates a validation set. And that was something I wanted to talk more about. The first thing I do want to do though is point out that this particular labeling function returns something that's either true or false. And actually this data set as we'll see later also tells also contains the actual breed of 37 different cat and dog breeds. So you can also grab that from the file name. In each of those two cases, we're trying to predict a category. Is it a cat or is it a dog? Or is it a German shepherd or a beagle or a ragdoll cat or whatever. When you're trying to predict a category, so when the label is a category, we call that a classification model. On the other hand, you might try to predict how old is the animal or how tall is it or something like that, which is like a continuous number that could be like 13.2 or 26.5 or whatever. Anytime you're trying to predict a number, your label is a number, you call that regression. So those are the two main types of model classification and regression. This is very important jargon to know about. So the regression model attempts to predict one or more numeric quantities such as temperature or location or whatever. This is a bit confusing because sometimes people use the word regression as a shortcut to a particular like a abbreviation for a particular kind of model called linear regression. That's super confusing because that's not what regression means linear regression is just a particular kind of regression. But I just wanted to warn you of that when you start talking about regression, a lot of people will assume you're talking about linear regression, even though that's not what the word means. All right, so I wanted to talk about this valid percent 0.2 thing. So as we described valid percent grabs in this case 20% of the data with 0.2 and puts it aside like in a separate bucket. And then when you train your model, your model doesn't get to look at that data at all. That data is only used to decide to show you how accurate your model is. So if you train for too long and or with not enough data and or a model with too many parameters, after a while the accuracy of your model will actually get worse. And this is called overfitting, right? So we use the validation set to ensure that we're not overfitting. The next line of code that we looked at is this one where we created something called a learner. We'll be learning a lot more about that, but a learner is basically more is something which contains your data and your architecture. That is the mathematical function that you're optimizing. And so a learner is the thing that tries to figure out what are the parameters which best cause this function to match the labels in this data. So we're talking a lot more about that. But basically this particular function ResNet 34 is the name of a particular architecture, which is just very good for computer vision problems. In fact, the name really is ResNet. And then 34 tells you how many layers there are. So you can use ones with bigger numbers here to get more parameters that will take longer to train, take more memory, more likely to overfit, but could also create more complex models. Right now, though, I wanted to focus on this part here, which is metrics equals error rate. This is where you list the functions that you want to be that you want to be called with your data with your validation data and printed out after each epoch. And epoch is what we call it when you look at every single image in the data set once. And so after you've looked at every image in the data set once, we print out some information about how you're doing. And the most important thing we print out is the result of calling these metrics. So error rate is the name of a metric and it's a function that just prints out what percent of the validation set are being incorrectly classified by your model. So a metrics of function that measures the quality of the predictions using the validation set. So error rates one, another common metric is accuracy, which is just one minus error rate. So very important to remember from last week, we talked about loss. Arthur Samuel had this important idea in machine learning that we need some way to figure out how good our, how well our model is doing so that when we change the parameters, we can figure out which set of parameters make that performance measurement get better or worse. That performance measurement is called the loss. The loss is not necessarily the same as your metric. The reason why is a bit subtle and we'll be seeing it in a lot of detail once we delve into the math and the coming lessons. But basically you need a function, you need a loss function where if you change the parameters by just a little bit up or just a little bit down, you can see if the loss gets a little bit better or a little bit worse. And it turns out that error rate and accuracy doesn't tell you that at all because you might change the parameters by such a small amount that none of your dog's predictions start becoming cats and none of your cat predictions start becoming dogs. So like your predictions don't change and your error rate doesn't change. So loss and metric are closely related, but the metric is the thing that you care about. The loss is the thing which your computer is using as the measurement of performance to decide how to update your parameters. So we measure overfitting by looking at the metrics on the validation set. So fast AI always uses the validation set to print out your metrics. And overfitting is like the key thing that machine learning is about. It's all about how do we find a model which fits the data, not just for the data that we're training with, but for data that the training algorithm hasn't seen before. So overfitting results when our model is basically cheating. Our model can cheat by saying, oh, I've seen this exact picture before and I remember that that's a picture of a cat. So it might not have learned what cats look like in general. It just remembers that images one, four and eight are cats and two and three and five are dogs and learns nothing actually about what they really look like. So that's the kind of cheating that we're trying to avoid. We don't want it to memorize our particular data set. So we split off our validation data. And most of these words you're seeing on the screen are from the book. So I just copied and pasted them. So if we split off our validation data and make sure that our model sees it during training, it's completely untainted by it. So we can't possibly cheat. Not quite true. We can cheat. The way we could cheat is we could run, we could fit a model, look at the result in the validation set, change something a little bit, fit another model, look at the validation set, change something a little bit. We could do that like a hundred times until we find something where the validation set looks the best. But now we might have fit to the validation set. Right. So if you want to be really rigorous about this, you should actually set aside a third bit of data called the test set that is not used for training. And it's not used for your metrics. It's actually, you don't look at it until the whole project's finished. And this is what's used on competition platforms like Kaggle. On Kaggle, after the competition finishes, your performance will be measured against a data set that you have never seen. And so that's a really helpful approach. And it's actually a great idea to do that, like even if you're not doing the modeling yourself. So if you're, if you're looking at vendors and you're just trying to decide, should I go with IBM or Google or Microsoft and they're all showing you how great their models are. What you should do is you should say, okay, you go and build your models. And I am going to hang on to 10% of my data. And I'm not going to let you see it at all. And when you're all finished, come back and then I'll run your model on the 10% of data you've never seen. Now, pulling out your validation and test sets is a bit subtle, though. Here's an example of a simple little data set. And this comes from a fantastic blog post that Rachel wrote that we will link to about creating effective validation sets. And you can see, basically, you have some kind of seasonal data set here. Now, if you just say, okay, fast day, I want to model that. I want to create my data loader using a valid percent of 0.2. It would do this. It would delete randomly some of the dots. Right. Now, this isn't very helpful because it's we can still cheat because these dots are right in the middle of other dots. And this isn't what would happen in practice. What would happen in practice is we would want to predict this is sales by date. Right. We want to predict the sales for next week, not the sales for 14 days ago, 18 days ago and 29 days ago. Right. So what you actually need to do to create an effective validation set here is not do it randomly, but instead chop off the end. Right. And so this is what happens in all Kaggle competitions pretty much that involve time. For instance, is the thing that you have to predict is the next like two weeks or so after the last data point that they give you. And this is what you should do also for your test set. So again, if you've got vendors that you're looking at, you should say to them, okay, after you're all done modeling, we're going to check your model against data that is one week later than you've ever seen before. And you won't be able to retrain or anything because that's what happens in practice. Right. Okay. There's a question. I've heard people describe overfitting as training error being below validation error. Does this rule of thumb end up being roughly the same as yours? Okay, so that's a great question. So I think what they mean there is training loss versus validation loss because we don't print training error. So we do print at the end of each epoch the value of your loss function for the training set and the value of the loss function for the validation set. And if you train for long enough, so if it's training nicely, your training loss will go down and your validation loss will go down. Because by definition, loss function is defined such as a lower loss function is a better model. If you start overfitting, your training loss will keep going down. Right. Because like, why wouldn't it, you know, you're getting better and better parameters. But your validation loss will start to go up because actually you started fitting to the specific data points in the training set. And so it's not going to actually get better. It's going to get it's not going to get better for the validation set. It'll start to get worse. However, that does not necessarily mean that you're overfitting or at least not overfitting in a bad way. As we'll see, it's actually possible to be at a point where the validation loss is getting worse, but the validation accuracy or error or metric is still improving. So I'm not going to describe how that would happen mathematically yet because we need to learn more about loss functions, but we will. But for now, just realize that the important thing to look at is your metric getting worse, not your loss function getting worse. Thank you for that fantastic question. The next important thing we need to learn about is called transfer learning. So the next line of code said learn dot fine tune. Why does it say learn dot fine tune fine tune is what we do when we are transfer learning. So transfer learning is using a pre-trained model for a task that is different to what it was originally trained for. So more jargon to understand our jargon. Let's look at that. What's a pre-trained model? So what happens is, remember, I told you the architecture we're using is called ResNet 34. So when we take that ResNet 34, that's just a mathematical function with lots of parameters that we're going to fit using machine learning. There's a big dataset called ImageNet that contains 1.3 million pictures of a thousand different types of thing, whether it be mushrooms or animals or airplanes or hammers or whatever. There's a competition or there used to be a competition that runs every year to see who could get the best accuracy on the ImageNet competition. And the models that did really well, people would take those specific values of those parameters and they would make them available on the Internet for anybody to download. So if you download that, you don't just have an architecture now, you have a trained model. You have a model that can recognize a thousand categories of thing in images, which probably isn't very useful unless you happen to want something that recognizes those exact thousand categories of thing. But it turns out you can start with those weights in your model and then train some more epochs on your data and you'll end up with a far, far more accurate model than you would if you didn't start with that pre-trained model. And we'll see why in just a moment, right? But this idea of transfer learning, it makes intuitive sense, right? ImageNet already has some cats and some dogs in it. It can say this is a cat and this is a dog, but you want to maybe do something that recognizes lots of breeds that aren't an ImageNet. Well, for it to be able to recognize cats versus dogs versus aeroplanes versus hammers, it has to understand things like, what does metal look like? What does fur look like? What does ears look like? You know, so it can say like, oh, this breed of animal, this breed of dog has pointy ears and oh, this thing is metal, so it can't be a dog. So all these kinds of concepts get implicitly learned by a pre-trained model. So if you start with a pre-trained model, then you don't have to learn all these features from scratch. And so transfer learning is the single most important thing for being able to use less data and less compute and get better accuracy. So that's a key focus for the FastAI library and a key focus for this course. There's a question. I am a bit confused on the differences between loss, error, and metric. Sure. So error is just one kind of metric. So there's lots of different possible labels you could have. Let's say you're trying to create a model which could predict how old a cat or dog is. So the metric you might use is on average how many years were you off by? So that would be a metric. On the other hand, if you're trying to predict whether this is a cat or a dog, your metric could be what percentage of the time am I wrong? So that latter metric is called the error rate. So error is one particular metric. It's a thing that measures how well you're doing. And it's like it should be the thing that you most care about. So you write a function or use one of FastAI's predefined ones, which measures how well you're doing. Loss is the thing that we talked about in lesson one. So I'll give a quick summary, but go back to lesson one if you don't remember. Arthur Samuel talked about how a machine learning model needs some measure of performance, which we can look at when we adjust our parameters up or down. Does that measure of performance get better or worse? And as I mentioned earlier, some metrics possibly won't change at all if you move the parameters up and down just a little bit. So they can't be used for this purpose of adjusting the parameters to find a better measure of performance. So quite often we need to use a different function. We call this the loss function. The loss function is the measure of performance that the algorithm uses to try to make the parameters better. And it's something which should kind of track pretty closely to the metric you care about. But it's something which as you change the parameters a bit, the loss should always change a bit. And so there's a lot of hand waving there because we need to look at some of the math of how that works. And we'll be doing that in the next couple of lessons. Thanks for the great questions. Okay, so fine-tuning is a particular transfer learning technique where the... Oh, and you're still showing your picture and not the slides. So fine-tuning is a transfer learning technique where the weights, this is not quite the right word, we should say the parameters, where the parameters of a pre-trained model are updated by training for additional epochs using a different task to that use for pre-training. So pre-training the task might have been image net classification. And then our different tasks might be recognizing cats versus dogs. So the way by default fast AI does fine-tuning is that we use one epoch, which remember is one looking at every image in the dataset once, one epoch to fit just those parts of the model necessary to get the particular part of the model that's specially for your dataset working. And then we use as many epochs as you ask for to fit the whole model. And so this is more for those people who might be a bit more advanced. We'll see exactly how this works later on in the lessons. So why does transfer learning work and why does it work so well? The best way, in my opinion, to look at this is to see this paper by Zeiler and Fergus, who were actually 2012 image net winners. And interestingly, their key insights came from their ability to visualize what's going on inside a model. So visualization very often turns out to be super important to getting great results. What they were able to do was they looked, remember I told you like a ResNet 34 has 34 layers. They looked at something called AlexNet, which was the previous winner of the competition, which only had seven layers. At the time that was considered huge. And so they took the seven layer model and they said, what is the first layer of parameters look like? And they figured it out how to draw a picture of them. And so the first layer had lots and lots of features. But here are nine of them, one, two, three, four, five, six, seven, eight, nine. And here's what nine of those features look like. One of them was something that could recognize diagonal lines from top left to bottom right. One of them could find diagonal lines from bottom left to top right. One of them could find gradients that went from the top of orange to the bottom of blue. Some of them were able, you know, one of them was specifically for finding things that were green and so forth. So for each of these nine, they're called filters or features. So then something really interesting they did was they looked at for each one of these filters, each one of these features. And we'll learn kind of mathematically about what these actually mean in the coming lessons. But for now, let's just recognize them as saying, oh, there's something that looks at diagonal lines and something that looks at gradients. And they found in the actual images in ImageNet specific examples of parts of photos that match that filter. So for this top left filter, here are nine actual patches of real photos that match that filter. And as you can see, they're all diagonal lines. And so here's for the green one, here's parts of actual photos that match the green one. So layer one is super, super simple. And one of the interesting things to note here is that something that can recognize gradients and patches of color and lines is likely to be useful for lots of other tasks as well, not just ImageNet. So you can kind of see how something that can do this might also be good at many, many other computer vision tasks as well. This is layer two. Layer two takes the features of layer one and combines them. So it can not just find edges, but can find corners or repeating curving patterns or semicircles or full circles. And so you can see, for example, here's a, it's kind of hard to exactly visualize these layers after layer one. You kind of have to show examples of what the filters look like. But here you can see examples of parts of photos that these, this layer two circular filter has activated on. And as you can see, it's found things with circles. So interestingly, this one, which is this kind of blotchy gradient seems to be very good at finding sunsets. And this repeating vertical pattern is very good at finding like curtains and wheat fields and stuff. So the further we get, layer three then gets to combine all the kinds of features in layer two. And remember, we're only seeing, so we're only seeing here 12 of the features, but actually there's probably hundreds of them. I don't remember exactly in AlexNet, but there's lots. But by the time we get to layer three, by combining features from layer two, it already has something which is finding text. So this is a feature which can find bits of image that contain text. It's already got something which can find repeating geometric patterns. And you see this is not just like a matching specific pixel patterns. This is like a semantic concept. It can find repeating circles or repeating squares or repeating hexagons. So it's really like computing. It's not just matching a template. And remember, we know that neural networks can solve any possible computable function. So it can certainly do that. So layer four gets to combine all the filters from layer three anyway at once. And so by layer four, we have something that can find dog faces, for instance. So you can kind of see how each layer, we get like more duplicatively more sophisticated features. And so that's why these deep neural networks can be so incredibly powerful. It's also why transfer learning can work so well. Because like if we wanted something that can find books and I don't think there's a book category in ImageNet. Well, it's actually already got something that can find text as an earlier filter, which I guess it must be using to find maybe there's a category for library or something or a bookshelf. So when you use transfer learning, you can take advantage of all of these pre-learned features to find things that are just combinations of these existing features. That's why transfer learning can be done so much more quickly and so much less data than traditional approaches. One important thing to realize then is that these techniques for computer vision are not just good at recognizing photos. There's all kinds of things you can turn into pictures. For example, these are sounds that have been turned into pictures by representing their frequencies over time. And it turns out that if you convert a sound into these kinds of pictures, you can get basically state-of-the-art results at sound detection just by using the exact same ResNet learner that we've already seen. I wanted to highlight that it's 9.45, so if you want to take a break soon. A really cool example from, I think this is our very first year of running Fast.ai, one of our students created pictures they worked at Splunk in Antifraud and they created pictures of users moving their mouse. If I remember correctly, as they moved their mouse, he basically drew a picture of where the mouse moved and the color depended on how fast they moved and these circular blobs is where they clicked the left or the right mouse button. At Splunk, what he did actually for the course as a project for the course is he tried to see whether he could use these pictures with exactly the same approach we saw in lesson one to create an antifraud model. And it worked so well that Splunk ended up patenting a new product based on this technique and you can actually check it out. There's a blog post about it on the internet where they describe this breakthrough antifraud approach which literally came from one of our really amazing and brilliant and creative students after lesson one of the course. Another cool example of this is looking at different viruses and again turning them into pictures. And you can kind of see how they've got here, this is from a paper, check out the book for the citation. They've got three examples of a particular virus called VB.AT and another example of a particular virus called Fakrian and you can see in each case the pictures all look kind of similar and that's why again they can get state-of-the-art results in virus detection by turning the kind of program signatures into pictures and putting it through image recognition. So in the book you'll find a list of all of the terms, all of the most important terms we've seen so far and what they mean. I'm not going to read through them but I want you to please because these are the terms that we're going to be using from now on and you've got to know what they mean because if you don't you're going to be really confused because I'll be talking about labels and architectures and models and parameters and they have very specific exact meanings and they'll be using those exact meanings. So please review this. So to remind you this is where we got to. We ended up with Arthur Samuel's overall approach and we replaced his terms with our terms so we have an architecture which contains parameters as inputs and parameters and the data as inputs so that the architecture press the parameters of the model with the inputs they use to calculate predictions they are compared to the labels with the loss function and that loss function is used to update the parameters many many times to make them better and better until the loss gets nice and super low. So this is the end of chapter one of the book. It's really important to look at the questionnaire because the questionnaire is the thing where you can check whether you have taken away from this book of this chapter the stuff that we hope you have. So go through it and anything that you're not sure about the answer is in the text. So just go back to earlier in the book in the chapter and you will find the answers. There's also a further research section after each questionnaire for the first couple of chapters they're actually pretty simple hopefully they're pretty fun and interesting. There are things where to answer the question it's not enough to just look in the chapter. You actually have to go and do your own thinking and experimenting and googling and so forth. In later chapters some of these further research things are pretty significant projects that might take a few days or even weeks and so yeah you know check them out because hopefully they'll be a great way to expand your understanding of the material. So something that Sylvain points out in the book is that if you really want to make the most of this then after each chapter please take the time to experiment with your own project and with the notebooks we provide and then see if you can redo the notebooks on a new data set. Perhaps for chapter one that might be a bit hard because we haven't really shown how to change things but for chapter two which we're going to start next you'll absolutely be able to do that. Okay so let's take a five minute break and we'll come back at 9.55 San Francisco time. Okay so welcome back everybody and I think we've got a couple of questions to start with so Rachel please take it away. Sure. First are filters independent? By that I mean if filters are pre-triined might they become less good in detecting features of previous images when fine-tuned? Oh that is a great question. So assuming I understand the question correctly if you start with say an ImageNet model and then you fine-tune it on dogs versus cats for a few epochs and you get something that's very good at recognizing dogs versus cats it's going to be much less good as an ImageNet model after that so it's not going to be very good at recognizing aeroplanes or hammers or whatever. This is called catastrophic forgetting in the literature the idea that as you like see more images about different things to what you saw earlier that you start to forget what the things you saw earlier are. So if you want to fine-tune something which is good at a new task but also continues to be good at the previous task you need to keep putting in examples of the previous task as well. What are the differences between parameters and hyperparameters? If I am feeding an image of a dog as an input and then changing the hyperparameters of batch size in the model what would be an example of a parameter? So the parameters are the things that are described in lesson one that Arthur Samuel described as being the things which change what the model does, what the architecture does. So we start with this infinitely flexible function the thing called a neural network that can do anything at all and the way you get it to do one thing versus another thing is by changing its parameters they are the numbers that you pass into that function so there's two types of numbers you pass into the function there's the numbers that represent your input like the pixels of your dog and there's the numbers that represent the learned parameters. So in the example of something that's not a neural network like a checker's playing program like Arthur Samuel might have used in the early 60s and late 50s those parameters may have been things like if there is an opportunity to take a piece versus an opportunity to get to the end of a board how much more value should I consider one versus the other? It's twice as important or it's three times as important that two versus three that would be an example of a parameter. In a neural network parameters are a much more abstract concept so a detailed understanding of what they are will come in the next lesson or two but it's the same basic idea they're the numbers which change what the model does to be something that recognises malignant tumours versus cats versus dogs versus colourised black and white pictures whereas the hyperparameter is the choices about what numbers do you pass to the function when you act the actual fitting function to decide how that fitting process happens. There's a question I'm curious about the pacing of this course I'm concerned that all the material may not be covered depends what you mean by all the material we certainly won't cover everything in the world so yeah we'll cover what we can we'll cover what we can in seven lessons we're certainly not covering the whole book if that's what you're wondering the whole book will be covered in either two or three courses in the past it's generally been two courses to cover about the amount of stuff in the book but we'll see how it goes because the book's pretty big 500 pages when you say two courses you mean 14 lessons 14 lessons, so it'll be like 14 or 21 lessons to get through the whole book although having said that by the end of the first lesson hopefully there'll be kind of like enough momentum and understanding that reading the book independently will be more useful and it'll have also kind of gained a community of folks on the forums that you can hang out with and ask questions of and so forth so in the second part of the course we're going to be talking about putting stuff in production and so to do that we need to understand what are the capabilities and limitations of deep learning what are the kinds of projects that even make sense to try to put in production and one of the key things I should mention in the Balkan in this course is that the first two or three lessons and chapters there's a lot of stuff which is designed not just for coders but for everybody there's lots of information about what are the practical things you need to know to make deep learning work and so one of the things you need to know is what's deep learning actually good at at the moment so I'll summarize what the book says about this but there are the kind of four key areas that we have as applications in fast AI computer vision, text, tabular and what I've called here REXIS this stands for recommendation systems specifically a technique called collaborative filtering which we briefly saw last week sorry another question are there any pre-trained weights available other than the ones from ImageNet that we can use? if yes when should we use others in one ImageNet oh that's a really great question so yes there are a lot of pre-trained models and one way to find them but also you're currently just showing okay great one great way to find them is we have model zoo which is a common name for like places that have lots of different models and so here's lots of model zoos or you can look for pre-trained models and so there's quite a few unfortunately not as wide a variety as I would like that most are still on ImageNet or similar kinds of general photos for example medical imaging there's hardly any there's a lot of opportunities for people to create domain specific pre-trained models it's still an area that's really underdone because not enough people are working on transfer learning okay so as I was mentioning we've kind of got these four variations that we've talked about a bit and deep learning is pretty good at all of those tabular data like spreadsheets and database tables is an area where deep learning is not always the best choice but it's particularly good for things involving high cardinality variables that means variables that have lots and lots of discrete levels like zip code or something like that deep learning is really pretty great for those in particular for text it's pretty great at things like classification and translation it's actually terrible for conversation so that's been something that's been a huge disappointment for a lot of companies they try to create these conversation bots but actually deep learning isn't good at providing accurate information it's good at providing things that sound accurate and sound compelling but we don't really have great ways yet of actually making sure it's correct one big issue for recommendation systems, collaborative filtering is that deep learning is focused on making predictions which don't necessarily actually mean creating useful recommendations we'll see what that means in a moment deep learning is also good at multimodal that means things where you've got multiple different types of data so you might have some tabular data including a text column and an image and some collaborative filtering data and combining that all together is something that deep learning is really good at so for example putting captions on photos is something which deep learning is pretty good at although again it's not very good at being accurate so it might say this is a picture of two birds it's actually a picture of three birds and then this other category there's lots and lots of things that you can do with deep learning by being creative about the use of these kinds of other application based approaches for example an approach that we developed for natural language processing called ULMFIT that all you're learning in the course it turns out that it's also fantastic at doing protein analysis if you think of the different proteins as being different words and they're in a sequence which has some kind of state and meaning it turns out that ULMFIT works really well for protein analysis so often it's about kind of being being creative so to decide like for the product that you're trying to build is deep learning going to work well for it you just have to try it and see but if you do a search hopefully you can find examples of other people that have tried something similar even if you can't that doesn't mean it's not going to work so for example I mentioned the collaborative filtering issue where a recommendation and a prediction are not necessarily the same thing you can see this on Amazon for example quite often so I bought a Terry Pratchett book and then Amazon tried for months to get me to buy more Terry Pratchett books now that must be because their predictive model said that people who bought one particular Terry Pratchett book are likely to also buy other Terry Pratchett books but from the point of view of like well is this going to change my buying behaviour probably not right like if I liked that book I already know I like that author and I already know that like they probably wrote other things so I'll go and buy it anyway so this would be an example of like Amazon probably not being very smart here they're actually showing me collaborative filtering predictions rather than actually figuring out how to optimize a recommendation so an optimized recommendation would be something more like your local human bookseller might do where they might say oh you like Terry Pratchett well let me tell you about other kind of comedy fantasy sci-fi writers on the similar vein who you might not have heard about before so the difference between recommendations and predictions is super important so I wanted to talk about a really important issue around interpreting models and for a case study for this I thought let's pick something that's actually super important right now which is a model in this paper one of the things we're going to try and do in this course is learn how to read papers so here is a paper which I would love for everybody to read called high temperature and high humidity reduce the transmission of COVID-19 now this is a very important issue because if the claim of this paper is true and that would mean that this is going to be a seasonal disease and if this is a seasonal disease and it's going to have massive policy implications so let's try and find out how this was modeled and understand how to interpret this model so this is a key picture from the paper and what they've done here is they've taken 100 cities in China and they've plotted the temperature on one axis in Celsius and R on the other axis where R is a measure of transmissibility it says for each person that has this disease how many people on average will they infect so if R is under 1 then the disease will not spread if R is higher than like 2 it's going to spread incredibly quickly and basically R any high R is going to create an exponential transmission impact and you can see in this case they have plotted a best fit line through here and then they've made a claim that there's some particular relationship in terms of a formula that R is 1.99 minus 0.023 times temperature so a very obvious concern I would have looking at this picture is that this might just be random maybe there's no relationship at all but just if you picked 100 cities at random perhaps they would sometimes show this level of relationship so one simple way to kind of see that would be to actually do it in a spreadsheet so here is a spreadsheet what I did was I kind of eyeballed this data and I guessed about what is the mean degree centigrade I think it's about 5 and what's about the standard deviation of centigrade I think it's probably about 5 as well and then I did the same thing for R I think the mean R looks like it's about 1.9 to me and it looks like the standard deviation of R is probably about 0.5 so what I then did was I just jumped over here and I created a random normal value so a random value from a normal distribution from a normal distribution so a bell curve with that particular mean and standard deviation of temperature and that particular mean and standard deviation of R and so this would be an example of a city that might be in this data set of 100 cities something with 9 degrees celsius and an R of 1.1 so that would be 9 degrees celsius and an R of 1.1 so something about here and so then I just copied that formula down 100 times so here are 100 cities that could be in China where this is assuming that there is no relationship between temperature and R they're just random numbers and so each time I recalculate that so if I hit control equals it will just recalculate it I get different numbers because they're random and so you can see at the top here I've then got the average of all of the temperatures and the average of all of the Rs and the average of all of the temperatures varies and the average of all of the Rs varies as well so then what I did was I copied those random numbers over here so I'll go copy these 100 random numbers and paste them here here here and so now I've got 1, 2, 3, 4, 5, 6 I've got 6 kind of groups of 100 cities and so let's stop those from randomly changing anymore by just fixing them in stone there so now that I've pasted them in I've got 6 examples of what 100 cities might look like if there was no relationship at all between temperature and R and I've got their mean temperature and R in each of those 6 examples and what I've done is you can see here at least for the first one is I've plotted it and you can see in this case there's actually a slight positive slope and I've actually calculated the slope for each just by using the slope function in Microsoft Excel and you can see that actually in this particular case it's just random 5 times it's been negative and it's even more negative than their 0.023 and so you can like it's kind of matching our intuition here that the slope of the line that we have here is something that absolutely can often happen totally by chance it doesn't seem to be indicating any kind of real relationship at all if we wanted that slope to be like more confident we would need to look at more cities so like here I've got 3000 and you can see here the slope is 0.0002 it's almost exactly 0 which is what we'd expect when there's actually no relationship between C and R in this case they're all random then if we look at lots and lots of randomly generated cities then we can say oh yeah there's no slope but when you only look at 100 as we did here you're going to see relationships very very often right so that's something that we need to be able to measure and so one way to measure that is we use something called a p-value so a p-value here's how a p-value works we start out with something called a null hypothesis and the null hypothesis is basically what's what's our starting point assumption so our starting point assumption might be oh there's no relationship between temperature and R and then we gather some data and have you explained what R is I have yes R is the transmissibility of the virus so then we gather data of independent independent variables in this case the independent variable is the thing that we think might cause a dependent variable so here the independent variable would be temperature the dependent variable would be R so here we've gathered data there's the data that was gathered in this example and then we say what percentage of the time would we see this amount of relationship which is a slope of 0.023 by chance and as we've seen one way to do that is by what we would call a simulation which is by generating random numbers 100 pairs of random numbers a bunch of times and seeing how often you see this relationship we don't actually have to do it that though there's actually a simple equation we can use to jump straight to this number which is what percent of the time would we see that relationship by chance and this is basically what that looks like we have the most likely observation which in this case would be if there is no relationship between temperature and R then the most likely slope would be 0 and sometimes you get positive slopes by chance and sometimes you get pretty small slopes and sometimes you get negative slopes by chance and so the larger the number the less likely it is to happen whether it be on the positive side or the negative side and so in our case our question was how often are we going to get less than negative 0.023 so it would actually be somewhere down here I actually copied this from Wikipedia and it's just a little bit more qualitative numbers and so they've coloured in this area above the number so this is the p-value and so you can, we don't care about the math but there's a simple little equation you can use to directly figure out this number the p-value from the data so this is kind of how nearly all scientists really focus on this idea of p-values and indeed in this particular study as we'll see in a moment they reported p-values so probably a lot of you have seen p-values in your previous lives they come up in a lot of different domains here's the thing they are terrible you almost always shouldn't be using them don't just trust me trust the American Statistical Association they point out six things about p-values and those include p-values do not measure the probability that the hypothesis is true all the probability that the data were produced by random choice alone now we know this because we just saw that if we use more data right so if we sample 3000 random cities rather than 100 we get a much smaller value so p-values don't just tell you about how big a relationship is but they actually tell you about a combination of that and how much data did you collect right so they don't measure the probability that the hypothesis is true so therefore conclusions and policy decisions should not be based on whether a p-value passes some threshold p-value does not measure the importance of a result right because again it could just tell you that you collected lots of data which doesn't tell you that the results actually of any practical import and so by itself it does not provide a good measure of evidence so Frank Harrell who is somebody who I read his book and it's a really important part of my learning he's a professor of biostatistics has a number of great articles about this he says no hypothesis testing p-values have done significant harm to science and he wrote another piece called no hypothesis significance testing never worked so I've shown you what p-values are so that you know why they don't work not so that you can use them but they're a super important part of machine learning because they come up all the time in people saying this is how we decide whether a drug worked or whether there is a epidemiological relationship or whatever and indeed p-values appear in this paper so in the paper they show the results of a multiple linear regression and they put three stars next to any relationship which has a p-value of 0.01 or less so something useful to say about a small p-value like 0.01 or less which is that the thing that we're looking at did not probably did not happen by chance the biggest statistical error people make all the time is that they see that a p-value is not less than 0.05 and then they make the erroneous conclusion that no relationship exists which doesn't make any sense because let's say you only had like three data points then you almost certainly won't have enough data to have a p-value of less than 0.05 for any hypothesis so like the way to check is to go back and say what if I picked the exact opposite null hypothesis what if my null hypothesis was there is a relationship between temperature and R then do I have enough data to reject that null hypothesis right and if the answer is no then you just don't have enough data to make any conclusions at all right so in this case they do have enough data to be confident that there is a relationship between temperature and R now that's weird because we just looked at the graph and we did a little bit of the envelope in Excel and we thought this is could well be random so here's where the issue is the graph shows what we call a univariate relationship a univariate relationship shows the relationship between one independent variable and one dependent variable and that's what you can normally show in a graph but in this case they did a multivariate model in which they looked at temperature and humidity and GDP per capita and population density and when you put all of those things into the model then you end up with statistically significant results for temperature and humidity why does that happen well the reason that happens is because all these variation in the blue dots is not random there's a reason they're different right and the reasons include denser cities are going to have higher transmission for instance and probably more humid will have less transmission so when you do a multivariate model it actually allows you to be more confident of your results but the p-value as noted by the American Statistical Association does not tell us whether this is a practical importance the thing that tells us this is a practical is importance is the actual slope that's found so in this case the equation they come up with is that r equals 3.968 minus 0.038 by temperature minus 0.024 by relative humidity this is this equation is this practically important well we can again do a little back of the envelope here by just putting that into Excel let's say there was one place that had a temperature of 10 centigrade and a humidity of 40 then if this equation is correct r would be about 2.7 somewhere with a temperature of 35 centigrade and a humidity of 80 r would be about 0.8 so is this practically important oh my god yes right two different cities with different climates can be if they're the same in every other way and this model is correct one city would have no spread of disease because r is less than one one would have massive exponential explosion so we can see from this model that if the modeling is correct then this is a highly practically significant result so this is how you determine practical significance of your models is not with p-values but with looking at kind of actual outcomes how do you think about the practical importance of a model and how do you turn a predictive model into something useful in production so I spent many many years thinking about this and I actually created a with some other great folks actually created a paper about it designing great data products and this is largely based on 10 years of work I did at a company I founded called Optimal Decisions Group and Optimal Decisions Group was focused on the question of helping insurance companies figure out what prices to set and insurance companies up until that point had focused on predictive modeling actuaries in particular spent their time trying to figure out how likely is it that you're going to crash your car how much damage might you have and then based on that try to figure out what price they should set for your policy so for this company what we did was we decided to use a different approach which I ended up calling the drive train approach which is described here to set insurance prices and indeed to do all kinds of other things and so for the insurance example the objective would be how do I maximize my let's say 5 year profit and then what inputs can we control which I call levers so in this case it would be what price can I set and then data is data which can tell you as you change your levers how does that change your objective so if I start increasing my price to people who are likely to crash their car then we'll get less of them which means we'll have less costs but at the same time we'll also have less revenue coming in for example so to link up the levers to the objective via the data we collect we build models that describe how the levers influence the objective and this is all like it seems pretty obvious when you say it like this but when we started work with optimal decisions in 1999 nobody was doing this in insurance everybody in insurance was simply doing a predictive model to guess how likely people were to crash their car and then pricing was set by like adding 20% or whatever it was just done in a very kind of naive way so what I did is I over many years took this basic process and tried to help lots of companies figure out how to use it to turn predictive models into actions so the starting point in like actually getting value in a predictive model is thinking about what is it you're trying to do and what are the sources of value in that thing you're trying to do the levers what are the things you can change like what's the point of a predictive model if you can't do anything about it right figuring out ways to find what data you have which one suitable, what's available then think about what approaches to analytics you can then take and then super important like well can you actually implement you know those changes and super super important how do you actually change things as the environment changes and you know interestingly a lot of these things areas where there's not very much academic research there's a little bit and some of the papers that have been particularly around maintenance of like how do you decide when your machine learning model is kind of still okay how do you update it over time have had like many many many many citations but they don't pop up very often because a lot of folks are so focused on the math you know and then there's the whole question of like what constraints are in place across this whole thing so what you'll find in the book is there is a whole appendix which actually goes through every one of these six things and has a whole list of examples so this is an example of how to like think about value and lots of questions that companies and organisations can use to try and think about you know all of these different pieces of the actual puzzle of getting stuff into production and actually into an effective product we have a question so do check out this appendix because it actually originally appeared as a blog post and I think except for my COVID-19 posts that I did with Rachel it's actually the most popular blog post I've ever written it's had hundreds of thousands of views and it kind of represents like 20 years of hard one insights about like how you actually get value from machine learning in practice and what you actually have to ask so please check it out because hopefully you'll find it helpful so when we think about like think about this for the question of how should people think about the relationship between seasonality and transmissibility of COVID-19 you kind of need to dig really deeply into the questions about like oh not just what are those numbers in the data but what does it really look like so one of the things in the paper that they show is actual maps of temperature and humidity and are and you can see like not surprisingly that humidity and temperature in China are what we would call auto correlated which is to say that places that are close to each other in this case geographically have similar temperatures and similar humidities and so like this actually puts into the question the a lot the p-values that they have because you can't really think of these as 100 totally separate cities because the ones that are close to each other probably have very close behavior so maybe you should think of them as like a small number of sets of cities you know of kind of larger geographies so these are the kinds of things that when you look actually into a model you need to like think about what are the what are the limitations and I'd like well what does that mean what do I do about that you you need to think of it from this kind of utility point of view this kind of end to end what are the actions I can take what are the results point of view not just null hypothesis testing so in this case for example there are basically four possible key ways this could end up is that there really is a relationship between temperature and R or the right hand side is or there is no real relationship between temperature and R and we might act on the assumption that there is a relationship or we might act on the assumption that there isn't a relationship and so you kind of want to look at each of these four possibilities and say like well what would be the economic and societal consequences and you know there is going to be a huge difference in lives lost and you know economies crashing and whatever else for each of these four the paper actually has shown if their model is correct what's the likely R value in March for like every city in the world and the likely R value in July for every city in the world and so for example if you look at New England and New York the prediction here is and also the very coast of the west coast is that in July the disease will stop spreading now if that happens, if they are right then that's going to be a disaster because I think it's very likely in America and also the UK that people will say oh this disease is not a problem you know it didn't really take off at all the scientists were wrong people will go back to their previous day to day life and we could see what happened in 1918 flu virus of like the second go around when winter hits could be much worse than the start right so like there's these kind of like huge potential policy impacts depending on whether this is true or false and so to think about it I also just wanted to say that it would be very irresponsible to think oh summer is going to solve it we don't need to act now just in that this is something growing exponentially and could do a huge huge amount of damage it could be a problem either way if you assume that there will be seasonality and that summer will fix things then it could lead you to be apathetic now if you assume there's no seasonality and then there is then you could end up kind of creating a larger level of expectation of destruction that actually happens and end up with your population being even more apathetic so that being wrong in any direction would be a problem so one of the ways we tend to deal with this with this kind of modeling is we try to think about priors so priors are basically things where we you know rather than just having a normal hypothesis we try and start with a guess as to like what's more likely right so in this case if memory says correctly I think we know that like flu viruses become inactive at 27 centigrade we know that like cold the cold coronaviruses are seasonal 1918 flu pandemic was seasonal in every country and city that's been studied so far there's been quite a few studies like this they've always found climate relationships so far so maybe we'd say well a prior belief is that this thing is probably seasonal and so then we'd say well this particular paper adds some evidence to that so like it shows like how incredibly complex it is to use a model in practice for in this case policy discussions but also for like organizational decisions because you know there's always complexities there's always uncertainties and so you actually have to think about the the utility you know and your best guesses and try to combine everything together as best as you can okay so with all that said it's still nice to be able to get our models up and running even if you know even just a predictive model is sometimes useful of its own sometimes it's useful to prototype something and sometimes it's just it's going to be part of some bigger picture so rather than try to create some huge end-to-end model here we thought we would just show you how to get your your PyTorch fast AI model up and running in as raw a form as possible so that from there you can kind of build on top of it as you like so to do that we are going to download and curate our own dataset and you're going to do the same thing you've got to train your own model on that dataset and then you're going to create an application and then you're going to host it now there's lots of ways to curate an image dataset you might have some photos on your own computer there might be stuff at work you can use one of the easiest though is just to download stuff off the internet there's lots of services for downloading stuff off the internet we're going to be using Bing image search here because they're super easy to use a lot of the other kind of easy to use things require breaking the terms of service of websites so like we're not going to show you how to do that but there's lots of examples that do show you how to do that so you can check them out as well if you want to Bing image search is actually pretty great at least at the moment these things change a lot so keep an eye on our website to see if we've changed our recommendation the biggest problem with Bing image search is that the sign up process is a nightmare at least at the moment like one of the hardest parts of this book is just signing up to their damn API which requires going through azure it's called cognitive services azure cognitive services so we'll make sure that all that information is on the website for you to follow through just how to sign up so we're going to start from the assumption that you've already signed up but you can find it just go Bing Bing image search API and at the moment they give you seven days with a pretty high pretty high quota for free and then after that you can keep using it as long as you like but they kind of limit it to like three transactions per second or something which is still plenty you can still do thousands for free so at the moment it's pretty great even for free so what will happen is when you sign up for Bing image search or any of these kind of services they'll give you an API key so just replace the XXX here with the API key that they give you so that's now going to be called key in fact let's do it over here okay so you'll put in your key and then there's a function we've created called search images Bing which is just a super tiny little function as you can see it's just two lines of code I was just trying to save a little bit of time which will take some take your API key and some search term and return a list of URLs that match that search term as you can see for using this particular service you have to install a particular package so we show you how to do that on the site as well so once you've done so you'll be able to run this and that will return by default I think 150 URLs okay so fast.ai comes with a download URL function so let's just download one of those images just to check and open it up and so what I did was I searched for a grizzly bear and here I have a grizzly bear so then what I did was I said okay let's try and create a model that can recognize grizzly bears versus black bears versus teddy bears so that way I can find out I could set up some video recognition system near our camp site when we're out camping that gives me bear warnings but if it's a teddy bear coming then it doesn't warn me and wake me up it doesn't scare me at all so then I just go through each of those three bear types create a directory with the name of grizzly or black or teddy bear search being for that particular search term along with bear and download and so download images is a fast.ai function as well so after that I can call get image files which is a fast.ai function that will just return recursively all of the image files inside this path and you can see it's given me bears slash black slash and then lots of numbers so one of the things you have to be careful of is that a lot of the stuff you download will turn out to be like not images at all and will break so you can call verify images to check that all of these file names are actual images and in this case I didn't have any failed so it's empty but if you did have some then you would call path.unlink path.unlink is part of the python standard library and it deletes a file and map is something that will call this function for every element of this collection this is part of a special fastai class called l it's basically it's kind of a mix between the python standard library list class and a numpy array class and we'll be learning more about it later in this course but it basically tries to make it super easy to do kind of more functional style programming and python so in this case it's going to unlink everything that's in the failed list which is probably what we want because there are all the images that failed to verify so we've now got a path that contains a whole bunch of images and they're classified according to black, grizzly or teddy based on what folder they're in and so to create we're going to create a model and so to create a model the first thing we need to do is to tell fastai what kind of data we have and how it's structured now in part in lesson one of the course we did that by using what we call a factory method which is we just said image data loaders.fromname and it did it all for us those factory methods are fine for beginners but now we're into lesson two, we're not quite beginners anymore so we're going to show you the super super flexible way to use data in whatever format you like and it's called the data block API and so the data block API looks like this here's the data block API you tell fastai what your independent variable is and what your dependent variable is so what your labels are and what your input data is so in this case our input data are images and our labels are categories that are going to be either grisly or black or teddy so that's the first thing you tell it that's the blocks parameter and then you tell it how do you get a list of all of the in this case file names and we just saw how to do that because we just called the function ourselves the function is called get image files so we tell it what function to use to get that list of items and then you tell it how do you split the data and a training set and so we're going to use something called a random splitter which just splits it randomly and we're going to point 30% of it into the validation set we're also going to set the random seed which ensures that every time we run this the validation set will be the same and then you say okay how do you label the data and this is the name of a function called parent label and so that's going to look for each item so this particular one would become a black bear and this is like the most common way for image data sets to be represented is that they get put the different files get put into folder according to their label and then finally here we've got something called item transforms we'll be learning a lot more about transforms in a moment that these are basically functions that get applied to each image and so each image is going to be resized to 128 and 28 by 128 square so we're going to be learning more about data block API soon but basically the process is going to be it's going to call whatever is get items which is a list of image files and then I'm going to call get x get y so in this case there's no get x but there is a get y so it's just parent label and then it's going to call the create method for each of these two things it's going to create an image and it's going to create a category it's then going to call the item transforms which is resize and then the next thing it does is it puts it into something called a data loader a data loader is something that grabs a few images at a time I think by default it's 64 and puts them all into a single it's called a batch it just grabs 64 images and sticks them all together and the reason it does that is it then puts them all into the GPU at once so it can pass them all to the model through the GPU in one go and that's going to let the GPU go much faster as we'll be learning about and then finally we don't use any here we can have something called batch transforms which we will talk about later and then somewhere in the middle about here conceptually is the splitter which is the thing that splits into the training set and the validation set so this is a super flexible way to tell fast AI how to work with your data and so at the end of that it returns an object of type data loaders that's why we always call these things DLs so data loaders has a validation and a training data loader and a data loader as I just mentioned is something that grabs a batch of a few items at a time and puts it on the GPU for you so this is basically the entire code of data loaders so the details don't matter I just wanted to point out that a lot of these concepts in fast AI when you actually look at what they are they're incredibly simple little things it's literally something that you just pass in a few data loaders to and it's still a similar attribute and pass and gives you the first one back as .train and the second one back as .valid so data loaders by first of all creating the data block and then we call the data loaders passing in our path to create DLs and then you can call show batch on that you can call show batch on pretty much anything in fast AI to see your data and look we've got some grizzlies, we've got a teddy we've got a grizzly so you get the idea right I'm going to look at these different, I'm going to look at data augmentation next week so I'm going to skip over data augmentation and let's just jump straight into training your model so once we've got DLs we can just like in lesson one call cnn learner to create a ResNet, we're going to create a smaller ResNet this time, a ResNet 18 again asking for error rate we can then call .finetune again so you see it's all the same lines of code we've already seen and you can see our error rate goes down from 9 to 1 so we've got 1% error and after training for about 25 seconds so you can see we've only got 450 images we've trained for well less than a minute and we only have, let's look at the confusion matrix so we can say I want to create a classification interpretation class I want to look at the confusion matrix and the confusion matrix as you can see it's something that says for things that are actually black bears how many are predicted to be black bears versus grizzly bears versus teddy bears so the diagonal are the ones that are all correct and so it looks like we've got two errors we've got one grizzly that was predicted to be black one black that was predicted to be grizzly super super useful method is plot top losses show me what my errors actually look like so this one here was predicted to be a grizzly bear but the label was black bear this one was the one that's predicted to be a black bear and the label was grizzly bear these ones here are not actually wrong this is predicted to be black and it's actually black but the reason they appear in this is because these are the ones that the model was the least confident about okay so we're going to look at image classifier cleaner next week let's focus on how we then get this into production so to get it into production we need to export the model so what exporting the model does is it creates a new file which by default is called export.pickle which contains the architecture and all of the parameters of the model so that is now something that you can copy over to a server somewhere and treat it as a predefined program so then the process of using your trained model on new data kind of in production is called inference so here I've created an inference learner by loading that learner back again right and so obviously it doesn't make sense to do it right next to after I've saved it in a notebook but I'm just showing you how it would work right so this is something that you would do on your server inference and remember that once you have trained a model you can just treat it as a program you can pass inputs to it so this is now our program this is our bear predictor so I can now call predict on it and I can pass it an image and it will tell me here is it is 99.999% sure that this is a grizzly so I think what we're going to do here is we're going to wrap it up here and next week we'll finish off by creating an actual GUI for our bear classifier we will show how to run it for free on a service called binder and yeah and then I think we'll be ready to dive into some of the details of what's going on behind the scenes any questions or anything else before we wrap up Rachel? no great alright thanks everybody so hopefully yeah I think from here on we've covered you know most of the key kind of underlying foundational stuff from a machine learning point of view that we're going to need to cover so we'll be able to ready to dive into lower level details of how deep learning works behind the scenes and I think that'll be starting from next week so see you then