 Welcome back to Lesson 3, so we're going to start with a quick correction which is to let you know that when we referred to this chart as coming from Cora last week, we were correct. It did come from Cora, but actually we realized originally it came from Andrew Oon's excellent machine learning course on Coursera. So apologies for the incorrect citation, but in exchange, let's talk about Andrew Oon's excellent machine learning course on Coursera. It's really great. You can see people gave it 4.9 out of 5 stars. In some ways it's a little dated, but a lot of the content really is as appropriate as ever and taught in a more bottom-up style, so it can be quite nice to combine Andrew's bottom-up style and our top-down style and meet somewhere in the middle. Also if you're interested in more machine learning foundations, you should check out our machine learning course as well. If you go to course.fast.ai and click on the machine learning button, that will take you to our course which is about twice as long as this deep learning course and kind of takes you much more gradually through some of the foundational stuff around validation sets and model interpretation and how pie torch tensors work and stuff like that. So I think all these courses together, if you want to really dig deeply into the material, do all of them. A lot of people here have and end up saying, oh, I got more out of each one by doing the whole lot. Or you can skip backwards and forwards, see which one works for you. So we started talking about deploying your web app last week. One thing that's going to make life a lot easier for you is that on the course V3 website, there's a production section where right now we have one platform, but more will be added by the time this video comes out, showing you how to deploy your web app really, really easily. And when I say easily, for example, here's the how to deploy on site guide created by San Francisco's study group member, Navjot. As you can see, it's just a page. There's almost nothing to do and it's free. It's not going to serve 10,000 simultaneous requests, but it'll certainly get you started. And I found it works really well. It's fast. And so deploying a model doesn't have to be slow or complicated anymore. And the nice thing is, you can kind of use this for an MVP. And if you do find you're starting to get 1,000 simultaneous requests, then you know that things are working out and you can start to upgrade your instance types or add to a more traditional big engineering approach. So if you actually use this starter kit, it will actually create my teddy bear finder for you. And this is an example of my teddy bear finder. So the idea is it's as simple as possible this template. So you can fill in your own style sheets, your own custom logic, and so forth. This is kind of designed to be a minimal thing so you can see exactly what's going on. The back end is a simple kind of REST style interface that sends back JSON, and the front end is a super simple little JavaScript thing. So yeah, it should be a good way to get a sense of how to build a web app which talks to a PyTorch model. So examples of web apps people have built during the week. Edward Ross built the what car is that app, or more specifically the what Australian car is that. I thought it was kind of interesting that Edward said on the forum that the building of the app was actually a great experience in terms of understanding how the model works himself better. And it's interesting that he's describing like trying it out on his phone. A lot of people think like, oh, if I want something on my phone, I have to create some kind of mobile TensorFlow, ONNX, whatever tricky mobile app. You really don't. You can run it all in the cloud and make it just a web app or use some kind of simple little GUI front end that talks to a REST back end. It's not that often that you'll need to actually run stuff on the phone. So this is a good example of that. The Werner has created a guitar classifier. You can decide whether your food is healthy or not. Apparently this one is healthy. That can't be right. I would have thought a hamburger is more what we're looking for, but there you go. Apparently Trinidad and Tobago is the home of the hummingbird. So if you're visiting, you can find out what kind of hummingbird you're looking at. You can decide whether or not to eat a mushroom. If you happen to be one of the cousins of Charlie Harrington, you can now figure out who is who. I believe this was actually designed for his fiance. Even will tell you about the interests of this particular cousin. So a fairly niche application, but apparently there are 36 people who will appreciate this at least. I have no cousins. That's a lot of cousins. This is an example of an app which actually takes a video feed and turns it into a motion classifier. That's pretty cool. I like it. Team 26, good job. Here's a similar one for American Sign Language. So it's not a big step from taking a single image model to taking a video model. You can just grab the occasional frame. Put it through your model and update the UI as the kind of model results come in. So it's really cool that you can do this kind of stuff either in client or in browser nowadays. Henry Pluchy has built your city from .space, which he describes as creepy how accurate it is. So here is where I live, which it figured out was in the United States. It's interesting he describes here how he actually had to be very thoughtful about the validation set he built, make sure that the satellite tiles were not overlapping or close to each other. In doing so, he realized he had to download more data. But once he did, he got this amazingly effective model that can look at satellite imagery and figure out what country it's from. I thought this one was pretty interesting, which was doing a Univariate Time Series analysis by converting it into a picture using something I've never heard of, a Gramean angular field. But he says he's getting close to state of the art results for Univariate Time Series modeling by turning it into a picture. And so I like this idea of turning stuff that's not a picture into a picture. So something really interesting about this project, which was looking at emotion classification from faces, was that he was specifically asking the question how well does it go without changing anything just using the default settings, which I think is a really interesting experiment because we're all told it's really hard to train models and it takes a lot of specific knowledge. And actually we're finding that that's often not the case. And he looked at this facial expression recognition data set, there was a 2017 paper that he compared his results to and he got equal or slightly better results than the state of the art paper on face recognition, emotion recognition without doing any custom hyper parameter tuning at all. So that was really cool. And then Alayna Harley, who I featured one of her works last week, has done another really cool work in the genomic space, which is looking at variant analysis, looking at false positives in these kinds of pictures. And she found she was able to decrease the number of false positives coming out of the kind of industry standard software she was using by 500% by using a deep learning workflow. And I think this is a nice example of something where if you're going through spending hours every day looking at something, in this case looking at kind of get rid of the false positives, maybe you can make that a lot faster by using deep learning to do a lot of the work for you. And again, this is an example of a computer vision based approach on something which initially wasn't actually images. So that's a really cool application. So really nice to see what people have been building in terms of both web apps and as classifiers. What we're going to do today is look at a whole lot more different types of model that you can build. And we're going to kind of zip through them pretty quickly. And then we're going to go back and say, like, oh, how did all these things work? What's the common denominator? But all of these things, you can create web apps from these as well, but you'll have to think about how to slightly change that template to make it work with these different applications. I think that'll be a really good exercise in making sure you understand the material. So the first one we're going to look at is a data set of satellite images. And satellite imaging is a really fertile area for deep learning. It's certainly a lot of people already using deep learning and satellite imaging, but only scratching the surface. And the data set that we're going to look at looks like this. It has satellite tiles. And for each one, as you can see, there's a number of different labels for each tile. One of the labels always represents the weather that's shown, so in this case, cloudy or partly cloudy. And then all of the other labels tell you any interesting features that are seen in there, so primary means primary rainforest. Agriculture means there's some farming, road, road, and so forth. So as I'm sure you can tell, this is a little different to all of the classifiers we've seen so far, because there's not just one label. There's potentially multiple labels. So multi-label classification can be done in a very similar way. But the first thing we're going to need to do is to download the data. Now this data comes from Kaggle. Kaggle is mainly known for being a competition's website. And it's really great to download data from Kaggle when you're learning, because you can see how would I have gone in that competition. And it's a good way to see whether you kind of know what you're doing. I tend to think the goal is to try and get in the top 10%. In my experience, all the people in the top 10% of a competition really know what they're doing. So if you can get in the top 10%, then that's a really good sign. Pretty much every Kaggle data set is not available for download outside of Kaggle. At least the competition data sets. So you have to download it through Kaggle. And the good news is that Kaggle provides a Python-based downloader tool which you can use. So we've got a quick description here of how to download stuff from Kaggle. So to install stuff, to download stuff from Kaggle, you first have to install the Kaggle download tool, so just pip install Kaggle. And so you can see what we tend to do when there's one-off things to do is we show you the commented out version in the notebook and you can just remove the comment. So here's a cool tip for you. If you select a few lines and then hit Ctrl slash, it uncomments them all. And then when you're done, select them again, Ctrl slash again, and re-comments them all, okay? So if you run this line, it'll install Kaggle for you. Depending on your platform, you may need Pseudo, you may need slash something else, slash pip, you may need source activate. So have a look on the setup instructions, actually the returning to work instructions on the course website. To see when we do a condor install, you have to do the same basic steps for your pip install. So once you've got that module installed, you can then go ahead and download the data. And basically it's as simple as saying Kaggle competitions download, the competition name, and then the files that you want. The only other steps before you do that is that you have to authenticate yourself and you'll see there's a little bit of information here on exactly how you can go about downloading from Kaggle the file containing your API authentication information. So I won't bother going through it here, but just follow these steps. Sometimes stuff on Kaggle is not just zipped or tied, but it's compressed with a program called 7Zip, which will have a 7Z extension. If that's the case, you'll need to either apt install P7Zip, or here's something really nice. Some kind of person has actually created a condor installation of 7Zip that works on every platform. So you can always just run this condor install. Doesn't even require sudo or anything like that. And this is actually a good example of where condor is super handy is that you can actually install binaries and libraries and stuff like that. And it's nicely cross-platform. So that's a good, if you don't have 7Zip installed, that's a good way to get it. And so this is how you unzip a 7Zip file. In this case, it's TARD and 7Zip. So you can do this all in one step. So 7Za is the name of the 7Zip archive program that you would run. Okay, so that's all basic stuff, which if you're not so familiar with the command line and stuff, it might take you a little bit of experimenting to get it working. Feel free to ask on the forum, make sure you search the forum first. To get started, okay. So once you've got the data downloaded and unzipped, you can take a look at it. So in this case, so in this case, because we have multiple labels for each tile, we clearly can't have a different folder for each image telling us what the label is. We need some different way to label it. And so the way that Kaggle did it was they provided a CSV file that had each file name along with a list of all of the labels. In order to just take a look at that CSV file, we can read it using the Pandas library. If you haven't used Pandas before, it's kind of the standard way of dealing with tabular data in Python. It pretty much always appears in the PD namespace. In this case, we're not really doing anything with it, other than just showing you the contents of this file. So we can read it, we can take a look at the first few lines, and there it is. So we wanna turn this into something we can use for modeling. So the kind of object that we use for modeling is an object of the data bunch plus. So we have to somehow create a data bunch out of this. Once we have a data bunch, we'll be able to go dot show batch to take a look at it. And then we'll be able to go create CNN with it, and then we'll be able to start training, okay? So really the trickiest step previously in deep learning has often been getting your data into a form that you can get it into a model. So far, we've been showing you how to do that using various factory methods. So methods where you basically say, I wanna create this kind of data from this kind of source with these kinds of options. The problem is, I mean that works fine sometimes, and we showed you a few ways of doing it over the last couple of weeks. But sometimes, you want more flexibility. Because there's so many choices that you have to make about where do the files live and what's the structure they're in and how do the labels appear and how do you split out the validation set and how do you transform it and so forth. So we've got this unique API that I'm really proud of called the data block API. And the data block API makes each one of those decisions a separate decision that you make. There's separate methods and with their own parameters for every choice that you make around how do I create, set up my data. So for example, to grab the planet data, we would say we've got a list of image files that are in a folder and they're labeled based on a CSV with this name. They have this separator. Remember I showed you back here that there's a space between them. So by passing in separator, it's gonna create multiple labels. The images are in this folder, they have this suffix. We're gonna randomly split out a validation set with 20% of the data. We're gonna create data sets from that, which we're then gonna transform with these transformations. And then we're gonna create a data bunch out of that, which we'll then normalize using these statistics. So there's all these different steps. So to give you a sense of what that looks like, the first thing I'm gonna do is kind of go back and explain what are all of the PyTorch and fast AI kind of classes that you need to know about that are gonna appear in this process. Cuz you're gonna see them all the time in the fast AI docs and the PyTorch docs. So the first one you need to know about is a class called a data set. And the data set class is part of PyTorch. And this is the source code for the data set class. As you can see, it actually does nothing at all. So the data set class in PyTorch defines two things. Get item and when. In Python, these special things that are underscore, underscore something, underscore, underscore, Pythonistas call them Dunder, something. So this would be Dunder, get item, Dunder, when. And they're basically special magical methods that do some special behavior. This particular method, you can look them up in the Python docs. This particular method means that your object, if you had an object called O, can be indexed with square brackets. Something like that, right? So that would call get item with three is the index. And then this one called Len, means that you can go Len, O, and it will call that method. And you can see in this case, they're both not implemented. So that is to say, although PyTorch says, tell PyTorch about your data. You have to create a data set. It doesn't really do anything to help you create the data set. It just defines what the data set needs to do. So in other words, your data, the starting point for your data, is something where you can say, what is the third item of data in my data set? So that's what get item does, and how big is my data set? That's what the length does. So FastAI has lots of data set subclasses that do that for all different kinds of stuff. And so far, you've been seeing image classification data sets. And so they're data sets where get item will return an image and a single label of what is that image. So that's what a data set is. Now, a data set is not enough to train a model. The first thing we know we have to do, if you think back to the gradient descent tutorial last week, is we have to have a few images or a few items at a time so that our GPU can work in parallel. Remember, we do this thing called a mini batch. A mini batch is a few items that we present to the model at a time that it can train from in parallel. So to create a mini batch, we use another PyTorch class called a data loader. And so a data loader takes a data set in its constructor. So it's now saying, this is something I can get the third item and the fifth item and the ninth item. And it's going to grab items at random and create a batch of whatever size you ask for and pop it on the GPU and send it off to your model for you. So a data loader is something that grabs individual items, combines them into a mini batch, pops them on the GPU for modeling. So that's quite a data loader and it comes from a data set. So you can see already there's kind of choices you have to make. What kind of data set am I creating? What is the data for it? Where is it going to come from? And then when I create my data loader, what batch size do I want to use, right? There still isn't enough to train a model, not really, because we've got no way to validate the model. If all we have is a training set, then we have no way to know how we're doing because we need a separate set of held out data, a validation set to see how we're getting along. So for that, we use a fast AI class called a data bunch. And a data bunch is something which as it says here binds together a training data loader and a valid data loader. And when you look at the fast AI docs, when you see these kind of monospace font things, they're always referring to some symbol you can look up elsewhere. So in this case, you can see trainDL is here. And there's no point knowing that there's an argument with a certain name, unless you know what that argument is. So you should always look after the colon to find out that is a data loader. So when you create a data bunch, you're basically giving it a training set data loader and a validation set data loader. And that's now an object that you can send off to a learner and start learning, start fitting. So there are the basic pieces. So coming back to here, this stuff plus this line is all the stuff which is creating the data set. So it's saying where did the images come from? Because the data set, the index returns two things. It returns the image and the labels, assuming it's an image data set. So where do the images come from? Where do the labels come from? And then I'm going to create two separate data sets, the training and the validation. This is the thing that actually turns them into PyTorch data sets. This is the thing that transforms them, okay? And then this is actually going to create the data loader and the data bunch in one go. So let's look at some examples of this data block API. Because once you understand the data block API, you'll never be lost for how to convert your data set into something you can start modeling with. So here's some examples of using the data block API. So for example, if you're looking at MNIST, which remember is the pictures and classes of handwritten numerals, you can do something like this. This, what kind of data set is this going to be? It's going to come from a list of image files, which are in some folder. And they're labeled according to the folder name that they're in. And then we're going to split it into train and validation according to the folder that they're in, train and validation. You can optionally add a test set. We're going to be talking more about test sets later in the course. Okay, we'll convert those into PyTorch data sets now that that's all set up. We'll then transform them using this set of transforms. And we're going to transform into something of this size. And then we're going to convert them into a data bunch. So each of those stages inside these parentheses are various parameters you can pass to customize how that all works, right? But in the case of something like this MNIST data set, all the defaults pretty much work. So this is all fine. So here it is, so you can check. Let's grab something. So data.trainDS is the data set, not the data loader, the data set. So I can actually index into it with a particular number. So here is the zero indexed item in the training data set. It's got an image and a label. We can show batch to see an example of the pictures of it and we can then start training. Here are the classes that are in that data set. And this little cut down sample of MNIST just has threes and sevens. Here's an example using Planet. This is actually, again, a little subset of Planet we use to make it easy to try things out. For this case, again, it's an image file list. Again, we're grabbing it from a folder. This time we're labeling it based on a CSV file. We're randomly splitting it. By default, it's 20%, creating data sets, transforming it using these transforms. We're going to use a smaller size and then create a data bunch. There it is. And so data bunches know how to draw themselves, amongst other things. So here's some more examples we're going to be seeing later today. What if we look at this data set called Canvid? Canvid looks like this. It contains pictures and every pixel in the picture is color coded, right? So in this case, we have a list of files in a folder. And we're going to label them, in this case, using a function. And so this function is basically the thing, we're going to see it later, which tells it we're about to the color coding for each pixel. It's in a different place. Randomly split it in some way, create some data sets in some way. We can tell it for our particular list of classes. How do we know what pixel value one? Versus pixel value two is, and that was something that we can basically read in, like so. Again, some transforms, create a data bunch. You can optionally pass in things like what batch size do you want. And again, it knows how to draw itself and you can start learning with that. Or one more example, what if we wanted to create something like this? It has like bars and chair and remote control and book. This is called an object detection data set. So again, we've got a little minimal cocoa data set. Cocoa is kind of the most famous academic data set for object detection. We can create it using the same process. Grab a list of files from a folder, label them according to this little function. Randomly split them, create an object detection data set, create a data bunch. In this case, as you'll learn, when we get to object detection, you have to use generally smaller batch sizes or you went out of memory. And as you'll also learn, you have to use something called a collision function. And once that's all done, we can again show it. And here's our object detection data set. So you get the idea, right? So here's a really convenient notebook. Where will you find this? Ah, this notebook is the documentation. Remember how I told you that all of the documentation comes from notebooks? You'll find them in your fast AI repo in docs underscore source. So this, which you can play with and experiment with inputs and outputs and try all the different parameters. You will find the data block API examples of use. If you go to the documentation, here it is, the data block API examples of use. So remember, everything that you want to use in fast AI, you can look it up in the documentation. So let's search data block API goes straight there and away you go. And so once you find some documentation that you actually want to try playing with yourself, just look up the name, data block. And then you can open up a notebook with the same name in the fast AI repo and play with it yourself. Okay, so that's a quick overview of this really nice data block API. And there's lots of documentation for all of the different ways you can label inputs and split data and create data sets and so forth. And so that's what we're using for Planet, so we're using that API. You'll see in the documentation these two steps we had all joined up together. We can certainly do that here too, but you'll learn in a moment why it is that we're actually splitting these up into two separate steps which is also fine as well. So a few interesting points about this, transforms. So transforms by default, remember you can hit Shift Tab to get all the information. Transforms by default will flip randomly each image. But they'll actually randomly only flip them horizontally, which makes sense, right? If you're trying to tell us something's a cat or a dog, it doesn't matter whether it's pointing left or right, but you wouldn't expect it to be upside down. On the other hand, satellite imagery, whether something's cloudy or hazy or whether there's a road there or not, could absolutely be flipped upside down. There's no such thing as a right, right up in space. So flip vert, which defaults to false, we're going to flip over to true. To say like, yeah, randomly you should actually do that. It doesn't just flip it vertically. It actually tries also each possible 90 degree rotation. So there are eight possible kind of symmetries that it tries out. So there's various other things here. I've found that these particular settings work pretty well for Planet, one that's interesting is warp. Perspective warping is something which very few libraries provide. And those that do provide it, it tends to be really slow. I think Fast AI is the first one to provide really fast perspective warping. And basically, the reason this is interesting is if I kind of look at you from below versus look at you from above, the kind of your shape changes, right? And so when you're taking a photo of a cat or a dog, sometimes you'll be higher, sometimes you'll be lower, then that kind of change of shape is certainly something that you would want to include as you're creating your training batches. You want to modify it a little bit each time. Not true for satellite images. A satellite always points straight down at the planet. So if you added perspective warping, you would be making changes that aren't going to be there in real life. So I turn that off. So this is all something called data augmentation. We'll be talking a lot more about it later in the course. But you can start to get a feel for the kinds of things that you can do to augment your data. And in general, maybe the most important one is if you're looking at astronomical data or kind of pathology, digital slide data or satellite data, data where there isn't really an up or a down, turning on flip vert equals true is generally going to make your models generalize better. Okay, so here's the steps necessary to create our data bunch. And so now to create a satellite imagery classifier, multi-label classifier that's going to figure out for each satellite tile what's the weather and what else, what can I see in it. There's basically nothing else to learn. Everything else that you've already learned is going to be exactly nearly the same. Here it is, learn equals create CNN, data, architecture, right? And in this case, when I first built this notebook, I used ResNet 34 as per usual. And I found this was a case, I tried ResNet 50 as I always like to do. I found ResNet 50 helped a little bit and I had some time to run it. So in this case, I was using ResNet 50. There's one more change I make, which is metrics. Now to remind you, a metric has got nothing to do with how the model trains. Changing your metrics will not change your resulting model at all. The only thing that we use metrics for is we print them out during training. So here it's printing out accuracy and it's printing out this other metric called F beta. So if you're trying to figure out how to do a better job with your model, changing the metrics will never be something that you need to do there. They're just to show you how you're going. So that's the first thing to know. You can have one metric or no metrics or a list of multiple metrics to be printed out as your models training. In this case, I want to know two things. The first thing I want to know is the accuracy. And the second thing I want to know is how would I go on Kaggle? And Kaggle told me that I'm going to be judged on a particular metric called the F score. So I'm not going to bother telling you about the F score. It's not really interesting enough to be worth spending your time on. You can look it up. But it's basically this. When you have a classifier, you're going to have some false positives. You're going to have some false negatives. How do you weigh up those two things to kind of create a single number? There's lots of different ways of doing that. And something called the F score is basically a nice way of combining that into a single number. And there are various kinds of F scores, F1, F2, and so forth. And Kaggle said in the competition rules, we're going to use a metric called F2. So we have a metric called Fbeta. Which in other words, it's F with one or two or whatever depending on the value of beta. And we can have a look at its signature. And you can see that it's got a threshold and a beta. Okay, so the beta is two by default. And Kaggle said that they're going to use F2. So I don't have to change that. But there's one other thing that I need to set, which is a threshold. What does that mean? Well, here's the thing. Do you remember we had a little look the other day at the source code for the accuracy metric? So if you put two question marks, you get the source code. And we found that it used this thing called ARGMEX. And the reason for that, if you remember, was we kind of had this input image that came in and it went through our model. And at the end, it came out with a table of ten numbers. Right, this is like if we're doing MNIST digit recognition. And the ten numbers were like the probability of each of the possible digits. And so then we had to look through all of those and find out which one was the biggest. And so the function in NumPy or PyTorch or just math notation that finds the biggest and returns its index is called ARGMEX, right? So to get the accuracy for our pet detector, we used this accuracy function called ARGMEX to find out behind the scenes which plus ID pet was the one that we're looking at. And then it compared that to the actual and then took the average. That was the accuracy. We can't do that for satellite recognition in this case because there isn't one label we're looking for. There's lots. So instead, what we do is we look at, so in this case. So I don't know if you remember, but a data bunch has a special attribute called C. And C is gonna be basically how many outputs do we want our model to create? And so for any kind of classifier, we want one probability for each possible class. So in other words, data.c for classifiers is always gonna be equal to the length of data.classes, right? So data.classes, there they all are, there's the 17 possibilities, right? So we're gonna have one probability for each of those. But then we're not just gonna pick out one of those 17. We're gonna pick out n of those 17. And so what we do is we compare each probability to some threshold. And then we say anything that's higher than that threshold, we're gonna assume that the model's saying it does have that feature. And so we can pick that threshold. I found that for this particular data set, a threshold of 0.2 seems to generally work pretty well. This is the kind of thing you can easily just experiment to find a good threshold. So I decided I wanted to print out the accuracy at a threshold of 0.2. So the normal accuracy function doesn't work that way. It doesn't arg max. We have to use a different accuracy function called accuracy underscore threshold. And that's the one that's gonna compare every probability to a threshold and return all the things higher than that threshold and compare accuracy that way. And so one of the things we would pass in is threshold. Now, of course, our metric is gonna be calling our function for us. So we don't get to tell it every time it calls back what threshold do we want. So we really wanna create a special version of this function that always uses an accuracy of a threshold of 0.2. So one way to do that would be to go define something called accuracy O2 that takes some input and some target and returns accuracy threshold with that input and that target and a threshold of 0.2. We could do it that way, right? But it's so common that you wanna kind of say create a new function that's just like that other function. But we're always gonna call it with a particular parameter. That computer science has a term for that. It's called a partial, it's called a partial function application. And so Python 3 has something called partial that takes some function and some list of keywords and values and creates a new function that is exactly the same as this function that is always gonna call it with that keyword argument. So here, this is exactly the same thing as the thing I just typed in. O2 is now a new function that calls accuracy threshold with a threshold of 0.2. And so this is a really common thing to do, particularly with the FastAI Library. Cuz there's lots of places where you have to pass in functions. And you very often wanna pass in a slightly customized version of a function. So here's how you do it. So here I've got an accuracy threshold 0.2. I've got a F beta threshold 0.2. I can pass them both in this metrics. And I can then go ahead and do all the normal stuff. LR find, record a dot plot, find the thing with the steepest slope. So I don't know, somewhere around 1eneg2. So we'll make that our learning rate. And then fit for a while with 5, slice LR and see how we go, okay? And so we've got an accuracy of about 96% and an F beta of about 0.926. And so you could then go and have a look at planet, leaderboard, private leaderboard. Okay, and so the top 50th is about 0.93. So we kind of say like, we're on the right track, okay? With something we're doing fine. So as you can see, once you get to a point that the data's there, there's very little extra to do most of the time. So when your model makes an incorrect prediction in a deployed app, is there a good way to record that error and use that learning to improve the model in a more targeted way? Yeah, that's a great question. So the first bit, is there a way to record that? Of course there is. You record it, that's up to you, right? So maybe some of you can try it this week. You need to have your user tell you you were wrong. This Australian car, you said it was a Holden and that actually it's a Falcon. So first of all, you'll need to collect that feedback. And the only way to do that is to ask the user to tell you when it's wrong. So you now need to record in some log somewhere, something saying, this was the file, I've stored it here. This was the prediction I made. This was the actual that they told me. And then at the end of the day or at the end of the week, you could set up a little job to run something or you can manually run something. And what are you gonna do? You're gonna do some fine tuning. What does fine tuning look like? Good segue, Rachel. It looks like this, right? So let's pretend, here's your saved model, right? And so then we unfreeze, right? And then we fit a little bit more, right? Now in this case, I'm fitting with my original data set. But you could create a new data bunch with just the misclassified instances and go ahead and fit, right? And the misclassified ones are likely to be particularly interesting, so you might wanna fit at a slightly higher learning rate to make them kind of really mean more, or you might wanna run them through a few more epochs. But it's exactly the same thing, right? You just call fit with your misclassified examples and passing in the correct classification. And that should really help your model quite a lot. There are various other tweaks you can do to this, but that's the basic idea. The next question. Could someone talk a bit more about the data block ideology? I'm not quite sure how the blocks are meant to be used. Do they have to be in a certain order? Is there any other library that uses this type of programming that I could look at? Yes, they do have to be in a certain order. They do have to be in a certain order. And it's basically the order that you see in the example of use, right? It's what kind of data do you have? Where does it come from? How do you label it? How do you split it? What kind of data sets do you want? Optionally, how do I transform it? And then how do I create a data bunch from it? So they're the steps. I mean, we invented this API. I don't know if other people have independently invented it. The basic idea of kind of a pipeline of things that dot into each other is pretty common in a number of places. Not so much in Python, but you see it more in JavaScript. Although this kind of approach of each stage produces something slightly different, you tend to see it more in ETL software, like extraction transformation and loading software, where there's kind of particular stages in a pipeline. So yeah, I mean, it's been inspired by a bunch of things. But yeah, all you need to know is kind of use this example to guide you and then look up the documentation to see which particular kind of thing you want. And in this case, the image file list, you're actually not gonna find the documentation of image file list in data blocks documentation because this is specific to the vision application. So to then go and actually find out how to do something for your particular application, you would then go to look at text and vision and so forth. And that's where you can find out what are the data block API pieces available for that application. And of course, you can then look at the source code if you've got some totally new application, you could create your own part of any of these stages. Like pretty much all of these functions are, you know, very few lines of code. Maybe we could look an example of one. Image list from folder. So let's just put that somewhere temporary. And then we're gonna go to t.label from CSV. Then you can look at the documentation to see exactly what that does. And that's gonna call label from data frame. So I mean, this is already like useful. Like if you wanted to create a data frame, a pandas data frame from something other than the CSV, you now know that you could actually just call label from data frame and you can look up to find what that does. And as you can see, like most fast AI functions are no more than a few lines of code. They're normally pretty straightforward to see what are all the pieces there and how can you use them. And it's probably one of these things that as you play around with it, you'll get a good sense of how it all gets put together. But if during the week there are particular things where you're thinking, I don't understand how to do this, please let us know and we'll try to help you. Sure. What resources do you recommend for getting started with video? For example, being able to pull frames and submit them to your model. I guess, I mean, the answer is it depends. If you're using the web, which I guess probably most of you will be, then there's web APIs that basically do that for you. So you can grab the frames with the web API and then they're just images which you can pass along. If you're doing it client side, I guess most people tend to use OpenCV for that, but maybe people during the week who are doing these video apps can tell us what have you used and found useful and we can start to prepare something in the lesson wiki with a list of video resources since it sounds like some people are interested. Okay, so just like usual, we unfreeze our model and then we unfit some more and we get down to nine to nine-ish. So one thing to notice here is that before we unfreeze, you'll tend to get this shape pretty much all the time. If you do your learning rate finder before you unfreeze, it's pretty easy, find the steepest slope, not the bottom. Remember, we're trying to find the bit where we can slide down it quickly. So if you start at the bottom, it's just gonna send you straight off to the end here. So somewhere around here. And then we can call it again after you unfreeze. And you'll generally get a very different shape. And this is a little bit harder to say what to look for because it tends to be this kind of shape where you get a little bit of upward and then a kind of very gradual downward and then up here. So I tend to kind of look for just before it shoots up and go back about 10X as a kind of a rule of thumb. So one E neg five. And that is what I do for the first half of my slice and then for the second half of my slice, I normally do whatever learning rate I used for the frozen part. So LR, which was 0.01, kind of divided by five or divided by 10. Somewhere around that. So that's kind of my rule of thumb, right? Look for the bit kind of at the bottom, find about 10X smaller. That's the number that I put here. And then LR over five or LR over 10 is kind of what I put there. It seems to work most of the time. We'll be talking more about exactly what's going on here. This is called discriminative learning rates as the course continues. So how am I gonna get this better than 929? Because, you know, there are how many people in this competition? About a thousand teams, right? So we wanna get into the top 10%. So the top 5% would be 0.931-ish. The top 10% is gonna be about nine to nine-ish. So we're not quite there, right? So here's a trick, right? I don't know if you remember, but when I created my data set, I put size equals 128. And actually the images that Kaggle gave us are 256. So I used the size of 128 partially because I wanted to experiment quickly. It's much quicker and easier to use small images to experiment. But there's a second reason. I now have a model that's pretty good at recognizing the contents of 128 by 128 satellite images. So what am I gonna do if I now wanna create a model that's pretty good at 256 by 256 satellite images? Well, why don't I use transfer learning? Why don't I start with a model that's good at 128 by 128 images and fine-tune that? So don't start again, right? And that's actually gonna be really interesting because if I'm trained quite a lot, if I'm on the verge of overfitting, which I don't wanna do, right? Then I'm basically creating a whole new data set, effectively, one where my images are twice the size on each axis, right? So four times bigger. So it's really a totally different data set as far as my convolutional neural network's concerned. So I kind of gonna lose all that overfitting. I get to start again. So let's create a new learner, right? Well, let's keep our same learner but use a new data bunch where the data bunch is 256 by 256. So that's why I actually stopped here, right? Before I created my data sets because I'm gonna now take this data source and I'm gonna create a new data bunch with 256 instead. So let's have a look at how we do that. So here it is, take that source, right? Take that source, transform it with the same transforms as before at this time use size 256. Now that should be better anyway because this is gonna be higher resolution images but also I'm gonna start with, I haven't got rid of my learner. It's the same learner I had before. So I'm gonna start with this kind of pre-trained model. And so I'm gonna replace the data inside my learner with this new data bunch. And then I will freeze again. So that means I'm going back to just training the last few layers and I will do a new LRFind. And because I actually now have a pretty good model like it was pretty good for 128 by 128. So it's probably gonna be like at least okay for 256 by 256. I don't get that same sharp shape that I did before but I can certainly see where it's way too high, right? So I'm gonna pick something well before where it's way too high. Again, maybe 10x smaller. So here I'm gonna go one e neg two over two. That seems well before it shoots up. And so let's fit a little bit more, okay? So we've frozen again. So we're just training the last few layers and fit a little bit more. And as you can see, very quickly, remember kind of 928 was where we got to before after quite a few epochs. We're straight up there and suddenly we've passed 0.93, right? So we're now already kind of into the top 10%. So we've hit our first goal, right? We're doing, we're at the very least pretty competent at the problem of undersrecognizing satellite imagery. But of course now we can do the same thing before. We can unfreeze and train a little more. Okay, again, using the same kind of approach I described before, LR over five here and even smaller one here. Train a little bit more, 0.9314. So that's actually pretty good, 0.931. Somewhere around top 20-ish. So you can see actually when my friend Brendan and I entered this competition, we came 22nd with 0.9315. And we spent, this was a year or two ago, months trying to get here. So using kind of pretty much defaults with the minor tweaks and one trick, which is the resizing tweak, you can kind of get right up into the top of the leaderboard of this very challenging competition. Now, I should say we don't really know where we'd be. We'd actually have to check it on the test set that Kaggle gave us and actually submit to the competition, which you can do, you can do a late submission. And so later on in the course, we'll learn how to do that. But we certainly know where we're doing well. We're doing very well. So that's great news. And so you can see also as I kind of go along, I tend to save things. You can name your models, whatever you like, but I just want to basically know, was it kind of before or after the unfreeze? So I kind of have stage one or two. What size was I training on? What architecture was I training on? So that way I can kind of always go back and experiment pretty easily. So that's Planet, multi-label classification. Let's look at another example. So the other example next we're going to look at is this dataset called Canvid. And it's going to be doing something called segmentation. We're going to start with a picture like this and we're going to try and create a color-coded picture like this, where all of the bicycle pixels are the same color, all of the roadline pixels are the same color, all of the tree pixels are the same color, all of the building pixels are the same color, the sky are the same color, and so forth. Now, we're not actually going to make them colors. We're actually going to do it where each of those pixels has a unique number. So in this case, the top left is building, so I guess building was number four, the top right is tree, so tree is 26, and so forth. So in other words, this single top left pixel, we basically, like I mentioned this, we're going to do a classification problem just like the PETS classification for the very top left pixel. We're going to say what is that top left pixel? Is it bicycle, roadlines, sidewalk, building? What is the very top left pixel? And then what is the next pixel along? What is the next pixel along? So we're going to do a little classification problem for every single pixel in every single image. So that's called segmentation. In order to build a segmentation model, you actually need to download or create a data set where someone has actually labeled every pixel. So as you can imagine, that's a lot of work. So that's going to be a lot of work. You're probably not going to create your own segmentation data sets, but you're probably going to download or find them from somewhere else. This is very common in medicine, life sciences. If you're looking through slides at nuclei, it's very likely you already have a whole bunch of segmented cells and segmented nuclei. If you're in radiology, you probably already have lots of examples of segmented lesions and so forth. So there's a lot of different domain areas where there are domain specific tools for creating these segmented images. As you could guess from this example, it's also very common in self-driving cars and stuff like that, where you need to see what objects are around and where are they. So in this case, there's a nice data set called Canvid, which we can download and they have already got a whole bunch of images and segment masks prepared for us, which is pretty cool. And remember, pretty much all of the data sets that we have provided kind of inbuilt URLs for, you can see their details at course.fast.ai slash data sets. And nearly all of them are academic data sets where some very kind people have gone to all of this trouble for us so that we can use this data set and made it available for us to use. So if you do use it, one of these data sets for any kind of project, it would be very, very nice if you were to go and find the citation and say, thanks to these people for this data set. Because they've provided it, and all they're asking in return is for us to give them that credit. Okay, so here is the Canvid data set, here is the citation, and on our data sets page, that will link to the academic paper where it came from. Okay, Rachel, now is a good time for a question. Is there a way to use learn.lr find and have a return a suggested number directly rather than having to plot it as a graph and then pick a learning rate by visually inspecting that graph? And then there are a few other questions, I think, around more guidance on reading the learning rate finder graph. Yeah, I mean, that's a great question. I mean, the short answer is no. And the reason the answer is no is because this is still a bit more artisanal than I would like. You know, as you can kind of see, I've been kind of saying how I read this learning rate graph depends a bit on what stage I'm at and kind of what the shape of it is. I guess like when you're just training the head, so before you unfreeze, it pretty much always looks like this and you could certainly create something that kind of creates a smooth version of this, finds the sharpest negative slope and picked that, you would probably be fine nearly all the time. But then for these kinds of ones, you know, it requires a certain amount of experimentation, but the good news is you can experiment, right? You can try, obviously, if the line's going up, you don't want it. Almost certainly at the very bottom point, you don't want it, right? Because you need it to be going downwards. But if you kind of start with somewhere around 10x smaller than that, then also you could try another 10x smaller than that, try a few numbers and find out which ones work best. And within a small number of weeks, you will find that you're picking the best learning rate most of the time, right? So I don't know. It's kind of, so at this stage, it still requires a bit of playing around to get a sense of the different kinds of shapes that you see and how to respond to them. Maybe by the time this video comes out, someone will have a pretty reliable auto learning rate finder. We're not there yet. It's probably not a massively difficult job to do. Get an interesting project, collect a whole bunch of different data sets, maybe grab all the data sets from our data sets page, try and come up with some simple heuristic, compare it to all the different lessons I've shown. That'd be a really fun project to do. But at the moment, we don't have that. I'm sure it's possible. But we haven't got there. Okay, so how do we do image segmentation? Same way we do everything else. And so basically we're gonna start with some path which has got some information in it of some sort. So I always start by untiring my data, do an LS, see what I was given. In this case, there's a folder called labels and a folder called images. So I'll create paths for each of those. We'll take a look inside each of those. And at this point, you can see there's some kind of coded file names for the images and some kind of coded file names for the segment masks. And then you kind of have to figure out how to map from one to the other. Normally these kind of data sets will come with a read me you can look at or you can look at their website. Often it's kind of obvious. In this case, I can see these ones always have this kind of particular format. These ones always have exactly the same format with an underscore P. So I kind of, but I did this honestly, I just guessed. I thought, oh, it's probably the same thing underscore P. And so I created a little function that basically talk the file name and added the underscore P and put it in the different place. And I tried opening it and I noticed it worked. So, you know, so I've created this little function that converts from the image file names to the equivalent label file names. I opened up that to make sure it works. Normally we use open image to open a file and then you can go to show to take a look at it. But this, as we described, this is not a usual image file that contains integers. So you have to use open mask rather than open image because we want to return integers not floats. And fast AI knows how to deal with masks. So if you go mask.show, it will automatically color code it for you in some appropriate way. That's why we say open mask. So, you know, we can kind of have a look inside, look at the data, see what the size is. So there's 720 by 960. We can take a look at the data inside and so forth. The other thing you might have noticed is that they gave us a file called codes.text and a file called valid.text. So codes.text, we can load it up and have a look inside. And not surprisingly, it's got a list telling us that, for example, number four is zero, one, two, three, four. It's building, top left is building. There you go, okay? So just like we had, you know, grizzlies, black bears and teddies, here we've got the coding for what each one of these pixels means. So we need to create a data bunch. So to create a data bunch, we can go through the data block API and say, okay, we've got a list of image files that are in a folder. We need to create labels, which we can use with that get, why file name function we just created. We then need to split into training and validation. In this case, I don't do it randomly. Why not? Because actually the pictures they've given us are frames from videos. So if I did them randomly, I would be having like two frames next to each other, one in the validation set, one in the training set. That would be far too easy. That's treating, right? So the people that created this data set actually gave us a data set saying here is the list of file names that are meant to be in your validation set and they're non contiguous parts of the video. So here's how you can split your validation and training using a file name file. So from that, I can create my data sets. And so I actually have a list of class names. So like often with stuff like the Planet data set or the Pets data set, we actually have a string saying this is a pug or this is a rag doll or this is a Berman or this is cloudy or whatever. In this case, you don't have every single pixel labeled with an entire string. That would be incredibly inefficient. They're each labeled with just a number and then there's a separate file telling you what those numbers mean. So here's where we get to tell it in the data block API. This is the list of what the numbers mean. So these are the kind of parameters that the data block API gives you. Here's our transformations. And so here's an interesting point. Remember I told you that for example, sometimes we randomly flip an image, right? What if we randomly flip the independent variable image but we don't also randomly flip this one? They're now not matching anymore, right? So we need to tell fast AI that I wanna transform the Y. So X is our independent variable, Y is our dependent. I wanna transform the Y as well. So whatever you do to the X, I also want you to do to the Y. So there's all these little parameters that we can play with and I can create a data bunch. I'm using a smaller batch size because as you can imagine, because I'm creating a classifier for every pixel, that's gonna take a lot more GPU, right? So I found a batch size of eight is all I could handle and then normalize in the usual way. And this is quite nice. Fast AI, because it knows that you've given it a segmentation problem when you call show batch, it actually combines the two pieces for you and it will color code the photo. Isn't that nice? So you can see here the green on the trees and the red on the lines and this kind of color on the walls and so forth. So you can see here, here are the pedestrians. This is the pedestrian's backpack. So this is what the ground truth data looks like. So once we've got that, we can go ahead and create a learner. I'll show you some more details in a moment. Call LRFind, find the sharpest bit which looks about 1eneg2, call fit, passing in slice LR and see the accuracy and save the model and unfreeze and train a little bit more. So that's the basic idea, okay? And so we're gonna have a break and when we come back, I'm gonna show you some little tweaks that we can do and I'm also gonna explain this custom metric that we've created and then we'll be able to go on and look at some other cool things. So let's all come back at eight o'clock, six minutes. Okay, welcome back everybody and we're gonna start off with a question we got during the break. Could you use unsupervised learning here, pixel classification with the bike example to avoid needing a human to label a heap of images? Well, not exactly unsupervised learning but you can certainly get a sense of where things are without needing these kind of labels. And time permitting, we'll try and see some examples of how to do that. It's, you're certainly not gonna get such a quality and such a specific output as what you see here though. If you wanna get this level of segmentation mask, you need a pretty good segmentation mask ground truth to work with. Is there a reason we shouldn't deliberately make a lot of smaller data sets to step up from in tuning, let's say 64 by 64, 128 by 128, 256 by 256 and so on. Yes, you should totally do that. It works great, try it. I found this idea is something that I first came up with in the course a couple of years ago and I kind of thought it seemed obvious and just presented it as a good idea and then I later discovered that nobody had really published this before and then we started experimenting with it and it was basically the main trick that we used to win the ImageNet competition, the Dawnbench ImageNet training competition and we're like, wow, people, this wasn't only, not only was this not standard, nobody had heard of it before. There's been now a few papers that use this trick for various specific purposes but it's still largely unknown and it means that you can train much faster, it generalizes better. There's still a lot of unknowns about exactly like how small and how big and how much at each level and so forth. But I guess in as much as it has a name now, it probably does and I guess we'd call it progressive resizing. I've found that going much under 64 by 64 tends not to help very much. But yeah, it's a great technique and I definitely try a few different sizes. What does accuracy mean for pixel-wise segmentation? Is it correctly classified pixels divided by the total number of pixels? Yep, that's it. So if you imagine each pixel was a separate object, you're classifying, it's exactly the same accuracy. And so you actually can just pass in accuracy as your metric. But in this case, we actually don't. We've created a new metric called accuracy camvid. And the reason for that is that when they labeled the images, sometimes they labeled a pixel as void. I'm not quite sure why, maybe it was something that they didn't know or somebody felt that they'd made a mistake or whatever, but some of the pixels are void. And in the camvid paper, they say when you're reporting accuracy, you should remove the void pixels. So we've created an accuracy camvid. So all metrics take the actual output of the neural net, that's the input to the, this is what they call the inputs, it's the input to the metric. And the target, the labels you're trying to predict. So we then basically create a mask. So we look for the places where the target is not equal to void. And then we just take the input, do the argmax as per usual, just the standard accuracy argmax, but then we just grab those that are not equal to the void code. We do the same for the target and we take them in. So it's just a standard accuracy, it's almost exactly the same as the accuracy source code we saw before with the addition of this mask. So this quite often happens, the particular Kaggle competition metric you're using or the particular way your organization scores things or whatever, there's often like little tweaks you have to do. And this is how easy it is. And so as you'll see, to do this stuff, the main thing you need to know pretty well is how to do basic mathematical operations in PyTorch. So that's just something you kind of need to practice. I've noticed that most of the examples and most of my models result in a training loss greater than the validation loss. What are the best ways to correct that? I should add that this still happens after trying many variations on number of epochs and learning rate. Okay, good question. So remember from last week, if your training loss is higher than your validation loss, then you're underfitting. Okay, it definitely means that you're underfitting. You want your training loss to be lower than your validation loss. If you're underfitting, you can train for longer. You can train the last bit at a lower learning rate. But if you're still underfitting, then you're gonna have to decrease regularization. And we haven't talked about that yet. So in the second half of this part of the course, we're gonna be talking quite a lot about regularization and specifically how to avoid overfitting or underfitting by using regularization. If you wanna skip ahead, we're gonna be learning about weight decay, drop out and data augmentation will be the key things that we're talking about. Okay. For segmentation, we don't just create a convolution on neural network. We can. But actually a architecture called UNET turns out to be better. And actually, let's find it. Okay, so this is what a UNET looks like. And this is from the university website where they talk about the UNET. And so we'll be learning about this both in this part of the course and in part two, if you do it. But basically, this bit down on the left hand side is what a normal convolutional neural network looks like. It's something which starts with a big image and gradually makes it smaller and smaller and smaller until eventually you just have one prediction. What a UNET does is it then takes that and makes it bigger and bigger and bigger again. And then it takes every stage of the downward path and kind of copies it across and it creates this U-shape. It was originally actually created or published as a biomedical image segmentation method. But it turns out to be useful for far more than just biomedical image segmentation. So it was presented at MICI, which is the main medical imaging conference. And as of just yesterday, it actually just became the most cited paper of all time from that conference. So it's been incredibly useful, over 3,000 citations. You don't really need to know any of the details at this stage. All you need to know is, if you wanna create a segmentation model, you wanna be saying learner.createUnit rather than create CNN. But you pass it the normal stuff. So data bunch and architecture and symmetrics. Okay? So having done that, everything else works the same. You can do the yellow finder, find the slope, train it for a while, watch the accuracy go up, save it from time to time, unfreeze. Probably wanna go about 10 less, so it's still going up. So probably 10 less than that. So one in egg five comma, LR over five, train a bit more. And there we go, right? Now, here's something interesting. You can learn.recorder is where we keep track of what's going on during training. And that's got a number of nice methods, one of which is plot losses. And this plots your training loss and your validation loss. And you'll see quite often, they actually go up a bit before they go down. Why is that? That's because you can also plot your learning rate over time. And you'll see that the learning rate goes up and then it goes down. Why is that? Because we said fit one cycle. And that's what fit one cycle does. It actually makes the learning rate start low, go up and then go down again. Why is that a good idea? Well, to find out why that's a good idea, let's first of all, look at a really cool project done by Jose Fernandez-Portal during the week. He took our gradient descent demo notebook and actually plotted the weights over time, not just the ground truth and model over time. And he did it for a few different learning rates. And so remember, we had two weights. We were doing basically Y equals AX plus B or in his nomenclature here, Y equals W naught X plus W1. And so we can actually look and see over time what happens to those weights. And we know this is the correct answer. Yeah, right? So at a learning rate of point one, it kind of like slides on in here. And you can see that it takes a little bit of time to get to the right point. And you can see the loss improving. At a higher learning rate of point seven, you can see that the ground truth, the model jumps to the ground truth really quickly. And you can see that the weights jump straight to the right place really quickly. What if we have a learning rate that's really too high? You can see it takes a very, very, very long time to get to the right point. Or if it's really too high, it diverges. So you can see here why getting the right learning rate is important. When you get the right learning rate, it really zooms in to the best spot very quickly. Now, as you get closer to the final spot, something interesting happens, which is that you really want your learning rate to decrease because you're kind of, you're getting close to the right spot. And what actually happens, so what actually happens is, I can only draw 2D, sorry, you don't generally actually have some kind of loss function surface that looks like that. I remember there's lots of dimensions, but it actually tends to kind of look like bumpy, like that. And so you kind of want a learning rate that's like high enough to jump over the bumps, right? But then once you get close to the middle, once you get close to the best answer, you don't want to be just jumping backwards and forwards between bumps. So you really want your learning rate to go down so that as you get closer, you take smaller and smaller steps. So that's why it is that we want our learning rate to go down at the end. Now, this idea of decreasing the learning rate during training has been around forever. And it's just called learning rate annealing. But the idea of gradually increasing it at the start is much more recent. And it mainly comes from a guy called Leslie Smith. And if you're in San Francisco next week, actually you can come and join me and Leslie Smith. We're having a meetup where we'll be talking about this stuff, so come along to that. What Leslie discovered is that if you gradually increase your learning rate, what tends to happen is that actually, actually what tends to happen is that lost function surfaces tend to kind of look something like this. Bumpy, bumpy, bumpy, bumpy, bumpy, bumpy, flat. Bumpy, bumpy, bumpy, bumpy, bumpy, bumpy. Something like this, right? They have flat areas and bumpy areas. And if you end up in the bottom of a bumpy area, that solution will tend not to generalize very well because you found a solution that's, it's good in that one place, but it's not very good in other places. Whereas if you found one in the flat area, it probably will generalize well because it's not only good in that one spot, but it's good kind of around it as well. If you have a really small learning rate, it'll tend to kind of lob down and stick in these places, right? But if you gradually increase the learning rate, then it'll kind of like jump down and then as the learning rate goes up, it's gonna start kind of going up again like this, right? And then the learning rate now gonna be up here. It's gonna be bumping backwards and forwards. And eventually the learning rate starts to come down again and so it'll tend to find its way to these flat areas. So it turns out that gradually increasing the learning rate is a really good way of helping the model to explore the whole function surface and try and find areas where both the loss is low and also it's not bumpy. Because if it was bumpy, it would get kicked out again. And so this allows us to train at really high learning rates. So it tends to mean that we solve our problem much more quickly and we tend to end up with much more generalizable solutions. So if you call plot losses and find that it's just getting a little bit worse and then it gets a lot better, you found a really good maximum learning rate. So when you actually call fit one cycle, you're not actually passing in a learning rate. You're actually passing in a maximum learning rate. And if it's kind of always going down, particularly after you unfreeze, that suggests you could probably bump your learning rates up a little bit because you really wanna see this kind of shape. It's gonna train faster and generalize better, just a little bit. And you'll tend to particularly see it in the validation set, the orange is the validation set. And again, the difference between kind of knowing the theory and being able to do it is looking at lots of these pictures. So like after you train stuff, type learn dot recorder dot and hit tab and see what's in there, right? Particularly the things that start with plot and start getting a sense of like, what are these pictures looking like when you're getting good results? And then try making the learning rate much higher, try making it much lower, more epochs, less epochs and get a sense for what these look like. So in this case, we used a size and our transforms of the original image size over two. These two slashes in Python means integer divide. Okay, because obviously we can't have half pixel amounts in our sizes. So integer divide divided by two. And we used batch size of eight. I found that fits on my GPU. It might not fit on yours. If it doesn't, you can just decrease the batch size down to four. And this isn't really solving the problem because the problem is to segment all of the pixels, not half of the pixels. So I'm gonna use the same trick that I did last time, which is I'm now gonna put the size up to the full size of the source images, which means I now have to half my batch size. Otherwise I run out of GPU memory. And I'm then gonna set my learner. I can either say learn dot data equals my new data, or I actually found I've had a lot of trouble with kind of GPU memory. So I generally restarted my kernel, came back here, created a new learner, and loaded up the weights that I saved last time. But the key thing here being that this learner now has the same weights that I had here, but the data is now the full image size. So I can now do an LR find again, find an area where it's kind of, well before it goes up. So I'm gonna use one in X3 and fit some more. And then unfreeze and fit some more. And you can go learn dot show results to see how your predictions compare to the ground truth. And you gotta say, they really look pretty good. Not bad, huh? So how good is pretty good? An accuracy of 92.15. The best paper I know of for segmentation was a paper called the 100 layers tiramisu, which developed a convolutional dense net came out about two years ago. So after I trained this today, I went back and looked at the paper to find their state of the art accuracy. Here it is. And I looked it up. And their best was 91.5 and we got 92.1. So I gotta say, when this happened today, I was like, wow, I don't know if better results have come out since this paper, but I remember when this paper came out and it was a really big deal. And I was like, wow, this is an exceptionally good segmentation result. Like when you compare it to the previous bests that they compared it to, it was a big step up. And so like in last year's course, we spent a lot of time in the course re-implementing the 100 layers tiramisu. And now with our totally default fast AI class, I'm easily beating this. And I also remember this I had to train for hours and hours and hours, whereas today's I trained in minutes. So this is a super strong architecture for segmentation. So yeah, I'm not gonna promise that this is the definite state of the art today because I haven't done a complete literature search to see what's happened in the last two years. But it's certainly beating the world's best approach the last time I looked into this, which was in last year's course, basically. And so these are kind of just all the little tricks, I guess we've picked up along the way in terms of like how to train things well, things like using the pre-trained model and things like using the one-cycle convergence and all these little tricks. They work extraordinarily well. And it's really nice to be able to show something in class where we can say, we actually haven't published the paper on the exact details of how this variation of the unit works, there's a few little tweaks we do. But if you come back for part two, we'll be going into all of the details about how we make this work so well. But for you, all you have to know at this stage is that you can say learner.createunit and you should get great results also. There's another trick you can use if you're running out of memory a lot, which is you can actually do something called mixed precision training. And mixed precision training means that instead of using, for those of you that have done a little bit of computer science, instead of using single precision floating point numbers, you can do all the, most of the calculations in your model with half precision floating point numbers. So 16 bits instead of 32 bits. Tradition, I mean, the very idea of this has only been around really for the last couple of years in terms of like hardware that actually does this reasonably quickly. And then fast AI library, I think is the first and probably still the only that makes it actually easy to use this. If you add to FP16 on the end of any learner call, you're actually gonna get a model that trains in 16 bit precision. Because it's so new, you'll need to have kind of the most recent CUDA drivers and all that stuff for this even to work. When I tried it this morning on some of the platforms, it just killed the kernel. So you need to make sure you've got the most recent drivers. But if you've got a really recent GPU, like a 2080 TI, not only will it work, but it'll work about twice as fast as otherwise. Now, the reason I'm mentioning it is that it's gonna use less GPU RAM. So even if you don't have like a 2080 TI, you might find or you'll probably find that things that didn't fit into your GPU without this then do fit in with this. Now, I actually have never seen people use 16 mixed position floating point for segmentation before. Just for a bit of a laugh, I tried it and actually discovered that I got even better result. So I only found this this morning, so I don't have anything more to add here other than quite often when you make things a little bit less precise in deep learning, it generalizes a little bit better. And I've never seen a 92.5 accuracy on Canva before. So yeah, not only will this be faster, you'll be able to use bigger batch sizes, but you might even find like I did that you get an even better result. So that's a cool little trick. You just need to make sure that every time you create a learner, you add this to FP16. If your kernel dies, it probably means you have slightly out of date CUDA drivers or maybe even an old two old graphics card. I'm not sure exactly which card support FP16. Okay, so one more before we kind of rewind. Sorry, two more. The first one I'm going to show you is an interesting data set called the BEWE HEDPOSE data set. And Gabrielle Finnelli was kind of enough to give us permission to use this in the class. His team created this cool data set. Here's what the data set looks like. It's pictures, it's actually got a few things in it. We're just going to do a simplified version. And one of the things they do is they have a dot saying this is the center of the face. And so we're going to try and create a model that can find the center of a face. So for this data set, there's a few data set specific things we have to do which I don't really even understand but I just know from the read me that you have to. They use some kind of depth sensing camera. I think they actually used a Kinect, Xbox Kinect. There's some kind of calibration numbers that they provide in a little file which I had to read in. And then they provided a little function that you have to use to take their coordinates to change it from this depth sensor calibration thing to end up with actual coordinates. So when you open this and you see these little conversion routines, that's just doing what they told us to do. Basically it's got nothing particularly to do with deep learning to end up with this dot. The interesting bit really is where we create something which is not an image or an image segment but an image points. And we'll mainly learn about this later in the course but basically image points use this idea of kind of the coordinates, right? They're not pixel values, they're XY coordinates. There's just two numbers as you can see. Let me see. So here's an example for a particular image file name. This particular image file and here it is. The coordinates of the center of the face are at 263, 428 and here it is. So there's just two numbers which represent whereabouts on this picture is the center of the face. So if we're gonna create a model that can find the center of a face, we need a neural network that spits out two numbers. But note, this is not a classification model. These are not two numbers that you look up in a list to find out that they're road or building or ragdoll, cat or whatever, they're actual locations. So far everything we've done has been a classification model, something that's created labels or classes. This for the first time is what we'd call a regression model. A lot of people think regression means linear regression. It doesn't. Regression just means any kind of model where your output is some continuous number or set of numbers. So we need to create an image regression model, something that can predict these two numbers. So how do you do that? Same way as always, right? So we can actually just say, I've got a list of image files, it's in a folder and I wanna label them using this function that we wrote that basically does the stuff that the readme says to grab the coordinates out of their text files. So that's gonna give me the two numbers for everyone. And then I'm gonna split it according to some function. And so in this case, the files they gave us, again, they're from videos. And so I picked just one folder to be my validation set. In other words, a different person. So again, I was trying to think about like how do I validate this barely? So I said, well, the fair validation would be to make sure that it works well on a person that it's never seen before. So my validation set is all gonna be a particular person. Create a data set. And so this data set, I just tell it, what kind of data set is it? Well, they're gonna be a set of points. So points means specific coordinates. Do some transforms. Again, I have to say transform y equals true because that red dot needs to move if I flip or rotate or what, right? Pick some size, I just picked a size that's gonna work pretty quickly. Create a data bunch, normalize it. And again, show batch, there it is, okay? I noticed that their red dots don't always seem to be quite in the middle of the face. I don't know exactly what their kind of internal algorithm for putting dots on. It kind of sometimes looks like it's meant to be the nose but sometimes it's not quite the nose. Anyway, it's somewhere around the center of the face or the nose. So how do we create a model? We create a CNN. But we're gonna be learning a lot about loss functions in the next few lessons. But generally, basically the loss function is that number that says how good is the model? And so for classification, we use this loss function called cross entropy loss which says basically, remember this from earlier lessons, did you predict the correct class and were you confident of that prediction? Now, we can't use that for regression. So instead, we use something called mean squared error. And if you remember from last lesson, we actually implemented mean squared error from scratch. It's just the difference between the two squared and added up together. Okay, so we need to tell it, this is not classification so we use mean squared error. So this is not classification so we have to use mean squared error. And then once we've created the learner, we've told it what loss function to use. We can go ahead and do LR find. We can then fit. And you can see here within a minute and a half, our mean squared error is 0.004. Now the nice thing is about like mean squared error, that's very easy to interpret. So we're trying to predict something which is somewhere around a few hundred and we're getting a squared error on average of 0.004. So we can feel pretty confident that this is a really good model. And then we can look at the results by learn.showResults and we can see predictions, ground truth, it's doing a nearly perfect job. So that's how you can do image regression models. So anytime you've got something you're trying to predict which is some continuous value, you use an approach that's something like this. So last example before we look at some kind of more foundational theory stuff, NLP. And next week we're gonna be looking at a lot more NLP. But let's now do the same thing but rather than creating a classification of pictures, let's try and classify documents. And so we're gonna go through this in a lot more detail next week but let's do the quick version. Rather than importing from fastai.vision, I now import for the first time from fastai.txt. That's where you'll find all the application specific stuff for analyzing text documents. And in this case, we're gonna use a data set called imdb. And imdb has lots of movie reviews. They're generally about a couple of thousand words. And each movie review has been classified as either negative or positive. So it's just in a CSV file so we can use pandas to read it. We can take a little look. We can take a look at a review. And basically, as per usual, we can either use factory methods or the data block API to create a data bunch. So here's the quick way to create a data bunch from a CSV of texts, data bunch from CSV. And that's that. And yeah, at this point, I could create a learner and start training it. But we're gonna show you a little bit more detail which we're mainly gonna look at next week. The steps that actually happen when you create these data bunches is there's a few steps. The first is it does something called tokenization, which is it takes those words and it converts them into a standard form of tokens where there's basically each token represents a word. But it does things like see here, see how did it has been turned here into two separate words and you see how everything's been lower cased. See how your has been turned into two separate words. So tokenization is trying to make sure that each token, each thing that we've got with spaces around it here represents a single linguistic concept. Also, it finds words that are really rare, like really rare names and stuff like that and replaces them with a special token called unknown. So anything starting with XX in fast AI is some special token. So this is tokenization. So we end up with something where we've got a list of tokenized words. You'll also see that things like punctuation end up with spaces around them to make sure that they're separate tokens. The next thing we do is we take a complete unique list of all of the possible tokens. That's called the vocab, and that gets created for us. And so here's the first 10 items of the vocab. So here is every possible token, the first 10 of them that appear in all of the movie reviews. And we then replace every movie review with a list of numbers. And the list of numbers simply says what numbered thing in the vocab is in this place. So here's six is zero, one, two, three, four, five, six. So this is the word R. And this is three, zero, one, two, three. This was a comma and so forth. So through tokenization and numericalization, this is a standard way in NLP of turning a document into a list of numbers. We can do that with the data block API. So this time it's not image files list, it's text split data from a CSV, convert them to data sets, tokenize them, numericalize them, create a data bunch. And at that point, we can start to create a model. As we'll learn about next week, when we do NLP classification, we actually create two models. The first model is something called a language model. Which, as you can see, we train in a kind of a usual way. We say we wanna create a language model learner, we train it, we can save it, we unfreeze, we train some more. And then after we've created a language model, we fine tune it to create the classifier. So here's the thing where we create the data bunch of the classifier, we create a learner, we train it, and we end up with some accuracy. So that's the really quick version, we're gonna go through it in more detail next week. But you can see the basic idea of training an NLP classifier is very, very, very similar to creating every other model we've seen so far. And this accuracy, so the current state of the art for IMDB classification is actually the algorithm that we built and published with a colleague called my name's Sebastian Ruder. And this basically, what I just showed you is pretty much the state of the art algorithm with some minor tweaks. You can get this up to about 95% if you try really hard. So this is very close to the state of the art accuracy that we developed. There's a question. Okay, now's a great time for a question. For a data set very different than ImageNet, like the satellite images or genomic images shown in lesson two, we should use our own stats. Jeremy once said, if you're using a pre-trained model, you need to use the same stats it was trained with. Why is that? Isn't it that normalized data with its own stats will have roughly the same distribution like ImageNet? The only thing I can think of which may differ is skewness. Is it the possibility of skewness or something else, the reason of your statement? And does that mean you don't recommend using pre-trained models with very different data sets like the one point mutation that you mentioned in lesson two? Nope. As you can see, I've used pre-trained models for all of those things. Every time I've used an ImageNet pre-trained model and every time I've used ImageNet stats. Why is that? Because that model was trained with those stats. So for example, imagine you're trying to classify different types of green frogs. So if you were to use your own per channel means from your data set, you would end up converting them to a mean of zero, a standard deviation of one for each of your red, green, and blue channels, which means they don't look like green frogs anymore. They now look like gray frogs, right? But ImageNet expects frogs to be green, okay? So you need to normalize with the same stats that the ImageNet training people normalized with, otherwise the unique characteristics of your data set won't appear anymore. You've actually normalized them out in terms of the per channel statistics. So you should always use the same stats that the model was trained with. Okay, so in every case, what we're doing here is we're using gradient descent with mini batches, so stochastic gradient descent, to fit some parameters of a model. And those parameters are parameters to basically matrix multiplications. In the second half of this part, we're actually gonna learn about a little tweak called convolutions, but it's basically a type of matrix multiplication. The thing is though, no amount of matrix multiplications is possibly going to create something that can read IMDB movie reviews and decide if it's positive or negative, or look at satellite imagery and decide whether it's got a road in it. That's far more than a linear classifier can do. Now we know these are deep neural networks, and deep neural networks contain lots of these matrix multiplications, but every matrix multiplication is just a linear model and a linear function on top of a linear function is just another linear function. If you remember back to your high school math, you might remember that if you have Y equals AX plus B and then you stick another C, Y plus D, on top of that, it's still just another slope and another intercept. So no amount of stacking matrix multiplications is gonna help in the slightest. So what are these models actually? What are we actually doing? And here's the interesting thing. All we're actually doing is we literally do have a matrix multiplication or a slight variation like a convolution that we'll learn about, but after each one, we do something called a non-linearity or an activation function. An activation function is something that takes the result of that matrix multiplication and sticks it through some function. And these are some of the functions that we use. In the old days, the most common function that we used to use was basically this shape. These shapes are called sigmoid and they have particular mathematical definitions. Nowadays, we almost never use those for these between each matrix model play. Nowadays, we nearly always use this one. It's got a rectified linear unit. It's very important when you're doing deep learning to use big, long words that sound impressive. Otherwise, normal people might think they can do it too. But just between you and me, a rectified linear unit is defined using the following function. That's it. Okay, so if you wanna be really exclusive, of course, you then shorten the long version and you call it a value to show that you're really in the exclusive team. So this is a value activation. So here's the crazy thing. If you take your red, green, blue pixel inputs and you chuck them through a matrix multiplication and then you replace the negatives with zero and you put it through another matrix multiplication, replace the negatives with zero and you keep doing that again and again and again. You have a deep learning neural network. That's it. All right. So how the hell does that work? So a extremely cool guy called Michael Nielsen showed how this works. He has a very nice website. There's actually more than a website, it's a book. Neuralnetworksanddeeplearning.com. And he has these beautiful little JavaScript things where you can get to play around because this was back in the old days, this was back when we used to use sigmoids. And what he shows is that if you have enough little, he shows these little matrix modifications. If you have enough little matrix modifications followed by sigmoids and exactly the same thing works for a matrix modification followed by a value, you can actually create arbitrary shapes. Right? And so this idea that these combinations of linear functions and nonlinearities can create arbitrary shapes actually has a name. And this name is the universal approximation theorem. And what it says is that if you have stacks of linear functions and nonlinearities, the thing you end up with can approximate any function arbitrarily closely. So you just need to make sure that you have a big enough matrix to multiply by or enough of them. So if you have now this function, just a sequence of matrix multiplies and nonlinearities where the nonlinearities can be basically any of these things. We normally use this one. If that can approximate anything, then all you need is some way to find the particular values of the weight matrices in your matrix multiplies that solve the problem you wanna solve. And we already know how to find the values of parameters. We can use gradient descent. And so that's actually it, right? And this is the bit I find the hardest thing normally to explain to students is that we're actually done now. People often come up to me after this lesson and they say, what's the rest? Please explain to me the rest of deep learning. But like, no, there's no rest. Like, we have a function where we take our input pixels or whatever. We multiply them by some weight matrix. We replace the negatives with zeros. We multiply it by another weight matrix. We replace the negatives with zeros. We do that a few times. We see how close it is to our target. And then we use gradient descent to update our weight matrices using the derivatives. And we do that a few times. And eventually we end up with something that can classify movie reviews or can recognize pictures of ragdoll cats. That's actually it. Okay, so the reason it's hard to understand intuitively is because we're talking about weight matrices that have, once you've got them all up, something like 100 million parameters. They're very big weight matrices, right? So your intuition about what multiplying something by a linear model and replacing the negatives with zeros a bunch of times can do. Your intuition doesn't hold, right? You just have to accept empirically the truth is, doing that works really well. So in part two of the course, we're actually gonna build these from scratch, right? But I mean, just to skip ahead, you'll basically will find that, you know, it's gonna be kind of five lines of code, right? It's gonna be a little for loop that goes, you know, T equals X at weight matrix one. T two equals max of T comma zero. Stick that in a for loop that goes through each weight matrix. And at the end calculate my loss function. And of course, we're not gonna calculate the gradients ourselves because PyTorch does that for us. And that's about it. So, okay, question. There's a question about tokenization. I'm curious about how tokenizing words works when they depend on each other, such as San Francisco. Yeah, okay. Okay, tokenization. How do you tokenize something like San Francisco? San Francisco contains two tokens, San Francisco. That's it. That's how you tokenize San Francisco. The question may be coming from people who have done like traditional NLP, often need to kind of use these things called Ngrams. And Ngrams are kind of this idea of like, a lot of NLP in the old days was all built on top of linear models where you basically counted how many times particular strings of text appeared like the phrase San Francisco. That would be a bi-gram or an Ngram with an N of two. The cool thing is that with deep learning, we don't have to worry about that. Like with many things, a lot of the complex feature engineering disappears when you do deep learning. So with deep learning, each token is literally just a word or in the case that the word really consists of two words like your, you split it into two words. And then what we're gonna do is we're going to then let the deep learning model figure out how best to combine words together. Now, when we say like, let the deep learning model figure it out, of course, all we really mean is find the weight matrices using gradient descent to give the right answer. Like there's not really much more to it than that. Again, there's some minor tweaks, right? In the second half of the course, we're gonna be learning about the particular tweak for image models which is using a convolution, there'll be a CNN, for language, there's a particular tweak we do called using recurrent models or an RNN, but they're very minor tweaks on what we've just described. So basically it turns out with an RNN that it can learn that SAM plus Francisco has a different meaning when those two things are together. Some satellite images have four channels. How can we deal with data that has four channels or two channels when using pre-trained models? Yeah, that's a good question. I think that's something that we're gonna try and incorporate into fast AI. So hopefully by the time you watch this video, there'll be easier ways to do this. But the basic idea is a pre-trained image net model expects red, green, and blue pixels. So if you've only got two channels, there's a few things you can do, but basically you wanna create a third channel. And so you can create the third channel as either being all zeros or it could be the average of the other two channels. And so you can just use normal PyTorch arithmetic to create that third channel. You could either do that ahead of time in a little loop and save your three channel versions, or you could create a custom data set class that does that on demand. For four channel, you probably don't wanna get rid of the fourth channel. So instead what you'd have to do is to actually modify the model itself. So to know how to do that, we'll only know how to do that in a couple more lessons time. But basically the idea is that the initial weight matrix, weight matrix is really the wrong term. They're not weight matrices, they're weight tensors. So they can have more than just two dimensions. So that initial weight matrix in the neural net, it's gonna have, it's actually a tensor. And one of its axes is gonna have three, whatever, three slices in it. So you would just have to change that to add an extra slice, which I would generally just initialize to zero or to some random numbers. So that's the short version. But really to understand exactly what I meant by that, we're gonna need a couple more lessons to get there. Okay, so wrapping up, what have we looked at today? Basically, we started out by saying, hey, it's really easy now to create web apps. We've got starter kits for you that show you how to create web apps and people have created some really cool web apps using what we've learned so far, which is single label classification. But the cool thing is, the exact same steps we use to do single label classification, you can also do to do multi-label classification such as in the planet, or you could use to do segmentation, or you could use to do or you could use to do any kind of image regression, or this is probably a bit early for you to actually try this yet, you could do for an LP classification and a lot more. So, and in each case, all we're actually doing is we're doing gradient descent on not just two parameters, but on maybe a hundred million parameters, but it's still just plain gradient descent along with a non-linearity, which is normally this one, which it turns out the universal approximation theorem tells us, lets us arbitrarily accurately approximate any given function, including functions such as converting a spoken waveform into the thing the person was saying, or converting a sentence in Japanese to a sentence in English, or converting a picture of a dog into the word dog. These are all mathematical functions that we can learn using this approach. So, this week, see if you can come up with an interesting idea of a problem that you would like to solve, which is either multi-label classification or image regression or image segmentation, something like that, and see if you can try to solve that problem. You will probably find the hardest part of solving that problem is coming up, creating the data bunch, and so then you'll need to dig into the Datablock API to try to figure out how to create the data bunch from the data you have. And with some practice, you will start to get pretty good at that. It's not a huge API, there's a small number of pieces. It's also very easy to add your own, but for now, ask on the forum if you try something and you get stuck. Okay, great. So, next week, we're gonna come back and we're gonna look at some more NLP. We're gonna learn some more about some details about how we actually train with SGD quickly. We're gonna learn about things like Adam and RMS prop and so forth. And hopefully, we're also gonna show off lots of really cool web apps and models that you've all built during the week. So, I'll see you then. Thanks.