 Hi everybody. Welcome to Practical Deep Learning for coders. This is part one of our two-part course. I'm presenting this from the Data Institute in San Francisco. We'll be doing seven lessons in this part of the course. Most of them will be about a couple of hours long. This first one may be a little bit shorter. Practical Deep Learning for coders is all about getting you up and running with deep learning in practice, getting world-class results, and it's a really coding-focused approach, as the name suggests. But we're not going to dumb it down. By the end of the course, you'll have learnt all of the theory and details that are necessary to rebuild all of the world-class results we're learning about from scratch. Now I should mention that our videos are hosted on YouTube, but we strongly recommend watching them via our website at course.fast.ai. Although they're exactly the same videos, the important thing about watching them through our website is that you'll get all of the information you need about updates to libraries, by all locations, further information, frequently asked questions, and so forth. So if you're currently on YouTube watching this, why don't you switch over to course.fast.ai now and start watching through there, and make sure you read all of the material on the page before you start just to make sure that you've got everything you need. The other thing to mention is that there is a really great strong community at forums.fast.ai. From time to time, you'll find that you get stuck. You may get stuck very early on. You may not get stuck for quite a while, but at some point you might get stuck with understanding why something works the way it does, or there may be some computer problem that you have, or so forth. On forums.fast.ai there are thousands of other learners talking about every lesson and lots of other topics besides. It's the most active deep learning community on the internet by far. So definitely register there and start getting involved. You'll get a lot more out of this course if you do that. So we're going to start by doing some coding. This is an approach we're going to be talking about in a moment called the top-down approach to study, but let's learn it by doing it. So let's go ahead and try and actually train a neural network. Now in order to train a neural network, you almost certainly want a GPU. A GPU is a graphics processing unit. It's the things that companies use to help you play games better. They let your computer render the game much more quickly than your CPU can. We'll be talking about them more shortly, but for now I'm going to show you how you can get access to a GPU. Specifically you're going to need an NVIDIA GPU because only NVIDIA GPUs support something called CUDA. CUDA is the language and framework that nearly all deep learning libraries and practitioners use to do their work. Obviously it's not ideal that we're stuck with one particular vendor's cards and over time we hope to see more competition in this space, but for now we do need an NVIDIA GPU. Your laptop almost certainly doesn't have one, unless you specifically went out of your way to buy like a gaming laptop. So almost certainly you will need to rent one. The good news is that renting access, paying by the second for a GPU-based computer is pretty easy and pretty cheap. I'm going to show you a couple of options. The first option I'll show you, which is probably the easiest, is called Cressel. If you go to Cressel.com and click on sign up, or if you've been there before, sign in, you will find yourself at this screen, which has a big button that says Start Jupiter and another switch called Enable GPU. So if we make sure that is set to true, Enable GPU is on, and we click Start Jupiter, and we click Start Jupiter, it's going to launch us into something called Jupiter Notebook. Jupiter Notebook in a recent survey of tens of thousands of data scientists was rated as the third most important tool in the Data Scientist toolbox. It's really important that you get to learn it well and all of our courses will be run through Jupiter. Yes, Rachel, you have a question or a comment? Oh, I just wanted to point out that you get, I believe, ten free hours. So if you wanted to try Cressel out, you're not having to pay right away. Yeah, he may have changed that recently to less hours, but you can check the fact or the pricing, but you certainly get some free hours. The pricing varies because this actually runs on top of Amazon Web Services, so at the moment it's 60 cents an hour. The nice thing is, though, that you can always turn it on, you know, start your Jupiter without the GPU running and pay a tenth of that price, which is pretty cool. So Jupiter Notebook is something we'll be doing all of this course in, and so to get started here, we're going to find our particular course. So we'd go to Courses, and we'd go to Fast AI2, and there they are. Things have been moving around a little bit, so it may be in a different spot for you. When you look at this, and we'll make sure all the current information is on the website. Now, having said that, you know, the Cressel approach is, as you can see, it's basically instant and easy, but if you've got an extra hour or so to get going, an even better option is something called Paper Space. Paper Space, unlike Cressel, doesn't run on top of Amazon. They have their own machines. So here's Paper Space, and so if I click on New Machine, I can pick which one of their three data centers to use. So pick the one closest to you, so I'll say West Coast. And then I'll say Linux, and I'll say Ubuntu 16. And then it says Choose Machine, and you can see there's various different machines I can choose from and pay by the hour. So this is pretty cool. For 40 cents an hour, so it's cheaper than Cressel, I get a machine that's actually going to be much faster than Cressel, 60 cents an hour machine, or for 65 cents an hour, way, way, way faster. So I'm going to actually show you how to get started with the Paper Space approach, because that actually is going to do everything from scratch. You may find if you try to do the 65 cents an hour one, that it may require you to contact Paper Space to say, like, why do you want it? That's just an anti-fraud thing. So if you say fast.ai there, then they'll quickly get you up and running. So I'm going to use the cheapest one here, 40 cents an hour. You can pick how much storage you want, and note that you pay for a month of storage as soon as you start the machine up. So don't start and stop lots of machines, because each time you pay for that month of storage. I think the 250 gig, $7 a month option is pretty good. But you only need 50 gig. So if you're trying to minimize the price, you can go there. The only other thing you need to do is turn on public IP so that we can actually log into this, and we can turn off auto snapshot to save the money of not having backups. So if you then click on create your Paper Space, about a minute later you will find that your machine will pop up. Here is my Ubuntu 16.04 machine. If you check your email, you will find that they have emailed you a password. So you can copy that, and you can go to your machine and enter your password. Now to paste the password, you would press Ctrl Shift V, or on Mac, I guess, Apple Shift V. So it's slightly different to normal pasting, or of course you can just type it in. And here we are. Now we can make a little bit more room here by clicking on these little arrows. That can zoom in a little bit. And so as you can see, we've got like a terminal that's sitting inside our browser. Which is kind of quite a handy way to do it. So now we need to configure this further course. And so the way you configure it for the course is you type curl http colon slash slash files dot fast dot ai slash setup slash paper space pipe bash. Okay. And so that's then going to run a script, which is going to set up all of the CUDA drivers, the special Python distribution we use called Anaconda, all of the libraries, all of the courses, and the data we use for the first part of the course. Okay. So that takes an hour or so. And when it's finished running, you'll need to reboot your computer. So to reboot, not your own computer, but your paper space computer. And so to do that, you can just click on this little circular restart machine button. Okay. And when it comes back up, you'll be ready to go. So what you'll find is that you've now got an Anaconda 3 directory. That's where your Python is. You've got a data directory, which contains the data for the first part of this course. The first lesson, which is our dogs and cats. And you've got a fast ai directory. And that contains everything for this course. So what you should do is cd fast ai. And from time to time, you should go get pool. And that will just make sure that all of your fast ai stuff is up to date. And also from time to time, you might want to just check that your Python libraries are up to date. And so you can type conda and update to do that. All right. So make sure that you've cd into fast ai. And then you can type Jupyter notebook. All right. There it is. So we now have a Jupyter notebook serving it running. And we want to connect to that. And so you can see here it says copy paste this URL into your browser when you connect. So if you double click on it, then that will actually, that will actually copy it for you. Then you can go and paste it. But you need to change this local host to be the paper space IP address. So if you click on the little arrows to go smaller, you can see the IP address is here. So just copy that and paste it where it used to say local host. So it's now HTTP and then my IP and then everything else I copied before. And so there it is. So this is the fast ai get repo. And our courses are all in courses. And in there, the deep learning part one is DL one. And in there you will find lesson one dot I pie and be I Python notebook. So here we are ready to go. Depending whether you're using grassel or paper space or something else. If you check courses dot fast at AI, we'll keep putting additional videos and links to information about how to set up other. You know, good Jupiter notebook providers as well. So to run a cell in Jupiter notebook, you select the cell and you hold down shift and press enter. Or if you've got the toolbar showing, you can just click on the little run button. So you'll notice that some cells contain code and some contain text and some contain pictures and some contain videos. So this environment basically has, you know, it's a way that we can give you access to a way to run experiments and to kind of tell you what's going on, show pictures. This is why it's like a super popular tool in data science. Data science is kind of all about running experiments really. So let's go ahead and click run. And you'll see that cell turn into a star, the one turned into a star for a moment and then it finished running. So let's try the next one. This time instead of using the toolbar, I'm going to hold down shift and press enter. And you can see again it turned into a star and then it said two. So if I hold down shift and keep pressing enter, it just keeps running each cell. So I can put anything I like. For example, one plus one is two. So what we're going to do is we're going to, yes, Rachel. This is just a side note, but I wanted to point out that we're using Python 3 here. Yes, thank you. Python 3.6. If you're still using Python 2. Yeah. And it is important to switch to Python 3. You know, now, well for fast AI you require it. But, you know, increasingly a lot of libraries are removing support for Python 2. Thanks, Rachel. Now it mentions here that you can download the data set for this lesson from this location. If you're using Cresl or the paper space script that we just used to set up, this will already be made available for you. If you're not, you'll need to W get it as soon. Now, Cresl is quite a bit slower than paper space, and also there are some particular things it doesn't support that we really need. And so there are a couple of extra steps. If you're using Cresl, you have to run two more cells. So you can see these are commented out. They've got hashes at the start. So if you remove the hashes from these and run these two additional cells, that just runs the stuff that you only need for Cresl. I'm using paper space, so I'm not going to run it. Okay. So inside our data, so we set up this path to data slash dogs cats that's pretty set up for you. And so inside there, you can see here I can use an exclamation mark to basically say, I don't want to run Python, but I want to run bash. I want to run shell. So this runs a bash command. And the bit inside the curly brackets actually refers, however, to a Python variable. So it inserts that Python variable into the bash command. So here is the contents of our folder. There's a training set and a validation set. If you're not familiar with the idea of training sets and validation sets, it would be a very good idea to check out our practical machine learning course, which tells you a lot about this kind of stuff. The basics of how to set up and run machine learning projects more generally. Would you recommend that people take that course before this one? Actually, a lot of students who would, you know, as they went through these have said they like doing them together. So you can kind of check it out and see. The machine learning course, yeah, they cover some similar stuff, but all in different directions. So people who have done both since, you know, say they find it, they each support each other. I wouldn't say it's prerequisite. But, you know, if I say something like, hey, this is a training set and this is a validation set and you're going, I don't know what that means, at least Google it, do a quick read, you know, because we're assuming that you know the very basics of kind of what machine learning is and does to some extent. And I have a whole blog post on this topic as well. Okay, and we'll make sure that you link to that from course.fast.ai. I also just wanted to say in general with fast.ai, our philosophy is to kind of learn things on an as needed basis. Yeah, exactly. Don't try and learn everything that you think you might need first. Otherwise, you'll never get round to learning this stuff. You actually want to learn. Exactly. And that shows up in deep learning, I think, particularly a lot. Yes. Okay, so in our validation folder, there's a cats folder and a dogs folder. And then inside the validation cats folder is a whole bunch of JPEGs. The reason that it's set up like this is that this is kind of the most common standard approach for how image classification data sets are shared and provided. And the idea is that each folder tells you the label. So there's each of these images is labeled cats and each of the images in the dogs folder is labeled dogs. This is how Keras works as well, for example. So this is a pretty standard way to share image classification files. So we can have a look. So if you go plot.imshow, we can see an example of the first of the cats. If you haven't seen this before, this is a Python 3.6 format string. So you can Google for that. If you haven't seen it, it's a very convenient way to do string formatting and we use it a lot. So there's our cat. But we're going to mainly be interested in the underlying data that makes up that cat. So specifically it's an image whose shape that is the dimensions of the array is 198 by 179 by three. So it's a three dimensional array also called a rank three tensor. And here are the first four rows and four columns of that image. So as you can see, each of those cells has three items in it. And this is the red, green and blue pixel values between 0 and 255. So here's a little subset of what a picture actually looks like inside your computer. So that's what our idea is to take these kinds of numbers and use them to predict whether those kinds of numbers represent a cat or a dog based on looking at lots of pictures of cats and dogs. So that's a pretty hard thing to do. And at the point in time when this data set actually comes from a Kaggle competition, the dogs versus cats Kaggle competition, and when it was released in, I think it was 2012, the state of the art was 80% accuracy. So computers weren't really able to at all accurately recognize dogs versus cats. So let's go ahead and train a model. So here are the three lines of code necessary to train a model. And so let's go ahead and run it. So I click on the cell, I press shift, enter, and then we'll wait a couple of seconds for it to pop up, and there it goes. Okay, and it's training. And so I've asked it to do three epochs. So that means it's going to look at every image three times in total, or look at the entire set of images three times. That's what we mean by an epoch. And as we do, it's going to print out the accuracy. It's the last of the three numbers that prints out on the validation set. The first two numbers we'll talk about later, in short, they're the value of the loss function, which is in this case the cross entropy loss for the training set and the validation set. And then right at the start here is the epoch number. So you can see it's getting about 90% accuracy. And it took 17 seconds. So you can see we've come a long way since 2012. And in fact, even in the competition, this actually would have won the Kaggle competition of that time. The best in the Kaggle competition was 98.9, and we're getting about 99%. So this may surprise you that we're getting a Kaggle winning as of end of 2012 or early 2013. Kaggle winning image classifier in 17 seconds. And three lines of code. And I think that's because a lot of people assume that deep learning takes a huge amount of time and lots of resources and lots of data. And as you'll learn in this course, that in general isn't true. One of the ways we've made it much simpler is that this code is written on top of a library we built, imaginatively called FastAI. The FastAI library is basically a library which takes all of the best practices approaches that we can find. And so each time a paper comes out, that looks interesting, we test it out. If it works well for a variety of data sets and we can figure out how to tune it, implement it in FastAI. And so FastAI kind of curates all this stuff and packages up for you and much of the time, or most of the time kind of automatically figures out the best way to handle things. So the FastAI library is why we're able to do this in just three lines of code. And the reason that we're able to make the FastAI library work so well is because it in turn sits on top of something called PyTorch, which is a really flexible, useful learning and machine learning and GPU computation library written by Facebook. Most people are more familiar with TensorFlow than PyTorch because Google markets that pretty heavily. But most of the top researchers I know nowadays, at least the ones that aren't at Google, have switched across to PyTorch. Yes, Rachel? And we'll be covering some PyTorch later in the course. Yeah, it's, I mean, one of the things that hopefully you're really like about FastAI is that it's really flexible that you can use all these kind of curated best practices as much or as little as you want. And so it's really easy to hook in at any point and write your own data augmentation, write your own loss function, write your own network architecture, whatever. And so we'll do all of those things in this course. So what does this model look like? Well, what we can do is we can take a look at, so what is the validation set dependent variable, the Y, look like? And it's just a bunch of zeros and ones. So the zeros, if we look at data.classes, the zeros represent cats, the ones represent docs. You'll see here there's basically two objects I'm working with. One is an object called data, which contains the validation and training data. And another one is the object called learn, which contains the model. So anytime you want to find something out about the data, we can look inside data. So we want to get predictions through a validation set. And so to do that we can call learn.predict. And so you can see here are the first 10 predictions. And what it's giving you is a prediction for dog and a prediction for cat. Now the way PyTorch generally works, and therefore FastAI also works, is that most models return the log of the predictions rather than the probabilities and cells. We'll learn why that is later in the course. So for now I recognize that to get your probabilities you have to get e to the power of. You'll see here we're using numpy, np is numpy. If you're not familiar with numpy, that is one of the things that we assume that you have some familiarity with. So be sure to check out the material at course.fast.ai to learn the basics of numpy. It's the way that Python handles all of the fast numerical programming, array computation, that kind of thing. Okay, so we can get the probabilities using np.exp. And there's a few functions here that you can look at yourself if you're interested, some plotting functions that we'll use. And so we can now plot some random correct images. And so here are some images that it was correct about. And so remember 1 is a dog, so anything greater than 0.5 is dog, and 0 is a cat. So this is 10 to the negative 5, obviously a cat. Here are some which are incorrect. So you can see that some of these, which it thinks are incorrect, obviously are just the images that shouldn't be there at all. But clearly this one, which it called a dog, is not at all a dog. So there are some obvious mistakes. We can also take a look at which cats is it the most confident are cats, which dogs are the most dog-like, which are confident dogs. Perhaps more interestingly, we can also see which cats is it the most confident are actually dogs. So which ones is it the most wrong about? And same thing for the ones, the dogs that it really thinks are cats. And again, some of these are just pretty weird. I guess there is a dog in there. Yes, Rachel. I was just saying, do you want to say more about why you would want to look at your data? Yeah, sure. So finally, I just mentioned the last one we've got here is to see which ones have the probability closest to 0.5. So these are the ones that the model knows it doesn't really know what to do with. And some of these, it's not surprising. So yeah, I mean, this is kind of like always the first thing I do after I build a model is to try to find a way to visualize what it's built. Because if I want to make the model better, then I need to take advantage of the things it's doing well and fix the things it's doing badly. So in this case, and often this is the case, I've learned something about the data set itself which is that there are some things that are in here that probably shouldn't be. But I've also, like it's also clear that this model has room to improve. Like to me, that's pretty obviously a dog. But one thing I'm suspicious about here is this image is very kind of fat and short. And as we all learn, the way these algorithms work is it kind of grabs a square piece at a time. So this rather makes me suspicious that we're going to need to use something called data augmentation that we'll learn about later to handle this properly. Okay, so that's it, right? We've now built an image classifier. And something that you should try now is to grab some data yourself, some pictures of two or more different types of thing, put them in different folders and run the same three lines of code on them. And you'll find that it will work for that as well. As long as they are pictures of things like the kinds of things that people normally take photos of. So if they're microscope pictures or pathology pictures or CT scans or something, this won't work very well as well. We'll learn about later. There are some other things we need to do to make that work. But for things that look like normal photos, you can run exactly the same three lines of code and just point your path variable somewhere else to get your own image classifier. So for example, one student took those three lines of code, downloaded from Google Images 10 examples of pictures of people playing cricket, 10 examples of people playing baseball, and built a classifier of those images, which was nearly perfectly correct. The same student actually also tried downloading seven pictures of Canadian currency, seven pictures of American currency. And again, in that case, the model was 100% accurate. So you can just go to Google Images if you like and download a few things of a few different classes and see what works. And tell us on the forum, both your successes and your failures. So what we just did was to train a neural network, but we didn't, first of all, tell you what a neural network is or what training means or anything. Why is that? Well, this is the start of our top-down approach to learning. And basically, the idea is that, unlike the way math and technical subjects are usually taught, where you learn every little element piece by piece and you don't actually get to put them all together and build your own image classifier until third year of graduate school, our approach is to say from the start, hey, let's show you how to train an image classifier and then you can start doing stuff and then gradually we dig deeper and deeper and deeper. And so the idea is that throughout the course, you're going to see, like, new problems that we want to solve. So for example, in the next lesson we'll look at, well, what if we're not looking at normal kinds of photos but we're looking at satellite images? And we'll see why it is that this approach that we're learning today doesn't quite work as well and what things do we have to change. And so we'll learn enough about the theory to understand why that happens. And then we'll learn about the libraries and change things with the libraries to make that work better. And so during the course, we're gradually going to learn to solve more and more problems. As we do so, we'll need to learn more and more parts of the library, more and more bits of the theory until by the end we're actually going to learn how to create a world-plus neural net architecture from scratch and our own training loop from scratch so that we build everything ourselves. So that's the general approach. Yes, Rachel. And we sometimes also call this the whole game which is inspired by Harvard professor David Perkins. Yeah, and so the idea of the whole game is like, this is more like how you would learn baseball or music. With baseball, you would get taken to a ball game. You would learn what baseball is. You would start playing it but only be years later that you might learn about the physics of how curveball works, for example. Or with music, we put an instrument in your hand and you start banging the drum or hitting the xylophone and it's not until years later that you learn about the circle of fifths and understand how to construct a cadence, for example. So yeah, so this is kind of the approach we're using. It's very inspired by David Perkins and other writers of education. So what that does mean is to take advantage of this as we peel back the layers, we want you to keep looking under the hood yourself as well like experiment a lot, because this is a very code-driven approach. So here's basically what happens, right? We start out looking today at convolutional neural networks for images and then in a couple of lessons we'll start to look at how to use neural nets to look at structured data and then to look at language data and then to look at recommendation system data. And then we kind of then take all of those steps and we go backwards through them in reverse order. So now by the end of that fourth piece you will know by the end of lesson four how to create a world-class image classifier, a world-class structured data analysis program, a world-class language classifier, a world-class recommendation system. And then we're going to go back over all of them again and learn in depth about like what exactly did it do and how did it work and how do we change things around and using different situations for the recommendation systems, structured data, images and then finally back to language. So that's how it's going to work. So what that kind of means is that most students find that they tend to watch the videos two or three times but not like watch lesson one, two or three times and lesson two, two or three times and lesson three, three times, but like they do the whole thing into end, lessons one through seven and then go back and start lesson one again. That's an approach which a lot of people find when they want to kind of go back and understand all the details that can work pretty well. So I would say, you know, aim to get through to the end of lesson seven you know, as quickly as you can rather than aiming to fully understand every detail from the start. So basically the plan is that in today's lesson you learn in as few lines as code as possible with as few details as possible how do you actually build an image classifier with deep learning to do this? To in this case say, hey, here are some pictures of dogs as opposed to pictures of cats. Then we're going to learn how to look at different kinds of images and particularly we're going to look at images from satellites and we're going to say for a satellite image what kinds of things might you be seeing in that image and there could be multiple things that we're looking at. So a multi-label classification problem. From there we'll move to something which is perhaps the most widely applicable for the most people which is looking at what we call structured data. So data about, data that kind of comes from databases or spreadsheets so we're going to specifically look at this data set of predicting sales the number of things that are sold at different stores on different dates based on different holidays and so on and so forth. And so we're going to be doing this sales forecasting exercise. After that we're going to look at language and we're going to figure out what this person thinks about the movie, Zombie-Gedden and we'll be able to figure out how to create, just like we create image classifiers for any kind of image we'll learn to create NLP classifiers to classify any kind of language in lots of different ways. Then we'll look at something called collaborative filtering which is used mainly for recommendation systems. We're going to be looking at this data set that showed for different people, for different movies, what rating did they give it? Here are some of the movies. And so this is maybe an easier way to think about it is there are lots of different users and lots of different movies and then for each one we can look up for each user how much they like that movie. And the goal will be, of course, to predict for user movie combinations we haven't seen before, are they likely to enjoy that movie or not? And that's the really common approach used for deciding what stuff to put on your home page when somebody's visiting, what book might they want to read or what film might they want to see or so forth. From there we're going to then dig back into language a bit more and we're going to look at actually we're going to look at the writings of Nietzsche, the philosopher, and learn how to create our own Nietzsche philosophy from scratch, character by character. So this here, perhaps, that every life or values of blood of intercourse when it senses there is unscrupulous his very rights and still impulse love is not actually Nietzsche. That's actually like some character by character generated text that we built with this recurrent neural network. And then finally we're going to loop all the way back to computer vision again. We're going to learn how, not just to recognize cats from dogs but actually find where the cat is with this kind of heat map. And we're also going to learn how to write our own architectures from scratch. So this is an example of a ResNet which is the kind of network that we are using in today's lesson for computer vision and so we'll actually end up building the network and the training loop from scratch. And so they're basically the steps that we're going to be taking from here and at each step we're going to be getting into increasing amounts of detail about how to actually do these things yourself. So we've actually heard back from our students of past courses about what they found and one of the things that we've heard a lot as students say is that they spend too much time on theory and research and not enough time running the code and even after we tell people about this morning they still come to the end of the course and often say, I wish I had taken more seriously that advice which is to keep running code. So these are actual quotes from our forum. In retrospect I should have spent the majority of my time on the actual code and the notebooks. See what goes in? See what comes out. Now this idea that you can create world class models in a code first approach learning what you need as you go is very different to a lot of the advice you'll read out there such as this person on the forum Hacker News who claimed that the best way to become an ML engineer is to learn all of math learn C and C++ learn parallel programming learn ML algorithms implement them yourself using plain C and finally start doing ML So we would say if you want to become an effective practitioner do exactly the opposite of this. Yes Rachel? Yeah, I was just highlighting that this is we think this is bad advice and this can be very discouraging for a lot of people to come across things like this. We now have thousands or tens of thousands of people that have done this course and have lots and lots of examples of people who are now running research labs or Google brain residents or have created patents based on deep learning and so forth who have done it by doing this course. So the top down approach works super well. Now, one thing to mention is like we've now already learned how you can actually train a world class image classifier in 17 seconds. I should mention by the way the first time you run that code there are two things it has to do that take more than 17 seconds. One is that it downloads a pre-trained model from the internet so you'll see the first time you run it it'll say downloading model so that takes a minute or two. Also the first time you run it it pre-computes and caches some of the intermediate information that it needs and that takes about a minute and a half as well. So if the first time you run it it takes three or four minutes to download and pre-compute stuff that's normal if you run it again you should find it takes 20 seconds or so. So image classifiers you may not feel like you need to recognize cats versus dogs very often on a computer you can probably do it yourself pretty well but what's interesting is that these image classification algorithms are really useful for lots and lots of things. For example AlphaGo which became which beat the Go World Champion the way it worked was to use something at its heart that looked almost exactly like our dogs versus cats image classification algorithm. It looked at thousands and thousands of Go boards and for each one there was a label saying whether that Go Board ended up being the winning or the losing player and so it learnt basically an image classification that was able to look at a Go Board and figure out whether it was a good Go Board or a bad Go Board and that's really the key most important step in playing Go World is to know which move is better. Another example is one of our earlier students who actually got a couple of patents for this work looked at anti-fraud. He had lots of examples of his customers mouse movements because they provided kind of these user tracking software to help avoid fraud and so he took the mouse paths basically of the users on his customers websites turned them into pictures of where the mouse moved and how quickly it moved and then built an image classifier that took those images as input and as output was that a fraudulent transaction or not and turned out to get really great results for his company. So image classifiers are much more flexible than you might imagine so this is how some of the ways you can use deep learning specifically for image recognition and it's worth understanding that deep learning is not just a word that means the same thing as machine learning. What is it that we're actually doing here when we're doing deep learning? Instead, deep learning is a kind of machine learning. Machine learning was invented by this guy Arthur Samuels who was pretty amazing in the late 50s he got this IBM mainframe to play checkers better than he can and the way he did it was he invented machine learning. He got the mainframe to play against itself lots of times and figure out which kinds of things led to victories and which kinds of things didn't and used that to kind of almost write its own program and Arthur Samuels actually said in 1962 that he thought that one day the vast majority of computer software would be written using this machine learning approach rather than written by hand by writing the loops and so forth by hand. So I guess that hasn't happened yet but it seems to be in the process of happening. I think one of the reasons it didn't happen for a long time is because traditional machine learning actually was very difficult and very knowledge and time intensive. So for example here's something called the computational pathologist or CPATH from a guy called Andy Beck back when he was at Stanford he's now moved on to somewhere on the east coast Harvard I think and what he did was he took these pathology slides of breast cancer biopsies and he worked with lots of pathologists to come up with ideas about what kinds of patterns or features might be associated with long-term survival versus dying quickly. So he came up with these ideas like relationship between epithelial nuclear neighbors relationship between epithelial and dermal objects and so forth. So they came up with all of these ideas of features, these are just a few of the things we thought of and then lots of smart computer programmers wrote specialist algorithms to calculate all these different features and then those features were passed into a logistic regression to predict survival and it ended up working very well it ended up that the survival predictions were more accurate than pathologists own survival predictions were and so machine learning can work really well but the point here is that this was an approach that took lots of domain experts and computer experts many years of work to actually to build this thing. So we really want something something better and so specifically I'm going to show you something which rather than being a very specific function with all this very domain specific feature engineering we're going to try and create an infinitely flexible function a function that could solve any problem right? It would solve any problem if only you set the parameters of that function correctly and so then we need some all-purpose way of setting the parameters of that function and we would need that to be fast and scalable right? Now if we had something that had these three things then you wouldn't need to do this incredibly time and domain knowledge intensive approach anymore. Instead we can learn all of those things with this algorithm. So as you might have guessed the algorithm in question which has these three properties is called deep learning or if not an algorithm then maybe we would call it a class of algorithms let's look at each of these three things in turn. So the underlying function that deep learning uses is something called the neural network now the neural network we're going to learn all about it and implemented ourselves from scratch later on in the course. But for now all you need to know about it is that it consists of a number of simple linear layers interspersed with a number of simple non-linear layers and when you interspersed these layers in this way you get something called the universal approximation theorem and the universal approximation theorem says that this kind of function can solve any given problem to arbitrarily close accuracy as long as you add enough parameters so it's actually provably shown to be an infinitely flexible function. So now we need some way to fit the parameters so that this infinitely flexible neural network solves some specific problem and so the way we do that is using a technique that probably most of you will have come across before at some stage called gradient descent and with gradient descent we basically say okay well for the different parameters we have how good are they at solving my problem and let's figure out a slightly better set of parameters and a slightly better set of parameters and basically follow down the surface of the loss function downwards it's kind of like a marble going down to find the minimum and as you can see here depending on where you start you end up in different places these things are called local minima. Now interestingly it turns out that for neural networks in particular there aren't actually multiple different local minima. There's basically just one. Or to think of it another way there are different parts of the space which are all equally good. So gradient descent therefore turns out to be actually an excellent way to solve this problem of fitting parameters to neural networks. The problem is though that we need to do it in a reasonable amount of time and it's really only thanks to GPUs that that's become possible. So GPUs this shows over the last few years how many gigaflops per second can you get out of a GPU that's the red and green versus a CPU that's the blue right and this is on a log scale so you can see that generally speaking the GPUs are about 10 times faster than the CPUs. And what's really interesting is that nowadays not only is the Titan X about 10 times faster than the E52699 CPU but the Titan X well actually better one to look at would be the GTX10ATI GPU costs about $700 whereas the CPU which is 10 times slower costs over $4,000. So GPUs turn out to be able to solve these neural network parameter fitting problems incredibly quickly and also incredibly cheaply so they've been absolutely key in bringing these three pieces together. Which is I mentioned that these neural networks you can intersperse multiple sets of linear and then nonlinear layers. In the particular example that's drawn here there's actually only one what we call hidden layer, one layer in the middle and something that we learned in the last few years is that these kinds of neural networks although they do support the universal approximation theorem they can solve any given problem arbitrarily closely they require an exponentially increasing number of parameters to do so so they don't actually solve the fast and scalable for even reasonable size problems. But we've since discovered that if you create add multiple hidden layers then you get super linear scaling so you can add a few more hidden layers to get multiplicatively more accuracy to more complex problems. And that is where it becomes called deep learning. So deep learning means a neural network with multiple hidden layers. So when you put all this together there's actually really amazing what happens. Google started investing in deep learning in 2012. They actually hired Jeffrey Hinton who's kind of the father of deep learning who's a top student Alex Kudzeski and they started trying to build a team. That team became known as Google Brain. And because things with these three properties are so incredibly powerful and so incredibly flexible you can actually see over time how many projects at Google use deep learning. My graph here only goes up to a year ago but I know it's been continuing to grow exponentially since then as well. So what you see now is around Google that deep learning is used in every part of the business. And so it's really interesting to see how this kind of simple idea that we can solve machine learning problems using an algorithm that has these properties. When a big company invests heavily in actually making that happen you see this incredible growth in how much it's used. So for example if you use the inbox by Google software then when you receive an email from somebody it will often tell you here are some replies that I could send for you. And so it's actually using deep learning here to read the original email and to generate some suggested replies. And so like this is a really great example of the kind of stuff that previously just wasn't possible. Another great example would be Microsoft has also a little bit more recently invested heavily in deep learning and so now you can use Skype, you can speak into it in English and ask it at the other end to translate it in real time to Chinese or Spanish. And then when they talk back to you in Chinese or Spanish, Skype will in real time translate the speech in their language into English speech in real time. And again this is an example of stuff which we can only do thanks to deep learning. I also think it's really interesting to think about how deep learning can be combined with human expertise. It's like drawing something just sketching it out and then using a program called Neural Doodle, this is from a couple of years ago to then say please take that sketch and render it in the style of an artist. And so here's the picture that it then created rendering it as impressionist painting. And I think this is a really great example of how you can use deep learning to help combine human expertise and what computers are good at. So I a few years ago decided to try this myself like what would happen if I took deep learning and tried to use it to solve a really important problem. And so the problem I picked was diagnosing lung cancer. It turns out if you can find lung nodules earlier there's a 10 times higher probability of survival. So it's a really important problem to solve. So I got together with three other people, none of us had any medical background. And we grabbed a data set of CT scans. We used a convolutional neural network much like the dogs vs cats one we trained at the start of today's lesson to try and predict which CT scans had malignant tumors in them. And we ended up after a couple of months with something with a much lower false negative rate and a much lower false positive rate than a panel of four radiologists. And we went on to build this in a startup into a company called Inletic which has really become pretty successful and since that time the idea of using deep learning for medical imaging has become hugely popular and is being used all around the world. So what I've generally noticed is that the vast majority of things that people do in the world currently aren't using deep learning and then each time somebody says let's try using deep learning to improve performance at this thing they nearly always get fantastic results and then suddenly everybody in that industry starts using it as well. So there's just lots and lots of opportunities here at this particular time to use deep learning to help with all kinds of different stuff. So I've tried it down a few ideas here. These are all things which I know you can use deep learning for right now to get good results from and are things which people spend a lot of money on or have a lot of important business opportunities. There's lots more as well but these are some examples of things that maybe at your company you could think about applying deep learning for. So let's talk about what's actually going on. What actually happened when we trained that deep learning model earlier. And so as I briefly mentioned the thing we created is something called a convolutional neural network or CNN. And the key piece of a convolutional neural network is the convolution. So here's a great example from a website I've got the URL up here explained visually it's called. And the explained visually website has an example of a convolution kind of in practice. Over here in the bottom left is a very zoomed in picture of somebody's face and over here on the right is an example of using a convolution on that image. You can see here that this particular thing is obviously finding edges. The edges of his head. Top and bottom edges in particular. Now how is it doing that? Well if we look at each of these little 3x3 areas that this is moving over it's taking each 3x3 area of pixels. And here are the pixel values for each thing in that 3x3 area. And it's multiplying each one of those 3x3 pixels in each one of these 3x3 kernel values. In a convolution this specific set of 9 values is called a kernel. It doesn't have to be 9 it could be 4x4 or 5x5 or 2x2 or whatever. In this case it's a 3x3 kernel and in fact in deep learning nearly all of our kernels are 3x3. So in this case the kernel is 1, 2, 1 oh oh oh minus 1, minus 2, minus 1. So we take each of the black through white pixel values and we multiply as you can see each of them by the corresponding value in the kernel. And then we add them all together. And so if you do that for every 3x3 area you end up with the values that you see over here on the right hand side. So very low values become black. Very high values become white and so you can see when we're at an edge where it's black at the bottom and white at the top we're obviously going to get higher numbers over here and vice versa. So that's a convolution. So as you can see it is a linear operation and so based on that definition of a neural net I described before this can be a layer in our neural network. It is a simple linear operation. And we're going to look at much more at convolutions later including building a little spreadsheet that implements them ourselves. So the next thing we're going to do is we're going to add a non-linear layer. So a non-linearity as it's called is something which takes an input value and turns it into some different value in a non-linear way. You can see this orange picture here is an example of a non-linear function. Specifically this is something called a sigmoid. And so a sigmoid is something that has this kind of S shape and this is what we used to use as our non-linearities in neural networks a lot. Actually nowadays we nearly entirely use something else called a ReLU or rectified linear unit. ReLU is simply take any negative numbers and replace them with 0 and leave any positive numbers as they are. So in other words in code that would be y equals max x comma 0. So max x comma 0 simply says replace the negatives with 0. Regardless of whether you use a sigmoid or a ReLU or something else the key point about taking this combination of a linear layer followed by a element-wise non-linear function is that it allows us to create arbitrarily complex shapes as you see in the bottom right. And the reason why is that, and this is all from Michael Nielsen's neural networks and deeplearning.com really fantastic interactive book, as you change the values of your linear functions it basically allows you to kind of like build these arbitrarily tall or thin blocks and then combine those blocks together. And this is actually the essence of the universal approximation theorem. This idea that when you have a linear layer feeding into a non-linearity you can actually create these arbitrarily complex shapes. So this is the key idea behind why neural networks can solve any computable problem. So then we need a way, as we described, to actually set these parameters. So it's all very well knowing that we can move the parameters around manually to try to create different shapes but we have some specific shape we want how do we get to that shape. And so as we discussed earlier the basic idea is to use something called gradient descent. This is an extract from a notebook actually, one of the fast lessons. And it shows actually an example of using gradient descent to solve a simple linear regression problem. But I can show you the basic idea. Let's say you were just, you had a simple quadratic. And so you were trying to find the minimum of this quadratic. And so in order to find the minimum you start out by picking some point. So we'll say okay let's pick here. And so you go up there and you calculate the value of your quadratic at that point. So what you now want to do is try to find a slightly better point. So what you could do is you can move a little bit to the left and a little bit to the right to find out which direction is down. And what you'll find out is that moving a little bit to the left decreases the value of the function. So that looks good. And so in other words we're calculating the derivative of the function at that point. So that tells you which way is down. It's the gradient. And so now that we know that going to the left is down we can take a small step in that direction to create a new point. And then we can repeat the process and say okay which way is down now. One step and another step and another step and another step. And each time we're getting closer and closer. So the basic approach here is to say okay we start we're at some point we've got some value x which is our current gas at time step n. So then our new gas at time step n plus 1 is just equal to our previous gas plus the derivative times some small number because we want to take a small step. We need to pick a small number because if we picked a big number then we say okay we know we want to go to the left. Let's jump a big long way to the left. We could go all the way over here. And we actually end up worse. Then we do it again now we're even worse again. So if you have too high a step size you can actually end up with divergence rather than convergence. So this number here we're going to be talking about it a lot during this course and we're going to be writing all this stuff out in code from scratch ourselves. But this number here is called the learning rate. Okay so you can see here this is an example of basically starting out with some random line and then using gradient descent to gradually make the line better and better and better. So what happens when you combine these ideas? The convolution the non-linearity and gradient descent because they're all tiny small simple little things it doesn't sound that exciting. But if you have enough of these kernels with enough layers something really interesting happens and we can actually draw them so here's the so this is a really interesting paper by Matzailer and Rob Fergus and what they did a few years ago was they figured out how to basically draw a picture of what each layer in a deep learning network learnt. And so they showed that layer one of the network here are nine examples of convolutional filters from layer one of a trained network and they found that some of the filters kind of learnt these diagonal lines or simple little grid patterns. Some of them learnt these simple gradients. And so for each of these filters they show nine examples of little pieces of actual photos which activate that filter quite highly. So you can see layer one these learnt these are learnt using gradient descent these filters were not programmed they were learnt using gradient descent. So in other words we were learning these nine numbers. So layer two then was going to take these as inputs and combine them together. And so layer two had you know this is like nine kind of attempts to draw one of the examples of the filters in layer two. They're pretty hard to draw but what we can do is say for each filter what are examples of little bits of images that activated them and you can see by layer two we've got basically something that's being activated nearly entirely by little bits of sunset something that's being activated by circular objects something that's being activated by repeating horizontal lines something that's being activated by corners So you can see how we're basically combining layer one features together. So if we combine those features together and again these are all convolutional filters learnt through gradient descent by the third layer it's actually learnt to recognise the presence of text another filter has learnt to recognise the presence of petals another filter has learnt to recognise the presence of human faces So just three layers is enough to get some pretty rich behaviour. By the time we get to layer five we've got something that can recognise the eyeballs of insects and birds and something that can recognise unicycle wheels So this is kind of where we start with something incredibly simple but if we use it at a big enough scale thanks to the universal approximation theorem and the use of multiple hidden layers in deep learning we actually get these very very rich capabilities So that is what we used when we actually trained our little dog vs cat recognizer So let's talk more about this dog vs cat recognizer So we've learnt the idea of like we can look at the pictures and come out of the other end to see what the model is classifying well or classifying badly or which ones it's unsure about But let's talk about this key thing I mentioned which is the learning rate So I mentioned we have to set this thing I just called it L before the learning rate and you might have noticed there's a couple of numbers these kind of magic numbers the first one is the learning rate by the gradient by when you're taking each step in your gradient descent We already talked about why you wouldn't want it to be too high but probably also it's obvious to see why you wouldn't want it to be too low If you had it too low you would take like a little step and you'd be a little bit closer and it would take lots and lots and lots of steps and it would take too long So setting this number well is actually really important and for the longest time this was driving deep learning research is crazy because they didn't really know a good way to set this reliably So the good news is last year a researcher came up with an approach to quite reliably set the learning rate Unfortunately almost nobody noticed so almost no deep learning researchers I know about actually are aware of this approach but it's incredibly successful and it's incredibly simple and I'll show you the idea It's built into the fast AI library as something called LRFind or the learning rate finder and it comes from this paper I was actually 2015 paper cyclical learning rates for training neural networks by a terrific researcher called Leslie Smith and I'll show you Leslie's idea So Leslie's idea started out with the same basic idea that we've seen before which is if we're going to optimize something pick some random point take its gradient and then specifically he said take a tiny, tiny step no tiny step so a learning rate of like 10eneg7 and then do it again and again but each time increase the learning rate like double it so then we try like 2eneg7 4eneg7 8eneg7 10eneg6 and so gradually your steps are getting bigger and bigger and so you can see what's going to happen it's going to like start doing almost nothing then suddenly the loss function will improve very quickly but then it's going to step even further again and then even further again let's draw the rest of that line to be clear and so suddenly it's going to shoot off and get much worse so the idea then is to go back and say at what point did we see like the best improvement so here we've got our best improvement and so we'd say okay let's use that learning rate so in other words if we were to plot the learning rate over time it was increasing like so and so what we then want to do is we want to plot the learning rate against the loss so when I say the loss I basically mean like how accurate is the model how close in this case the loss would be how far away is the prediction from the goal and so if we plotted the learning rate against the loss we'd say like okay initially it didn't do very much for small learning rates and then it suddenly improved a lot and then it suddenly got a lot worse so that's the basic idea and so we'd be looking for the point where this graph is dropping quickly we're not looking for its minimum point we're not saying like where was it the lowest because that could actually be the point where it's just jumped too far at what point was it dropping the fastest so if you go so if you create your learn project in the same way that we did before we'll be learning more about this these details shortly if you then call LRFind method on that you'll see that it'll start training a model like it did before but it'll generally stop before it gets to 100% because if it notices that the loss is getting a lot worse then it'll stop automatically so that you can see here it stopped at 84% and so then you can call learn.shed that gets you the learning rate scheduler that's the object which actually does this learning rate finding and that object has a plot learning rate function and so you can see here by iteration you can see the learning rate so you can see each step the learning rate is getting bigger and bigger you can do it this way we can see it's increasing exponentially another way that Leslie Smith did it linearly so I'm actually currently researching both of these approaches to see which works best recently I've been mainly using exponential but I'm starting to look more at using linear at the moment and so if we then call shed.plot that does the plot that I just described down here learning rate versus loss and so we're looking for the highest learning rate we can find the loss is still improving clearly well and so in this case I would say 10 to the negative 2 because at 10 to the negative 1 it's not improving 10 to the negative 3 it is also improving but I'm trying to find the highest learning rate I can where it's still clearly improving so I'd say 10 to the negative 2 so you might have noticed that when we ran our model before we had 10 to the negative 2 0.01 so that's why we picked that learning rate so there's really only one other number that we have to pick and that was this number 3 and so that number 3 controlled how many epochs that we run so an epoch means going through our entire data set of images and using each time we do they're called mini batches we grab like 64 images at a time and use them to try to improve the model a little bit using gradient descent and using all of the images once is called one epoch and so at the end of each epoch we print out the accuracy and validation and training loss at the end of the epoch so the question of how many epochs should we run is kind of the one other question that you need to answer to run these three lines of code and the answer really to me is like as many as you like what you might find happen is if you run it for too long the accuracy you'll start getting worse and we'll learn about that why later it's something called overfitting so you can run it for a while run lots of epochs once you see it getting worse you know how many epochs you can run the other thing that might happen is if you've got like a really big model or lots and lots of data maybe it takes so long you don't have time and so you just run enough epochs that fit into the time you have available so the number of epochs you run that's a pretty easy thing to set so they're the only two numbers that you're going to have to set and so the goal this week will be to make sure that you can run not only these three lines of code on the data that I've provided but to run it on a set of images that you either have on your computer or that you get from work or that you download from Google and like try to get a sense of like which kinds of images does this seem to work well for which ones doesn't it work well for what kind of learning rates do you need how many kinds of images how many epochs do you need how does the learning rate change the accuracy you get and so forth like really experiment and then try to get a sense of like what's inside this data object what do the y values look like what do these classes mean and if you're not familiar with NumPy really practice a lot with NumPy so that by the time you come back for the next lesson you'll get into a lot more detail and so you'll really feel ready to do that now one thing that's really important to be able to do that is that you need to really know how to work with NumPy the faster library and so forth and so I want to show you some tricks in Jupyter Notebook to make that much easier so one trick to be aware of is if you can't quite remember how to spell something right so if you're not quite sure what the method you want is you can always hit tab and you'll get a list of methods that start with that letter and so that's a quick way to find things if you then can't remember what the arguments are to a method hit shift tab so hitting shift tab tells you the arguments to the method so shift tab is like one of the most helpful things I know so let's take np.exp shift tab and so now you might be wondering like what does this function do and how does it work if you press shift tab twice then it actually brings up the documentation shows you what the parameters are and shows you what it returns and gives you examples if you press it three times it pops up a whole little separate window with that information so shift tab is super helpful one way to grab that window straight away is if you just put question mark at the start then it just brings up that little documentation window now the other thing to be aware of is increasingly during this course we're going to be looking at the actual source code of FastAI itself and learning how it's built it's really helpful to look at source code in order to understand what you can do and how you can do it so if you for example wanted to look at the source code for learn.predict you can just put two question marks and you can see it's popped up the source code so it's just a single line of code you'll very often find FastAI methods they're designed to never be more than about half a screen full of code and they're often under six lines so you can see in this case it's calling predict with tags so we could then get the source code for that in the same way and then that's calling a function called predict with tags so we could get the documentation for that in the same way and then so here we are and then finally that's what it does it iterates through a data loader gets the predictions and then passes them back and so forth so question mark how to get source code single question mark is how to get documentation and shift tab is how to bring up parameters or press it more times to get the docs so that's really helpful another really helpful thing to know about is how to use Jupyter Notebook well and the button that you want to know is H if you press H it will bring up the keyboard shortcuts palette and so now you can see exactly what Jupyter Notebook can do and how to do it I personally find all of these functions useful so I generally tell students to try and learn four or five different keyboard shortcuts a day try them out see what they do see how they work and then you can try practicing in that session one very important thing to remember when you're finished with your work for the day go back to paper space and click on that little button which stops and starts the machine so after it stopped you'll see it says connection closed and you'll see it's off if you leave it running you'll be charged for it same thing with Kressel be sure to go to your Kressel instance and stop it you can't just turn your computer off or close the browser you actually have to stop it in Kressel or in paper space don't forget to do that or you'll end up being charged until you finally do remember okay so I think that's all the information that you need to get started please remember about the forums if you get stuck at any point check them out but before you do make sure you read the information on course.fast.ai for each lesson because that is going to tell you about things that have changed so if there's been some change to which type of notebook provider we suggest using or how to set up paper space or anything like that that will all be on course.fast.ai okay thanks very much for watching and look forward to seeing you in the next lesson