 Okay, okay, so let me introduce everybody to everybody else first of all So we're here at the University of San Francisco Learning machine learning or you might be at home watching this on video. So hey everybody wave here is the University of San Francisco graduate students Thank you everybody and wave back from the future and from home to all the students here if If you're watching this on YouTube Please stop And instead go to course stop faster AI and watch it from there instead There's nothing wrong with YouTube, but I can't edit these videos after I've created them So I need to be able to like if you updated information about like What environments to use how the technology changes and so you need to go here, right? So you can also Watch the lessons from here Here's lots of lessons and so forth, right? so That's tip number one for the video tip number two for the video is because I can't edit them all I can do is add these things called cards and Cards a little things that appear in the top corner of the top right hand corner of the screen So by the time this video comes out, I'm going to put a little card there right now For you to click on and try that out Unfortunately, they're not easy to notice so keep an eye out for that because that's going to be Important updates to the video. All right, so welcome. We're going to be learning about machine learning today And so for everybody in the class here, you all have Amazon web services set up so you might want to go ahead and launch your AWS instance now Or go ahead and create launch your Jupiter notebook on your own computer If you don't have Jupiter notebook set up Then what I recommend is you go to crecel.com www.crestle.com sign in there sign up and You can then turn off enable GPU and click start Jupiter and you'll have a Jupiter notebook instantly That costs you some money. It's three cents an hour Okay, so if you don't mind spending three cents an hour to learn machine learning. Here's a good way So I'm going to go ahead and say start Jupiter And so whatever technique you use There you go. One of the things that you'll find on on the website is Links to lots of information about the costs and benefits and approaches to setting up lots of different environments for Jupiter notebook Both the deep learning and for regular machine learning So check them out because there's lots of options So if I then go open a Jupiter in open Jupiter in a new tab Here I am in Crestle or on AWS or your own computer We use The Anaconda Python distribution for basically everything you can install that yourself And again, there's lots of information on the website about how to set that up We're also assuming that either you're using Crestle or there's something else Which I really like called paper space comm which is another place you can fire up if you put a notebook pretty much instantly both of these have Already have all of the fast AI stuff pre-installed for you So as soon as you open up Crestle or paper space assuming you chose the paper space fast AI Template, you'll see that there's a fast AI Folder. Okay, if you are using your own computer or AWS You'll need to go to our GitHub repo fast AI fast AI And clone it Okay, and then you'll need to do a condo and update to install the libraries And again, that's all information we've got on the website and we've got some previous workshop videos to help you through all of those steps So for this Class I'm assuming that you have a Jupiter notebook running. Okay so Here we are in the in the Jupiter notebook and If I click on fast AI, that's what you get if you get clone or if you're on Crestle You can see our repo here all of our Lessons are inside the courses folder and The machine learning part one is in the ML one folder If you're ever looking at my screen and wondering where are you look up here and you'll see It tells you the path fast AI courses ML one And Today we're going to be looking at lesson one random forests. So here is lesson one are it So there's a couple of different ways you can do this both here in person or on the video You can either attempt to follow along as you watch Or you can just watch and then follow along later with the video It's up to you. I would maybe have a loose recommendation to say to Watch now and follow along with the video later Just because It's quite hard to multitask and if you're working on something you might miss a key piece of information Which you're welcome to ask about okay But if you follow along with the video afterwards, then you can pause stop Experiment and so forth. But anyway, you can choose either way I'm going to Go view toggle header view toggle toolbar and then full screen it So to get a bit more space So the basic approach we're going to be teaching here try taking here is to Get straight into code start building models Not to look at theory we're going to get to all the theory Okay But at the point where you deeply understand what it's for and at the point that you're able to be an effective practitioner So my hope is that you're going to spend your time focusing on Experimenting so if you take these notebooks and try different variations of what I show you Try it with your own data sets the more coding you can do The better the more you'll learn. Okay, don't You know my suggestion or at least all of my students have told me the ones who have gone away and spent time studying Books of theory rather than coding found that they learnt less Machine learning and that they often tell me they wish that's one more time coding The stuff that we're showing in this course a lot of it's never been shown before this is not a summary of Other people's research. This is more a summary of 25 years of work that I've been doing in machine learning So a lot of this is it's going to be shown for the first time And so that's kind of cool because if you want to write a blog post about something that you learn here You might be building something that a lot of people find super useful, right? So There's a great opportunity to practice your technical writing and here's some examples of good technical writing Okay by by showing people stuff which you've it's not like hey I just learned this thing. I bet you all know it often It'll be I just learned this thing and I'm going to tell you about it and other people haven't seen it In fact, this is the first course ever that's been built on top of the fast AI library So even just stuff in the library is going to be new to like everybody Okay, so when we use a Jupyter notebook or anything else in Python we have to Import the the libraries that we're going to use Something that's quite convenient is if you use these two auto reload commands at the top of your notebook You can go in and edit the source code of the modules and your notebook will automatically update with those new modules You won't have to like restart anything. So that's super handy Then to show your plots inside the notebook you'll want that plot in line So these three lines appear at the top of all of my notebooks Hey You'll notice when I import the libraries that for anybody here who is a experienced Python programmer I am doing something that would be widely considered very inappropriate. I'm importing star Okay, generally speaking in software engineering. We're taught to like specifically figure out what we need and import those things The more experienced you are as a Python programmer the more extremely offensive practices you're going to see me use For example, I don't follow what's called pap 8 which is the normal style method style of code used in Python So I'm going to mention a couple of the things first is Go along with it for a while. Don't judge me just yet, right? There's reasons that I do these things and if it really bothers you then feel free to change it, right? But the basic idea is data science is not software engineering, right? There's a lot of overlap, you know, we're using the same languages and in the end these things will may become Software engineering projects, but what we're doing right now is we're prototyping models and prototyping models Has a very different set of best practices that are taught basically nowhere, right? They're not really even really written down But the key is to be able to do things very interactively and very iteratively right so for example From library import star means you don't have to figure out ahead of time what you're going to need from that library It's it's all there. Okay Also because we're in this wonderful interactive Jupiter environment. It lets us Understand what's in the libraries really well. So for example later on I'm using a function called display, right? So an obvious question is like or what is display so you can just type the name of a function and Press shift enter number shift enter is is to run a cell and it will tell you where it's from Right. So anytime you see a function. You're not familiar with you can find out Where it's from and then if you want to find out what it does put a question mark at the start okay, and Here you have the documentation and Then particularly helpful for the fast AI library so the fast AI library I try to make as many functions as possible be like no more than about five lines of code It's designed to be really easy to read right if you put a second question mark at the start It shows you the source code of the function Right so all the documentation plus the source code so you can see like nothing has to be mysterious and We're going to be using the other library. We'll use a lot is scikit-learn Which is kind of implements a lot of machine learning stuff in python the scikit-learn? Source code is often pretty readable and so very often if I want to really understand something I'll just go question mark question mark and the name of the scikit-learn function. I'm typing and I'll just go ahead and read the source code As I say the fast AI library in particular is designed to have source code It's very easy to read and we're going to be reading it a lot Okay All right, so today we're going to be working on a Kaggle competition called blue book for bulldozers So the first thing we need is to get that data. So if you go Kaggle bulldozers Then you can find it so Kaggle competitions allow you to download a Real-world data set that somebody a real problem that somebody's trying to solve and solve it according to a Specification that that actual person with that actual problem decided would be actually helpful to them, right? So these are pretty authentic Experiences for applied machine learning now, of course, you're missing all the bit that went before Which was why did this company this startup the side that predicting the auction sale price of bulldozers was important Where did they get the data from? How did they clean the data and so forth? Okay, and that's all important stuff as well But the focus of this course is really on what happens next which is like how do you actually build the model? One of the great things about you working on Kaggle competitions Whether they be running now or whether they be old ones is that you can submit yours to the leaderboard even old closed competitions you can submit to the leaderboard and find out how would you have gone right and there's really no other way in the world of knowing whether you're Competent at this kind of data in this kind of model then doing that right because otherwise If your accuracy is really bad, is it because this is just very hard like it's just not possible than the The data is so noisy. You can't do better or is it actually that it's an easy data set and you've made a mistake and like when you Finish this course and apply this to your own projects This is going to be something you're going to find very hard and there isn't a simple solution to it, which is You're now using something that hasn't been on Kaggle. It's your own data set Do you have a good enough answer or not? Okay, so we'll talk about that more during the course and in the end We just have to know that we have good effective techniques for a lively building baseline models Otherwise, yeah, there's really no way to know. There's no way other than creating a Kaggle competition Or getting you know a hundred top data scientists to work at your problem to really know what's possible So Kaggle competitions are fantastic for the learning And as I've said many times I've learned more from from competing in Kaggle competitions and everything else I've done in my life So to compete in the Kaggle competition you need the data This one's a an old competition. So it's not running now, but we can still access everything So we first of all want to understand what the goal is And I suggest that you read this later but basically we're going to try and predict the sale price of heavy equipment and One of the nice things about this competition is that if you're like me You probably don't know very much about heavy heavy industrial equipment options, right? I actually know more than I used to because my toddler loves building equipment So we actually like watch YouTube videos about front-end loaders and forklifts But you know two months ago. I was You know a real layman So one of the nice things is that Machine learning should help us understand a data set not just make predictions about it So by picking an area which we're not familiar with it's a good test of whether we can build an understanding Right because otherwise what can happen is that your intuition about the data can make it very difficult You to be open-minded enough to see what does the data really say? It's easy enough to download the computer to download the data to your computer You just have to click on the data set. So here is train zip and Click download Right, and so you can go ahead and do that if you're running on your own computer right now if you're running on AWS It's a little bit harder right because unless you're familiar with text mode browsers like elinks or links It's quite tricky to get the data set to Kaggle. So a couple of options One is you could download it to your computer and then scp it to AWS so scp works just like SSH, but it copies data rather than logging in I'll show you a trick though that I really like and it relies on using firefox And for some reason chrome doesn't work correctly with Kaggle for this So if I go on firefox To the website Eventually and what we're going to do is we're going to use something called the JavaScript Console so every web browser comes with a set of tools for web developers To help them see what's going on and you can hit To do it through here Developer Control shift I okay, so you can hit control shift I to bring up this this web developer tools and one of the tabs is network Okay, and so then if I click on train dot zip and I click on download Okay, and I'm not even going to download I'm just going to say cancel, but you'll see down here It's shown me all of the network connections that were just initiated right and so here's one Which is downloading a zip file from storage dot Google APIs dot com blah blah blah. That's probably what I want All right, that looks good. So what you can do is you can right click on that and say copy Copy as curl so curl is a Unix command like W get that downloads stuff Right, so if I go copy as curl That's going to create a command that has all of my cookies headers everything in it necessary to download this authenticated data set So if I now go into My server right and if I paste that You can see a really really long curl command One thing I notice is that at least recent versions have started adding this minus minus 2.0 thing to the command that doesn't seem to work with all versions of curl So something you might want to do is to Okay Oh Is to pop that into an editor find that to Get rid of it and then use that instead All right now one thing to be very careful about by default curl downloads The file and displays it in your terminal So if I try to display this it's going to display gigabytes of binary data in my terminal and crash it Okay, so to say that I want to output it using some different file name I always type minus oh for output file name and then the name of the file bulldozers Dot and make sure you give it a suitable A suitable extension so in this case the file was Train dot zip. Okay, so Bulldozers dot zip There it is. Okay, and so there it all is so I could Make directory bulldozers and move my zip file into there Go the wrong way around Okay, and then you if you don't have unzip installed you may need to pseudo apt install unzip or if then you're on a Mac That would be a brew install unzip if brew doesn't work you haven't got home brew installed So make sure you install it and then unzip Okay, and so they're the basic steps One nice thing is that if you're using Cresol most of the data sets should already be pre-installed for you So what I can do here is I can say open a new tab Here's a cool trick in Jupiter you can actually say new terminal and You can actually get a web-based terminal, right? And so you'll find on Cresol. There's a slash data sets folder slash data sets slash Kaggle Slash data sets slash fast AI often the things you need are going to be in one of those places Okay, so assuming that we don't have it already downloaded in paper actually paper space should have most of them as well Then we would need to go to fast AI. Let's go into the courses machine learning Folder and what I tend to do is I tend to put all of my data for a course into a folder called data You'll find that if you try and if you're using what we use and get right You'll find that that doesn't get added to get because it's in the get ignore, right? So So don't worry about creating the data folder. It's not going to screw anything up So I generally make a folder called data and then I tend to create folders for everything I need there So in this case make full doses CD and remember the last word of the last command is exclamation mark dollar I'll go ahead and grab that curl command again. Okay, and zip All those is there we go. Okay So You can now see I generally have like anything that would change that might change from person to person I kind of put in a constant so here I just define saying court path but if you've used the same path I just did you should just be able to go ahead and run that and Let's go ahead and keep moving along. So we've now got all of our libraries imported and we've said the path to the data You can Run Shell commands from within Jupyter notebook by using an exclamation mark So if I want to check what's inside that path, I can go LS data slash full doses Okay, and you can see that works or you can even use Python variables If you use a Python variable inside a Jupyter shell command, you have to put it in Curly's Okay So that makes me feel good that my path is pointing at the right place if you say LS curly capital's path and You get nothing at all then you're pointing at the wrong spot. Yes And this up here Yeah, so the curly brackets refer to the fact that I put an exclamation mark at the front which means the rest of this is not a Python command it's a bash command and bash doesn't know about capital path because capital path is part of Python so this is a special Jupyter thing which says expand this Python thing please before you pass it to the shell So the goal here is to use the training set Which contains data through the end of 2011 to predict the sale price of full doses and so The main thing to start with then is of course to look at the data Now the data is in CSV format Right, so one easy way to look at the data would be to use shell command head to look at the first two lines Head full doses and even tab completion works here. It's Jupyter does everything Right, so here's the first few Five lines. Okay, so there's like a bunch of column headers and then there's a bunch of data So that's pretty hard to look at so what we want to do is take this and read it into a nice tabular format okay, so Just Terence putting glasses on me. I should make this bigger or is it okay? Is this big enough font size? so This kind of data where you've got columns Representing a wide range of different types of things such as an identifier of value currency a date a size I refer to this as structured data now I say I refer to this as structured data because like there's there's been many arguments in the machine learning community on Twitter about What is structured data? Weirdly enough. This is like the most important type of distinction is between data that looks like this and data like Images where every column is of the same type like that's the most important distinction in machine learning Yet we don't have Standard accepted terms, so I'm going to use the term structured and unstructured But note that other people you talk to particularly in NLP and NLP people use structured to mean something totally different right, so When I refer to structured data, I mean columns of data that can have varying different types of data in them By far the most important tool in Python for you working with structured data is pandas Pandas is so important that it's one of the few libraries that everybody uses the same abbreviation for it Which is PD so you'll find that One of the things I've got here is from fast AI imports import star right The fast AI imports Module has nothing but imports of a bunch of hopefully useful tools So all of the code for fast AI is inside the fast AI directory inside the fast AI repo and so you can have a look at Imports and you'll see it's just literally a list of imports and you'll find there Pandas as PD and so everybody does this right? So you'll see lots of people using PD dot something. They're always talking about pandas so pandas lets us read a CSV file and So when we read the CSV file We just tell it the path to the CSV file a list of any columns that contain dates And I always add this low memory equals false That's going to actually make it read more of the file to decide what the types are This here is Something called a Python 3.6 format string. It's one of the coolest parts of Python 3.6 You've probably used lots of different ways in the past in Python of interpolating variables into your strings Python 3.6 has a very simple way that you'll probably always want to use from now on and it's you to create a normal string You type in F at the start And then if I define a variable Then I can say hello Curly's Python function Okay This is kind of confusing. These are not the same Curly's that we saw earlier on in that LS command right that LS command is specific to Jupiter and it interpolates Python code into Shell code These Curly's are Python 3.6 format string Curly's they require an F at the start. So if I get rid of the F It doesn't interpolate. Okay, so the F tells it to interpolate and the cool thing is inside that Curly's you can write any Python code you like just about so for example name dot upper Hello, Jeremy Okay, so I use this all the time And it doesn't matter because it's a format string. It doesn't matter if the thing was They always forget my age, I think I'm 43 It doesn't matter if it's an integer right normally if you like to string concatenation with integers Python complains No such problem here. Okay, so So this is going to read path slash train dot CSV into a thing called a data frame Panda's data frames and ours data frames are kind of pretty similar So if you've used our before Then you'll find that this is a little recently comfortable. So this file is 9.3 meg and its size is Sorry 112 Meg 112 Meg and it has 400,000 rows in it. Okay, so it takes a moment to import it But when it's done We can type the name of the data frame DF raw And then use various methods on it. So for example DF raw tail will show us the last few rows of the data frame By default it's going to show the columns along the top and the rows down the side But in this case, there's a lot of columns. So I've just said dot transpose to show it the other way around I've created one extra function here display all normally if you just type DF raw If it's too big to show conveniently it truncates it and put like little lipses in the middle So the details don't matter But this is just changing a couple of settings to say even if it's got a thousand rows and a thousand columns Please still show the whole thing Okay, so this is finished I can actually show you that so if I just type This is really cool in in Jupiter notebook. You can type a variable with almost any kind of video HTML and image whatever and it'll generally figure out a way of displaying it for you Okay, so in this case, it's a pandas data frame that figures it out a way of displaying it for me And so you can see here that by default it's actually doesn't show me the whole thing so So here's the data set We've got a few different rows. This is the last bit the tail of it right last few rows This is the thing we want to predict price, okay, and then all of the other we call this the dependent variable Dependent variable is the price And then we've got a whole bunch of things we could predict it with and when I start with a data set. I tend Yes, Terrence. Can I give you this? Hello, Jeremy. Hi, Terrence I've read in books that you should never look at the data because of the risk of overfit Why do you start by looking at the data? Yeah, so I was actually gonna mention I actually kind of don't like I I want to find out at least enough to know that I've like managed to Import it, okay, but I tend not to really study it at all at this point Because I don't want to make too many assumptions about it. I would actually say Most books say the opposite most books do a whole lot of Eda expiratory data analysis first Yeah, academic books The academic books I've read Say that's that's one of the biggest risks of overfitting What the practical books say let's do some Eda first Yeah So that the truth is kind of somewhere in between and I generally I generally try to do machine learning driven Eda And that's what we're going to learn today. Okay So the do thing I do care about though is What's the purpose of the project and for Kaggle projects the purpose is very easy We can just look and find out There's always an evaluation section how is it evaluated and this is evaluated on root main squared log Error, so this means they're going to look at the difference between the log of our prediction of price and the log of the actual price And then they're going to square it and add them up Okay, so because they're going to be focusing on the difference of the logs That means that we should focus on the logs as well And this is pretty common like for a price Generally you care not so much about did I miss by $10, but did I miss by 10% right? So it's a million dollar thing and you're a hundred thousand dollars off or if you're it's a ten thousand dollar thing And you're a thousand dollars off often we would consider those equivalent scale issues and so for this auction problem The organizers are telling us they care about ratios more than differences. And so the log is the thing we care about So the first thing I do is to take the log Okay, now NP is NumPy, I'm assuming that you have some familiarity with NumPy If you don't we've got a video called deep learning workshop, which actually isn't just for deep learning It's for a whole basically for this as well And one of the parts there which we've got a time-coded link to is a quick introduction to NumPy Okay, but basically NumPy lets us treat arrays matrices vectors High-dimensional tensors as if they're Python variables and we can do stuff like log to them and it'll apply it to Everything NumPy and pandas work together very nicely So in this case DfRaw.sale price is pulling a column out of a pandas data frame Which gives us a pandas series It shows us the sale prices and the indexes right and A series can be passed to a NumPy Function, okay, which is pretty handy and so you can see here. This is how I can replace a column with a new column Pretty easy So okay now that we've replaced sale price with its log We can go ahead and try to create a random forest. What's a random forest? We'll find out in detail, but in brief a random forest is a Kind of universal machine learning technique It's a way of predicting something that can be of any kind. It could be a category Like is it a dog or a cat or it could be a continuous Continuous variable like price It can predict it with columns of pretty much any kind pixel data zip codes Revenues whatever In general, it doesn't overfit it can and we'll learn to check whether it is but it it doesn't Generally overfit too badly and it's very very easy to make to stop it from overfitting You don't need and we'll talk more about this. You don't need a separate validation set in general It can tell you how well it generalizes even if you only have one data set It has few if any statistical assumptions. It doesn't assume that your data is normally distributed It doesn't assume that the relationships are linear. It doesn't assume that you've just specified the interactions it requires Very few pieces of feature engineering for many different types of situation You don't have to take the log of the data. You don't have to model play interactions together. So in other words It's a great place to start right if your first random forest Does very little useful and that's a sign that there might be problems with your data like it's designed to work pretty much first off Can you please throw it at or towards this gentleman? Thank you Yeah, great question, so there's this concept of curse of dimensionality In fact, there's two concepts. I'll touch on curse of dimensionality and the no free lunch theorem These are two concepts. You'll often hear a lot about They're both largely meaningless and basically stupid and yet I Would say maybe the majority of people in the field Not only don't know that but think the opposite so it's well worth explaining the curse of dimensionality is this idea that the more columns you have It basically creates a space that's more and more empty and there's this kind of fascinating mathematical Idea, which is the more? Dimensions you have the more all of the points sit on the edge of that space, right? So if you've just got a single dimension where things are like random then there's spread out all over, right? Or else if it's a square then the probability that they're in the middle means that they kind of been on the edge of either dimension So it's a little bit less likely that they're not on the edge Each dimension you add it becomes more Applicatively less likely that the point isn't on the edge of at least one dimension, right? and so basically in high dimensions everything sits on the edge and What that means in theory is that the distance between points is much less meaningful and so if we assume that Somehow that matters and it would suggest that when you've got lots and lots of columns And you just use them without being very careful to remove the ones You don't care about that somehow things won't work That turns out just not to be the case It's not the case for a number of reasons One is that the points still do have different distances away from each other just because they're on the edge They still do vary in how far away they are from each other And so this point is more similar to this point than it is to that point So even things will learn about k nearest neighbors actually work really well Really really well in high dimensions despite what the theoreticians claimed and what really happened here was that in the 90s theory totally took over Machine learning and so particularly there was this concept of these things called support vector machines that were theoretically very well justified extremely easy to analyze Mathematically and you could like kind of prove things about them and we kind of lost a decade of real practical Development in my opinion and all these theories Became very popular like the curse of dimensionality nowadays and a lot of theoreticians hate this The world of machine learning has become very empirical Which is like which techniques actually work and it turns out that in practice Building models on lots and lots of columns works really really well So yeah, the other thing to quickly mention is the no free lunch theorem There's a mathematical theorem by that name that you will often hear about that claims that There is no type of model that works well for any kind of data set Which is true and it's obviously true if you think about it in the mathematical sense Any random data set by definition? It's random, right? So there isn't going to be some way of looking at every possible random data set that's in some way more useful than any other approach And the real world we look at data, which is not random Mathematically, we'd say it sits on some lower dimensional manifold. It was created by some kind of Caused all structure, right? There are some relationships in there So the truth is that we're not using random data sets And so the truth is in the real world There are actually techniques that work much better than other techniques for nearly all of the data sets you look at And nowadays there are Empirical researchers who spend a lot of time studying this which is which techniques work a lot of the time and Ensembles of decision trees of which random forest a one Is perhaps the technique which most often comes up the top and that is despite the fact that Until the library that we're showing you today fast a I came along There wasn't really any standard way to pre-process them properly and to properly set their parameters So I think it's even more strong than that So Yeah, I think this is where the difference between theory and practice is is is huge So when I try to create a random forest regressor, what is that random forest regressor? Okay, it's part of something called SK learn SK learn is scikit learn It is by far the most popular and important Package for machine learning in Python. It does nearly everything It's not the best at nearly everything, but it's perfectly good at nearly everything So like you might find in the next part of this course with your net You're going to look at a different kind of decision tree ensemble called gradient boosting trees Where actually there's something called XG boost which is better than gradient boosting trees in scikit-learn But it's pretty good at everything. So we're I'm really going to focus on scikit-learn Random forest you can do two kinds of things with a random forest if I hit tab I haven't imported it. So let's go back to where we import So you can hit tab in Jupyter notebook to get tab completion for anything that's In your environment, and you'll see that there's also a random forest classifier So in general, there's an important distinction between things which can predict continuous variables And that's called regression and therefore a method for doing that would be a regressor and things that predict categorical variables and That is called classification and the things that do that are called classifiers So in our case, we're trying to predict a continuous variable price So therefore we are doing regression and therefore we need a regressor a Lot of people incorrectly use the word regression to refer to linear regression Which is just not at all true or appropriate regression means a machine learning model It's trying to predict some kind of continuous outcome. It has a continuous dependent variable So pretty much everything in scikit-learn has the same form you first of all create an instance of an object for the machine learning model You want you then call fit passing in the Independent variables the things you're going to use to predict and the dependent variable the thing that you want to predict so in our case the dependent variable is Is The data frames sale price column and so we the thing we want to use to predict is everything except that in pandas the drop method returns a new data frame with a list of columns removed Right. Well a list of rows or columns removed. So access equals one means removed columns So this here is the data frame containing everything except for sale price Okay Can I hit the box? So if you want to remove all four columns you just pass a list of Strings of the column names. Let's find out so to find out I could hit shift tab and That will bring up the you know a quick inspection of the parameters in this case. It doesn't quite tell me what I want So if I hit shift tab twice It gives me a bit more information Ah, yes, and that tells me it's a single label or list like list like means like anything you can index in Python There's lots of things by the way if I hit three times It will give me a whole little window at the bottom. Okay, so that was shift tab Another way of doing that of course which we learned would be question mark question mark df raw Okay, sorry question mark question mark would be the source code for it For a single question mark Is the documentation? So I think that trick of like tab complete shift tab parameters Question mark and double question mark for the docs and the source code Like if you know nothing else about using Python libraries know that because now you know how to find out everything else Okay So we try to run it and it doesn't work Okay, so why didn't it work? So anytime you get a Stack trace like this. So an error the trick is to go to the bottom because the bottom tells you what went wrong But it tells you all of the functions that called other function could cause other functions to get there Could not convert string to float conventional. So there was a column name Sorry a there was a value rather inside my data set conventional the word conventional and it didn't know how to create a model using that string Now that's true. We have to pass numbers to most machine learning Models and certainly to random forests. So step one is to convert everything into numbers So our data set contains both continuous variables. So numbers where the meaning is numeric like price And it contains categorical variables, which could either be numbers where the meaning is not continuous like a zip code Or it could be a string like large small and medium. So categorical and continuous variables We want to basically get to a point where we have a data set where we can use all of these variables So they have to all be numeric and they have to be usable in some way So one issue is that we've got something called sale date Which you might remember right at the top we told it that that's a date So it's been passed as a date and so you can see here. It's data type D type very important thing data type is date time 64 bit So that's not a number All right, and this is actually where we need to do our first piece of feature engineering right inside a date There's a lot of interesting stuff All right, so since you've got the catch box, can you tell me what are some of the interesting bits of information inside a date? Well, you can see like a time series pattern That's true, I haven't did an express very well, what are some columns that we could pull out of this Year month The date as in like it tell me a note at least to be a number yeah month Quarter you want to pass it to your right and get some more behind you just pass it to your right. Yeah, you got some more columns for us Day of month. Yeah, keep going to the right They have weak day of week. Yeah We could hear yeah, okay, I'll give you a few more like that. You might want to think about would be like Is it a holiday? Is it a weekend? Was it raining that day? Was there a sports event that day? Like it depends a bit on what you're doing right so like if you're predicting soda sales in Soma you would probably want to know was there a San Francisco Giants ballgame on that day Right, so like what's in a date is one of the most important pieces of feature engineering you can do and no machine learning Algorithm can tell you whether the Giants were playing that day and that it was important Right, so this is where you need to do feature engineering So I do as much things as many things automatically as I can for you Right, so here. I've got something called add date part. What is that? It's something inside fast AI that's structured Okay, and what is it? Well, let's read the source code Here it is so you'll find most of my functions are Less than half a page of code right so here is something it's going to so rather than often rather than having docs I'm going to try to add docs over time But that is the design that you can understand them are reading the code So we're passing in a data frame and the name of some field. Okay, which in this case was sale date and so In this case, we can't go D F dot field name because that would actually find a field called field name Literally so D F square bracket field name is how we grab a column where that column name is stored in this variable Okay, so we've now got the field itself the series And so what we're going to do is we're going to go through all of these different strings All right, and this is a piece of Python which actually looks inside an object and finds a Attribute with that name so this is going to go through and you can again you can Google for Python get attribute It's a cool little advanced technique, but this is going to go through and it's going to find for this field It's going to find its year attribute Now pandas has got this interesting idea, which is if I actually look inside Let's go field equals. This is the kind of experiment. I want you to do right play around sale date Okay, so I've now got that in a field object and so I can go field Right, and I can go field dot tab Right, and let's see is year in there. Oh It's not. Okay. Why not? Well, that's because year is only going to apply to pandas series that Date time objects. So what pandas does is it splits out different methods? Inside attributes that are specific to what they are so date time objects will have a DT attribute to find and At that is where you'll find all the date time specific stuff So what I went through was I went through all of these and picked out all of the ones that could ever be interesting for Any reason right and this is like the opposite of the curse of dimensionality It's like if there is any column or any variant of that column that could be ever be interesting at all Add that to your data set and every variation of it. You can think of there's no harm in adding more columns Nearly all the time right so in this case We're going to go ahead and add all of these different Attributes and so for every one I'm going to create a new field. That's going to be called the Name of your field with the word date removed so to be sale and then the name of the attribute So we're going to get a sale year sale month sale week sale day, etc. Etc. Okay, and then at the very end I'm going to remove The original field right because remember we can't use Sale date directly because it's not a number Okay So you're saying there's only work because it was a date type did you make it a date ever was already saved as one in the original Yeah, it's already a date type and the reason it was a date type Is because when we imported it We said pass dates equals and told pandas it's a date type so as long as it looks Date ish and we tell it to pass it as a date it'll turn it into a date type Is there a way to do that so it would just look through all the columns and say like if it looks like a date make it a date Um, I know which one I Think there might be but for some reason it wasn't ideal like maybe it took lots of time or it didn't always work or for some reason I Had to list it here. I would suggest checking out the docs to pandas.readcsv and Maybe on the forum you can tell us what you find because I can't remember offhand. Good telephoning So, how about the time zone like how can we get the time zone? Let's do that one on the same forum thread that Savannah creates because I think it's a reasonably advanced question, but generally speaking the time zone in a properly formatted date will be included in the string and It should format it should pull it out correctly and turn it into a universal time zone. So generally speaking it should handle it for you So I noticed you The square brackets one is safer Particularly if you're assigning to a column if it didn't already exist You need to use the square brackets format. Otherwise, you'll get weird errors So the square brackets format is safer the dot version saves me like a couple of keystrokes So I probably use it more than I should In this particular case Because I wanted to grab something that was had field name was had something inside it wasn't the name itself I have to use square brackets So square brackets is going to be your your safe bet if in doubt So after I run that You'll notice that Df raw Columns gives me a list of all of the columns Just as strings and at the end there they all are right. So it's removed sale date and it's added all those So that's not quite enough The other problem is that we've got a whole bunch of strings in there, right? So you can just leave that there Do you want to pass it back? Thanks So Here's like low high medium. Thank you So pandas actually has a concept of a category data type But by default it doesn't turn anything into a category for you. So I've created something called train cats Which creates categorical variables for everything that's a string Okay, and so what that's going to do is behind the scenes It's going to create a column that's actually a number, right? It's an integer and it's going to store a mapping from the integers to the strings, okay? The reason it's train cats is that you use this for the training set More advanced usage is that when we get to looking at the test and validation sets. This is really important idea In fact Terrence came to me the other day and he said my model is not working Why not and he figured it out for himself it turned out the reason why was because the mappings He was using from string to number in the training set were different to the mappings He was using from string to number in the test set. So therefore in the training set Hi might have been three but in the trade test set it might have been two So the two were totally different and so the model was basically non predictive. Okay, so I have another function called apply categories Where you can pass in your existing training set and it will use the same Mappings to let you'll make sure your test set of validation set uses the same mappings. Okay, so when I go train cats It's actually not going to make the data frame look different at all Behind the scenes it's going to turn them all into numbers We finish at 12 1150 Here we go. I'll try and finish on time. So you'll see now remember I mentioned there was this dot DT attribute that gives you access to everything assuming it's a date time about the date time There's a dot cat attribute that gives you access to things assuming something's a category right and so usage band was a string and so now that I've run train cats It's turned it into a category. So I can go Df raw dot usage band Dot cat Right, and there's a whole bunch of other things we've got there. Okay So one of the things we've got there is dot categories and you can see here is the list Now one of the things you might notice it's that this list is in a bit of a weird order high low medium The truth is it doesn't matter too much But what's going to happen when we use the random forest is it's actually going to this is going to be zero This is going to be one This is going to be two and we're going to be creating decision trees And so we're going to have a decision tree that can split things at a single point So it either be high versus low and medium or medium versus high and low That would be kind of weird, right? It actually turns out not to work too badly, but it'll work a little bit better if you have these insensible orders Okay, so if you want to reorder a category, then you can just go cat dot set categories and pass in The order you want and tell it it's ordered and almost every pandas method has an in place Parameter which rather than returning a new data frame. It's going to change that data frame Okay, so I'm not going to do that like I didn't check that carefully for categories It should be ordered, but this seems like a pretty obvious one Sure so The usage band column is actually going to be This is actually what our random forest is going to see these numbers one zero two one Okay, and they map to the position in this array and as we're going to learn shortly a random forest consists of a bunch of trees That's going to make a single split and the single split is going to be either Greater than or less than one or greater than or less than two, right? So we could split it into High versus low and medium, which that semantically makes sense It's like is it big or we could split it into medium versus high and low It doesn't make much sense Right, so in practice the decision tree could then make a second split to say like Medium versus high and low and then within the high and low into high and low But by putting it in a sensible order if it wants to split out low it can do it in One decision rather than two and we'll be learning more about this shortly It's honestly, it's not a big deal, but I just wanted to mention it's there and It's also good to know that people when they talk about like different types of categorical variable Specifically, you need to know there's a kind of categorical variable called ordinal and an ordinal categorical variable is one that has some kind of order like High medium and low right and random forests aren't terribly sensitive to that fact But it's worth knowing it's there and trying it out Still order the ordering wouldn't help our maximum that That's what I'm saying. It helps a little bit, right? It means you can get there with one decision rather than two I notice there is a negative one in that list of categories. Is that like an NA or yeah, exactly So for free we get a negative one which refers to missing And one of the things we're going to do is we're going to actually add one Can somebody pass it back to Paul is we're going to add one to our codes maybe in two goes People know it's coming Yeah, so let people so we're going to add one to all of our codes to make missing zero later on Yeah, we're going to get to that yeah Yeah, so get dummies which we'll get to in a moment's going to create three separate columns Once and zeros for high once there's some medium ones and so is for low Whereas this one creates a single column with an integer zero one or two We're going to get to that one shortly yep, did you have a question to Paul or you're just pointing out, okay? Okay, so at this point as long as we always make sure we use dot cat dot codes the thing with the numbers in We're basically done all of our strings have been turned into numbers at dates been turned into a bunch of numeric columns And everything else is already a number, okay The only other main thing we have to do is notice that we have lots of missing values So here is there for all dot is null that's going to return true or false depending on whether something is empty Dot sum is going to add up how many empty for each series And then I'm going to sort them and divide by the size of the data set so here we have Some things which have like quite high percentages of Nulls so so missing values we call them in Play all Maybe I didn't run it Okay, so We're going to get to that in a moment But I will point something out which is reading the CSV talk a minute or so The processing took another 10 seconds or so from time to time when I've done a little bit of work I don't want to wait for again. I will tend to save where I'm at So here I'm going to save it and I'm going to save it in a format called feather format This is very very new right but what this is going to do is it's going to save it to disk and exactly the same basic Format that it's actually in RAM. This is by far the fastest way to save something and the fastest way to read it back Right, so most of the folks you deal with unless they're On the cutting edge won't be familiar with this format. So this would be something you can teach them about It's becoming the standard right. It's actually becoming something that's going to be used not just in pandas, but in Java In spark in lots of like things for like communicating across computers because it's incredibly fast And it's actually co-designed by the guy that made panthers by Wes McKinney So we can just go df raw dot to feather and pass in some Name I tend to have a folder called temp for all of my like as I'm going along stuff And so when you go OS dot make does you can path in any path path here? You like it won't complain if it's already there. It's just okay equals true if there are some sub directories It'll create them for you. So this is a super handy little function Okay, so it's not installed so because I'm using Cresol for the first time it's complaining about that. So if you get a message that something's not installed If you're using anaconda, you can condo install Cresol actually doesn't use anaconda. It uses pip And so we wait for that to go along. Okay, and so now if I run it and so sometimes You may find you actually have to Restart Jupiter, so I won't do that now because we're nearly out of time So if you restart Jupiter, you'll be able to keep moving along. So from now on You don't have to re-run all the stuff that above you could just say pd dot read feather and we've got our data frame back So the last step we're going to do is to Actually replace the strings with their numeric codes And we're going to pull out the dependent variable sale price into a separate variable And we're going to also handle missing continuous values. And so how are we going to do that? So you'll see here we've got a function called croc df What is that croc df? So it's inside fastai.structured again And here it is So quite a lot of the functions have a few additional parameters that you can provide and we'll talk about them later But basically we're providing the data frame to process and the name of the dependent variable the the y field name Okay, and so all it's going to do is it's going to make a copy of the data frame It's going to grab the y value It's going to drop the the dependent variable from the original And then it's going to Fix missing so how do we fix missing so what we do to fix missing is pretty simple If it's numeric Then we fix it by basically saying Let's first of all check that it does have some missing right so if it does have some missing values So in other words the is null sum is nonzero Then we're going to create a new column called with the same name as the original plus underscore na And it's going to be a boolean column with a one any time that was missing and a zero any time it wasn't We're going to talk about this again next week, but this is you know, I'll give you the quick version Having done that where they're going to replace the Na's the missing with the median Okay, so anywhere that used to be missing will be replaced with the median or add a new column to tell us Which ones were missing we only do that for numeric We don't need it for categories because pandas had his handles categorical variables automatically by setting them to minus one So What we're going to do is if it's not numeric and It's a categorical type and we'll talk about the maximum number of categories later But let's assume this is always true So if it's not a numeric type, we're going to replace the column with its codes the integers Okay, plus one right so the by default Pandas uses minus one for missing so now zero will be missing and One two three four will be all the other categories So we're going to talk about dummies later on in the course But basically Optionally you can say that if you already know about dummy values There are columns with a small number of possible values You can turn into dummies instead if you're maritalizing them, but we're not going to do that for now Okay, so for now all we're doing is we're using the categorical codes plus one Replacing missing values with the median adding an additional column telling us which ones were replaced And removing the dependent variable So that's what Procdf does runs very quickly. Okay, so you'll see now Sale price is no longer here. Okay, we've now got a whole new call a whole new variable called y that contains sale price You'll see we've got a couple of extra blar underscore na's at the end. Okay, and If I look at that Everything is a number Okay These Booleans are treated as numbers. They're just considered zero or one. They're just displayed as false and true They can see here is at the end of a month is at the start of a month is at the end of a quarter It's kind of funny, right? Because we've got things like a model ID which presumably is something like I don't know It could be a serial number or it could be like the model identifier that's created by the factory or something We've got like a data source ID like some of these are numbers, but they're not continuous It turns out actually random forests Work fine with those we'll talk about why and how and a lot about that in detail But for now all you need to know is no problem Okay, so as long as this is all numbers, which it now is we can now go ahead and create a random first So M dot random first regressor Random forests are trivially Paralysable so what that means is that they if you've got more than one CPU which everybody will basically on their Computers at home and if you've got a T2 dot medium or bigger at AWS You've got multiple CPUs trivially Paralysable means that it will split up the data Across your different CPUs and basically linearly scale, right? So the more CPUs you have pretty much it will divide the time it takes by that number not exactly But roughly so n jobs equals minus one tells the random forest regressor to create a separate job It's a separate process basically for each CPU you have so that's pretty much what you want all the time Fit the model using this new data frame we created using that y value we pulled out and then get the score Okay, the score is going to be the R squared. We'll define that next week Hopefully some of you already know about the R squared one is very good zero is very bad So as you can see we've immediately got a very high score okay, so That looks great But what we'll talk about next week a lot more is that it's not quite great Because maybe we had data that had points that looked like this and we fitted a line that looks like this When actually we want one that looks like that. Okay, the only way to know Whether we've actually done a good job is by having some other data set that we didn't use to train the model Now we're going to learn about some ways with random forests We can kind of get away without even having that other data set but for now What we're going to do is we're going to split into 12,000 Rows which we're going to put in a separate data set called the validation set Versus the training sets going to contain everything else right and our data set is It's going to be sorted by date And so that means that the most recent 12,000 rows are going to be our validation set again We'll talk more about this next week. It's a really important idea, but for now We can just recognize that if we do that And run it I've created a little thing called print score and it's going to Print out the root mean spread error between the predictions and actuals for the training set For the validation set the r squared for the training set and the validation set And you'll see that actually the r squared for the training was point nine eight But for the validation was point eight nine Okay, then the RMSE and remember this is on the logs was point oh nine for the training set Point two five for the validation set now if you actually go to Kaggle and go to the leaderboard In fact, let's do it right now He's got private and public. I'll check on public leaderboard and We can go down and find out where is point two five. So there are 475 teams and generally speaking if you're in the top half of a Kaggle competition, you're doing pretty well So point two five here. We are point two five. What was it exactly point two five? Point two five oh seven Yeah, about a hundred and tenth. So we're about in the top twenty five percent So so the idea like this is pretty cool, right with with like with no thinking at all using the defaults of everything We're in the top twenty five percent of a Kaggle competition So like random forests are insanely powerful and this totally standardized Process is insanely good for it like any data set. So We're going to wrap up. What I'm going to ask you to do For Tuesday is like take as many Kaggle competitions as you can whether they be running now or old ones or data sets that you're interested in for hobbies or work and and please try it right try this process and If it doesn't work, you know, tell us on the forum. Here's the data So I'm using here's where I got it from here's like the stack trace of where I got an error or here's like, you know, if you use my Print score function or something like it like, you know, show us what the training versus test set looks like We'll try and figure it out. Right, but what I'm hoping we'll find is that all of you will be pleasantly surprised that with with the, you know Hour or two of information you got today You can already get better models than most of the very serious practicing data scientists that compete in Kaggle competitions Okay, great. Good luck, and I'll see you on the forums. Oh one more thing Friday The other class said a lot of them had class during my office hours So if I made them one till three instead of two till four on Fridays, is that okay? Seminar, oh Okay, I have to find a whole nother time. All right I will talk to somebody who actually knows what they're doing unlike me about finding office hours Absolutely