 Last lesson, we looked at what random forests are, and we looked at some of the tweaks that we could use to make them work better. So in order to actually practice this, we needed to have a Jupyter notebook environment running, so we can either install Anaconda on our own computers, we can use AWS, or we can use Crestle.com that has everything up and running straight away, or else paperspace.com also works really well. So assuming that you've got all that going, hopefully you've had a chance to practice running some random forests this week. I think one of the things to point out though is that before we did any tweaks of any hyper parameters or any tuning at all, the raw defaults already gave us a very good answer for an actual data set that we've got on Kaggle, so the tweets aren't always the main piece, they're just tweets, sometimes they're totally necessary, but quite often you can go a long way without doing many tweets at all. So today we're going to look at something I think maybe even more important than building a predictive model that's good at predicting, which is to learn how to interpret that model to find out what it says about your data, to actually understand your data better by using machine learning. And this is kind of contrary to the common refrain that things like random forests are black boxes that hide meaning from us, and you'll see today that the truth is quite the opposite. The truth is that random forests allow us to understand our data deeper and more quickly than traditional approaches. The other thing we're going to learn today is how to look at larger data sets than those which you can import with just the defaults. And specifically we're going to look at a data set with over 100 million rows, which is the current Kaggle competition for groceries for past year. Does anybody have any questions outside of those two areas, since we're covering that today, or comments if they want to talk about that. Yeah, I apologize, it's kind of like basic, just to make sure I'm understanding the concept. I'm sure I can hear you. Oh, sorry. Can you just talk a little bit about like in general, I understand the details more now of random forests, but like, when do you know this is an applicable model to use? In general, be like, oh, I should try a random forest here, because that's the part that I'm still like. Yeah. If I'm told to I can. Yeah, so the short answer is, I can't really think of anything offhand that it's definitely not going to be at least somewhat useful for, so it's always worth trying. I think really the question is, in what situations should I try other things as well? And the short answer to that question is for unstructured data, what I call unstructured data, so where all the different data points represent the same kind of thing, like a wave form and a sound or speech or the words and a piece of text or the pixels and an image, almost certainly you're going to want to try deep learning. And then outside of those two, there's a particular type of model we're going to look at today called a collaborative filtering model, where, which it so happens that the groceries competition is of that kind, where neither of those approaches are quite what you want without some tweaks to them. So that would be the other main one. Sorry, you're saying neither, are you saying deep learning in random? Neither deep learning or random forests is exactly what you want, you need to kind of do some tweaks, you'll see. Yeah, if anybody thinks of other places where maybe neither of those techniques is the right thing to use, yeah, mention it on the forums, even if you're not sure, you know, so we can talk about it because I think this is one of the more interesting questions. And to some extent, it is a case of practice and experience, but I do think there are, you know, two main classes, you know, about. So last week, we, at the point where we had kind of done some of the key steps, you know, like the CSB reading in particular, which took, you know, a minute or two, at the end of that, we saved it to a feather format file. And just to remind you, that's because this is basically almost the same format that it lives in RAM, so it's like ridiculously fast to read and write stuff from, from feather format. So what we're going to do today is we're going to look at lesson two, RF interpretation. And the first thing we're going to do is read that feather format file. Now, one thing to mention is a couple of you pointed out during the week, a really interesting little, little bug or little issue, which is in the PROC-DF function. The PROC-DF function, remember, finds the numeric columns which have missing values and creates an additional Boolean column, as well as replacing the missing with medians, and also turns the categorical objects, you know, into the integer codes, the main things it does. And a couple of you pointed out some key points about the missing value handling. The first one is that your test set may have missing values in some columns that weren't in your training set or vice versa. And if that happens, you're going to get an error when you try to do the random forest, because it's going to say, you know, if that is missing field appeared in your training set, but not in your test set that ended up in the model, it's going to say, you can't use that data set with this model because you're missing one of the columns it requires. That's problem number one. Problem number two is that the median of the missing value, sorry, the median of the numeric values in the test set may be different for the training set. And so it may actually process it into something which has different semantics. So I thought that was a really interesting point. So what I did was I changed crop DF. So it returns a third thing, NAs. And the NAs thing it returns, it doesn't matter in detail what it is, but I'll tell you just so you know, that's a dictionary. That where the keys are the names of the columns that had missing values. And the values of the dictionary are the medians. And so then optionally, you can pass NAs as an additional argument to crop DF. And it'll make sure that it adds those specific columns and it uses those specific medians. Okay, so it's kind of giving you the ability to say process this test set in exactly the same way as we process this training set. Can you pass that please? Hi, is this a updated feature? So you're just updating this? Yeah, so I just did that. So we have to get the day of the day before. So just the get pull? Yeah, in fact, that's a good point. Before you start doing work any day, I would start doing a get pull. And if something's not working today that was working yesterday, check the forum where there'll be an explanation of why. This library in particular is moving fast, but pretty much all the libraries that we use, including PyTorch in particular, move fast. And so one of the things to do if you're watching this through the MOOC is to make sure that you've got a course.fast.ai and check the links there because there'll be links saying, oh, these are the differences from the course. And so they're kind of kept up to date so that you're never going to... Because I can't edit what I'm saying. I can only edit that. But yeah, do a get pull before you start each day. So I haven't actually updated all of the notebooks to add the extra return value. I will over the next couple of days, but if you're using them, you'll just need to put an extra comma and a, yeah, otherwise you'll get an error that it's returned three things and you only have room for two things. Okay, what I want to do... I think what I want to do before I talk about interpretation is to show you what the exact same process looks like when you're working with a really large dataset. So, and you'll see it's kind of almost the same thing, but there's going to be a few cases where we can't use the defaults because the default is kind of like just run a little bit too slowly, right? So specifically, I'm going to look at the Cable Groceries Competition specifically... What's it called? Here it is. Compress your favorite grocery sales forecasting. So this competition, well, who is entering this competition? Okay, a lot of you. Who would like to have a go at explaining what this competition involves, what the data is and what you're trying to predict? Okay, you're trying to predict the items on the shelf, depending on lots of factors like oil prices... What do you say predicting the items on the shelf? What do you mean? What are you actually predicting? How much time to have in stock to maximize their, I guess... It's not quite what we're predicting, but we'll try and fix that in a moment, so go on. And then there's a bunch of different data sets that you can use to do that. There's oil prices, there's stores, there's locations, and each of those can be used to try to predict it. Okay, does anybody want to have a go at expanding on that? All right, so we have a bunch of information on different products. So for every store, for every item, for every day, we have a lot of related information available, like the location where the store was located, the class of the product, and the unit sold. And then based on this, we are supposed to forecast in a much shorter timeframe compared to the training data. For every item number, how much do we think it's going to sell? So only the units and nothing else. Okay, good. So somebody can help get that back here. So your ability to explain the problem you're working on is really, really important, okay? So if you currently feel confident of your ability to do that, practice with someone who is not in this competition, tell them all about it. So in this case, or in any case, really, the key things to understand a machine learning problem would be to say, what are the independent variables and what is the dependent variable? So the dependent variable is the thing that you're trying to predict. The thing you're trying to predict is how many units of each kind of product were sold in each store on each day during a two-week period. So that's the thing that you're trying to predict. And the information you have to predict it is how many units of each product at each store on each day were sold in the last few years and for each store, some metadata about it, like where is it located and what class of store is it? For each type of product, you have some metadata about it, such as what category of product is it and so forth. For each date, we have some metadata about it, such as what was the oil price on that date. So this is what we would call a relational dataset. So a relational dataset is one where we have a number of different pieces of information that we can join together. Specifically, this kind of relational dataset is what we would refer to as a star schema. A star schema is a kind of data warehousing schema where we basically say there's some central transactions table. In this case, the central transactions table, if you go to the data section here, is train.csv. And it contains the number of units that were sold by date, by store ID, by item ID. That's the central transactions table, very small, very simple. And then from that, we can join various bits of metadata. And it's called a star schema because you can kind of imagine the transactions table in the middle and then all these different metadata tables join onto it, giving you more information about the date, the item ID and the store ID. Sometimes you'll also see a snowflake schema, which means there might then be additional information joined on to maybe the item's table that tells you about different item categories and joined to the store table telling you about the state that the store's in, and so forth, so you can have kind of a whole snowflake. So that's the basic information about this problem. The independent variables, the dependent variable, and you probably also want to say things like the timeframe thing. Now, we start in exactly the same way as we did before, loading in exactly the same stuff, setting the path. But when we go read csv, if you say limit memory equals false, then you're basically saying use as much memory as you like to figure out what kinds of data is here. It's going to run out of memory, pretty much regardless of how much memory you have. So what we do in order to limit the amount of space that it takes up when we read it in is we create a dictionary for each column name to the data type of that column. And so for you to create this, it's basically up to you to run less or head or whatever on the dataset to see what the types are and to figure that out and pass them in. So then you can just pass in dtype equals with that dictionary. And so check this out, right? We can read in the whole csv file in one minute and 48 seconds. And there are 125.5 million rows. So like, when people say like Python slow, no, Python is not slow. Python can be slow if you don't use it right, but we can actually pass 125 million csv records in less than two minutes. My language hat on for just a moment. Actually, if it's fast, almost certainly it's going to see. Right. Yeah, so Python is a wrapper around a bunch of c code usually. Yeah, exactly. So yeah, so Python itself isn't actually very fast. Yeah. So that was Terence Parr who writes things for writing programming languages for a living. So he is right. Python itself is not fast, but almost everything we want to do in Python in data science has been written for us in C or actually more often in Python, which is a Python-like language which compiles to C. And so most of the stuff we run in Python is actually running not just c code, but actually in Pandas, a lot of it's written in like assembly language. It's heavily optimized. Behind the scenes, a lot of that is going back to actually calling Fortran-based libraries for linear algebra. So there's layers of one way of speed that actually allow us to spend less than two minutes reading that much data. Yeah, if we wrote our own CSV reader in pure Python, it would take thousands of times, at least thousands of times longer than the optimized versions. Yeah, so for us, what we care about is the speed we can get in practice. And so this is pretty cool. As well as telling it what the different data types were, we also have to tell it as before, which things do you want to pass as dates? I've noticed that in this dictionary, you're specifying 64, 33 and eight. I was wondering in practice, is it like faster if you all specify them to aim towards lower or like any performance consideration? So the key performance consideration here was to use the smallest number of bits that I could to fully represent the column. So if I had used N8 for item number, there are more than 255 item numbers. More specifically, the maximum item number is bigger than 255. So on the other hand, if I had used N64 for store number, it's using more bits than necessary. Given that the whole purpose here was to avoid running out of RAM, we don't want to be using up eight times more memory than necessary. So the key thing was really about memory. And in fact, when you're working with large datasets, very often you'll find the slow piece is the actually reading and writing to RAM, not the actual CPU operations. So very often that's the key performance consideration. Also, however, as a rule of thumb, smaller data types often will run faster, particularly if you can use SIMD. So that's single instruction, multiple data vectorized code. It can pack more numbers into a single vector to run at once. That was all heavily simplified and not exactly right, but writing back, I think, for this purpose. Once you do this, the shuffle thing beforehand is not needed anymore. It may just send a random sub-selection. Yeah, so although here I've read in the whole thing, when I start, I never start by reading in the whole thing. So if you search the forum for Shuff, SHUF, you'll find some tips about how to use this Unix command to get a random sample of data at the command prompt. And then you can just read that. And the nice thing is that that way, like that's a good way, for example, to find out what data types to use, to read in a random sample and let Pandas figure it out for you. Oh, thank you. I think I'm still fine with this one. Yeah, and in general, I do as much work as possible on a sample until I feel confident that I understand the sample before I move on. So yeah, having said that, what we're about to learn is some techniques for running models on this whole dataset that are actually going to work on arbitrarily live datasets. That also I specifically wanted to talk about how to read in live datasets. One thing to mention on promotion object, objects are like saying create a general purpose Python data type, which is slow and memory heavy. And the reason for that is that this is a Boolean, which also has missing values. And so we need to deal with this before we can turn it into a Boolean. So you can see after that, I then go ahead and let's say fill in the missing values with false. Now you wouldn't just do this without doing some checking ahead of time, but some expiratory data analysis shows that it seems that this is probably an appropriate thing to do. It seems that missing does mean false. It objects generally read in a string. So replace the strings true and false with actual Booleans, and then finally convert it to an actual Boolean type. So at this point, when I save this, this file now of 123 million records takes up something under two and a half gigabytes of memory. So you can look at pretty large datasets, even on pretty small computers, which is interesting. So at that point, now that it's in a nice fast format, look how fast it is. I can save it to feather format in under five seconds. So that's nice. And then because pandas is generally pretty fast, you can do stuff like summarize every column of all 125 million records in 20 seconds. So the first thing I looked at here actually is the dates. Generally speaking, dates are just going to be really important in a lot of the stuff you do, particularly because any model that you put in in practice, you're going to be putting it in at some date that is later than the date that you trained it by definition. And so if anything in the world changes, you need to know how your predictive accuracy changes as well. And so what you'll see on Kaggle and what you should always do in your own projects is make sure that your dates don't overlap. So in this case, the dates that we have in the training set go from 2013 to mid-August 2017. Okay, there's our first and last. And then in our test set, they go from one day later, right, August the 16th until the end of the month. So this is a key thing that like you can't really do any useful machine learning until you understand this basic case here, which is you've got four years of data, and you're trying to predict the next two weeks. Okay, so like that's just a fundamental thing that you're going to need to understand before you can really do a good job of this. And so as soon as I see that, what does that say to you? If you wanted to now use a smaller data set, should you use a random sample or is there something better you could do? Probably from the bottom more recent. Yeah, get the most recent, right? And if you ever have trouble answering questions like this, just try to make it as physical as possible. So it's like, okay, I'm gonna go to a shop next week. And I've got a $5 bet with my brother as to whether I can guess how many cans of coca are going to be on the shelf. All right, well, probably the best way to do that would be to go to the shop, same day of the previous week and see how many cans of coca on the shelf. And guess it's going to be the same. You wouldn't go and look at how many were there four years ago. But couldn't four years ago, that same time frame of the year be important? I mean, for example, how much coke they have on the shelf at Christmas time is going to be way more than... So exactly. So it's not that there's no useful information from four years ago. And so we don't want to entirely throw it away. But as a first step, like what is the simplest possible thing? It's kind of like submitting the means. I wouldn't submit the mean of 2012 sales. I would want to probably submit the mean of last month's sales, for example. So yeah, we just want to think about like how might we want to kind of create some initial easy models and later on, we might want to wait it. So for example, we might want to wait for recent dates more highly. They're probably more relevant. But we should do a whole bunch of exploratory data analysis to check that. So here's what the bottom of that data set looks like. Okay. And you can see literally it's got a date, a store number, an item number, and unit sales and tells you whether or not that particular item was on sale at that particular store on that particular date. And then there's some arbitrary ID. So that's it. So now that we have read that in, we can do stuff like... So this is interesting. Again, we have to take the log of the sales. And it's the same reason as we looked at last week, right? Because we're trying to predict something that kind of varies according to ratios. They told us in this competition that the root mean squared log error is the thing they care about. So we take a log. They mentioned also if you check the competition details, which you always should read carefully the definition of any project you do. It's, they say that there are some negative sales that represent returns. And they tell us that we should consider them to be zero for the purpose of this competition. So I clip the sales so that they fall between zero and no particular maximum. Okay. So clip just means cut it off at that point, truncated. And then take the log of that plus one. Why do I do plus one? Because again, if you check the details of the capital competition, that's what they tell you they're going to use is they're not actually just taking the root mean squared log error, but the root mean squared log plus one error. Okay. Because log of zero doesn't make sense. We can add the date part as usual. And again, it's taking a couple of minutes, right? So I would run through all this on a sample first. So everything takes 10 seconds to make sure it works just to check everything looks reasonable. Before I go back, because I don't want to wait two minutes or something. I don't know what's going to work. But as you can see, all these, all these lines of code are identical to what we saw for the bulldozers competition. In this case, all I'm reading in this training set, I didn't need to run train cats because all of my data types are already numeric. Okay. If they weren't, I would need to call train cats. And then I would need to call apply cats to apply the same categorical codes that I now have in the training set to the validation set. I call property F as before to check for missing values and so forth. So all of those lines of code are identical. These lines of code again are identical because root mean squared errors do what we care about. And then I've got two changes. The first is set RF samples, which we learned about last week. So we've got 120 something million records. We probably don't want to create a tree from 120 million something records. I don't even know how long that's going to take. I haven't been, I haven't had the time and patience to wait and see. So, you know, you could start with 10,000 or 100,000, you know, maybe run some a few seconds, make sure it works and you can kind of figure out how much you can run. And so I found getting it to a million, it's runs in under a minute. Right. And so the point here is, there's no relationship between the size of the data set and how long it takes to build a random forest. The relationship is between the number of estimators multiplied by the sample size. Okay. Just curious what N job is because the path has always been negative one. And you made it eight years. Yeah. So the number of jobs is the number of cores that it's going to use. And I was running this on a computer that has about 60 cores and I just found if you tried to use all of them, I spent so much time spinning up jobs, it was a little slower. So if you've got like lots and lots of cores on your computer, sometimes you want less than negative one means you use every single core. Yeah. There's one more change I made, which is that I converted the data frame into an array of floats and then I fitted on that. Why did I do that? Because internally inside the random forest code, they do that anyway. Right. And so given that I wanted to run a few different random forests with a few different hyper parameters, by doing it once myself, I saved that minute 37 seconds. Right. So if you run a line of code and it takes like quite a long time, so the first time I ran this random forest regressor, it kind of took two or three minutes and I thought, I don't really want to wait two or three minutes. You can always add in front of the line of code, P run, percent P run. And what percent P run does is it runs something called a profiler. And what a profiler does is it'll tell you which lines of code behind the scenes took the most time. Right. And in this case, I noticed that there was a line of code inside scikit-learn that was this line of code and it was taking all the time nearly all the time. And so I thought, oh, I'll do that first and then I'll pass in the result and I won't have to do it again. Okay. So this thing of looking to see which things is taking up the time is called profiling. And in software engineering, it's one of the most important tools you have. Data scientists really under appreciate this tool, but you'll find like amongst conversations on GitHub issues or on Twitter or whatever amongst the top, data scientists, they're sharing and talking about profiles all the time. And that's how easy it is to get a profile. So for fun, try running P run from time to time on stuff that's taking 10, 20 seconds and see if you can learn to interpret and use profiler outputs. Even though in this case, I didn't write this scikit-learn plus, I was still able to use the profile to figure out how to make it run over twice as fast by avoiding recalculating this each time. So in this case, I built my regressor. I decided to use 20 estimators. Something else that I noticed in the profiler is that I can't use OOB score when I use set RF samples. Because if I do, it's going to use the other 124 million rows to calculate the OOB score, which is like, okay, it's still going to take forever. So I may as well have a proper validation set anyway, besides which I want a validation set that's the most recent dates rather than as random. So if you use set RF samples on a large data set, don't put the OOB score parameter in because it takes forever. So that got me a 0.76 validation root mean squared log error. And then I tried like fiddling around a different mean sample. So if I decrease the mean samples from 100 to 10, it took a little bit more time to run, as we'd expect. And the error went down from 76 to 71. So that looked pretty good. So I kept decreasing it down to three. And that brought this error down to 0.70. When I decreased it down to one, it didn't really help. So I kind of had like a reasonable random forest here. When I say reasonable though, it's not reasonable in the sense that it's just not give a good result on the leaderboard. And so this is a very interesting question about why is that and the reason is really coming back to Savannah's question earlier, like where might random forests not work as well? Let's go back and look at the data. Okay, here's the entire data set that we did. Here's all the columns that we used. So the columns that we have to predict with the date, the store number, the item number, and whether it was on promotion or not. And then, of course, we used add date part. So there's also going to be day of week, day of month, day of year, is quarter, start, et cetera, et cetera. So if you think about it, most of the insight around like how much of something do you expect to sell tomorrow is likely to be very wrapped up in the details about like what, where is that store? What kind of things do they tend to sell at that store for that item? What category of item is it? If it's like fresh bread, they might not sell much of it on Sundays, because on Sundays fresh bread doesn't get made. Where else is gasoline? Maybe they're going to sell a lot of gasoline because on Sundays, people go and fill up their car for the week ahead. Now, a random forest has no ability to do anything other than create a bunch of binary splits on things like day of week, store number, item number. It doesn't know which one represents gasoline. It doesn't know which stores are in the center of the city versus which ones are out in the states. It doesn't know any of these things. So its ability to really understand what's going on is somewhat limited. So we're probably going to need to use the entire four years of data to even get some useful insights. But then as soon as we start using the whole four years of data, a lot of the data we're using is really old. So interestingly, there's a cable kernel that points out that what you could do is just take the last two weeks and take the average sales, the average sales, by date, by store number, by item number, and just submit that. And if you just submit that, you come about 30th. All right. So for those of you in the groceries, Terence has a comment or question. I think this may have tripped me up actually. I think you said date, store, item. I think it's actually store, item, sales, and then you mean across date. Oh yeah, you're right. It's store, item, and on promotion. On promotion, yeah. If you do it, if you do it by date as well, you end up. So each row represents basically like a cross tabulation of all of the sales on that date in that store for that item. So if you put date in there as well, there's only going to be one or two items being averaged in each of those cells, which is too much variation, basically. It's too sparse. It doesn't give you a terrible result, but it's not 30th. So your job, if you're looking at this competition, and we'll talk about this in the next class, is how do you start with that model and make it a little bit better? Right? Because if you can, then by the time we meet up next, hopefully you'll be above the top 30. Because you know, Kaggle being Kaggle, lots of people have now taken this kernel and submitted it, and they all have about the same score. And the scores are ordered not just by score, but by date submitted. So if you now submit this kernel, you're not going to be 30th because you're way down the list of when it was submitted. But if you can do a tiny bit better, you're going to be better than all of those people. So try and think of how can you make this a tiny bit better? Could you try to capture seasonality and trend effects by creating new columns like these are the average sales in the month of August, these are the average sales for this year? Yeah, I think that's a great idea. So the thing for you to think about is how to do that, right? And so like, see if you can, see if you can make it work, because there are details to get right, which I know Terrence has been working on this for the last week, and he's gone almost crazy, right? But the details are difficult. They're not difficult, like, intellectually difficult. They're kind of difficult in the way that makes you like want to head back a desk at 2am. And like this is something to mention in general is the coding you do for machine learning is like, it's incredibly frustrating and incredibly difficult, not difficult, like technically, but difficult. Like there, if you get a detail wrong, much of the time, it's not going to give you an exception. It'll just silently be slightly less good than it otherwise would have been. Right? And if you're on Kaggle, at least you know, okay, well, I'm not doing as well as other people on Kaggle, right? But if you're not on Kaggle, you just don't know. Like, you don't know if your company's model is like half as good as it could be, because you made a little mistake, right? So that's why one of the reasons why practicing on Kaggle now is great, right? Because you're going to get practice in finding all of the ways in which you can infuriatingly screw things up. And you'll be amazed. Like for me, there's like extraordinary array of them. But as you get to know what they are, you'll start to know how to check for them as you go. Right? And so the only way, like, you should assume every button you press, you're going to press the wrong button, right? And that's fine as long as you have a way to find out. Okay? So we'll talk about that more during the course, but unfortunately, there isn't like a set of specific things I can tell you to always do. You just always have to think like, okay, what do I know about the results of this thing I'm about to do? I'll give you a really simple example. If you've actually created that basic entry where you take the mean by date by store number by on promotion, right? And you've submitted it and you've got a reasonable score. And then you think you've got something that's a little bit better. And you do predictions for that. How about you now create a scatter plot showing the predictions of your average model on one axis versus the predictions of your new model on the other axis? You should see that they just about form a line. Right? And if they don't, then that's a very strong suggestion that you've screwed something up. Right? So that would be an example. Okay. Can you pass that one to the end of that row? Possibly two steps. One step. So for a problem like this, unlike the car insurance problem on Kaggle where we don't, where columns are unnamed, we know, we know what the columns represent what they are. How often do you pull in data from other sources to supplement that? I mean, you could maybe like weather data or, you know, for example, or how often is that used? Very often. Right? And so the whole point of this star schema is that you've got your central table and then you've got these other tables coming offered that provide metadata about it. So for example, weather is metadata about a date. Right? Now on Kaggle specifically, most competitions have the rule that you can use external data as long as you host on the forum that you're using it and that it's publicly available. But you have to check on a competition by competition basis, they will tell you. Outside of Kaggle, you should always be looking for like, what external data could I possibly leverage here? All right. Yes, you can do it. Because otherwise they can't be on the recording. Oh, okay. So are we still talking about how to tweak this data set? If you wish. Well, I'm not familiar with the countries here. So maybe Ecuador. So maybe I would Ecuador's largest grocery chain. Ecuador's largest grocery chain. Maybe I would start looking for Ecuador's holidays and shopping holidays, maybe when they have a three day weekend and actually that information is provided in this case. And so in general, one way of tackling this kind of problem is to create lots and lots of new columns containing things like average number of sales on holidays, average percent change in sale between January and February and so on and so forth. And so if you have a look at, there's been a previous competition on Kaggle called Rossman store sales that was almost identical. It was in Germany, in this case, for a major grocery chain, how many items are sold by day by item type by store. And in this case, the person who won quite unusually actually was something of a domain expert in this space. They're actually a specialist in doing logistics predictions. And this is basically what they did was there's a professional sales forecast consultant. He created just lots and lots and lots of columns based on his experience of what kinds of things tend to be useful for making predictions. So that's an approach that can work. The third place team had almost no feature engineering, however. And also they had one big oversight, which I think they would have won if they hadn't had it. So you don't necessarily have to use this approach. So anyway, we'll be learning a lot more about how to win this competition and ones like it as we go. They did interview the third place team. So if you Google for Kaggle, Rossman, you'll see it. The short answer is they used big money. So one of the things and these are a couple of charts that Terence is actually my teammate on this competition. So Terence drew a couple of these charts for us. And I want to talk about this, which is if you don't have a good validation set, it's hard, if not impossible, to create a good model. So in other words, like if you're trying to predict next month's sales and you try to build a model and you have no way of really knowing whether the models you've built are good at predicting sales a month ahead of time, then you have no way of knowing when you put your model in production whether it's actually going to be any good. So you need a validation set that you know is reliable at telling you whether or not your model is likely to work well when you like put it into production or use it on the test set. So in this case, what Terence has plotted here is so normally you should not use your test set for anything other than using it right at the end of the competition or right at the end of the project to find out how you've got. But there's one thing I'm going to let you use the test set for in addition, and that is to calibrate your validation set. So what Terence did here was he built four different models, some which he thought would be better than others, and he submitted each of the four models to Kaggle to find out its score. And so the x-axis is the score that Kaggle told us on the leaderboard. And then on the y-axis, he plotted the score on a particular validation set he was trying out to see whether this validation set looked like it was going to be any good. So if your validation set is good, then the relationship between the leaderboard score, the test set score, and your validation set score should lie in a straight line. Ideally, it'll actually lie on the y equals x line. But honestly, that doesn't matter too much. As long as relatively speaking, that tells you which models are better than which other models, then you know which model is the best. And you know how it's going to perform on the test set, because you know the linear relationship between the two things. So in this case, Terence has managed to come up with a validation set, which is looking like it's going to predict our Kaggle leaderboard score pretty well. And that's really cool, right? Because now he can go away and try 100 different types of models, feature engineering, waiting, tweets, hyperparameters, whatever else, see how they go on the validation set and not have to submit to Kaggle. So we're going to get a lot more iterations and a lot more feedback. This is not just true of Kaggle, but every machine learning project you do. And so if you find, so here's a different one he tried where it wasn't as good. It's like, oh, these ones that were quite close to each other, it's showing us the opposite direction, that's a really bad sign. That's like, okay, this validation set idea didn't seem like a good idea. This validation set idea didn't look like a good idea. So in general, if your validation sets not showing a nice straight line, you need to think carefully, like, okay, how is the test set constructed? Why, how is my validation set different? You know, there's some way you're constructing it, which is, which is different. You're going to have to draw lots of charts and so forth. So one question is, and I'm going to try to, to guess how, how you did it. So how do you actually try to construct this validation set as close to the, so what I would try to do is to try to sample points from the training set that are very close or possible to some of the points in the test set. What are you saying? What sets? I don't know. I will have to find the features. Well, in this case, for this groceries. For it is groceries, the last points. Yeah, close by date. So basically all the different things Terence was trying with different variations of close by date. So the most recent. Yeah, what I noticed was, so first I looked at the date range of the test set, and then I looked at the, the kernel that described how he or she. Here's the date range of the test set. So the last two weeks of August 26, 2017. That's right. And then the person who submitted the kernel that said how to get the 0.58 leaderboard position or whatever score. The average by group. I looked at the date range of that and that was like nine or 10 days. Well, it was actually 14 days and the test set is 16 days. But the interesting thing is the test set begins on the day after payday and ends on the payday. And so these are things I also paid attention to. But I think that's one of the bits of better data that they told us. You know, so these are the kinds of things you just got to try. Like I said, plot lots of pictures. And like, even if you didn't know it was payday, you know, you would want to like draw the time series chart of sales and you would hopefully see that like every two weeks there would be a spike or whatever. And you'd be like, oh, I want to make sure that my, I have the same number of spikes in my validation set that I've had in my test set, for example. Okay, let's take a five minute break and let's come back at 232. Okay, so this is my favorite bit, interpreting machine learning models. By the way, if you're looking for my notebook about the groceries competition, you won't find it in GitHub because I'm not allowed to share code for running competitions with you unless you're on the same team as me. That's the rule. After the competition is finished, it'll be on GitHub forever. So if you're not doing this through the video, you should be able to find it. So let's start by reading in our feather file. So our feather file is exactly the same as our CSV file. This is for our blue book for bulldozers competition. So we're trying to predict the sale price of heavy industrial equipment at auction. And so reading the feather format file means that we've already read in the CSV and processed it into categories. And so the next thing we do is to run ProcDF in order to turn the categories into integers, deal with the missing values, and pull out the intended variable. Okay, this is exactly the same thing as we used last time to create a validation set, where the validation set represents the last couple of weeks, the last 12,000 records by date. And I discovered, thanks to one of your excellent questions on the forum last week, I had a bug here, which is that ProcDF was shuffling the order. Sorry, not ProcDF. And last week, we saw a particular version of ProcDF where we passed in a subset. And when I passed in the subset, it was randomly shuffling. And so then I said split vowels. It wasn't getting the last rows by date, but it was getting a random set of rows. So I've now fixed that. So if you rerun the lesson one RF code, you'll see slightly different results. Specifically, you'll see in that section that my validation set results look less good. But that's only for this tiny little bit where I had subset equals. I'm a little bit confused about the notation here. So NaS is both an input variable, and it's also the output variable of this function. Why is that? The ProcDF returns a dictionary telling you which things were missing, which columns are missing. And for each of those columns, what the median was. So when you call it on the larger dataset, the non-subset, you want to take that return value, and you don't pass in an anecdote to that point. You just want to get back the result. Later on, when you pass it into a subset, you want to have the same missing columns and the same medians. And so you pass it in. And if this different subset, like if it was a whole different dataset, turned out it had some different missing columns, it would update that dictionary with some additional key values as well. So you don't have to pass it in. If you don't pass it in, then it just gives you the information about what was missing and the medians. If you do pass it in, it uses that information for any missing columns that are there. And if there are some new missing columns, it will update that dictionary with that additional information. So it's like keeping all datasets, the column information. Yeah, it's going to keep track of any missing columns that you came across in anything you passed to CROC-DF. Thank you. Okay, so we split it into the training and test set, just like we did last week. And so to remind you, once we've done CROC-DF, this is what it looks like. This is the log of sale price. Okay, so the first thing to think about is we already know how to get the predictions, which is we take the average value in each leaf node in each tree after running a particular row through each tree. That's how we get the prediction. But normally, we don't just want a prediction, we also want to know how confident we are of that prediction. And so we would be less confident of a prediction if we haven't seen many examples of rows like this one. And if we haven't seen many examples of rows like this one, then we wouldn't expect any of the trees to kind of have a path through which is really designed to help us predict that row. And so conceptually, you would expect then that as you pass this unusual row through different trees, it's going to end up in very different places. So in other words, rather than just taking the mean of the predictions of the trees and saying that's our prediction, what if we took the standard deviation of the predictions of the trees? So the standard deviation of the predictions of the trees, if that's high, that means each tree is giving us a very different estimate of this row's prediction. So if this was a really common kind of row, then the trees will have learnt to make good predictions for it because it's seen lots of opportunities to split based on those kinds of rows. So the standard deviation of the predictions across the trees gives us some kind of, at least, relative understanding of how confident we are of this prediction. So that is not something which exists in scikit-learn or in any library I know of. So we have to create it. But we already have almost the exact code we need because remember last lesson, we actually manually calculated the averages across different sets of trees so we can do exactly the same thing to calculate the standard deviations. So when I'm doing random forest interpretation, I pretty much never use the full dataset. I always call setRS samples because we don't need a massively accurate random forest. We just need one which indicates the nature of the relationships involved. And so I just make sure this number is high enough that if I call the same interpretation commands multiple times, I don't get different results back each time. That's like the rule of thumb about how big does it need to be. But in practice, like 50,000 is a high number and most of the time it'd be surprising if that wasn't enough and it runs in seconds. So I generally start with 50,000. So with my 50,000 samples per tree set, I create 40 estimators. I know from last time that minSAMPLE's leaf equal 3 max features equals 0.5 isn't bad. And again, we're not trying to create the world's most predictive tree anyway. So that all sounds fine. We get an r squared on the validation set of 0.89. Again, we don't particularly care, but as long as it's good enough, which it certainly is. And so here's where we can do that exact same list comprehension as last time. Remember, go through each estimator, that's each tree, call.predict on it with our validation set, make that a list comprehension and pass that to np.stack, which concatenates everything in that list across a new axis. Okay, so now our rows are the results of each tree and our columns are the result of each row in the original dataset. And then we remember we can calculate the mean. So here's the prediction for our dataset row number one. And here's our standard deviation. Okay, so here's how to do it for just one observation at the end here. We've calculated for all of them, just printing it for one here. Now, this can take quite a while and specifically, it's not taking advantage of the fact that my computer has lots of cores in it. List comprehensions, this is like the list comprehension itself is Python code, that is my Python code. And Python code, unless you're doing special stuff, runs in serial, which means it runs on a single CPU. It doesn't take advantage of your multiple CPU hardware. And so if I wanted to run this on more trees and more data, this one second is going to go up. And you see here the wall time, the amount of actual time it took, is roughly equal to the CPU time, whereas if it was running on lots of cores, the CPU time would be higher than the wall time. So it turns out that scikit-learn provides a handy, actually not scikit-learn, fastai provides a handy function called parallel trees, which calls some stuff inside scikit-learn. And parallel trees takes two things. It takes a random forest model that I trained, here it is, n, and some function to call. And it calls that function on every tree in parallel. So in other words, rather than calling t.predictxvalid, let's create a function that calls t.predictxvalid. Let's use parallel trees to call it on our model for every tree. And it will return a list of the result of applying that function to every tree. And so then we can np.stack that. So hopefully you can see that that code and that code are basically the same thing, but this one is doing it in parallel. And so you can see here now our wall time has gone down to 500 milliseconds and it's now giving us exactly the same answer, so a little bit faster. Time-committing, we'll talk about more general ways of writing code that runs in parallel because it turns out to be super useful for data science. But here's one that we can use that's very specific to random forests. Okay, so what we can now do is we can always call this to get our predictions for each tree and then we can call standard deviation to then get them for every row. And so let's try using that. So what I could do is let's create a copy of our data and let's add an additional column to it, which is the standard deviation of the predictions across the first axis. And let's also add in the lean, so they're the predictions themselves. So you might remember from last lesson that one of the predictors we have is called enclosure and we'll see later on that this is an important predictor. And so let's start by just doing a histogram. So one of the nice things in pandas is it's got built-in flooding capabilities. It's well worth googling for pandas flooding to see how to do it. Yes, Terrence? Chairman, can you remind me what enclosure is? So we don't know what it means and it doesn't matter. That's the whole purpose of this process is that we're going to figure out, we're going to learn about what things are or at least what things are important and we'll later on figure out what they are and how they're important. So we're going to start out knowing nothing about this data set, right? So there's something, so I'm just going to look at something called enclosure that has something called EROPS and something called OROPS and I don't even know what this is yet. All I know is that the only three that really appear at any great quantity are OROPS, EROPS, WAC and EROPS. And this is like really common as a data scientist, you know, you often find yourself looking at data that you're not that familiar with and you've got to figure out at least like which bits to study more carefully and which bits seem to matter and so forth. So in this case, I at least know that these three groups I really don't care about because they basically don't exist. So given that, we're going to ignore those three. So we're going to focus on this one here, this one here and this one here. And so here you can see what I've done is I've taken my data frame and I've grouped by enclosure and I am taking the average of these three fields. So here you can see here's the average sale price, the average prediction and the standard deviation of prediction for each of my three groups. So I can already start to learn a bit here. As you would expect, the prediction and the sale price are close to each other on average. So that's a good sign. And then the standard deviation varies a little bit. It's a little hard to see in a table. So what we could do is we could try to start like printing these things out. So here we've got the sale price for each level of enclosure and here we've got the prediction for each level of enclosure. And for the error bars, I'm using the standard deviation of prediction. So here you can see the actual and here's the prediction and here's my confidence interval. Or at least it's the average of the standard deviation of the random forests. So this tells us, it'll tell us if there's some groups or some rows that we're not very confident of at all. So we could do something similar for product size. Right, so here's different product sizes. We can do exactly the same thing of looking at our predictions and standard deviations. Okay, we could sort by and what we could say is like, well, what's the ratio of the standard deviation of the predictions to the predictions themselves, right? So you would kind of expect on average that when you're predicting something that's a bigger number that your standard deviation would be higher, right? So you can like sort by that ratio. And what that tells us is that the product size large and product size compact, our predictions are less accurate, you know, as relatively speaking, as a ratio of the total price. And so then if we go back and have a look, well, there you go. That's why from the histogram, those are the smallest groups. Okay, so as you would expect in small groups, we're doing a less good job, right? So this confidence interval you can really use for two main purposes. One is that you can group it up like this and look at the average confidence interval by group to find out, are there some groups that you just don't seem to have confidence about those groups? But perhaps more importantly, you can look at them for specific rows. And so when you put it in production, you might always want to see the confidence intervals. So if you're doing, say, your credit scoring, so deciding whether to give somebody a loan, you probably want to see not only what's their level of risk, but how confident are we? And if they want to borrow lots of money and we're not at all confident about our ability to predict whether they'll pay it back, we might want to give them a small amount. Okay, so those are the two ways in which you would use this. Okay, let me go to the next one, which is the most important. The most important is feature importance. And the only reason I didn't do this first is because I think the intuitive understanding of how to calculate confidence interval is the easiest one to understand intuitively. In fact, it's almost identical to something we've already calculated, right? But in terms of which one do I look at first in practice, I always look at this in practice. So when I'm working on whether it be a capital competition or a real world project, I build a random forest as fast as I can, try and get it to the point that it's significantly better than random, but it doesn't have to be much better than that. And then the next thing I do is to plot the feature importance. And the feature importance tells us in this random forest, which columns mattered, right? So we had like dozens and dozens of columns originally in this dataset. And here I'm just picking out the top 10. So you can just call rf feature importance. Again, this is part of the FastAI library. It's leveraging stuff that's in scikit-learn. Pass in the model, pass in the data frame because we need to know the names of columns, right? And that'll tell you it'll order, give you back a pandas data frame showing you in order of importance how important was each column. And here I'm just going to pick out the top 10. So we can then plot that, right? So if I, because it's a data frame, we can use data frame plotting commands. So here I've plotted all of the feature importance, right? And so you can see here, like, and I haven't been able to write all of the names of the columns at the bottom, which that's not the important thing. The important thing is to see that some columns are really, really important. And most columns don't really matter at all. And like in nearly every dataset you use in real life, this is what your feature importance is going to look like. It's going to say there's like a handful of columns you care about. And this is why I always start here, right? Because at this point in terms of like looking into learning about this domain of heavy industrial equipment options, I only got to care about learning about the columns, which matter, right? So are we going to bother learning about enclosure? It depends whether enclosure it is important. And there it is. It's in the top 10. So we are going to have to learn about enclosure. Okay? So then we could also plot this as a bar plot, right? So you can hear I've just created a little tiny little function here that's going to just plot my bars. And I'm just going to do it for the top 30. And so you can see the same basic shape here. And I can see there's my enclosure. Okay. So we're going to learn about how this is calculated in just a moment. But before we go about how it's calculated, much more important is to know what to do with it. So the most important thing to do with it is to now sit down with your client or your data dictionary or whatever your source of information is and say to them, okay, tell me about year made. What does that mean? Where does it come from? Plot lots of things like histograms of year made and scatter plots of year made against price and learn everything you can because year made and coupler system, they're the things that matter, right? And what will often happen in real world projects is that you'll sit with the client and you'll say, oh, it turns out the coupler system is the second most important thing. And then they might say, that makes no sense. Now that doesn't mean that there's a problem with your model. It means there's a problem with their understanding of the data that they gave you. So let me give you an example. I entered a Kaggle competition where the goal was to predict which applications for grants at a university would be successful. And I used this exact approach and I discovered a number of columns which were almost entirely predictive of the dependent variable. And specifically, when I then looked to see in what way they're predictive, it turned out that whether they were missing or not was basically the only thing that mattered in this data set. And so later on, so I ended up winning that competition and I think a lot of it was thanks to this insight. And so later on, I heard what had happened. But it turns out that at that university, there's an administrative burden to filling out the database. And so for a lot of the grant applications, they don't fill in the database for the folks whose applications weren't accepted. So in other words, these missing values in the data set were saying, okay, this grant wasn't accepted because if it was accepted, then the admin folks are going to go in and type in that information. So this is what we call data leakage. And data leakage means there's information in the data set that I was modeling with which the university wouldn't have had in real life at the point in time they were making a decision. So when they're actually deciding which grant applications should I prioritize, they don't actually know which ones the admin staff are later on going to add information to because it turns out that they got accepted. You see what I mean? So one of the key things you'll find here is data leakage problems. And that's a serious problem that you need to deal with. The other thing that will happen is you'll often find it's signs of collinearity. And I think that's what's happened here with Kappler system. I think Kappler system tells you whether or not a particular kind of heavy industrial equipment has a particular feature on it. But if it's not that kind of industrial equipment at all, it will be empty. It will be missing. And so Kappler system is really telling you whether or not it's a certain class of heavy industrial equipment. Now this is not leakage. This is actual information you actually have at the right time. It's just that like interpreting it, you have to be careful. So I would go through at least the top 10 or like kind of look for whether natural break points are and really study these things carefully. To make life easier for myself, what I tend to do is I try to throw some data away and see if that matters. So in this case, I had a random forest, which let's go and see how accurate it was, 0.89, 0.889. What I did was I said here, okay, well, let's go through our feature importance data frame and filter out those where the importance is greater than 0.005. So 0.025 to 0.005 is about here. It's kind of like where they really flatten off. So let's just keep those. And so that gives us a list of 25 column names. 25 column names. And so then I say, okay, let's now create a new data frame view, which just contains those 25 columns, call split vowels on it again, split into test and training set, and create a new random forest. And let's see what happens. And you can see here, the R squared basically didn't change, 0.891 versus 0.889. So it's actually increased a tiny bit, right? I mean, generally speaking, removing redundant columns, obviously it shouldn't make it worse. If it makes it worse, they weren't redundant after all. It might make it a little better, because if you think about how we built these trees, when it's deciding what to split on, it's got less things to have to worry about trying. It's less often going to accidentally find a crappy column. So it's got a slightly better opportunity to create a slightly better tree with slightly less data. But it's not going to change it by much. But it's going to make it a bit faster and it's going to let us focus on what matters. So if I rerun feature importance now, I've now got 25. Now the key thing that's happened is that when you remove redundant columns is that you're also removing sources of collinearity. In other words, two columns that might be related to each other. Now, collinearity doesn't make your random forest less predictive. But if you have two columns that are related to each other, this column is a little bit related to this column and this column is a strong driver of the dependent variable, then what's going to happen is that the importance is going to end up split between the two collinear columns. It's going to say, well, both of those columns matter. So it's going to split up between the two. So by removing some of those columns with very little impact, it makes your feature importance a lot clearer. And so you can see here, actually, year-made was pretty close to couple of systems before. But there must have been a bunch of things that were collinear with year-made, which makes perfect sense, right? Like old industrial equipment wouldn't have had a bunch of kind of technical features that new ones would, for example. So it's actually saying like, oh, okay, year-made really, really matters. Right? So I trust this feature importance better. The predictive accuracy of the model is a tiny bit better, but this feature importance has a lot less collinearity to confuse us. So let's talk about how this works. And it's actually really simple. And not only is it really simple, it's a technique you can use not just for random forests, but for basically any kind of machine learning model. And interestingly, almost no one knows that. Like many people will tell you, oh, this particular kind of model, there's no way of interpreting it. And the most important interpretation of a model is knowing which things are important. And that's almost certainly not going to be true, because this technique I'm going to teach you actually works for any kind of model. But here's what we're going to do. We're going to take our dataset, the bulldozers, right? And we've got this column which we're trying to predict, right? Which is price. And then we've got all of our independent variables. Okay. So here's an independent variable here, year-made, right? Plus a whole bunch of other variables. And remember, after we did a bit of trimming, we have 25 independent variables. Okay. How do we figure out how important year-made is? Well, we've got our whole random forest, right? And we can find out our predictive accuracy. So we're going to put all of these rows through our random forest. And we're going to spit out some predictions, right? And we're going to compare them to the actual price. You get, in this case, for example, our root mean squared error and our square. And we're going to call that, like, that's our starting point, right? So now let's do exactly the same thing, but let's take the year-made column and randomly shuffle it. The randomly permute just that column. So now year-made has exactly the same distribution as before, same means and deviation. But it's going to have no relationship to the dependent variable at all, because we're totally randomly reordered it. So before, we might have found our r squared with 0.89, right? And then after we shuffle year-made, we check again, and now it's like 0.8. Like, oh, our score got much worse when we destroyed that variable. And it's like, okay, let's try again. Let's put year-made back to how it was. And this time, let's take enclosure and shuffle that, right? And we find this time with enclosure, it's 0.84. And we can say, oh, okay. So the amount of decrease in our score for year-made was 0.09, and the amount of decrease in our score for enclosure was 0.05. All right? And this is going to give us our feature importances for each one of our columns. Yes? Wouldn't just excluding, let's say, each column and running a random forest and checking the decay in the performance? Yeah, so you could remove the column and train a whole new random forest, but that's going to be really slow. Whereas this way, we can keep our random forest and just test the predictive accuracy of it again. All right? So this is nice and fast by comparison. In this case, we just have to re-run every row forward through the forest for each shuffle column. Can I say, we're just basically doing predictions. Exactly. Great question. So if you want to do like multi-coloniality, would you do two of them and a random shuffle and then three of them random shuffle? Yeah, so I mean, I don't think you mean multi-coloniality. I think you mean looking for interaction effects. Yeah, so if you want to say which pairs of variables are most important, you could do exactly the same thing, each pair in turn. In practice, there are better ways to do that because that's obviously computationally pretty expensive. And so we're trying to find time to do that again. Okay, so we now have a model which is a little bit more accurate and we've learned a lot more about it. So we're out of time. And so what I would suggest you try doing now before next class for this bulldozer's dataset is like go through the top, I don't know, five or 10 predictors and try and learn what you can about how to draw plots and pandas and try to come back with like some insights about like what's the relationship between year made and dependent variable? What's the histogram of year made? Try and find some possible like, now that you know year made is really important, is there some noise in that column which we could fix? Are there some weird encodings in that column that we could fix? This idea I had that maybe a couple of systems there entirely because it's collinear with something else. Do you want to try and figure out whether that's true? If so, how would you do it? If I product class desk, that brings alarm bells to me. It sounds like it might be a high cardinality categorical variable. It might be something with lots and lots of levels because it sounds like it's like a model name. So like go and have a look at that model name. Does it have some order into it? Could you make it an ordinal variable to make it better? Does it have some kind of hierarchical structure in the string that we could split it on like hyphen to create more sub columns? You know, have a think about this, you know? And so try and make it so that, you know, by Tuesday when you come back, you've got some new, ideally you've got a better accuracy than what I just showed because we've found some new insights, or at least that you can tell the class about some things you've learnt about how heavy industrial equipment options work in practice. Okay, great. See you on Tuesday.