 So we've looked at a lot of different random forest interpretation techniques, and a question that's come up a little bit on the forums is like, what are these for really? Like, how do these help me get a better score on Kaggle? And my answer's kind of been like, they don't necessarily. And so I want to talk more about like, why do we do machine learning? Like what's the point? And to answer this question, I'm going to put this PowerPoint in the GitHub repo so you can have a look. I want to show you something really important, which is examples of how people have used machine learning mainly in business, because that's where most of you are probably going to end up after this is working for some company. I'm going to show you applications of machine learning, which are either based on things that I've been personally involved in myself or know of people who are doing them directly. So none of these are going to be like hypotheticals. These are all actual things that people are doing and I've got direct or second hand knowledge of. I'm going to split them into two groups, horizontal and vertical. So in business, horizontal means something that you do like across different kinds of business, whereas vertical means it's something that you do within a business or within a supply chain or within a process. So in other words, an example of horizontal applications is everything involving marketing. Every company pretty much has to try to sell more products to its customers, and so therefore does marketing. And so each of these boxes are examples of some of the things that people are using machine learning for in marketing. So let's take an example. Let's take churn. So churn refers to a model which attempts to predict who's going to leave. That's why I've done some churn modeling fairly recently in telecommunications. And so we're trying to figure out for this big cell phone company, which customers are going to leave. That is not of itself that interesting, like building a highly predictive model that says Jeremy Howard is almost certainly going to leave next month. It's probably not that helpful, because if I'm almost certainly going to leave next month, there's probably nothing you can do about it. It's too late. It's going to cost you too much to keep me. So in order to understand why would we do churn modeling, I've got a little framework that you might find helpful. So if you Google for Jeremy Howard data products, I think I've mentioned this thing before. There's a paper you can find designing great data products that I wrote with a couple of colleagues a few years ago. And in it, I describe my experience of actually turning machine learning models into like stuff that makes money. And the basic trick is this thing I call the drivetrain approach, which is these four steps. The starting point to actually turn a machine learning project into something that's actually useful is to know what am I trying to achieve. And that doesn't mean like I'm trying to achieve a high area under the ROC curve or I'm trying to achieve a high, large difference between classes. No, it would be I'm trying to sell more books or I'm trying to reduce the number of customers that leave next month or I'm trying to detect lung cancer earlier. All right. These are things that these are objective. So the objective is something that absolutely directly is the thing that the company or the organization actually wants. No company or organization lives in order to create a more accurate predictive model. There's some reason. So that's your objective. Now that's obviously the most important thing. If you don't know the purpose of what you're modeling for, then you can't possibly do a good job of it. And hopefully people are starting to pick that up out there in the world of data science. But interestingly, what very few people are talking about, but it's just as important as the next thing, which is levers. A lever is a thing that the organization can do to actually drive the objective. So let's take the example of churn modeling, right? What is a lever that an organization could use to reduce the number of customers that are leaving? They could take a closer look at the model and do some of this random force interpretation and see some of the causes that are causing people to leave and potentially change those issues in the company. So that's a data scientist's answer, but I want you to go to the next level. What are the things? The levers are the things they can do. Do you want to put it past the behind you? What are the things that they can do? Just outreach, like calling or sending emails. They could call someone and say, like, are you happy, anything we could do? They can provide incentives to increase engagement with the product? Yeah, so they could give them a free pen or something if they buy 20 bucks worth of product next month. You are going to do that as well? Okay, so you guys are the giving out carrots rather than the handing out sticks. Do you want to send it over in a couple of guys? Have you changed the price of the product to a subscription or something? Yeah, you could give them a special. So these are levers, right? And so whenever you're working as a data scientist, keep coming back and thinking, what are we trying to achieve, we being the organization, and how are we trying to achieve it, being like, what are the actual things we can do to make that objective happen? So building a model is never, ever a lever, okay, but it could help you with the lever. So then the next step is what data does the organization have that could possibly help them to set that lever to achieve that objective, right? And so this is not what data did they give you when you started the project, right? But like, think about it from a first principle point of view, okay, I'm working for a telecommunications company, they gave me some certain set of data, but I'm sure they must know where their customers live, how many phone calls they made last month, how many times they called customer service, whatever. And so have a think about like, okay, if we're trying to decide like, who should we reduce the, you know, give a special offer to proactively, then we want to figure out like, what information do we have that might help us to identify who's going to react well or badly to that? Perhaps more interestingly would be, what if we were doing like a fraud algorithm, right? And so we're trying to figure out like, who's going to like not pay for the phone that they take out of the store, you know, they're on some 12 month payment plan, we never see them again. Now, in that case, the data we have available, it doesn't matter what's in the database, what matters is what's the data that we can get when the customer is in the shop, right? So there's often constraints around the data that we can actually use. So we need to know, what am I trying to achieve? What can I actually, what can this organization actually do specifically to change that outcome? And at the point that that decision is being made, what data do they have or could they collect, right? And so then the way I put that all together is with a model, and this is not a model in the sense of a predictive model, but it's a model in the sense of a simulation model. So one of the main examples I give in this paper is one I spent many years building, which is if an insurance company changes their prices, how does that impact their profitability, right? And so generally your simulation model contains a number of predictive models. So I had, for example, a predictive model called an elasticity model that said for a specific customer, if we charge them a specific price for a specific product, what's the probability that they would say yes, both when it's new business and then later what's the probability that they're new. And then there's another predictive model, which is what's the probability that they're going to make a claim, and how much is that claim going to be, right? And so like you can combine these models together, then to say, all right, if we changed our pricing by reducing it by 10% for everybody, if we body between 18 and 25, and we can run it through these models that combine together into a simulation, then the overall impact on our market share in 10 years time is X and our cost is Y and our profit is Z and so forth. So in practice, most of the time, you really are going to care more about kind of the results of that simulation than you do about the predictive model directly. But most people are not doing this effectively at the moment. So for example, when I go to Amazon, right, I read all of Douglas Adams's books, right? And so having read all of Douglas Adams's books, the next time I went to Amazon, they said, would you like to buy the collected works of Douglas Adams? This is after I had bought every one of his books. So like from a machine learning point of view, some data scientist had said, oh, people that buy one of Douglas Adams's books often go on to buy the collected works, right? But recommending to me that I buy the collected works of Douglas Adams isn't smart. And it's actually not smart at a number of levels. Like, not only is it unlikely I'm going to buy a box set of something of which I have everyone individually, but furthermore, it's not going to change my buying behavior. I already know about Douglas Adams. I already know I like him. So taking up your valuable web space to tell me, hey, maybe you should buy more of the author who you're already familiar with and have bought lots of times isn't actually going to change my behavior, right? So what if instead of creating a predictive model, Amazon had built an optimization model that could simulate and said, if we show Jeremy this ad, how likely is he then to go on to buy this book? And if I don't show him this ad, how likely is he to go on to buy this book? And so that's the counterfactual, right? The counterfactual is what would have happened otherwise. And then you can take the difference and say, okay, what should we recommend him that is going to maximally change his behavior? So maximally result in more books. And so you'd probably say like, oh, he's never bought any Terry Pratchett books. He probably doesn't know about Terry Pratchett, but like lots of people that liked Douglas Adams did turn out to like Terry Pratchett. So let's like introduce him to a new author, right? So it's a difference between a predictive model on the one hand versus an optimization model on the other hand. So the two tend to go hand in hand, right? The optimization model basically is saying, well, first of all, you have a simulation model, right? The simulation model is saying in a world where we put Terry Pratchett's book on the front page of Amazon for Jeremy Howard, this is what would have happened. He would have bought it with a money 4% probability, right? And so that then tells us with this lever of like, what do I put on my home page for Jeremy today? We say, okay, all the different settings of that lever, the put Terry Pratchett on the home page has the highest simulated outcome, right? And then that's the thing which maximizes our profit from Jeremy's visit to Amazon.com today, okay? So generally speaking, your predictive models kind of feed in to this simulation model, but you've kind of got to think about like, how do they all work together? So for example, let's go back to churn, right? So I turn out that Jeremy Howard is very likely to leave his cell phone company next month. What are we going to do about it? Oh, let's call him, right? And I can tell you, if my cell phone company calls me right now and says, just calling to say we love you, I'd be like, I'm cancelling right now, like that would be a terrible idea. So again, you would want a simulation model that says like, what's the probability that Jeremy is going to change his behavior as a result of calling him right now, right? So one of the levers I have is call him. On the other hand, if I got a piece of mail tomorrow that said like, for each month you stay with us, we're going to give you $100,000, okay, then that's going to definitely change my behavior, right? So but then feeding that into the simulation model, it turns out that overall that would be an unprofitable choice to make. So do you see how all this fits in together, right? So when we look at something like churn, we want to be thinking like what are the levers we can pull, right? And so what are the kind of models that we could build with what kinds of data to help us pull those levers better to achieve our objectives. And so when you think about it that way, you realize that the vast majority of these applications are not largely about a predictive model at all, they're about interpretation, they're about understanding what happens if, right? So if we kind of take the cross product, not the cross product, sorry, the intersection between on the one hand, here are all the levers that we could pull, like here are all the things we can do, and then here are all of the features from our random forest feature importance that turn out to be strong drivers of the outcome. And so then the intersection of those is here are the levers we could pull that actually matter, right? Because if you can't change the thing, then it's not very interesting. And if it's not actually a significant driver, it's not very interesting, right? So we can actually use our random forest feature importance to tell us what can we actually do to make a difference. And then we can use the partial dependence to actually build this kind of simulation model to say like, okay, well if we did change that, what would happen, right? So there are examples, lots and lots of these vertical examples. And so what I want you to kind of think about as you think about the machine learning problems you're working on is like, why does somebody care about this, right? And like, what would a good answer to them look like? And how could you actually positively impact this business? So if you're creating like a Kaggle kernel, try to think about from the point of view of the competition organizer, like what would they want to know? And how can you give them that information? So something like fraud detection, on the other hand, you probably just basically want to know who's fraudulent, right? So you probably do just care about the predictive model. But then you do have to think carefully about the data availability here. So it's like, okay, but we need to know who's fraudulent at the point that we're about to deliver them a product, right? So it's no point like looking at data that's available like a month later, for instance. So you've kind of got this, this key issue of thinking about the actual operational constraints that you're working under, lots of interesting applications in human resources, but like employee churn, it's another kind of churn model. We're finding out that Jeremy Howard, sick of lecturing, he's going to leave tomorrow. What are you going to do about it? Well, knowing that wouldn't actually be helpful. It'd be too late, right? You would actually want a model that said, what kinds of people are leaving USF? And it turns out that like, oh, everybody that goes to the downstairs cafe leaves USF, I guess their food is awful or whatever, right? Or everybody that we're paying less than half a million dollars a year is leaving USF because they can't afford basic housing in San Francisco. So like, you could use your employee churn model, not so much to say like, which employees hate us, but why do employees leave, right? And so again, it's really the interpretation there that matters. Now, leave prioritization is a really interesting one, right? Like this is one where a lot of companies, yes, Dana, can you pass that over there? Yeah, so I was just wondering, so like for the churn thing, you suggested for one day is like paying an employee like a one million a year or something, but then it sounds like there are two predictors that you need to predict for. I mean, one the churn and one you need to optimize for like your profit thing. So how does it work in that? Yeah, exactly. So this is what this like simulation model is all about. So it's a great question. So like, you kind of figure out this objective we're trying to maximize, which is like company profitability, you can kind of create like a pretty simple like Excel model or something that says like, here's the revenues and here's the costs and the cost is equal to the, you know, number of people we employ multiplied by their salaries, blah, blah, blah, blah, right? And so inside that kind of Excel model, there are certain cells, there are certain inputs where you're like, Oh, that thing's kind of stochastic, you know, or that thing is kind of uncertain, but we could predict it with a model and so that's kind of what I, what I do then is I think, okay, we need a predictive model for how likely somebody is to stay if we change their salary, how much, how likely they are to leave, you know, with the current salary, how likely they are to leave next year. If they, if I increase their salary now, blah, blah, blah, so you kind of build a bunch of these models and then you can find them together with simple business logic and then you can optimize that. You can then say, okay, if I, you know, pay Jeremy Howard half a million dollars, that's probably a really good idea, you know, and if I pay him less than, you know, it's probably not or whatever, like you can figure out the overall impact. And so it's, it's really shocking to me how few people do this, like most people in industry measure their models using like AUC or RMSE or whatever, which is never actually what you want. Yes. Can you pass it over here? I wanted to stress the point that you made before. In my experience, a lot of the problem was to define the problem, right? So you, you are in a company, you're talking to somebody that doesn't have like this mentality that you have. They don't know that you have to have X and Y and so on. So you have to try to get that out of them, you know, what exactly do you want and try to go through a few iterations of understanding what they want. And then, you know the data, you know what it is, you know actually what you can measure, which is often know what they want. So you have to kind of get a proxy for what they want. And then so a lot of what you do is not that much of like, well, some people do actually just work on really good models for, you know, but a lot of people also just work on this kind of how do you put this as a, you know, classification regression or some other type of modeling. That's actually kind of the most interesting, I think. And also kind of what's kind of what you have to do well. Understand the technical model building deeply. But also understand the kind of strategic context deeply. And so this is one way to think about it. And as I say, like, you know, I actually think you know, there aren't many articles I wrote in 2012, I'm still recommending, but this one I think is still equally valid today. So yeah, so like another great example is lead prioritization, right? So like a lot of companies, like every one of these boxes I'm showing, you can generally find our company or many companies whose sole job in life is to build models of that thing, right? So there are lots of companies that sell lead prioritization systems. But again, like the question is, how would we use that information? Right? So if it's like, oh, our best lead is Jeremy, you know, he's our highest probability of buying. Does that mean I should send a salesperson out to Jeremy or I shouldn't? Like if he's highly probable to buy, why waste my time with him, you know, so like, again, it's like, you really want some kind of simulation that says, like, what's the chain, the likely change in Jeremy's behavior? If I send my best salesperson, Yannette out to go and like, encourage him to sign. Okay, so yeah, I think this, this is like, there are many, many opportunities for data scientists in the world today to move beyond predictive modeling, to actually bringing it all together, you know, with the kind of stuff that Dina was talking about in her question. So as well as these horizontal applications that basically apply to like every company, there's a whole bunch of applications that are specific to like every part of the world, right? So if for those of you that end up in healthcare, some of you will become experts in one or more of these areas like readmission risk. Okay, so what's the probability that this patient is going to come back to the hospital and readmission is depending on the details of the jurisdiction and so forth. It can be a disaster for hospitals when somebody is readmitted, right? So if you find out that this patient has a high probability of readmission, what do you do about it? Well, again, the predictive model is helpful of itself, right? It rather suggests like we just shouldn't send them home yet because they're going to come back. But wouldn't it be nice if we had the tree interpreter and it said to us the reason that they're at high risk is because we don't have a recent EKG for them. And without a recent EKG, we can't have a high confidence about their cardiac health. In which case it wouldn't be like, well, let's keep them in the hospital for two weeks anyway. Like let's give them an EKG. Okay, so this is interaction between interpretation and predictive accuracy. Okay. So correct me if I'm wrong, but what I'm understanding you saying is that the predictive models are a really great starting point, but in order to actually like answer these questions, we really need to focus on the interpretability of these models. Yeah, I think so. And more specifically, I'm saying like we just learn to hold raft of random forest interpretation techniques. And so I kind of just to try to justify like, well, why? Right. And so the reason why is because actually maybe I'd say most of the time the interpretation is the thing we care about and like you can create a chart, you know, or a table without machine learning. And indeed, that's how most of the world works, right? Most managers like build all kinds of tables and charts without any machine learning behind them. But they often make terrible decisions because they don't know the feature importance of the objective they're interested in. So the table they create is of things that actually are the least important things anyway. Or they just do a univariate chart rather than a partial dependence plot. So they don't actually realize that the relationship they thought they're looking at is drew entirely to something else, right? So, you know, I'm kind of arguing for data scientists getting like much more deeply involved in strategy and in trying to use machine learning to really help, you know, help a business with all of its objectives, right? Now, there's like, there are companies like Dunhumbi is a huge company that does nothing but retail applications with machine learning. And so like, I believe there's like a Dunhumbi product you can buy, which will help you figure out like, if I put my new store in this location versus that location, how many people are going to shop there? Or if I put like, you know, my diapers in this part of the shop versus that part of the shop, how's that going to impact, you know, purchasing behavior or whatever, right? So I think it's also good to realize that like, the subset of machine learning applications you tend to hear about, you know, in the tech press or whatever, is this massively biased tiny subset of stuff which kind of Google and Facebook do, whereas the vast majority of stuff that actually makes the world go round is, you know, these kinds of applications that actually help people make things, buy things, sell things, build things, so forth. So about tree interpretation, the way we looked at the tree was we manually checked which feature could cause, was more important for particular observation. But for businesses, they would have huge amount of data and they want this interpretation for a lot of observations. So how do they automate it? I don't think the automation is at all difficult. Like you just, you can run any of these algorithms like looping through the rows or doing them in parallel. It's all just code. Oh, I'm misunderstanding your question. Is it like they set a threshold that if some feature is above, like for different, different people will have different behavior. Oh, so, yeah. Okay, I get it. That's a good question. The important thing, this is a really important issue actually is the vast majority of machine learning models don't automate anything. They're designed to provide information to humans. So for example, if your appointed sales, customer service phone operator for an insurance company and your customer asks you, why is my renewal $500 more expensive than last time? Then hopefully, the insurance company has provides in your terminal as little screen that shows the result of the tree interpreter or whatever and tells, so that you can jump there and tell the customer like, okay, well, here's last year, you're in this different zip code, which you know, is less, has lower amounts of car theft and this year also you've actually changed your vehicle to more expensive one or whatever, right? So it's not so much about thresholds of automation but about, you know, making these model outputs available to the decision makers in an organization, whether they be at the top strategic level of like, you know, are we gonna shut down this whole product or not all the way to the operational level of like that individual discussion with a customer. So like another example is like aircraft scheduling and gate management. Like there's lots of companies that do that, right? And basically what happens is that the people, you know, there are people in at an airport whose job it is to basically tell each aircraft what gate to go to, to figure out when to close the doors, stuff like that. And so the idea is, you know, you're giving them software which has the information they need to make good decisions. So the machine learning models end up embedded in that software to kind of say like, okay, that plane that's currently coming in from Miami, there's a 48% chance that it's gonna be over five minutes late. And if it does, then this is gonna be the knock-on impact through the rest of the terminal, for instance, okay? So that's kind of how these things tend to fit together. So there's so many of these, right? There's lots and lots. And so it's not like, I don't expect you to like, remember all these applications but what I do want you to do is to like spend some time like thinking about them. Like sit down with one of your friends and like talk about a few examples of like, okay, how would we go about like doing failure analysis and manufacturing? I don't know, like, you know, who would be doing that? Why would be they doing it? What kind of models might they use? What kind of data might they use? Like start to like kind of practice this and get a sense. So, because then there's your like interviewing and then when you're at the workplace and you're talking to managers, you want to be like straight away able to kind of recognize that the person you're talking to, what are they trying to achieve? What are the levers that they have to pull, right? What is the data they have available to pull those levers to achieve that thing? And therefore, how could we build models to help them do that and what kind of predictions would they have to be making, right? And so then you can have this really thoughtful empathetic conversation with those people kind of saying like, hey, you know, in order to reduce the number of customers that are leaving, you know, I guess you're trying to figure out like, you know, who should you be providing better pricing to or whatever and so forth. So what I'm noticing like from your beautiful little chart above is that like a lot of this, to me at least, still seems like the primary purpose is like at the, at least base level, like is predictive power. And so I guess my thing is, is like, for explanatory problems, like a lot of the ones that are people are faced with like in social sciences, is that something machine learning can be used for or is used for or is that not really the realm that it is? Yeah, that's a great question. And I've had a lot of conversations about this with people in social sciences and currently machine learning is not well applied in like economics or psychology or whatever on the whole. But I'm convinced it can be for the exact reasons we're talking about. So if you're trying to figure out like, you're going to try to do some kind of behavioral economics and you're trying to understand like, why some people behave differently to other people, you know, a random forest with a feature importance plot would be a great way to start. Or like more interestingly, if you're trying to do some kind of sociology experiment or analysis based on a large social network data set where you have an observational study, you really want to try and pull out all of the sources of kind of exogenous variables, you know, all the stuff that's going on outside. And so if you use a partial dependence plot with a random forest, that happens automatically. So I actually gave a talk at MIT a couple of years ago for the first conference on digital experimentation, which was really talking about like, how do we experiment in, you know, things like social networks and kind of these digital environments. And yeah, economists, economists all do things with like, you know, classic statistical tests. But the group of, yeah, yeah, so, but anyway, in this case, the economists I talked to were absolutely fascinated by this. And they actually asked me to give a introduction to machine learning session at MIT to these various faculty and graduate folks in the economics department. And so, and some of those folks have gone on to be, you know, write some pretty famous books and stuff. And so hopefully it's been useful. So like it's definitely early days, but it's a big, big opportunity. But as Yannette says, it's, you know, there's plenty of skepticism still out there. All right. Huh? Well, the skepticism comes from unfamiliarity basically with like this totally different approach. So like if you've spent 20 years studying econometrics and somebody comes along and says, you know, here's a totally different approach to all the econometric, all the stuff that econometricians do, you know, naturally your first reaction will be like, prove it. You know? So that's fair enough. But I think it's, you know, over time, the next generation of people who are growing up with machine learning, some of them will move into the social sciences. They'll make huge impacts that nobody's ever managed to make before and people will start going, wow. You know, just like happened in computer vision, right? When, you know, computer vision spent a long time of people saying like, hey, maybe you should use deep learning for computer vision. And everybody in computer vision is like, prove it. You know, we have decades of work on amazing feature detectors for computer vision. And then finally in 2012, you know, Hinton and Kudzesky came along and said, okay, our model is like twice as good as yours. And, you know, we've only just started on this. And everybody was like, oh, okay, that's pretty convincing. And nowadays every computer vision researcher basically uses deep learning. So I think that time will come in this area too. Okay, I think what we might do then is take a break and we're gonna come back and talk about these random forest interpretation techniques and do a bit of a review. So let's come back at two o'clock. So let's have a go at talking about these different random forest interpretation methods having talked about like why they're important. So let's now remind ourselves like what they are. So I gotta let you folks have a go. So let's start with confidence based on tree variants. So can one of you tell me one or more of the following things about confidence based on tree variants that you have? What does it tell us? Why would we be interested in that? And how is it calculated? So this is going back a ways because it was the first one we looked at. Even if you're not sure or you only know a little piece of it, give us your piece and we'll build on it together. I think I got a piece of it. It's getting the variance of our predictions from random forest. That's true. That's the how. Can you be more specific? So what is it the variance of? I think it's remembering correctly. I think it's just the overall prediction. The variance of the predictions of the trees. Yes. So normally the prediction is just the average. This is the variance of the trees. So it kind of just gives you an idea of how much your prediction is going to vary. So if maybe you want to minimize variance, maybe that's your goal for whatever reason that could be. That's not so much the reason. So I like your calculation description. Let's see if somebody else can tell us how you might use that. It's okay if you're not sure. Now have a start. So I remember that we talked about kind of the independence of the trees. And so maybe something about if the variance of the trees is higher or lower than... No, not so much that. That's an interesting question, but it's not what we're going to see here. You're going to pass it back behind you. So to remind you, just to fill in a detail here, what we generally do here is we take just one row, like one observation often, and like find out how confident we are about that, like how much variance there are in the trees for that, or we can do it as we did here for different groups. So according to me, the idea is like for each row, we calculate the standard deviation that we get from the random forest model. And then maybe group according to different variables of predictors and see for which particular predictor the standard deviation is high. And then go deep down as why it is happening, maybe it is because a particular category of start variable has very less number of observations. Yeah, that's great. So that would be one approach is kind of what we've done here is to say like, is there any groups that have that where we're very unconfident? Something that I think is even more important would be when you're using this like operationally, let's say you're doing a credit decisioning algorithm. So we're trying to say like, okay, is Jeremy a good risk or a bad risk? Should we loan him a million dollars? And the random forest says, I think he's a good risk, but I'm not at all confident. In which case we might say, okay, maybe I shouldn't give him a million dollars. Or else if the random forest said, I think he's a good risk, I am very sure of that, then we're much more comfortable giving him a million dollars. And I'm a very good risk, so feel free to give me a million dollars. I checked the random forest before a different notebook, not in the repo. So like, it's quite hard for me to give you folks direct experience with this kind of like single observation, interpretation stuff, because it's really like the kind of stuff that you actually need to be putting out to the front line, do you know what I mean? Like it's not something which you can really use so much in a kind of Kaggle context, but it's more like, okay, if you're actually putting out some algorithm, which is making like big decisions that could cost a lot of money, you probably don't so much care about the average prediction of the random forest, but maybe you actually care about like the average minus a couple of standard deviations, you know, like what's the kind of worst case prediction. And so as Shikar mentioned, it's like maybe there's a whole group that we're kind of unconfident about. So yeah, so that's confidence based on tree variants. All right, who wants to have a go at answering feature importance? What is it? Why is it interesting? How do we calculate it? Or any subset thereof? Dina. I think it's like, it's basically to find out which features are important for your model. So you take each feature and you like randomly sample all the values in the feature and you see how the predictions are. If it's very different, it means that that feature was actually important. Else if it's fine to take any random values for that feature, it means that maybe probably it's not very important. Okay, that was terrific. There was this, that was all exactly right. There was some details that maybe were skimmed over a little bit. I wonder if anybody else wants to jump into like a more detailed description of how it's calculated. Cause I know this morning, some people were not quite sure. Is there anybody who's like not quite sure maybe who wants to like have a go or wanna just put it next to you there? Let's see. How exactly do we calculate feature importance for a particular feature? So I think after you're done building the random process model, you take each problem and randomly shuffle it and generate a prediction and check the validation score. If it gets pretty bad for after shuffling one of the columns that means that column was important. So that has higher importance. I'm not exactly sure how we quantify the feature importance. Okay, great. Dina, do you know how we quantify the feature importance? That was a great description. I think we take the difference in the first square. Or score of some sort, exactly. Yeah, so let's say we've got our dependent variable, which is price, right? And there's a bunch of independent variables including year made, right? And so we basically, we use the whole lot to build a random forest, right? And then that gives us our predictions, right? And so then we can, let's call this Y, right? And so then we can compare that to get, I don't know, whatever, R squared, RMSE, whatever you're interested in, right? From the model. Now, the key thing here is I don't wanna have to retrain my whole random forest. That's kinda slow and boring, right? So using the existing random forest, how can I figure out how important year made was, right? And so the suggestion was, let's randomly shuffle the whole column, right? So now that column is totally useless. It's got the same mean, same distribution, everything about it is the same, but there's no connection at all between particular people, actual year made, and what's now in that column. I've randomly shuffled it, okay? And so now I put that new version through the same random forest. So there's no retraining done, okay? To get some new Y hat, I call it Y hat, YM, right? And then I can compare that to my actuals to get like an RMSE YM, right? And so now I can start to create a little table. So now I can create a little table where I've basically got like the original here, RMSE, and then I've got with year made scrambled. So this one had an RMSE of like three, this one had an RMSE of like two, enclosure, you know, scrambling that had a RMSE of like 2.5, right? And so then I just take these differences. So I'd say year made, the importance is one, three minus two. Enclosure is 0.5, three minus two and a half, and so forth, right? So how much worse did my model get after I shuffled that variable? Does anybody have any questions about that? Can you pass that to Danielle, please? I assume you just chose those numbers randomly, but my question I guess is, does it, do all of them theoretically have a perfect model to start out with? Are they, well, all the importance is seven to one, or is that not, they're just... Honestly, I've never actually looked at what the units are, so I'm not, I'm actually not quite sure. Okay. Sorry. We can check it out during the week. If somebody's interested, have a look, have a look at the, this sklearn code and see exactly what those units of measure are, because I've never bothered to check. Although I don't check like the units of measure specifically, what I do check is the relative importance. And so like, here's an example. So rather than just saying, like, what are the top 10? Yesterday, one of the Tractacum students asked me about a feature importance where they said, like, oh, I think these three are important. And I pointed out that the top one was a thousand times more important than the second one. Right, so like, look at the relative numbers here. And so in that case, it's like, no, don't look at the top three. Look at the one that's a thousand times more important and ignore all the rest, right? And so this is where sometimes the kind of, your natural tendency to wanna be like precise and careful, you need to override that and be very practical. It's like, okay, this thing's a thousand times more important. Don't spend any time on anything else, right? So then you can go and talk to the manager of your project and say, like, okay, this thing's a thousand times more important. And then they might say, oh, that was a mistake, it shouldn't have been in there. We don't actually have that information at the decision time or, you know, for whatever reason, we can't actually use that variable. And so then you could remove it and have a look. Or they might say, gosh, I had no idea that like that was more, by far more important than everything else put together. So let's forget this random Boris thing and just focus on like understanding how we can better collect that one variable and better use that one variable. So that's like something which comes up quite a lot. And actually another place that came up just yesterday, again, another practicum student asked me, hey, I'm doing this medical diagnostics project and my R squared is 0.95 for a disease, which I was told is very hard to diagnose. You know, is this random forest a genius or is something going wrong? And I said like, remember the second thing you do after you build a random forest is to do feature importance. So do feature importance. And what you'll probably find is that the top column is something that shouldn't be there. And so that's what happened. He came back to me half an hour later. He said, yeah, I did the feature importance. You were right. The top column was basically something that was another encoding of the dependent variable. I've removed it. And now my R squared is negative 0.1. So that's an improvement. Okay. The other thing I like to look at is this chart, right? Is to basically say like, where do kind of things flatten off in terms of like, which ones should I be really focusing on? So that's the most important one, right? And so when I did credit scoring in telecommunications, I found there were nine variables that basically predicted very accurately who was that gonna end up paying for their phone and who wasn't. And like, apart from ending up with a model that saved them $3 billion a year in fraud and credit costs, it also let them basically rejig their process. So they focused on collecting those nine variables much better. All right. Who wants to do partial dependence? This is an interesting one. Very important, but in some ways kind of tricky to think about. Well, go ahead and try. Yeah, please do. So from my understanding of what partial dependence is, is that there's not always necessarily like a relationship between the strictly the dependent variable and this independent variable that necessarily like is showing importance, but rather than interaction between two variables that are working together. So you're thinking something like this, right? Yeah. Where we are like, oh, that's weird. Like you'd expect this to be kind of flat and there's a weird poke you did. Yeah. And so for, and in this example, what we found was that it's not necessarily year made or when the sale was elapsed, but it's actually the age of the model. And so that is easier to just like to tell a company, well, obviously your younger models are gonna sell for more and it's less about when the year was made. Yeah, exactly. So let's come back to how we calculate this in a moment. But the first thing to realize is that the vast majority of the time, post your course here, when somebody shows you a chart, it'll be like a univariate chart. They'll just like grab the data from the database and they'll plot X against Y and then managers have a tendency to wanna like make a decision, right? So be like, oh, there's this like drop off here. So we should like stop dealing in equipment made between 1990 and 1995 or whatever, right? And this is like a big problem because like real world data has lots of these interactions going on. So like maybe there was a recession going on around the time that those things are being sold or maybe around that time people were buying more of a different type of equipment or whatever, right? So generally what we actually wanna know is all other things being equal, what's the relationship between year made and sale price? Because like if you think about the drive train approach idea of like the levers, you really wanna model that says if I change this lever, how will it change my objective, okay? And so it's by pulling them apart using partial dependence that you can say, okay, actually this is the relationship between year made and sale price, all other things being equal, right? So how do we calculate that? So for the variable year made, for example, you're gonna train, you keep every other variable constants and then you're gonna pass every single value of the year made and then train the model after that. So for every model you're gonna have the light blue or the values of it and the median is gonna be the yellow line up there. Good, okay. So let's try and draw that. So by leave everything else constant, what she means is leave them at whatever they are in the data set. So just like when we did feature importance, right? We're gonna leave the rest of the data set as it is and we're gonna do partial dependence plot for year made. So you've got all of these other rows of data that will just leave as they are. And so instead of randomly shuffling year made, instead what we're gonna do is replace every single value with exactly the same thing, 1960, okay? And just like before, we now pass that through our existing random forest which we have not retrained or changed in any way to get back out a set of predictions. Why 1960, okay? And so then we can plot that on a chart year made against partial dependence 1960 here, okay? And then we can do it for 1961, two, three, four, five, and so forth, right? And so we can do that for, on average for all of them or we could do it just for one of them, right? And so when we do it for just one of them and we change its year made and pass that single thing through our model, that gives us one of these blue lines, right? So each one of these blue lines is a single row as we change its year made from 1960 up to 2008. And so then we can just take the median of all of those blue lines to say, you know, on average, what's the relationship between year made and price, all other things being equal. So why is it that this works? Why is it that this process tells us the relationship between year made and price, all other things being equal? Well, maybe it's good to think about like a really simplified approach. A really simplified approach would say, what's the average auction? You know, what's the average sale date? What's the most common type of machine we sell? Which location do we mainly mostly sell things? And like we could come up with a single row that represents the average auction and then we could say, okay, let's run that row through the random forest to replace its year made with 1960 and then do it again with 1961 and then do it again with 1962 and we could like plot those on our little chart, right? And that would give us a version of the relationship between year made and sale price, all other things being equal, right? But what if like tractors looked like that and backhoe loaders looked like that, right? Then taking the average one would hide the fact that there are these totally different relationships, right? So instead, we basically say, okay, our data tells us what kinds of things we tend to sell and who we tend to sell them to and when we tend to sell them, so let's use that, right? So then we actually find out, like for every blue line, like here are actual examples of these relationships, right? And so then what we can do is, as well as plotting the median, is we can do a cluster analysis to find out like a few different shapes, right? And so we may find, in this case, they all look like pretty much the different versions of the same thing with different slopes. So my main takeaway from this would be that the relationship between sale price and year is basically a straight line, right? And remember this was log of sale price, right? So this is actually showing us an exponential and so this is where I would then bring in the domain expertise, which is like, okay, things depreciate over time by a constant ratio, so therefore I would expect older stuff year-made to have this exponential shape. So this is where, like I kind of mentioned, like the very start of my machine learning project, I generally try to avoid using as much domain expertise as I can and let the data do the talking, right? So like one of the questions I got this morning was like, oh, if there's like a sale ID, a model ID, I should throw those away, right? Because they're just IDs, so no. Don't assume anything about your data, right? Leave them in and if they turn out to be super important predictors, you want to find out, you know, why is that, okay? But then, now I'm at the other end of my project, right? I've done my feature importance. I've pulled out the stuff which is like, you know, from that dendrogram, you know, the kind of redundant features. I'm looking at the partial dependence and now I'm thinking like, okay, is this shape what I expected? So even better, before you plot this, first of all, think what shape would I expect this to be? Because it's always easy to justify to yourself after the fact, oh, I knew it would look like this, right? So what shape do you expect and then is it that shape? So in this case, I'd be like, yeah, this is what I would expect, okay? Where else? This is definitely not what I'd expect. So the partial dependence plot has really pulled out the underlying truth. Okay. So does anybody have any questions about like why we use partial dependence or how we calculate it? Who's got the, oh, you've got it. They'll say you have, you've found, say, 20 features. Everything are important. Are you going to measure the partial dependence for every single one of them? So is there a limit on that? If there are 20 features that are important, then I will do the partial dependence for all of them. Where important means like it's a lever I can actually pull. It's like the magnitude of its size is like not much smaller than the other 19. Like, you know, based on all of these things, it's like, yeah, it's a feature I ought to care about. Then I will want to know how it's related. It's pretty unusual to have that many features that are important both operationally and from a modeling point of view in my experience. How do you define important actually now to think about it? So important means it's a lever. So it's something I can change. And it's like, you know, kind of at the spiky end of this tail. Or, you know, maybe it's not a lever directly. Like maybe it's like zip code. And I can't actually tell my customers where to live. But I could like focus my new marketing attention on a different zip code, you know. Would it make sense to do pairwise shuffling for every combination of two features and hold everything else constant like in future importance to see interactions and compare scores? So you wouldn't do that so much for partial dependence. I think your question is really getting to the question of could we do that for feature importance? So I think interaction feature importance is a very important and interesting question. But doing it by randomly shuffling every pair of columns, you know, if you've got a hundred columns, sounds computationally intensive, possibly infeasible. So what I'm going to do is after we talk about tree interpreter, I'll talk about interesting but largely unexplored approach that will probably work. Okay, who wants to do tree interpreter? All right, over here, Prince. Can you pass that over here to Prince? I was thinking this to be more like feature importance. But feature importance is for complete random forest model. And this tree interpreter is for feature importance for particular observation. So if that, let's say it's about hospital readmission. So if a patient even is going to be admitted to a hospital, which feature for that particular patient is going to impact, and how can we change that? And it is calculated starting from the prediction of mean, then seeing how each feature is changing the behavior of that particular patient. I'm smiling because that was one of the best examples of technical communication I've heard in a long time. So it's really good to think about like why was that effective, right? So what Prince did there was he used as specific an example as possible, right? So humans are much less good at understanding abstractions, right? So if you kind of say, oh, it takes some kind of feature, and then there's an observation in that feature, where it's like, no, it's a hospital readmission. Okay, and so we take a specific example. The other thing he did which is very effective was to kind of take an analogy to something we already understand. So we already understand the idea of feature importance across all of the rows in a data set. So now we're going to do it for a single row. Okay, so like, you know, one of the things I was really hoping we would learn from this experience is how to become effective technical communicators. So, you know, that was a really great role model from Prince of like using all of the tricks we have at our disposal for effective technical communication. So hopefully you found that a useful explanation. I don't have a hell of a lot to add to that other than to show you, you know, what that looks like. So with the tree interpreter, we picked out a row. Okay, and so remember when we talked about the confidence intervals at the very start, the confidence based on tree variance, we mainly said like you probably mainly use that for a row. So this would also be for a row. So it's like, okay, why is this patient likely to be readmitted? Okay, so here is all of the information we have about that patient, or in this case, this auction. Why is this auction so expensive? So then we call tree interpreter dot predict and we get back the prediction of the price, right? The bias, which is the root of the tree. So this is just the average price for everybody. So this is always going to be the same. And then the contributions, which is how important is each of these, each of these things, right? And so the way we calculated that, so the way we calculated that was to say, okay, at the very start, the average price was 10, right? And then we split on enclosure, right? And for those with this enclosure, the average was 9.5. And then we split on year made, I don't know, less than 1990. And for those with that year made, the average price was 9.7, right? And then we split on the number of hours on the meter. And for this branch, we got 9.4, right? And so we then have a particular auction, which we pass it through the tree and it just so happens that it takes this path, right? So one row can only have one path through the tree, right? And so we ended up at this point. Okay, so then we can create a little table, right? And so as we go through, we start at the top and we start with 10, right? That's our bias. And when we said enclosure resulted in a change from 10 to 9.5, minus 0.5. Year made changed it from 9.5 to 9.7, so plus 0.2, right? And then meter changed it from 9.7 down to 9.4, which is minus 0.3. And then if we add all that together, 10 minus a half is 9.5, plus 0.2 is 9.7, minus 0.3 is 9.4. Lo and behold, that's that number, which takes us to our Excel spreadsheet. Where's Chris, who did our waterfall? There you are. All right, so last week we had to use Excel for this because there isn't a good Python library for doing waterfall charts. And so we saw we got our starting point. This is the bias. And then we had each of our contributions and we ended up with our total. The world is now a better place because Chris has created a Python waterfall chart module for us and put it on PIP, so never again will we have to use Excel for this. And I wanted to point out that waterfall charts have been very important in business communications at least as long as I've been in business, so that's about 25 years. Python is a couple of decades old, a little bit less. Yeah, maybe a couple of decades old. But despite that, no one in the Python world ever got to the point where they actually thought, you know, I'm going to make a waterfall chart. So they didn't exist until two days ago. Which is to say like the world is full of stuff which ought to exist and doesn't and doesn't necessarily take ahead a lot of time to build. Chris, how long did it take you to build the first Python waterfall chart? Well, there was a, you know, a gist of it. But it wasn't in a function. Yeah, about eight hours. Okay, so, you know, a hefty time amount, but not unreasonable. And now forever more, people when they want the Python waterfall chart will end up at Chris's GitHub repo and hopefully find lots of other USF contributors who have made it even better. So in order for you to help improve Chris's Python waterfall, you need to know how to do that, right? And so you're going to need to submit a pull request. Life becomes very easy for submitting pull requests if you use something called hub. So if you go to GitHub slash hub, that will send you over here. And what they suggest you do is that you alias get to hub. Because it turns out that hub actually is a strict superset of git. But what it lets you do is you can go get fork, get push, get pull request, and you've now sent Chris a pull request. Without hub, this is actually a pain and requires like going to the website and filling in forms and stuff, right? So this gives you no reason not to do pull requests. And I mentioned this because like when you're interviewing for a job or whatever, I can promise you that the person you're talking to will check your GitHub. And if they see you have a history of submitting thoughtful pull requests that are accepted to interesting libraries, that looks great, right? It looks great because it shows you're somebody who actually contributes. It also shows that if they're being accepted that you know how to create code that fits with people's coding standards, has appropriate documentation, passes their tests and coverage and so forth, right? So when people look at you and they say, oh, here's somebody with a history of successfully contributing accepted pull requests to open source libraries, that's a great part of your portfolio, okay? And you can specifically refer to it, right? So either I'm the person who built Python waterfall. Here is my repo. Or I'm the person who contributed currency number formatting to Python waterfall. Here's my pull request. Any time you see something that doesn't work right in any open source software you use is not a problem. It's a great opportunity because you can fix it and send in the pull request. So yeah, give it a go. It actually feels great the first time you have a pull request accepted. And of course, one big opportunity is the fast AI library. And thank you. The person here, the person who added all the docs to fast AI structured in the other class. Okay. So thanks to one of our students. We now have doc strings for most of the fast AI dot structured library. And that again came via a pull request. So thank you. Okay. Does anybody have any questions about how to calculate any of these random forest interpretation methods or why we might want to use any of these random forest interpretation methods? Towards the end of the week, you're going to need to be able to build all of these yourself from scratch. One note of that. Can you pass that please? Just looking at the interpreter. I noticed that some of the values are in a ends. How I get, I give why you keep them in the tree, but how can an NAN have a feature importance? Okay, let me pass it back to you. Why not? So in other words, how is NAN handled in pandas and therefore in the tree? Is that to some default value? Anybody remember how pandas, these are notices are all in categorical variables. How does pandas handle NANs in categorical variables? And how does fast AI deal with them? Can somebody pass it to the person who's talking? Negative one for pandas? Yeah, pandas sets them to negative one category code. And do you have to remember what we then do? Doesn't matter really. We add one to all of the category codes. So it ends up being zero. So in other words, we have a category with, remember, by the time it hits the random forest, it's just a number. And it's just the number zero. And we map it back to the descriptions back here. So the question really is, why shouldn't the random forest be able to split on zero? It's just another number. So it could be NAN, high, medium, or low, zero, one, two, three, four. And so, you know, missing values are one of these things that are generally taught really badly. Like, often people get taught, like, here are some ways to remove columns with missing values or remove rows with missing values or to replace missing values. That's, like, never what we want because missingness is very, very, very often interesting. And so we actually learned that from our feature importance that coupler system NAN is like one of the most important features. And so for some reason, well, I could guess, right? Coupler system NAN presumably means this is the kind of industrial equipment that doesn't have a coupler system. Now, I don't know what kind that is, but apparently it's a more expensive kind. Does that make sense? Yeah. Okay. So, yeah, I did this competition for university grant research success where by far the most important predictors were whether or not some of the fields were null. And it turned out that this was data leakage, that these fields only got filled in most of the time after a research grant was accepted. Right? So, you know, it allowed me to win that Kaggle competition, but didn't actually help the university very much. Okay, great. So let's talk about extrapolation. And I am going to do something risky and dangerous, which is we're going to do some live coding. And the reason we're going to do some live coding is I want to explore extrapolation together with you. And I kind of also want to kind of help give you a feel of, you know, like how you might go about like writing code quickly in this notebook environment. Right? And this is the kind of stuff that you're going to need to be able to do, you know, in the real world and in the exam is kind of quickly create the kind of code that we're going to talk about. So I really like creating synthetic data sets. Any time I'm trying to like investigate the behavior of something, because if I have a synthetic data set, I know how it should behave. Which reminds me, before we do this, I promised that we would talk about interaction importance. And I just about forgot. The tree interpreter tells us the contributions for a particular row based on the difference in the tree. We could calculate that for every row in our data set and add them up. And that would tell us feature importance. It would tell us feature importance in a different way. Right? One way of doing feature importance is by shuffling the columns one at a time. And another way is by doing tree interpreter for every row and adding them up. Now there is right more right than the others. They're actually both quite widely used. So this is kind of type one and type two feature importance. So we could try to expand this a little bit to do not just single variable feature importance, but interaction feature importance. Now here's the thing. What I'm going to describe is very easy to describe. It was described by Breiman right back when random forests were first invented. And it is part of the commercial software product from Southwood Systems who have the trademark on random forests. But it is not part of any open source library I'm aware of. And I've never seen an academic paper that actually studies it closely. So what I'm going to describe here is a huge opportunity. But it's also like there's lots and lots of details that kind of need to be fleshed out. So here's the basic idea. This particular difference here is not just because of year made, but because of a combination of year made and enclosure. The fact that this is 9.7 is because enclosure was in this branch and year made was in this branch. So in other words, we could say the contribution of enclosure interacted with year made is minus 0.3. And so what about that difference? Well, that's an interaction of year made and hours on the meter. So year made interacted with I'm using star here not to mean times, but to mean interacted with it. So it's a kind of a common way of doing things like ours formulas do it this way as well. Year made by interacted with meter has a importance, sorry, a contribution of minus 0.1. Perhaps we could also say from here to here that this also shows an interaction between meter and enclosure. Like with one thing in between them. So maybe we could say meter by enclosure equals and then what should it be? Minus 0.6. I mean, in some ways that kind of seems unfair because we're also including like the impact of year made. So maybe it should be minus 0.6. Maybe we should add back this 0.2. And these are like details that I actually don't know the answer to. Right. Like how should we best kind of assign a contribution to each pair of variables in this path? Right. But clearly conceptually, we can write the pairs of variables in that path all represent interactions. Right. Yes, Chris, can you put your past that Chris? Why don't you force them to be next to each other in the tree? I mean, I'm not going to say it's the wrong approach. I don't think it's the right approach, though, because it feels like this path here, meter and enclosure are interacting. So it seems like not recognizing that contribution is throwing away information. But I'm not sure, you know, I had one of my staff at cattle actually do some R&D on this a few years ago. And they actually found, you know, and I wasn't close enough to know how they dealt with these details, but they got it working pretty well. But unfortunately, it never saw the light of day as a software product. But like this is something which, you know, maybe a group of you could get together and build, you know, I mean, do some Googling to check. But I really don't think that there are any interaction feature importance parts of any open source library. Can you pass that back? Wouldn't this exclude interactions, though, between variables that don't matter until they interact. So say your row never chooses to split down that path, but that variable interacting with another one becomes your most important split. I don't think that happens, right? Because if there's an interaction that's important only because it's an interaction and not on a univariate basis, it will appear sometimes assuming that you set max features to less than one. And so therefore it will appear in some part. What is meant by interaction? Is it multiplication, ratio, addition? Interaction means appears branches appears on the same path through a tree. Like an interaction. In this case, the tree there's an interaction between enclosure and your mate because we branch on enclosure, and then we branch on your mate. So to get to here, we have to have some specific value of enclosure and some specific value of your mate. Sorry, I just, my brain's kind of working on this right now. What if you went down the middle leaves between the two things you're trying to observe, and you could just sort of, and you would also take into account what the final measure is. So, I mean, if we extend the tree downwards, you'd have many measures, both of like the two things you're trying to look at and also the in between steps. There seems to be a way to like average information out in between them. There could be. So I think what we should do is talk about this on the forum. I think this is fascinating and I hope we build something great. But I need to do my live coding. So let's, yeah, that was a great discussion. Keep thinking about it. And yeah, do some experiments. And so to experiment with that, you almost certainly want to create a synthetic data set first, right? It's like y equals x1 plus x2 plus x1 times x2 or something, you know, like something where you know that there's this interaction effect and there isn't that interaction effect. And then you want to make sure that the feature importance you get at the end is what you're expected, right? And so probably the first step would be to do single variable feature importance using the tree interpreter style approach. And one nice thing about this is like it's, it doesn't really matter how much data you have, like all you have to do to calculate feature importance is just like slide through the tree, right? So you should be able to write in a way that's actually pretty fast. And so even writing it in pure Python might be fast enough, depending on your tree size. Okay, so we're going to talk about extrapolation. And so the first thing I want to do is create a synthetic data set that has a simple linear relationship. We're going to pretend it's like a time series, right? So we need to basically create some x values. So the easiest way to kind of create some synthetic data of this type is to use Linspace, which just creates some evenly spaced, some evenly spaced data, right between start and stop with by default 50 observations. So if we just do that, right? There it is. Okay. And so then we're going to create a dependent variable. And so let's assume there's just a linear relationship between x and y. And let's add a little bit of randomness to it. Right? So uniform random between low and high. So we could like add somewhere between like minus 0.2 and 0.2. Okay. And so the next thing we need is, is a shape, right, which is basically how what dimensions do you want this this random numbers, these random numbers to be. And obviously we want them to be the same shape as X's shape. So we can just say x dot shape. Okay. So in other words, that's X dot shape. Remember, when you see something in parentheses with a comma, that's a tuple with just one thing in it. Okay. So this is of shape 50. And so we've added 50 random numbers. And so now we could plot those. Okay, so shift tab X comma Y. All right. So there's our data. Okay. So like, when you're both working as a data scientist or for doing your exams in this course, you need to be able to like quickly whip up a data set like that, throw it up on a plot without thinking too much. Okay. And like, as you can see, you don't have to really remember much. If anything, you just have to know how to like, hit shift tab to check the names of parameters. And, you know, everything in the exam will be open, open book, open internet so you can always like Google for something to try and find Linspace if you've got what it's called. All right. So let's assume that's our data. Right. And so we're now going to build a random first model. And what I want to do is build a random first model that kind of acts as if this is a time series. So I'm going to take this as a training set. Right. I'm going to take this as our validation or test set, just like we did in, you know, groceries or bulldozers or whatever. Okay. So we can use exactly the same kind of code that we used in split bells. Right. So we can basically say X train comma X, bow equals X up to 40 comma X from 40. Okay. So that just splits it into the first 40 versus the last 10. Right. And so we can do the same thing for Y. And there we go. Okay. So the next thing to do is we want to create a random first. Okay. And fit it. And that's going to require X's and Y's. All right. Now, that's actually going to give an error. And the reason why is that it expects X to be a matrix, not a vector, because it expects X to have a number of columns of data. Right. So it's important to know that a matrix with one column is not the same thing as a vector. So if I try to run this, right, expected 2D array, got one D array instead. So we need to convert our 2D array into a 1D array. So remember I said X dot shape is 50 comma right. So X has one axis. So it's important to make sure X is rank is one. The rank of a variable is equal to the length of its shape. How many axes does it have? So a vector we can think of as an array of rank one, a matrix is an array of rank two. I very rarely use words like vector and matrix because like they're kind of meaningless specific examples of something more general, which is they're all n dimensional tensors, right, or n dimensional arrays. Okay, so an n dimensional array, we can say it's a tensor of rank n. They basically mean kind of the same thing. Physicists get crazy when you say that because to a physicist a tensor has quite a specific meaning, but in machine learning we generally use it in the same way. Okay, so how do we turn an array, a one dimensional array into a two dimensional array? There's a couple of ways we can do it, but basically we slice it, right. So colon means give me everything in that axis, right. Column comma none means give me everything in the first axis, which is the only axis we have and then none is a special indexer, which means add a unit axis here. So let me show you that is of shape 50 comma one. So it's a rank two. It has two axes. One of them is a very boring axis, right. It's a length one axis. So let's move this over here. There's one comma 50. Okay, and then to remind you the original is just 50. Right, so you can see I can put none as a special indexer to introduce a new unit axis there. Okay, so this thing has one row and 50 columns. This thing has 50 rows and one column. So that's what we want, right. We want 50 rows and one column. This kind of playing around with ranks and dimensions is going to become increasingly important in this course and in the deep learning course. Right, so spend a lot of time slicing with none slicing with other things. Try to create three dimensional four dimensional tensors and so forth. I'll show you a trick. I'll show you two tricks. The first is you never ever need to write comma colon. It's always assumed. So if I delete that, this is exactly the same thing. Okay, and you'll see that in code all the time. So you need to recognize it. The second trick is this is adding an axis in the second dimension, right, or I guess the index one dimension. What if I always want to put it in the last dimension? Right, and like often our tensors change dimensions without us looking because like you went from a one channel image to a three channel image, or you went from a single image to a mini batch of images like suddenly you get new dimensions appearing. So to make things general, I would say this dot dot dot. Dot dot dot means as many dimensions as you need to fill this up. Okay, and so in this case it's exactly the same thing, but I would always try to write it that way because it means it's going to continue to work as I get, you know, higher dimensional tensors. Okay, so in this case, I want 50 rows in one column. So I call that say x one. Okay, so let's now use that here. And so this is now a 2D array. And so I can create my random forest. Okay, so then I could plot that. And this is where you're going to have to turn your brains on because the folks this morning got this very quickly, which was super impressive. So I'm going to plot y train against m dot predict x train. Okay, before I hit go, what is this going to look like? Yeah, it should basically be the same. Right, our predictions hopefully are the same as the actuals. So this should fall on the line. But there's some randomness. So it won't quite. I should have used scatterplot. Okay. So that's cool. Right, that was the easy one. Let's now do the hard one. The fun one. What's that going to look like? Okay, so I'm going to say no, but nice try. You know, it's like, hey, we're extrapolating to the to the validation. That's what I'd like it to look like. But that's not what it is going to look like. Think about what trees do and think about think about the fact that we have a validation set here and a trading set here. So think about a forest is just a bunch of trees. So the first tree is going to. Okay, Melissa's going to have a go. Can you pass that to Melissa? Um, will it start grouping the dots? Yeah, that's what I mean. That's what it does. Okay, but you know, let's think about how it groups the dots. So yeah, Tim, I'm guessing since all the new data is actually outside of the original scope, it's all going to be basically the same. It's like one huge group. Yeah, right. So like we make it like forget the forest. Let's create one tree. Right. So we're probably going to split somewhere around here first. And then we're going to probably split somewhere around here. And then we're going to split somewhere around here and somewhere around here. Right. And so our final split is here. Right. So our prediction when we say, okay, let's take this one. And so it's going to put that through the forest. Right. And end up predicting this average. It can't predict anything higher than that. Because there is nothing higher than that to average. Right. So this is really important to realize is a random forest is not magic. Right. It's just returning the average of nearby observations where nearby is kind of in this like tree space. So let's run it. Let's see if Tim's right. Holy shit. That's awful. Right. And like, if you don't know how random forests work, then this is going to totally screw you. Right. If you think that it's actually going to be able to extrapolate to any kind of data it hasn't seen before. Like particularly like future time periods. It's just not like it just can't. It's just averaging stuff. It's already seen. That's all it can do. Okay. So we're going to be talking about like how to avoid this problem. We talked a little bit in the last lesson about trying to avoid it by just like an avoiding unnecessary time dependent variables where we can. Right. But in the end, if you really have a time series that looks like this. We actually have to deal with the problem. Right. So one way we could deal with the problem would be. Use like a neural net. Right. Use something that actually has a function or shape that can actually like fit something that actually fits something like this. Right. And so then it will extrapolate nicely. Another approach would be to use all the time series techniques you guys are learning about in the morning class to fit some kind of time series. Right. And then detrend it. Right. And so then you'll end up with detrended dots and then use the random forest to predict those. Right. And that's particularly cool. Right. Because if you're like imagine that your random forest was actually trying to predict data that like, I don't know, maybe it was two different states. And so the blue ones, you know, are down here and the red ones are up here. Right. Now if you tried to use a random forest, it's going to do a pretty crappy job because like time is going to seem much more important. It's basically still going to like split like this. And it's going to split like this. And then finally, once it kind of gets down to this piece, it'll be like, oh, okay, now I can see the difference between the states. Right. So in other words, like when you've got this big time piece going on, you're not going to see the other relationships in the random forest until you've dealt until every tree deals with time. So one way to fix this would be with a gradient boosting machine, GBM. Right. And what a GBM does is it creates a little tree. Right. And runs everything through that first little tree, which could be like a time tree. And then it calculates the residuals. And then the next little tree just predicts the residuals. So it'd be kind of like detrending it. Right. So GBMs handle this. GBMs still kind of extrapolate to the future. But at least they can deal with time dependent data more conveniently. Right. So we're going to be talking about this quite a lot more over the next couple of weeks. Right. And in the end, the solution is going to be just use neural nets. Right. But for now, you know, using some kind of time series analysis, detrend it and then use a random forest on that isn't a bad technique at all. And if you're playing around with something like the Ecuador groceries competition, that would be a really good thing to fiddle around with. All right. See you next time.