 Okay, so welcome back. So we're going to start by doing some review and we're going to talk about test sets training sets validation sets and OOB Something we haven't covered yet, but we will cover in more detail later is also cross validation But I'm going to talk about that as well, right? So We have a data set With a bunch of rows in it and And we've got some dependent variable and so what's the difference between like machine learning and Kind of pretty much any other kind of work that the the difference is that in machine learning the thing we care about is The generalization accuracy or the generalization error where else in like pretty much everything else All we care about is is How well we could have mapped to the observations full stop And so this this thing about generalization is the key unique piece of machine learning And so if we want to know whether we're doing a good job of machine learning We need to do know whether we're going a good job of generalizing if we don't know that We know nothing, right Yeah By generalizing do you mean like scaling being able to scale larger? No, I don't mean scaling at all. So scaling is an important thing in many many areas. It's like, okay We've got something that works on on my computer with 10,000 Items I now need to work make it work on 10,000 items per second or something. So scaling is important But not just a machine learning for just about everything we put in production Generalization is where I say, okay, here is a model that can predict Cats from dogs. I've looked at five pictures of cats five pictures of dogs and I've built a model that is perfect and Then I look at a different set of five cats and dogs and it gets them all wrong So in that case what it learned was not just in a cat and a dog But it learned what those five exact cats look like and those five exact dogs look like or I've built a model of predicting grocery sales for a particular Product so for toilet rolls in New Jersey last month and Then I go and put it into production and it scales great in other words it has a great latency I don't have a high CPU load But it fails to predict anything well other than Toilet rolls in New Jersey and it also turns out it only did it well for last month not the next month So these are all generalization failures So the most common way that people check for the ability to generalize is To create a random sample. So they'll grab a few rows at random and Pull it out Into a test set and Then they'll build all of their models on the rest of the rows and then When they're finished, they'll check that the accuracy they got on there So the rest of the rows are called the training set everything else everything Else We could call the training set and so at the end of their modeling process on the training set They got an accuracy of 99% of predicting cats from dogs at the very end They check it against a test set to make sure that the model really does generalize Now the problem is What if it doesn't? Right, so, okay Well, I could go back and change some hyper parameters do some data augmentation Whatever else try to create a more generalizable model and then I'll go back again After doing all that and check and it's still no good But and I'll keep doing this again and again until eventually after 50 attempts. It does generalize But does it really generalize because maybe all I've done is accidentally found this one which happens to work just for that test set Because I've tried 50 different things, right? And so if I've got something which is like right coincidentally 0.05% of the time then I very likely to accidentally get a good result So what we generally do is we put aside a second data set They've got a couple more of these and put these aside Into a validation set Validation set right and then everything that's not in the validation or test is now training And so what we do is we train a model check it against the validation to see if it generalizes Do that a few times and then when we finally got something where we're like, okay We think this generalizes successfully based on the validation set and then at the end of the project we check it against the test set Yeah So basically by making this two-layer test that validation said if it gets one right the other one wrong You kind of double checking your errors kind of like it. It's checking that we haven't overfit to the validation set So if we're using the validation set again and again Then we could end up not coming up with a generalizable set of hyper parameters But a set of hyper parameters that just so happened to work on the training set and the validation set so So if we try 50 different models Against the validation set and then at the end of all that we then check that against the test set and it still Generalizes well, then we're kind of going to say okay. That's good We've actually come up with a generalizable model if it doesn't then that's going to say, okay We've actually now overfit to the validation set at which point you're kind of in trouble, right because You don't you know, you don't have anything left behind right so the idea is to use Effective techniques during the modeling so that so that doesn't happen, right? But but if it's going to happen you want to find out about it Like you need that test set to be there because otherwise when you put it in production and And then it turns out that it doesn't generalize that would be a really bad outcome, right? You end up with less people clicking on your ads or selling less of your products or Providing car insurance to very risky vehicles or whatever Just to make sure do you need to ever check if the validation set and the test that it is coherent or you just keep So if you've done what I've just done here, which is to randomly sample There's no particular reason to check as long as they're as long as they're big enough, right? But we're going to come back to your question in a different context in just a moment Now another trick we've learned for random forests is a way of Not needing a validation set and the way what we learned was to use instead use the OOB Error or the OOB score and so this idea was to say well every time we train a tree in a random forest There's a bunch of observations that are held out anyway because that's how we get some of the randomness and so let's Calculate our score for each tree based on those held out samples and therefore the forest by averaging the trees That that each row was not part of training okay and so the OOB score gives us something which is pretty similar to the Validation score But on average, it's a little less good Can anybody either remember or figure out why on average? It's a little less good Quite a subtle one. Thank you. It's Genchi. I'm not sure but is it because you are treating like you are doing every kind of probe Processing on your test time and so the OOB score is reflecting the performance on testing set No, so the OOB score is not using the test set at all The OOB score is using the held out rows in the training set each tree So I mean the you are basically testing each tree on some data from the training set Yes, so you are you have the potential of overvading the trees I Shouldn't cause overfitting because each one is looking at a held out Samples, it's not an overfitting issue. It's quite a subtle issue. Ernest, don't have a try Aren't the samples from OOB? Bootstrap samples. They are so then you're never gonna on average. They only grab 63% of right So an average the OOB is 1-63% exactly. Yeah, that's the issue So then if you're not why would the score be lower than the validation score? That implies that you're leaving sort of like a black hole in the data that there's like data points You're never going to sample and they're not going to be represented by the model No, that's not true though because each tree is looking at a different set, right? So the OOB so like we've got like I don't know dozens of models, right? And in each one There's a different set of rows Which which happened to be held out Right And so when we calculate the OOB score for like let's say row three We say okay row three is in this tree this tree and that's it And so we calculate the prediction on that tree and for that tree and with average those two predictions And so with enough trees You know each one has a 30 or so percent chance. Sorry 40 or so percent chance that the row is in that tree So if you have 50 trees It's almost certain that every row is going to be mentioned somewhere Did you have an idea? With validation set we can use the whole forest to make the predictions, but here we cannot use the whole forest So we cannot exactly see exactly so every row Is going to be using a subset of the trees to make its prediction and with less trees We know we get a less accurate prediction. So that's like that's a subtle one Right, and if you didn't get it have a think during the week until you understand Why this is because it's a really interesting test if you're understanding of random forests of like why is OOB score on average less good than your validation score. They're both using random subs and random held out subsets Anyway, it's generally close enough, right so Why have a validation set at all when you're using random forest? If it's a randomly chosen validation set, it's not strictly speaking necessary But you know you've got like four levels of things to test right so you could like test on the OOB When that's working well, you can test on the validation set, you know And hopefully by the time you check against the test set There's going to be no surprises. So that would be one good reason Then what Kaggle do the way they do this is kind of clever What Kaggle do is they split the test set into two pieces a public and a private and They don't tell you which is rich. So you submit your predictions to Kaggle and then a Random 30% of those are used to tell you the leaderboard score But then at the end of the competition that gets thrown away and they use the other 70% to calculate your real score So What that's doing is that you're making sure that you're not like Continually using that feedback from the leaderboard to figure out some set of hyper parameters It happens to do well on the public, but actually doesn't generalize Okay, so it's a great test like this is one of the reasons why it's good practice to use Kaggle Because at the end of a competition at some point this will happen to you and you'll drop a hundred places on the leaderboard The last day of the competition when they use the private test set as I oh Okay, that's what it feels like to overfit and it's much better to practice and get that sense there than it is to do it In a company where there's hundreds of millions of dollars on the line Okay, so this is like the easiest possible situation where you're able to use a Random sample for your validation set Why might I not be able to use a random sample for my validation set? In the case of something where we're forecasting we can't randomly sample because we need to maintain the temporal ordering Go on why is that? Because it doesn't it doesn't make sense. So in the case of like an armor model I I can't use like I can't pull out random rows because there's I'm thinking that there's like a certain dependency or I'm I'm trying to model a certain dependency that relies on like a specific Lag term if I randomly sample those things then that lag term isn't there for me to okay So it could be like a technical modeling issue that like I'm using a model that relies on like Yesterday the day before and the day before that and if I randomly remove some things I don't have yesterday and my model might just fail. Okay, that's true But there's a more fundamental issue. Do you want to pass it to Tyler? It's a really good point Although, you know in general we're going to try to build models that are not that are more resilient than that particularly with Yeah temporal order we expect things that are close by in time to be related to things close to them so we so I hope we destroy the order like if if we destroy that order we Really aren't going to be able to use that this time is close to this other time Um, I don't think that's true because I can pull out a random sample for a validation set and still keep everything nicely ordered To predict things in the future Which we would require as much data close to the end of art Okay, that's true I mean we could be like limiting the amount of data that we have by taking some of it out But my claim is stronger my claim is that by using a random validation set We could get totally the wrong idea about our model Karen. Do you want to have a try? So you if our data is imbalanced for example we can if you're randomly sampling it We can only have one class in our validation set. So our fitted model may be That's true as well. So maybe you're trying to predict in a medical situation Who's going to die of lung cancer and that's only one out of a hundred people and we pick out a validation set that we Accidentally have nobody that died of lung cancer. That's also true. These are all Good niche examples, but none of them quite say like why could the validation set just be plain Wrong like give you a totally inaccurate idea of whether this is going to generalize And so let's talk about and the closest is is is what Tyler was saying about time closeness in time The important thing to remember is when you build a model You're always you always have a systematic error Which is that you're going to use the model at a later time than the time that you built it, right? Like you're going to put it into production By which time the world is different to the world that you're in now And even when you're building the model you're using data, which is older than today anyway, right? So there's some lag between the data that you're building it on and the data that it's going to be actually be used on your life and A lot of the time if not most of the time that matters, right? So if we're doing stuff in like predicting who's going to buy toilet paper in New Jersey And it takes us two weeks to put it in production and We did it using data from the last couple of years then by that time, you know things may look Very different, right and particularly our validation set if we randomly sampled it Right and it was like from a four-year period then the vast majority of that data is going to be over a year old right and it may be that the toilet buying habits of folks in New Jersey may have Dramatically shifted. Maybe they've got a terrible recession there now and they can't afford high quality toilet paper anymore Or maybe they know their paper making industry has gone through the roof and suddenly, you know They could they're buying lots more toilet paper because it's so cheap or whatever, right? So The world changes and therefore if you use a random sample for your validation set Then you're actually checking. How good are you at predicting things that are totally obsolete now? But how good are you at predicting things that happened four years ago? That's not interesting Okay, so what we want to do in practice Any time there's some temporal piece Is to instead say Assuming that we've ordered it by time Right, so this is old and this is new That's our validation set Okay, or if we you know I suppose actually do it properly. That's our validation set That's our test set Make sense, right? So here's our training set and we use that and we try and build a model that still works on stuff That's later in time than anything the model was built on and so we're not just testing generalization in some kind of abstract sense, but in a very Specific time sense, which is it generalizes to the future. Could you pass it to Suraj, please? So when we are as you said, as you said, there is some temporal ordering in the data So in that case, is it wise to take the entire whole data for training or only a few recent data set? Validation test or training training Yeah, that's a whole nother question. All right, so how do you how do you get the validation set to be good? So I build them a random forest on all the training data. It looks good on the training data It looks good on the OOB Right, and this is actually a really good reason to have OOB if it looks good on the OOB But it means you're not overfitting in a statistical sense, right? Like it's it's it's working well on a random sample But then it looks bad on the validation set So what happened? Well, what happened was that you you somehow failed to predict the future You only predicted the past and so Suraj had an idea about how we could fix that would be okay Well, maybe we should just train so like maybe we shouldn't use the whole training set We should try a recent period only and now you know on the downside We're not using less data so we can create less rich models on the upside. It's it's more up-to-date data and this is something you have to play around with most Machine learning functions have the ability to provide a weight that is given to each row So for example with a random forest rather than bootstrapping at random You could have a weight on every row and randomly pick that row with some probability, right? And we could like say Here's our like probability We could like pick a Curve that looks like that So that the most recent rows have a higher probability of being selected that can work really well Yeah, it's it's something that you have to try and and if you don't have a validation set that represents the future Compared to what you're training on you have no way to know which of your techniques are working How do you make the compromise between an amount of data versus recency of data? So what I tend to do is is when I have this kind of temporal issue, which is probably most of the time Once I have something that's working well on the validation set I wouldn't then go and just use that model on the test set because the thing that I've trained on is now like Much, you know the test set is much more in the future compared to the training set So I would then replicate building that model again, but this time I would combine the training and validation sets together Okay, and and retrain the model and at that point you've got no way to test Against a validation set. So you have to make sure you have a reproducible Script or notebook that does exactly the same steps in exactly the same ways Because if you get something wrong then you're going to find on the test set that you've you've got a problem So So what what I do in practice is I need to know is my validation set A truly representative of the test set. So what I do is I build five models On the training set I build five models on the training set and I try to have them kind of vary In how good I think they are right and then and then I score them my five models on the validation set Right and then I also score them on the test set Right, so I'm not cheating because I'm not using any feedback from the test set to change my hyper parameters I'm only using it for this one thing which is to check my validation set. So I get my five scores From the test set and then I check That they fall in a line Okay And if they don't Then you're not going to get good enough feedback from the validation set. So keep doing that process Until you're getting a line and that can be quite tricky, right sometimes The the test set You know trying to create something that's similar to the real world outcome as possible It's difficult, right and when you're Kind of in the real world the same is true of creating the test set like the test set Has to be as close to production as possible So like What's the actual mix of customers that are going to be using this? How much time is there actually going to be between when you build the model and when you put it in production? How often are you going to be able to refresh the model? These are all the things to think about when you build that test set Okay, uh So you want to say that first make five models On training data and then till you get a straight line relationship Change your validation and test set. You can't really change the test set generally So this is assuming that the test set's given the change change the validation set So if you start with a random sample validation set and then it's all over the place and you realize Oh, I should have picked the last two months Um, and then you pick the last two months and it's still low over the place and you realize Oh, I should have picked it so that's also from the first of the month to the 15th of the month And it'll keep going until changing your validation set until you've found a validation set which is indicative of your test set results So the five models like you would start maybe like just the random data and then average and like just make it better Yeah, I mean, yeah, yeah, yeah, maybe uh, exactly Maybe a kind of five like not terrible ones, but you want some variety and you also particularly want some variety in like How well they might generalize through time So one that was trained on the whole training set one that was trained on the last two weeks One that was trained on the last six weeks One which used, uh, you know, lots and lots of columns and might overfit a bit more Yeah, so you kind of want to get a sense of like, oh if my validation set fails to Generalize temporarily. I'd want to see that if it fails to generalize statistically. I'd want to see that Sorry, can you explain a bit more detail what you mean by change your validation set? So it indicates the test set like what does that look like? um, so uh, possibly so let's take the groceries competition where we're trying to predict The next two weeks of grocery sales. So possible validation sets that Terence and I played with was a random sample The last month of data The last two weeks of data And the other one we tried was same Day range one month Earlier so that the test set in this competition was the first to the 15th of august Sorry, there's 15th that maybe the 15th the 30th of august So we tried like a random sample. It's four years. We tried The 15th of july to the 15th of august We tried the first of august to the 15th of august and we tried the 15th of july to the 30th of july And so there were four different validation sets we tried and so with random, you know our kind of Results were all over the place with last month You know, they were like Not bad, but not great the last two weeks. There was a couple that didn't look good But on the whole they were good and same day range of months earlier. They got a basically perfect line So that's the part i'm talking about right there. What exactly are you comparing it to from the test set? I'm kind of confused what you're creating that graph So for each of those so for each of my so i've built five models, right? So there might be like Just predict the average Do some kind of simple group mean of the whole data set Do some group mean of the last month of the data set build a random forest of the whole thing Build a random forest in the last two weeks On each of those i calculate the validation score And then i retrain the model on the whole training set and calculate the same thing on the test set And so each of these points now tells me how well did they go in the validation set? How well did it go in the test set? And so if the validation set is useful We would say every time the validation set improves the test set should also score should also improve Yeah, so you just said retrain doing mean retrain the model on training and validation Yeah, that was a step i was talking about here So once i've got the validation score based on just the training set And then retrain it on the train and validation and check against test Somebody else So just to clarify By test set you mean Submitting it to Kaggle and then checking the score If it's Kaggle then your test set is Kaggle's leaderboard In the real world the test set is this third data set that you put aside and it's that third data set that Having it reflect real world production differences is the most important step in a machine learning project Why is it the most important step? Because if you screw up everything else that you don't screw up that you'll know you've screwed up Right like if you've got a good test set Then you'll know you screwed up because you screwed up something else and you tested it and it didn't work out And it's like okay, you're not going to destroy the company Right if you screwed up creating the test set That would be awful, right because then you don't know if you've made a mistake Right, you try to build a model you test it on the test set. It looks good, but the test set was not indicative of real world uh environment So you don't actually know if you're going to destroy the company Right now, hopefully you've got ways to put things into production gradually So you won't actually destroy the company But you'll at least destroy your reputation at work, right? It's like oh, Jeremy tried to put this thing into production And in the first week the cohort we tried it on their sales halved and we're never going to give Jeremy a machine learning job again Right, but if Jeremy had used a proper test set then like he would have known Uh-oh, this is like half as good as my validation set said it would be I'll keep trying right and now i'm not going to get in any trouble. I was actually like oh Jeremy's awesome He is identifies ahead of time when there's going to be a generalization problem Uh, okay So this is like This is something that kind of everybody talks about a little bit in machine learning classes But often it kind of stops at the point where you learn that there's a thing in sk learn Called make test train split and it returns these things and off you go, right? But the fact that like Or here's the cross validation function, right? So The fact that these things always give you random samples tells you that like Much if not most of the time you shouldn't be using them Uh, the fact that random forest gives you an oob for free It's useful, but it only tells you that this generalizes in a statistical sense not in a practical sense, right? So then finally there's cross validation, right, which Outside of class you guys have been talking about a lot which makes me feel somebody's been overemphasizing the value of this technique So i'll explain what cross validation is And then i'll explain why you probably shouldn't be using it most of the time So cross validation says let's not just pull out one validation set, but let's pull out Five say so let's assume that we're going to randomly shuffle the data first, right? This is critical, right? We first randomly shuffle the data and then we're going to split it into five groups And then for model number one We'll call this the validation set And we'll call this the training set Okay, and we'll train and we'll check against the validation and we'll get some rmsc r squared whatever And then we'll throw that away And we'll call this the validation set And we'll call this the training set And we'll get another score Okay, we'll do that five times And then we'll take the average Okay, so that's a cross validation Average accuracy so who can tell me like a benefit of using cross validation over a The kind of standard validation set I talked about before Uh, could you pass it to fun? Uh, if you have a small data center, then, uh, cross validation will make use of the data you have Yeah, you can use all of the data You don't have to put anything aside and you kind of get a little benefit as well in that like You've now got five models that you could ensemble together each one of you's which used 80% of the data So, you know, sometimes that ensemble link can be helpful Um Fun, could you tell me like what could be some reasons that you wouldn't use cross validation? Uh, we have enough data so we don't not want the validation set to be included in the model trainings process To to like to to pollute like like the model Okay, yeah I'm not sure that cross validation is necessarily polluting the model What would be a key like downside of cross validation? But like for deep learning if you have learned the pictures and The new network will know the pictures and it's more likely to predict the right so Sure, but if we if we've put aside some data Each time in the cross validation. Can you pass that to Suraj? I'm I'm I'm not so worried about Like I don't think there's like one of these validation sets is More statistically accurate. Yes, sir. I'm mistaken. Will we be overfitting the data like if you're trying to I think that's what fun was worried about. I don't see why that would happen Like each time we're fitting a model just behind you Each time we're fitting a model. We are absolutely holding out 20 of the sample Right, so yes, the five models between them have seen all of the data But but it's kind of like a random forest In fact, it's a lot like a random forest each model has only been trained on the subset of the data Yes, you should say if it is like a large data set like it'll take a lot of time Oh, yes, exactly. Right. So we have to fit five models rather than one. So here's a key downside number one is time and so if we're Doing deep learning and it takes a day to run suddenly now takes five days or we need five GPUs Okay, what about my earlier issues about validation sets? Do you want to pass it over there? What's your name? So if you had like temporal data, wouldn't you be like by shuffling wouldn't you be breaking that relation? Uh, well we could unshuffle it afterwards We could reorder it like we could shuffle get the training set out and then sort it by time Like I'd like this. Presumably there's a date column there. So I don't think I don't think it's going to stop us from building a model. Did you have uh that is? Um with cross validation, you're building five even validation sets And if there's some sort of structure that you're trying to capture in your validation set to mirror your test set You're you're essentially just throwing that a chance to construct that yourself Right, I I think you're going to say that I think you said the same thing as I'm going to say which is which is that Our earlier concerns about why random validation sets are a problem are entirely relevant here all these validation sets are random So if a random validation set is not appropriate for your problem Most likely because for example of temporal issues Then none of these four validations set five validation sets are any good. They're all random right, and so if you have Temporal data Like we did here There's no way to do cross validation really or like probably no good way to do cross validation. I mean You want to have Your validation set be as close to the test set as possible And so you can't do that by randomly sampling different things so um So as thong said You may well not need a to do cross validation because most of the time in the real world We don't really have that little data, right? Unless your data is based on some very very expensive labeling process or some experiments that take a lot cost a lot to run or whatever But nowadays that's Data scientists are not very often doing that kind of work summer in which case this is an issue, but most of us aren't So we probably don't need to As nishan said if we do do it, it's going to take a whole lot of time Right, and then as Ernest said even if we did do it And we took up all that time It might give us totally the wrong answer because random validation sets are inappropriate for our problem Okay, so I'm not going to be spending much time on cross validation because I just I think it's an interesting tool to have It's easy to use sk learn. It has a cross validation thing. You can go ahead and use but It's it's it's not that often that it's going to be an important part of your toolbox in my opinion It'll come up sometimes Okay, so that is uh validation sets So then the other thing we started talking about last week And got a little bit stuck on because I screwed it up was a tree interpretation So i'm actually going to cover that again Without the error And dig into it in a bit more detail So, uh, can anybody tell me What tree interpreter does And how it does it Anybody remember it's a it's a difficult one to explain. I don't think I did a good job of explaining it So don't worry if you don't do a great job, but does anybody want to have a go at explaining it? Well, okay, that's fine. So, um Let's start with the output of tree interpreter so If we look at a single model a single tree in other words Here is a single tree Okay And so to remind us the top of a tree Is before there's been any split at all? So 10.189 Is the average log price of all of the options in our training set So i'm going to go ahead and draw Right here 10.189 is the average of all Okay And then if I go a couple of system less than or equal to 0.5 Then I get 10.345 Okay, so for this subset of 16800 Couple is less than or equal to 0.5 the average is 10.345 And then off the people with a couple of system less than or equal to 0.5 We then take the subset where enclosure is less than or equal to two And the average there of log sale price is 9.955 So here's 9.955 And then final step in our tree Is model id just for this group with no couple of system with enclosure less than or equal to two Then let's just take model id less than or equal to 45 73 And that gives us 10.226 Okay, so then we can say you're at starting with 10.109 189 average for everybody in our training set for this particular tree's sub sample of 20 000 adding in the coupler decision Or coupler less than or equal to 0.5 Increased our prediction by 0.156 So if we predicted with a naive model of just the mean that would have been 10.109 Adding in just the coupler decision would have changed it to 10.345 So this variable is responsible for a 0.156 increase in our prediction From that the enclosure decision was responsible for a minus 0.395 decrease The model id was responsible for a 0.276 increase until eventually that was our final decision That's our prediction for this auction of this particular sale price So we can draw that as what's called a waterfall plot Right and waterfall plots are one of the most useful plots I know about and weirdly enough there's nothing in python to to do them And this is one of these things where there's this disconnect between like the world of like management consulting and business where everybody uses waterfall plots all the time and like academia Who have no idea what these things are but like every time like you're looking at say Here is last year's sales for apple And then there was a change in that iphone's increased by this amount max decreased by that amount and ipads increased by that amount Every time you have a starting point in a number of changes and a finishing point waterfall charts are pretty much always the best way to show it So here our prediction for price based on everything 10.189 There was an increase blue means increase of 0.156 the coupler decrease of 0.395 for enclosure increase model id of 0.276 so decrease So increase decrease increase to get to our final 10.266 So you see how a waterfall chart works? So with excel 2016 You it's built in you just click insert waterfall chart and there it is If you want to be a hero create a waterfall chart Package for map plot led put it on pip and everybody will love you for it There are some like really crappy Gists and manual Notebooks and stuff around these are actually super easy to build Like you basically do a stacked column plot where the bottom of this is like all white Right, like you can kind of do it But if you can wrap that up all and put the data the points in the right spots and color them nicely That would be totally awesome. I think you've all got the skills to do it and would make you know be a terrific thing for your portfolio So there's an idea Could make an interesting kernel even like here's how to build a waterfall plot from scratch And by the way, I can put this up on pip. You can all use it So in general therefore obviously going from the all and then going through each change Then the sum of all of those is going to be equal to the final prediction So that's how we could say if we were just doing a decision tree Then you know you're coming along and saying like how come this particular option was this particular price And it's like well your prediction for it and like oh, it's because Of these three things had these three impacts, right? so for a random forest We could do that across all of the trees right so every time we see coupler We add up that change every time we see enclosure. We add up that change every time we see model We add up that change. Okay, and so then we combine them all together We get what Tree interpreter does but so you could go into the source code for tree interpreter, right? It's not at all complex logic or you could build it yourself Right, and you can see How it does exactly this so when you go tree interpreter dot predict with a random forest model for some specific Auction, so I've got a specific row here. This is my zero index row Um, it tells you okay. This is the prediction the same as the random forest prediction Fias this is going to be always the same. It's the average sale price for for everybody For each of the random samples in the tree and then contributions is The average of also the total of all our contributions for each time we see that Specific column appear in a tree Right, so last time I made the mistake of not sorting this correctly So this time NP dot arc sort is a super handy Function it sorts. It doesn't actually sort Zero it just tells you where each item would move to if it were sorted So now by passing ID access to each one of The column The the level Contribution I can then print out all those in the right order So I can see here. Here's my column Here's the the level And the contribution So the fact that it's a small version of this piece of industrial equipment meant that it was less expensive Right, but the fact it was made pretty recently meant that was more expensive The fact that it's pretty old however made that it was less expensive, right? So this is not going to Really help you much at all with like a Kaggle style situation where you just need predictions That's going to help you a lot in a production environment or even pre-production, right? So like something which Any good manager should you should do if you say here's a machine learning model? I think we should use Is they should go away and grab a few examples of actual customers or or actual auctions or whatever And check whether your model looks intuitive, right? And if it says like my prediction is that You know Lots and lots of people are going to really enjoy This crappy movie, you know, and it's like, well, that was a really crappy movie Then they're going to come back to you and say like explain why your model is telling me That I'm going to like this movie because I hate that movie And then you can go back and you say well, it's because you like this movie And because you're this age range and you're this gender on average actually people like you Did like that movie Yeah What's the second element of each table? This is saying for this particular row It was a mini and it was 11 years old and it was a hydraulic excavator track three to four metric tons Yeah So it's just feeding back and telling you it's it because this is actually what it was It was these numbers. So I just went back to the original data to actually pull out the The descriptive versions of each one. Okay So if we sum up all the contributions together And then add them to the bias Then that would be the same as adding up those three things Adding it to this and as we know from our waterfall chart that gives us our final prediction um, this is a Almost totally unknown technique and this particular A library is almost totally unknown as well. Um, so like it's a great opportunity to You know show something that a lot of people like it's totally critical in my opinion But but rarely known so that's um That's kind of the end of the ran of forest interpretation piece and hopefully you've now seen enough that when somebody says We can't use modern machine learning techniques because they're black boxes that aren't interpretable You have enough information to say you're full of shit Right like they're extremely interpretable and the stuff that we've just done You know trying to do that with a linear model Good luck to you You know even where you can do something similar to the linear model trying to do it So that's not giving you totally the wrong answer and you had no idea it was the wrong answer It's going to be a real challenge So the last step we're going to do before we try and build our our own random forest is deal with this tricky issue of extrapolation So in this case, um, if we look at our tree Um, let's look at the accuracy of our most recent trees Um, we still have You know a big difference between our validation score And our training score um the Actually in this case, it's not too bad that the The difference between the oob and the validation is actually pretty close So if there was a big difference between validation and oob, like I'd be very worried about that we've dealt with the temporal side of things correctly Let's just have a look at I think our most recent model here it was Yeah, so there's a tiny difference right and so On on Kaggle at least you kind of need that last decimal place In the real world, I'd probably stop here But quite often you'll see there's a big difference between your validation score and your oob score and I want to show you how you would deal with that Um, particularly because actually we know that the oob should be a little worse Because it's using less trees So it gets me a sense that we should be able to do a little bit better And so the reason the way we should be able to do a little bit better is by handling the time component a little bit better so Here's the problem with random forests when it comes to extrapolation um when you When you've got a data set that's like, you know Got four years of sales data in it and you create your tree Right and it says like oh if these um if it's in some particular store And it's some particular item And it is on special You know, here's the average price Right, it actually tells us the average price, you know Over the whole training set which could be pretty old right and so when you then um Want to step forward to like what's going to be the price next month It's never seen next month and and where else with a kind of a linear model It can find a relationship between time and price Where even though we only had this much data When you then go and predict something in the future, it can extrapolate that But a random forest can't do that There's no way if you think about it for a tree to be able to say well next month it would be higher still So there's a few ways to deal with this and we'll talk about it over the next couple of lessons But one simple way is just to try to Avoid using time Variables as predictors if there's something else we could use that's going to give us a better You know something of that kind of a stronger relationship that's actually going to work in the future so in this case What I wanted to do was to first of all figure out What's the difference Between our validation set and our training set like if I understand the difference between our validation set and our training set Then that tells me What are the predictors? Which which have a strong temporal component and therefore they may be Irrelevant by the time I get to the future time period So I do something really interesting which is I create a random forest Where my dependent variable is Is it in the validation set? Right, so I've gone back and I've got my whole data frame with the training and validation all together And I've created a new column called is valid Which I've set to one And then for all of the stuff in the training set I set it to zero That's I've got a new column Which is just is this in the validation set or not? And then I'm going to use that as my dependent variable and build a random forest So this is a random forest not to predict price That predict is this in the validation set or not? And so if your Variables were not time dependent then it shouldn't be possible to figure out if something's in the validation set or not This is a great trick in Kaggle right because in Kaggle They often won't tell you whether the test set is a random sample or not So you could put the test set and the training set together Create a new column called is test and see if you can predict it If you can you don't have a random sample Which means you have to come and figure out how to create a validation set From it right and so in this case I can see I don't have a random sample because my validation set can be predicted with a 0.9999 R-squared And so then if I look at feature importance the top thing is sales ID And so this is really interesting. It tells us very clearly sales ID is not a random identifier But probably it's something that's just set Consecutively as time goes on we just increase the sales ID Sailor lapsed that was the number of days since the first date in our data set So not surprisingly that also is a good predictor Interestingly machine ID Clearly each machine is being labeled with some consecutive identifier as well And then there's a big don't just look at the order look at the value. So 0.7 0.1 0.07 0.002 Okay, stop right these top three are hundreds of times more important than the rest, right? So let's next grab those top three Right and we can then have a look at their values Both in the training set And in the validation set and so we can see for example sales ID on average is I've divided by a thousand on average is 1.8 million in the training set And 5.8 million in the validation set right so you like you can see Just confirm like okay. They're very different So let's drop them Okay. So after I drop them, let's now see if I can predict whether something's in the validation set I still can with 0.98 R-squared So once you remove some things then other things can like come to the front and it now turns out. Okay. That's not surprisingly age You know things that are old are You know more likely I guess to Be in the validation set because I've you know earlier on in the training set yet. They can't be old yet Year made same reason So then we can Try removing those as well And so once we let's see where we go up here Yeah, so what we can try doing is we can then say all right, let's take the sales ID So that's machine ID from the first one The age year made sale day of year from the second one and say okay. These are all time dependent features So I still want them in my random forest if they're important Right, but if they're not important Then taking them out. There are some other non-time dependent variables that that work just as well. That would be better Right because now I'm going to have a model that generalizes over time better So here I'm just going to go ahead and go through each one of those features And drop each one one at a time Okay retrain a new random forest and print out the score Okay, so before we do any of that our score was 0.88 Or our validation versus 0.89 oob And you can see here When I remove sales ID my score goes up And this this is like what we're hoping for we've removed a time dependent variable There were other variables that could find similar relationships without the time dependency So removing it caused our validation to go up now oob didn't go up Right because this is genuinely statistically a useful predictor Right, but it's a time dependent one and we have a time dependent validation set So this is like really subtle, but it can be really important, right? It's trying to find the things that give you a generalizable time across time prediction and here's how you can see it So by so it's like, okay, we should remove sales ID for sure right But sailor lapsed Didn't get better Okay, so we don't want that machine ID Did get better went from 888 to 893. All right, so it's actually quite a bit better age Got a bit better Year made got worse sale day of year got a bit better Okay, so now we can say, all right, let's get rid of the three Where we know that getting rid of it actually made it better Okay, and as a result look at this we're now up to 915 Okay, so we've got rid of three time dependent things and now as expected Our validation is better than our OOB Okay, so that was a super successful approach there right and so now we can check the feature importance And let's go ahead and say all right That was pretty damn good. Let's now Leave it for a while. So give it 160 trees Let it turn on it and see how that goes Okay, and so as you can see like we did all of our interpretation All of our fine tuning basically with smaller models subsets and at the end we run the whole thing It actually still only took 16 seconds And so we've now got an rmsc of 0.21. Okay, so now we can check that against Kaggle Again, we can't we Unfortunately this Older competition we're not allowed to enter any more to see how we would have gone so the best we can do is check Whether it looks likely would have done well based on our validation set So it should be in the right area and yeah based on that we would have come first Okay, so You know, I think this is an interesting Series of steps right so you can go through the same series of steps in your Kaggle projects and more importantly your real world projects So one of the challenges is once you leave this learning environment Suddenly you're surrounded by people who they never have enough time. They always want you to be in a hurry They're always telling you, you know, do this and then do that You need to find the time to step away Right and go back because this is a genuine real world modeling process you can use And it gives When I say it gives world-class results. I I mean it right like this guy who won this Listergos, sadly. He's passed away. Um, but he is the top Kaggle Competitor of all time like he he won. I believe like dozens of competition So if we can get a score even within kui of him Then we are doing really really well Okay, so let's take a five minute break and we're going to come back and build our own random forest Okay I just wanted to clarify something quickly a very good point during the break was Going back to the Change in r-squared between here And here is not just due to the fact that we removed These three predictors We also went reset rf samples, right? So to actually see the impact of just removing we need to compare it to The final step earlier, so it's actually compared to 907 so removing those three things took us from 907 to 915 okay, so I mean and you know in the end of course what matters is our final model that Yeah, just to clarify Okay So um Some of you have asked me about writing your own random forests from scratch. I don't know if any of you have Given it a try yet. Um, my original plan here was to Do it in real time and then as I started to do it I realized that that would have kind of been boring because to you because I Screw things up all the time. So instead we might do more of like a walk through the code together Just as an aside This reminds me talking about the exam actually somebody asked on the forum about like what what can you expect on the exam the basic plan is to make it a The exam be very similar to these notebooks. So it'll probably be a notebook that you have to you know Get a data set create a model trainer Feature importance whatever right and the plan is that it'll be Open book open internet. You can use whatever resources you like So basically if you're entering calculations competitions, the exam should be very straightforward I also expect that there will be some pieces about like Here's a partially completed random forest or something, you know finish Finish writing this step here or here's a random forest implement feature importance or you know implement one of the things we've talked about so it'll be you know The exam will be much like what we do in class and what you're expected to be doing during the week there won't be any Define this or tell me the difference between this word and that word or whatever. There's not going to be any rote learning It'll be entirely like are you an effective machine learning practitioner i.e. Can you use the algorithms? Uh, do you know, can you create an effective validation set? And can you can you create parts of the algorithm? Implement them from scratch. So it'll be all about writing code basically, so If you're not comfortable writing code to practice machine learning then You should be practicing that all the time if you are comfortable You should be practicing that all the time also whatever you're doing write code to implement random to do machine learning Okay, so I I kind of have a particular way of writing code Uh, and i'm not going to claim it's the only way of writing code But it might be a little bit different to what you're used to and hopefully you'll find it at least interesting um, creating implementing random forest algorithms um Is actually quite tricky not because the code's tricky like generally speaking Most random forest algorithms are pretty conceptually easy, you know that generally speaking Academic papers and books have a a knack of making them look difficult But they're not difficult conceptually. What's difficult is getting all the details right And knowing and knowing when you're right And so in other words, we need a good way of doing testing So uh, if we're going to re-implement something that already exists. So like say we want to create a random forest in some different Framework different language different operating system, you know, I would always start with something that does exist Right, so in this case, we're just going to do it as a learning exercise writing a random forest in python So for testing, I'm going to compare it to an existing random forest implementation Okay, so that's like critical anytime you're doing anything involving like non trivial amounts of code in machine learning Knowing whether you've got it right or wrong is kind of the hardest bit Uh, I always assume that I've screwed everything up at every step And so I'm thinking like okay assuming that I screwed it up How do I figure out that I screwed it up? Right, and then much to my surprise from time to time I actually get something right and then I can move on Okay, but most of the time I I get it wrong so Unfortunately with machine learning there's a lot of ways you can get things wrong that don't give you an error They just make your result like slightly less good Uh, and so that's that's what you want to pick up So given that I want to kind of compare it to an existing implementation I'm going to use our existing data set our existing validation set and then to simplify things. I'm just going to use two columns to start with So let's go ahead and start writing a random forest So my way of writing nearly all code Is top down just like my teaching and so by top down I start by assuming That everything I want already exists Right, so in other words the first thing I want to do I'm going to call this a tree ensemble Right, so to create a random forest the first question I have is What do I need to pass in? Right, what do I need to initialize my random forest? So I'm going to need some independent variables some dependent variable Pick how many trees I want I'm going to use the sample size parameter from the start here So how big do you want each sample to be? And then maybe some optional parameter of what's the smallest leaf size Okay For testing, it's nice to use a constant random seed. So we'll get the same result each time So this is just how you set a random seed Okay Maybe it's worth mentioning this for those of you that aren't familiar with it Random number generators on computers aren't random at all. They're actually called pseudo random number generators And what they do is given some initial starting point In this case 42 A pseudo random number generator is a mathematical function that generates a deterministic always the same sequence of numbers Such that those numbers are designed to be as uncorrelated with the previous number as possible Okay, and as unpredictable as possible And as uncorrelated as possible with something with a different random seed So the second number in in the sequence starting with 42 should be very different to the second number starting with 41 And generally they involve kind of like taking, you know You know using big prime numbers and taking mods and stuff like that. It's kind of an interesting area of math If you want real random numbers, the only way to do that is you can actually buy Hardware called a hardware random number generator that will have inside them like a little bit of some radioactive Substance and and like something that detects how many things it's spitting out or you know, there'll be some hardware thing Getting current system time is is it a valid random like random number generation So that would be for maybe for a random seed, right? So this thing of like, what do we start the function with? So one of the really interesting areas is like in your computer if you don't set the random seed what is it set to and Yeah, quite often people use the current Time for security like obviously we use a lot of random number stuff for security stuff Like if you're generating an ssh key, you need some it needs to be random It turns out like, you know people can figure out roughly when you created a key Like they could look at like oh ID rsa has a timestamp and they could try, you know, all of the different nanoseconds Starting points for a random number generator around that timestamp and figure out your key. So in practice um a lot of like really random Uh High randomness requiring applications actually have a step that say please move your mouse and type random stuff at the Keyboard for a while and so it like gets you to be a sort that's called entropy to be a source of entropy um other approaches is they'll look at like You know the hash of of some of your log files or you know, um stuff like that It's a really really fun area So in our case our purpose actually is to remove randomness So we're saying okay generate a series of pseudo random numbers starting with 42. So it always should be the same um So if you haven't done much stuff in python o this is a basically standard idiom at least I mean, I write it this way most people don't but uh, if you pass in like One two three four five things that you're going to want to keep inside this object Then you basically have to say self dot x equals x self dot y equals y self dot sample equals sample Right and so we can assign to a tuple from a tuple so You know again, this is like my way of coding most people think this is horrible But I prefer to be able to see everything at once and so I know in my code anytime I see something looks like this It's always all of the um stuff in the method being set If I did it a different way then half the codes now come off the bottom of the page and you can't see it so all right, so um So that was the first thing I thought about was like, okay to create a random forest What information do you need? Then I'm going to need to store that information inside my object and so then I need to create some trees Right a random forest is something that creates is something that has some trees. So I basically figured okay List comprehension to create a list of trees. How many trees do we have or we've got n trees trees? That's what we asked for So range n trees gives me the numbers from zero up to n trees at minus one Okay, so if I create a list comprehension that looks through that range Calling create tree each time I now have n trees trees and now so I I had to write that I didn't have to think at all like that's all like obvious and so I've kind of Delayed the thinking to the point where it's like well wait We don't have something to create a tree Okay, no worries, but let's pretend we did if we did we've now created a random forest Okay, we still need to like do a few things on top of that For example, once we have it we would need a predict function. So, okay Well, let's write a predict function. How do you predict? in a random forest Can somebody tell me Either based on their own understanding or based on this line of code What would be like your one or two sentence answer? How do you make a prediction? in a random forest You would want to over every tree for your Like the row that you're trying to predict on Average the values that your that each tree would produce for that Exactly good. And so, you know, that's a summary of what this says, right? So for a particular row That or maybe this is a number of rows Go through each tree Calculators prediction. So here is a list comprehension that is calculating The prediction for every tree for x. I don't know if x is one row or multiple rows. It doesn't matter, right? As long as as long as tree dot predict works on it And then once you've got a list of things a cool trick to know is you can pass numpy dot mean A regular non numpy list Okay, and it'll take the mean. You just need to tell it Axis equals zero means average it across the lists Okay, so this is going to return the average of dot predict for each tree and so I find list comprehensions Allow me to write the code in the way that my brain works Like you could take the words spencer said and like translate them into this code or you could take this code and Translate them into words like the one spencer said, right? And so when I write code I want it to be as much like that as possible Right. I want it to be readable And so hopefully you'll find like when you look at the fastai code and try to understand how to germy do x I try to write things in a way that you can read it and like kind of turn it into english in your head So if I see correctly that predict method is recursive It's No, it's calling tree dot predict and we haven't written a tree yet So self dot trees is going to contain a tree object So this is tree ensemble dot predict And inside the trees is a tree not a tree ensemble. So this is calling tree dot predict not tree ensemble dot predict Good question Okay So we've nearly finished writing our random forest, haven't we all we need to do now is write create tree right, so Based on this code here Or on your own understanding of how we create trees in a random forest. Can somebody tell me Um, let's take a few seconds have a read have a think and then I'm going to try and come up with a way of saying How do you create a tree in a random forest? Okay, who wants to tell me anybody else, uh, okay, let's time this close to You take your You're essentially taking a random sample or of the original data and then you're just Get just constructing a tree. However, that happens So construct a decision tree like a non random tree from a random sample of the data Okay, so again like We've delayed any actual thought process here. We've basically said, okay, we could pick some random IDs This is a good trick to know If you call np dot random dot permutation passing in an int It'll give you back a randomly shuffled sequence from zero to that it right and so then if you grab the first colon n Items of that that's now a random Substantful, so this is not doing bootstrapping. We're not doing sampling. We're a replacement here Um No, which I think is fine, you know for my random forest I'm deciding that it's going to be something where we do the sub sampling not bootstrapping Okay, so here's a good line of code to know how to write Uh, because it it comes up all the time like I find in machine learning most algorithms I use are Somewhat random and so often I need some kind of random sample. Can you pass that Tata or Tenshi? Uh, won't that give you one one extra because the you said it'll go from zero to length? um No, so this will give you if lend self dot y It's a size n This will give you n a sequence of length n so zero to n minus one Okay, and then from that I'm picking out colon self dot sample size so the first sample size IDs I have a comment on bootstrapping. I think this method is better because we have chance of, uh, giving more weights to each Observation or am I thinking wrong? No, I mean, I think you for bootstrapping we could also give weights I mean weighing single observations more than they are like Without wanting that weight because when bootstrapping with replacement, we can Have a single observation and duplicates of it. Yeah the same tree. Yeah, it does feel weird, but I think I'm not sure that the actual Theory or empirical results backs up our intuition that it's worse. Um, it would be interesting to look Look back at that actually Um, personally, I prefer this because I feel like most of the time we have more data than we Want to put a tree at once. I feel like back when bryman created random forests. It was 1999 It was kind of a very different world, you know where we pretty much always wanted to use all the data we had But nowadays I would say that's Generally not what we want We normally have too much data And so what people tend to do is they're like fire up a spark cluster and they'll run it on hundreds of machines when It makes no sense because if they had just used a sub sample each time they could have done it on one machine and like The the overhead of like Spark is a huge amount of i o overhead. Like I know you guys are doing distributed computing now If you if you've looked at some of the benchmarks Yeah, yeah, exactly. So um, if you do something on a single machine, it can often be hundreds of times faster Because you don't have all this this this i o overhead That also tends to be easier to write the algorithms like you can use like sk learn Um easier to visualize Cheaper so forth. So like I Almost always avoid distributed computing and I have my whole life like even 25 years ago when I was starting In machine learning I you know still didn't use clusters because I was I always feel like Whatever I could do with a cluster now. I could do with a single machine in five years time So one of us focus on always being as good as possible with the single machine, you know And that's going to be more interactive and more iterative and um, it's worked for me Okay, so um, so again, we've like delayed thinking To the point where we have to write decision tree And so hopefully you get an idea that this top-down approach The goal is going to be that we're going to keep delaying thinking so long that that we delay it forever Like like eventually we've somehow written the whole thing without actually having to think Right and that's that's kind of what I need because I'm kind of slow, right? So this is why I write code this way and notice like you never have to design anything You know, you just say hey, what if somebody already gave me the exact API I needed? How would I use it? Okay, and then and then okay to implement that next stage What would be the exact API I would need to implement that right? You keep going down until eventually you're like, oh that already exists Okay, so uh, this assumes we've got a class called decision tree. So we're going to have to create that so a decision tree Uh, it's something so we've we already know what we're going to have to pass it Because we just passed it right so we're passing in um, a random sample of x's a random sample of y's um Indexes is actually um, so we know that down the track. So I got a plan a tiny bit We know that a decision tree is going to contain decision trees Which themselves contain decision trees and so as we go down the decision tree There's going to be some subset of the original data that we've kind of got And so I'm going to pass in the indexes Of the data that we're actually going to use here. Okay, so initially It's the entire Random sample, right? So I've got the whole I've got the whole range And I turn that into an array. So that's zero the indexes from zero to the size of the sample And then we'll just pass down them in leaf size. So everything that we got For constructing the random forest we're going to pass down the decision tree except of course num trees, which is irrelevant for the decision tree So again now that we know that's the information we need we can go ahead and store it inside this object Um, so I'm pretty likely to need to know How many rows we have in this tree which I generally call n How many columns do I have which I generally call c So the number of rows is just equal to the number of indexes we were given And the number of columns is just like however many columns there are in our independent variables Um, so then we're going to need this value here We need to know for this tree Uh, what's its prediction, right? So The prediction for this tree is the mean of our dependent variable for those indexes Which are inside this part of the tree, right? So at the very top Of the tree it contains all the indexes right I'm assuming that by the time we've got to this point remember we've already done the Random sampling Right, so when we talk about indexes, we're not talking about the random sampling to create the tree We're assuming this tree now has some random sample inside decision tree This is this is the one of the nice things right inside decision tree Whole random sampling things gone right that was done by the random first, right? So at this point we're building something that's just a plain old decision tree It's not in any way a random sampling anything. It's just a plain old decision tree, right? So the indexes is literally like Which subset of the data have we got to so far in this tree? And so at the top of the decision tree, it's all the data, right? So it's all of the indexes Okay So all of the indexes So this is therefore All of the dependent variable that are in this part of the tree. And so this is the value mean of that Does that make sense? Anybody have any questions about about that? So, uh, yes, can you pass it to Chen Xi? Actually, just to let you know that's a large portion of us don't have a OOP I mean OOP experiments Okay, sure. So So a quick OOP primer would be helpful Great. Yeah, okay Who has done object-oriented programming in some programming language? Okay So you've all used actually lots of object-oriented programming in terms of using existing classes All right, so every time we've created a random forest We've called the random forests constructor And it's returned an object and then we've called methods And attributes on that object. So fit is a method you can tell because it's got parentheses after it All right, where else? Yeah, I will be score is a property Or an attribute doesn't have parentheses after it Okay, so inside an object. There are kind of two kinds of things. They're the functions that you can call So you you have object dot Function parenthesis arguments or there are the properties or attributes you can grab which is object dot and then just the attribute name with no parentheses So when and then the other thing that we do with objects Is we create them? Okay, we pass in the name of the class and it returns us the object and you have to tell it all of the parameters necessary So it can to get constructed So let's just copy this code And see how we're going to go ahead and build this So the first step is we're not going to go m equals random forest regressor. We're going to go m equals tree ensemble We're creating a class called tree ensemble and we're going to pass in various bits of information um So maybe we'll have 10 trees sample size of a thousand Maybe a min leaf of three right and you can always like choose to name your arguments or not So when you've got quite a few it's kind of nice to name them so that just so we can see what each one means It's always optional, right? um So we're going to try and create a class that we can use like this and then um I'm not sure we're going to bother with dot fit Because we've passed in the x and the y right like in in psychic learn They use an approach where first of all you construct something without telling it what data to use And then you pass in the data We're doing these two steps at once. We're actually passing in the data, right? And so then after that we're going to be going m Dot so we're going to go preds equals m dot predict passing in maybe some validation set Okay, so that's that's the api. We're kind of creating here So this thing here is called a constructor something that creates an object is called a constructor uh and python um There's a lot of ugly hideous things about python one of which is they it uses these special magic method names underscore underscore in it underscore underscore is a special magic method that's called it's called When you try to construct a class. So when I call tree ensemble parenthesis It actually calls tree ensemble dot They see people say dunder in it. I kind of hate it. But anyway dunder in it double underscore in it double underscore dunder in it So that's why we've got this method called Dunder in it. Okay. So when I call tree ensemble it's going to call this method another hideously ugly thing about pythons OO is that there's this special thing where if you have a class and to create a class you just write class in the name of the class all of its methods Automatically get sent one extra parameter one extra argument Which is the first argument and you can call it anything you like If you call it anything other than self everybody will hate you and you're a bad person Okay, so call it anything you like as long as itself Okay, so um So that's why you always see this and in fact I can immediately see here. I have a bug Anybody see the bug in my predict function? I should have self, right? I Like they always do it, right? So anytime you try and call a method on your own class and you get something saying you passed in two parameters And it was only expecting one you forgot self Okay, uh, so like this is a really dumb way to add OOP to a programming language But the older languages like python often did this because they kind of needed to they started out not being OO And then they kind of added OO In a way that was hideously ugly. So pearl, um, which predates python by a little bit kind of I think really came up with this approach and unfortunately other languages of that era stuck with it So You have to add in this magic self. So the magic self now um When you're inside this class You can now pretend as if Any property name you like exist. So I can now pretend there's something called self x I can read from it. I can write to it, right? But if I read from it, and I haven't yet written to it I'll get an error So the stuff that's passed to the constructor Get thrown away by default like there's nothing that like says you need to this class needs to remember what these things are But anything that we stick inside self Is remembered for all time, you know, as long as this object exists, you can access it. It's remembered. So Now that I've gone, um, in fact, let's do this, right? So let's let's create the tree ensemble class And let's now instantiate it. Okay Uh, of course we haven't got x. We need to call x train y train Okay, decision tree is not defined. So let's Create a really minimal decision tree There we go. Okay. So here is enough to actually instantiate our tree ensemble Okay, so we have to find the in it for it We have to find the in it for decision tree We need decision trees in it to be defined because inside our ensemble in it. They're called self dot create tree And then self dot create tree called the decision tree constructor And then decision tree constructor Basically does nothing at all other than saves and information, right? So at this point, we can now go m dot Okay, and if I press tab at this point Can anybody tell me what I would expect to see? press it to taylor Jen she could you press it to taylor? We would see like a we would see a drop-down of all available methods for that class. Okay, which would be In this case, so if m is a tree ensemble, we would have create tree and predict. Okay. Anything else? um We What oh, yeah, and as well as earnus whispered the variables as well. Uh, yeah, so the Very book would mean a lot of things will say the attributes. So the things that we put inside self So if I hit tab Right, there they are right as taylor said there's create tree There's predict and then there's everything else we put inside self right, so if I look at m dot Min leaf If I hit shift enter, what will I see? Yeah, the number that I just put there I put min leaf is three So that went up here to min leaf. This here is a default argument So it says if I don't pass anything it'll be five, but I did pass something right so three self dot min leaf Is it going to be equal to min leaf here? so something which Like because of this rather annoying way of doing oh It does mean that it's very easy to accidentally forget To do that right so if I don't assign it to self dot min leaf Right, then I get an error And so here tree ensemble doesn't happen min leaf Right, so how do I create that attribute? I just put something in it Okay, so if you want to like if you don't know what a value of it should be yet But you kind of need to be able to refer to it. You can always go like self dot min leaf equals none Right, so at least it's something you can read check for nonness and not have an error Great now Interestingly, I was able to instance yet tree ensemble even though predict refers to a method of decision tree that doesn't exist And this is actually something very nice about the dynamic nature of python is that Because it's not like compiling it. It's not checking anything unless you're using it Right, so we can go ahead and create decision d dot predict later And then our our instance yet an object will magically start working, right? It doesn't actually look up that functions that methods details until you use it And so it really helps with top-down programming Um Okay, so when you're Inside a class definition in other words you're at that indentation level, you know indented one in so these are all class definitions Any function that you create unless you do some special things that we're not going to talk about yet Is automatically a method of that class and so every method of that class magically gets a Passed to it So we could call Since we've got a tree ensemble we could call m dot create tree And we don't put anything inside those parentheses because the magic self will be passed and the magic self will be whatever m is Okay, so m dot create tree returns a decision tree just like we asked it to right so m dot create tree Dot id xs Will give us the self dot id xs inside the decision tree Okay, which is set to n p dot arrange range self dot sample size all right Why is data scientists do we care about object oriented programming? um Because a lot of the stuff you use is going to require you to implement stuff with oop for example Every single pytorch model of any kind Is created with oop. It's the only way to create pytorch models Good news is What you see here Is the entirety of what you need to know? So you this is all you need to know you need to know to create something called in it To assign the things that are passed to in it to something called self And then to stick the word self after each of your methods Okay, and so the nice thing is like now to think as an oop programmer is to realize You don't now have to pass around x y sample size and min leaf to every function that uses them By assigning them to to attributes of self They're now available like magic Right, so this is why oop is super handy If you're particularly I started trying to create a decision tree initially without using oop and try to like keep track of like What that decision tree was meant to know about it was very difficult, you know, or else with oop You can just say inside the decision tree, you know self dot index is equals this and Everything just works. Okay. Okay. That's great. So we're out of time. I think that's um that's great timing because There's an introduction to oop, but this week You know next class i'm going to assume that you can use it, right? So you should create some classes instantiate some classes Look at their methods and properties Have them call each other and so forth until you feel comfortable with them and maybe For those of you that haven't done oop before you and find some other useful resources You could pop them onto the wiki thread so that other people know what you find useful Great. Thanks everybody