 Okay. So welcome back to, not welcome back to, welcome to lesson six, first time we've been a lesson six. Welcome back to practically deep learning for coders. We just started looking at tabular data last time. And for those of you who've forgotten, what we did was we, we were looking at the Titanic data set. And we were looking at creating binary splits by looking at categorical variables or binary variables like sex. And continuous variables like the log of the fare that they paid. And using those, you know, we also kind of came up with a score, which was basically how, how good a job did that split to a grouping the survival characteristics into two groups, you know, all of, nearly all of one of whom survived the other didn't survive. So they had like small standard deviation in each group. And so then we created the world's simplest little UI to allow us to fiddle around and try to find a good binary split. And we did, we did come up with a very good binary split, which was on on sex and actually we created this little automated version. And so this is I think the first time we can we're not quite the first times it knows this is this is yet another time I should say that we have successfully created a actual machine learning algorithm from scratch. This one is about the world's simplest one. It's one are creating the single rule, which does a good job of putting your data set into two parts, which differ as much as possible on the dependent variable. One R is probably not going to cut it for a lot of things though. It's surprisingly effective, but it's so maybe we could go a step further. And the other step further we could go is we could create like a two R. What if we took each of those groups, males and females in the Titanic data set and split each of those into two other groups. So split the males into two groups and split the females into two groups. So to do that, we can repeat the exact same piece of code we just did. But let's remove sex from it and then split the data set into males and females and run the same piece of code that we just did before, but just for the males. And so this is going to be like a one R rule for how do we predict which male survive the Titanic. And let's have a look 3-8, 3-7, 3-8, 3-8, 3-8. Okay so it's age, were they greater than or less than six turns out to be for the males, the biggest predictor of whether they were going to survive that shipwreck. break. And we can do the same thing for females. So for females, there we go, no great supplies. P class. So whether they were in first class or not was the biggest predictor for females, of whether they would survive the shipwreck. So that has now given us a decision tree. It is a series of binary splits, which will gradually split up our data more and more such that in the end, these in the leaf nodes, as we call them, we will hopefully get as, you know, much stronger prediction as possible about survival. So we could just repeat this step for each of the four groups we've now created, males, kids and older than six, females, first class, and everybody else. And we could do it again. And then we'd have eight groups. We could do that manually with another couple of lines of code. Or we can just use decision tree classifier, which is a class which does exactly that for us. So there's no magic in here. It's just doing what we've just described. And decision tree classifier comes from a library called scikit learn. scikit learn is a fantastic library that focuses on kind of classical non deep learning ish machine learning methods, like decision trees. So we can so to create the exact same decision tree, we can say please create a decision tree traffic classifier with at most four leaf nodes. And one very nice thing it has is it can draw the tree for us. So here's a tiny little draw tree function. And you can see here, it's going to first of all split on sex. Now it looks a bit weird to say sex is less than a quarter point five. But remember, what our binary characteristics are coded as zero one. So that's just how we, you know, easy way to say males versus females. And then here we've got for the females. What class are they in? And for the males, what age are they? And here's our four leaf notes. So for the females in first class, 116 of them survived and four of them didn't. So very good idea to be a well to do woman on the Titanic. On the other hand, males, adults, 68 survived 350 died. So very bad idea to be a male adult on the Titanic. So you can see you can kind of get a quick summary of what's going on. And one of the reasons people tend to like decision trees, particularly for exploratory data analysis is it doesn't allow us to get a quick picture of what are the key driving variables in this data set and how much do they kind of predict what was happening in the data. Okay, so it's around the same splits as us. And it's got one additional piece of information we haven't seen before. This is sitting called Ginny. Ginny is just another way of measuring how good a split is. And I've put the code to calculate Ginny here. Here's how you can think of Ginny. How likely is it that if you go into that sample and grab one item and then go in again and grab another item, how likely is it that you're going to grab the same item each time? And so if the entire leaf node is just people who survived or just people who didn't survive, the probability would be one. You get the same time, same every time. If it was an exactly equal mix, the probability would be 0.5. So that's why we just, yeah, that's where this formula comes from in the binary case. And in fact, you can see it here, right? This group here is pretty much 50-50, so Ginny is 0.5. Or else this group here is nearly 100% in one class, so Ginny is nearly zero. So I had it backwards as one minus. And I think I've written it backwards here as well, so I better fix that. So this decision tree is, you know, we would expect it to be all accurate, so we can calculate it's been absolute error. And for the one R, so just doing males versus females, what was our score? Here we go, 0.407. Actually, we had a, do we have an accuracy score? So here we are, 0.336. Oh, that was for logfair. And for sex, it was 0.215. Okay, so 0.215. So that was for the one R version for the decision tree with four leaf nodes, 0.224. So it's actually a little worse, right? And I think this just reflects the fact that this is such a small data set. And the one R version was so good, we haven't really improved it that much, but not enough to really see it amongst the randomness of such a small validation set. We could go further to 50, a minimum of 50 samples per leaf node. That means that in each of these, see how it says samples, which in this case is passengers on the Titanic. There's at least, there's 67 people that were female, first class, less than 28. That's how you define that. So this decision tree keeps building, keeps splitting until it gets to a point where there's going to be less than 50, at which point it stops splitting that leaf. So you can see they're all got at least 50 samples. And so here's the decision tree that builds. As you can see, it doesn't have to be like constant depth, right? So this group here, which is males who had cheaper fares and who were older than 20, but younger than 32, actually younger than 24, and actually super cheap fares, and so forth, right? So it keeps going down until we get to that group. So let's try that decision tree. So that decision tree has an absolute error of point 183. So not surprisingly, you know, once we get there, it's starting to look like it's a little bit better. So there's a model and this is a Kaggle competition. So therefore, we should submit it to the leaderboard. And, you know, one of the biggest mistakes I see, not just beginners, but every level of practitioner make on Kaggle is not to submit to the leaderboard, spend months making some perfect thing, right? But you're actually going to see how you're going and you should try and submit something to the leaderboard every day. So, you know, regardless of how rubbish it is, because you want to improve every day. So you want to keep iterating. So to submit something to the leaderboard, you generally have to provide a CSV file. And so we're going to create a CSV file. And we're going to apply the category codes to get the category for each one in our test set. We're going to set the survived column to our predictions. And then we're going to send that off to a CSV. And so, yeah, so I submitted that. And I got a score a little bit worse than most of our linear models and neural nets, but not terrible. You know, it was, it's doing an okay job. Now, one interesting thing for the decision tree is there was a lot less pre-processing to do. Did you notice that? We didn't have to create any dummy variables for our categories. And, like, you certainly can create dummy variables, but you often don't have to. So, for example, you know, for class, you know, it's one, two, or three, you can just split on one, two, or three, you know, even for, like, what was that thing, like the embarkation city code, like we just convert them kind of arbitrarily to numbers one, two, and three, and you can split on those numbers. So with random forest, also not random forest, decision trees, yeah, you can generally get away with not doing stuff like dummy variables. In fact, even taking the log of fare, we only did that to make our graph look better. But if you think about it, splitting on log fare less than 2.7, it's exactly the same as splitting on fare is less than either the 2.7, you know, or whatever log base we used, I can't remember. So all that a decision tree cares about is the ordering of the data. And this is another reason that decision tree based approaches are fantastic, because they don't care at all about outliers, you know, long tail distributions, categorical variables, whatever, you can throw it all in, and it'll do a perfectly fine job. So for tabular data, I would always start by using a decision tree based approach, and kind of press and baselines and so forth, because it's really hard to mess it up. And that's important. So yeah, so here, for example, is embarked, right? It was coded originally as the first letter of the city they embarked in. But we turned it into a categorical variable. And so pandas for us creates this vocab, this list of all of the possible values. And if you look at the codes attribute, you can see it's that s is that 0, 1, 2. So s has become 2, c has become 0, and so forth. Right? So that's how we convert in the categories, the strings into numbers that we can sort and group by. So yeah, so if we wanted to split c into one group and q and s in the other, we can just do, okay, less than a root quarter 1.0.5. Now, of course, if we wanted to split c and s into one group and q into the other, we would need two binary splits. First c on one side and q and s on the other, and then q and s into q versus s. And then the q and s leaf nodes could get similar predictions. So like you do have that sometimes it can take a little bit more messing around. But most of the time, I find categorical variables work fine as numeric in decision tree based approaches. And as I say here, I tend to use dummy variables only if there's like less than four levels. Now, what if we wanted to make this more accurate? Could we grow the tree further? I mean, we could, but you know, there's only 50 samples in these leaves, right? It's not really, you know, if we keep splitting it, the leaf nodes are going to have so little data that that's not really going to make very useful predictions. Now, there are limitations to how accurate a decision tree can be. So what can we do? We can do something that's actually very, I mean, I find it amazing and fascinating. It comes from a guy called Leo Breiman. And Leo Breiman came up with this idea called bagging. And here's the basic idea of bagging. Let's say we've got a model that's not very good. Because let's say it's a decision tree, it's really small, we've hardly used any data for it. It's not very good. So it's got error. It's got errors on predictions. It's not a systematically biased error. It's not always predicting too high or always predicting too low. I mean, decision trees, you know, on average will predict the average, right? But it has errors. So what I could do is I could build another decision tree in some slightly different way that would have different splits. And it would also be not a great model, but predicts the correct thing on average. It's not completely hopeless. And again, you know, some of the errors are a bit too high and some are a bit too low. And I could keep doing this. So if I could keep building lots and lots of slightly different decision trees, I'm going to end up with say 100 different models, all of which are unbiased, all of which are better than nothing, and all of which have some errors bit high, some bit low, whatever. So what would happen if I average their predictions? Assuming that the models are not correlated with each other, then you're going to end up with errors on either side of the correct prediction. Some are a bit high, some are a bit low. There'll be this kind of distribution of errors, right? And the average of those errors will be zero. And so that means the average of the predictions of these multiple uncorrelated models, each of which is unbiased, will be the correct prediction, because they have an error of zero. And this is a mind blowing insight. It says that if we can generate a whole bunch of uncorrelated, unbiased models, we can average them and get something better than any of the individual models because the average of the error will be zero. So all we need is a way to generate lots of models. Well, we already have a great way to build models, which is to create a decision tree. How do we create lots of them? How do we create lots of unbiased, but different models? Well, let's just grab a different subset of the data each time. Let's just grab at random half the rows and build a decision tree and then grab another half the rows and build a decision tree. Grab another half the rows and build a decision tree. Each of those decision trees is going to be not great. It's only using half the data, but it will be unbiased. It will be predicting the average on average. It will certainly be better than nothing because it's using some real data to try and create a real decision tree. They won't be correlated with each other because they're each random subsets. So that meets all of our criteria for bagging. When you do this, you create something called a random forest. So let's create one in four lines of code. So here is a function to create a decision tree. So let's say this is just the proportion of data. So let's say we put 75% of the data in each time, or we could change it to 50%, whatever. So this is the number of samples in this subset, n. And so let's at random choose n times the proportion we requested from the sample and build a decision tree from that. And so now let's 100 times get a tree and stick them all in a list using a list comprehension. And now let's grab the predictions for each one of those trees. And then let's stack all those predictions up together and take their mean. And that is a random forest. And what do we get? 1, 2, 3, 4, 5, 6, 7, 8, 7 lines of code. So random forests are very simple. This is a slight simplification. There's one other difference that random forests do, which is when they build the decision tree, they also randomly select a subset of columns. And they select a different random subset of columns each time they do a split. And so the idea is you kind of want it to be as random as possible, but also somewhat useful. So we can do that by creating a random forest classifier. So how many trees do we want? How many samples per leaf? And then fit does what we just did. And here's a mean absolute error, rich. Again, it's like not as good as that decision tree, but it's still pretty good. And again, it's such a small data set. It's hard to tell if that means anything. And so we can submit that to Kaggle. So earlier on, I created a little function to submit to Kaggle. So now I just create some predictions and I submit to Kaggle. And yeah, looks like it gave nearly identical results to a single tree. Now to one of my favorite things about random forests. And I should say in most real world data sets of reasonable size, random forests basically always give you much better results than decision trees. This is just a small data set to show you what to do. One of my favorite things about random forests is we can do something quite cool with it. What we can do is we can look at the underlying decision trees they create. So we've now got 100 decision trees. And we can see what columns did it find a split on? And so it's a here. Okay, well, the first thing it split on was six. And it improved the Jenny from 0.47 to now just take the weighted average of 0.38 and 0.31 weighted by the samples. So that's probably going to be about 0.33. So I would say, okay, it's like 0.14 improvement in Jenny, thanks to sex. And we can do that again. Okay, well then P class, you know, how much did that improve Jenny? Again, we keep waiting it by the number of samples as well. Log fare, how much did that improve Jenny? And we can keep track for each column of how much in total did they improve the Jenny in this decision tree? And then do that for every decision tree. And then add them up per column. And that gives you something called a feature importance plot. And here it is. And a feature importance plot tells you how important is each feature. How often did the trees pick it? And how much did it improve the Jenny when it did? And so we can see from the feature importance plot that sex was the most important. And class was the second most important. And everything else was a long way back. And this is another reason, by the way, why our random forest isn't really particularly helpful, because it's just such an easy split to do, right? Basically, all that matters is, you know, what class you're in and whether you're male or female. And these feature importance plots, remember, because they're built on random forests, and random forests don't care about really the distribution of your data and they can handle categorical variables and stuff like that. That means that you can basically any tabular data set you have, you can just plot this right away. And random forests, you know, for most data sets, I need to take a few seconds to train that, you know, really at most a minute or two. And so if you've got a big data set and, you know, hundreds of columns, do this first and find the 30 columns that might matter. It's such a helpful thing to do. So I've done that, for example, I did some work in credit scoring. So we're trying to find out which things would predict who's going to default on a loan. And I was given something like 7000 columns from the database. And I put it straight into a random forest and found I think there was about 30 columns that seemed kind of interesting. I did that like two hours after I started the job. And I went to the head of marketing and the head of risk and I told them here's the columns I think that we should focus on. And they were like, oh my god, we just finished a two year consulting project with one of the big consultants, paid the millions of dollars, and they came up with a subset of these. There are other things that you can do with random forests along this path. I'll touch on them briefly. And specifically, I'm going to look at chapter eight of the book, which goes into this in a lot more detail. And particularly interestingly, chapter eight of the book uses a much bigger and more interesting data set, which is auction prices of heavy industrial equipment. I mean, it's less interesting historically, but more interestingly numerically. And so some of the things I did there on this data set, so this isn't from the data set, this is from the scikit-learn documentation. They looked at how, as you increase the number of estimators, so the number of trees, how much does the accuracy improve? So I then did the same thing on our data set. So I actually just added up to 40 more and more and more trees. And you can see that basically as predicted by that kind of initial bit of hand wavy theory I gave you that you'd expect the lower the error because the more things you're averaging, and that's exactly what we find the accuracy improves as we have more trees. John, what's up? Victor, you might have just answered his question actually as he typed it, but he's asking on the same theme the number of trees in a random forest. Does increasing the number of trees always translate to a better error? Yes, it does always. I mean, tiny bumps, right? But yeah, once you smooth it out. But decreasing returns, and if you end up productionising a random forest, then of course every one of these trees you have to go through at inference time. So it's not that there's no cost. I mean, having said that, zipping through a binary tree is the kind of thing you can really do fast. In fact, it's quite easy to like literally spit out C++ code with a bunch of if statements and compile it and get extremely fast performance. I don't often use more than 100 trees. This is a rule of thumb. Is that the only one John? Okay. So then there's another interesting feature around forests, which is remember how in our example, we trained with 75% of the data on each tree. So that means for each tree, there was 25% of the data we didn't train on. Now this actually means if you don't have much data in some situations, you can get away with not having a validation set. And the reason why is because for each tree, we can pick the 25% of rows that weren't in that tree and see how accurate that tree was on those rows. And we can average for each row their accuracy on all of the trees in which they were not part of the training. And that is called the out of bag error or OOB error. And this is built in also to SK learn. You can ask for an OOB prediction. John, just before we move on, Zaki has a question about bagging. So we know that bagging is powerful as an ensemble approach to machine learning. Would it be advisable to try out bagging then first when approaching a particular, say tabular task before deep learning? So that's the first part of the question. And the second part is, could we create a bagging model which includes fast AI deep learning models? Yes, absolutely. So to be clear, you know, bagging is kind of like a meta method. It's not a prediction. It's not a method of modeling itself. It's just a method of combining other models. So random forests in particular as a particular approach to bagging is a, you know, I would probably always start personally a tabular project with a random forest because they're nearly impossible to mess up and they give good insight and they give a good base case. But yeah, your question then about, can you bag other models is a very interesting one. And the answer is you absolutely can. And people very rarely do. But we will. We will quite soon. Maybe even today. So I, you know, you might be getting the impression I'm a bit of a fan of random forests. And before I was, before, you know, people thought of me as the deep learning guy, people thought of me as the random forests guy. I used to go on about random forests all the time. And one of the reasons I'm so enthused about them isn't just that they're very accurate or that they require, you know, that they're very hard to mess up and require very little processing, preprocessing. But they give you a lot of quick and easy insight. And specifically, these are the five things which I think that we're interested in and all of which are things that random forests are good at. They will tell us how confident are we in our predictions on some particular row. So when somebody, you know, when we're giving a loan to somebody, we don't necessarily just want to know how likely are they to repay. But we'd also like to know how confident are we that we know. Because if we're, if we're like, well, we think they'll repay, but we're not confident of that, we would probably want to give them less of a loan. And another thing that's very important is when we're then making a prediction, so again, for example, for credit, let's say you rejected that person's loan, why? And a random forest will tell us what is the, what is the reason that we made a prediction? And you'll see why all these things. Which columns are the strongest predictors? You've already seen that one, right? That's the feature importance plot. Which columns are effectively redundant with each other, i.e. they're basically highly correlated with each other. And then one of the most important ones is you vary a column. How does it vary the predictions? So for example, in your credit model, how does your prediction of risk vary as you vary? Well, something that probably the regulator would want to know might be some, you know, some protected variable like, you know, race or some socio demographic characteristics that you're not allowed to use in your model. So they might check things like that. For the first thing, how confident are we in our predictions using a particular row of data? There's a really simple thing we can do, which is remember how when we calculated our predictions manually, we stacked up the predictions together and took their mean? Well, what if you took their standard deviation instead? So if you stack up your predictions and take their standard deviation, and if that standard deviation is high, that means all of them, all of the trees are predicting something different. And that suggests that we don't really know what we're doing. And so that would happen if different subsets of the data end up giving completely different trees for this particular row. So there's like a really simple thing you can do to get a sense of your prediction confidence. Okay, feature importance, we've already discussed. After I do feature importance, you know, like I said, when I had the 7000 or so columns, they got rid of like all but 30. That doesn't tend to improve the predictions of your random forest very much, if at all, but it certainly helps. Like, you know, kind of logistically thinking about cleaning up the data, you can focus on cleaning those 30 columns stuff like that. So I tend to remove the low importance variables. I'm going to skip over this bit about removing redundant features because it's a little bit outside what we're talking about, but definitely check it out in the book. Something called a dendrogram. But what I do what I mentioned is is the partial dependence. This is the thing which says, what is the relationship between a column and the dependent variable. And so this is something called a partial dependence plot. Now this one's actually not specific to random forests. A partial dependence plot is something you can do for basically any machine learning model. Let's first of all look at one and then talk about how we make it. So in this data set, we're looking at the relationship, we're looking at the sale price at auction of heavy industrial equipment like bulldozers. This is specifically the blue books for bulldozers, calcule competition and a partial dependence plot between the year that the bulldozer or whatever was made and the price that was sold for, this is actually the log price, is that it goes up. More recent bulldozers, more recently made bulldozers are more expensive. And as you go back back to older and older builder bulldozers, they're less and less expensive to a point. And maybe these ones are some old classic bulldozers you pay a bit extra for. Now you might think that you could easily create this plot by simply looking at your data at each year and taking the average sale price. But that doesn't really work very well. I mean it kind of does, but it kind of doesn't. Let me give you an example. It turns out that one of the biggest predictors of sale price for industrial equipment is whether it has air conditioning. And so air conditioning is, you know, it's an expensive thing to add and it makes the equipment more expensive to buy. And most things didn't have air conditioning back in the 60s and 70s and most of them do now. So if you plot the relationship between year made and price, you're actually going to be seeing a whole bunch of when, you know, how popular was air conditioning, right? So you get this cross correlation going on. But we just want to know, no, what's just the impact of the year it was made, all else being equal? So there's actually a really easy way to do that, which is we take our data set. We take the, we leave it exactly as it is to just use the training data set, but we take every single row and for the year made column, we set it to 1950. And so then we predict for every row, what would the sale price of that have been if it was made in 1950? And then we repeat it for 1951 and they repeat it for 1952 and so forth. And then we plot the averages. And that does exactly what I just said. Remember, I said the special words, all else being equal? This is setting everything else equal. It's the everything else is the data as it actually occurred. And we're only varying year made. And that's what a partial dependence plot is. That works just as well for deep learning or gradient boosting trees, or logistic regressions or whatever. It's a really cool thing you can do. And you can do more than one column at a time, you know, you can do two way partial dependence plots. For example, another one. Okay, so then another one I mentioned was can you describe why a particular prediction was made? So how did you decide for this particular row to predict this particular value? And this is actually pretty easy to do. There's a thing called tree interpreter, but we could easily create this in about half a dozen lines of code. All we do is we're saying, okay, this customer's come in, they've asked for a loan, we've put in all of their data through the random forest, it's about out of prediction. We can actually have a look and say, okay, well that in tree number one, what's the path that went down through the tree to get to the leaf node? And we can say, oh, well, first of all, it looked at sex and then it looked at postcode and then it looked at income. And so we can see exactly in tree number one, which variables were used and what was the change in Jenny for each one. And then we can do the same in tree two, seven, three, three, four. Does this sound familiar? It's basically the same as our feature importance plot, right? But it's just for this one row of data. And so that will tell you basically the feature importance is for that one particular prediction. And so then we can plot them like this. So for example, this is an example of an auction price prediction. And according to this plot, you know, so we predicted that the net would be, oh, this is just the change from, so I don't actually know what the price is, but this is how much each one impacted the price. So year made, I guess this must have been an older tractor. It caused prediction of the price to go down, but then it must have been a larger machine. The product size caused it to go up. Couple of system made it go up. Model ID made it go up and so forth, right? So you can see the reds says this made our prediction go down, green made our prediction go up. And so overall you can see which things had the biggest impact on the prediction and what was the direction for each one. So it's basically a feature importance plot, but just for a single row. Any questions, John? Yeah, there are a couple that have sort of queued up. This is a good spot to jump to them. So first of all, Andrew is asking, jumping back to the OOB era, would you ever exclude a tree from a forest if it had a bad out of bag era? I guess if you had a particularly bad tree in your ensemble, might you just drop it? Would you delete a tree that was not doing its thing, not playing its part? No, you wouldn't. If you start deleting trees, then you are no longer having a unbiased prediction of the dependent variable. You are biasing it by making a choice. So even the bad ones will be improving the quality of the overall average. All right, thank you. Zaki followed up with the question about bagging and we're just sort of going, you know, layers and layers here. You know, we could go on and create ensembles of bagged models. And, you know, is it reasonable to assume that they would continue? So that's not going to make much difference, right? If they're all like, you could take your 100 trees, split them into groups of 10, create 10 bagged ensembles and then average those. But the average of an average is the same as the average. You could like have a wider range of other kinds of models. You could have like neural nets trained on different subsets as well. But again, it's just the average of an average will still give you the average. Right. So there's not a lot of value in kind of structuring the ensemble. I mean, some ensembles you can structure, but not bagging. Bagging is the simplest one. It's the one I mainly use. There are more sophisticated approaches, but this one is nice and easy. All right. And there's one that is a bit specific and it's referencing content you haven't covered, but we're here now. So and it's on explainability. So feature importance of random forest models sometimes has different results when you compare to other explainability techniques like SHAP, SHAP or LIME. And we haven't covered these in the course, but Amir is just curious if you've got any thoughts on which is more accurate or reliable random forest feature importance or other techniques? I would lean towards more immediately trusting random forest feature importances over other techniques on the whole on the basis that it's very hard to mess up a random forest. So yeah, I feel like pretty confident that a random forest feature importance is going to be pretty reasonable. As long as this is the kind of data which a random forest is likely to be pretty good at doing, you know, if it's like a computer vision model, random forests aren't particularly good at that. And so one of the things that Breiman talked about a lot was explainability. And he's got a great essay called the two cultures of statistics in which he talks about, I guess what we're nowadays called kind of like data scientists, machine learning folks versus classic statisticians. And he was, you know, definitely a data scientist while before the label existed. And he pointed out, yeah, you know, first and foremost, you need a model that's accurate. It needs to make good predictions. A model that makes bad predictions will also be bad for making explanations because it doesn't actually know what's going on. So if you know, if you've got a deep learning model that's far more accurate than your random forest, then it, you know, explainability methods from the deep learning model will probably be more useful because it's explaining a model that's actually correct. All right, let's take a 10 minute break and we'll come back at five past seven. Welcome back. One person pointed out I noticed I got the chapter wrong. It's chapter nine, not chapter eight in the book. I guess I can't read. Somebody asked during the break about overfitting. Can you overfit a random forest? Basically no, not really. Adding more trees will make it more accurate. It kind of asymptotes, so you can't make it infinitely accurate by using infinite trees, but it certainly, you know, adding more trees won't make it worse. If you don't have enough trees and you let the trees grow very deep, that could overfit. So you just have to make sure you have enough trees. Radek told me about an experiment he did during, Radek told me during the break about an experiment he did, which is something I've done something similar, which is adding lots and lots of randomly generated columns to a data set and try to break the random forest. And if you try it, it basically doesn't work. It's like it's really hard to confuse a random forest by giving it lots of meaningless data. It does an amazingly good job of picking out the useful stuff. As I said, you know, I had 30 useful columns out of 7,000 and I found them perfectly well. And often, you know, when you find those 30 columns, you know, you could go to, you know, I was doing consulting at the time, go back to the client and say like, tell me more about these columns. And they'd say like, oh, well that one there, we've actually got a better version of that now. There's a new system, you know, we should grab that. And oh, this column, actually that was because of this thing that happened last year, but we don't do it anymore. Or, you know, like you can really have this kind of discussion about the stuff you've zoomed into. You know, there are other things that you have to think about with lots of kinds of models, like particularly regression models, things like interactions. You don't have to worry about that with random forests, like because you split on one column and then split on another column, you get interactions for free as well. Normalization, you don't have to worry about. You don't have to have normally distributed columns. So yeah, definitely worth a try. Now, something I haven't gone into is gradient boosting. But if you go to explain.ai, you'll see that my friend Terence and I have a three part series about gradient boosting, including pictures of golf made by Terence. But to explain, gradient boosting is a lot like random forests, but rather than training, a lot of training, fitting a tree again and again and again on different random subsets of the data. Instead, what we do is we fit very, very, very small trees to hardly ever any splits. And we then say, okay, well, what's the error? So, you know, so imagine the simplest tree would be our one R rule tree of male versus female, say, and then you take what's called the residual. That's the difference between the prediction and the actual error. And then you create another tree, which attempts to predict that very small tree. And then you create another very small tree, which tries to predict the error from that. And so forth. Each one is predicting the residual from all of the previous ones. And so then to calculate a prediction, rather than taking the average of all the trees, you take the sum of all the trees, because each one has predicted the difference between the actual and all of the previous trees. And that's called boosting versus bagging. So, boosting and bagging are two kind of meta-ensemble techniques. And when bagging is applied to trees, it's called a random forest. And when boosting is applied to trees, it's called a gradient boosting machine or gradient booster decision tree. Gradient boosting is, generally speaking, more accurate than random forests. But you can absolutely overfit. And so therefore, it's not necessarily my first go-to thing. Having said that, there are ways to avoid overfitting. But yeah, it's just, it's not, because it's breakable, it's not my first choice. But yeah, check out our stuff here if you're interested. And there is stuff which largely automates the process. There's lots of hyperparameters you have to select. People generally just, you know, try every combination of hyperparameters. And in the end, you generally should be able to get a more accurate gradient boosting model than random forest. But not necessarily by much. Okay. So that was the Kaggle notebook on random forests, how random forests really work. So what we've been doing is having this daily walkthrough where me and, I don't know, how many 20 or 30 folks get together on a zoom call and chat about, you know, getting through the course and setting up machines and stuff like that. And, you know, we've been trying to kind of practice what, you know, things along the way. And so a couple of weeks ago, I wanted to show like, what does it look like to pick a Kaggle competition and just like do the normal sensible kind of mechanical steps that you would do for any computer vision model. And so the competition I picked was Patty disease classification, which is about recognizing diseases, race diseases and race patties. And yeah, I spent, I don't know, a couple of hours or three, I can't remember a few hours, throwing together something. And I found that I was number one on the leaderboard. And I thought, oh, that's, that's interesting. Like, because you never quite have a sense of how well these things work. And then I thought, well, there's all these other things we should be doing as well. And I tried three more things. And each time I tried another thing, I got further ahead at the top of the leaderboard. So I thought it'd be cool to take you through the process. I'm going to do it reasonably quickly, because the walkthroughs are all available for you to see the entire thing in, you know, seven hours of detail or however long we probably was six or seven hours of conversations. But I want to kind of take you through the basic process that I went through. So since I've been starting to do more stuff on Kaggle, you know, I realized there's some kind of menial steps I have to do each time, particularly because I like to run stuff on my own machine and then kind of upload it to Kaggle. So to do, to make my life easier, I created a little module called fast Kaggle, which you'll see in my notebooks now on, which you can download from pet or condo. And as you'll see, it makes some things a bit easier. For example, downloading the data for the Patty disease classification, if you just run setup comp and pass in the name of the competition. If you are on Kaggle, it will return a path to that competition data that's already on Kaggle. If you are not on Kaggle and you haven't downloaded it, it will download and unzip the data for you. If you're not on Kaggle and you have downloaded unzip the data, it will return a path to the one that you've already downloaded. Also, if you are on Kaggle, you can ask it to make sure that pip things are installed that might not be up to date otherwise. So this basically, one line of code now gets us all set up and ready to go. So this path, so I ran this particular one on my own machine, so it's downloaded and unzipped the data. I've also got links to the six walkthroughs so far into the videos. So here's my result after these four attempts, that's a few fiddling around at the start. So the overall approach is, well, this is not just to a Kaggle competition, right? The reason I like looking at Kaggle competitions is you can't hide from the truth in a Kaggle competition, you know, when you're working on some work project or something, you might be able to convince yourself and everybody around you that you've done a fantastic job of not overfitting and your model's better than what anybody else could have made or whatever else. But the brutal assessment of the private leaderboard will tell you the truth. Is your model actually predicting things correctly and is it overfit? Until you've been through that process, you know, you're never going to know. And a lot of people don't go through that process because at some level they don't want to know. But it's okay, you know, nobody needed it. You don't have to put your own name there. I always did, right from the very first one. I wanted, you know, if I was going to screw up royally, I wanted to have the pressure on myself of people seeing me in last place. But, you know, it's fine. You can do it all it honestly. And you'll actually find as you improve, you'll have so much self-confidence, you know. And the stuff we do in a Kaggle competition is indeed a subset of the things we need to do in real life. But it's an important subset, you know, building a model that actually predicts things correctly and doesn't overfit is important. And furthermore, structuring your code and analysis in such a way that you can keep improving over a three-month period without gradually getting into more and more of a tangled mess of impossible to understand code and having no idea what untitled copy 13 was and why it was better than 25, right? This is all stuff you want to be practicing. Ideally, well away from customers or whatever, you know, before you kind of figure things out. So the things I talk about here about doing things well in this Kaggle competition should work, you know, in other settings as well. And so these are the two focuses that I recommend. Get a really good validation set together. We've talked about that before, right? And in a Kaggle competition, that's like, it's very rare to see people do well in a Kaggle competition who don't have a good validation set. Sometimes that's easy. In this competition, actually, it is easy because the test set seems to be a random example. But most of the time, it's not actually, I would say. And then how quickly can you iterate? How quickly can you try things and find out what worked? So obviously you need a good validation set, otherwise it's impossible to iterate. And so quickly iterating means not saying what is the biggest, you know, open AI takes four months on 100 TPUs model that I can train. It's, what can I do that's going to train in a minute or so? And will quickly give me a sense of like, well, I could try this, I could try that, what thing's going to work. And then try, you know, 80 things. It also doesn't mean that saying, like, oh, I heard this is amazing new Bayesian hyperparameter tuning approach. I'm going to spend three months implementing that, because that's going to like give you one thing. But actually do well in these competitions or in machine learning in general, you actually have to do everything reasonably well. And doing just one thing really well, we'll still put you somewhere about last place. I actually saw that a couple of years ago, Aussie guy who's very, very distinguished machine learning practitioner actually put together a team, entered a Kaggle competition and literally came in last place. Because they spent the entire three months trying to build this amazing new fancy thing. And never actually, never actually iterate it. If you iterate, I guarantee you won't be in last place. Okay, so here's how we can grab our data with fast Kaggle. And it gives us, tells us what path it's in. And then I set my random seed. And I only do this because I'm creating a notebook to share. You know, when I share a notebook, I like to be able to say, as you can see, this is 0.83 blah, blah, blah, right? And know that when you see it, it'll be 0.83 as well. But when I'm doing stuff, otherwise, I would never set a random seed. I want to be able to run things multiple times and see how much it changes each time. Because that'll give me a sense of like, the modifications I'm making changing it because they're improving it making it worse, or is it just random variation. So if you, or if you always set a random seed, that's a bad idea because you won't be able to see the random variation. So this is just here for presenting a notebook. Okay, so the data, they've given us, as usual, they've got a sample submission, they've got some test set images. They've got some training set images, a CSV file about the training set. And then these other two you can ignore because I created them. So let's grab a path to train images. And so do you remember get image files? So that gets us a list of the file names of all the images here recursively. So we could just grab the first one and take a look. So it's 480 by 640. Now we've got to be careful. This is a pillow image, Python imaging library image. In the imaging world, they generally say columns by rows. In the array slash tensor world, we always say rows by columns. So if you ask PyTorch what the size of this is, it'll say 640 by 480. And I guarantee at some point this is going to bite you. So try to recognize it now. Okay, so they're kind of taller than they are, at least this one is taller than it is wide. So I actually actually know they all this size because it's really helpful if they are all the same size or at least similar. Believe it or not, the amount of time it takes to decode a JPEG is actually quite significant. And so figuring out what size these things are is actually going to be pretty slow. But my fast core library has a parallel sub module, which can basically do anything that you can do in Python. It can do it in parallel. So in this case, we wanted to create a pillow image and get its size. So if we create a function that does that and pass it to parallel, passing in the function and the list of files, it does it in parallel. And that actually runs pretty fast. And so here is the answer. I don't know how this happened. 10,403 images are indeed 480 by 640 and four of them aren't. So basically what this says to me is that we should pre-process them or, you know, at some point process them so that they're probably all 480 by 640 or all basically the kind of same size. We'll pretend they're all this size. But we can't not do some initial resizing otherwise this is going to screw things up. So like the probably the easiest way to do things, the most common way to do things is to either squish or crop every image to be square. So squishing is when you just, in this case, squish the aspect ratio down as opposed to cropping randomly a section out. So if we call resize squish, it will squish it down. And so this is 480 by 480 square. So this is what it's going to do to all of the images first on the CPU. That allows them to be all batched together into a single mini batch. Everything in a mini batch has to be the same shape. Otherwise the GPU won't like it. And then that mini batch is put through data augmentation and it will grab a random subset of the image and make it a 128 by 128 pixel. And here's what that looks like. Here's our data. So show batch works for pretty much everything, not just in the fast AI library, but even for things like fast audio, which are kind of community based things. You should be to use show batch on anything and see or hear or whatever what your data looks like. I don't know anything about rice disease, but apparently these are various rice diseases and this is what they look like. So I jump into creating models much more quickly than most people because I find models are a great way to understand my data, as we've seen before. So I basically build a model as soon as I can. And I want to create a model that's going to let me iterate quickly. So that means that I'm going to need a model that can train quickly. So Thomas Cappale and I recently did this big project, the best vision models of fine tuning, where we looked at nearly 100 different architectures from Ross Whiteman's Tim Library, PyTorch image model library, and looked at which ones could we fine tune, which ones had the best transfer learning results. And we tried two different data sets, very different data sets. One is the PETS data set that we've seen before. So trying to predict what breed of PET is from 37 different breeds. And the other was a satellite imagery data set called Planet. So very, very different data sets in terms of what they contain and also very different sizes. The planet one's a lot smaller, the PETS one's a lot bigger. And so the main things we measured were how much memory did it use? How accurate was it? And how long did it take to fit? And then I created this score, which can, which combines the fit time and error rate together. And so this is a really useful table for picking a model. And now in this case, I want to pick something that's really fast. And there's one clear winner on speed, which is ResNet 26D. And so its accuracy was 6% versus the best was like 4.1%. So okay, it's not amazingly accurate, but it's still pretty good. And it's going to be really fast. So that's why I picked ResNet 26D. A lot of people think that when they do deep learning, they're going to spend all of their time learning about exactly how a ResNet 26D is made and convolutions and ResNet blocks and transformers and blah, blah, blah. We will cover all that stuff in part two and a little bit of it next week. But it almost never matters. It's just a function. And what matters is the inputs to it and the outputs to it and how fast it is and how accurate it is. So let's create a learner, which with a ResNet 26D from our data loaders. And let's run LRFind. So LRFind will put through one mini batch at a time, starting at a very, very, very low learning rate and gradually increase the learning rate and track the loss. And initially the loss won't improve because the learning rate is so small it doesn't really do anything. And at some point the learning rate is high enough that the loss will start coming down. Then at some other point the learning rate is so high that it's going to start jumping past the answer and it's going to get worse. And so somewhere around here is a learning rate we'd want to pick. We've got a couple of different ways of making suggestions. I generally ignore them because these suggestions are specifically designed to be conservative. They're a bit lower than perhaps an optimal in order to make sure we don't recommend something that totally screws up. But I kind of like to say like, well, how far right can I go and still see it like clearly, really improving quickly. And so I'd pick somewhere around 0.01 for this. So I can now fine tune our model with a learning rate of 0.01, three epochs. So look, the whole thing took a minute. That's what we want, right? We want to be able to iterate rapidly just a minute or so. So that's enough time for me to go and, you know, grab a glass of water or do some reading. Like it's not going to get too distracted. And what do we do before we submit? Nothing. We submit as soon as we can. Okay, let's get our submission in. So we've got a model. Let's get it in. So we read in our CSV file of the sample submission. And so the CSV file basically looks like we're going to have to have a list of the image file names in order. And then a column of labels. So we can get all the image files in the test image like so, and we can sort them. And so now we want is what we want is a data loader, which is exactly like the data loader we use to train the model, except pointing at the test set. We want to use exactly the same transformations. So there's actually a dls.testdl method, which does that. You just pass in the new set of items. So the test set files. So this is a data loader, which we can use for our test set. A test data loader has a key difference to a normal data loader, which is that it does not have any labels. So that's a key distinction. So we can get the predictions for our learner passing in that data loader. And in the case of a classification problem, you can also ask for them to be decoded. Decoded means rather than just get returned the probability of every rice disease or every class. It'll tell you what is the index of the most probable rice disease. That's what decoded means. So that'll return with probabilities, targets, which obviously will be empty because it's a test set. So throw them away and those decoded indexes, which look like this, numbers from not to nine, because there's 10 possible rice diseases. The Kaggle submission does not expect numbers from not to nine. It expects to see strings like these. So what do those numbers from not to nine represent? We can look up our vocab to get a list. So that's zero, that's one, etc. That's nine. So I realized later, this is a slightly inefficient way to do it, but it does the job. I need to be able to map these to strings. So if I enumerate the vocab, that gives me pairs of numbers, zero, bacterial leaf light, one, bacterial leaf streak, etc. I could then create a dictionary out of that. And then I can use pandas to look up each thing in a dictionary. They call that map. If you're a pandas user, you've probably seen map used before being passed a function, which is really, really slow. But if you pass map addict, it's actually really, really fast. Do it this way if you can. So here's our predictions. So we've got our submission sample submission file SS. So if we replace this column label with our predictions, like so, then we can turn that into a CSV. And remember, this means this means run a bash command, a cell command. Head is the first few rows. Let's just take a look. That looks reasonable. So we can now submit that to Kaggle. Now, iterating rapidly means everything needs to be fast and easy. Things that are slow and hard, don't just take up your time, but they take up your mental energy. So even submitting to Kaggle needs to be fast. So I put it into a cell. So I can just run this cell. API.competitions submit this CSV file, give it a description. So just run the cell and it submits to Kaggle. And as you can see, it says, here we go, successfully submitted. So that submission was terrible. Top 80%, also known as bottom 20%, which is not too surprising, right? I mean, it's one minute of training time. But it's something that we can start with. And that would be like, however long it takes to get to this point that you put in our submission, now you've really started, right? Because then tomorrow, you can try to make a slightly better one. So I like to share my notebooks. And so even sharing the notebook, I've automated. So part of fast Kaggle is you can use this thing called push notebook. And that sends it off to Kaggle to create a notebook on Kaggle. There it is. And there's my score. As you can see, it's exactly the same thing. Why would you create public notebooks on Kaggle? Well, it's the same brutality of feedback that you get for entering a competition. But this time, rather than finding out in no uncertain terms whether you can predict things accurately, this time you can find out no uncertain terms whether you can communicate things in a way that people find interesting and useful. And if you get zero votes, you know, so be it, right? That's something to know. And then, you know, ideally go and ask some friends like, what do you think I could do to improve? And if they say, Oh, nothing, it's fantastic. You can tell. No, that's not true. I didn't get me votes to try again. This isn't good. How do I make it better? You know, and you can try and improve. Because if you can create models that predict things well, and you can communicate your results in a way that is clear and compelling, you're a pretty good data scientist, you know, like they're two pretty important things. And so here's a great way to test yourself out on those things and improve. Yes, John. Yes, Jeremy, we have a sort of, I think a timely question here from Zaki about your iterative approach. And they're asking, do you create different Kaggle notebooks for each model that you try? So one Kaggle book for the first one, then separate notebooks subsequently, or do you do append to the bottom of a single notebook? What's your strategy? That's a great question. And I know Zaki is going through the the daily walkthroughs but isn't quite caught up yet. So I will say keep it up because in the six hours of going through this, you'll see me create all the notebooks. But if I go to the actual directory I used, you can see them. So basically, yeah, I started with, you know, what you just saw bit messier without the pros, but that same basic thing. I then duplicated it to create the next one, which is here. And because I duplicated it, you know, this stuff, which I still need it still there, right? And so I run it. And I don't always know what I'm doing, you know, and so at first, if I don't really know what I'm doing next, when I duplicate it, it will be called, you know, first steps in the road to the top part one dash copy one, you know, and that's okay. And as soon as I can, I'll try to rename that once I know what I'm doing, you know, or if it doesn't say to go anywhere, I'll rename it into something like, you know, experiment, blah, blah, blah. And I put some notes at the bottom and I might put it into a file folder or something. But yeah, it's like, it's a very low tech approach that I find works really well, which is just duplicating notebooks and editing them and naming them carefully and putting them in order and, you know, put the file name in when you submit as well. And then of course, also if you've got things in git, you know, you can have a link to the git commit so you'll know exactly what it is. Generally speaking for me, you know, my notebooks will only have one submission in. And then I'll move on and create a new notebook. So I don't really worry about versioning so much. But you can do that as well, if that helps you. Yeah, so that's basically what I do. And I've worked with a lot of people who use much more sophisticated and complex processes and tools and stuff. But none of them seem to be able to stay as well organized as I am. I think they kind of get a bit lost in their tools sometimes. And file systems and file names, I think are good. Great, thanks. So away from that kind of dev process, more towards the specifics of, you know, finding the best model and all that sort of stuff, we've got a couple of questions that are in the same space, which is, you know, we've got some people here talking about auto ML frameworks, which you might want to, you know, touch on for people who haven't heard of those. If you've got any particular auto ML frameworks you think are worth recommending, or just more generally, how do you go trying different models, random forest gradient boosting neural network at just so in that space if you could comment. Sure. I use auto ML less than anybody I know, I would guess. Which is to say, never. Hyper parameter optimization, never. And the reason why is I like being highly intentional. You know, I like to think more like a scientist and have hypotheses and test them carefully and come up with conclusions, which then I implement, you know. So for example, in this best vision models of fine tuning, I didn't try a huge grid search of every possible model, every possible learning rate, every possible preprocessing approach, blah, blah, blah. Instead, step one was to find out, well, which things matter, right? So, for example, does whether we squish or crop make a difference? You know, are some models better with squished and some models better with crop? And so we just tested that for, again, not for every possible architecture, but for one or two versions of each of the main families, that took 20 minutes. And the answer was no, in every single case, the same thing was better. So we don't need to do a grid search over that anymore, you know. Or another classic one is like learning rates. Most people do a kind of grid search over learning rates or they'll train a thousand models, you know, with different learning rates. But this fantastic researcher named Leslie Smith invented the learning rate finder a few years ago. We implemented it, I think within days of it first coming out as a technical report. And that's what I've used ever since, because it works well and runs in a minute or so. Yeah, I mean, then like neural nets versus GBM sources, random forests, I mean, that's that shouldn't be too much of a question on the whole. Like they have pretty clear places that they go. Like if I'm doing computer vision, I'm obviously going to use a computer vision deep learning model. And which one I would use, well, if I'm transfer learning, which hopefully is always I would look up the two tables here. This is my table for pets, which is which are the best at fine tuning to very similar things to what they're pre trained on. And then the same thing for planet is which ones are best for fine tuning to data sets that are very different to what they're trained on. And as it happens in both case, they're very similar in particular connexed is right up towards the top in both cases. So I just like to have these rules of thumb. And yeah, my rule of thumb for tabular is random forests going to be the fastest, easiest way to get a pretty good result. GBMs probably going to give me a slightly better result if I need it and can be bothered fussing around. GBM, I would probably, yeah, actually, I probably would run a hyper parameter sweep, because it is fitly and and it's fast. So you may as well. So, yeah, so, you know, we were able to make a slightly better submission, slightly better model. And so I had a couple of thoughts about this. The first thing was that thing trained in a minute on my home computer. And then when I uploaded it to Kaggle, it took about four minutes per epoch, which was horrifying. And Kaggle's GPUs are not amazing, but they're not that bad. So I knew something was up. And what was up is I realized that they only have two virtual CPUs, which nowadays is tiny. Like, you know, you generally want as a rule of thumb about eight physical CPUs per GPU. And so it's spending all of its time just reading the damn data. Now, the data was 640 by 480, and we were ending up with any 128 pixel size bits for speed. So there's no point doing that every epoch. So step one was to make my Kaggle iteration faster as well. And so very simple thing to do, resize the images. So fast AI has a function called resize images. And you say, okay, take all the train images and stick them in the destination, making them this size recursively. And it will recreate the same folder structure over here. And so that's why I called this the training path, because this is now my training data. And so when I then trained on that on Kaggle, it went down to four times faster with no loss of accuracy. So that was kind of step one was to actually get my fast iteration working. Now, still a minute's a long time. And on Kaggle, you can actually see there's a little graph showing how much the CPU is being used, how much the GPU is being used on your own home machine. You can, there are tools, free GP, you know, free tools to do the same thing. I saw that the GPU was still hardly being used. So still CPU was being driven pretty hard. I wanted to use a better model anyway to move up the leaderboard. So I moved from a, oh, by the way, this graph is very useful. So this is, this is speed versus error rate by family. And so we're about to be looking at these ConvNext models. So we're going to be looking at this one, ConvNext Tiny. Here it is, ConvNext Tiny. So we were looking at ResNet 2016, which took this long on this data set. But this one here is nearly the best. It's third best, but it's still very fast. And so it's a best overall score. So let's use this, particularly because, you know, we're still spending all of our time waiting for the CPU anyway. So it turned out that when I switched my architecture to ConvNext, it basically ran just as fast on Kaggle. So we can then train that. Let me switch to the Kaggle version because my outputs are missing for some reason. So, yeah, so I started out by running the ResNet 2016 on the resized images and got similar error rate, but I ran a few more epochs, got 12% error rate. And so then I do exactly the same thing, but with ConvNext small and 4.5% error rate. So don't think that different architectures are just tiny little differences. This is over twice as good. And a lot of folks you talk to will never have heard of this ConvNext, because it's very new. And I've noticed a lot of people tend not to keep up to date with new things. They kind of learn something at university and then they stop learning. So if somebody's still just using ResNets all the time, you know, you can tell them we've actually, we've moved on, you know. ResNets are still probably the fastest, but for the mix of speed and performance, you know, not so much. ConvNext, you know, again, you want these rules of thumb, right? If you're not sure what to do, this ConvNext, okay? And then like most things, there's different sizes. There's a tiny, there's a small, there's a base, there's a large, there's an extra large. And, you know, it's just, well, let's look at the picture. This is it here, right? Large takes longer, but lower error, tiny takes less time, but higher error, right? So you pick about your speed versus accuracy trade-off for you. So for us, small is great. And so, yeah, now we've got a 4.5% error. That's terrific. Now let's iterate. On Kaggle, this is taking about a minute per epoch. On my computer, it's probably taking about 20 seconds per epoch. So not too bad. So, you know, one thing we could try is, instead of using Squish as our pre-processing, let's try using crop. So that will randomly crop out an area. And that's the default. So if I remove the method equal Squish, that will crop. So you see how I've tried to get everything into a single function, right? The single function, I can tell it, that's going to find the definition. What architecture do I want to train? How do I want to transform the items? How do I want to transform the batches? And how many epochs do I want to do? That's basically it, right? So this time, I want to use the same architecture. Next, I want to resize without cropping, and then use the same data augmentation. And okay, error rates about the same. So not particularly, it's a tiny bit worse, but not enough to be interesting. Instead of cropping, we can pad. Now, padding's interesting. Do you see how these are all square? Right? But they've got black borders. So padding's interesting because it's the only way of pre-processing images, which doesn't distort them and doesn't lose anything. If you crop, you lose things. If you Squish, you distort things. This does neither. Now, of course, the downside is that there's pixels that are literally pointless. They contain zeros. So every way of getting this working has its compromises, but this approach of resizing where we pad with zeros is not used enough, and it can actually often work quite well. And in this case, it was about as good as our best so far. But no, not huge differences yet. What else could we do? Well, what we could do is see these pictures. This is all the same picture, but it's gone through our data augmentation. So sometimes it's a bit darker. Sometimes it's flipped horizontally. Sometimes it's slightly rotated. Sometimes it's slightly warped. Sometimes it's zooming into a slightly different section. But this is all the same picture. Maybe our model would like some of these versions better than others. So what we can do is we can pass all of these to our model, get predictions for all of them, and take the average. Right? So it's our own kind of like little mini bagging approach. And this is called test time augmentation. Fast AI is very unusual in making that available in a single method. You just pass TTA, and it will pass multiple augmented versions of the image and average them for you. And so this is the same model as before, which had a 4.5%. So instead, if we get TTA predictions and then get the error rate, wait, why does this say 4.8? Last time I did this, it was way better. Well, that's messing things up, isn't it? So when I did this originally on my home computer, it went from like 4.5 to 3.9. So possibly I got a very bad luck this time. So this is the first time I've actually ever seen TTA give a worse result. So that's very weird. I wonder if it's... if I should do something other than the crop padding. All right, I'll have to check that out, and I'll try and come back to you and find out why in this case, this one was worse. Anyway, take my word for it every other time I've tried it, TTA has been better. So then, you know, now that we've got a pretty good way of resizing, we've got TTA, we've got a good training process, let's just make bigger images. And something that's really interesting and a lot of people don't realize is your images don't have to be square. They just all have to be the same size. And given that nearly all of our images are 640 by 480, we can just pick, you know, that aspect ratio. So for example 256 by 192, and we'll resize everything to the same aspect ratio rectangular. And that should work even better still. So if we do that, we'll do 12 epochs. Okay, now our error rates down to 2.2%. And then we'll do TTA. Okay, this time you can see it's actually improving down to under 2%. So that's pretty cool, right? We've got our error rate at the start of this notebook, we were at 12%. And by the time we've got through our little experiments, we're down to under 2%. And nothing about this is in any way specific to rice or this competition, you know, it's like this is a very mechanistic, you know, standardized approach, which you can use for certainly any kind of this type of computer vision competition and computer vision data set almost. But you know, look very similar for a collaborative filtering model or a tabular model, NLP model, whatever. So of course, again, I want to submit as soon as I can. So just copy and paste the exact same steps I took last time basically for creating a submission. So as I said, last time we did it using pandas, but there's actually an easier way. So the step where here I've got the numbers from 0 to 9, which is like which which rice disease is it? So here's a cute idea. We can take our vocab and make it an array. So that's going to be a list of 10 things. And then we can index into that vocab with our indices, which is kind of weird. This is a list of 10 things. This is a list of, I don't know, four or 5,000 things. So this will give me four or 5,000 results, which is each vocab item for that thing. So this is another way of doing the same mapping. And I would spend time playing with this code to understand what it does, because it's the kind of like very fast, you know, not just in terms of writing, but this would optimize, you know, on the CPU very, very well. So this is the kind of coding you want to get used to, this kind of indexing. Anyway, so then we can submit it just like last time. And when I did that, I got in the top 25%. And that's, that's where you want to be, right? Like generally speaking, I find in Kaggle competitions, the top 25% is like, you're kind of like solid, competent level. You know, not to say like, it's not easy. You got to know what you're doing. But if you get in the top 25%, and I think you can really feel like, yeah, this is, this is a, you know, very reasonable attempt. And so that's, I think this is a very reasonable attempt. Okay. Before we wrap up, John, any last questions? Yeah, there's, there's, there's two, I think that would be good if we could touch on quickly before you wrap up. One from Victor asking about TTA. When I use TTA during my training process, do I need to do something special during inference? Or is this something you use only during validation? Okay. So just to explain, TTA means test time augmentation. So specifically, it means inference. So I think you mean augmentation during training. So yeah, so during training, you basically always do augmentation, which means you're varying each image slightly so that the model never seems the same image exactly the same twice. And so I can't memorize it. On fast AI, and as I say, I don't think anybody else does this as far as I know. If you call TTA, it will use the exact same augmentation approach on whatever data set you pass it and average out the prediction, but like multiple times on the same image, and will average them out. So you don't have to do anything different. But if you didn't have any data augmentation in training, you can't use TTA. It uses the same, by default, same data augmentation you use for training. Great. Thank you. And the other one is about how you know, when you first started this example, you squared the models and the images rather and you talked about squashing versus cropping versus clipping and scaling and so on. But then you went on to say that these models can actually take rectangular inputs. So there's a question that's kind of probing it at that. If the models can take rectangular inputs, why would you ever even care as long as they're all the same size? So I find most of the time data sets tend to have a wide variety of input sizes and aspect ratios. So if there's just as many tall skinny ones as wide short ones, you know, it doesn't make sense to create a rectangle because some of them you're going to really destroy them. So a square is the kind of best compromise in some ways. There are better things we can do, which we don't have any off the shelf library support for yet. And I don't think I don't know that anybody else has even published about this, but we experimented with kind of trying to batch things that are similar aspect ratios together and use the kind of median rectangle for those and have had some good results with that. But honestly, 99.999 percent of people given a wide variety of aspect ratios chuck everything into a square. A follow up, this is my own interest. Have you ever looked at, you know, so the issue with padding, as you say, is that you're putting, you know, black pixels there. Those are not nans. Those are black pixels. That's right. And so there's something problematic to me, you know, conceptually about that. You know, when you when you see, for example, four to three aspect ratio footage presented for broadcast on 16 to nine, you get the kind of the blurred stretch that kind of stuff. No, we played with that a lot. Yeah, I used to be really into it actually. And fast day I still by default uses reflection padding, which means if this is, I don't know, let's say this is a 20 pixel wide thing, it takes the 20 pixels next to it and flips it over and sticks it here. And it looks pretty good, you know, another one is copy, which simply takes the outside pixel and it's a bit more like TV, you know, you know, much to my chagrin. It turns out none of them really help, you know, if anything they make it worse. Because in the end, the computer wants to know, no, this is the end of the image. There's nothing else here. And if you reflect it, for example, then you're kind of creating weird spikes that didn't exist. And the computer's going to be like, oh, I wonder what that spike is. So yeah, it's a great question. And I obviously spent like a couple of years assuming that we should be doing things that look more image like, but actually, the computer likes things to be presented to it in as straightforward a way as possible. All right. Thanks, everybody. And I hope to see some of you in the walkthroughs. And otherwise, see you next time.