 Cool, we're good. So that's the only side I have. So, right, so thank you for coming. We'll do it really quickly, just a quick introduction. Kevin, I'm working at a company called Cambridge Spark. We do trainings in data science for individuals and for companies on site. That was it. Now, the notebook. The link that you have here is available as well on the, can you all read? Yeah, it's available as well on the European page. So if you can't read it here, you can just copy paste it from there. So today we're going to talk about ensemble models. So first, quickly, why do we care about ensemble models? Well, they're really popular. They've been used a lot and they still use a lot in the industry. They provide pretty good performance and they're quite convenient in the sense that they don't require as much data as they're planning in general. So that can, they can be a good alternative. And in fact, they use a lot in machine learning competitions as well on Kaggle, for example. The goal of this talk is to try to build a bridge between the theory and the implementation of those models in Python. So it's going to be really much applied. Just quickly, the agenda, I'm going to start with an intuition of why ensemble models work. Then discuss the main building block of the ensemble, which usually we use a decision tree. Then talk about two techniques, bagging and booging and boosting. And finally, talk about other libraries that you can use that are a bit more advanced. Right, so quickly, first the definition of what do we mean by ensemble model. So it's simply combining multiple simple models that we will call weak learners into a larger one that will be our ensemble. There are two popular techniques that you can use, bagging and boosting. And we'll discuss both of them here. And we use, as a core building block, a weak learner, here it's going to be a decision tree and it's usually a decision tree. Right, so quickly, just some intuition. So let's say you want to know if you have a given disease that here I called A. You're going to see three doctors. All of them tell you that you have that disease. And let's say you have access to the file of past diagnosis and you know that you can calculate their accuracy on previous patients and they had like all of them had 75% accuracy. So the question here is, do you think that the probability of you having the disease given that all the doctors told you that you have it, is it higher than 75% or is it equal to 75%? In the notebook that you have on GitHub, they have actually simulating that scenario so you can play with it and run it if you're interested. Here, I really don't have time to do it. So I'm just going to tell you the answer. There is no much suspense here. It's actually higher. So by combining all those predictions together, we get a higher probability. The reason behind that really simply is because the probability of all those doctors to be wrong at the same time is, since they're more likely to be right than wrong, is going to be really low. So you get better accuracy by doing that. There are two main assumptions behind that for it to work. The first one, the models here, our doctors have to make predictions independently. If all the doctors will make the prediction the same way, then you would just have the same, the exact same thing by consulting three doctors than if you would just have one. Also, another assumption, either you need your model or your doctor here to have an accuracy higher than 50%, you need them to be more right than wrong. The reason behind that is that if you combine several diagnoses from people that are more wrong than right, you will end up just having something really wrong. So really quickly, introducing the data set that we're going to use here. So it's data from Facebook, different posts that were posted on pages. We have features such as the category of the page, the number of comments on the page, and so on and so forth. And the last column in the data set, so you can download the data set here at this link. The last column on the data set is the thing, the target, the thing we are trying to predict, is the number of comments that we will receive in the next hour. That's a regression. Here we want to stick to classification in order to stay with our analogy of the doctors that we've seen above. So we'll define the problem as will this post be commented on in the next hour or not? So basically, will we have zero comments or more than zero? So let's just load the data really quickly because we don't have that much time. I have 10 minutes left. All right. So we defined the features really quickly here. That's just what I said. If we have zero comments, if we have more than zero comments, we have true, otherwise false. So here, the data is quite balanced. So pretty much half of the data set has comments. The other half doesn't have any comments. And let's create a training set and a test data set. Right. So the first building block here is the decision tree. So quickly, why are we using decision tree to do to build our ensembles? Well, it matches with the two different conditions that we stated above. The first one is that we need to have 50% accuracy. Well, those decision trees are actually pretty good at capturing complex relationship, including nonlinear relationships in the data. So we have that checked. The second reason why we use decision trees because they overfit easily. Usually that's seen as a bad thing, but here it's good because that means we will, by perturbating a little bit, the data will be able to build decision trees that are quite different one from another. So let's start here by building a decision tree that has a depth of two. So really simple one. I'm going to fit it on my data and then plot it. So by the way, the function that I'm using to plot it is I've defined it myself. It's at the bottom of the notebook. So if you want to run it yourself, you need to run that first. But here you can see that really simple decision tree. I'm starting looking at the feature 30. I'm looking at whether it's smaller than 4.5. Then I continue that way, looking at another feature and so on until I reach the bottom where I'm saying that if a sample is here, then we predict it will have comments. Otherwise, no comment here has comments here. Something important to note here is this sample's value. It's actually the fraction of the data that is represented in that node. So at the first node at the top, we are looking at 100% of the data. Then it's split in two. 63% of the data will be on the left-hand side and the other one on the 37 remaining on the other side. It will become important in a minute. Just let's look at the accuracy, 81%. Right, so now we've built a really simple tree. We only have three rules. So surely our data is more complicated than that. So we'll try to build a more complex tree by increasing the depth. Here, we're going to use a depth of 10. So let's plot it. Here you see a really complex tree. Let's zoom a little bit. And you see that here we don't even see the whole tree because we can't plot the whole thing. But the important part here is the samples. You see that we have 0.4% for example in this node that is of the data that is represented. So surely that looks like we are overfitting a little bit. If we look at other ones, we see that, for instance, here we have 0.0% of the data in this node. So we are really overfitting to some particular example in our data. Surely that's not what we want. So with Scikit-learn, there is a parameter that we can use to control that. It's called mean sample splits. And what we're doing is passing exactly the minimum percentage that we need represented to create a new node. So here I've used a really large number, 20%, and we'll see how that looks. So it's creating another tree with a max depth of 10 still. So we can still model complex relationships. But we're not overfitting to really simple, to single examples. For example, here we have 30% of the data represented here. If we go down, we only have 30%. So we don't create a new rule here. We stop and we say that all those samples will have what we predict has comments here. And we say that if we have more data to back up our nodes, we actually keep dividing in two and go down and go down. And here we don't plot it because that would just be too much to visualize. But we're still going deeper. Right. So with this first building block, we are able to build our first ensemble, a random forest. So as you can see in this diagram, it's pretty similar to what we were talking about with the doctors. We will train, here we're training two different trees. The main difference is that in order to make those trees not correlated, we are not going to train them directly on the data. But we will introduce perturbation in data. So the first stage is the bootstrapping. We will sample from our data with replacement in order to create a new data set that has only a subset of the observation. And it will have some observations duplicated as well, meaning that here are sample one and our sample two are both statistically similar to the data, but they are different one from another. The second step that we're doing is the feature subsampling. So instead of providing all the features to our trees when we're training them, we will only take a sample of the features. So we will have some trees that will focus on some features and other trees hopefully that will focus on others. The last important thing to do when you're training a random forest is to make sure that your trees overfit a little bit. Because if you keep your trees really simple and had a really big constraint on it, for example, you will only create one node that will be the exact same one for all your sample data sets. So you're not going to be able to benefit from this advantage of ensemble models that we've seen before with different models. Right. So the three points are summarized here. Let's try to apply it with scikit-learn now. We'll start with five trees. So here, those are the main parameters to pass to the scikit-learn implementation of the random forest. The first row that you see here are the parameters that constrain our trees. So those are the same one as before. We choose a depth of 10, and we keep 20 percent for the main sample split. So as we've mentioned, it's not a good idea to do that. We'll see how it works in a minute. And estimator. So we are going to use five trees. Max features is the feature sub-sampling. Here, we're not doing it. We're not using it. And we're not doing the bootstrap either. So we're expecting that to be quite bad. Let's see if that happens. We'll calculate the accuracy, right? 81 percent. With scikit-learn, we can look at every single decision tree in our ensemble. So let's do that. Here, you see the list of all your decision trees within your ensemble. The nice thing is that since we can access every single tree, we can also access the accuracy of every single tree. So by iterating over all my estimators, I can predict on the test that I set and then calculate the accuracy. If I do that, I see here the accuracy of my five trees. As you can see, they all have the exact same accuracy. In fact, we've built five trees that are exactly the same ones. And when we end up assembling them, as we've seen before, we get the exact same accuracy as one single tree. So basically, by not using properly the parameters of my random forest, I've not gained anything by assembling. Right. So now we'll try to get a bit better by using the feature sub-sampling here. So by defining auto, it's just going to pick square roots of the number of features every time it's building a sample. And we're setting the bootstrap to be true as well. Here, for the moment, we are keeping the... Okay, I'm skipping this one. I'm just going to go to the right one. Okay. So this is the good one. So here, I'm doing everything right. I'm letting my trees overfitting by having a loser constraint on the minimum sample split. Only one person here. I'm also using more trees, 15. I'm using the feature sub-sampling, and I'm using the bootstrapping. So let's see how it does. 83%. That's a bit better. Let's look at every individual tree. And here, we see that we've managed to build... Oops, I don't want to do that. Here, we see that we managed to build trees that are different, one from another. We see that some of them will have a lower accuracy, some of them a higher one, and they compensate each other, which is what we wanted. Right. So I have five more minutes to talk about boosting. So boosting is different than random first, in a way that instead of building all the trees in parallel, here, we're going to build them sequentially. So we start from the data. We build our first tree. So that's the stage one here. And then we compute the residuals from it. So we check what is the difference between what we were supposed to predict and what we actually predicted. Then we build the second tree that is not trained on the original data, but on the residuals, meaning that this tree will learn how to predict what the other tree got wrong, meaning that if we subtract the second tree from the previous one, we will compensate the error. Obviously, this tree itself is not perfect, and it will make an error. It will get some error itself. So we compute again the residuals, train another tree on it, et cetera. So we can end up training a lot of trees. As you can anticipate, it will overfit really easily, because at some point, the error is going to be smaller and smaller, and we'll start building trees that instead of compensating the error, they will overfit to what is essentially noise. So it's quite important to make sure you tune properly the number of trees you're going to use here. And also, you don't want your first tree to overfit in this case, because if it already overfits, your second tree is only going to compensate for, to try to overfit to the error, to the noise. Right. Also here, we have access still to the bootstrapping, so it's called subsample for gradient boosting and the feature sampling in order to make sure that every subsequent tree that you're going to build will focus on something different in the data and try to correct in a different way, so we can benefit from the advantage of ensembles. Right. So let's do it again. Again, I've tried to put in the first row the parameters for the trees. Same ones that we used before. I'm using five trees here, using 80 percent of the data set for every new tree. I'm using feature subsampling, and the learning rate, which is the way we're correcting the error, is just the weight we will be using for every new tree correcting the error of the previous one. So if I have a large learning rate, only building one tree after the first one will correct a big amount of the error, so it will overfit really quickly with only a few trees, whereas if I keep the learning rates smaller, I will need to build more trees to be able to converge to the ideal solution, but yeah. Right, so where was I? Cool. So I think I'm going to skip, okay, so I will just show this one, and then I will skip that because I just have two minutes left. All right, so accuracy by default here, I've got 82 percent. Yeah, just one thing, here it doesn't make any sense to look at the individual accuracy of the trees, it's just because if you look at the tree at stage two, for example, it was never meant to be accurate on the original data. In fact, it has never seen the original data, it was only trained on the residual, so it only makes sense in the context of the previous tree, and the error that was the previous tree had. So the only thing we can do here is cut at one stage and look at how all the trees before that behave together. So for example, if I cut here, I can look at how the two first trees do without the third one. So let's do that with scikit-learn, we have access to the stage predict function that generates a prediction at every stage. We're able to iterate over that and calculate the accuracy score for all of those predictions. So if we look at it, we see that here we start with an accuracy of 54%, and then every new stage, every new tree that we're adding is improving the accuracy a little bit. So we go from 54 to 77, and so on and so forth. Here you can see it's always increasing, so you might wonder whether we should build more trees, and at what time we would stop seeing an improvement. So we're going to do just that and try to do the same thing but with more trees this time. So we're using 15 trees. I've also changed the learning rate, so it takes less trees to converge. And let's run that. Get the accuracy score. So I've got pretty much the same accuracy score here, but let's look at every stage to see how the error on the testing set changes. So here we see that it's improving with the first few trees, but after a while it just starts converging to 83%. And then it starts going down again 82, and it would probably go even lower after that as we are overfitting to some noise in the training set. So what that means is that we really need to make sure that we stop adding trees at one stage and get that number of trees right, and also the learning rate. So just really quickly to finish, here I'm using the scikit-learn implementation of gradient boosting. It's pretty good for demonstration purposes, but it has that issue that it can't run in parallel. So usually people in production will use other libraries that are more optimized for speed. So we have XGBoost, which is probably the most popular one, like GBM and Catboost, which are two more recent ones. The good thing with those libraries is that you can use them exactly the same way you would use the gradient boosting in scikit-learn. So you can import the classifier the exact same way, so I'm doing it here, and then you instantiate your classifier with the same parameters, you use .fit and .predict the exact same way. So you can keep the exact same pipeline, but just change the instance that you're using. So here I'm training XGBoost classifier, like GBM, and finally a Catboost one. Right, so that's it. You've got access to the notebook, if you want to play a bit with it and trying to replicate that. Yeah, that's it. Thank you very much. So officially the time is over, even for the question, so this is a bit of overtime we're taking, given the problems we had. There's a 10-minute break that, well now it's like seven minutes until the next session, which is the spring orientation. I mean, we have time for a couple of questions, so if anyone has a question. Thank you for the talk. You mentioned about the boosting tree that the trains are trained to correct the error of the first train, right? Yeah, but how did it work when you predict? Because you don't know the error when you predict of the first train, right? You don't know what's wrong? So when you predict the test data, you don't know the... Right, yeah, so you... The residuals are calculated on a training set, right? So you don't see, yeah, as you said, you don't see the test data set at this stage. You're calculating the difference, so the error on the training set that you have, and then you train your second tree on this data. Thanks, so could you recommend like any publicly available data set to play and start with random forests and the decision trees? Well, you can use this one. That's a good one. I think a lot of people start with the titanic one that is on Kaggle, for example. That's just... It's pretty good, because it has a mix of categorical data and numerical data, and it's a pretty simple problem to understand, so it might be a good one, but if you want the advantage of the data set I'm using here is that it has more data, it has more rows, so you might be able to get more benefit from libraries like SGBoost with this data, which are more complex models. We have time for one last question, if anyone... No, so thank you, Kevin.