 Hello and welcome to lecture eight of our introduction to machine learning course today We're going to talk about the boosting and the bagging these are these were very hot topics in machine learning and statistics in the 90s and early 2000s but still Remain very popular and very useful algorithms until now So to introduce these things I will I will start with this image that I used a lot in this course So the bias variance tradeoff depending on the model complexity we discussed it several times Imagine some algorithm where you can change its complex the model complexity when you go from the from the very simple models the simple models Underfit the data and show high bias the test error is high and then if you increase the complexity You slowly get to the regime where the model complexity is so high that the model over fits your training data and The very the bias is low, but now the variance is high and the test error is high again And somewhere in between is the is the ideal the sweet spot where you want to be so the boosting and the bagging came the illustrated on these plotters follows The boosting usually starts with a very simple model the model that under fits the data that model that has high bias and Then we apply this boosting procedure. It can be applied to different models increasing the model complexity and bringing the the error down Bagging in contrast starts with a very complex model a model that Overfits the data very very badly and then we back several models together to reduce the variance and bring the performance Improve the performance again So the as I say here the boosting builds complex models out of simple ones It's a sequential procedure that boosts the model to make it more complicated to make it more complex and reduce the bias and the bagging averages complex models hoping to average out the variance and And also reduce the the test error. So we'll start Well, I will start with introducing a particular model called classification trees because in fact, the bagging and the boosting often use The particular kind of particular kind of a classification model calls called the classification tree. So let me introduce it here Imagine you have a very simple data set like that with just two features and it's a binary classification problem So every point can belong to a circle class or the cross class in the classification tree you Try to find one feature That has high predictive performance It can be thresholded such that all points below the threshold Preferentially belong to one class and all points above the threshold preferentially belong to another class and then if The value of this feature is below the threshold We'll just in this case for example classify the point as a circle And if it's above the threshold we can either just classify it as a cross or we can make the model more complicated And then look for the next split. So here we can say if the if this is the case then We will make another split now We are looking at the second feature and if it's above the threshold then it's a cross and if it's below the threshold Then it's a circle and in principle one can keep building this tree. It's a binary tree because it always has two splits So one can keep Adding adding branches to it until you classify all the entire training set correctly and your training performance is 100% You can always achieve that by building such a tree and then it is built in a greedy manner So we first look for the best possible split and this can be done relatively easily Just look at each feature and like scan the possible thresholds really brute force look for the what is the best split Since you're always looking at one feature at a time That's not very costly and then once you found the split you proceed to the next level and then you look for the second best split and so So it's a greedy algorithm to build this tree. Okay, so that's all you need to know about the binary trees Important thing here is that the the size of the tree Regulates model complexity in this case. So you can have the simplest possible Tree that you can have just has one split here and nothing else So it's one split and then everything on the left is classified as one class and everything on the right is another This is has a name. It's called a tree stump It's of course a very very simplistic model on the other side of the spectrum You have a fully grown tree. So that is a tree that achieves 100% training set accuracy You you you cannot build a tree longer than that. So that's a fully grown tree as another other. Sorry as another remark You can use trees for regression that then it's called regression trees as opposed to classification trees which work very similarly But now in each of these Regions, you're just predicting a constant value if it's a regression problem But here we're talking about classification problems today for simplicity. We only have two classes So we'll just be predicting class one or class two okay now in Having introduced that we can start to talk about boosting the boosting Is a method that repeatedly applies classifier usually it's a weak classifier So we will actually later on use tree stumps as An example here, so you just take a tree stump, which is a very very simple classifier with huge bias and You repeatedly apply this classifier to your training set But every time you modify the training set a little bit in particular you change the weights of specific samples. So let me Let me introduce this algorithm to you in the very First like the generic outline of the of the boosting algorithm. So you start with all Samples have the same weight. So this is as this is the same as not having weights at all then you apply your your classifier to this Training data with these weights on the first iteration. You just apply the classifier to the data as As the data set is And then you measure the performance of this data. So the accuracy on the training set and Depending on how well your algorithm performs, you will set the weight of this Of the of the algorithm if it performs very well It will get a high weight if it doesn't perform very well it will get a lower weight and Then the most important thing is that we're going to look for examples that are misclassified by this algorithm So if nothing is misclassified, then we're done, but usually you apply a very weak classifier a Lot of samples will be misclassified We're looking at the ones that are misclassified and we increase their weights Then we proceed to the second boosting iteration. We apply the same algorithm to the training set But now the algorithm is forced to to focus on these samples that were previously misclassified okay So we build another model and then we increase the weights of the samples that are misclassified again And we do keep doing this for a hundred or a thousand boosting iterations So notice here that our models will not necessarily become better like individual Models gm will not become better as we proceed. They will instead they will focus on different parts of the of the Of the training set. So you first fit the model some parts It classifies correctly some parts that misclassifies then the algorithm focuses on these misclassified parts Classifies them correctly, but maybe misclassifies something that was previously classified correctly and so on So you will not using this very weak classifier. You will not achieve a Good performance for each individual model, but you will build a whole bunch of different models that cover different parts of the data so to say and then The final Result of the algorithm the output the clear the resulting classifier is actually the linear combination of all these boosted iterations So we just combine all of these gm Outputs with weights alpha that depend on the individual performance. It's just a linear combination. It's like they all all Iterations vote then If you have a test set on each test case all of these boosted Algorithms classifiers will will vote and the majority vote With these weights will be the final classification. So this is a very generic scheme Notice that you can use any classifier as the input here let me Illustrated on a very simple toy data So here we again have two features and a binary classification problem Let's say we use the tree stumps here now as a weak learner. So here's the iteration one This is the the best tree stump. It performs Not very well, but above chance, right? We classify all of that as circles and all of that as crosses these points here Get misclassified. So on the next iteration, I will draw them larger This means they have an increased weight now the algorithm is forced to classify to pay attention to them So the resulting the resulting Tree stump will be different because the way it's changed so now maybe that is the optimal stump and now these things are all classified correctly together with these circles But all of these crosses are now misclassified So we increase their weight when we go to the third iteration and now fit another stump and it will maybe be like that So now all of that is classified as cross correctly, but these things are misclassified and you can keep going so this will Not this will typically never converge in the sense that you will keep getting different different Models as you go on but then of course you stop After some number of iterations and you do this averaging so in this case I just average the three models and if you look at them You will see that here for example This model votes circle and this also votes circle and these votes cross But it's two to one circle so it's circle and so on and I get the final decision boundary that is not linear And it can be pretty complicated and the longer you boost the more complicated decision boundary you can You can obtain in the final model So it is the number of boosting iterations that regulates model complexity in this case the most popular boosting algorithm was called at a boost for adaptive boosting and It implements exactly the same logic as I had two slides before But what I didn't show on that slide is how we actually choose the alpha and how we choose these double your weights, right? So the the at a boost is a particular way to To To choose those weights In order to to introduce these weights I need first to introduce this weighted accuracy. So on each for each on each given iteration I can define the weighted accuracy of my Classifier as follows. So here this capital I means It is it is one if the point is misclassified So my predicted label is different from the actual label then it's one and if the predicted label it's the same as the True label then is zero and then I would just if I just average The zero one values I get the accuracy, but here I'm taking weights into account So I get I'm getting the weighted accuracy and now I can define the error boost Which just has a particular particular choice for the for the alpha weight Depending on the error rate and a particular choice For the W. So if you if you pay attention to this formula here Let's do a sanity check. So if the error is very large Then what is large L? So large error means close to one, right? Then this will be close to No, sorry, the the very large error is actually 0.5 because you are we're doing binary classification So if you are at a chance level then your error is 50 percent 0.5 So if you plug 0.5 here you get logarithm of one, which is zero. So it's zero weight That makes sense if your error is very low this means it's close to zero Then this inside here will diverge to infinity, which means it will get an infinite weight which makes sense If you are perfect on any one iteration then this should get a very high weight So it's a sensible formula you will get weights between zero and very large depending on your performance and Here you see that everything that it's every sample that is misclassified increase gets gets the weight increased so the formulas Sort of makes sense if you look at them like we did now But they're also rather mysterious like why these choices I can write many different formulas that make sense here And they may perform differently like why this particular choice? So one perspective that I really like on on Erebus that actually appeared only a few years later after the Original algorithm was suggested is that actually we can see this entire thing as a greedy optimization of the exponential loss function. So let me explain what that means I can Introduce this exponential loss function for my binary classification problem so the why eyes here are the true labels that can be minus one or one in this case and These are my predictions. So if I predict the same label This will be minus one and you will get one over e as a contribution to the sum And if I predict a different label, for example, this is one and this is minus one Then I will get one inside the exponent, which means it will be e Contribution to the sum so every time I do a mis-classification. I pay price that is bigger So that makes sense and then I sum over my entire training set and this is my Exponential loss, but g can also predict not necessarily one minus one can in this case it can be predict Any real value and Let's say we want to find the g that is a sum linear sum with some weights of of of my classifiers here and now let's say we want to optimize this exponential loss function by adapting the this this weights and choosing the the individual gm terms in a greedy way so that means first I Find alpha one and g1 and I keep them fixed and now I'm looking for the optimal g2 and Alpha two and then I keep them fixed and I proceed like that So I'm like adding more and more terms to this expansion and every time I try to minimize The exponential loss function so one can prove and I'm not going to prove this in this lecture entirely One can prove that these results in exactly the adabost Procedure so it's it's just a different view of the adabost the only thing I will Show here like I will start the proof But not give it in full here is that let's just take a look at what happens when we use this exponential Loss function and we go from the model from m minus one terms to m terms We do one iteration right so I can write this this model is the sum of everything that I had before Plus this new thing that I'm actually optimizing on this step right and since I have exponent of a sum I can split it in the two terms and This is already fixed so I can just call this to w and this is what is actually allowed to be changed to be to be fit on this iteration so this Thanks to the exponential loss function this weight here appears. That's a sample weight, right? Because it's a different w i for each of the samples so starting with this one can actually Relatively straightforwardly derive the that the optimal ways to choose the to choose these alphas is the formula is the adabost formula and The W's will turn Turn out to be also given by the adabost Formulas so this entire the entire Adaboost Machinery works thanks to the exponential loss function because it allows to split Basically thanks to this trick right so it's very neat In fact, it turns out that one can generalize that one can use other loss function not exponential loss function But something else Anything actually anything you want you say it's it can be in the square error loss function. It can be the The same loss function we used in logistic regression and Then one can you won't have adaboost anymore But it's it gets generalized to something called gradient boosting which is a very also very useful family of algorithms Alright, so some comments on adaboost then To to finish up this part so first of all it is often considered one of the best of the shelf classifier So off the shelf means you have some classification problem that is not like very very domain specific in the sense that maybe if it's a classification problem on the image data then a Convolutional neural network will perform much better than the adaboost if if the if the data has some structure like for example Images then there are other other tools that that may perform better but if you are predicting something like something like I don't know the apartment price based on based on a bunch of different predictors None of which are images Then you will typically do a pretty good job if you use the adaboost with With tree stumps as a classifier So often in a situation where you have some classification problem and you just want to train something Don't want to spend a lot of time on fine-tuning that You can use that a boost. It's one of the one of the good choices. So number of boosting iterations as I Said before controls the model complexity the longer you boost the more complex model you get It can overfit so in principle It's possible that you keep boosting and at some point your test error starts increasing again. However The the the great thing is that it often it overfits very very slowly So you'd like boost for hundred iterations or the thousand and even if it's already started over boost Overfitting it will do this so slowly that you will not lose much in the performance or in many cases It actually does not overfit at all So you keep boosting and the performance just stabilizes the test performance and doesn't go up anymore So that's pretty remarkable We'll get back to that later in the end of the lecture briefly another comment is that if You boost for long enough the training accuracy will get to hundred percent So it can fit any training for any training set you can fit it perfectly if you boost long enough This whole treatment with exponential loss function shows that actually even after the training accuracy is already hundred percent The algorithm won't stop right you can keep boosting even though your This aggregate model is already performing and hundred percent on the training set But the you the rate the boosting iterations can keep going. Nothing will nothing will stop them and the test Error can keep decreasing and the exponential loss on the training set can actually still keep decreasing Even after the classification hits a hundred percent on the training set So this shows that it actually really does optimize the exponential loss function and not just a misclassification All right So let's then start with part two which is the bagging so the bagging stands for bootstrap aggregation and In a way, it's simpler Then then boosting it refers to model averaging where we build a bunch of models and then just average them together But we do not build them One like in this successfully as in as in boosting right whether we need to keep track of the weights and everything But we just completely completely Parallelizable independently build a bunch of models and then average so that's simpler And we build these models on on bootstrapped Datasets so we take our data set then we draw bootstrap copies of it samples of it We build our model and then we average them. So what what does bootstrap data set mean? You might have encountered this this term in statistic. It's very useful concept bootstrapping refers to drawing samples with repetitions Where you draw the same number of samples as you originally had so let's say you had 1,000 samples in your training data if you want to make if you want to do bootstrapping you draw 1,000 samples out of these 1,000 samples, but you allow repetitions So some samples will be selected two times three times four times some samples by chance will not get selected at all So you will get a sample that contains some repetitions But it will have the same size as your original sample, right? It will still have 1,000 1,000 samples so that's often very useful in statistics in different content contexts here We'll just use this bootstrapped samples to build the model so the intuition is that we Have our sample will change it slightly by having like one bootstrap sample another bootstrap sample the third bootstrap sample And every time we'll build a model They will all be a bit different because the samples are a bit different and then we will average So one comment on the bootstrapping procedure one can show As an exercise that on average you will select you will leave out Around one-third of the samples one over e samples will be left out on each bootstrap iteration And this is around 30% so two-thirds of your samples will get on average into Each bootstrap iterations different two-thirds every time of course so The bagging is the idea of bagging is to apply it to the models that have low bias and high variance Okay, for example a fully grown tree. So if you build a fully grown classification tree you will achieve training set 100% you will often have lousy test error because you overfit your training data You are in a high variance situation, but the bias is very low Maybe even zero So the hope here is that if you build a bunch of these models and then average and if the bias was zero for each of them It will stay zero. So you will not you will not the bias will not Change But the variance will go down thanks to this averaging That is the hope of the bagging and then if the variance decreases and the bias stay low our model Will perform better than each individual model in fact if every model were independent Then the variance would even decrease to zero So if you increase the sample size and all all terms are independent Then the variance of the result will just go to zero and you will have perfect model Of course in reality this will not happen because the models are not independent because they are built on the bootstrapped Bootstrapped samples of your data set so it's not new data Right that that that you use to build each successful Model it's it's different data set, but it overlaps with the previous one So they are not independent so the variance will not go to zero, but it will hopefully decrease This is not random for us yet. This is just general bagging procedure random forest makes one important tweak and that is The aim is to decrease this dependency between Between every two model between every pair of Bagged models, so how can we do that? We want the ideas that we want to make every every two models We want to make them a little bit more different than what we get by applying them to two bootstrapped samples And here's how the random forests Do that we are so random forest specifically refers to fully grown To fully grown classification trees It's a forest of trees So now when we build our tree on a given iteration on a given bootstrap sample we start building the tree and then Remember we build the tree greedily by doing these binary splits right so every time we do a split We actually select a subset of variables that can be candidate Candidates for for splitting and then we check those and choose the best one and make the split So we don't scan all variables, but just a subset and it can be a small subset and These introduce additional randomness, so this will make the trees more different from one another and Then we'll hopefully allow the variance to decrease Further, it's just a heuristic, but turns out it works pretty well. This will perform better Sometimes much better than just bagging without this tweak so Important comment is that here Unlike what we had before in at a boost and in boosting the number of trees Do not really regulate model complexity in in any meaningful way. So the number of trees here just If you keep adding trees, you're just keeping the terms into your into your big average The idea is that you choose the number that is that is relatively big to decrease the variance So maybe you choose like a thousand trees for example But by doing this longer your model will not necessarily become more or less complex it will converge to something and then stay there whereas in In boosting you like keep increasing the complexity with every with every subsequent iteration So some comments on random forests now They are also often considered one of the best of the shelf classifiers. They are they are very easy to implement to build to use and often are one of the one of the You know top recommendations to what you can use if you need a classification algorithm they actually in practice Tend to perform similarly well to to add a boost or gradient boosting in in general The good thing is that random forest requires very little tuning. There's almost no free parameters, right? There's a free parameter of how big the tree is but random forest usually uses fully grown trees There's another free parameter of this how many variables you you scan on each split but Usually the performance also Weekly depends on that you you go with the default value and then there's nothing to tune anymore There's no really a regularization parameter to tune. You just let it run and miraculously it performs well Important remark is that typically you will get hundred percent Training set accuracy with a random forest and this is may not be entirely obvious immediately. So why is that and that is because? Remember that two-thirds of your samples will on average get into Each bootstrapped tree each tree is fully grown. So it has hundred percent training set accuracy Now if you look at each training sample, it will by the same logic appear in two-thirds in the training sets of two-thirds of the trees and they all have hundred percent training set accuracy, so the aggregate model the Like the when when these trees vote for every Training set samples two-thirds of them will vote correctly Or more and so you will get you will perfectly classify all training set examples So you're in it seems you are in an overfitting regime, right? You are Whatever the training data is you will reach hundred percent training set accuracy. Nevertheless test set performance can be can be pretty good two more things that are specific for for random forest and Useful often useful in practice. One thing is that it doesn't really require a validation set or a test set or a cross validation procedure and that's because when you're building the forest You naturally have the samples that are left out when you're building each tree, right? This special term in random forest lit which is out of the bag Samples so you have this one third of samples that don't that are not used to train each tree so You can For each sample in your in your data set You can average only trees that came from bootstrap iteration where the where this particular sample was not selected Where this sample was out of bag And then you average the performance only over those trees and this acts as a test set Right, so you don't have a separate test set But for each sample you can only look at trees that didn't use this sample to build the trees And then it acts as if it were a test set so and you don't you you get it for free because you Using this bootstrapping procedure anyway when you build a random forest and once you're done You have this out-of-bag estimate of the performance Which is as good as having a test set so that's pretty convenient and Another thing is that you can use a very similar trick to assess variable importance Which is which is often useful you have a bunch of predictors you're predicting something and then you want to know which which variables Actually contributed a lot to to the performance and you can do this very very easily here when you're using this When you're using this out-of-bag samples to check the performance of the random forest you can Permute one variable at a time so just scramble one variable at a time and then check the performance And if there's a variable that where the performance decreases a lot when you scrambled it Then this means that was an important variable So this is this is pretty standard in random forest literature in applications If you see random forest used and in a scientific paper, you will often see also a plot that shows The variables sorted by importance and this is how the importance is is typically assessed so we're almost done as Closing remarks I want to get back to this Figure that I had in the beginning That shows the bias variance trade off depending on the model complexity, right and then it is It is tempting to think about the boosting as Starting somewhere here and then increasing the model complexity until you get to somewhere here to the sweet spot And the bagging or the random forest is start somewhere here And then you average the models and you actually decrease the model complexity until you also get to the sweet spot So this is I think a wrong image. This is not what is happening here because as we discussed actually both are Examples of interpolating classifiers or classifiers that fit the training set perfectly This means that both of them are in in in in in this interpolating regime So we talked about that earlier in lecture four, I think One can imagine even more complex models behind Behind the interpolation threshold here on the right and remember when we talked about Regression and when we talked in particular about neural networks in the previous lecture. We talked about this double descent phenomenon where the models can be So complex that they feed any training set perfectly, but nevertheless the test error Is good. Thanks to some implicit regularization, this is one property of the neural networks that attracts a lot of attention here, but in in in recent years similar things happen here in both cases. So in random forests your Training set performance is 100% by construction. Nevertheless, your variance decreases and you actually end up with good test error So you are at the interpolation or you are on the other side of the interpolation threshold and the Same happens here or even more interesting thing happens here with boosting where you start if you start if you're using the weak learners You start here on the left and then you just you boost and boost and boost and your model complexity increases And if you boost long enough your training set Performance is at 100% which means you are now you crossed the interpolation threshold You are now somewhere here in this regime, but usually what will happen or often what will happen is that you basically you You like tunnel through this through this high variance peak You never see it when you boost so your performance improves and then you somehow get go from here to here Thanks to some implicit regularization That operates in boosting and then you end up with this model that that lies beyond the interpolation threshold on their On the bias variance trade-off. So there are relatively recent papers that that study this phenomenon I my impression is that it's not fully understood why and in what Why this happens how this happens in what exact Situations it happens and in what doesn't So this is actually very interesting and active field of research It also nicely shows that there's there's a lot of discussion in literature now about why this happens in neural networks Why the neural networks are in this? atypical Interpolating regime but nevertheless perform well well in fact these two Two models from the 90s the add a boost and the random forest they We see similar phenomenon phenomena there to these are interpolating classifiers that never let us often perform really well Have on a test date I have really good predictive performance. Thank you