 Hello everyone, my name is AJ Halthor and in this video we're going to talk about boosting. So you're a programmer working a dead-end job, not making enough money, and you decide that, well, you need a side-hobby, that's somewhat more stable. And you decide to go into gambling on horse races. But clearly you're no expert, and you barely have an idea of what to look into while placing your bets. Realizing you're ineptitude, you consult an expert gambler. And you ask him, what are the most important factors in determining the winners of a horse race? And he says, well, it's complicated. There's no grand set of rules, you see? To which you say, well, okay then, but what if we take a look at just these 10 races only? And he says, well, there are some rules of thumb that we can establish. For example, for these 10 races, it looks like the horse with the most previous wins won that race. So just bet on the horse with the most previous wins. If we were to present another collection of 10 different races, then the expert may say, well, for these 10 races, it looks like the lightest horse has performed better. So you should bet on the lighter horses. Now, these rules are very general, and while these rules of thumb are still better than a random guess, they are still quite inaccurate. The lighter horses may have had an edge in these last 10 races, but we can come up with probably 100 other cases where the heavier horses may have won. And so we give our expert a different set of races, and he would give us back some rules of thumb. But here's what we need to do. What is the best way to group these horse races into sets in order to get the best rules of thumb from our expert? And even if we do do that and we have these rules of thumb ready, how do we actually make predictions on or how do we make decisions on which horse to bet on in a race? Now, boosting solves both of these questions. It's a technique where we combine multiple rules of thumb to make an accurate and informed decision. But how is this possible? I want to take a step back and segue back into this real sand, though. Neural networks, logistic regression, support vector machines, all of these models answer the question, how do we learn to solve this problem? But a question that we should actually ask before this is, is this problem learnable or is this problem solvable? To answer this question, a machine learning framework called PAC learning model was introduced back in 1984. This framework quantitatively defined how do you determine if a problem is learnable? So what is the PAC model exactly? Well, it's probably approximately correct model. Okay, there's a lot of words here. Let's break it down. We want a correct model, a model that always makes correct predictions on an unseen data set. And you can only get this by seeing every possible input combination. And but if you do that, at that point, it's just memorization and not generalization, which is the point of machine learning. So there's no learning. Okay, so if we can't get a correct model, can we get an approximately correct model? Well, we can gather some training data set and potentially train a model that has a very small nonzero error. However, we cannot always guarantee that the error is something tolerable. What if we accidentally sample some bad training data? Then we're going to get a bad model that has a pretty high error. So we have to make sure that the probability of getting this bad model as a result of bad training data is low, lower than some threshold. So we cannot be approximately correct always, but we can probably be approximately correct. And hence the name PAC, PAC. Putting this all together, a problem is PAC learnable. If a learning algorithm can find a solution that has an error less than some threshold with a probability that is greater than some other threshold. This is what I meant by quantitatively defining what is learnable. We have actual physical numbers here. Now, this defines PAC learnability, but if we look at this same definition in terms of the learning algorithm, we can say that a learning algorithm that satisfies these threshold conditions is said to be a strong learner. Okay, that's cool. But we're talking about boosting here. So where does that fit in? Keep listening. Consider the simple problem of classifying a flower into its correct category based on some foreign numeric features. This you may know is the basis of the iris data set and the simple problem can be solved with reasonably low error using something like logistic regression. So for this particular problem, logistic regression is a strong learner. It fits our definition with the thresholds. That is that it is able to learn to model this iris data set with an error less than some epsilon with a probability greater than some other threshold like one minus delta. But for more complex problems, a strong learner would need to be more complex to make sure that it satisfies the conditions for the thresholds. But for these problems, we require a lot more learning parameters, a lot more samples for training, and we may also have like a very high hardware requirement. Solving these complex problems with stronger algorithms is actually the direction in which we are moving in the field of deep learning research specifically. But for general use, if we want to solve complex problems like these and don't have the most sophisticated hardware, what do we do? Or what if we don't have enough training examples? Can we not just solve these complex problems at all? This is where weak learners come in. In 1988 paper by researchers Michael Kearns and Leslie Valiant defined weak learners as algorithms that performed just slightly better than random guessing. Shortly after in 1990, another researcher, Robert Shapire, illustrated the powers of these weak learners in his paper, The Strength of Weak Learnability. The main takeaway of this paper was that if a problem can be solved by a strong learner, then a weak learner should be able to do it too in some way. They showed this by introducing a technique called hypothesis boosting mechanism. So let's talk about that. A hypothesis represents what a model has learned after training. In the supervised learning case, think of it as the final equations with the weights. Instead of constructing one hypothesis though, why not construct three hypotheses? Have them all make predictions and then we go by majority vote. We can get different hypotheses from the same algorithm just by training it on different training data. So to construct the first hypothesis, feed it all the training data we have. For the second hypothesis, construct the data set where half the points were classified correctly by the first hypothesis and the other half was incorrectly classified. This way, the second hypothesis can somewhat compensate for the shortcomings of the first. Now, the third hypothesis is basically a tiebreaker, so we use the points that the first two didn't agree on and construct training samples to train this third hypothesis. So the third hypothesis overcomes the shortcomings of the first and the second. When we want to make some predictions on unseen data, we basically feed it to all these three hypotheses, determine their outputs, a majority vote wins. With this version of boosting, algorithms actually performed better than without boosting. Now, think about this for a second. What if the algorithm in question was a weak learner? Then that means that we can construct three weak hypotheses by boosting and by combining these three weak hypotheses, we may be able to make strong predictions and solve more complex problems. That is, weak learners can be used to make strong predictions, which was the big takeaway of the paper. But what if the problems were even more complex? Three weak hypotheses may still not be enough to get adequate performance. And so we can use multiple such three hypotheses units to get this hierarchical structure where the final output would be like a majority of majorities, if you will. But this does not scale well. In 1995, researchers from AT&T labs introduced a modification to this algorithm. Collapse the entire hierarchical structure and take a single majority from multiple hypotheses. This slightly improved performance. For more complex problems, all we need to do is just add more hypotheses until our training error goes down. So it's quite scalable. Here's another thought. In this case, every hypothesis has equal say in the final decision. But what if they were weighted instead? What if after adding a hypothesis, their importances would adapt or change depending on the errors of the new hypotheses added? This is the idea behind adaptive boosting or at a boost, one of the well-known variants of boosting we see today. The main difference between at a boost and the previous variance is twofold. Every weak hypothesis has a weight while constructing the final decision. And every sample has a weight associated with it while constructing a weak hypothesis. This is as opposed to the filtering of data samples that we discussed before in the hypothesis boosting mechanism. Here's the algorithm. Step one, initialize the weights of each sample to be the same. Step two, construct the first hypothesis by training on these data points. Step three, determine the training error of this hypothesis and then make the weight updates. Now there are two sets of weights that needed to be updated. The first is sample weights for the next hypothesis. So we need to increase the weights of misclassified samples and decrease the weights of the correctly classified samples. This is to ensure that the next hypothesis focuses more on what the current one missed. Also, we update the weights of the hypotheses themselves. In the case of the first iteration, the weight of the first hypothesis will be one because there's only one hypothesis. And as we add more and more hypotheses over the iterations, you will see that these weights adapt or change. The final classifier can thus be used to make stronger decisions. Now, here's a case study on overfitting and underfitting and how added boost deals with it. If we don't add enough hypotheses, our partially boosted model will still not be sophisticated enough to capture patterns in training data. So boosting weak learning algorithms may be prone to underfitting. On the other hand, boosting of weak algorithms is relatively robust to overfitting, but only under the assumption that we're using a weak learning algorithm as the learner, like a single stumped decision tree. If the hypotheses are weak, it's going to take a considerable number of iterations to add enough complexity to the final boosting model. Now, another even more popular boosting technique than added boost is gradient boosting. It's easier understood by drawing parallels with added boost. So boosting techniques like added boost and gradient boost, they are additive in nature. After every iteration, we add a new hypothesis that strives to overcome the shortcomings of the previous hypotheses. Add a boost constructs a new hypothesis by increasing the weights of the misclassified data samples. So the next hypothesis gets it right. Gradient boosting, on the other hand, uses gradients for everything. To find the data point to focus on, it will compute the gradients of the overall hypothesis with a training data sample. This is done because gradients also measure how poorly this particular data point is classified. Worse the data point classified, that means higher is its gradient. And the next hypothesis is going to focus more on that sample to make sure it gets it right. For added boost, the shortcomings are shown by high weighted data points. While in gradient boosting, the shortcomings are identified by high gradient data points. In added boost, the weighting of data points is done with an exponential curve. This just means that if a sample is classified incorrectly by the overall hypothesis, then we need to weight it really high so that the next hypothesis really focuses on getting it right. But instead of an exponential weighting scheme, what if we let the model determine the weights in its own fluid weighting scheme? After all, in gradient boosting, everything seems to be done with gradients, so why not use that to learn the weighting scheme on its own? This allows for more flexibility and added boost is often seen as a specific case of gradient boosting with an exponential weighting scheme. Gradient boosting seems to be a clear winner. But as we use machine learning in real life, we have complex problems that require millions if not billions of training examples. And while some caveats of gradient boosting are that computing the gradients can become quite slow, and because of so many moving parts and learnable parameters, they can also be overfitted. Both of these are however taken care of in a programmatic implementation of XGBoost. XGBoost is gradient boosting, but with some performance enhancements. Usually a dataset is in a table format, where the rows are data samples and the columns are features. This takes too much space, so XGBoost stores data in a compressed sparse column format instead, or a CSE format. If we have a lot of data, which is usually the case, we divide the data into blocks. These blocks can run on separate machines for parallel compute. XGBoost operates quick on sorted columns. With this compressed sparse column format, we can sort every column in every block by value. To optimize even further, decision tree nodes are constructed in a breadth-first fashion, also in parallel. All of these optimizations together make XGBoost up to 10x faster than normal gradient boosting. And it's because of this, you see XGBoost everywhere, from product recommender systems to house price prediction systems to fraud prediction, you name it. A fun fact that I read in the XGBoost paper was that 17 of the 29 winning solutions in the 2015 Kaggle competitions uses either some form of gradient boosting or gradient boosting alone itself. I encourage you to check out the Kaggle solutions in their blog post. They are very well written and explain how and why these winners chose some variant of gradient boosting for their solution. And now that you have some context to boosting and its evolution, you can better appreciate these programming implementations. And even for your future projects, you can know whether to apply these techniques or similar techniques. In the description, I lay out all the research resources that I have referenced in this video as a timeline of events from the inception of pack learning that started the discussion to defining what problems are learnable to well gradient boosting that we see in production systems today for like high scale and very complex problem solving. Hopefully, now you're better equipped to take on your newfound passion of horse race gambling to make ends meet in your plebian household. Bye bye!