 Thank you, guys. Let me try to start my screen share. Okay, can you see my screen now? Yep, it's perfect. Okay, great. So thanks very much to the organizers for the invitation and for taking all the trouble to organize this wonderful workshop online. So I'm going to talk about some properties of boosting algorithms and high dimensions and this is based on joint work with Tengwan Liang, who is an assistant professor, professor currently at UChicago. So boosting is an ensemble learning algorithm, which is pretty old. It's route straights back to the works of valiant around the theory of learnability. So what the boosting algorithm does is it's a way, it's an iterative algorithm and an ensemble learning algorithm and in each iterate it calls a bunch of weak learners and then it combines the predictions you receive from the weak learners in a smart way and the earliest and most popular version of boosting was adaboost due to Freund and Shapire, where you have a bunch of training examples that's your training data. So you start with a uniform weight on your training data and at each iterate you up way or down way your training examples based on an exponential loss function, exponential weighing mechanism and the important feature of adaboost is that this weight parameter here, alpha T, is going to be chosen adaptively based on the error you anchor at time T and it so turned out that moving away from earlier versions of boosting to an adaptive weighing scheme suddenly yielded an algorithm which is extremely good generalization performance. Okay, so adaboost was observed to perform extremely well in practice and this is sort of a picture that was observed in multiple works in around 96 and 98. Now what you see here is that this is iteration and we're plotting the training error of adaboost and you can see it goes to zero, this is around iteration five but now the generalization error is here and you will see that this just keeps on decreasing. There is no double descent or multiple descent here and this sort of ties to Diaz-Colis talk earlier. I think, you know, this is because this is an ensemble learning algorithm so somehow automatically the way weights are combined leads to this kind of a generalization error curve. Now this phenomenon of interpolating the training data is now very common and we understand many aspects about it very well but when this was earlier observed for boosting and bagging this naturally picked the interest of a broad community of machine learning researchers and statisticians. So one of the explanations that was put forth in those days was that the key quantity that explains this picture is the empirical margin distribution. Okay, so what is the empirical margin? So given any training example and a classifier F, the margin of that example is defined to be this product and in this paper, Shapir et al, the thing we observe is the following. So on the plot on the right I'm plotting the fraction of examples for which the margin is below some threshold for three time iterations. So this curve is for T equals five, this one is for T equals hundred and the solid one is for T equals thousand. So when you run at a boost, you will see that initially the margin distribution, you know, has some structure but along the path of iteration numbers. What happens is that the fraction of examples with a large margin increases. So by the time you're at T equals hundred, there is almost no point with margin below 0.5 and the margin distribution furthermore stabilizes to some some members, some cumulative distribution function. Okay, so this quantity was put forth as a very key quantity and in fact what they had shown was generalization error bounds. So if you take any classifier of this form and if you scale by the L1 norm, then the generalization error can be upper bounded by in this way and the first term here is your empirical margin distribution. Kappa can be any threshold that you would like to pick and the second term is, you know, the generalization error you incur for a certain threshold and this term is independent of Kappa. So the natural thing to do looking at this bound would be, so if you want to obtain the best possible upper bound that you could, you can fix theta and in that case, you can choose Kappa to be, you know, just minimum of these objects. So this first term will be zero. And then if you wanted to optimize over all classifiers of this form, you would choose Kappa to be this maximum L1 margin. So this quantity is known as the maximum L1 margin. If you do that, you immediately get an upper bound for the generalization error of this form. So it's one over square root of ten times this option. Okay, but and this upper bound has subsequently been improved, but still in the literature, this connection between generalization and the error and the maximum margin is known through upper bounds. But it's so turned out that that this maximum L1 margin is not just important for generalization. If you wanted to study how many steps add a boost is going to take before it reaches zero training error or perfectly, you know, into starts interpolating the training data. One can also provide upper bounds to this quantity. And once again, the maximum L1 margin shows up here. So this this bouncer due to Zhang and you around But we don't really know whether this upper bound is tight. Okay. And then if we follow the literature, this maximum L1 margin comes back to the picture again. If you want to understand the path of at a boost. So Imagine that your data is linearly separable that we're in a classification setting where the outcome or response, if you will, is a plus minus one. Then we say that the data is linearly separable. If there is a hyperplane that perfectly separates the two groups. And then you can define the mean L1 norm interpolated classifier to be if you look at all all possible vectors that separates your data than the one with the minimum L1 norm. And for boosting it is well known that if you take the step size in the algorithm, the adaptive step size. To be infinitesimal in the sense that this if you send out send the step size off to zero. And if you send the time iterates off to infinity, then this scaled iterates from boosting where they're scaled by the L1 norm, this actually converges to this mean L1 norm interpolate. So what this kind of suggests is that in this special limit boosting converges to this quantity, but this also gives other insights in the sense that from these works, you can also read off that boosting in some sense follows the path follows that of an L1 regularized empirical risk minimization problem. And this mean L1 norm interpolate is actually directly connected to the max L1 margin because in this optimization here, whichever theta achieves the optimal value that's actually given by the mean L1 norm interpolate. Okay, so these results seem to suggest that okay both this interpolate and the margin are key players in boosting. But in the past, we only knew sort of upper bounds for these two properties of the algorithm based on this object. So in this talk, we are going to try to provide certain precise characterizations of these quantities. So we're going to ask, okay, exactly how large is this L1 margin going to be. And if we look at the mean L1 norm interpolate, what are the properties of that particular estimator. And then we will switch gears and come back to boosting and try to study precisely, okay, how long does boosting take to reach this mean and mean norm interpolate. And that will give rise to a precise characterization of the generalization error of the algorithm and we'll be able to study certain other properties of this as well. And we'll do all of this in the high dimensional scaling minutes that have featured in the earlier talks today. And here I just want to mention that boosting is an ensemble learning algorithm and there are many other ensemble learning algorithms like random forest bagging. But it's hard to study ensemble learners in general, but for boosting it so it turns out that through this connection with the mean norm interpolate, you can study precise properties of these algorithms. And my hope is that, you know, some of the proof techniques can be used to study on other ensemble learners or, you know, like even when you combine, you know, some of the techniques have also been used for like in the context of neural networks and when they're being on top. Okay, so the inspiration for our work is tied to sort of very both historical and recent literature in statistical physics and in other areas. So our proof techniques are going to be based on Gaussian comparison results that date back to Gordon in 88, and then more recently, what is known as convex Gaussian minimax theorem, which was, which was proved in the context of high dimensional M estimation. Now I want to mention that Max L2 margin is a very well studied problem. So in the case of isotropic design matrices where there is no there's independence between the response and the covariates. The predictions for what happens to the Max L2 margin limit are due to Gordon are in 88 and these were subsequently regularized by Sherbin Andrews in 2003. But more recently, the paper due to Montenari et al and Deng et al studied these objects under more general covariance matrices and characterize their precise limits. So what's different between L1 and L2, it's just L1 L2 geometries are significantly different and the L1 case lacks certain important strong convexity type features that L2 has. So to rigorize the proofs that calls for certain novel techniques and new uniform convergence arguments but I'll probably not have time to go over those today. And of course, as you can imagine, when things, you know, things will be characterized to fix point equations here, as well as in the L2 case and we will obtain different fixed point equations for the L1. Okay, so now diving into into the formal setting. So I'm going to be under the usual high dimensional scaling limit where P by N goes to a constant and this will be the model plus minus, you know, the labels are plus minus one given some covariates. So let's skip over this condition on the signal strength. So we have three problem parameters. But we'll also be in the linearly separable regime asymptotically which basically means that this over parameterized ratio is above a certain threshold, which is a function of the signal strength that happened to. So what are the results we have we have a very precise characterization of this maximum L1 margin. And then that it can be related to, you know, a particular function which is a function of the problem parameters and this function can be explicitly pinned down. This function is also related to the MLE existence phase transition curve for logistic regression but I'm just going to leave that as a comment. So if we move over all kappa non negative and that this function is non negative over that set and this function we can prove that it is continuous and non decreasing so things are well defined. We have a very precise characterization of what this function should be but I'm going to skip that in the interest of time. This characterization itself sort of shows that what are the connections with certain earlier results in the context of, you know, separability of the data when you have logistic model or when you have a general F here. And then regarding the main L1 norm interpolant what we are able to show is you know one can characterize so this will be the generalization error of that estimator and we can characterize the precise limit of this object. YZ1Z2 are random variables which have a specific description, but what are these constants these constants together with another constant as star is going to be a unique solution to non linear system of equations. The equation is parametrized by your dimensionality, the ratio p over n and also the limit of your maximum L1 margin. And these constants are you know they have certain interpretations so C1 star is going to be the angle between the interpolant and the planted direction C2 star is the norm of the residuals of this quantity and s star can be used as Lagrange multipliers. And we can also show certain other properties of this main L1 norm interpolant, you know, which which can only be described through these constants. And now coming back to boosting what does all of this tell us in the past we knew these results about the number of optimization state the steps in the boosting path. So now what we're able to show is the following so we can choose an appropriate step size sequence such that for any epsilon. If we have the iteration time is above a certain threshold, then the boosting solution approximates them in L1 interpolated classifier up to error epsilon. And that can be characterized precisely through this form so Kappa star is the limit that we obtained for the maximum L1 margin. And this threshold has a precise characterization when NNP are both diverging in the particular high dimensional limit that I mentioned, okay and the maximum L1 margin limit shows up here as well. The first with classical results the classical upper bound was this and if we you know just plug in the limit instead of the empirical version, then this is the upper bound, the y axis is the ratio P over N. And for us the precise limit that we characterizes over here, the separability threshold is over here. So you can see that there is some gap, and this into asymptote as the size is increasing. And then again revisiting this result this also characterizes the stopping time precisely. And if we plot this denominator as a function of psi, then we see that this object increases a psi increases. So that says that when the you know the effective dimensionality increases then over parameterization increases then the optimization is faster and you converge to this mean L1 norm interpolant direction faster. And as an aside we can study some other structural properties of boosting as well. For instance, if you were interested to understand okay when when boosting reaches you know zero training error what does the boosting classifier look like, you know because it goes to an L1 direction soon after that. So you know it is expected to have some degree of sparsity. And so we can characterize that if you know we look at the fraction of non zero coordinates in the boosting solution at a certain time, you know, when it first hits the zero training error, we can characterize an upper bound on this object that once again it's related to this maximum one margin, and this upper bound shows that actually you know the solution is going to be quite sparse. And this is, you know, sort of the tightest upper bound we know that exists in the literature. Okay, so maybe I'll start wrapping up. Basically in this work what we try to do is provide precise characterizations of the maximum L1 margin of the male L1 norm interpolant and because of this subtle connection with boosting that let's helps us understand this algorithm much better than before so we can improve the existing bounds and provide precise generalization at our expressions and optimization speed. Our proofs from L2 to L1 the jump requires new uniform convergence arguments but once you can do L1 you can actually go to any LP geometry or proofs directly extend to that. And we can also talk of certain other you know data generating models. And there are many other perspectives and boosting which I didn't touch upon today but I hope that you know this helps understand you know these on some of the learning methods. More precisely, you know, as opposed to understanding upper bounds on you know their performance. And you know this opens up doors to further questions and maybe I will just stop here. We have a we have a draft and archive on this work but we are sort of revising to include several extensions at this moment and I'll be happy to send that around for if someone is interested. Thank you.