 I'm very happy to welcome Michael Belking to talk to us today. He is a professor at the Halechiolu Data Science Institute at the University of California. And prior to that, he was a professor at the Department of Computer Science and Engineering at the Ohio State University. So, and his research interests are in the theory and applications of machine learning and data analysis. And some of his well-known work are the applications of Lepasian eigenmaps graph and manifold regularization algorithms, as well as polynomial learning of distribution families. And some of his recent work has been concerned with the understanding of statistical phenomena observed in deep learning. And one of the key recent findings is the double descent risk curve that extends the textbook U shaped bias variance trade-off curve beyond the point of interpolation. So without further ado, let's start the seminar. Thank you very much, Becky. And thank you for the invitation. It's very nice to be here virtually. So what I would like to talk about today is some recent issues in really in statistics, which have been identified through the empirical theory through the empirical observations of deep learning. And in some sense, this is a very much kind of physics-minded view. We observe empirical phenomena and we sort of try to provide reasonable theoretical descriptions. So in fact, I've been sort of arguing that we have to do more sort of physics or science kind of based approach to machine learning. Of course, our experimental results done in the computer, but rather than the physical work, but it's a similar setup. Yeah, so by the way, please do interrupt me if there's any questions. I am more than happy to slow down and answer any questions. Yeah, so let me proceed. So, well, in recent years, we have seen that deep neural networks, which very complex objects have performed well on many tasks. And there are really things like this. I think there's very complex structures, have a lot of kind of elements and they are very sort of non-transparent to observation from the point of view of theory. It's not easy to figure out what's going on. And in a sense, we have a sort of crisis of machine learning theory. And sort of on one hand side, there was an argument that machine learning has become alchemy, is a very nice talk by Raheem Ibn-Rack that deep 2017. Well, you're using basically very complex recipes to do things. And sometimes these recipes seem to work, but it's hard to judge when and how they work because there is somehow no systematic theory of the things. On the other hand, machine learning is, like Yaglik would argue, machine learning is theory. He's looking for lost keys on the lamp post because where the light is, right? It was a cartoon of a drunk person looking who lost his keys and is looking for the lost keys on the lamp post. Because somehow it's looking at phenomena which are sort of accessible through the light of this lamp post, but not really where the key is probably lost. I mean, I should point out that there is no sort of a primary reason to think about, you know, the light is fixed, right? So hopefully we can move the light to the China in different direction. But that was sort of the criticism was that the theory was irrelevant to the practice of machine learning. And so what was the kind of issue here? And there are really two issues. If there is a sort of generalization, oops, this is not what I want, there is sort of generalization to the optimization. The first question, which was this very ugly, is why do this very complex over parameterized models generalize? So that's number one. And question number two, why can highly non-convex systems be optimized by methods such as stochastic gradient descent? This stuff will primarily be about the first question, but I'll say something about the second one. Actually, in a sense, the second one is easier. But the first one is perhaps more fundamental for trying to understand what's going on. So I'll talk about the first one for which we have now some understanding, but it's not yet complete. But it's certainly much, much better than it used to be. So what will I talk about this? First, I'll talk about what the issues were sort of standard analysis, and then I'll sort of point out some directions through which we now understand generalization from the theoretical point of view. And at the end, so that's somehow the majority of the stuff, and at the end I'll say something about the optimization. And please do feel free to ask any questions and interrupt. So, okay, so just to set things up, we have this standard supervised machine learning. We have data XIYI, XI, this is a training data, XIYI in RD and YIYI is just for simplicity labels minus one one. And the sort of the goal of machine learning is using this data to construct a new function that best generalizes to unseen data. And what does generalize this mean? Well, actually it's some sort of statement about the future, right? We wanted to predict somehow what will happen in the future, but if you use the CIP plus and the most standard statistical assumption that this data are sampled from some sort of probability distribution independently, then the prediction of the future just becomes that we want the expected value over the future over the unseen data of some loss function to be minimized. And this loss function, you can think of it as maybe a square loss or classification loss. Now, so this is kind of the best, right? This is what we really want. And of course, this assumption. Now, most algorithms and you know, many analysis are based on empirical risk minimization. And the empirical risk minimization is the following. Think you take a class of functions H and you try to find the function which minimizes the loss function over the trading data. And that's the procedure. Well, the procedure, of course, is not just taking argument, the procedure usually has some sort of, you know, algorithmic element, typically just gradient descent type of method. Now, the key aspect of this is actually the choice of this H, right? And how do you choose H? Well, sort of a traditional view of this, is that you do something like this. This is the so-called U shape generalization curve. And the way you do it is you think that if H is small, that is, there are too few candidate functions, then your performance on the trading sample and on the test sample. So this is the red and blue curve correspondingly is not very good, right? So on the x-axis here, you have model complexity. On the y-axis, you have prediction error. So your generalization is not good, right? Because prediction error is on the test sample is bad. If, on the other hand, so that's how they're fitting. If your model complexity is very high, then your performance on the trading data are good, but the performance on the test data is not good because you're somehow overtrading to the test, to the trading sample. And the goal of this is to find the bottom of this curve where the test sample performance is optimized. That's optimal generalization. And the sort of classical corollary of this is that the model with zero trading error is overfade and will generalize poorly. And we will call this zero trading error interpolation because mathematically it is simply classical interpolation. Now, let me sort of, now, so the goal is now I'll kind of outline the sort of classical analysis of the thing and I'll point out what goes wrong with this in view of the empirical alterations. So what's the goal of machine learning? The goal of machine learning, because you will have seen this to find this F star. The goal of empirical risk minimization is to find the F star ERM, right? What's the difference? There are two differences. One is that for ERM, there is this class of functions H. And two is that the sum is taking over the trading data rather than the unseen data. But other than that, it's actually mathematically looks very, very similar. Now, the claim is that the, well, I should say, Vapnik's in book statistical learning theory, he basically said that the theory of induction is based on the uniform law of large numbers. And that effective methods of inference must include capacity control. And that was sort of foundation of learning theory, well, according to this view, Vapnik's view. And let me unfold this a little bit. Empirical loss. So first, what is a uniform law of large numbers? It simply means that empirical loss of any F in H approximates expected loss of F. So this is some sort of law of large number type of statement and it's uniform over this function class H. And second, capacity control just mean that H contains functions that approximate F star. Well, if F doesn't contain function that approximates F, then there is no hope to connect those two. And notice that the goal of learning procedure, right, is to find F star RM, which is close to F star. So if you can connect those two things, then, you know, you're doing well. So it is easy to see that if you have one or two, well, not like immediately, but it's a little bit of collusion, that if you have one or two, then you can connect F star and F star and this is good because that means that your performance of your algorithm is essentially the same as the optimal. So that's machine learning theory. And let me now be a little bit more precise about this. If that the uniform laws of large numbers are actually usually have the following form and there are many sort of different versions of them. But basically on the left of the inequality, you have the expected risk, which is the future. And on the right, you have the empirical risk. And basically with uniform law, large numbers say that the expected risk is bounded by the empirical risk plus some sort of term, which is usually something like of the form of square root of C over M. There are different forms as well. When C some notion of complexity of that function class H. And in some cases, this H can be data dependent. Now, it's kind of nice to call them busy week. What you see is what you get because you see what's on the left is what you get, right, that's the future. What you have on the right, the empirical risk is what you see on your training. So essentially the set that what you see on the training set is what you will get in the future. And the sort of upshot of this is this nice figure from about textbook, which said that it's similar to the one I showed before, but it's a little bit more precise. It shows that on the X axis, you have the size of your class H. On the Y axis, you have the error and the risk. And you have empirical risk, which is the first part of this. So this is empirical risk, right? This is going down as H increases because with more functions, you can fit better. But the model complexity is going up, right? So what he calls confidence interval, which is this model complexity term, this is going up. And empirical risk is going down. And because the bound is composed of the sum, it has this U shape curve. And this is the optimal. So now I've spent some time on this, but to set it up because somehow to make it clear what's going wrong with empirical observations, we have to sort of fix the idea of what is to be expected. And now here are some really interesting observations. And I think in this paper by Zhang Litao from this, which is called understanding deep learning requires rethinking generalization. The case that they make the case that you can trade a neural network to have 100% accuracy. So it's perfect. It fits the data perfectly. And yet the test accuracy is still quite good. So if there is overfitting, it's not large, right? They get almost 90% test accuracy, which is pretty remarkable. They use some sort of neural network architecture. So this actually suggests that interpolation that is fitting the data trading data, exactly it doesn't overfit or at least doesn't overfit much. This actually I should point out is not a new observation. And in fact, in 98, there was a well-known paper boosting the margin by Shapir, Freud, Barclay and Lee. And they analyzed boosting and sort of the whole analysis based on the observation that the test error of generated hypothesis, I'm quoting from this, usually does not increase as its size becomes very large and often it's observed to decrease even after the training error reaches zero. So it's exactly this phenomenon. So even after the training error, well, classification error reaches zero, you still get better performance. So this is probably was observed in early 90s, this kind of phenomenon on mid 90s. For the, well, I don't know for the first time or not. Okay. Now, let me point out that this is suggested but doesn't directly validate these weak points because well, we don't know what the true accuracy should be. And maybe the true accuracy is 100% and then maybe data totally separable. And then of course, this is not inconsistent with this ball that we have. So how do we test model complexity? And let me sort of give you an experimental setup. So it's kind of physics-like experiments. And there is actually a simple sort of test model when we can test model complexity and the idea is to add label noise. So let me just tell you how it works. So imagine that I have two data set with two classes and suppose my classes are actually linearly separable like here, right? So there is a line and the line separates up perfectly. So the test error should be zero for the optimal predictor. Now, what if I do the following? So notice that interpolating here is okay, right? Because the slide interpolates, right? On everything to the left, I'll call red and everything to the right, I'll call blue. That's my interpolation. Now, what do I do? Well, I can add label noise. And what do I mean by label noise? The easiest way probably to do it is just to flip some of the labels. So I dig thick 10% of the labels and I flip them. So at random. So you can see I flip this two and this one, okay? Now you can see that if I am to separate the red from blue, it's going to be a very complicated boundary. I need to have something like this. So I don't know. Okay, I'm not doing a great job, but you can see that the model complexity required to interpolate this data now is dramatically higher than it was for the line, right? So somehow by adding label noise, I'm forcing the model complexity of the interpolating model to increase dramatically. And presumably that would require the model to overfit. But notice, however, that the optimal model does not change. The optimal is still here. You know, if you think how the flip works, you see that the optimal doesn't change because this is still the best. It's just that some labels get flipped as long as the flip is less than 50%. So this is interesting because you can see that there is kind of a big gap between the labels. You see that there is kind of adding label noise, create this somehow it causes the optimal model and the interpolating model to diverge, right? One is very simple, the other is rather complex. Okay, so, and sort of theoretically what we expect is that overfitting will become bad as a model complexity ground script. But let's see what actually happens in practice. So this is an experiment which we did with my students, see you on mind to make Mandel, curious experiment. So we took, this was on, I think on the MNIST which is a 10 class, 10 written digit classifier. So it's 10 class. So on the X axis, I have the label noise. So we're doing more and more label noise on the Y axis. I have a classification error. So notice that zero classification error is great, right? It's the best and 90% classification error is random since it's a 10 class problem. Now the green line here is the third of the theoretical optimal, right? Because when there is no label noise, you kind of do better than 0% and you can see that when adding noise, it's actually just a line connecting at 100% label noise, right? It's a completely random prediction. So it's a line connecting zero and sort of the left corner of this and the right corner of this. Now, what am I showing here? I'm showing several methods, but in particular, you can look at the Laplace kernel because it's kind of the nicest curve here. And if you're not familiar with kernel machines, it doesn't really matter what it is exactly. The important part of it is that it's a method which fits my training data exactly. And by exactly, I literally mean exactly with zero training square, well, it's not technically zero, but it's like machine zero, 10 to the minus 25, something like that, very, very small. At precision level, essentially. You can do something similar with neural network, but with neural network, so with a kernel machine, you have an analytical solution by matrix inversion. With neural network, you have to train it so it's not possible to get quite such a low error, but you get still something quite small in square loss and you get zero in classification loss. Okay, so that's basically what I have. And now we observe a pretty remarkable thing here is that even when the label noise is quite high, right? Like it's 70%, the classification error of my Laplace kernel machine is maybe only 5 to 8% worse than the best optimal. And that's pretty remarkable because think about what it's doing. It's fitting 70%, so 70% of my points are junk, right? There is no, I'm fitting tremendous amount of noise, but yet I'm paying for it very little. So there is very, very little of our teaching. So that seems somehow difficult to reconcile with the bounds that we have seen before. And let me sort of be a little bit more analytic about why this is difficult to reconcile. So what was about that we had before? We had this type of busy week bound and the question is can uniform bounds like this? It's called for generalization of the interpolation. And the question is, well, on the left, you have test loss, right? And in the presence of noise in my data, the test loss is actually not equal to zero. The training loss, however, is equal to zero. So this part disappears. And now the test loss needs to be exactly bounded by some sort of model complexity. And this is quite difficult. And why this is difficult is, well, let's think about what this fits. So if you look at a high noise level, like 80%, then you see that the test loss, like to explain what we see, right? The test loss would have to be bounded by 0.7 on the left and 0.9 on the right. Like why? Because you see on the right, if it's bigger than 0.9, right, it's useless. Since it's not telling me anything better than random. If it's smaller than 0.7 on the left, that's better than the best possible. So that's impossible, that's wrong. So that's basically what it is. So you have to have it in this very narrow band between 0.7 and 0.9 to be both correct in use. And the problem is that this is very difficult to have a bounce like that for any sort of realistic scenario. Why? Because, well, there are like constants here in all star. First there are constants, right? And even the constant factor of two will already break this bound. And there are a lot of factors that there are all sorts of other things. So there are really no bounds like that. Oh, okay, I should moderate this slightly. So in some very special cases, actually, they could be bounds which are almost exact like for linear regression with Gaussian terms. And actually recently there was a nice work by from the netis-rebras group they analyzed. They showed that one specific example where this is possible. But thirdly, you wouldn't expect such a bound to exist in general. And it was kind of conceptually, how would the quantity see off and how would model complexity know about the base risk? Because somehow it has to know about this left mark. And that seems difficult. So there was actually some recent work in particular in the garage and in Colt and Barclay did a lot for show that in some, for some kind of general cases, such bounds do not exist. So that's basically maybe a summary of the crisis that we have for generalization is that the observed phenomena of deep learning seem to contradict to get out what we would expect it to be. So theory and experiments do not fit. And now you could say, well, if this really the sort of, maybe this is some sort of marginal thing, but really in some sense, you can say that interpolation is best practice for deep learning. And I really liked this quote from Ruslan's tutorial. He said that the best way to solve the problem from practical standpoint, you build a big system and basically want to make sure you hit zero trading. So you really want the system which interpolates. And now, well, if you want to get state of that result, this is not the end, like you do other things to the system, but if you already at this point, when you get zero trading error, it already works okay. So you already have some generalization. And really the practice of machine learning has been building bigger and bigger systems. And this is a summary from 2017 of the systems different architectures. So each circle here corresponds to an architecture. The small ones are about 5 million parameters. The big ones is about 150 million parameters. So the one I showed you on the first slide is actually a small one. It only has about 6 million parameters. Now, that was 2017. In 2020, we have GPT-3 with 175 billion parameters. This is to scale. So the area of a circle corresponds to the number of parameters. And well, in 2021, there is something called Switch Transformer, which has trillions of parameters. And by the way, you can see this other architecture, they're all here. So they have been this kind of freelance, relentless growth of the number of parameters, this extremely, extremely large system. Although arguably you should say this, they also trade on very large data sets. So they have been sort of historical recognition of this. And well, Jan, of course, on many occasions has something like deep logic breaks on basic rules of statistics. Remarkably in 1995, Leo Breiman, a statistician from Berkeley, he wrote this very interesting note called Reflections After Referring Papers for NIPPs. And he asked several questions. And the first one was, why don't heavily parametrize neural networks overfeed the data? And what was really kind of was clear to him in 1995 that there were some sort of issue. So that's maybe the summary of empirical findings. So that's a nice sort of, that's kind of a way to summarize this. The goal of optimization, which is simply interpolating or minimizing empirical laws aligns with the goal of machine learning, which is minimizing the expected loss. And I think this is truly, I think, unexpected from a classical statistic point of view. Because traditionally, we use some sort of capacity control or similar ideas based on regularization and so on. So that's basically that's maybe the summary of the sort of the meat of this crisis of generalization. And yeah, are there any sort of to summarize as a theory that if we want a new theory of induction, which we do definitely want, it kind of be based on this kind of classical uniform lots of large numbers with capacity control. And the question is sort of what's next? Any questions? I'm more than happy to sort of, maybe this is a nice point to sort of take a short break and see if there is any question. Yeah, I see Brink has a question. Hi, yeah. So you mentioned this paper earlier, the Bartlett paper on boosting the margin. And there's some of these sort of interpolation you're not overfitting things are seen in that sense as well. I'm wondering if there's a lot of been much work in studying boosting as an analogy for deep learning. At the moment, because it seems like most of it after the whole NTK thing is concentrated around the beaches and kernels. I'm wondering if you think there's like stuff to be understood about deep learning from looking at boosting. I haven't said that's a really, that's an interesting question. So I think that there was a lot of study of boosting early on in like late nineties and early 2000s. And I think there was recently much less. Yeah, I don't know why. Yeah, so my impression is that there haven't been as much study. I mean, the thing with NTK is that this gives a very direct connection to between kernels and neural networks, which I guess is not the case with boosting. Yeah, that's an interesting question. I should point out that boosting is still used very widely like what is it called? HG boost, right? That's some form of boosting. That's like one of the most popular methods. Like if you look at Kaggle, that's like very, very popular. But probably not used as much to analyze deep learning. Yeah, and that's a question. Are? It's, I have a question. It's more of a historical question about the Zangetal paper. Mm-hmm. We know that neural networks are highly expressive, right? There's lots of, so it's not all surprising that if you make a big neural network, you should be able to get zero training error. We also empirically that they journalize well, doing that longer than Zangetal papers. So I was, I know this, the one question maybe it feeds a comment on is why has this, why was it this particular paper that so much caught everyone's imagination? In the way I read the papers, more something like this, which is saying, what it really shows is that SGD is a very good optimizer because they're able to get, find the zero training errors on randomly labeled data. But I was not surprised at all that they could find those solutions because they're highly expressive. That's right. So in some sense, it is perhaps not surprising, right? Because, okay, yeah, there are like tons and tons of parameters. Why shouldn't you be able to fit the data? But I think the fact that- There are proofs, right? There's expressivity proofs. Sure, there are expressivity proofs, but I think what you say is correct. What is not obvious is that this kind of gradient based method can do something because, yeah, the proofs are kind of very generic, right? They're just saying that some class of functions is big. But why would SGD find something like that? You know, the analysis are convex in the classically and this clearly is very, very non-convex. So I think that was maybe the kind of, the surprising thing is that non-convexity doesn't matter. And they really systematically showed it very nicely as it's like ending noise and writing this. So that's somehow caught the imagination, I think. I think that was a kind of historically very significant war. Okay, so let me maybe continue if there are no further questions. Yeah, so what's, if your theory of induction can be based on this, well, where next? Then I think actually it's kind of pretty remarkable in retrospect is that one nearest neighbor classifier, right, one nearest neighbor is probably the most classical predictor, certainly one of them is linear regression. And it has a, first it's in the interpolating classifier, right? Because one nearest neighbor interpolates. Second, it has a non-trivial and sharp performance guarantee that it's twice the base risk due to covering 167. They actually, they prove more than that, but this is one of the things. What I say twice, it's actually, there is a more precise thing, but it's bounded by twice. And it's not explained and as Pfizer now never was explained by empirical risk minimization. So sort of in retrospect, it's kind of interesting because we sort of thought of the Lapdic's theory. Lapdic's claim that this is the theory of learning is reasonable, but yet it doesn't explain this very simple algorithm, right? One nearest neighbor can be simpler. So there were some sort of disconnect between this which sort of went unnoticed. And actually, you could, so twice the base risk, maybe it's still kind of far from optimal, but you can push this. And there is a very actually interesting this kind of singular curvil interpolation scheme which originally was known as shepherd's, what was to know the shepherd's interpolation. And there was a remarkable result by D. Roy, Dior, Fiat, Kruzhak, I probably not saying the name correctly. In 98, when they showed that for a second curvil, this scheme produces, it's consistent for regression. And the scheme is this kind of classical kernel moving the D. Roy Watson scheme. So just give it by this equation, but using a singular kernel. So the kernel has a singularity, it has a pole. It has something like this. In any case, you can actually push this more and we showed, so with Daniel, so in part of the trend that there was a follow-up with Sascha Rakti and Sascha Tsibakov, that in fact, there is kind of, if you're allowing this nearest neighbor type of scheme with this weight, you can actually have a class of the singular kernels. And in fact, you even can get optimal rates and so on. So you can essentially get statistically optimal predictors which is pretty strange because they look crazy. And that's what they look like. So I think the picture on the right is, though you have data, which is just one dimensional, and the true thing is Y plus X. So it's Y plus X plus noise. And what you have, you have this data and so the blue line is optimal prediction. The red curve is what you get from one of these schemes. And you can see that at every data point, this red curve is not close to the line. So if a priori, the blue line is impossible to get, it's the best. At every point, this red curve is not close to the line, but yet when you get enough data, somehow on average, it will converge to the blue line and it will become optimal. So it's a very weird, it's a very weird sort of non-intuitive kind of predictor. And what it seems, and in some sense, you can quantify that in high dimension, that becomes even easier for it to be optimal. There is some sort of leisig of dimensionality here. But even in one dimension, it's optimal in the speedy box, so that's what you get. Now, yeah, so as you can see, this is kind of a very counterintuitive from a traditional point of view, because it just looks strange. And maybe I'll keep this adversarial examples. So here is what we have so far, is that interpolation empirically aligns with generalization, that's empirical observation. Second theory of interpolation should not be based on uniform bounds. And there is at least some methods like this nearest neighbor interpolating methods, which have statistical validity. Now, there is a mismatch obviously between A and C because there's nearest neighbor type of methods, like one nearest neighbor. They have no capacity control, complexity control and they have no optimization. Yet, practical methods always use optimization in some sense use the largest sort of feasible models. In some sense, you build a model which is based on how many GPUs you have. Google have like 10,000 GPUs and they build a model with trillion parameters, something like that. So the key question is dependence on generalization of model complexity. And in particular the number of parameters because that's, I mean, you can argue that the number of parameters is the right measure of model complexity, perhaps it's not. But it is the thing that we observe, right? When you choose a model, that's what we have. You have the number of parameters, you don't have spectral norms, you don't have this more complicated things. And this is sort of where the double descent curve comes in. And if you sort of see the standard curve which is on the left, you'll have underfitting and overfitting, right? And as you increase in complexity, you go through the few curve. And then the question is, what happens if you increase your complexity even further, path, the point of interpolation? So at this point, the model is complex enough to fit the data exactly. And the performance actually is not good. But as you're increasing the model complexity, it turns out that all of these models to the right are interpolating, but the larger model actually generalize better. And that's what we call the double descent because there are some called as two. Oh, and this is what joint work with Daniels who see you out by it's a bit of a doubt. And yeah, so there is the kind of the classical descent and then there is this modern descent. And the interesting thing is that the modern descent often goes, often, but not always goes to infinity. So the more parameters you have, the better your performance becomes. So of course it doesn't go to zero, right? The risk never goes to zero, but it monotonically decreases in many cases. And often the bottom of the few is actually higher than the point of the right. So very complex model can outperform classical models. That's what we observe, which is, you know, and that's what people in deep learning have been observing. That was kind of the impetus for building such large models. And yeah, so there have been a number of different observations of this type of things. Maybe let me skip this. But let me maybe, I don't have a lot of time left, but let me maybe say a few words about what is the mechanism for that? And let me just take a very simple model, which is this random radio network. So you have two layers and you first, the first layer is fixed, it's just random. So you can say it's a random radio feature model. And this is what it works, what it looks like. So I'm just doing it in one dimension. So if you have three neurons, you have this kind of nice parametric fit. You just have this nice curve. So I have, I think 11 data points in one dimension. And if I increase the number of neurons, and I have 30 neurons, I get some sort of classical overfitting, right? I fit the data, but it looks awful and probably has no generalization power. It kind of oscillates all over the place. And if I increase the number of neurons to 3000, this actually becomes a nice non-parametric model. Now, the model is non-parametric, of course the model is technically parametric. It has whatever, 3000 parameters, but it's not parametric in the sense that it's close to the limits. Like if I take 3 million, it would essentially look the same. So it's pretty nice because what you can see is that as you increase it up of parameters, you go from parametric models to some sort of bad overfitting to non-parametric models. So you can combine both in the same kind of axis, which is, I think that kind of bodes well for having some sort of theory potentially, although we don't have that yet, but no, not complete. Okay, maybe I'll keep this. You can do this. Yeah, maybe one point I would like to sort of say if that as you increase the number of features, it's a similar model, but I don't want to discuss what it is. What you can show is that the norm of the predictor, it kind of increases with the number of features. Initially, and then it decreases again. And that's because, you see, when the number of features is actually equal to the number of data points, you have a unique solution, and the solution is usually awful because you have to fit noise. But when the number of features much larger than the number of data points, you have many solution which fits the data. And if you choose the medium of norm solution, it has some sort of desirable properties. So more features or more of a parameterization allows you a bigger space to choose your inter-politic solutions from. And among those solutions, there are some which are good. And it happens that the minimum norm solution is all, is not always, is frequently good. That's basically, so more features somehow allows you to give you a better approximation to the true minimum norm solutions. So very briefly, that's what it is. And let me just pick this. So they've been quite a bit of work on trying to understand this in particular for interpolating linear models and for a random features model. There is actually recently a very nice work by Hulse Miller, who showed that there is a lower bound on the number of features. So P, so if you look at the bottom, okay, it's, ah, what's going on? Why is my, I cannot, for some reason I cannot draw. In any case, look at the absolute noise, that's a lower bound on the loss. And P is the number of features and N is the number of data points. And it basically says that when P is close to M, this is quite large. So you have this peak. But when P is much larger than N, this goes to zero. And the lower bound, well, it doesn't tell you about the upper bound, but you can have very good predictors. And the nice thing about his result is that it's quite general. It's essentially a work for any model. Okay, so let me, I am almost out of time. So let me actually just do a summary. And then I want to say like one word about optimization. So to summarize, you can kind of think maybe as a framework for model machine learning is some sort of Occam's razor. And what you would like to say, you would like to maximize smoothness, subject to interpolate the data, what do I mean by smoothness? That could be some sort of averaging process. Or this could be, and I didn't really discuss the averaging very much. Or it could be some sort of medium of dorm solution. So, you know, like when you have like a lot of random features, this is some sort of self-smoothing process. And there are really three ways to increase smoothness. You could look at this functional dorm solutions like exact kernel machine. So maybe approximate with random features. You could do it for optimization implicitly. And we understand this for again, for this type of linear models and parameters or random feature type of models, not so much for neural networks in general. Or you could do averaging, like bagging or boosting type of processes. Well, boosting is somehow a bit bit. And interestingly, all of these processes, all of these ways coincide for kernel machines or there is something quite nice about kernel. Okay, so that's all about generalization. Now, maybe just one point here is that, I think the nice thing about it is that overfitting kind of shows in the different light in this curve is that overfitting really is a band of parameters around interpolation threshold. And you can combat overfitting by decreasing the number of parameters or by degrading regularization or by increasing the number of parameters and essentially building bigger systems. Now, let me, given that I don't have time, let me just point out something quite interesting about landscapes of over-pervitrized models and why neural networks seem to be not too difficult to optimize. Well, the landscapes, of course, are not convex. And, you know, like in a classically, we think of dot convex is something like this, which is terrible, right? We have to find anything by gradient. They said, oh, rather we can find some local medium, which is probably no good. But this is actually a rather misleading way to think about over-parameterized system. And really you need to think of over-parameterized system or something like that. So the landscapes have this manifolds of minima and every minimum is a global medium. And you can see that on the left, gradient descent will not work, right? If you roll a ball downhill, it will just get stuck in the closest local medium. But on the right, a ball, if you roll a downhill, it will actually go to a global medium. So this is a kind of benign landscape, even though they're non-convex and they're actually non-convex in a very strong sense that they're locally non-convex at every point. As long as this manifold of minima has curvature, basically they're always non-convex. So this can be analyzed and we argued that there is this kind of classical idea of polyacrylate savage, this condition, which we argued that the over-parameterized system satisfy that. But let me stop here. And yeah, so this is maybe the kind of summary. So you have these classical models and you need careful parameter selection here and there are many non-global minima and you have issues with optimization. And there are these modern models and essentially with modern models, you could just take models to be quite large. And you don't need a lot of regularization and your gradient descent method converts somehow more or less for free, which is pretty remarkable, probably what makes deep learning possible to a large degree. So I'll stop here and I'm very happy to take any questions. Okay, thank you very much, Mikhail. Any question from our audience?