 So can you all see my screen? Yeah, you can give it a go. All right, so thank you very much for the organizers putting together a very interesting program. And so today I'm gonna talk about statistical perspective. So I'm in the department of statistics and I'll talk about the statistical perspective on understanding generalization and the tester. And so this is joint work with Lee Cong Lin who is a really exceptionally talented starting PhD student at Berkeley. All right, so I, looks like I shared the wrong slide. I apologize. I have a longer version of the talk that I gave earlier. I need to share the short version of the slides. I apologize. There's a subfolder. Now we need, you know, like the music that you got when you were waiting. I got the music that you got from the NRS. Rookie mistake, but all right, so there we go. So let's start with the bias variance decomposition, which is one of the most kind of basic, but also one of the most useful kind of an important concept in statistics, machine learning and the data science. So here what we are trying to solve is we're trying to solve a prediction problem. So we train some model based on some training data, it's called f-hat. And then we want to predict label y by applying f-hat on some new test data. And so we can decompose this squared prediction error into the value of the just intrinsic noise in y plus the deviation between the mean of y and the mean of your, mean of our prediction by f-hat, which is the squared bias. And then also the deviation of f-hat from its mean, which is the variance. So this is really fundamental. It's kind of universally applicable to all kind of prediction problems, not even estimation problems. And so usually what we think about is we want to kind of strike the right balance as it's shown in this picture, where we want to find a model that's kind of not too large so that it doesn't have too much variance, but kind of large enough that it has small bias. And neural networks, partly the interest in them is because somehow they have experimentally a very good trade-off in the bias and variance. And there's been a lot of work in the direction of double descent and so on. I'll talk about it a little later. But our perspective is, it's a different perspective that we're trying to do. So we are interested in attributing the sources of the variance to the various components that actually are changing. So if I train a model, then for instance, if I train a neural net, I have a different randomization for every different neural net that I train. And when I have new data that comes in in my training, then I also have variability due to these input features and I also have label noise. So all of these components contribute to F hat. So they contribute to the variability. And there are other components like the randomness in the optimization algorithm. Suppose I use some standard stochastic optimization. For instance, I choose the mini batches randomly. Those choices affect the randomness in F hat. So they affect the variance. And our approach is, so our kind of question is, how can we basically in a principled way make attributions of these different kind of components to the variability in a kind of unequivocal way. So it turns out there is a classical idea in statistics called ANOVA, the analysis of variance, which I'll explain. It's a kind of at least 100 years old idea. And that's what we propose to use here. So I'll explain how that goes next. But in order, so we don't just kind of propose this at a very high level, we also work out kind of examples of some random features models to actually calculate these variance components. To explain that, I'd like to give you the more concrete setting here. So in the setting, we have an data point, standard features XI and outcome YI. And we assume that the true model that generates the data is actually linear. And so we have a label like epsilon. So this can be put in a matrix form, Y equals X beta plus epsilon. And in order to simplify the analysis and get some kind of analytically tractable results, we make a lot of assumptions here. So we assume that this distribution of the vector X has IID entries, the noise is normal with some common variance sigma squared. And we also assume that the parameter theta is random to kind of normalize so that it has expect norm alpha squared. Alpha squared is kind of the signal of strength. And it turns out that the randomness in theta in principle could also affect the variance, but it will have a vanishing effect. So it will kind of, we will not need to include that in the variance components. And then we will fit a very simple linear random features model. I will not have time to talk about the two, the nonlinear random features model that we fit, but in the linear case, very simple, W X transpose beta, this is the model and the weight matrix, W is a uniformly orthogonal matrix. So it's a uniformly orthogonal random matrix. And so these are the, this is the data model, this is the kind of prediction model and how do we train it? We just train it in a, in with something that's analytically kind of tractable, which is an L2 loss, so mean squared error, plus a read regularization parameter. And then we effectively get an explicit predictor that has this form, it's just a read regression of Y on X times W transpose. So it's kind of already explicit here. And in red, you can see the sources that I'm going to study. So when I train this F hat, actually this could be, maybe this should be F hat here. So it depends on X, of course, my predictor depends on the training data. The F hat also depends on the epsilon, right? It implicitly depends on the epsilon because it depends on through Y, right? And the F hat also depends on what initialization I chose here. So it depends on all three of these randomly chosen components. And then we want to understand these dependence. We want to decompose the dependence and variability variance that is kept, that is in F into these three components. So just repeating, this is the, this is the bias variance decomposition. And as I said, we would like to understand what are the contributions? So a key kind of inspiration for us was this nice work from Daskoly and the collaborators last year that looked at a very similar model with a few technical differences with very similar model and they decomposed the variance in the following order. So first take conditional expectation over X and W, W and X and then take conditional expectation over X. So this is the kind of specifically chosen order. And by the tower property of conditional expectation you can write this sum in this way. So this is one of the things that really got us thinking of whether, what is the meaning of these terms and what is the interpretation and whether there is something more canonical to do here. And so as, this is the statistical idea of the ANOVA that turns out is kind of an answer for this. So here I'll explain what the ANOVA is. So let's start, let's give these quantities even more short-hand names. So X are the samples, so I'll call it S, W is the initialization, so I'll call it I and epsilon is the label noise or the outcome noise, so I'll call it L, so it's S-I-L. And so these variants can be written as a sum of the seven terms, V-S, V-L-V-I, S-L, S-I-L-I, pairwise interactions and the triple and they are defined the following way. So I take f hat and for every index A that's in this set I compute the conditional expectation of f hat conditional on that quantity, that's a conditional X, W or epsilon. And so that is now a quantity that only depends on that A, whatever that A is, right? And so I can take these variants. So that is called the VA, that is called the marginal effect or main effect of varying A alone. So it says, on average, if I just change A, what is the effect of varying A just by itself, marginally on f hat? Then there are the interaction terms which are defined very similarly where you take the conditional expectation with respect to A and B and then you subtract the marginal effects. Turns out these are all non-negative. Very similarly, you define the three way interactions. So VAB is called the second order interaction effect between A and B and ABC is the third order interaction. So of course, this is defined for any kind of number of any number, not just three, and it's defined very generally for any random variables. And it's a kind of really well-developed classical tool in statistics. So we applied it here for a particular kind of machine learning problem. And so to give you the results, we looked at asymptotic regime which has been kind of present in a few talks in this conference. So letting the dimension B grow to infinity, letting the number of parameters P go to infinity proportionally. So P by D is a sort of parameterization level. How many parameters do I have? So P is the number of random features in the intermediate layer. And then letting D by N also tend to a constant B is the data dimension. And there's some quantities that we need to express our results. Gamma is just pi times delta. And then theta j's are the so-called resolvent moments. So this is, so we have this kind of famous distribution, the Machanko-Pastore distribution with parameter aspect ratio of gamma. And then theta j are just the expectations of one over x plus lambda two power j. And we need only j equals one and two. So these are just some functions that there are actually specific formulas for them. For actually j equals one is just the series as transform. It doesn't matter. And then, so these are some things that we need just to state the results which you'll see in the next slide. And there's also something kind of conceptual. There's this lambda, the regularization parameter and there's an effective regularization parameter lambda tilde that's slightly larger due to additional kind of randomness that is induced by the random feature projection step. So it's slightly higher regularization parameter. So there is one result which takes up a whole slide to state that we find the, in this model, the linear random features model under these asymptotics that I stated, we find the limits, almost sure limits as the dimension samples and a number of random features diverges to infinity of each of the seven components. So the variance component, the main effect of varying the samples, features x, the label, the initialization and so on. And of course, I can't, these are close form expressions. So we know everything here, but it's not really possible to remember these. Instead, we will plot them and we will analyze them in terms of their monotonicity and modality and so on. Something interesting is that two of the terms are zero. So what does that mean? That the main effect of VL is 10 to zero. So that means that the label noise only affects the variance, this particular model, due to its interaction effects with the sample and the three of them also interact together. So I'll explain what is the implications or why I think such a result is kind of meaningful or important. Firstly, if you look at some plots, here's a plot that shows the bias and these non-zero components in a kind of particular choice of signal strength, reasonable signal strength, noise level, things like that parameter as a function of the data kind of, so that's a function of D by a function of the dimension effectively if you wish. So what you can see, there's a large literature on random, on double descent and I'll have some references later on. You see an increase in the overall mean squared error and then sort of decrease, but more interestingly, you see that in the variance component, so not the blue, everything else, in the variance components, the largest term is this interaction effect. So the reason why that's kind of interesting and subtle is that if you want to say something like, oh, you know, the label noise is the one that causes the biggest, it's the biggest component of the variance. Such a claim is, you know, very hard to make because it's just the interaction, the interaction term is the one that's kind of large. So you cannot put the, it's not just that the, you know, these marginal effects are a lot of things in the interaction sense. So I think to me, at least that's something that was quite surprising and still, you know, I'm puzzled by it every time that I think about it. And so another thing we can make a lot of plots, we can show, we can display how these variance components behave as a function of the parametrization level, you know, the number of in our neurons and as a function of the data dimension. And, you know, this is how they look. What is maybe interesting to see here, you can see this kind of heat, these are heat maps. So the yellow is kind of the high values. So you can see here that they're large, along this curve, this is the interpolation threshold where roughly speaking, there's some matrix that has a poor condition number. Then you can compare it to what happens if you do optimal regularization. So I should say this was for some things regularization parameter. And you can compare it with the optimal regularization parameter and things kind of change the heat maps look much different. And in fact, if you look more carefully, you will see that this interaction effect between the samples and the initialization is one that's decreased by a huge amount. So actually the scale here is not the same. I apologize for that. Here this is something like 3.6 and here this is 0.06. So it's like 50 times smaller. So if you regularize optimally, this is what changes this interaction effect. That's the one that changes the list. Okay, so in addition, what I'd like to mention is that the order in which we do the decomposition has a large effect. So remember that one of our inspirations was the nice work from the Schollian collaborators. But if we look at two different orders here that are in two randomly chosen orders, sample, label, and transition, and label, and transition, and sample, what you can see is that these, so in the ABC order, C is just the marginal effect, B is the sum of the two effects that include B and C. And A ABC is the sum of all the terms that include A. Then what you can see is that, for instance, if I look at the value of maybe sigma S, here sigma S is really large, but sigma S here in the different order is small. So anyway, just to kind of hammer down the points that if I want to say, oh, the data points are the ones that cause large variance that it depends on the order if you don't look at the kind of each other terms separately. And we have provided a number of results on monotonicity and unimodality. How many minutes do I have? Very little, again. Very little, okay. We're drifting later on. So in the interest of time, let me skip the monotonicity and unimodality. They are very interesting. We show, what I want to say is that with optimal, without optimal regularization and with optimal regularization, the monotonicity and unimodality behavior is really different. There's a number of nice works that kind of inspired us including by free tool Nakiran and collaborators who spoke earlier. And we've done a quite thorough study of this phenomenon in our paper. A lot of related works here. And there are several talks at this workshop that considered related random features models by Refnetti, Michiekiewicz, Beiningi, and Guerrevello. I hope somewhat correct pronunciation. And I want to mention the most close-rated work, especially the one by Ben Adlam and Jeffrey Pennington, which was a parallel work to ours that kind of independently also derived ANOVA in a slightly different model. There's a neat Twitter post here that shows how 100-year-old system models can provide the full framework to study modern machine learning. So that's kind of our contribution. In terms of the proofs, we use some deterministic equivalents from Endomagic's theory. So that's just one line of what it is. All right, so thank you very much.