 Hi, I'm Stéphane Dascoli with a PhD student at ENS and working with Facebook. Stéphane, we cannot hear you. Okay, here I am. Okay, so hi everyone. Thanks Marc for the introduction. So today I'm going to be presenting some joint work with collaborators at ENS and Facebook, Maria Refiniti, Yelvet Sagon, Julio Biroli and Florence Akala. And we're going to be looking at the generalization behavior of neural networks and try to reconcile the surprising behavior with all the right ideas. So, yeah, on the first part of this talk, I'm going to be trying to reconcile the double descent with the convention wisdom of the bias variance trade-off, which you're probably familiar with, but I'm going to restate it here. So the original citation is that the price to pay for achieving low bias is high variance. In other words, when we increase the number of parameters or the complexity of the machine learning model, we expect to see underfitting when the number of parameters is too small, overfitting when it's too large, and therefore there should be a soft spot where the test error is low. So this is a U-shaped curve, but this seems not to be the true story in deep learning. Instead in deep learning, if you plot the generalization error of a neural network versus the number of parameters, you'll observe a particular interpolation threshold, which is denoted here as this dotted line, to the left of which you get the classical U-shaped curve here, and to the right of which, instead you get a monotonous decrease. And so at the interpolation threshold, which is the point where the training error vanishes, you get a characteristic peak in the test error. Now notice that this peak can be suppressed by performing regularization. For instance, in this paper, it was performed by early stopping, or even by ensembling, which is averaging the predictions of differently initialized networks. So what we like to do is understand what in the situation is happening to the bias and variance, as we knew them before. So to study this question, we're going to be looking at the random features model, which you'll probably hear about a couple of times during this conference. It was introduced as an approximation for kernel methods on large data sets, and essentially it can be viewed as a two-layer network where the first layer of weights theta is fixed, and we only learn the second layer of weights. Now this model was recently shown in a paper by Song-Mei and Montenari to display the double descent curve. But in addition to simpler models like linear regression, which also have a double descent behavior, in this model, we can disentangle the input dimension, in other words, the number of nodes in purple from the number of parameters in blue. And also we have a notion of non-linearity in the projection. But more crucially for the current study, in random feature models, you have this fixed first layer, which can mimic the effect of initialization in neural networks, and therefore can enable us to study the effect of ensembling. Now to introduce notations, because they're going to change a lot throughout the conference, I'm sure, we're going to denote as n, the number of training examples in the data sets x, as p, the number of hidden nodes here, and as d, the number of input dimensions. We're going to have a non-linearity sigma, which is going to be taken to be ReLU, and all the random quantities here are going to be sampled from a standard Gaussian. So our learner is going to take the following functional form, and it's going to be used to try and approximate ground truth, which is a linear function, plus some added noise epsilon. To learn the second layer of weights, we're going to perform ridge regression with parameter regularization parameter lambda, and we're going to evaluate the tester of the model on the fresh sample, again, Gaussian IID. So what we're going to want to do here is disentangle the various sources of variance, which come from the sources of randomness in our system. So this is going to be sequential decomposition. We're first going to remove the variance with respect to the noise and labels epsilon. And as you can see, once you've averaged over the label noise, you get an average predictor, which is closer to the true function in blue. Once we've done this, we're going to again remove the variance, but this time respect to the first layer of weights, theta. So the random features. And again, we're going to get an average predictor, which is closer to the ground truth. And finally, the last source of randomness is the sampling of the datasets. And once we've removed all the source of variance, we get the final bias, which is the distance from this fully average predictor to the ground truth. So to study this analytically, we're going to go to the high dimensional limits where N, D and P all go into infinity with their ratios fixed. And in this limit, we're going to use the replica method from statistical physics to compute the tester and the various sources of variance. And what you can see is that, first of all, the tester displays the double descent. But if we look at the various terms, which intervene, we'll see that the noise variance and the initialization variance both diverge at the interpolation threshold when there is no regularization. In stark contrast, you'll notice that the two remainder terms, which are the sampling variance and the bias, reach a plateau at the interpolation threshold with a distinctive kink here, which reminds us of a phase transition, and it can be smoothed out by regularization. So the conclusion of this is that the noise variance and the initialization variance lie at the heart of the deep double descent behavior. And in particular, the fact that they are monotonously decreasing after the interpolation threshold shows that over parameterizing is beneficial just due to the fact that it amounts to removing some variance. Now, what's nice with this decomposition is that it also allows us to study the effect of ensembling, which is again averaging the predictions of learners with different first layers, but trained on the same data set. So we're going to call K the number of such learners. And what you'll see is that as you increase the number of learners K, you'll drastically reduce the noise variance and the initialization variance, which we're causing the divergence at the interpolation threshold here. And when K goes to infinity, the divergence is completely suppressed. So the summary of this is that we reconciled this classical bias variance decomposition with a more modern one which takes place after the interpolation threshold, where in stark contrast with the previous bias variance decomposition, the variance is decreasing and the bias remains constant. So in the second part of this talk, I'd like to reconcile this double descent with an older version of double descent, which was known since the 90s in statistical physics in the perceptron. Indeed, if we look at the parameter wise profile, which is the test error you obtain when you plot versus the number of parameters as we just saw in your networks, you get a peak when P is equal to n number of training examples. If you look at linear networks, you won't get such a peak. However, you will get a peak when you plot the sample wise profile, which is the profile you obtain when you plot versus the number of training examples. However, there is one difference between nonlinear and linear networks is that in linear networks, the peak appears when n is equal to the input dimension D rather than P. And so if we plot this in the form of a phase space, a P and phase space, you'll notice that these two kinds of overfitting correspond to different lines in this phase space. However, they have one thing in common is that they both occur at the interpolation threshold exactly. And so what we'd like to answer is, are they the same thing? Are they due to the same phenomena? Well, the answer we provide here is that no. Indeed, if you look at the regression task with a lot of noise, what you'll notice is that the sample wise profile obtained when plotting versus n will present two peaks when the activation function is a weakly nonlinear. You observe a peak at n equals D, which is the linear peak, which is vertical on the phase space, and a nonlinear peak at n equals P, which is diagonal here. Now notice that when the function is purely linear, you only get the n equals D peak and when it's strongly nonlinear, you'll only get the n equals P peak. And so what we'd like to know here is what are the mechanisms underlying these peaks and how are they different? So to answer this question, we consider two different models. One, which is our usual RF model, which I've already introduced, where we get an RF student and a linear teacher. And we're going to denote by Z, the projection matrix, which is simply the data set once it's projected through the first layer and the nonlinearity. Now one quantity, which is important is the covariance of Z, whose bad conditioning, in other words, small eigenvalues are going to cause peaks in the test error. We compare this analytically tractable model to a numerical model of, which is a toy model for deep learning where we have both the teacher and the student, which are three layer networks. The teacher is randomly initialized and fixed, whereas the student has his weights trained by gradient descent, again on mean square error loss to try and mimic the labels of the teacher. So this is the, in both models, if you look at the P n face space of the test error, you'll notice two characteristic lines. The linear peak, which is at n equals D, which is the first overfitting peak, and then this nonlinear peak when n is equal to P, which is diagonal in the face space. To provide some description for this, we again go to our high dimensional limits, and we'll notice that one important quantity which emerges is the degree of nonlinearity, or rather the degree of linearity of the activation function. This quantity varies between zero and one, and for instance, to illustrate it, we can take the piecewise linear functions and continuously vary from the absolute value for which r equals zero to a linear function for which r equals to one. And what you'll notice is that by doing so, you're going to be reinforcing the linear peak at n equals D, but weakening the nonlinear peak at n equals P. Now, what are the mechanisms underlying these two piece? Well, to understand this, we consider the matrix Z, which we already discussed. And we know that in the high dimensional limit we consider, we can decompose this nonlinear projection as a linear term plus a nonlinear term, and the relative importance of these two terms is controlled by the degree of nonlinearity. These two terms are both going to give separate contributions to the covariance of the random features, which we denoted sigma previously. And investigating the properties of this covariance is going to help us identify the sources of the peaks. So we do this using some formulas established in random matrix theory. And so if we look at, first of all, the purely linear case where r is equal to one, we'll notice that the spectrum, the spectral density of the covariance matrix sigma has a vanishing gap, in other words, vanishingly small eigenvalues, only when n is equal to D. And this is what is causing the overfitting peak, which we called the linear peak. If we look at instead the purely nonlinear case r equals zero, we'll notice that the vanishing gap again occurs, but not at n equals to D, but instead at n equals to P. In both cases, the linear peak and nonlinear peak are caused by small eigenvalues. However, when we go to the intermediate case, for instance tanh, but I could also have showed ReLU. What you'll notice is that the n equals P gap survives. In other words, again, we see here vanishing eigenvalues when n equals to P. Instead, the n equals D gap is regularized. As you can see, the spectrum is cut off at some point. And so we don't have vanishing eigenvalues. So something else we'd like to do here is understand where the divergence is coming from in terms of bias and variance. So we already know from the first part of this talk that the nonlinear peak is caused by the noise and the initialization variance. But what we can see here is that the linear peak at n equals to D is only caused by the noise. Nothing special happens to the initialization variance. And one important consequence of this is that if you go to the noiseless case where there's no noise in the labels, the linear peak dies. However, the nonlinear peak survives. And this explains the curious phenomenon in deep learning that even when there's no noise, you can still observe this overfitting peak. So now I'm going to discuss two important differences between these two peaks in terms of phenomenology. So in the top row here I show the RF model results and in the bottom row I show the deep learning toy model. And what we see is that when we apply regularization, we can notice that the nonlinear peak is reduced both in both models. When we apply regularization, we'll notice, sorry, in the case of on something or regularization, you can notice that the nonlinear peak is reduced, but the linear peak is weakly affected. And the reason for this is that the linear peak is already implicitly regularized by the nonlinearity, as we saw from the analysis spectrum. And the important difference between these two peaks is that the linear peak, which again appears as this horizontal line in the face space, appears much earlier during training than the nonlinear peak, which appears diagonally here. And again, the reason for this is clear, because the nonlinear peak is caused by vanishingly small eigenvalues in the spectrum. And we know that those vanishingly small eigenvalues are much slower to learn, which is right there appear later during training. All right, thank you for your attention and my collaborators here.