 Yes, so can you see my screen? Yep. Okay. Thank you for the invitation. I'm glad to speak to this workshop so today I will present some work which is more than one year old, which might sound a bit ancient and the timescale of machine learning research. I think this fits well the topic of today's workshop and you will see a lot of connections with some topics that you have seen so far. And so I will present joint work with Francis back about the implicit regularization of gradient descent when training two layer neural networks. So we consider a classification task in this talk. So this means that the input variables X on or in order to the D and D might be just the space of input like images or sounds and the outputs why are minus one one because we are doing a classification task. So of course is to have a predictor that predicts the correct sign of why given an ID samples from the training set. So, to do that we will consider in this talk, why two layer value neural networks. So this consists in the function class which is written here. These are predictors which are the sum of simple functions phi of WJ X, but each of these function, the simple function phi, or simple radio function so they're the composition of affine function of their input. And then again a multiplication by a scalar so each of these simple functions as parameters and B which are the input and the output weights, which I stack in the vector W. And there will be the training parameters. So here in this way in the way I've returned the predictor we see that we have a sum of our M simple functions and that there is this scaling one of our M, which would be useful because we will consider the limit when the sum goes to plus infinity and the scaling will be useful to obtain another generic limit. And here I've written the equations for the rectified in our unit, but more generally our analysis could apply in the cases where phi is too homogeneous in the parameter W. So this is really a crucial assumption in what I will present. Which is that when I multiply the parameters by some scalar or then the output is multiplied by all squared. And this is due to the fact that the real neural networks when I multiply both a and B by all then I get all squared in the output. So this is the class of function the class of predictors that I will consider. So first we need to consider the training algorithm which will be used to select these weights W J. And for that we will consider the great influence the empirical risk. So first we consider a loss, which will be a mountainously decreasing and exponentially tamed. This is an assumption that covers for instance the logistic loss or the exponential loss. And then we consider the unregularized empirical risk for this loss which is simply the sum of the losses of all the training set. And since we will consider. In fact, in the infinite infinite width limits. There will be many measures to the this empirical risk. In fact, if you just take any predictor which correctly which predicts the correct sign on all the training sets and multiply its magnitude by some scalar or then you can make this empirical risk go arbitrarily close to zero thanks to the loss of the tail of the loss. And so it will be useful and necessary to study the training dynamics to understand which minimizer is selected by the algorithm so that's the standard understanding the implicit regularization of training algorithms. And in this talk we'll consider a very simple situation where we consider the gradient flow of this empirical risk. So this means that we will initialize all the weights independently according to some distribution on the space of parameters. This means I have the distributions for the input weights and another distribution for the output weights distribution will be the probability distribution on the space of parameters. And then I will just consider the gradient flow of this empirical risk up to some normalization and the idea is that we want to. This is an idealization of various gradient based learning algorithms that we would like to understand. Let me show you an illustration of such dynamics. So here I will show a video of the training dynamics of why to lay on our networks in a simple two dimensional classification task. So on the left, this is the parameter space, and they plot for each neuron the product of the input weights times the absolute magnitude of the output weights. So since here we are D equals to two. The parameters live in all to the four and so this representation is used to to see a dynamics in all to the three. And then the color red or blue will depend on the sign of B. So here is the space of predictor with the pluses and the minuses which are the training sets, and the line will be the decision boundary. So here is the training dynamics of the gradient descent for this loss so we consider a finite training set. And here in the space of parameters, the unit here is at infinity, because in fact, all the neurons their weight will go to plus infinity so I've risqué the radio axis. Is that the neurons the cluster into a few groups of weights, although at any situation there were a lot of neurons in the end we only observe a few clusters and predictors space this corresponds to a predictor with a few ages, which is a political polygonal decision boundary. So the question we would like to understand is what the performance of the predictor we have learned after the training dynamics, and I will give some hints towards these questions, but no full answer of course in this talk. I will briefly give you an inspiration and where the idea of analysis come from. And it comes from the results of implicit regularization for simple linear classifiers. So here I will recall a result by Sudri and co-authors, which deals with a predictor which is not just an affine predictor, parameterized by a weight w. They show that if the training set is linearly separable, then whatever the initialization, they show that the gradient flow of the empirical risk will diverge to plus infinity. But if now we renormalize the weights and look at the direction then they show that in direction this weights they convert to a L2 max margin classifier. And of course, since we are interested in classification only the direction of the classifier matters we does not really, it's not an issue that the weights diverge to plus infinity but in the end, we convert to L2 max margin classifier which is a specific classifier for which generalization properties can be analyzed. So let me show. So now we will show a similar results, but for why the two layer neural networks, and to state the results we need some background on the function spaces associated to neural network and that's very convenient because the order in this talk has just introduced all these spaces so I will just recall the spaces and adapt the notations so this is really the same definitions as in the other talk. If you have a predictor H will define its F2 norm as follows you so you try to decompose H as a linear combination of rich functions of radio functions with weights G of a and you can find over all such positions the one with the smaller cell to know this gives you a norm which is a norm associated to replaceable kind of little space and so forth called the conjugate RKHS for neural networks. And now we can also define these different norm on the space of predictors where we place the L2 norm by the L1 norm. We can also find the somehow the sparsest decomposition, at least L1 norm it's known that it has some specific effect so we've tried to find the sparse decomposition of our predictor H over this dictionary of radio like functions. And to recall briefly what we have learned from the other talk is that roughly this F2 norm. There are functions which are isotropically smooth. While F1 norm, it will, it can potentially be small even for functions which are not really smooth, as long as they are not smooth in just a few directions so F1 can be adaptive to not to an isotropy in the predictor that we want to learn. So this was just a brief recall of the function space associated to neural network and now let me state the main results that we have with Francis in this paper. Assume that at any situation, the input weights are uniform on the sphere, and the output weights are uniform and minus one one so here this is just to fix ideas I give a specific initialization scheme. But for instance the absolute scale of the initialization does not matter with the result I will present. We don't need to assume any linear separability assumption, because to layer neural networks or universal approximators. So we just need to assume that the training set is consistent that there is the function that interpolates on the training set. There will be some additional technical assumptions which unfortunately I do not have the time to discuss here but they are quite important we need to assume several objects converge in the, in the infinite with dynamics to make the result rigorous. But under these assumptions we show that the predictor learned by gradient descent in the infinite with limit and the infinite time limit will convert to F1 max margin classifier. So this corresponds in a question to the following fact that when you take the limits in time and in infinite with so in fact here the limits can be interchanged in MNT. Then the margin of our predictor will be the max margin over the F1 norm space. Alright, so here there are some let me comment a little bit on these results. So first on the downside, it is really quite qualitative results there's nothing quantitative we do not know how large T or how large M needs to be. And in fact there are some hints that we know that there are some hard problems that can be incorporated as F1 max margin problems so we know that there will be some situations where these limits will be very difficult to observe we will need a very large number of neurons and a very large training time. So this gives us at least a qualitative field of what kind of function is learned by a wide neural network. And that's a function at least that asymptotically will approach this F1 max margin classifier. And to give a contrast with what happens when we only train the output layer of the neural network, then we recover the random feature setting that was presented by Theodor. And in that case by just applying the previous results of Sudre and Quota we will we can show that it is to the F2 max margin classifier so we have these two types of behavior, when we train both layers or just the output layer of two layer real neural network. We also use some illustrations so here the illustrations are in a similar setting as in Maria's talk. We consider a classification task with two classes and in dimension 15 but only the first two coordinates are relevant to perform the classification and all the remaining 13 coordinates will be just pure noise. So for this training set. First here I plot the margin as a function of mm is the number of neurons in my neural network. And this is the F1 margin so our results says that asymptotically when m increases, we will converge to the F1 max margin and in this specific example we observe that 100 neurons are sufficient to converge to the F1 max margin. But of course this is just a specific example and in general, I do not expect this result to be so efficient. But in some settings this limits is something regime can be achieved quite rapidly apparently. And now, when we look at the test error as a function of the number of samples so here we compare the test error when training both layers which is this F1 max margin classifier. And only the output layer which is this kernel max margin classifier this kernel SVM, and we see that after 500 samples. We already have almost perfect generalization with this F1 max margin classifier while the kernel method still makes a lot of mistakes. And this can be understood by some arguments that we have seen the previous talks, but the specific setting of classification task. So we have shown a smaller statistical bound which confirms this numerical experiment. So here's this is a statistical bound that uses classical margin classification theory, margin based generalization bounds. So let me present it, we assume that the inputs are almost surely bounded, and that there exists some smaller dimension, which I will call R. Yes, after a projection of rank R, the two classes with pluses and minus one plus one or minus one, they can be separated by at least a distance delta of R. And this is the test on the test distribution so we assume that the two classes are well separated. And then we show that the proportion of rank also this is precisely the setting of the previous experiment where we had a two dimensional projection for which the two classes could be separated. And then we show that the proportion of mistakes. After for the F1 max margin classifier. We can see by the following quantity. So we have a certain decrease as a function of the number of samples in and which is multiplied by some quantity which depends on the difficulty of the task. So it depends on how close the two classes are, but what's crucial here is that there is no dependency, no explicit dependency in the dimension D. There is no dimension independent bounds. So, in practice, the dimension can intervene into R, but it, at least it does not intervene into the exponent. So that's another appearance of the good adaptivity property of this F1 max margin classifier related to two layer neural networks. Okay, you have one minute, I can maybe give one last comment. So everything that I've presented so far was non quantitative in terms of optimization time. And in fact, when we look at the literature about about implicit regularization of gradient methods for exponentially tell the losses. There are very few quantitative results when we go beyond the setting of linear predictors, which are parametrized linearly. And here, we want to understand a little bit more precisely what is the behavior of these dynamics that we have with two layer neural networks. And although we cannot so far give training time guarantees for this, the results that I've presented, we can look at simplified situations. In particular, we have looked at the specific situation when we train both layers. But we fix the direction on the input weights. And in that case, we could show a convergence in a certain rate of convergence by making an analogy between this, the behavior of gradient descent and online mirror descent so I will not go into the detail for this results. But as it was seen in Teodor's talk that if one cannot be approximated when we fix the directions, we cannot be well approximated with fixed directions. So this result, of course, does not solve the, the optimization in F1 space, but at least it gives us a little bit some hints about what can happen. So I'm going to discretize the time axis and making the main theorem quantitative is fully open. So since my time is over, I will conclude for this talk and emphasize again that the main open question is to find interesting assumptions under which we can make the main theorem quantitative. So this is a fully open problem. And also it could be interesting to deal with more complex architectures in the future. Thank you for your attention.