 Can you hear me and see my slides? Yep. Okay, so hello everyone. So thank you Marco and thank you to the organizers. And thank you all for connecting to this talk. So my name is Teodor and I'm a PhD student at Stanford. So today I present a recent work that I did with my advisor. Andrea Montanerri and Michael Charentano, who is also one of the speakers this week. And I believe that he will speak about his own work on a dimensional statistics on Friday. So the title is minimum complexity interpolation in random features models. So I consider no networks in some sense, which is the theme of this track. But instead of considering gradient train no networks. I consider infinite with no networks that are solutions of convex optimization program over the space of majors. So okay, so let's start. So I'll start by recalling the definition of our KHS methods like kind of methods. So in no networks jargon. So it corresponds to no networks trained in the lazy training regime. So we consider the non parametric supervised learning setting. So we are given n data points excite why I where the covariates vectors excite ID sample from a probability space XP. And the response variables why I is a noisy measurement of the target function as star. So a popular method to fit his data is like our kernel machines. So here I'll give a different construction of reproducing kernel space. So different than the one that is usually given during lectures. So we start from a weight probability space omega with measure new. So we consider features of featureization map. Like, which for any weight w associators squared integrable function fee like w on the space of covariates. So for example, as like, you can think about them as like being an activation function sigma applied to the inner product of X and W. So we define the R KHS that the denoted by F2 to be all functions FX a that can be written as integral over the weight space of a w fee x w some you do. And so, like these functions will have squared integrable density a. So this construction is equivalent to the standard one with associated kernel K X1 X2. So you can think about these functions as being infinite with two layers neural networks with first layer weights w nonlinearity fee and second layer weights a w. So now given this R KHS, we can feed the data using like a convex loss L and doing a empirical risk minimization with this added R KHS norm regularization. So not that this is convex optimization problem, but on the infinite dimensional space. However, so this is a celebrated result. This problem can be sold efficiently. So thanks to the representative theorem, which guarantees that the solution a hard w is in this fixed and the manual and dimensional subspace spend by the fee XI. So, here we only need to solve a convex optimization problem over and parameters see I. Okay, so we can solve kernel methods efficiently. So what about their statistical properties. So we do not have to R the ball of produce air in the R KHS. Schematically, the generalization error of learning F star in F to R is a per bounded by R of a square root of number of samples. So in other words, we expect R KHS methods to generalize better for target functions that have lower R KHS. So the problem of these common methods is that they suffer from the curse of dimensionality. So what we mentioned is F to our balls will only contain very smooth functions and kernel methods will fail for all other functions. So here I took the example of X and W uniformly distributed on the sphere and like this inner product activation function. There are many target function F star and number of samples between D at the power K and D at the power K plus one. So we showed with govani may inventory that the test error of like any kernel methods will be lower bounded by this quantity up to a vanishing additive term. So this is F star projected auto binary to the subspace of degree K polynomials. So meaning that the best we can do is fit degree K, degree K polynomial like approximation to a star. And so you can indeed check that this F two square root of n ball only contain degree K polynomials. So like you can do the simulation and even for D equal 30, which is not exactly how dimensional you need like 20 30,000 samples to a cubic polynomial approximation. And this is true for any target functions. And okay so it's kind of disappointing, given that there are classes of functions that can be efficiently approximated by no networks. And we expect like to be able to do much better on them. So for example, take like the task of learning a single mirror. So F star fee x double star. So it's very easy to represent as a two layer neural networks. So simply put a direct at the way star. So the problem of directs is that they're not squared integral. So in particular, I started not in the RKS. And so the reason it's difficult to learn a signal neuron with kernels is that in order to approximate is direct. We need a like a like a function a that as a to know that explodes. So we won't be able to do now as well on them. So in simple idea is therefore to consider instead of a to know this opinion of a with P between one and two. And the reason is that it's a penance in a single or distributions less heavy. So we fit this infinite with two layers neural networks but now penalize the function a with this opinion. The neural networks were introduced by Ben Joe and all in 2006. And they call them the convex neural networks and this is because now the optimization problem is convex but like in the space of matters. So we have inclusion as strict between the balls so F2 is in FP, which is included in F1. So the FP balls correspond to richer and richer function class as P decreases. So FP catches better function that are highly dependent on a low dimensional projection. So in particular for P equal one, we can approximate a direct by a sequence of dead cities like a functions a with uniformly bounded and one more. So which is not the case for a two. So in particular back indeed verified that F1 is adaptive to unknown underlying linear structure and did the process of dimensionality for functions that depends on the low dimensional projection of the covariance. On the other hand, so F2 is not adaptive. So it does not differentiate between like whether the function has a low dimensional structure and only care about like global smoothness properties. And the rest of the talk will focus on this FP function spaces. And I will consider finding the minimum norm interpolating solution to the data so minimizing this LP norm of a subject to these interpolation constraints. So this corresponds to taking the regularization parameter to zero in the empirical risk meanization. Okay, so why look at this interpolating solution. So, okay, so one is conceptually this is the critical part of the optimization problem. And second is correspond to the modern practice of training until interpolation. Our problem is a complex problem, but on an infinite dimensional space. So in the case of people to assume that there is like a representative theorem, which makes the optimization very easy. However, for P different than two it's not clear what to do. Okay, so in the case of people one, Ben join his team and back proposed like different incremental algorithm. However, they either don't have a convergence guarantees or each step is NP hard. So here instead we consider like the standard approach of replacing the density mu by an empirical density new M that is finitely supported. For example, M ID weights, double J, and we replace the like this infinitely a dimensional integration by this sum over the double J's. So now. Okay, so like this is also known as the random features more. This is originally introduced as randomized approximation to kind of methods. And so here okay so now, instead of having these infinite dimensional problem with us. We have this finite which problem, which is much easier to solve with a that is in the are the power M. So we can ask several questions about this fp spaces and just random features approximation. And so in this talk, I will focus on the following so how large and needs to be for the finite which solution to approximate the infinite which problem. P equal to, for example, we should with May and Montanari that it's necessary and sufficient for them to be bigger than like. So what about P less than two. So let me just describe very quickly the setting and assumptions. So okay so we have like an ID samples so we don't assume much about them just access as bounded support. So I just defined the kernel matrix, which is the kernel evaluated at the training points. Okay, so we have given, so we are given M weights double J's ID samples from you, and I did notice fee and J's the random features vectors associated to double J so evaluated data points X one to XN. And we also introduced his white and features side and the value, which is just gain at the power minus seven half the end, and they're white and because the covariance is just identity. So we will show that the random features approximation concentrates on the infinite with solution conditional on the realization of the data. So I said, like, exploit the randomness of the weights double J's to show concentration. And so like to see how. So for example like fee and J conditional on X ID random vectors in R and because this double J's ID. So for our result to hold, we will only make assumptions about the feature map. So conditions only need to be very fat conditional on the Y and X. So, okay, I just like skip over them very quickly so we need some good sanity of the features. So the Nipp sheets continuity of this like neurons. And we need a technical small ball property. And so basically, we expect to subgauchanity and some small ball property to be very mild assumptions. However, they're part two, like checking practice. So I can go back on this assumptions during the discussion or offline. This is the main results. So for P as strictly between one and two. So we assume like a 123 to hold conditionally and why and X, then with probability at least one minus C M at the power C M. Like on the realization of the double J's. So we have this falling down on the autonome squared of the distance between effort RF and the infinite with solution. And so this is given by the max of N log M over N and add and log M at the power P over P minus one over M squared. So, okay, so to interpret this results, not that typically this quantity is bounded by score to find. And so in order for this RF approximation to be a good approximation of the infinite. We need M to be bigger than add log N at the power. So the max of two and P minus one half and P minus one. Okay, let me make two remarks. So we don't expect the bound to be optimal. For example, for people to expect M only to be a need to be bigger than N log N. So it should be sufficient. Okay, so P decreases instead of two and second. So, as P decreases to one. So it is bound. So P over minus one half of a P minus one that verges. Okay, for fixed P away from one. So M only need to be like polynomial in N. So in some sense this is tractable. But when P goes to one. So T six point and diverges and so like the RF approximation cannot efficiently solve this F one problem. And there is this counter example of learning a single one. Okay, so here's some, so here are some numerical experience. We take a single neuron. We have distributed on the sphere double abortion and we take equal 30. So on the left, I fixed the number of samples and to be 150. And I plotted the test error against. So the number of features. So for different P. Okay, so you can see that the test error decreases as the number of features increases until it settles on a repeating value, which correspond to this infinite with solution. So for P equal to thanks to the representative, I can compute the infinite with solution and this is this dashed black line. I can't complete it. So, okay, so second remark, it seems that in order to reach this plateau we need more and more features, which is consistent with our bound that is worse and worse as people is to one, and especially for people one. So we don't reach the plateau. And this is because we need an expansion number of features. So on the right. So I fixed the number of features to be very large in order to have a rough approximation of the infinite with limit. And I plot the test error as a function of the sample size. So we see as P decreases, the test error decreases. So which is consistent with the notion that the class FP captures better and better functions that are highly dependent on low dimensional projections of the provides. So okay, I don't have much time but let me just give you some intuition about the proof. So we want to compare two problems. So it is to infinite with infinite with minimum interpolation problems. So we can consider. So the dual problems. So I just keep all the details but basically now these two problems are defined on the same parameter space. So lambda in our end. And, and we're making like here, a very simple observation. So the promo problem is over parametrized high dimensional problem, which is difficult to control. And this is usually why it's difficult to study over parametrize problem. So here it is dual problem is low dimensional, like under parametrized, and we can use classical techniques like uniform convergence arguments. Okay, so here just quick remote really recover a representative for between one and two. So here it's slightly more involved than the case for the case for people to. So, okay, I skip over this slide and just directly go to the conclusion, but basically just quickly so this problem, like the finite which problem is landscape concentrate on this landscape, and that's why we can compare the two solution. So let me conclude. So first, one of the things that I find interesting in this like work is that now we have like this access to these FB spaces that are spaces of infinite with two layers normal networks that are tractable and not our KHS. On the other hand, people can be interested in studying the effect of regularization in random features models. So here we showed that when you have enough over parametrization introduces to studying this corresponding infinity to its model, which is much simpler. Third, so we introduced this new proof like very simple proof to show the double descent phenomenon. And that does not rely on strong assumptions like ocean data or in our model. And when I say double descent here I mean that the test of the model does not expose when adding more parameters, but instead I concentrate on that certain value. And the mechanism here is very simple is just uniform concentration of the dual to an infinite roots problem. So you have civil future directions. So we're studying the generalization property of FP non zero regularization. And the most interesting case is P equal one. This is the one that breaks the cost of dimension and the question is, like, is there an efficient algorithm that solve this F1 problem. Okay, that's it. Thank you.