 OK, OK, it seems fine. Perfect. Recording in progress. So for this last talk, we're going to have Alberto Bietti from NYU. And he's going to tell us about single index models and how to learn them with shallow neural networks. Alberto. OK, thank you so much, Sebastian, for the intro. Thanks for inviting me to this wonderful workshop. So since I'm the last speaker, I have to say this was a really great week. I really enjoyed every talk. And meeting everyone. So thanks to the organizers again. OK, so this is kind of a work that is not yet on archive. Hopefully, it'll be there soon. So it's a little preliminary, but please stop me for any questions. I'm happy to take questions also during the talk. I don't have that many slides. So the idea is single index models. If you've heard of that, that's basically when you have a target function that actually only depends on one direction. So a projection of your data on only one direction. So the goal here is you would like to kind of recover that structure, but also learn potentially the complexities of the function that you define on top of that projection. And I should say this is joint work with Joan Brunet at NYU and two students, one at Columbia. And so Clayton from Columbia, Minjay from NYU. OK, so let's start with some intro about structure in neural net. So neural networks, we know that they work very well in many problems, typically very high-dimensional problems. So on images, on texts, graphs, biology problems, et cetera. So high-dimensionality of the data is usually kind of a crucial aspect of these models. And hopefully it seems like deep learning really is successful on these kind of tasks. So high-dimensionality is a big deal. Another problem, another aspect is that optimization typically works. So we just use simple algorithms like gradient descent. And we can learn models and things work out. So there has to be an important aspect of any theory that tries to explain these models. And finally, often when you have a lot of parameters, these models are very expressive. So typically you have universal approximation aspects so you can fit any data. And even on a training set, you can typically easily fit all your data using neural net. So we're going to try to incorporate these three aspects. And when I talk about structure, here we'll consider one form of structure, which is we have some structure on the target function. So we assume that we have some data x with perhaps a simple distribution, think of it as a Gaussian. And then the labels y in your data are generated from some function f star of x, and then maybe with some noise. And the point is that typically in these high-dimensional problems, you need some structure on this f star in order to learn efficiently. So one question that you might think of is if you have these problems that work in high-dimension, what's the right structure to put on this f star? So there's various ways to put in structure and we'll think about one specific one in this talk. And then once you put these assumptions on the target, what some parts of deep learning theorists have to do is just how do you show that a neural network can actually efficiently learn, so from few samples and with efficient algorithms with such a structure, with such an assumption on the data distribution. Okay, so let me go through a few things that kind of show up in the literature. So the first case is kind of generic assumption on f star. So this is classical sort of non-parametric statistics. So maybe f star is Lipschitz's function or it's a smooth function with a certain number of derivatives, let's say beta. And so in this regime, even kernels, which are kind of a simplified view of models that have existed for a long time, actually give you optimal rates. So we know that typically even, so neural nets are linked to these kernels. Sometimes when you think about neural networks with random weights, it's basically like a kernel. And we know that in this case, if you have these generic assumptions, you can typically learn efficiently. So when I say efficiently though, this is not really a great thing in high dimensions because typically if f star is only Lipschitz, you have a non-smooth function in high dimension that's really hard to learn and you pay exponentially in the dimension. So the number of sample has to be exponentially, you have this cursive dimensionality phenomenon. So we have to go, hopefully we have to go beyond that with some structure. So another simple setting that has been considered a lot and here, I'm just quoting two papers that are relevant to what I'll talk about, but there's really been a lot of work on these kinds of models. And here, so we just assume basically that f star is some neuron, so phi is let's say a value activation of some theta star inner product with x. So you have one direction and then you have a specific neuron that matches the one you're using for training. So you're using a value network and your target is a value. And then, so in this case, there's actually some results that tell you, even if you use a single neuron and your model that you're training, you can actually fit this direction theta star and so do inference basically. And you can learn efficiently, the number of samples typically doesn't have to scale with the full dimension. Like, okay, it has to grow with dimension but not exponentially, let's say. So the issue here is that these models, okay, maybe they're nice in some cases, but this phi, you really have to know the activity. Basically you have to assume that your target is some value. Why would it be a value? Why not a more complicated function? That's kind of a limitation of this framework. And also typically you only need one neuron so you don't have this expressivity aspect that allows you to actually fit arbitrary functions. So the next step is the single index model, which is what I'll talk about here. And so here basically we'll consider functions that depend still on only one direction with this sort of planted theta star direction, but then you have a function f star on top of it which could be arbitrary. So maybe it's elliptic function, maybe it's like slightly smooth. But ideally we don't want to depend on the smoothness in the same way as this first class which actually dependent also exponentially in dimension. So here there's been, there's some works here, let's say this back paper here, which shows that some kind of convex ways to define neural networks actually adapt to such structure and can give you the right statistical rates. And there's some ways to study neural networks in these infinite with limits called the mean field regime that actually allows you to potentially learn those models but we don't really know how fast you converge or if you converge at all or what happens if you actually don't have infinite with. So it's kind of a nice theoretical model but we don't really know if the algorithms in practice would actually work. So this is why I put intractable. And then there's some other works that kind of consider models of this form but have kind of very tailored algorithms. Maybe you do something with some specific polynomials at one degree and then another degree. So these kind of can solve these tasks but they're not really something you would use every day. So okay, just some other examples of structure. So multi-index is when instead of depending only on one planted direction, you might depend on a few. And then there's other kinds of assumptions that you might make. Perhaps F star has some symmetries and it's invariant to some kind of translations of your image, let's say. You might have some F star that is actually a composition of functions. So these are all like very interesting models for deep learning but not the focus of the talk. So here we'll talk about single index models and try to use some kind of shallow network architectures gradient descent and can we learn these functions? All right, we'll keep going. Just one extra motivation for single index models that I usually think of when you have CNNs in practice which are convolutional networks very successful in images. If you look at the first layer weights, you typically see these kind of nice filters that look like some Gabor filters. And usually you initialize randomly and yet these neurons kind of align to these nice structured filters. So clearly there is some learning of these features that's happening at least in these models. And so it's kind of a reasonable assumption to make that you might have data generated by some kind of multi-index model. The next layers are kind of more messy but people might interpret similar things. Okay, let me start with some setting here. So I said the data on X will make simple assumptions so we'll just assume here that the data is Gaussian in D dimensions. And the labels are generated by the single index model with some function F star, 1D function F star, some theta star direction of norm one and then perhaps we have some noise, a variant sigma. We use, this is the architecture we use. So it looks like a shallow network. So we have large N, this is the number of neurons that we use. CI are the second layer variables, second layer weights but at the first layer we actually fix this theta. So we only have one neuron. So it's kind of not a standard architecture. We're kind of constraining all these weights to be the same. We're tying them also to be the same theta but we do have some biases B, BI here for each neuron. And the point here is we'll stick to something where the biases are actually random. So we just have random initialization for the biases and we don't train them. And all we train is this weight theta, so the direction and the second layer weights CI. And then for the activation you can stick to the value even though we can potentially adapt this easily to other activations. And so as I said, we'll think about some kind of standard gradient descent algorithm. So we have some empirical loss on our training dataset. We have a regular rise around the second layer weights and we train this with respect to C and theta. We'll also kind of use a projected version where we keep theta on the sphere. So we do like a spherical gradient on theta and then gradient descent on C. So this will be kind of the setup. Nothing too complicated, it's just this trick of kind of fixing the direction theta for the same neurons. Just kind of for simplicity, because we know it's potentially enough to only fit one neuron. Okay, so just as a warm up, I'll talk about something that kind of inspired this analysis, which is this paper on actually teacher student networks where this F star is both, it's the same F star basically, the activation and the neuron. Okay, so here we will make this assumption on the F star, which is, so we have Gaussian data. So typically when you have Gaussian data, you think about decomposition in Hermite polynomials, which are diagonal in that basis. So Hj here are Hermite polynomials of degree J and they're all orthonormal with respect to gamma here, which actually is just a one D Gaussian measure. So we'll assume that the activation and the target actually decompose with these alpha J coefficients. Okay, so if in this case, if you assume this, we can look at this squared loss objective on the population data, so in expectation over X. So we have F star, so we assume that we know F star, so both our activation and the target have this F star. Our training is on theta, but then the target is theta star. So you compute this square, you develop it, the only term that actually ends up depending on theta is this cross term here. And then there's a property of Hermite polynomials, which is when you take these functions that actually depend on two different angles, projection with respect to theta and respect to theta star, then actually the function will end up depending only on this correlation that I'll call m between theta and theta star, okay? So, and then because the coefficients of both F star and of both things here are actually these alpha J's, we got alpha J squared times mj. So if you just kind of develop and use the orthogonality property, this is what you get. So basically we have at least when m here, the correlation is positive, the landscape of this population with respect to m, so you just look at it as a function of the correlation here, it just basically goes down, okay, as m increases between zero and one, this landscape kind of decreases all the way to something, okay? So basically as long as we are on the right side, if you initialize basically with an inner product that a correlation that is positive, then you can think that if you do gradient descent on the population loss, at least you start going down and eventually you get to m equals one. Okay, so this is basically what is shown in that paper. They define this information exponent s which is kind of the first non-zero alpha J. So basically the first terms might be zero which is why you have maybe something flat here, like you have a saddle point, especially, so in this plot I think s equals two, so the first two, like the zero order term, well, zero order term doesn't matter, but the first one is zero, so you actually get this saddle point at zero and you have to somehow escape it in the beginning. And typically, okay, so when you initialize, this is the kind of, if you do random initialization on the sphere, you expect that m initially is something like one over square root d. This is kind of a standard Gaussian, like CLT type thing, and then we assume that we're on the right side, so m is positive with half probability, and then what happens is that you need enough samples to actually capture the saddle well, and what they show in the paper is that, okay, actually it could be something tighter than this. I think it's s minus one when s is large enough, but basically when you have d to the s samples, then that's enough to kind of escape this saddle point in the beginning. Okay, so this is kind of, actually it relates also to the previous talk where you have these things that might shoot up. If this might tell you actually the kind of sample complexity you need when you have tensors of a certain order, perhaps, so these things are actually linked. Okay, so we have this initial saddle. The other terms might be negligible, at least when m is very small, but this is kind of what your landscape looks like, and then with enough samples, you can kind of escape this. Okay, so this is kind of a starting point, and once you escape the saddle point, then things can shoot down pretty quickly. As you can see, the things kind of get easier over time, and you can actually recover perfectly your theta star. So this is the picture. Now what happens if we don't know f star? So now f star has this decomposition with these alpha j coefficients, but phi, which is our activation, has some different coefficients. So maybe this is a relu, that's where we're using to train on neural net, but our target actually has different coefficients. So in this case, you can again rewrite the lost landscape on the population here. It still depends on this correlation m, but now you might have different, so you don't have a square here, so this actually could be negative, which means that you don't necessarily go down, and this is an example where some of the coefficients are negative, and so you actually might get stuck before reaching one. So in this case, you don't get full recover, you don't actually align completely the two neurons, and you get stuck somewhere in between. All right. So if alpha j, beta j is negative, then we may get stuck, because you have some critical point in between, but yeah, so at least if the very first term, so what will matter is basically the first term that is positive, so that you do go down a little bit. So if the information, S here is the information exponent, the first non-zero term, if that coefficient alpha S, beta S is actually positive, then you can go down a little bit until you get stuck at some first zero gradient point. Okay. So what basically what we hope to do here in neural nets is just to kind of learn these beta j's. Beta j is the activation, but if instead of using an activation, you use something more complicated, then maybe hopefully we can learn these beta j's, and if these beta j's can align with the alpha j's, then this thing should be, all these terms should be non-negative, and so then we can actually go down all the way and keep aligning the neurons until the end. So that's kind of the idea of this work, and the point will be using these random biases to somehow try to learn something about this outer function, and so then this should allow us to also learn theta perfectly. Okay, so this was kind of the full idea, and I'll give a bit more details just so that you get something out of it hopefully. And the starting point is we can again write the population landscape. So here F star is the same target function with these coefficients alpha j, and our network is this FC theta that depends both on output layer C, and on these, so here big phi is kind of a feature map. You can, if you know kernels, you could think of this as a kernel, but it also, this kernel depends on this projection with respect to theta, so the kernel is actually learned in the sense that theta is being shifted. And so if you develop this, this is basically what the network looks like. So we have this parameter theta that we're optimizing as well as these parameters C of the second layer, and you can again write the population loss. Here we'll add this regularization term, but basically you have another term that pops out which comes from the square, the expectation of the square of our learning network. That's this first term. And then the second term again, depends on this M correlation. So M is again the correlation. We have this alpha j, but then we have something that actually depends on C, our weights. So this can potentially be learned. So we can tune it, we can learn it. If C is very high dimensional, if you had let's say an infinite dimensional feature vector, then perhaps all of these can be completely learned. We can learn all of them for any j. That's kind of the hope. So okay, we have this operator T. For those who are curious, basically T is a standard operator that takes any function in L2 and projects it down towards all these random features that we're using. So it's a standard quantity that it shows up when you study random feature kernels. And when you project the Hermite polynomials, this is what this Tj vector ends up being. And if you take T times the adjuvant of T, that's basically an N by N matrix, which corresponds to your covariance matrix. So this is what this Q is here. So we have a covariance. Now, because we're looking at the population loss and everything is rotational invariant, that doesn't actually depend, this doesn't depend on theta. It only depends, it just depends on the data, but not on the specific direction. So basically theta doesn't really come up in this term, it would just have this quadratic form. Question, so this is kind of the landscape that we get, if you have any questions. All right. And the first, okay, so the first point, as I said, the idea is that initially, if you tune even without, basically even without tuning these second layer weights, if the very first coefficient here is of the same time of alpha, then we can actually at least get some weak kind of alignment. So M can go somewhat beyond zero. So we can get to some constant level. And then the idea will be, what can we learn beyond that? And hopefully what we can show is that there's no bad critical points. All the critical points should be either at M equals zero or at the very end. And this is kind of the first result that I'll talk about here. And we have to assume a lambda that is small enough, so regularization that is small enough, and a certain number of features here, random features that scales with one over lambda. So once you do assume this, so having a small lambda basically tells us we're close enough to approximating any function with our RKHS, especially if you have infinite data, you can actually approximate anything. And so if you have these two conditions, then basically the only critical points of this population objective are, one is okay, this degenerate thing, the saddle at the beginning, where M equals zero and the second layer is also zero. And the other critical points are either you aligned completely theta star or minus theta star, and then C is uniquely determined by this population loss. Okay, so we get basically, these are the critical points that you would need. So what we know is that there's nothing, nothing's gonna be stuck with M, but it's like strictly between zero and one, and we're gonna, hopefully if we get to a critical point, either it means that we got stuck in the initialization or that we actually recovered. So just for some details, the way this works is we use, this is where we actually use kernels. So with random biases, basically what we're defining is this expected random feature kernel, where the bias is random, and we have the same value and we have xx prime. This is what this kernel looks like. And you can actually study the approximation properties of this kernel, see how hard it is to approximate some function f here. And this p hat lambda here is the projection, when you have a certain number of random features and you have this lambda parameter, then this tells us how, this is the projection onto the random feature space, and you can control how close the projection is to the actual function, and this will scale with lambda. So if lambda is very small, this will actually go very closely, very close to zero, and it will depend on the second derivative. So typically we'll make this assumption that all these functions are twice, have a second derivative in the Gaussian norm. So that will be kind of an assumption that we'll make. And to have this as approximation, we will need that you have enough random features, and particularly as is scale with whenever lambda. And so basically it turns out that when, I think, yeah, when lambda is small enough, then you can look at the equation for the critical point and there's some term, there's some kind of approximation term that will be so small that it actually won't matter. And as long as M is strictly less than one, basically this approximation term has to be canceled. And so in the end, you end up with only the right critical points. And this is kind of the conditional lambda that you need. It has to depend on this alpha S, which is, I guess, the signal that you get at this information exponent, which was the first non-zero alpha, alpha J. This tells us basically, if we use a second layer, it can actually help us cancel out those critical points which might be an intermediate correlation. So hopefully if we train all the way, we can actually get a perfect alignment. And just some intuition on what the algorithm actually looks like. So remember, this is the population loss. One key thing is initialization. This will actually, this is what's relevant to capture sample complexity because you will need these things to not be, so you will need, you will have something that's very small because you have the saddle, so you want to capture enough of signal so by using enough samples. And okay, we don't actually use this SGD analysis of the paper that I talked about before. We hear what we do is just concentration of empirical and population landscape. And so those two need to be close enough so that you don't get lost in the initialization. But so we can have these anti-concentration results so that the initialization of both M and this other factor C transpose TS are somewhat like away from zero. We'll assume that things are kind of aligned so that we can go down. So that's probability one half event. Then we'll have two phases. So the first phase will only do what I said earlier which is kind of try to get to a non-zero level of M. So here we only train theta. C, we leave it at this random initialization and we'll actually try to, we'll tune the initialization norm so that eventually when we get to this level gamma I will actually have escaped the initial saddle. And then so in this case, because M is small enough, basically this is what the population loss looks like. So if you think about optimizing the population loss, it basically looks like this M to the S until you escape something. And then after that, because we've chosen things so that that was enough to escape the bad critical points near the saddle, then we can just kind of trust that there's no stationary points until we get to the end. So we've basically escaped some level set. So we get below zero and some, if you look at these two terms here, basically the bad critical point with M equals zero gives you zero lower bound. And what we get instead, it was get a little bit below zero so that we're sure that we're not at that bad point and we can keep optimizing until the end. That's the second phase. So you must get to something close to one. So we must kind of recover the direction. And finally to get right sort of non-parametric rates with the kernel, we have this final fine tuning phase in the theory which is we retrain the second layer weights with some fresh samples and we can kind of optimize this new lambda to get the right rates. Okay, so just the final generalization guarantee, this is what it looks like. We have a certain number of samples for this first gradient descent phase and then we have some new samples for the final fine tuning where we fix the direction and we only optimize the second layer. And this second part is just to get basically the first part we have to assume N is larger than this D to the S over something. That's to escape the saddle point. And the second part, we have some number and prime of samples and we get this kind of non-parametric rate here. So the first part depends on dimension and that's the high dimensional part. The second part, we're basically doing regression in one dimension. So that's a very easy problem that people did in the 70s, 80s. So this is what we get and we have some rate that depends also on the kernel and its approximation properties. A few comments. Okay, so as I said, we talked about the intuition on the population landscape. What happens in practice, actually we do it on the empirical landscape. So we rely on some landscape concentration results. Recovery is near optimal. So we have this N larger than D to the S, which is similar to that paper even though they actually get potentially better exponents in some regimes. Fine tuning, that's the final phase where we optimize the second rate, gives us something, basically we can exactly inherit what the kernel method would do. So you get the right rates for this 1D problem. And one comment is that if you don't do fine tuning, so if you only look at gradient descent until the end, you can still get good excess risk vanishing, but you'll get something worse. So we don't need the second sample, but we'll have something that depends on D and with a slower rate. Some ugly, not very nice looking experiments here just to see, so this is without the fine tuning phase, but we get, okay, this is S equals one. So we have a, this is a piecewise linear teacher. So it starts going down immediately, whereas if you kind of get rid of the first few harmonics, first few hermite coefficients, S equals three, then you get stuck in the beginning, but when you have enough sample, you start going down. And in these cases, you can actually get full recovery. So afterwards, what I have to still do is kind of do some experiments where you actually do fine tuning and hopefully that also improves the excess risk further. Okay, so I'll just conclude efficient learning of single index models. We'll see how we have this tied direction. So kind of a weird architecture where the neurons are tied between all the directions are tied between all neurons and the biases are random. And that's enough to basically do feature learning of one direction, plus we can do some nonparametric regression of a 1D function. So we combine these two things together and we get the right kind of reasonably right statistical guarantees. Further questions, okay, lots of them here. What if we train jointly from the start? So here we have this two phase thing where we first optimize data and then we jointly optimize. Maybe we can immediately optimize data as long as the step size is large enough. As GD on the population loss. So here we've used this argument of concentrating empirical to population, but maybe we can directly study as GD on population. Untied neuron direction. Okay, what happens if we actually have different weights on each neuron? What happens if we also train biases? What happens if we have multiple index, like different directions that we want to fit? Do we need fine tuning, et cetera? These are all interesting questions. And I'll stop here. Thank you for your attention. Thank you very much. We have time for a couple of questions. So Mark wants to start. Thank you very much for the very nice talk. Can you make any guesses of how the landscape will look like, say for a two index model? Right. I mean, I think there's actually some papers that start studying that I think the initial saddle point might be similar. What I don't know is like when you have different neurons they might move in strange ways. And I know, actually, maybe Sebastian has done some work on tracking these different correlations between different neurons. So things might behave sometimes nicely but maybe sometimes in complicated ways. So anyway, I feel like that's an interesting question and characterizing provable guarantees on when you can align. For example, if there are a thousand, would you expect to have a similar picture? Yeah, I think. Even if you have just two training neurons, I think they might align to the right things. Thanks. Maybe if I can follow up with a question. So when you do the fine tuning, like what kind of solutions do you find? Like, is the network going to do something very sparse where it basically turns off all the neurons but one with the second layer weights? Do you take some kind of effective average at the second layer? Maybe you can use that. I think it's, so basically if you don't do fine tuning you already get kind of rich like L2 solution. Cause we have, so the point is we have this L2 regularizer. So you get kind of a Hilbert norm, so you don't get sparse C. You get kind of a dense C and the point of fine tuning is also we can, so the lambda, it turns out that this lambda regularization parameter, if you make it too small then you'll suffer in the first recovery phase. So that's also why it's nice to have a separate phase where you can take a lambda that's potentially smaller and get the right rates for the non-parametric part. But basically you'll get a kernel method, so kind of an L2 Euclidean. Okay, cool. Thanks. That's another question, right there. In case in which you were to train the betas of the activation function and so train your activation function, do you expect to get better rates? I think that would be, so training the bias, right? Oh, beta, what is beta? The one you had on the activation function when you decompose the activation function on your harmonics. Oh, yeah. Oh, these beta. And of a hyperbarometer that you could optimize as well, right? Yeah, but you have an infinite number of them. Okay, yes, but you could truncate that some. Yeah, so that's actually what this other paper that I mentioned by Daniel Su, they kind of optimize one polynomial at a time until you get a non-zero one and they get the direction. They don't talk about the non-parametric rates, so that's potentially to be studied, but you should be able to get the right things. So that would mean you basically train with specific hermite polynomials as your activations. Thank you for that question. Any other questions? Any last question for Alberto? Anything on the chat, maybe, Marco? No, okay. Well, in that case, let's thank Alberto again and all the speakers of this morning's sessions.