 up to thank the organizers, organizing conference, take time, especially putting this together after all that's going on. We should say a big thanks to Jean, Florent, and Mathieu. Okay, anyway, so this is session six. I'll be quick. First speaker is Marco Mondelli from IST Austria. We'll talk on understanding gradient descent for over parameterized neural networks. Hello, hello. So, let me share my screen. Full screen. Okay, can you all hear me and see my slides? Yeah, it looks like so. Okay, cool. Again, thanks Jean, thanks Florent, thanks Mathieu for organizing this. So I guess I don't need too much of an introduction about this, since the first two days were already quite populated in this era, and tomorrow there's going to be some talks about this. The bottom line is that training neural networks is, in principle, difficult, but as many people have shown, it works remarkably well. So today I would like to shed a little bit of light from a theoretical perspective on this fact, and I would like to emphasize that there are two things that are really important here. One is over parameterization, and the other is the use of a stochastic gradient descent algorithm. So during most of my thoughts, I would like to deal with this paper with my great student Alex here at IST. This is about landscape connectivity and robot stability. Then if I have time, perhaps we've lost three or four minutes or so of my talk, I will briefly talk about convergence. Okay, so the point of this work here is mainly to explain something that has been observed widely in practice, which is landscape connectivity of neural networks. Actually, this very picture was used, I think, in the presentation in the first day. Again, the point is that people have noticed that the local minimal obtained by SGD can be connected via a piecewise linear path that has pretty much the same loss. So here I have a minimum. Here I have another minimum. Now if I just linearly interpolate between them, this sucks. So the loss here is very large, but if I am a bit more clever, then I can find an interpolating path that goes from one to the other. And this was related by, this was an Ipspaper last year, to properties of a well-trained network. So there what they considered is multi-layered networks with Rayleigh activations, and they showed that if the neural network has some properties that are desirable in practice, noise stability, dropout stability, then this happens. Now what basically what we show is that we show that this actually happens for a wide class of neural networks, and hence landscape connectivity takes place. There is a bunch of papers that have looked at the landscape of neural networks. Again, here I'll be briefing the interest of time. The only point that I would like to make is that typically they have strong assumptions on the models and the rather poor scaling of the parameters. So this has been proven, for example, for linear networks, for networks that require TPC more neuron than training samples. It has been observed that the sub-levels that are connected for a class of networks that exhibits a pyramid of topology, but again, more neurons than training samples. And there was a paper by John Bruna and his group at NYU where they showed that there is an energy gap, but there the scaling of the energy gap is exponentially dimension. So it's something like e to the minus one over d. So it scales pretty badly with the dimension. Okay, so I would like to offer a slightly different perspective that comes from a relatively recent line of work on the so-called mean field view of neural networks. And I would like to take on this view in order to be able to prove stronger results about dropout stability and mode connectivity. So the so-called mean field view was discovered independently by a bunch of different groups that have slightly different perspectives on this. One is the Stanford group by Andrea and Song that also gave a talk about something slightly different. And this has been done mainly for two layers just because the math is simpler, but it has also recently been extended to multiple layers. And there is actually paper by Fun that we upgraded, I think a few weeks back in which he has a very nice characterization of the mean field view of multi-layered neural networks. Okay, so let me start with a warm up. I mean, this is a brief talk. So I would mostly talk about two layer networks where things are a bit easier. And then I would just flashlights result about multi-layer network that is somewhat similar in spirit, but just harder to prove as often in these cases. So I'm going to consider a supervised learning model in which I'm given IED data samples, x, y, so x features, y labels. My goal is to minimize the loss, that's just the expected value of the L2 loss and the expectation is over the random data. And I will do that via online SGD. So I just take gradients, and I move in the direction of the gradient. My step size is alpha. And the factor n here is just so that these objects is order of one, so that we actually move. This is in star contrast with also another recent line of work that analyzes the so-called NTK regime, and there the idea is that the neural network basically doesn't move so much and it's somewhat close to its initialization. So this is actually quite different because the weights do actually move by order of one during the training. I'll be making a bunch of technical assumptions, nothing really special. So I'll assume that the labels are bounded, that the gradient is sub-gaussian, that the activation function is bounded differentiable with bounded Lipschitz gradient. And there we'll assume something about the initialization, but this is all qualitative assumptions. Okay, so now let me define formally what drop-off stability is. So we say that the network is drop-off stable if the loss does not change so much when we remove part of the neurons and we suitably rescale the rest. So slightly more formally, LM is the loss of a neural network with M neurons and LM is the loss of a neural network with M neurons and say that I removed them at random. Now I'll say that the parameters theta are epsilon D drop-off stable if the loss doesn't decrease more than epsilon D. Okay, to put this in a picture this is my original network and neurons, this is my drop-off network and neurons, this two must be pretty close. So the loss between these two has to be smaller than epsilon D. Okay, cool. Connectivity, even easier, we say that two parameters are epsilon C connected. If there exists a continuous path in parameter space such that along that path the loss doesn't increase so much, doesn't increase more than say epsilon C. Okay, and this is the picture, so I start from here. I want to end up here and the loss here in this case is the orange line doesn't increase that much. Okay, so I'm now ready to state our main results. So let me say again what's my notation. So M is the number of neurons of the full network, M is the number of neurons of the reduced network, so M is smaller than M, M is smaller than M, alpha is the steps of SDD and D is the dimension. So consider k steps of SDD, then with high probability along the whole trajectory the parameters are absolutely the drop-off stable and the price that I have to pay is this one. Let me just emphasize that the change in loss scales as basically one over root M plus something that depends on the step size, so plus root alpha times log M. Okay, this is a couple of remarks about why I like this. So one is that the epsilon D does not depend on the size of the original network but it just depends on how much stuff is left. So M is completely coupled from M, D and alpha, so I just need that M is very large and the requirement on alpha is quite mild because it just needs to grow, look at it, just needs to go down as one over log M. So also this is a pretty good point. I mean this comes just because we are somewhat lazy in the proof. I think this could be removed. Okay, then from this we can deduce something about connectivity. So now I'll run SGD twice, so I'll run SGD and I will obtain some parameter theta k, then I run it a second time and I run it with an independent, with a different initialization with different data samples, still coming from the same distribution though, okay, and also for different times, so k and k prime don't need to be the same. Then with hypermobility, then these two solutions are epsilon c connected and epsilon c looks like pretty much what I've shown before. Basically this is the same formula that we had before but where M is n over 2 because basically the, I mean the proof of this is constructed and this follows pretty much the lines of the previous newspaper. So here we literally construct one path where the loss doesn't increase so much. And to do that we crucially exploit the dropout stability. So we crucially exploit that it's possible to remove this subset of neurons and don't get penalized so much. So the change in loss scales basically is one over log M, sorry one over root N plus small stuff, okay. And yeah, so these are already said so the nice thing is that we can connect SGD solutions that come from different training data but the same data distribution and different initializations. I'll say, I'll have literally one slide about the proof idea. The whole idea about this meaningful characterization is basically to say that the discrete dynamics of SGD which I can think about as the dynamics of a gas here within particles gets approximated as and goes large by the continuous dynamics of a certain gradient flow. So now the idea is that theta K are close to these particles that evolve according to the continuous dynamics and these particles are IAD. Then since these particles are IAD if I could just plug in the IAD particles instead of the weights of SGD then this would be basically just low of large numbers and I'm not theta K and I'm not theta K and they concentrate to the same limit and then we can pass from Doppler stability to connectivity in a constructive way. So we actually construct the path. Okay, now this is how the result looks like for multi-layer networks. Again, I don't want to spend so much time on this but this is morally the same. Now I have to, I can remove neurons from multiple layers. M is the maximum number of neurons that are left in each of the layers and the scaling is pretty much the same. Here I pay a linear dependence in dimension while the previous bound was dimension three. This is again because probably our proof is suboptimal somewhere and the D appears here too. Okay, but it's linear. It's at least not exponential so it's not so bad. Okay, maybe let me say something about the proof here. The proof here is a bit more challenging because the issues that the neurons are not exactly independent anymore because if I condition on the previous layer, the weights of the new layer are independent but if I consider path, this is not true anymore and in particular we need the somewhat delicate bound on the distance between the ideal particles and the weights of those. Now we submitted to ICML so I figured that we should have some sort of numerical results. So here we actually tested this on a bunch of classic data sets, NIST, C part 10 and we observed that the number of neurons scales with the error pretty much as the bounds predict. So the bounds are actually pretty decent here. So we just compute how much the loss changes for different data sets as the number of neurons increases and I always remove half. So I would expect that these curves go down as one of a real time. And what? Pretty good. Okay, I think I still have three, four minutes or so. So let me tell you something about convergence. Now, why do we like landscape connectivity? We like landscape connectivity because this suggests that the optimization problem is not too bad. So if we consider a local algorithm like SGD then we have a chance of converging to the global optimum. So there was in the original paper by Andrea Song and Fan and notes in the paper by Shiza and Muck, they provided some global convergence guarantees, but they suggest that this convergence guarantees are similar to the convergence guarantees that one has for longevity dynamics. So they don't really give you rates in terms of B. So one question that I have was if actually one can get some rates and they would now present a model in which it's indeed possible. I mean, it's a very specific model, two-layer networks, and here it's indeed possible to get rates. So I consider a regression problem. So the features are uniform on a bounded convex set and the label is just f of the feature and f is strongly concave and this is actually very important. I want to minimize the loss and my activation function looks like a bump that has center WI. So basically here I'm trying to fit this function and I'm trying to fit it with respect to the centers of the bump. So the bumps move the gradient descent and that with delta and they want to fit the function. Okay, now here let me show you this result, this joint work with Adel and with Andrea. So here what we show is that if f is strongly concave and we have this regression problem here that can be modeled as a two-layer neural network, for example with radial activation or with bump-like activations, then the following happens. So the loss at step K is upper bounded by, well, this is the loss of the beginning, so that's a constant, times something that goes to zero exponentially fast. So K alpha is basically the time that I moved because alpha is the step size of SGD, K is the number of iterations, so this is pretty much the time of the evolution plus something that is rather horrible and so I'll just hide it here with delta and the horrible stuff goes to zero as the number of neurons goes large, the step size goes to zero and delta goes to zero. Okay, let me mention two things, the bound is almost quantitative, so it's quantitative in N in alpha but not really in delta and the idea is that for this specific model as delta goes to zero the underlying gradient flow exhibits some very nice properties, some very nice structure, in particular what you can show is that as delta goes to zero the gradient flow optimizes the displacement convex loss. Note that for any positive delta this is not true, so the gradient flow does not optimize the displacement convex loss and the problem, the original problem is not convex itself, so if I consider the problem with N neurons it's not convex in N times D dimensions, so it's actually only the limit that's very nice and yeah, I think I'm done, that's all I had to say. Thanks Marco.