 I'm very happy to be here and thank you for the invitation, I will talk about some of my recent work on the analytic characterization of the dynamics of the multi-passage GD. And now I will have to pretend I have not listened to the last thought that I will still, I will still describe it. And in particular, I will focus on a joint work with Francesco Urbani and Katya Vorova for this year. So first of all, understanding the dynamics of gradient-based algorithm is, in particular, stochastic gradient descent is a central question in machine learning. Indeed, it has been shown, for instance, in this paper of 2019 by UN Quarters, that the landscape of supervised learning problem is far from being trivial. And there are indeed regions of local and even global minima that have poor generalizations and that gradient descent algorithm can reach if they are initialized in an adversarial way. However, with random initialization, they are able to avoid this bad local and global minima and achieve, in practice, good generalization properties. And this is not clear why. So, for instance, some of the questions that arise are which regions of the landscape are attractive for the algorithm. And in particular, in which cases and what are the characteristics of SGD that make the algorithm better than gradient descent. So, in order to understand this problem, it is necessary to characterize somehow the whole trajectory of the algorithm. Because the dynamics is being the problem so complex the dynamics is central to understand it. And first I will overview some of the related works in this direction. So, first of all, there are a series of work that try to describe the dynamics of stochastic gradient descent, seeing the algorithm as a noise approximation of gradient descent. Another series of work models the noise as Gaussian invoking the central limit theorem. And this assumption is questioned in another series of works where the authors observe numerically that the in some tasks the noise looks and so invoke a generalized version of the central limit theorem to characterize the trajectory of the algorithm. And more related to my talk is a series of other works that try to model the trajectory of the algorithm without resorting to approximation. It can be done, but in some specific cases. So, for the case of gradient descent in linear networks and even deep linear networks, or in the case of online SGD for two layer neural networks, both with the finite number of layers and in the case of an infinitely larger gradient layer. However, since in practice, the algorithm that is applied is a multi pass SGD, where the samples are used multiple time, we wanted to try to model directly this scenario and there is a much less theory on this topic. We managed to derive analytic equations for the dynamics of the algorithm applying a dynamical new theory from statistical physics in a setting with with a general known convex loss function in the limit of high input dimensions. And in particular the key ingredients are that we consider that the data are drawn for some specific generative model and we compute the typical case performance in a dimension. And this was first introduced in our work of last year and I will talk a bit about some extensions. So the setting is supervised learning and in particular I will consider the phase retrieval problem, which is an inference problem where the task is to recover a need and signal from a series of observations. So we have and real measurements that are Russian vectors and the, and the labels are generated by the ignorance data and signal this car. So basically we have access only to the absolute value of the of the projection of the data on to the teacher and not on the face that in particular I will consider the case where the teacher weights are spherical. We consider this problem, because it is a prototype of an dimensional non convex problem. So a really hard problem even harder than the typical scenario where where neural networks live. Where where typically there is a wide region of the minima in this case there is only one global minimum that is. And then a proliferation of bed local and the goal is of course to retrieve the vector w. So the learning is done through empirical visualization when we have a sum overall the sample of a loss function that in this case is a square loss with an activation function. The label squared simply because we do not have access to the. To the case and and then we can add the regularization which can either be rich or in this case, we will consider the spherical constraint. The optimization algorithm is SGD so in general, the weights are updated in the direction of the gradient but this gradient is completed only on a subset B of sample that is updated at each time. So we can rewrite the algorithm introducing a binary variable s new associated to its to its sample and its evolution defines the sampling protocol in particular to our analytic characterization we choose a slight variant of the SGD algorithm we call persistent SGD and in particular. The difference is that in this case. The process for the. For the labels as new in the continuous time limit is a process the average. The average fraction of samples that are used to compute the gradient is fixed. In general the mini batch size can fluctuate and also the samples are used in the gradient for a typical time so there is. An ocean of persistency in in this sampling procedure that introduces a memory that's normally is not there and. In conclusion, we consider different in positions so first under initialization and and then we focus on informed initialization that in this case we simply a plug by initializing the algorithm in the direction of the of the signal with a fixed projection and zero. And then this is not. unrealistic because in practice it is what is really done in applications and this informed initialization. can be achieved the for instance via spectral methods, and we use these M zero as an extra parameter to probe the landscape of the problem. The performance being the way it's a spherical is entirely encoded in the magnetization that is our order parameter that's the simply the projection of the weights on the signal so in practice. What is done to solve this problem is considering inferred initialization and trimming strategies that are other. Regularizations on the loss so in particular trimming. is consistent putting a cut off on the activation function, however, recent works have shown that even from random initializations and without resorting to. To this trick. The physical problem can be. can be solved for enough number of samples even by gradient descent so our questions were. Can SGD improve significantly over a gradient descent and. At which sample complexity so in this case we consider linear sample complexity where the number of samples over the dimension is fixed and and the times that are not. Exponential in the system size, but. are large, but still the tractable. So in particular. Our derivation involves the application of an independent theory, which is an analytic framework been used many times. In statistical physics to describe the dynamics of large number of the interaction degrees of freedom. In particular, for the. For the random perceptron problem in the context of the spin glasses and this framework has been applied by agoritzas and all. And then. We have generalized the this case to multipass SGD and structure data in 2020. Basically, I won't go into the details of the derivation of the dynamic theory, but. In a nutshell this formalism allows to perform a dimensional reduction so we start from a Markovian dynamics of. Huge number of degrees of freedom and infinity and we reduce it to a dynamics of just a few. Effective degrees of freedom that really capture the performance of the algorithm. At the price of adding the memory to the dynamics, so we start from the continuous time limit of the stochastic gradient descent equations, and, and this is why we have defined the persistent variant of SGD in order to have a well. The final continuous time limit. And we end up with the dynamics of an effective. We have an effective stochastic process H of T. And so it's a scalar variable that encodes the projection of the weights onto the noise in the data. And then we have an ODE for the order parameter M that is the again the projection of the weights on to the signal. So this equation is quite tricky. These equations are quite tricky to solve because they depend on a set of memory kernels and correlation kernels. This is a summary function that are internal averages over the stochastic process. So basically they must be solved in a self consistent way by iterations. But in the end, we observe that they can capture very well the dynamics of the algorithm at the advantage of being the infinite dimensional limits in this case. There is a particular problem, for instance, where a finite set effect can be huge. Being able to track the dynamics in the infinite dimensional limit keeps give us a control over the simulations and indeed the first week. We compare our results for just the gradient descent algorithm and we find that the theory in red as a very good agreement with the simulations that are performed dimension 1000. So here we plot the average magnetization as a function of the training time. Here is the continuous time. And first of all what we can observe is that even at a finite initial magnetization, for instance, 0.5 in the left plot. We still have that the algorithm is stuck. Even very close to the signal so it can be stuck even at magnetization 0.9. So like the first thing that we observe is that a warm start is not enough to achieve a perfect recovery. And then we perform some numerical simulations at fixed landscape. And we plot a different realization of the first thing again we observe that the landscape is really complicated indeed. We, for instance, we see that more important initializations do not guarantee better performance and also that some regions can be trapping even if they're very close to the signal. For instance, if we look at the bottom right figure, we have that all simulation starting at the term as many sessions 0.5 achieve perfect recovery. While if we achieve magnetization 0.8 dynamically, we see that in the left and central plots that many realization can still get stuck below perfect recovery. And this was a bit surprising. For the moment we compare the performance of GD with SGD. And we observe that persistent SGD has a huge improvement over the GD algorithm. Here we plot again the average magnetization over all as a function of time. And indeed we see that this additional noise allows to introduce a sort of built in annealing in the algorithm that automatically can reach perfect recovery even at very low sample complexity. And if we look at the different realizations for different algorithms or GD in red, vanilla SGD in green and persistent SGD in blue, we see that the tuning the persistence time we can avoid the algorithm to get stuck in plateau. And, and we also see that the performance of SGD is significantly better than the one of GD and starting from random visualization. And, okay, we can also explore the role of the parameters and maybe be surprisingly we've served that for the range of parameters that we have analyzed in this problem. And the best setting is, is a finite persistence time, while infinitesimal persistence time would correspond to SGD, and also a finite fraction of samples in the, in the, in the training in the batch, while vanilla SGD usually has an infinitesimal fraction. So I'm a bit out of time so I will conclude with some open questions that that we are investigating. First thing would be to gain more insights from the MFT so try to characterize the noise introduced by Stochastic in the center, and why it is beneficial, and also gain some insights on optimal parameter setting then this model can be extended to other data models where I have not specified it. I think it was clear it's it was a one, one layer neural network and this could be extended to networks with the most one in the unit. And also we could consider some variants of the algorithm, indeed in a recent work, Mannelli and Mubani have derived some equations, the MFT equations for momentum, getting the center plus momentum, and, and we could try to see if there is an improvement if one does SGD and momentum. And finally also the recourse proofs of the question could be a further direction of research. And I think I will stop here. Thank you for your attention.