 So thanks to the organizers for organizing such a nice workshop that I'm enjoying a lot. So I will be telling you about analyzing performance of gradient descent based algorithms but before getting to that I have some motivational slide and the main bulk of the work was done by my students, Stefano Sara Omanelli who is one of the two students actually is not here. And our collaborators, Giulio Chiara, Floron and Pierre Francesco and part of the talk will also be mentioning work that has been done with Jean and Nicolas and Leo. So the motivational slide so we all know and that's what we are discussing here that there was some revolutionary progress in machine learning done in the last 20 years or so but there also are some open questions of course so here I take a snapshot from an article that I have read recently that summarizes probably the three questions that I think are kind of the most kind of relevant or interesting for me at least. So let us go through them. So why don't heavily parametrized neural networks overfit the data? So we already discussed that a little bit in the workshop. This kind of a riddle of the overparameterization that even though we have so many parameters and there are many possible ways how to set them to fit the training data that the algorithm somehow magically managed to choose a configuration that generalizes as well as well. So how is this possible, this lack of overfitting in neural networks? The second question is about what is the effective number of parameters, what is kind of the effective dimension related to the data set and to the task that would somehow translate directly for instance to the sample complexity, that is how many samples of a given data set do we need to be able to learn something correctly? So we don't really have a good understanding of that either. And the third one, why doesn't back propagation, so this gradient descent is like very, very straightforward algorithm, why doesn't it head for a local minima? Why does it go to a local minima with a good generalization? So these are the questions that are discussed these days, all deep learning theory meetings. The interesting thing about them is that this article I read it recently but it's actually really old, it's from 95, so 25 years ago, it's written by Leo Breidman, it's a reflection after referring papers for NIPs and it's still very current, right? So this is kind of telling me that although we moved really a lot on the engineering and application size, we didn't move so much on the theory side because we don't understand these questions so much more better than we did 25 years ago. So this being said, kind of I like to put the, how do we go towards theory of deep learning in this kind of a triangle diagram, where I say that in order to build a theory we really need to take into account interplay of these following three ingredients. The architecture, the data structure and the algorithm. Now you see if we don't take for instance into account the algorithm as for instance in yesterday's talk by Helmut, then we have beautiful theory but we don't know whether the configurations that exist are tractably findable, so this is useless for practice. If we don't take into account the data structure or in other words if we consider that the data are the worst possible case, then we have very strong theorems in computer science that are telling us that training even the simplest neural network is NP-complete. So of course the real data are not the worst case data but what we lack completely is some kind of characterization of what's the magical property of the data that makes the task easy. We know that the worst case data would not make it easy. And the architecture of course if you are not considering the right architecture so we don't really know that there is no shallow network that is able to do the tasks we are seeing nowadays done by deep networks. We just haven't found any, like the best thing we have are the deep networks so if we are building theory of that well we better take into account the right architecture. So the kind of trouble of interaction of these three ingredients is that each of them is studied in different of the traditional fields or subfields of engineering and computer science. Right, so it's the computer science and optimization theory that deals with the algorithms and is the signal processing that deals with the data structures and is the approximation theory and the learning theory that deals with the architectures and they are not always talking to each other. Nowadays maybe there are more and more but traditionally not so much. So that's maybe where the trouble kind of of putting these things together comes from. So I also put here a talk by Elhanan Mosul this year in a bootcamp in the Simon's Institute for Theory of Computing who put things slightly differently but is making the same point. Now Elhanan went even farther, I loved what he did. He took some of the influential papers in the theory of deep learning and he rated them according to how well they are taking into account these three ingredients. So I was tempted to do something like that for the talks in this conference and then I shied away from it. Maybe it would be better to do it in some collective effort to kind of remove the personal biases in that. So now going to the statistical physics, kind of the approach from statistical physics from the very beginning was taking into account all these three. Not necessarily as well or not taking the architectures as they should be etc. But that's kind of on which I'm trying to build. So where did it all started? In my point of view the most influential of these early papers are these works by Gardner and Derrida and notably the one where they studied a teacher-student perceptor. So we already heard about this in Mark's talk and also a bit in Andrew's talk. So what is this? This is the single layer neural network with no hidden unit whatsoever. There is only the input and the output and then there is the teacher that takes some ground through vector of weights w and with the input pushing it through some non-linear function phi, imagine a sign here, generates an output. And then the student observes the input data and the output labels. She doesn't know the ground through vector of weights w star and the student is trying to reconstruct this w star or at least a function that implement, some w that implements the same function. So that's the student, teacher-student perceptron in this regime that kind of is not the traditional statistics where the number of samples is much bigger than the number of dimensions. Here the number of dimensions and the number of samples is comparable. The ratio between them is constant, that is called alpha. This is kind of the regime in which the current machine learning is working when these two are comparable. One may be bigger or the other but they are not vastly different. So this problem was also solved back then. But I guess at Georgia just one year after with the replica method that in this community we know very well. And when I say solved, what he computed is summarized in this picture here. So he computed the optimal generalization error that any algorithm, even exponentially costly can reach as a function of this parameter alpha. The alpha again was the number of samples divided by the dimension. So as alpha is bigger, I have more samples, it's easier to learn and when alpha is small, it's harder to learn the error is big. And what he realized is that, for instance, for the case where the teacher weights are binary, this optimal generalization error has a phase transition. There is a sudden drop from some crappy generalization to zero generalization error at the gardener Derrida transition point, which is 1.245, etc. So this kind of sharp drop in this generalization curve is for physicists very interesting of course. And from the point of view of learning theory, this is one of the few cases where we can actually up to all constants compute something like that. This is very difficult to compute in general, for general network, general data, we have no idea. So the point is we would like to get pictures like that for more realistic settings. So now let me just show you this is all 35 years ago or so. So what is the recent progress on this specific model? The teacher student perceptome. So that's something from our paper from two years ago, that will kind of motivate the second part of the talk. So I split this in three points like the kind of contributions that we did in two years ago is that first of all, gardener Derrida considered the activation to be sine and the ways to be the binary or spherical. But we can write a closed formula for any activation function and any class of priors, okay? We don't have to write a new paper for each of them. It's a formula that has a special function that depends on the prior and a special function that depends on the activation. So it's nice to have. The second point is that the work of Gezadorji was purely information theoretic. He didn't speak about whether this generalization error is or not reachable algorithmically. So what we understand now is that by working with an algorithm called approximate message passing that is really closely related to what we call the top equations in physics. We do understand which part of this curve is reachable and which one is not and I will show you in two slides. And the third point is that, speaking in front of mathematicians and computer scientists, we always used to have to apologize that the replica method is non-rigorous. Nevertheless, we believe it is correct, etc. So we also managed to actually prove that this Therida Gardner and George Sompolinsky, whatever, formulas are actually rigorously correct formulas for the best generalization error. And that was largely thanks to efforts of Jean and Nicolas and Leo in this collaboration. And Florian, well, everybody else, but me, whatever. So to show you what we get, let me show you some examples of, say, phase diagrams of this teacher-student perceptron. So here I'm taking spherical weights and sine activation. So that's the most commonly studied case. And the best generalization error as a function of alpha is the red curve. Black points are the performance of the algorithm on size of, on instances of size, something like 10,000, I forgot, something like that. And the blue points, that's just some black box binary classification. In this case, we just took logistic regression from psychic learn. And you see that in this case, first of all, the theory tells us the algorithm should do as the theory predicts. This is the case. And the black box algorithm that is kind of not knowing much about nothing about the problem is very close, so there is not a big gap. So that's kind of a nice case. But it's not always so nice. So for instance, in this case where the teacher weights are binary, the activation is still a sign. The optimal curve is that's the red curve here. That's the one that I showed already in this Georgie paper, is the red curve. But now what the algorithm does is the black points on a single instance, and the green curve on average from the theory. And you see that there is a gap, that the algorithm doesn't do as well as the theory says we could do with an exponentially costly algorithm. And this gap is actually, we believe it's of the very same nature as the one that Afonso was discussing on Monday. That actually, if this approximate message passing doesn't manage to do it, there is no polynomial algorithm that will manage to do it. That's of course conjectured, this is not proven. But there is something fundamentally hard about this region. But maybe what's more interesting here is now the difference with the black box algorithm. So this is still the regression, logistic regression from scikit-learn. But it doesn't know that the weights are binary, it just looks at the problem, it's just a black box classification problem. And it's not picking this phase transition, it just continues decreasing basically as it was before, it's not picking the structure. So there is a gap between the performance of the approximate message passing and this kind of traditional gradient based algorithm. And there are cases where things are even worse. That would be, for instance, again, binary weights. And now the activation is slightly different. So I take absolute value of this color product, I shift it a little bit and I take a sign of that. So we call that the symmetric door activation. The constant is chosen so that half of the outputs are plus one and half of the minus are minus one. Then again, the meaning of the red curve is the information theoretic curve. Green is the performance of the approximate message passing algorithm. And here in the inset, here we tried, we tried pretty much with TensorFlow a bunch of networks and we ended up with some three layer fully connected neural network. It was kind of managing to learn the functions, say, for alpha around 10, 15, and not quite even exactly. Whereas the approximate message passing does it at around one. So there is a gap here. So where does this gap come from? So that will be the subject of the rest of the talk. So to summarize what I said so far, we have in these simple models, good understanding of the information theoretically of the performance of what the message passing algorithms do. But in practice, people don't do message passing algorithms because they are very sensitive to the data being really random and we don't have their versions for the multilayer network. In practice, people train with gradient descent based algorithms. And even in these simple settings, we have gaps between what the gradient descent based algorithms are doing and what the message passing is doing. So we would like to understand where these gaps come from. Can we get rid of them? By changing the algorithms, it's something that we need to understand if we are to progress on what the gradient descent is doing in the more practical settings. And we don't have any such understanding. We have some dynamical mean field theories, but none of them is actually giving us an equation that we could plot and that would give us performance of those algorithms. So this is kind of embarrassingly an open question. And that's the one that we are trying to attack. So deep learning is fueled by gradient descent. So we need to understand gradient descent. We need to progress on the methods to do that in our community. So I just mentioned here that the kind of nice exception where we do understand the gradient descent as was really nicely presented yesterday by undoing out the linear networks. So the dynamics there is really interesting, but the fixed point is not interesting. The fixed point is just linear regression. So there is a lot of features that are missing in there. So we can learn a lot, but not everything. So we need to develop methods that would let us analyze the nonlinear cases. And so this is where I'm going. So what are these gradient descent based algorithms that we will be analyzing? So in physics, we most often talk about the Langevin algorithm. That's the one where we take the Hamiltonian and the gradient of the Hamiltonian with respect to the spins or variables. I will be working in a spherical model. So here is a spherical constraint. In machine learning, this constraint is called the weighted decay. So that's something that people use often in practice in training. And then there is this Langevin noise that if it has zero mean and the variance is proportional to the temperature, if this temperature is one, then this is designed exactly in a way that for finite system size at infinite time, it's sampling the Boltzmann measure of that Hamiltonian. So that's a nice property of the Langevin algorithm. But we are of course not interested what happens for a finite system at infinite time. We are interested what happens for a large system in linear time because there's the practical scaling that we can like hope to achieve in practice. And then we are not guaranteed to sample the posterior and then there's the whole question, what is happening in large constant time if we divide it by the number of spins? And then being able to analyze that at any T, we also give access to being able to analyze what the gradient flow is doing. There's the continuous time gradient descent algorithm that doesn't have the noise. It's just going in the direction of the gradient. So in which model can we analyze it? Can we analyze it in this teacher student? Ideally, what do we want? We want some high dimensional non-convex loss function that will resemble these neural networks. So the random perceptron would be a candidate but we don't want the random perceptron because we also want to have the notion of the generalization error. This is really important. We don't want to be analyzing optimization problem because as we correctly heard yesterday from Joachim, learning is not optimization. We want to understand a performance of a learning problem where we have some loss function that we optimize but that's not the goal. The goal is actually to achieve a good generalization. So we need some other notion of what is good than just being low in the energy landscape. So we need some teacher-student model. We need some ground truth and be true to the ground truth but the teacher-student perceptron that would be a good model but unfortunately we don't know how to analyze the dynamics for the teacher-student perceptron. That's too difficult. There is, I could have put the citation. There is actually a paper by Giulio Biorli, Francesco Boni, Valentina Ross, Frances, Pierre Francesco Boni, Francesco Zamponi. Maybe I forgot somebody. Where they are kind of going towards the mean field theory for the perceptron but they end up with equations that they basically cannot solve and it's, you know, we are not quite there. It's very difficult. So we need a problem where we can solve the dynamics. And this brings us, you know, the only problem in physics where we know how to solve the Langevin dynamics is the spherical P-spin model. So we try to do something like the spherical P-spin model but it also needs to satisfy the other criteria. So we could do the spherical P-spin model but that one actually behaves bizarrely. It's very special. We know in physics that it's very special. So the one that is more generic that has a better chance to be kind of universal in how the landscape behaves is actually the mixed spherical P-spin model. And so this is how we set the problem that kind of satisfies all the shopping list. So this is the model for which we will analyze the dynamics. So we call it the mixed spiked matrix tensor model. How do we, what's the problem? The problem is there is some ground through X star, that's a n dimensional vector, that we observe in two different ways. We take the vector times it's transposed and add a load of noise of variance delta two and that's a matrix. Then you observe the matrix. And we take, you know, P-uple kind of product of tensor product of this vector and add a noise of variance delta P and that's a tensor and we observe the tensor. And then the problem is from knowledge of this matrix Y and the tensor T can we reconstruct the X star. That's the inference problem and the notion of goodness is are we at the end close to the X star or not? But the way we will be solving the problem because that's the algorithm we want to analyze is we write the corresponding Hamiltonian and we will be doing the gradient descent in this Hamiltonian, a temperature one or zero depending whether we want to look at the launch of our gradient descent. And so this Hamiltonian that's really the Hamiltonian of the spherical mixed two plus p spin glass for those of you who know with a little difference that in the usual problem the T and Y are completely random matrix and tensor and here they're not completely random. They have this small perturbation that has a low rank and that kind of makes it possible to discover this ground through X star and that changes slightly them all. So now, okay before going to the dynamics let me first show you kind of what we know about the information theoretical possibility to reconstruct the signal in this problem and what the approximate message passing is doing because this is what we will be comparing to. So I have two slides on that. So first of all, what is the estimator that is kind of telling us about the information theoretic performance? That's the one where we take the posterior and we sample the posterior, we compute the marginals with respect to the posterior and that's the right estimator and that's what the Langev algorithm aims to do. But we can also do it with the message passing and then the maximum apostrophe inference as we heard from Joachim yesterday that's not the right thing to do. This is the right thing to do. Here we know what's the right thing to do because we know them all. This is not the right thing to do. Nevertheless, everybody is doing it so let's analyze it as well. Okay, so these two things we will be analyzing them. So let me show you was the result of the first one. So this is just a list of references. It goes pretty much in the same line as this paper I was telling you about in the perceptron. In the last, say, three, four years we know how to solve such problems for generic priors and generic activation functions and we understand the behavior of the approximate message passing algorithm and we have a rigorous proof of all that and these are the references. If you want to go into details, but the result is that all we need to write for this problem is the so-called replica symmetric free entropy or energy if you want, I just changed the sign and I called it entropy. It looks like this for the present mall. It's very simple. It just has the order parameter which is the overlap with the ground through configuration and that's it. Then it has the two P is the order of the tensor and it has two variances of the noise. There's the only two parameters that we have in the mall and the P if you want, but I will always be taking P equals three up to the very last slide. So it's a very simple formula and we know from the previous theory that the optimal error corresponds to M that is the global maximum of this formula and that the error that the algorithm is reaching is the local maximum that has the worst error. So sometimes they are the same but they don't have to be the same and if I collect this information in a phase diagram that is on the X axis is the variance of the noise on the tensor and on the Y axis is one over the variance of the noise on the matrix. Maybe it is our choice but when you stare at it for two years you will get used to it. So this is how the phase diagram looks. If the variance on the matrix is smaller than one that's the green region then the approximate message passing is matching the information theoretic performance is recovering something correlated with the signal the magnetization is positive. In the red region, the information theoretic performance optimal performance is such that the magnetization is still zero. So this is a region where the noise is too big the signal is completely hidden and no matter what you do algorithmically either exponentially costly you cannot find it back. So it's information theoretically undetectable. And the hard region that's the one where information theoretically is detectable physically this is a ferromagnet but the approximate message passing is not able to do it. So this is this hard phase the disconnection to be hard for all polynomial algorithms. So now the whole question is how does the Langevin Dynamics do in this diagram? Does it do, it could do better than the approximate message passing but I just told you that this is conjectured not to be the case it could do the same, it could be worse. So that's our question. What does it do? So this is back the Langevin Dynamics and now when we go to solving it I said the mix, the spherical P-spin mix model is the only one where we can actually where we actually know how to solve the dynamical equations with the dynamical mean field theory. And that is something that has been done long time ago in these very influential and nice works that basically standard the basis of our understanding of structural glasses today. By Crisanti Horner-Sommel and Kulian Dolo-Kurcha for the mathematicians or mathematically minded colleagues in the audience the correctness of these equations that they derived using non-rigorous techniques was proven by Gerard Benarus and Amir Dembo and Alice Guiones some years later. But that's for the random spherical P-spin model. We are in the planted one so we cannot completely readily use their equations and their proof. We have to redo them to take into account that there is the planted solution. So fortunately this is possible, this is not a big deal and this is how the equations look like. So I of course don't understand unless you already know the Kulian Dolo-Kurcha equations you will probably not be able to understand where this comes from. But just to give you an idea how they look these dynamic equations so I split these three are for temperature one these three are for temperature zero. They are written in terms of three order parameters the autocorrelation between configurations at different times the response function between some perturbation at time t prime and what is happening at time t. So that's the ones that also appear in the Kulian Dolo-Kurcha equations and here because we have the planted configuration we have a third order parameter which is simply the overlap with the planted configuration and that's the one we are really interested in. So they are self consistent equations on these three order parameters but the one that we are most interested in is this one because that's directly the correlation with the ground true our question is does it stay zero during the dynamics or does it correlate positively with the signal? So we solve these equations numerically and what we get so if you are interested in the code is here and what we get is in this picture compared to the corresponding dynamical evolution of the magnetization in the approximate message passing algorithm. So in the approximate message passing algorithm the magnetization as a function of time does something very intuitive as the noise is getting smaller, the magnetization, the correlation is getting bigger and the time it takes is basically independent on the noise. Slightly shorter for smaller noise. So this is intuitive, smaller noise, easier problem. Bigger magnetization, a bit shorter time. What happens with the Langevin dynamics is intuitive in the sense that as the noise is getting smaller the magnetization we reach at the end is bigger it's actually the same as for the AMP algorithm that's consistency check but it's taking longer. So as the noise is getting smaller it should get easier but to converge to the right point takes actually longer time so it's not getting easier, it's getting harder. So this is not intuitive nevertheless this is what is happening as the noise on the tensor is getting smaller the tensor is becoming stronger is making the landscape more complex and it takes to the algorithm longer to actually find where it should go. That's what's going on. But I will, this is intuition I will develop on this intuition. So what we can do now we can collect these times it takes to converge and extrapolate where do they diverge. So there was a picture like with these extrapolations but you can imagine what we are doing here these are like standard methods we fit with the Paolo we are careful about like being doing things properly not over like half decade but several et cetera. And when we collect the points where the Langevin dynamics stops working these are the points on this picture we split this phase diagram into again like by this line into a region where the Langevin algorithm finds a correlation with the signal as good one as the approximate message passing and then this region here where the approximate message passing worked but the Langevin algorithm stays stuck at zero correlation zero magnetization. So this is what's going on. So see this kind of tells you know explains the pictures that we saw before that the problems are harder for the Langevin algorithm than for the message passing even though these two algorithms have the same goal they are both settled to kind of tell us something about the marginals of the posterior. Okay, so we can do the same thing for the gradient flow just doing the same thing but setting the temperature to zero this gives us a curve that is slightly above. So this is expected that it will be worse than the Langevin because it's optimizing the wrong objective as we discussed yesterday. But you know it's not infinitely farther away there is some region where gradient flow is working. Now how do people in computer science in optimization theory explain something like that? A popular explanation is okay, look at the landscape of the problem there are some spurious local minima and you need the signal to noise ratio to be large enough so that only the minima you are interested in survives and all the spurious ones go away. There is a long line of work in computer science where they are trying to prove by the fact the architecture is set in deep learning the landscape is such that there are no spurious local minima. Of course under some assumptions that are mostly not realistic I mean this is if it was true then things would be solved it would be easy but these works it's an interesting line of work and kind of our question here okay is this the only thing there is? Like is it really true that we need to get rid of all the spurious local minima for the algorithm to work? So if it was true then the line that I showed on the previous slide should be exactly the line where all the spurious local minima disappear. Now in general this is something that we don't know how to analyze. In our model because it's so simple you know it's basically like fine-tuned for all the physics methods to work we can analyze that. So how do we do that? We count the number of minima of a given order so given number of negative directions. So no negative direction in the case of minima with the so-called cat's rice formula. And again I will not be going into the details. You know this is a well-known technique that you know just this year and past year it has been used for the non-mixed case of this model in this paper. And also so that would be the annealed case and the quench case was developed in this very nice paper again by co-authors and collaborators. And once again you know that's the formula I am not expecting you to really understand how we obtained it. What it is it gives us the logarithm of the number of minima that have given magnetization, given overlap with the signal and given energy term for the matrix and given energy term for the tensor. Okay so this is what the calculation splits and now when we analyze actually what this gives us we can plot the line where all the spurious minima disappear where only the one that corresponds to the planted configuration remains. So let's do it, there's the purple line so we call it the landscape trivialization threshold if you want. So now the interesting thing is that it's not the one where the gradient flow stops working. So this is interesting right because what it means is that there is a region where there still are spurious local minima but the gradient flow doesn't care, it works. It finds the best possible correlation with the signal. So how is that possible? So what's going on? So looking in more detail into how the dynamics behaves we can understand the following. If we plot the energy that we reach as a function of time at four different points that correspond to points along a line here in the phase diagram we see that either the noise is too, we are in the very hard region where we just are stuck to some plateau the same for the yellow curve or we are stuck to the plateau but eventually we go away or we kind of go away sooner and what is this plateau? Okay here, so this is from solving numerically the differential equations that we derived for the dynamics. These are the same things from simulations from rather big system that we sparsified so that it can be so big but you see I mean this is kind of, this is the same curves. Here you can't see it really but there are many, many curves and this is the upper envelope and this is their average but they all kind of look like that but the plateau has different length. If you want to see the picture I will show you on the monitor then but what are these, what is this plateau? That's the key point here. So this plateau is nothing else than what we call the threshold energy in the spin glass literature for the non-planted system. So what is the algorithm doing? The algorithm, so how to interpret it? The way to interpret it is that at the beginning the algorithm has no idea that the planted solution is there so it just converges, it's just just what it would do if it was not there and if it was not there it would happily converge with some kind of aging and people study that a lot in the physics literature to these threshold states. And so then what matters is what is the nature of these threshold states? Are they actually nice minima that will block the dynamics or do they have a negative direction towards the signal? And if they have a negative direction towards the signal the dynamics will realize it and go to the signal. And so looking at these curves we hypothesize this, we say okay let's assume the dynamics goes to the threshold and then all we need to look is to look whether the threshold states have a negative direction towards the signal. And this is again something that we can do within the equations. It is all there already. Like once you already wrote all these equations it's just a matter of fact of putting them together correctly. So we need to compute the overlap of the threshold states and plug them into the condition that tells us is there or not a negative direction towards the signal and putting these two together we get a formula which is some expression between delta P and delta two but if we plot it, there's the blue curve and that's perfectly explaining the points that we obtain by integrating these equations numerically. So this seems to be the right curve and it has a simple formula like that. So that's the explanation of what the dynamics is doing in this problem. And kind of I like to, so this is the same, this is analogous curve. This was for the gradient descent, so zero temperature and this is this green curve that you actually already seen is the curve for the Langevin dynamic. So again it goes perfectly through the points and stops here at two. There is a multi-critical point that is interesting one. We don't really know what nature it is and okay. Yet, but what we kind of, what I like to do is to correct this wrong explanation. Okay, so to correct this wrong explanation this is the right explanation. So as the signal to noise ratio increases there is eventually the trivialization of the landscape where only the good minima remains all the others became settles but that's not what matters for the dynamics. What matters for the dynamics is these threshold states that happen to be the highest lying ones whether they are settles or minima. And when they are minima they will block the dynamics and when they are settles they will make the dynamics go to the solution. And so is this kind of transition that explains whether the gradient descent is working or not. And that was the line that I plotted on the previous slide. So this slide is just to say all these plots were for p equal three so tensor of order three. We don't quite understand everything because if we go to tensor of order four things are a bit trickier. The phase diagram is more complicated. There are some really interesting things happening close to the critical line but it's close to the critical line. So and our integration of the equations is not so precise to be able to go to so large times to actually properly exploit what is happening here. So we quite don't understand what is happening here. This curve would be just our formula that predicts some kind of a re-entrance here but we don't really know whether that re-entrance is actually real or not. So just for those of you who want to think about something more, you can think about the p equal four case. But the p equal three case seem to be all. Everything is consistent. There is no mystery that we could identify and we try, we always try. That's what we do. So a conclusion about this gradient descent part is that the gradient flow is worse than the Langevin algorithm that's expected because it's optimizing the wrong objective. But nevertheless, in practice it's more robust, faster, blah, blah, blah. So both are worse than the approximate message passing. So that's kind of, two years ago I would have been surprised if you told me that's the case, we didn't know. Now we are, it's clear evidence. The question is how can we tune them? How can we change them? That they actually reach the performance of the approximate message passing. So, if you wait for a couple of months, we will have an answer. I hope we will have an answer for that. Pretty confident we will. We actually do have, we just need to confirm it. The gradient flow works even when spurious local minima are present. And this is something that we can quantify with the use of the cat's rice formula. And the third point is that this is the first model that is nonchival resembles in its nature, learning, or at least it's an inference problem where we have a closed form conjecture for the threshold for the gradient based algorithms that include all the constants. In computer science, people would like have some log power something and so we really care in physics about the constants. I mean, not always in statistic in computer science but we should, I think. And so the question here is, is it applicable beyond the present model? And again, the answer is yes and we are kind of confirming it because the recipe that we followed at the end didn't need the original equations. So that's kind of the nice result here. The recipe is simpler and we can use it. Hopefully we need to confirm that for models for which we are actually not able to write the dynamic and mean field theory. So that would be really cool. But again, wait a few months. And to finish, I kind of want to give back this triangle or diagram and tell you where we are, where we meaning is my group. So everybody's kind of making effort in their way and kind of my way of making effort on this phase diagram, like filling things into each of these directions and making it realistic is, you know, this like analyzing the realistic algorithms or the algorithms that people are actually using, that's this talk. Okay, on the side of the architecture, that's actually where we are not so good yet. So we can deal with the committee machine which has a single hidden layer or few hidden units. Those of you who have seen the posters yesterday, there was a poster on the batch training of the committee machine by Benjamin Antoine and there was a poster by Sebastian on the online committee machine. So we are like understanding things and having progress on that. But so far, it's a technical challenge. We don't know how to derive corresponding formulas for the case where the hidden layer is extensive or where we have several extensive hidden layers. You know, even the non rigorous replica method doesn't work. I mean, it's not straightforward to apply. It doesn't, you know, you don't get to a nice formula. So this is kind of a big open problem. How to go beyond the committee machine. And on the side of the structure of the data, you know, most of you have heard the first talk of this conference by Mark Mazar. So that's our current progress on how to put the relevant structure in the data to make the data lie on this hidden low dimensional manifold. And we have a really nice progress on the analysis of that model, but that again is in a couple of months. But that's kind of where we are going. So the biggest open problem is in this direction. These ones, we are kind of confident that we are getting what we want to be getting. And with this, I just give you the list of references on which this is based, this talk is based. This one is unrelated, but for those who are interested in kind of the general area of applying machine learning to physical sciences, we kind of reviewed what has been done in the past, say 10 years or so here. So kind of happy about that paper. And with this side, let's open this to the discussion and to the questions. Let's thank the speaker first. Thank you.