 We go. Is that OK? OK, so let's go. Starting now. So hello, everybody. Thank you very much for the invitation. So I'm a student at EcoNormal Superior in Paris. I'm working with Florence Akela. And the title of my talk is Exact Asymptotics with Realistic Data. So the question of generalization in machine learning algorithms can be tackled with several different frameworks. And the framework that I will be interested in here is the framework of high dimensional statistics, which is also very heavily linked to a statistical physics of learning, which is one of the topics of this conference. And so what is typically encountered is more of a typical average case analysis as opposed to the worst case analysis found in statistical learning theory. The focus is on a benchmark, random design problems, obtaining exact solutions as opposed to upper bounds, again, which are traditionally encountered in statistical learning theory. But there is a catch to these exact solutions, which are assumptions that are quite strong, so namely high dimensional problems. And then the dimensions of the problems go to infinity and random design assumptions. So a question that naturally arises from this is how realistic are the statistical physics benchmarks? And perhaps what can we do to try to make them more realistic? So let me introduce the typical setup that I will use during this talk, is the teacher-student generalized in our model. So the name teacher-student comes from the statistical physics of learning community. However, in machine learning, you could call this a generative model. Learning a generative model is the same thing. So we observe a generative model parameterized by function f0, a ground truth vector w0, and the data matrix x, which is typically taken iid Gaussian. And then what we try to do is reconstruct this model using a student. So here, I will take a very simple model, which is a convex generalized in our model, which is parameterized by a loss function l, a convex one, and a regularity, a penalty, r of w. And as I mentioned before, the dimensions will be going to infinity. And the typical goal in this type of problem is to try to understand the statistical properties of w star. So for example, can I calculate the mean square error between w star and w0, an overlap, and so on and so forth? And there is a very rich literature of results here on this type of problem. And the question that I want to ask is, is it possible to go beyond this idea assumption and introduce some correlation to reflect as much as possible a realistic scenario? So the first thing that I would like to present is a joint work with Bruno Loureiro, Hugo Cui, Sebastian Goldt, Marc Mésard, Florence Zacala, and Lenka Zdeborova, where we propose a block-covariate model to try to create a scenario where the teacher and the student can act on two different feature spaces. And so if you consider samples of the form here, u and v in a vector of size r to the power p plus d, u is described by a covariance matrix. So modeling a feature space psi, v by a covariance matrix omega, and the interaction between the two, a covariance matrix phi. So now the generative model generates labels using the samples u, and so the feature map psi, and the student tries to learn this using the samples v, using the covariance matrix omega. And so there are many works that have been proposed trying to deal with correlated scenarios, and perhaps the closest to this is this Masti-Montanari-Rosset-Antipchirani, where they had at a certain point a similar block-covariant structure for the specific case of ridge regression. And so here where the idea was to do this for a generalized linear model. So let's jump directly to the solution. So it's not very nice, but I want to show what it looks like. And so the theorem that we have is if you consider the unique fixed point to this set of self-consistent equations, you can exactly characterize the training and generalization error, as the dimensions of the problem go to infinity. So there are close form formulas for these quantities. And so the proof uses convex Gaussian comparison in equality. So these inequalities are a family of proofs which are quite popular at the moment for this type of problem, and that worked very well for this type of problem. They were introduced by Mihail Oztognik in 2013 and popularized in other works later on. So how well did this work? So again, we were trying to reproduce realistic scenarios. So for ridge regression, as was observed in earlier works, a lot of realistic scenarios can be reproduced. Feature maps can be included directly in the problem and the random design problem is quite accurate and it works very well. However, it was also interesting to look at when the model breaks down, which is what you see on the right-hand side of the plots. So this is just a logistic regression. So binary classification task on a synthetic data problem. So the red curves, which are generated with a type of gun. So here the fit is quite good. And in blue is taken directly on the real data. And on the real data, so this random design block covariance model doesn't work so well. So we see that classification is more problematic. However, I mean, the Gaussian model is quite interesting. We see that it captures some features that are interesting. So it would be nice to have another realistic benchmark problem, which includes correlated Gaussians and goes beyond the binary classification case. And this motivates us to study a classification of K Gaussian mixtures with a convex GLM. And so further motivation for this problem is several scenarios that have been observed in machine learning recently, where Gaussian mixtures are accurate to describe a certain realistic scenario. So for example, the team of Romain Couillet has shown using random matrix theory that some data generated by guns can be accurately described by Gaussian mixtures. And other works are related to deep learning. And also classifying a Gaussian mixture with K clusters, this is a benchmark problem in machine learning is quite clear. And on another note, Gaussian mixtures are universal approximators provided you have enough Gaussians, of course. So let's go. So the data distribution that we chose, so the generative model, which implicitly defines the teacher in this teacher student framework. So it's a Gaussian mixture model. So I have data points X in R to the power D, one hot encoded labels Y in R to the power K and a joint distribution with row K, so the probability for the clusters and the Gaussian densities parameterized by a generic means mu K and covariances sigma K. So here is a familiar picture for three clusters in dimension two. And the goal is to try to learn this Gaussian mixture. So learn separators using again, a convex generalized linear model. And so this time, instead of learning just a vector, so we are learning K separating hyperplanes. So we are learning a matrix W in R D times K. And so examples includes, for example, a rich regression, softmax with a cross entropy. So any loss and regularizer as long as it is convex, otherwise we don't have the solution. And so again, I'm showing you the result. So in this, again, with Bruno-Lauré-Rogue, Gabrielle Sikourot, Alexandre Paco, Florence-Zacallan-Lengas-Devorova in the recent preprint. It has the same flavor as the previous one, so it's a bit heavy. Consider a fixed point of a self-consistent set of equations. And again, as the dimensions go to infinity, we can compute exactly the training and generalization error. So here are the misclassification rates. So this formula is not very nice. It's a bit heavy. I mean, I'm happy to give more intuition because all these results actually have the same core intuition. So if someone is interested later on, I'm happy to discuss more of this type of formula. But what's interesting to note is that, so this is a very generic statement. There are very few assumptions that are made. However, it greatly simplifies if you make assumptions on the covariances. For example, if they are diagonalizable in the same spaces, if you make assumptions on the separability of functions, you can get statements that are a lot clearer with a lot less parameter. And in most relevant cases, it reduces a lot to a low-dimensional statement, which allows for maybe a more intuitive analysis of what is happening. So now let's look directly at some examples. So examples for synthetic data on random design problems. So here on the left-hand side, we plotted rich penalized logistic regression on K-Gausson clusters with identity covariances. And we plot the generalization error as a function of the sample complexity. And for two, three, four, and five clusters. And what we recover is a type of curve that had been already observed for two clusters. So for binary classification, where a separability phase transition occurs and there is a kink in the generalization curve, which otherwise, I mean, remains monotonic. And so this type of result, observing separability phase transition. So then we observe them at different places for a higher number of clusters. There is a rich literature of this going back to Covarrin 69. And so here on the right-hand side is a plot of the generalization error, again as a function of the sample complexity, but this time for different regularization parameters. And so we see that regularization consoles this kink in the generalization curve. So there is also a comparison to the base optimal scenario. Again, the fact that regularization removes this kink had already been observed in other models. So I talked about real data. So let's look at an example on real data. And so this is a binary classification task on MNIST. And on the left-hand side, what we did is take MNIST, consider the classification task of odds versus even numbers. We took the empirical covariances of the two clusters and we designed our generative model, the random design model with this. And we try to compare how does the random design model compare to the actual regression using a standard machine learning solver. And we see that actually the logistic regression cannot make the difference. So the prediction is exactly the same between the real data and the Gaussian approximation. So on the right-hand side is fashion MNIST. So I guess data set, which is a bit more structured. And so there is a discrepancy between the theoretical prediction in blue with the Gaussian approximation, which is represented by the green stars, and the orange prediction, the orange dots for the real prediction. And so maybe a brute force way to try to solve this. So this is another binary classification task on MNIST. But this time, classifying 0, 1, 2, 3, 4 versus 5, 2, 9. And here again, you can do a Gaussian approximation. So this is the P2 with the blue stars. Or you can do 10 clusters. And so I give an idealized view of this on the right-hand side. I just have a data set. I can either approximate it with two Gaussians or four Gaussians. And adding more Gaussians is a brute force way to get closer to the real thing. And this is a bit what we observe on these curves. So now I will come back a bit to this in the discussion at the end of the talk. And for the remaining time, I would like to talk a bit about the sketch of proof and why this was a challenging proof. So what are the difficulties here? So we are learning a matrix. And so implicitly, we need to understand how the different hyperplanes are correlated by the learning process. And this is something that's non-frivial. And the second thing is each cluster is different. We want to allow different covariance matrices. And so intuitively, we cannot use the same quantities to characterize the effect of each cluster. And actually, this was observed in a recent paper by the group of Christos Trampolidis, where they showed that the convex Gaussian comparison inequality, so this nice family of proof that I've talked to you about earlier in this talk, they actually break down beyond the least square case for a multiclassification problem. So just a quick disclaimer, perhaps there is a solution using this method. It's just that now the way they are formulated, they could not find it. And so we did not use this method to solve this problem. So how did we solve this problem? So from the statistical physics side, the replica computation works. But how do we solve this rigorously? And so we used approximate message passing algorithms. So what's an approximate message passing algorithm? Well, it's a family of iterations, which has a very nice property. It's inspired by statistical physics. It has a nice property of having exact closed form asymptotics called state evolution equations. What do I mean by that? So I mean that at each time step of this iteration, you can exactly characterize the statistical properties of the iterates using a low dimensional equation, low dimensional iteration. And one of the nice things is that AMP, you can treat matrix value variables as a whole. This is no problem. You can handle block correlation structures. This is called spatial coupling in the AMP parlance. And so here intuitively, we have different clusters, which behave differently. So spatial coupling will be useful in the formulation of the problem. And AMP methods are very adaptable. So just a quick bit of history. Initially, as I said, these SE state evolution equations were derived in statistical physics using heuristic methods. And the first proof was due to a great mathematician, Erwin Bolthausen. And the proof method that he used was popularized later on by Etienne Montanari, notably in 2011. So what does an AMP look like? So there are many different form of AMP's. But the one I'm interested in is the following, which is a sequence of matrices, u and v, defined by a Gaussian matrix z, which can be a block Gaussian matrix, and two sets of nonlinearities, h and e, which here are matrix value functions. So there is a z transpose h, ze. So this kind of look like a first order method or classical descent method. And in addition, you have these two correction terms, momentum terms. And these brackets, they are Jacobian like terms. And they are inherent to AMP. So you cannot discard them. And they are really crucial to get these nice properties of state evolution equations. So now, how do we use this to solve the problem? So I have a target. So the target is the W star defined by the convex GLM on the Gaussian mixture. I have this tool, which is an iteration that has this nice state evolution property. Now, how do I solve this? I need to design the nonlinearities h and e, such that the fixed point of the iteration too much is the optimality condition of problem one, which is convex. Then one needs to find a converging trajectory. So in this case, convexity is very helpful. And once you have done these two first steps, well, what have you shown? We have shown that the estimator W star can be found as the fixed point of an AMP trajectory. So this means that the state evolution equations characterize this fixed point. So this estimator, and then all you need to do is use the fixed point of the state evolution equations. And this is what we did. So now just a quick look at the design of the AMP. So the AMP is very often designed from a factor graph representation of the problem. In this case, the factor graph representation is really not obvious. So what we did is reformulate a bit the optimality condition of the problem, and then match it with the fixed point of the AMP iteration. And this actually can give you the form that you need for the functions h and e. And so just a quick technical point of the validity of the state evolution equations. So here, actually, when you reformulate the gradient, the fixed point, sorry, the optimality condition of the problem, you get this block structure with non-separable functions. So here, L tilde and R tilde are reformulations of the problem, which include the covariance matrices. So there is no trivial mixing. And the different technical tools have been developed in this Javambar and Montanarin in 2012, and for non-separable functions in the Berthier-Montanarin-Guillain in 2018. But the combination of both, so formally, this is really a formal point, but it was not included. So this is included as a side result in a country, in a recent work with Raphael Berthier, where we proved state evolution equations for a larger class of approximate message passing algorithms. So, and yeah, this basically concludes the proof. And what are the future directions that are interesting to look at with respect to this work? Well, the relevance to realistic scenarios, well, I guess it was quite clear from the plots. Okay, Gaussian models are relevant to a certain degree. And then the brute force way to try to make Gaussian models more realistic is just to add more clusters. But then if I keep adding more clusters, in the end, as you saw, the formula is already quite heavy, I would just have a problem that is more complicated than the original problem. So this will not be useful. So is there a middle ground, a clever parameterization that allows to describe machine learning problems in an exact or close to exact way? So in the statistical physics way. And at the same time, remains sufficiently intuitive to be useful. And on the technical side, so there are a lot of technical improvements that can be made related to AMP methods. So there are really a bunch of possibilities offered by approximate message passing methods that can be composed to almost an arbitrary degree of complexity, and they are very interesting objects. So here we are working with the dimensions that go to infinity, but actually AMP methods, they are amenable to finite size analysis. So how fast do we need the dimensions to go to infinity? And so it's possible to characterize this. And a lot of work has been done in this direction by Russian Venkatraman in 2018. And finally, universality properties. So the randomness of the IID Gaussian core can be replaced by a universal IID matrix. And so one of the core work on this was done by Bayati LeLarge and Montanari in 2015. And with this last slide, I would like to conclude. Thank you all very much for your attention. And I would like to thank the wonderful collaborators that have worked with me on this project. So Bruno Lourero, Gabrielle Sikourot, Raphael Berthier, Lenkas DeBorova and Florian Zakala. And thank you again. Thank you, Cedric, for this very clear talk. It was very useful and pedagogic. Thank you very much. Thanks a lot. So Cynthia Vosch is the next speaker from Columbia University. Whenever you, one second, sorry, I have to stop the recording.