 It moved to ICTB, before that he was a student of the French group, so today he will talk about face transitions in high-dimensional inference. So please, Jane. Thank you very much, Aydrun. Do you all see my slides and do you hear me properly? Yes. Okay, great. So hello everyone. Thanks for these great workshops and thanks for the speakers for all this new knowledge. So today I would like to discuss a recent work, which is a joint work together with my colleagues Nicolas Malkry and Clément Luneau from EPFL in Lausanne and Cynthia Roche, who is a mathematician in Columbia University in New York. So what I will discuss is a special type of face transition that is happening in very sparse high-dimensional inference and learning problem that has been coined the all or nothing phenomenon. And roughly what it means is that if you're trying to perform inference of very sparse signals, which means signals that have many zero components, or try to learn from high-dimensional data that has an effective low-dimensional, that is effectively of much lower dimension that lives, for example, in a lower-dimension manifold, what happens is that algorithms in the very sparse limit have a behavior where they are typically able to perform very good, as good as they can actually, or very bad, as bad as possible, essentially they lead to nothing. And this is why it's called the all or nothing phenomenon. So let me motivate a bit this high-dimensional data that has effectively a low-dimensional structure. So one example in signal processing is very known is the fact that if you take a natural image like this Lena picture and you represent it in the proper basis, in the wavelet basis, here I start from there, I apply the wavelet basis, and then if you try to threshold a large fraction of the wavelet coefficients, the ones that matter the least with the lowest amplitude, and you apply to this new signal that is effectively extremely sparse, the inverse wavelet transformer, what you realize is that the new picture is essentially exactly the same as the original one. What does it mean? This image, which actually leaves a priori in order of 10 to the 5 dimensions, which is the number of pixels, has an intrinsic dimension, which is much lower, which can be of the order of hundreds, which is the number of wavelet coefficients that actually matter. Another example that is the, for example, MNIST dataset in machine learning that has, so each vector that represents these images of numbers lives in 784 dimensions. Actually, there are many techniques, many authors have evaluated the intrinsic dimension of the manifold in which actually leave these numbers, and it is of the order of 10 to the 15, so much lower. And one very cheap way to convince yourself that this is the case is what I did here. I just wrote the simplest machine learning algorithm that you can think of, which is a simple linear classifier, Perceptron, using a logistic cost with a L2 penalization, and I run the plain stochastic gradient descent. And what I do is that I take two numbers, for example, let's say nine and one or three and five, whatever two numbers you want, this defines my two classes that I try to classify. Of course, a priori, there is no reason that this data is linearly separable, it has been generated in a very complex way by people. But if you try your stochastic gradient descent, here I use 5,000 training examples, 8,000 test examples, a bit of L2 penalization, and I try to classify the numbers one and two, sorry, one and nine, that defines my two classes. What appears is that your simple linear classifier is indeed able to obtain a very good test error, in this case, less than a percent of misclassification. And one clear reason behind this success of a linear classifier for this high-dimensional data is that these numbers actually live in a much lower dimensional space, and therefore you have many degrees of freedom to find a hyperplane that will separate these two classes. All right, so all that motivates the so-called bet on sparsity principle that says essentially that the intrinsic or effective low dimensionality is often a crucial ingredient for the interpretability of high-dimensional data. So in this work, we used a statistical physics approach that is of considering idealized models of high-dimensional data with the low intrinsic or effective dimension. We perform, thanks to statistical physics, an average case or ensemble analysis, which leads a typical behavior of algorithms, which is to put in contrast with the techniques from computer science that generally lead a worst case analysis. We are able to access the so-called information theoretic and algorithmic phase transitions, where information theoretic means the phase transition related to the Bayesian optimal algorithm, which does not, we do not take into consideration the computational constraint, while algorithmic phase transitions correspond to practical algorithms that are running in a polynomial time. And what we are able to obtain are, again in contrast with computer science that generally get bounds, we are able to obtain exact asymptotic thresholds and formulas in so-called high-dimensional regime. High-dimensional means here that essentially the number of parameters or components of the signal to infer or to learn are of the same order as the number of data points that you have access to. And this number is extremely large. And the type of quantities that we can access are free energies or mutual information. So this is the same thing, minimum mean square error, generalization errors, and so on. Okay, so the paradigmatic models that we have been considering are the following. So these are probably the most studied nowadays high-dimensional inference problems. They play the role of the easing model in physics if you want. So the first one is random linear estimation, which is also called compressive sensing, which corresponds to the following inference problem. You have some signal x, which can be sparse with many zero components. You have some random matrix phi, which is known, which allows to project the signal x. Some noise is added to this linear projection. The noise, of course, is unknown and is a standard Gaussian in this case. And what you obtain are these observation vector w, and the inference task is given w and the projection matrix phi. Are you able to recover the sparse vector x? Okay. This is a special case of a more generic model, which are called generalized linear models, where now instead of having a linear relationship between w and x, you can put here any function, which can be nonlinear, stochastic, it can be whatever you want. And also this model is very nice in the sense that you can also interpret it not only as an inference problem, but also as a learning problem. For example, if you take the activation function phi to be the sign, you recover the so-called perceptron neural network, which is the algorithm I used just before to classify amnesty. And you are in the so-called teacher-student scenario, which also corresponds to the so-called realizable rule setting in machine learning. So what does it mean? Imagine that you want to use a perceptron neural network as a model for data generation. This is a generative model. So what you do is that you generate data points, inputs to this neural network that you feed in this neural network from which you obtain labels. Okay. So this gives you kind of training data and underlying this training data. There is this rule. And if you take the weights of this neural network to be sparse, you just generated data with a low effective dimension with respect to the ambient dimension of the data. Indeed, the dimension of the data is the number of input neurons here. While the effective dimension of the data you obtain, that means the labels here are only functions of few neurons. Okay. So the sparsity of the weights here allows to generate data with an effective low dimension. Okay. And the learning task is given this data that has been generated from this so-called teacher neural network. A student is given this data and tries to learn this neural network knowing the architecture. He doesn't know which features have been selected by this neural network. He doesn't know the value of the weights here, but he knows at least that such a neural network has been used. Okay. So this generalized linear model includes many applications and special cases such as compressive sensing that we already discussed, CDMA and superposition codes in communications, the so-called phase retrieval problem in signal processing. This perceptron with the sine function is also called one bit compressive sensing in signal processing. You have the rectified linear unit, which is a very important nonlinearity that is used in deep neural networks, or the classical sigmoid or logistic regression, which is a probabilistic model where the quality of the label is given by this logistic function where the weights appear here. And these are the data points. Okay. Another model class of model that we consider are the so-called low-rank matrix factorization, which is a simple probabilistic model for past principal component analysis. So in this inference problem, or you can also think of it as a learning problem, what you are given is a matrix W that is a, you can think of a matrix of data. And what you're trying to do is to find a low-rank representation of this matrix. So the generative model for this matrix in this setting is the following. There is a low-rank matrix X that is multiplied by itself, X, X transpose. You can even think of X as a vector. In this case, this is a rank one matrix plus a noise, which is unknown, of course. And the problem is given this W. Are you able to recover the low-rank structure that is hidden in this W? Namely, are you able to recover this rank one matrix, X, X transpose? Lambda here is the signal-to-noise ratio, which is the control parameter in physics terms. And N here is the dimensionality of this X, which is needed for the problem to be non-trivial, to have phase transitions. Okay, so before going in the regime of very sparse signals or data with an effective low dimension, let us consider the generic setting where the most specific setting where the variables are not sparse. In this case, let us consider binary entries for this vector X. The generic scenario in this high-dimensional inference problem is the one with a discontinuous phase transition and the appearance of a so-called computational gap. So what does it mean? It means that if you try to, if you plot the error of algorithms, you will observe discontinuous phase transitions and the appearance of a so-called impossible, hard, and easy regime. What does that mean? If you consider the Bayesian optimal algorithm, which leads to the so-called minimum mean square error, so the optimal algorithm is doing what? It's exploiting fully the posterior distribution given the data. And the minimum mean square error is defined here. If you plot this quantity as a function of the signal-to-noise ratio, the higher the SNR, the easier is the inference task. You see that up to some threshold, the MMS is very large and therefore you are in the impossible regime. Whatever algorithm you use, you cannot beat the optimal one and therefore you cannot reconstruct the signal. And then suddenly at this discontinuity, the Bayesian optimal estimator is able to perform. And you enter the so-called hard phase in the sense that even if the best algorithm, the optimal one, gives you something, practical algorithms in the sense that they do not need the age of the universe to run, such as a plain PCA, which just look at the principal eigenvector. You just do the eigen decomposition of this matrix and you look at the principal eigenvector or the so-called approximate message passing, which I won't discuss in detail, but essentially it's a version for dense factor graphs of the so-called belief propagation or some product algorithm, which is conjectured to be optimal in this problem, optimal among the class of polynomial time algorithms. You realize that you need a higher signal-to-noise ratio and therefore you have this hard phase because in this regime, we don't know any practical algorithm able to match the optimal performance. And suddenly these algorithms, in this case this AMP algorithm, has also a discontinuous phase transition and you enter the easy regime in the sense that this algorithm is able in a finite amount of time to match the optimal performances. Another example is in this perceptron problem. Now instead, I will use the learning interpretation and plot what actually matters in machine learning, which is the generalization error. So what is it? Here is the optimal estimation of a new label that has not been used in the training data. So the training data is this matrix phi. Each row of this matrix is a data point. These are the labels. I give you a new data point and I ask you to estimate the associated label. And this is the square deviation between the Bayesian optimal estimation of this label and the true one. And if you plot the generalization error for this perceptron model in the case where the weights that have been used by the teacher network to generate the data is binary, the picture looks like this, you also have this impossible hard easy regime. Impossible here in the sense that the optimal Bayesian optimal algorithm performs poorly and suddenly this discontinuous phase transition to perfect generalization. But if you use a practical algorithm, this AMP algorithm, you need more data and at some point you obtain a perfect generalization as well and you are in the easy regime. Here the x axis is the sampling rate that is the number of data points that you have access for in the training set divided by the number of parameters to learn, the dimensionality of the weights in the teacher neural network. So a very generic picture appears them. And the question that we wanted to answer in this work is does anything special happen in the very high sparsity regime that is when the effective dimension of the data is of much lower dimension, is much lower than the ambient dimension in which live the data points. Okay. So this question has been partially looked at by recent work by Gamman, Xadik, Rives, Zhu and Xadik. That first considered this compressive sensing problem that I discussed, that I recall here. They considered a sparse signal which is simply Bernoulli, okay, so made of zero and once and this parameter row controls the sparsity. So the higher is row, the more you have once, okay, the lower is row, the sparser is the signal. And what they did is that they took the known rigorous result in the in the literator that essentially gives you an exact formula that I will not describe for the minimum mean square. Okay. There is a so-called replica symmetric formula. This vocabulary comes from physics that allows to predict asymptotically in the high dimensional limit. What is the minimum mean square? And if you plot this quantity varying the sparsity level, this parameter row, what you observe is that if you rescale the x-axis, which is the sampling rate by a proper quantity that will depend on the sparsity level, when the sparsity level decreases, sorry, when the, so the sparsity increase, which means this quantity row decreases, you see that this curve tend when row is very small to a step function. Okay. And this is why they call the all or nothing below, because below this threshold, essentially, you get extremely poor result. And here, this line means as bad as it can be. And suddenly you jump to a regime where you can perfectly reconstruct the signal. So in order to study this model, what they did is to allow all the control parameters in the problem, which are the signatures ratio, the sparsity level, the measurement rate to be sequences of N. They are now allowed to depend on the system size, which is in contrast with the usual statistical mechanics techniques where usually you consider all the control parameters to be fixed. Okay. They do not depend on the number of spins in physics language. And what they show is this so-called all or nothing phenomenon, the name is very appropriate. They show that when the sparsity goes to zero, if the measurement rate, the number of accessible data is less than some threshold that depends on the sparsity on the signal transfer ratio, then weak recovery is impossible, which means essentially that you cannot say anything even using the Bayesian optimal estimator about the signal. The MMSE is one where one here means the maximum value. While above this threshold, strong recovery is possible, which means that you are perfectly able to reconstruct the signal. So I was very curious when I've seen this result and I wondered how generic it is. So what I did is to take one of my favorite model, which is this sparse principal component analysis and develop a kind of analysis which is similar in spirit, but which is technically extremely different, where again allow all the control parameters in the problem to be sequences of N. But before analyzing this quite complicated model, what we did is the first to take the known results in the literature as they did for compressive sensing to plot the minimum mean square error for the spike, that is this rank one matrix, and therefore to scan to see what happens when we let the sparsity of this vector X or equivalently the sparsity of this rank one matrix, this spike, sorry, increase so that it's made of more and more zeros. We are zooming on the corner of this phase diagram. This phase diagram tell you where reconstruction is possible, where is the art phase, which is this yellow region, where is the information theoretic phase transition, algorithmic transition as a function of the SNR and the sparsity level. And we are trying to zoom on this corner to see what happens. And surprisingly, the very same phenomenon appears. So if you rescale properly the signature notes ratio by some information theoretic transition that was derived erystically in papers by Flora and Lenka, what you see is the same phenomenon you see a curve that tends when the sparsity level, so when rho goes to zero when there are more and more zeros to a step function. Okay. And so what we did is to prove a very fine result on this model, which I won't parse the result and this is not really the point here, but essentially this result tells you that if you take any sparse signal where the sparsity is really allowed to depend on n, you can really have very, very sparse signals. This is the truly sparse regime where the number of non-zero components, sorry, the number of zero, yeah, the number of non-zero components scales sub-linearly with the size of the system. We have a very precise control on the mutual information where the mutual information is in physics term the free energy, which is the fundamental object you want to compute. And it tells you how close you are from a simple scalar formula that you can plug in your computer and evaluate. And it tells you how far you are, what is the deviation. And importantly, from this theorem, which is on the mutual information, you can obtain as a corollary one object that interests you probably the most in order to compute, in order to predict the performance of the optimal algorithm, which is the minimum in square. And you have two bounds that you can precisely evaluate. And you see from these bounds that this MMSE indeed verifies this all or nothing phenomenon, that is this quantity when the sparsity goes to zero and in the large system limit in the high dimensional regime, you see that indeed there is a threshold above which this function is below which it's one here and above which it becomes zero. So the all or nothing is really there. And now we have a theorem for that. But we wanted to go beyond the information theoretic analysis and also study what happens for algorithms, which was not done before. And so what we studied is this AMP algorithm, which I recall is a version of belief propagation for dense graphs originally introduced for compressive sensing by Yoshiyuki Kabashima and rediscovered later by Ondra Montanari, by Dono and Maliki, which is a very simple recursion, but extremely powerful. And this algorithm enjoys two very special properties. The first one is an exact characterization in a proper asymptotic limit thanks to a so-called state evolution, which is a simple scalar recursion that tells you what is the mean square error, the error that will lead the algorithm in the large size limit without even running it. You can predict the behavior of the algorithm. And moreover, this algorithm is conjectured to be optimal among all polynomial time algorithms. And so what we did is first, we needed to essentially say that we still have the right to use this state evolution analysis in this new regime, where now all the parameters depend on n when the sparsity is extremely high and so on and so forth, which are not a priori state evolution does not apply in this case. But what we prove is that essentially we have the right to use the state evolution analysis in this extremely sparse regime. Informally, we prove that state evolution that allows to track AMP and therefore the curves that I will discuss right now are valid in the very sparse regime. And by very sparse here, we mean that the sparsity parameter can go as one over log n to any power. So what we observe is that there is also an algorithmic transition at which if you collapse the curves, if you rescale them by this threshold that depends on the sparsity level, you see that also the algorithm, this AMP, when the sparsity parameter goes to zero, experience this all or nothing transition. But what is also interesting is that this transition is at a totally different scaling. So the AMP transition goes as one over rho square. So the SNR increase much faster than the one required from the information theory point of view that essentially scales as one over rho. So if you plot the AMP and MMSC behavior, the optimal algorithm and the practical one on the scale of the information theory transition, you see that when the sparsity is very small, you have a large computational gap here, the x axis is on a log scale, that appears. So the problem becomes extremely difficult from a computational point of view. So then what we wanted to know is how generic is this transition. So we wanted to even go beyond this sparse PCA problem and to see what happened in this extremely generic generalized linear model. And essentially, you do not have to read the theorem here. This is not really the point, but this theorem tells you, gives you guarantees again on how far is the fundamental object, which is the mutual information between the data and the labels and the weights in this neural network. And you have a simple scalar expression for it and you can control very precisely the deviation. And all that is valid precisely in the regime, actually in a broader regime, but at least in the regime where you observe this all or nothing phase transition. From this very fine theorem, which gives you finite size fluctuations of the mutual information of the free energy, you can also derive an exact analytic expression for the generalization error. And also let me just mention that you can precisely get a limit when the sparsity really goes to one, when the signal is almost only made of zeros, the true limit. You can also get an asymptotic expression for this potential and it becomes extremely simple for this concept. Anyway, all that to say that from these results, what you can do is also plot the generalization error for this learning model. So here is the Percepton that I already discussed. This is the Relu-Rilou neural network, which is again used a lot as nonlinearity in deep neural networks. And you also observe that when the sparsity goes with n to zero, to small value, you observe again a collapse of this curve at a threshold which is explicit, the measurement rate threshold, the x-axis is the number of data points that you have access in the training set. And you see that when the sparsity of the underlying weights of the teacher neural network, which just means the effective dimension of the data is much lower than its ambient dimension, you see again this all or nothing behavior that appears in these different models. And again, we have an explicit formula for the threshold. So let me summarize quickly what I said. So statistical physics predictions allow to rigorously validate what happens in the corners of the phase diagram. And by corners, here I really mean when the control parameters are depending on the system size and can go to zero or diverge with the number of variables in the system. This all or nothing phenomenon seems extremely universal, at least in the models that we considered, universal in the reconstruction of very sparse information, that is information with an effective low dimension with respect to the dimension, the ambient dimension of the data. How generic it is, I think it's an extremely interesting question. We don't really know. There is this very recent paper that seems to indicate that in a very large class of additive Gaussian model, which does not include this generalized linear models, but at least includes the sparse PCA that I discussed, but other type of model, here X can be anything, it can be a tensor. In a large class of Gaussian model like this, it seems that from these results this all or nothing is actually happening for all sublinar sparsity or very wide regime of sparsity. I also discussed this all or nothing in algorithms, in particular in this AMP algorithms. It happens at least in sparse PCA and we proved it. A very interesting question is what about other algorithms like MCMC or gradient based? Is it specific to algorithms that are conjectured to be Bayesian optimal or at least in some regions like AMP or is it a more generic phenomenon from an algorithmic point of view? Another question is what happens in more complex models? Here I've discussed a single-layer learning problem. What happens when you have multi-layer, for example? Is it still the case or things smooth out? I have no idea. Of course, the most fundamental question is why this phenomenon happens. I don't have an answer right now. I don't have a real understanding of why this thing is going on. Now we know it happens in wide generality. I think it's very important to understand why is it the case. I will stop there. Thank you very much. Thank you, Jean. Questions and comments? Okay. Let's start from me. This signal-to-noise ratio also could be interpreted as a lambo for measurements versus the dimension of the dataset? So here there were two main control parameters that I've been discussing. In the in-front setting in this sparse PCI problem, I was considering the signal-to-noise ratio, which controls the signal strength with respect to the noise. But in the per-step tone, for example, in this simple neural network that I've been discussing, a more natural control parameter is instead the number of data points that you have access in your training data. In this case, this is the sampling rate. But crucially, fundamentally, if you want this signal-to-noise ratio and the sampling rate, they play a similar role. They are just control parameters that allow to control the difficulty of the inference or learning tasks. So they play a similar role, essentially. Yes, there is a question from Haiping Fan. He asks, can deep neural networks be approximated by multiple layers of generalized linear models? Yes. So deep neural networks are multi-layer generalized linear models. So if you concatenate multiple generalized linear models, you exactly obtain a deep neural network. Then the learning task in deep neural networks is you have this concatenated GLM. The task is to learn, appropriate weight matrices in this GLM in order to fit the training data and hopefully to generalize from there. The answer is yes. We have papers where so one can look at my papers and also other papers by Faw and Lenka, where we start from our theory that is under control for the single-layer GLM. And we concatenate these GLMs in order to say something about deep neural networks. Yes. And also this analysis you presented is quite similar to this finite-sized scanning analysis in combination. Yes. Can you repeat? So this analysis is quite similar to the finite-sized scanning analysis in such cases. Yes. So in order to obtain all this result, what we do is essentially a finite-sized analysis. So instead of getting directly asymptotic results for the free energy or mutual information, we place ourselves at finite size and in this case, all the parameters in the problem can naturally be functions of this size. And what we do is to precisely control what happens for a finite end. Yes. So this is a result, is a finite-sized result. Exactly. And from there, you can essentially zoom out in the corners of the phase diagrams. Yes. Okay. So are there other questions? Okay. If no, let's thank again all the speakers of this session. So this morning or this afternoon's session closes. Metal.