 Thank you. Thank you very much for the invitation. So as Stefan said, I'm here for the physics part and And I will talk of something that is at first point somehow unrelated to what Laura was talking about but as you will see during the The talk with the methods is very much related and the phase transitions will be very much the same one So I will during the talk try to make links to what Laura was saying So the problem that I would be concerned about today is learning a rule So what do I mean by learning a rule like that? Example would be that I give you number of photos of cats and dogs and I ask you to stare at them so that if I and I tell you who is a cat who is a dog and Then you know I give you new photos and you're so so at some point when you were little this happens to you Right your parents tell you this is a cat. This is a dog And so you learned and now if I give you a new set of photos with dogs and cats Each of us in the room can very easily tell which one is which one Okay, so so learning a rule in this sense like learn what is a key thing about a picture so that we can tell there is a But when you think about it from the point of view of a computer You know it seems like a really easy problem But from the point of view of the computer is not such a easy problem because what the computer must do How is a photo stored right so so at best? It's just an array of pixels and for each of the pixels We have the three colors and intensity of each of the colors is just stored in bits So the beginning of the photo of how pictures stored looks like that and then there is millions of it's not Okay, and now the goal of classifying what is again, and what is it off is one used to find a function so that if this String of bits is fed into the function it outputs plus one for instance for a cat and minus one for So when you think about it like that as a matematical Rather complicated like how in that string of zero and once am I supposed to see that it's a dog So somehow the excitement of today's you know, this used to be a very hard problem for computers even say like 10-15 years ago 15 years ago, but to the end This is convolutional deep neural nets and they are very good at at reaching basically human performance of the task like that But from the point of view of theory, we would somehow like to understand what's going How is that done and what are the limitations and how should parmita speech you that's it? so so so in physics What we do since hundreds of years is that we kind of try to take the object of study and then say that like a chicken is Spherical and it's in the vacuum, right? That's kind of the record and so so so we want to so we want to do something like that With the never like like take the simplest models and build from there to build understand And so this actually so so what would be the simplest mall of a no Where can we start understanding what's going on? So that would be I could go a single layer feet forward neural network or perceptron or a generalized linear question Is we will see later and what would this be? So F would be the matrix of data so different lines would be different pictures and Then each column would be one pixel and one bit of that pixel of that so F is The matrix F stores the data and then why would we the labels who is a captain who is a dog and then? the function that this neural network is trying to pitch to the data is Just to take a sign of a broad of a linear combination of Of the different columns of the data right and these Perficients X I that's something that we don't know that we should learn from the many examples of F and Y that we have So I will just use that like word because many of you being in statistics You probably at this like the confuse with my notation like why am I not? Calling the number of samples little n and the dimension to me, right? There's a host stick to the physics notation, but let me write it here. It's a very good state So n for me is the dimension and physics is the number of particles So we like to call it began began and n is the number of samples and then F is The matrix of data And that's the data and why this is probably the only one that looks familiar. There's the labels So they are usually called why and X is the weights Fit it This can stay here so that if you forget You can look just here. So that's what hopefully consistent in my slides So so this is just you know so far. I said nothing. Yeah, this is the simple simplest But now where's the mall? Where's the modeling like do I actually? No, I think it was okay So so the mall is here, and it's actually a mall that you know, it's not something that I invented this in physics it was studied in physics 30 years ago like exactly this this model of this simple Perceptor they call it a teacher student mouth So what you take instead of taking F the true figures of cats and dogs or whatever else you want to classify You just take this matrix F I ID Gaussian matrix. So each element is just I ID. It's just a Gaussian number and Then you take some ground through or vector of weights that I will call x star I Is the component and each component is taken from some distribution px, which can be I will consider basically the plus minus one so gaussian something very simple and Then you create the labels the teacher creates the labels in the following way It takes the matrix of data multiplies it by this ground through weights and takes the sign and that's the true All right, and then the mathematical problem that the student is or algorithmic or mathematical That the student is faced with the student gets the matrix F and gets the labels y and so so each call each row of the matrix F has Dimension and and he gets m samples And the goal is to learn the rule So to recover so I tell the student that the rule is that it was a sign of a linear combination of these matrix times A vector, but I don't tell him the vector extra. I tell him the px or her You know to be nice to make it easy, but I don't tell her the vector extra So that's the goal to compute the best and and and you know, how do we quantify? What's the best we look at what people usually look in in learning, which is the generalization error That is if I present a new row of new data samples So a new row of matrix F for which the label was not given the student has to come up with a procedure to give The label so that is the SS similar as possible to the one the teacher would give That means to the one where if x star actually was extra So that will be the measure that I want to optimize And we are in the high-dimensional regime. So of course if the If the number of samples was a huge and the dimension n was say 5 and the number of samples went to infinity Then this is a simple problem likely fitting a line through many points. It's a simple problem algorithmically But here we will be in a limit Where the number of samples is large, but the dimension also is large and the ratio between the two is small So like two or five or zero point five or something of order one So that's that makes this problem challenge. We don't we have really few samples from the room And the question is how well can we do can we do in saturation? So so not only the small was Defined and presented back then, but it was also solved in 90 in this paper by Geza Georgie. So again in physics, right physical review a canonical physics rules and in the case where the teacher generates the rule with x star at his binary so the x star Randomly take your either plus one or minus one then they are hidden the student knows that he knows the p of x and But doesn't know which one was plus one and which one was minus one and he tries to find that and Georgie in his paper post this figure. So this is some measure of accuracy To remember which one exactly but big means good Sorry big means bad and low means good. So some measure of error rather than accuracy then and alpha What is alpha? That's the ratio between the number of samples and the dimension So the more samples I have the easier it should get and what just showed is that in the optimal generalization error There is a sharp jump at some point where alpha is equal to one point two four five something something something to a generalization error that is zero meaning that the that the student in principle is able to learn exactly the x star and Hence you multiply exactly by the right vector and get zero generalization So there is a phase transition and in the physics sense This is a first order phase transition because some measure of error jumps discontinuously from finite value to zero All right, and this is so it was not said in moron's talk But these hard phases that he was talking about from the physics point of view They are induced by first order phase transitions. So when they are first order phase transitions in computational problems That's exactly those cases where Algorithmically they are these hard phases. So from this point of view, I think I mean as far as I know This is probably the first paper where some of this hard face in a sense The problem was analyzed where this hard phase occurs. So I will get back to it where what's going on here? I'll be able to because so far this curve was information theoretic. That's just with With unlimited computational power, how much information is there in this few number of samples? About the rule and then there is this physical branch as he says, but it will not be so So uninteresting as we will see in the in the rest But back then they didn't really have good understanding of these branches It was some unphysical brought uninteresting for the analysis Okay, so just a bit of history to say, you know, this was in the 90s But in the 90s, they were really this is one example. There were hundreds of papers in physics and statistical physics about studies of artificial Neural efforts, of course back then we didn't have deep learning and things were not working yet in practice, right? Nobody was pretty much like there were no startups, maybe some but but not the millions and And and there you know, but I was so it worked down in the in the Statistical physics on these kind of problems. This would be some of the key names Who wrote reviews and books about about these problems? But of course many questions were answered, but many questions were open So the main ones I will show you in the next slide, but then somehow somehow as it as it usually is When there is a active topic that is kind of was a winter of that topic in statistical physics and say in Early 2000s when I will say during my PhD thesis, nobody was working on these problems in statistical physics And then of course with the come with the with the appearance of deep learning that became widely known and used somehow before us Oh, but we actually have something to say about these things maybe so so somehow there is a huge come back to these topics these days So what about these open questions that they didn't answer back then? So so with the replica methods, you know with some statistical physics method They stay made an explicit prediction of how the optimal generalization error look like so there was a formula But there was no fear and behind it was based on a method that was like intrinsically Infamously known to be non rigorous in other contexts people are trying to establish it but but in this In this context, you know, it stays so full for a long time as you will see and Then there is the question And what is the smallest error that always the two questions? What is the smallest error reachable information theoretically? What is the smallest error reachable algorithmically efficiently sometimes this is the same and sometimes it is small so how it is here and Then you know if we if we change the activation function So for instance from the sign that was popular back then we go to the realm that is Popularity stays how they change and what if the weights were different instead of plus minus ones we take six bars because it's All that can also be interesting for other contexts So what do we need to do like every time we need to write a new paper repeat that calculation Does that calculation work etc. I like that kind of a basic question that one should ask So you will see that in this So before going through the answers, I will talk yet of a completely different problem Why am I doing that? Just you have to somehow motivate That then I will kind of introduce a general model that encompasses them both And so I just want to motivate it the general model is kind of interesting because it includes many interesting special cases And another special case of the generic model is compressing So what is that? So compressing is this cute idea of You know, you know if you take pictures and you store them in your computer you store them compressed Right, you don't store every of the pixels like this F that I was writing This is not actually how the computer stores the images because that would take too much space They actually compresses them so everybody knows that so but classically it is done that all the pixels are taken on the Camera and then they are compressed so the idea of compressing is to do the compression at the moment of the measurement If we can compress say to one percent of the site Do we have to go through taking the full picture or can it be done that we never ever go through the full picture? Only at the end if you really want to reconstruct it and look at it Right, so that's the idea in in compressing can this be done computational and But the fact that yes, it can you know, that's kind of one of the most cited papers on on the topic and you see it's kind of popular to say the least So how this work works mathematically when you take it, you know when when some apparatus takes a measurement So for instance, I know a camera I picture because that's kind of direct but say when in hospital we measure we measure MRI We use MRI to create a picture or map of the body or computer tomography It's you know in a lot of these signal processing applications basically majority of them the measurement is some linear Transformation of the signal and when we want to recall strike the signal of the measurement We somehow have to invert that linear transformation So from that point of view if we want to if you want to record strike and the points of the signal We better have n times n matrix so that there is a good chance. We can invert it But the idea of compressing is to do it with way fewer measurements than m So using the fact that any signal of interest can be written in some basis So for pictures, this would be very bled for instance for some this would be for a basis and others for other signals Such that this this would be a square matrix But this x star this transformation of the data is sparse so that was actually illustrated here This is the wavelength transform of the picture You see that many of these coefficients are close to zero some of them are big if I forget the ones that are close to zero And I re-inverse the wavelength transform from only the big ones No, this is kind of bad resolution But even if it was a good resolution you would see very little difference between the three images So there is some transformation Which is flown by so that the signal gets sparse and if I multiply these together What I get is that I have the measurements are some matrix that I know times number of Coefficients out of the many of them are zero that represent the signal So I want to reconstruct this x as the signal and you see now since I call the same so so so I use the same Letters here right but here why will not be the labels but the measurements f represents the apparatus that the Combination of the matrix of the apparatus and the wavelength transform say for example an x is the signal here But otherwise mathematically this problem is the same There is no sign right. There is a different activation function here, but otherwise it is the same so You know if I go to these these two open questions of what happens if you know The x has a generic prior and activation function is generic So this is where I introduce this generalized linear model Which is a unified view for the two examples that I showed and many more So I list here some more that came to my mind what this could probably continue So what do I aim here? I still use the same notation So why will be the labels or the measurements f will be some matrix that I know x will be some vector of weights Or signal that I don't know I Push them through and only near through some function fire can be linear or nonlinear as you want Before it was the sign and the x's were plus minus one and this is some other the generalized linear regression or generalized linear model with the goal will be to fit the x from examples of f and y Okay, so this this will be the setting and Now if I want to calculate something about it I need some hope to get to this mall that will be this freighter chicken in the vacuum So what do I say? I say now this matrix f will be Iig random as before as for the little mole of neural network From the papers of Derrida and Gardner that I was showing the x star will be taken from some Sifferable distributions so so px and I will generate the labels by applying this function to f times x From my and f I'm supposed to reference like x star So I will call this the teacher students a thing and so so why this setting right because from a point of view of say machine learning It makes very little sense to say that the data are I a derangue I would I do that like no data are I a derangue so so that's maybe like there is a kind of a discrepancy of what we say When we say a model in physics, we don't mean the same thing as in statistics Okay, the statistics when I say a mall you usually mean that you suppose the data looks like the malls as they So in physics, that's not the case because surely here I'm not meaning to say that gets and dogs look at your handle That's completely nonsense here in physics when you say mall it's some some Setting where the some how the nature of the problem that the important properties of the problem are still captured and The unimportant ones are not there so that I can somehow solve it and say something about So of course a priori we never know what we need to keep so that the important property stay there Right, so it's kind of a there and back procedure that we like try something and see how it's behaved And if doesn't work, maybe we need to keep more so it's kind of a there and back thing But this is why is this an interesting mall so why I idea random matrix and Random factor in the teacher generates so one reason is that we can write attractable theory We can compute what is this optimal generalization error for this case? And it's a non-trivial thing like it's not sure really zero and it's not sure the one and it's some non-trivial for one and It's kind of complementary to this worst case theories that one would have in statistics where I assume like the worst possible Signal or the worst possible data so here I just take them from some probabilistic mall and I don't pretend this is realistic But it's not trivial and interesting and I'm asking like what would what would I be able to do at best in this particular case? But sometimes random F is actually realistic So studying the compress and seeing because if I there I have the freedom to design my apparatus So there's always some error correcting goes in applications were actually had the freedom to design with this matrix Of course, but then it can be random and Applications where this directly applies It's interesting from the physics point of view because there are these phase transitions as I've already highlighted and we like those It's challenging from the math point of view because you have these explicit conjectures about how these optimal errors look like But the theorems are missing in many cases So that's why so So how in principle to compute this optimal generalization error? So we know how to do it in a setting such as the one I define is we simply need to write the posterior probability distribution of what is the unknown X given what we know and Since I exactly told you you know this function fee can be translated to a probability distribution of a Given Y was created if given F times X was fed in this is that would just be the F times X Right, so this is the likelihood. This is the prior and this is the normalization and For estimation of the most likely of the best X best in the sense that it would minimize the mean square error Between the ground true and estimator what we need to do is to compute a posterior me posterior mean of X I under this measure. Okay, so this is like a basic fact or Fact rather of of say Bayesian statistics that if we have access to computing the Marginals of posterior is like that then we can compute the optimal estimators that would you minimize the distance between the ground true and the Estimation but that's not the same as minimizing the generalization error So to minimize the generalization error what we need to do We need to compute again the average of what is if we are to if you are given a new sample of new So traditionally in machine learning you would like write some loss function and minimize it with gradient descent or Stochastic gradient descent and get some like hope for the minimum and then plug it into and plug it into the neural network And that could that would give you the estimator of the Of the label right, but if you really have small number of data of samples Kind of proportionately to the dimension. That's not that you know be the best way The best way is to actually not you know to take the news the new data point But there are other than minimizing the loss function average over the posterior over all possible axis that you know You could feed into the neural network and that will give you the estimator that will minimize the distance between the teacher label and your estimated label Okay So okay, you know what to do in principle. So that's good But what's bad is that of course doing these well, you would have to do these exponentially Long sums because these both these numbers are are large, right? So this this is not tractable directly. So that's computational cabinet So, you know what to do, but you don't know how to do that in principle But nevertheless, you remember this slide. That's that's what they did in 30 years ago with this replica metal they computed that if you were able to do this What would be the performance? All right, and so this brings us kind of back they computed it So, you know, unsurprisingly, we can repeat that calculation for this generalized linear model So for the setting with the general px and we general function 5 but can you also prove it that it actually is correct and The answer is yes in this particular case Of course as it usually is there is a this is a selection But there is a kind of a long way to the result that I want to present to you on the next day for slides that kind of starts with this mathematical method of showing some rigorous results about spring glasses that would be miles and physics that resemble to all these to all these computational problems By Gwera and Tony Nelly some time ago already and then in the cases where this first order phase transition So the jump in the performance are not present There is a considerable series of work by Andrea Montanaria and collaborators proving that these replicas almost actually give the right result But not in the case the first order phase transitions were there and so somehow what brought us to also Establishing the results in those cases is for me There was this key paper of Corada and Makris some 10 years ago or so that was on some completely unrelated problem But not completely of course from the point of view of the math it was related and then through some series of work We got up to this last one. So maybe if we just go to that one directly So in that one we basically closed this year if we found that result from 98 I should say because we are not into 2017 Okay, and So so so the result is very informal is set here But it basically says this replica calculation from 30 years ago for this particular problem is indeed the correct thing to do and indeed leads to be to the optimal generalization error and I also give you the formula for the free energy So what's a free energy that's when you take the normalization of the posterior the logarithm of it And then you go through the expectation with respect to what you know. So the y and f in The limit that you consider so all both and an m are large and the ratio is fixed as the alpha over here Okay, so the this free energy is this horrible formula. It has two terms that are Expectations of some integrals, etc. So at first view you say how is that helping? This is as horrible as the original problem, but remember the original problem is high-dimensional, right? There was an integral over n of parameters Okay, this one is a scalar formula. So this m and m hat would be some part or the parameters in the physics language They are just scalar numbers Okay, and all these expectations and integrals here are only over scalar variables with some known distribution like either Gaussians or taken from this px the x is always taken from px and the VZs and Ws are just gaussians So this is a simple formula to complete right? It's in the information information theory and you put on the single vector characterization of what you want So if you plug this into Mathematica you can compute what these things are But so far it's something about the normalization of the posterior. So so far it's telling you nothing about the error But the fear it is a kind of generating function in statistics of physics So all kind of things you're interested in you can directly get it from it So unsurprisingly if you actually take the m that achieves this supremum Then this little row, which is just the second moment of your pure banks minus this m star is the minimum in square error directly and Similarly for the optimal generalization error, so it's a bit more involved less in duty formula But again a formula that is just some expectations over scalar variables that involve again this m star Which is the m started achieve the supremum of this little scalar function So in principle from this function from analyzing how it looks like you get both the optimal estimation error and the optimal generalization error Just two slides about how the proof looks like Because it's actually useful like there is one part of it that is that actually makes you understand this formula right because there are these two terms that are some pretty unintuitive integrals But actually they are not unintuitive because if you look what they are there are just three entropies of scalar denoising problems where you would You would generate some x star from px, but this time is just a scalar x star It's a much simpler problem dimension one not dimension n and Then you have some known number m hat that you would multiply Here and add a noise of variance one and you would observe the sum of that, right? And then how would you estimate back the x star from y? White y prime, so that's just like some exercise to like statistic 101 like this It's like you just need to write down the probability set and you will see that if you do it You have to actually need exactly to those integrals and expectations that I wrote on the previous slides and similarly for the second term It's a bit more tricky noising problem like you again have some unknowns He started to multiply by some constant and then you add a gaussian noise multiplied by some constant And then you observe it through this p out output channel Okay, but it's again a scalar denoising problem, so you can compute everything you want to compute about it And that exactly what leads to the second part of the formula Once you realize that Then the way the proof works is by the square-atominelli interpolation Where you actually show that the free energy of these decoupled problems where you would have m of those and n of those And the free energy of the original problem that you can like find a path from one to another Along which the free energy is not changing or is changing by some constant that you can track down that you can track down Okay, so that's how it works. There's the core somehow the proof that this and pages are not explicitly but on the way like there are you know many Like concentrations and things that you need to show But there is one really key thing in all that that if you didn't have it it would completely break And that's what we that's what we call the nishimori property. That's kind of a Key property of the bisoptimal setting like if I know the model that generated the data And I can write exactly the posterior with all the likelihoods and priors being the right ones Then the way the bias formula works is that A sample from the posterior can be under expectation exchanged with the ground truth and we set versa Okay, and and so and that property so that under expectation I can exchange a sample by the ground truth That's in physics called the nishimori property and this is used that some like key parts of the proof And if you didn't have it it would it would go Okay, like many of the other things could maybe it could be done differently, but where you mean this taken So that's kind of the current bottleneck Of of this method to prove the replica form was that we need to be in the bias of the For example, you need the Gaussian distribution for this or No, we don't need gaussian distribution anywhere like the px can be anything and the p out can be anything But we need to know the px and we need to know the p out And if if say the random matrix f was not It doesn't have to be gaussian, but the elements need to be independent So that's push. Yeah, that's like they need to be in the small. I mean, okay The somehow ongoing work in future directions is how to go beyond that No, this we can discuss it out. But for this the purpose of this store the matrix is not But to somehow get okay, so So now there's the second part of the question, right? So this this analysis and this proof tells you If we actually were able to implement this bias of the estimators What would be the performance and we have these explicit formulas that we can evaluate? Which is nice, but so far we didn't say anything about the algorithm. So we won't so that's important So again like a slight offer. So so the algorithm that we will be using here will be this approximate message pass it That again has a long history somehow From my point of view starting in the physics by work of talis Anderson palmer more than 40 years ago for this one That for the perceptron was extended by by meso But there were some problems in the way that they wrote these equations that With with their iterative equations and the time indices were not quite right ones and in this way didn't go Over it. So people like wrote them, but we're not really able to use them for what the theory was telling them They should be able to use them So there was kind of this puzzle for many years That could resolve by in a in a paper in a math paper by bolhausen around 2008 Where and you know at the same time kind of this problem with time indices people who worked on beneath propagation before Including andra montanari with collaborators is somehow realize how how to write things out So there was kind of this big revival of this type of algorithms That got called the approximate message passing and that name kind of is used these days in in these two papers And somehow one key element Also was that the proof of bolhausen about convergence of these stuff equations was extended In three works again co-authored by andra montanari and co-authors Where well, I will get to it. So that will be the history. So what they did to will be in three slides. So You don't have to think of the approximate message passing this way, but to make the connection of with the with the Lecture by Laura who talked about belief propagation one way to think of approximate message passing is through belief propagation So you just take the graphical mall that corresponds to this generalized linear model Where the squares would be the checks and the variables would be the unknowns would be the circles And then the multiplication by the matrix system. This is the graphical mall and you write the belief propagation for it So this is of course crazy because laura told you that the belief propagation works for trees for three graphical mall But this is fully connected like every check is connected to every variable and every variable is connected So that as med loops as you can possibly get right yet. Okay, if you do that You end up with these equations, but more over there is this integral over n variables So more over down in some form that is not tractable It's not like in the stochastic block mall where you write that and as you wrote it you can implement it And it works well you cannot implement this but fortunately since there are so many neighbors There is some simplification That you can carry out and realize that in the leading order in n Actually, you can simplify these equations into tractable form It's just a few pages of calculation like you start with these equations and you end up with With those so this is kind of absurdical that I don't suppose you will like understand because that's not the purpose here The purpose is just to say no, it's very simple It's a iterative algorithm You are just multiplying your estimators by that matrix The complexity is n squared as small as it can get because you need n squared to even read that matrix So it's a very you know very fast thing. You know, we get distributed if you want it It's very comfortable from algorithmic point of view algorithm It estimates directly the marginal means of your variables And it can also estimate very directly The optimal estimator for the new label if you feed it a new sample, right? So it has all that incorporated in So that's great, but what is really the best about it is the state evolution So that's what I wanted to say three sites ago that that under mountain are improved in a series of papers And that's that's something pretty magic because here we are in very high dimensional problem And we ran on it an iterative algorithm that is kind of very complicated, right? So we could think of say spectral methods as power iterations of iterative algorithms We can understand what's going on there But that's somehow very special cases like if I just cook up an arbitrary iterative algorithm to run on a problem like that You will have like no clue how to trace what it is doing at iteration 100. It's very complicated, right? The different iterations talk to each other and many correlations arrive and it's like really complicated So for like other types of this Other types of algorithms that look similar such as the variational mean field of Monte Carlo or some longevity algorithm We have no proof how to write something like that But for the approximate message passing we can define the correlation of the current estimator at iteration p with the ground truth And then we write these simple two equations Using the same functions that we saw before in this replica p energy That are telling us how this Wanted evolves as we are running the algorithm And you should be somehow surprised How comes that the same functions that appear in this replica formula that have nothing to do with an algorithm How comes they appear also here? So that's kind of the magic of it And to summarize that magic is that you have this one formula for the free energy And it's maxima So if I you maximize over one of the parameters and look at this function as a function of only one parameter That is global maxima corresponds to the error That is that the bias optimal estimator would be giving it and it's local maxima that is at the Worst possible correlation with the ground truth because I start with no correlation with the ground through the algorithm So the first local maxima that the algorithm hits that's where it gets stuck This state evolution gets stuck. So there's the performance of this algorithm So if this function happens to have a unique maxima then fine both agree But if it has more than one then the algorithm may not agree with the bias optimal estimation And this is you know, you already saw a figure like that flipped upside upside upside down in moron's talk About you know, this is what characterizes the hard fix So just to show you now there was a lot of theoretical set. So just to show you where concrete things how these things behave So I start with the compress sensing so Alpha is m over n so the bigger alpha the more samples I have So the easier it should be and raw is the sparsity. So here I just fixed 40 percent of the elements in x r non-zero In some way and then I show you this free energy or entropy That's just the sign Between one and the other for different values of alpha. So you see that the global maxima is always at zero at error zero So all these values of alpha it's when alpha is bigger than raw So it's information theoretically compress sensing you would be able to reconstruct the signal with zero error Okay, but you see where as for large alpha you don't get stuck at local maxima before for smaller alpha you do get stuck here for here All right. So this is similar as in the stochastic block model case What's maybe a bit confusing is that in the stochastic block model we were talking about Estimating as bad as one don't guessing versus correlating with the ground true Whereas here we are talking, you know, this is pretty good correlation with the ground true We are like correlated, you know 0.85, right? But we are but this is not good for us because we know that we could reconstruct it exactly here So here we are the differences between are we able to extract the signal exactly or with some Of 15 percent error if you can extract it exactly it's not satisfied So, so okay, it's just a different case So here is just how the phase diagram looks like if you summarize this alpha in the alpha and raw plane Right, so you have this impossible phase you need at least as many non zeros as at least as many measurements as non zeros But the vice even if I told you where where there are you couldn't information theory if you possibly get anything Then with somehow the lasso method or the l1 regularization you would work in this I guess it used to be blue. I don't know how this is. Yeah, it's probably blue And and then you have a regime where you improve over the l1 minimization with this algorithm But you know, it's fair to say that this line doesn't depend on px and this line does depend on px Okay, so it's not the same kind of robustness as the l1, but there is still this hard phase Where this algorithm doesn't work Right and again, it's okay. You introduced an algorithm. It's maybe a nice one Well, definitely intriguing one, but that up here doesn't mean why there shouldn't be a better one I mean, I heard of a zillion of other algorithms Horrific ones to make the others a little more here But this is not the case At least as far as we know and just to yeah just to illustrate this kind of For a phase diagram when you plot things this way, this is what we like to do in physics, right? So this is a physical system. It's actually the phase diagram of diamond and graphite in pressure and temperature And this is just to illustrate that what is the hard phase in the compressing algorithmically That's actually exactly what the diamond is at the room temperature at atmospheric pressure So did you know that diamond is actually a metastable state if it would actually turn to graphite at room temperature and atmospheric pressure It would release heat because it would be much better for it Energetically to be graphite So that's a fact So it should make us worried that all diamond will like spontaneously turn into graphite But that's not happening because it's in that metastable state for so long that we don't need to worry about that Right and this is the corresponding phase diagram. So we would have to go to pressure that is like You know, this is in a log scale So that is like two decades in the log scale higher than the atmospheric pressure to actually For the diamonds to become the stable state, but it is not or we would have to go to a very very high temperature So this is in thousands of Kelvin. So this would be like 10 times room temperatures Then okay, if you hit it up, then it would turn into graphite You would cross this line that is kind of This is just a sketch. So you see that at very high temperatures. So you should not heat your diamonds too much I guess that means right if you put them into like a really Like not in your oven at home, but in some like Industrial oven then and that's a bad thing So in the same sense that the metastable diamond corresponds, you know to the lower accuracy So it's kind of the vise versa, right? We are interested in the diamond. So we are happy that it stays the way it stays But in the algorithmic case the metastable state was the bad bad error We would like to get the good error and we are just not getting it because it's stuck The dynamics is stuck in the metastables. So this is you know to show you how somehow where the connection comes from with with the physics And this slide to say no, this is you already saw two examples So no one was talking of hard phase. I'm talking of the hard phase. You are somehow using related algorithm Right, he was using this spectral metals which was kind of a linearization of the belief propagation And I'm using a different version of belief propagation But you can use many other algorithms and somehow nobody seems to get into these hard phases And there is also many other problems where they were identified So I somehow realized that this was a list I actually had in some old stocks But then I actually split it in three parts and kind of Laura was covering these sparse cases And I'm covering these generalized linear model cases and Andrea will be covering the lower cases, right? So in this day, you are basically kind of each of them will broadly belong to one of the talks So yeah, the conjecture is that nothing in these cases can do better than this approximate message passing So of course, it's not it will not be easy to prove because I wouldn't buy the piece not that piece So let's see how was the standing of the conjecture So here I want to just give you a few more examples like if I play With what I'm measuring what is the p out and what is the px right? So here for instance, it's interesting One where I do compress sensing. So it's just before But instead of measuring f times x without noise, I measure just the absolute value So this is like a toy problem for phase retrieval. So phase retrieval is a very famous for challenging problem in signal processing And in phase retrieval, these numbers would all be complex. So here they are real valued and all I am losing is the sign So nothing for the information theoretic transition strictly nothing changes is the same as for is as if I had the sign Okay, so I don't lose anything information theoretically by using the sign But for the algorithmic phase transition that before was something like that going down here Now it's this red line and what is striking about it Is that as the spark city goes to zero instead of this red line going to zero So that the sparser is your signal the more you can reduce the number of measurements Here it saturates at half So however sparse is your signal you always need at least half times the dimension Not times the spark city but times the dimension of measurements. So in a sense, this means That if you lose the signs, you cannot sense compressive Algorithmic not information theoretically information theoretically is the black line that comes right, but that's kind of striking like the great idea of compress sensing that got 20 000 citations and more because this was just one paper Basically completely goes away if you don't measure the sun It's kind of surprising as well. So is that true? So so this is if the sparse so the important scaling here is that the sparsities of auto one So that actually works that if the sparsity goes to zero as n goes to zero then this curve actually goes On a on a smaller scale it actually goes to Goes to one but on the scale where the sparsities of auto one this this should hold So that's like one one kind of consequence of the phase diagrams that we plotted here Another one would be like for instance, we were interested in this real function What this basically is doing is taking max between zero and whatever the what used to be y is So in a sense the negative ones you are losing them completely except the fact that they were negative Okay, so you keeping the information that they were negative and the positive ones you're keeping So somehow naively you would expect that all that phase diagram just changes that this scales instead of one gets to two And that's how I plot it right and indeed again with the information theoretical transition That's what happens like somehow the naive expectation this holds information theoretically But if you look how the algorithmic line shifts In dotted is like what would be twice the previous one so that's we would expect naively But the two one is actually lower So this is also interesting. It means that the zeros in real that means the measurements that you know you lost But you only knew that were negative They are not helping you information theoretically they cannot but they are helping you algorithmically so Again kind of a puzzle like why is that so and why why do these curves look like that? I mean, why is this not one here? Maybe we can see it from the equations, but intuitively it's like not so understand So to go to the no case I started with the perceptron problem So what the the the case I start with is the single layer perceptron where The teacher generates the weights from a Gaussian distribution And the activation function is assigned So that's the like first case this papers by Gardner and Derrida 30 years ago We're studying and I plot the optimal generalization error From the theory, that's the red line From the amp algorithm test black points on somewhat large instance, but not so large white I have to store the matrix And then in blue Maybe I just took sign and turn and I just ran logistic regression on then on that classification problem Right. So like the first thing like any student who comes to my group will do And you see it's not it's quite good. It's close to the optimal Good and as a function of alpha you see alpha is not so big It's like this is five times the dimension. So you see the error is not So close to zero, but it goes to zero as one over alpha Actually some cold stand over alpha all these things can be seen derived on the expressions But now you change things a little bit The only thing you change is that you take that the teacher took plus minus ones for the weights Okay Then we are back as this picture from george that you have seen that I have shown you at the very beginning So now I just interpret it in the kind of current point of view So indeed my red line, which is what the bias optimal Estimator would be doing that's exactly his line. So his line was correct And we now have a proof that it was okay So this is fine this fits and this what he called the unphysical branch Well, that's the bronze that describes the performance of the amp algorithm because the amp algorithm starts at bad Generalization and gets stuck at the first global a local maximum And that's exactly that branch. So that's kind of a redemption of the unphysical branch. It's not so unphysical after all And again, those theoretical prediction from the state evolution fits with what we get From the algorithm on not so large instances And in this time, I also take the logistic regression the very same one But that one of course does know the amp algorithm knows that x were plus minus ones The logistic regression doesn't and it somehow doesn't magically Recover it, right? There is a big gap and put somehow you can somehow extrapolate where it would go, but there is a gap here So there is some Value in putting the prior information in if you have it with respect to like generic blank books kind of thing And this is the hard face again So another example, so you change things just a little bit again So you keep binary x for the teacher But now the labels are different are generated a bit differently Yeah, they are generated in such a way so that it's even in the x So if x was negative, it's exactly the same if i took minus x, it would be exactly the same y So instead of taking a step function, you take kind of a symmetric step function for the activation Why not? Okay, why yes, okay, but just to illustrate that this is an interesting behavior that in that case the generalization error actually stays at one up to Okay, this critical alpha c 1.36 And then it's slightly below one and then it jumps discontinuously to zero So that's kind of an interesting behavior So that would be the hard phase and this behavior is actually Related more to this to what's going on in the stochastic block model than the previous cases, right? Because generalization as bad as one In the way i am you know defining things is basically random guessing All right, you just know the prior and you just guess it You just know that the labels must be plus minus one. So it's as bad as randomly guessing who is plus minus one And then there's 1.36. That's exactly the ks transition in this case So that would be the you know a triple point of Propagation here and then that one gets unstable And then you see this is the case that laura somehow In his very last slide when he said that the situation can be a bit more complicated than the situations he describes This is a case where Close to the ks transition you have some correlation with the ground true But you are not optimal yet You need to reach this 1.566 to actually jump to the zero error, which is the optimal behavior in this case All right, so here in a sense you have one two three four phases And and more if you put here put in the other spin-offs, etc And also what's interesting here that this seems to be kind of challenging like kind of a trivial issue But challenging benchmark for the For the black box algorithms in a sense like here like logistic regression was not doing anything And here we took some like three layer neural net optimized a little bit and got you know some performance But you see I completely changed the scale here like here is 2.5 and here is 20 So I have a huge region in which like some you know, what he doesn't say one day playing with With like tensorflow that we could not get The network to learn this rule from these few samples Okay, is that maybe didn't play with it enough or is that that The stochastic gradient descent that we were using is not such a good algorithm as the amp or is Nothing more fundamental this kind of gives us peaceful thinking about these kind of questions And last thing I want to show you about is that so far I was sticking to this single layer neural network That can only learn linearly separately things. So it's not so like useful as the as the deeper ones So can we kind of go deeper and include hidden variables? So the current situation that we can as long as their number stays order one while the number of samples and the dimension Both go to infinity with the fix ratio So it's kind of not ideal We would like to have more of these hidden variables to somehow be more realistic But that's going to limit the current limitation of the method But already in this setting we can like see some interesting things So this mall is called the committee machine and again, it's like from the physics People in the 90s already like studied it and had it and the formulas that we were basically using They were already in these old papers, but the story is the same. They were directly this replica metal They were not rigorous and in our oh my gosh, I didn't change the name. I'm so sorry for that So in our paper that will appear in couple of weeks of no rips Well, so what we did we show that those replica formulas from back then were Were correct and somehow highlighted some behavior of this of this of this Slightly more complicated but still simple neural network And that's somehow a behavior that like every practitioner knows Like if you have data and you don't have enough samples Right because your application is so you are some I don't know medical application And you only had thousand patients and you just cannot have more Right, so we know that there is no point if you have few samples to use deep networks Because that will not give you good generalization if you really have really little number of samples you can Almost never beat kernel rich regression. You just do okay You just do you just do that you just do regression or best kernel rich regression So somehow practitioners know that and interestingly this can be Quantified in this small and what it means is that if you actually evaluate the optimal performance of the can Or bias of the master major there is a phase transition from a regime Where each of the hidden neurons learn exactly the same thing and that's the optimal thing to do To a regime where the different hidden neurons realize that they should actually not learn the same thing And that this way they will do better As the number of samples is increasing. So that's kind of interesting So as I said like every practitioner knows that if you have very few samples Then you should just do regression and there is no point going to deploy But somehow having a model where we can compute exactly up to a constant how this works It's kind of interesting And this was an example with just two hidden units And the thing is that when we have more than hidden units then this heart phase comes back And these algorithmic thresholds appear And what it means here that when so here we are in a limit where n and n go to infinity together Their ratio is alpha and the case of auto one, but is also big but after okay, and in that case Somehow for information theoretically to learn well We need something like some constant times number of hidden units times the dimension samples But algorithmically we need some other constant times the dimension times k squared So when we have more hidden units we need more samples and the gap is kind of large Okay Just you know, what does that mean for like practitioners? I don't know, but I I mean theoretically it's kind of interesting So then I had this last part, but I will skip it So I can be overcome this hardness and in the cases where we can design the measurement we can It's kind of related to nucleation in physics. So for the physicists is nice But I will not take over my time and go to the summary So I told you about this generalized linear regression, which is a nice model that can encompass like simple neural networks Oh, by the way, the multilayer one is just still a special case of the generalized linear regression Just the variables get vectorial, but the formulas are still the same So we have this kind of a plug-and-go formula where you can take an arbitrary Output channel activation function and arbitrary prior for the x as long as it's separable the simple life lies And we have the proof of that formula, including the errors For the errors, there are some like assumptions that are maybe that can be maybe put away That are needed in the proof. If you're interested, you could talk about it, but basically we have the proof um We have the algorithm that matches that predicted performance out of the heart phase There's the heart phase that we think is unvisible and If you can design this matrix F, then there is some way to do it And just to you know as as laura did to to finish with some like open problems. So what we don't know So actually interestingly one of my open problems is the same as laura's, right? It was like can we analyze other algorithms than message passing or spectral algorithms and see so what are the thresholds for Something like multical and launch advance sampling. So for the generalized linear model, we don't know that's hard Okay, but we actually have like word that you are just writing for um for A problem that is much more related to the lower end estimation that andre will be talking about Rather than generalized linear models and from that we can actually we actually realize that Gives the longer run dynamics in that particular model is not so good as Approximant message passing it has the heart phase strictly bigger and somehow physically we don't really understand Why it has to be so and how to change it so that it's so good as empty. So that's kind of a lot of questions are still open here Well, we don't know in terms of the proof is how to go beyond this bias optimal setting So like a concrete example in the neural networks We would like to study over parametrization like what if the neural networks has more hidden units than the teacher had Because the two networks are clearly way over parametrized. So what's the effect of that? But with that the teacher would not be matching the student anymore So we still have the replica formula like at least the conjecture But we don't know how to make the proof work in that case that that's the right formula that that's the right error that we will be getting Is the right one, but how to make it rigorous? Okay, there is always this kind of discrepancy between theoretical physics when we say something is exact It means that it's true But we don't have the absence on it. Yes And Sometimes maybe there's some like caveas that we haven't seen that actually were not included in what we thought Was true. So it's kind of interesting the replica. Yes. Okay, so so then there is this open problem that actually is related to a question that Somebody asked me over the over the break It's like in the stochastic model blackmail Can we treat the case when the number of groups would not be a constant or it would scale with n Which is also an interesting case And here it somehow translates the method wise into the case where the number of hidden units would not be order one But it would be extensive as the dimension and it's the number of samples The kind of turns out challenging even for the replica We have some we had actually something that we thought it's a solution But it turned out to be wrong. So we still don't know how to how to solve this case Not even heuristically not saying even about the proof Another yeah, and and this is a she'll also comment to Lawrence open problem, right? What is what is really the theory of computational hardness in this probabilistic setting was really the nature of this hard phase of which in which Computational complexity sense is it really hard? So that's like a different way of saying also one of the problems And with this, okay, thank you for the attention You said that proving that the hard phase is really hard is somehow equivalent to proving that P is not mp No, it's not equivalent I was just saying if you if you somehow were able, you know if you thought you had a proof that the hard phase is really hard for every polynomial algorithm Then it would imply that P is not mp All right, so so that would not be so easy to get that So that's what I'm saying So the best we can hope is basically to prove that it is hard at least come a reason of the hope Is that with them some well-defined class of algorithms? It is hard, right? And the question is how large is that class and how to characterize it? so Of course the larger the somehow more interesting and but we really don't have much of it You know for for some very restricted classes of algorithms. This has improved for instance. I cited like at some point like some Versions you know well-defined versions of the approximate message passing like andra has a paper where they prove that That within that class you cannot improve over what the canonical one is doing But somehow completely open question is how large can you make this class or there are some other worse On local algorithms and statistical algorithms I'm like little classes in the space of all possible algorithms for which you can show that in these hard phases they cannot work But of course, how can we connect them? How can we find relations between them and how to somehow make it the most the biggest possible class of algorithms? Perhaps all polynomial that that kind of currently sounds ambitious, right? But definitely making those classes bigger and bigger. It's definitely something people are working on The prior if you don't know the prior beforehand, can you Have an outer loop perhaps an approximate message passing? Yes, so if you if you have like a parametrization of the prior and if your parametrization is still separable So it just means you just learning some parameters of that p of x then it's completely straightforward to actually do it With expectation maximization and you can even like extend the state evolution to that case It still holds their papers by the group around some deep run gun where they do the state evolution for that case Okay, then you know then the like kind of you can go farther and say okay now I don't even know if the prior is separable or I don't know how to parametrize it Okay, then you're entering a different one to basically how to do that But I guess would it be a good addition to say scikit-learn to have you know amp plus an outer Loop to the priors would be a general purpose. Yeah for for for including amp to something like scikit-learn It's not the prior. It is the bottleneck It's really the assumption that the matrix is ii that is the bottleneck right because Usually when you do regression you usually do it on samples where the matrix is Problems where the matrix is not ii and this algorithm is very sensitive to that It's very non-rope us like when the matrix is not ii it actually kind of plays so in in our works where we can go to talking of scikit-learn i've been a heavy collaboration with with Bertrand Kirion and Gael Barocco where you know on on somehow functional MRI analysis And there like you go nowhere with amp, but there is a version of amp that is called vector amp bump That is related, but they're not so not so not so related and that one is actually much more robust To like the different assumptions that the theory would have but the algorithm actually in practice does need it So that one we made work and we actually we actually speeded up what they were using on some particular setting With a non separable fryer because they didn't know how to deal with that with like other metals They were using so they are so like niches, right? But it's kind of the robustness of the algorithm is a problem But on some like special cases you can get improvement even for some somebody who is like very practical But it's not it it somehow doesn't fit into the scikit-learn kind of algorithms that need to be very proper and robust To different that assess that you that's what they want, right? I mean at least talking to guy also That's kind of the status of the algorithm Because the fact that you you don't know the distributions or It will change your transition phase that you don't know the distribution of Yeah So the algorithm knows the distribution of the By the weights and the fact that you know you've had that you don't know them So if you want to have a theory without knowing them then it will change the Yes, yes the if say if the if you If the distribution was different it changes the position of the phase transition It may even feel this I was showing like one case where the only thing I changed the ways went from gaussian to binary And in the gaussian case there was no phase transitions and in the binary there was even For a given distribution, but we don't know it You Yes, so you could have a worse phase transition or have a no transition at all Right, you could just stick to like error one for all possible Amphas, etc. So that's true. No, but it shouldn't if you assume that you only belong to some class of densities that you can estimate It shouldn't it's like You can learn them right that you could learn like if you can if you can include learning in In your procedure and if the true one flies in the class within which you all learn Then you actually get If you force it to mismatch You agree, right? If you force it to mismatch then he's right that it will be either worse or maybe no phase transition at all Yeah, I cannot be better cannot be better. That's like one thing like Andrea has a paper where he shows that because somehow naively Why not right? It could be better. So I kind of it's not the clear Maybe that would be a way to beat it, but that actually is not so possible Within this class of algorithms, but I believe both with both of you, but you know in your language You can be adopted But then you know in the practical cases like the distribution is not something separable, right and then you completely have to change The algorithm to include that which you could but then it's you know, it depends on the on the case You can it's pretty adaptable in the not in the safe sense of it. But uh Can be adapted to the application