 Let me just distribute so please take one and pass it to your neighbor. Okay. So our next lecture is a is another worldwide specialist of random matrices, I would say coming from the statistical mechanics side. I'm very happy to. I'm sorry. Go ahead. That is part of the school. And so I highly recommend. So two references it was asked on the on the chat. Of course the book by Mark potters and jumpy Bush showed that he mentioned, but also they are very very clear and introductory notes by Paul that can be found on archive. They are that are great. So, so what are you going to talk about. Okay. Can you hear me. So everything will run. Thanks. Thanks very much. First of all, for, for the great introductions, it's real thrilled to be back at ICTP for for a school. My last one was in 2017. I spent three years here at ICTP as a postdoc 10 years ago, unfortunately, so this is great comeback. So coming to the topics of the school. I will of course talk about random matrices. I will not talk much about random graphs. I will talk about statistical mechanics. I will not talk much about machine learning but I promise I will mention the word inference at least once during this, this lecture, actually technically twice because I just said it. So, I have prepared some handouts with some material that will be helpful during my my lectures, and I encourage you to go over the this material and ask me questions if these are not clear. Maybe it's just a collection of material equations formally that will be helpful during during this lectures. I selected a specific class of problems, which I find very, very instructive that have been most of the most of these things have been developed at King's where I am by young Fyodorov and his PhD student, Rachel to blame. I wasn't very much involved in in this during the development of this problem but we are now coming back to it. Because the problem is very rich is very instructive there are a lot of things that we don't know or we don't understand so I thought it would be a perfect, a perfect set of problems to start with. And the, the, the problem or the the class of problems goes under the name of procrastis problems from the name of the character in in Greek mythology, who was son of Poseidon. So, Rob, and the bandit who essentially invited people like strangers over to his to his mansion, he would force them on on his bed, and then he would proceed by stretching their limbs and chopping their legs, until they would fit his bed. And unfortunately, nobody would fit his bed is bad exactly and the class of problems are. They take the name from, from, from this particular operation of forcing people with a constraint on their height into a setting where they would not like to stay. So, in the, in the case of the orthogonal procrastis problem which is depicted in the in the first figure in the in the handout, you essentially have a set of points, a, for example, in the plane, and you rotate each of these points, individually, and to get a new set of points to which you add some some noise so in the end, you displace the final positions of these points, and you get a new set of points that we call that's a triangle, set B. So in the orthogonal procrastis problem, you are given the set a the original set a, and then the set B, which is the rotated set after you've added some noise. And the question that you are asked is how you can, can you reconstruct the best rotation matrix omega that would map a into something as close as possible to the actual set B. So the constraint is the fact that the matrix that operates that that performs this operation must be a rotation rotation matrix. In the oblique you procrastis problem which is the one that I'm that I'm dealing with here, we are instead given a set of equation so a linear set of equations with a nonlinear constraint. And this is as follows, you are given a matrix, a, with M rows and N columns. And the A is applied to a vector X to give a vector B. So this is the linear system that we want to solve the vector B is a vector with M entries, real vector with M entries, and we have a constraint, though, on the solution of X on the solution of this linear system, that is that the modulus square of X must be equal to a constant. Let's call it, and so X leaves on on a sphere. This is a quadratic constraint, but it must be a solution of this linear system. So in the procrastis setting X is the poor passer by the, the person that is invited by procrastis in that satisfies this constraints on on his normal height. He doesn't want to be stretched or to have his legs chopped. So B is procrastis hammer, and B is procrastis bad. Okay, so we want to act on something that that has a constraint in such a way that it fits another, you know, another location which is is is bad. Now to analyze this, this problem so M is number of rows, and is the number of columns so in this linear system setting M is the number of equations, and N is the number of variables. So we will start with considering the case and larger than N, but we will also relax this this assumption so we will get results, also for M smaller than, than N. So how do you measure the in general, of course, for a given instance so you want to find the solution X of this of this problem. In general, this system of equation will not have a solution. Okay, so how do we measure whether this system, whether we are approximating the solution in a good way or in a bad way. Well we introduce utility function or a loss function, let's call it H of X, and defined for example in terms of the square norm of the difference. So when when this square norm of the difference is exactly equal to zero, then the system as a solution. So this square norm is larger than zero then in general we are not able to solve this system but this is a measure of how good we are performing in approximating the solution to this to the system. Okay. So this is a loss function that can also be written explicitly in terms of the components. So, okay, from one to M, and then we have summation over J, one to N. Okay, J, X, J minus BK square. And we will consider randomized version of this of this problem to understand questions of typicality like what is the typical behavior of the system. So we will take a the interest of the matrix A and B, the interest of the bed vector as random random variables. And AKJ will be Gaussian random variables, which makes clearly the cost function, random function for which we can, we can compute the statistics of interesting observables and BK is Gaussian as well. And all the entries are Gaussian with the following covariance structure. So they are ID with variance with variance sigma square. Okay, so this is the, the setting. This is M and sigma square. We are interested in two objects. The first one is the statistics of the critical points. I will tell you in a minute what critical points are. And then the second point is the statistics of the minimal loss. The minimal loss is zero if the system can be solved. It is larger than zero if the system cannot be solved, but still we would like to understand how close we are to solving the system exactly. Okay. This problem can be tackled by using the Lagrange multipliers method. So we define a Lagrangian that depends on X, which is equal to our loss function to which we add a constraint on the norm of the vector of the vector X. So we want the norm to be equal to N. So this is object, the solution must leave on on a sphere. And then what we need to do is we need to find the critical points of this Lagrangians Lagrangian so the solution in components, we need to take the derivatives with respect to X. So this is the R of the Lagrangian and in particular and set this equal to zero. Okay. Yeah, sorry, this this by by this symbol I mean the. So this is equivalent to the square norm of the vector of the vector X. So this this stands for the dot dot product of the vectors X. So, if you take this, this object here, and you differentiate this object with respect to XR, then you get a factor of two that comes that comes down and then you get a factor that multiplies X, X are coming from from here, right. So what you what you get by differentiating it is one half times a factor of two, then you get the summation over K of this object square, which is summation over J, aka J XJ minus BK. So you get an extra factor differentiating inside here, which is AKR minus lambda over two times to XR equal to zero. Okay. So this goes away. So we recast this equation for the critical for the critical points of this loss function in matrix matrix form. So you get precisely a X minus B here. And then you get. Sorry, there's no two here. And then this object here is multiplied to the left by a transpose. This term here that is summed over K. So you get a transpose times a X minus B, which must be equal to lambda X. So that's the equation for the critical points. And we can proceed by multiplying this a transpose in. So we get a transpose a minus a transpose be applied. So a transpose a X minus a transpose B, which must be equal to lambda X. So I move X to the left. So I get a transpose a minus lambda identity that multiplies X must be equal to a transpose B. So a transpose a has been, we discussed this with with mark, and I thank him for helping me out a lot here. And remember, a is a rectangular matrix and times and so this is precisely the V shirt matrix corresponding to the matrix a. So let's call it w. The solution to my problem to finding the critical points of this loss constrained loss function is that X must be equal to w minus lambda identity to the minus one applied to a transpose B. Now, which is, if I'm not mistaken the solution of the standard rich regression also right. I guess, that's the case. Yes. But don't take my words. Yeah, this is a solution of regularized rich regression. I think, I think, I think it must be but I was hesitant to make a statement of this type but thanks for jumping in. So now we have this solution. We are still missing something right because we have a constraint that X must must leave on the sphere. So we need now to impose the constraint that X leaves on the sphere, which fixes the Lagrange multiplier. Okay. So what we need to do is we need to compute X transpose X. We need to transpose X transpose X we need to transpose all this, this term starting from the right. So we get B transpose a, and then we get w minus lambda identity to the minus one. But then there is an X. So this becomes to the minus two applied to a transpose B. And this object here must be equal to N, our constraint. This is just a number. If you if you analyze all the sizes of the matrices and vectors involved is just a number and this, this equation fixes the value for the Lagrange multiplier lambda. Okay. Now we need to work on this on this bit here. Hey. Okay. So first of all, here we have one over w minus lambda I to the power two. And I claim that this object can be expressed in a series, which is K plus one w to the power k divided by lambda to the K plus two. I cannot prove this, but what you prove it by taking minus the derivative with respect to lambda of one over lambda identity minus w, and you expand this object in a standard geometric series. So this is our first ingredient. And then we have a second ingredient, which is. We plug this object in in here. What we will have to deal with is something like a w to the K a transpose. That's the kind of elementary object that we have to deal with. So a w to the K a transpose. And it as a, what is w, it is a transpose a transpose a K times multiplied by a transpose. Okay. I'm just expanding the power K of the of the wichord. And now I can regroup these terms starting from a, and I get a transpose a transpose a transpose. And what Mark discussed, we have a transpose a transpose a number of times k plus one times. So this is w a race to the power k plus one. What is w a w a is the anti wichord, let's call it the anti wichord matrix. So if if the wichord matrix is n by n, the anti wichord matrix is m by m where m is larger than n in general. Okay. Good. So now we are able to massage a bit this this term here this constraint. So for this constraint we have that a w minus lambda identity to the minus to a transpose can be written as a sum over K of K plus one over lambda to the K plus two. So this is w to the K a transpose this one using the property number two is K plus one divided by lambda to the K plus two anti wichord matrix race to the power K plus one. Okay. So we can refold this, this sum into this form anti wichord divided by lambda identity of size M minus anti wichord to the power to. I'm just rereading this, the first property from right to to left. And now we have an expression that depends on the anti wichord matrix that has the same eigenvalue, the same non zero eigenvalues as the original matrix and plus a number of zero eigenvalues and minus. And so it is natural now to perform a spectral decomposition and try to rewrite this object in terms of eigenvalues and eigenvectors of the anti wichord matrix, which is to say, of the wichord matrix as well. Okay. So if we do that. And, and this, I think the spectral decomposition here is on page one of the of the handout or so. So I spell it. I'm not doing the full derivation but essentially everything can be written in terms of the end known non zero eigenvalues of the anti wichord matrix here. So this is the composition in terms of the projector projectors over the, I, I can space. So if you're not familiar with the Brian cat notation this is just, you know, column vector and row vectors, nothing, nothing more fancy than that. So now what we have is this object here that needs to be plugged in in here in the constraint. So what we get is that X transpose X is summation. I want to end s I divided by lambda minus s I square. So s eyes are the eigenvalues of the wichord matrix or the non zero eigenvalues of the anti wichord matrix. And then we have be, be I be I be. So here we have the, we are projecting the eigenvectors of the wichord matrix over the vector of the procrastus bed. Okay, the known term of the, of the linear system. Of course, this term here is nothing but be transpose be I square. It is just, it is just a number here it is just a coefficient here in this expansion. It is a random number of course that depends on the randomness in the bees. So for non-mathematicians this bracket notation is that. Yeah, it is, it, it is just the column vector and the and the row vector, which is, I mean, this one, you can interpret as a dot product here and the dot product there. So, and these two terms are identical. So this is essentially the square of this, of this number. So if we know that be is a Gaussian Gaussian vector with variance sigma square. So we can write that be is sigma times psi, where psi is zero one so each components of psi are are zero one. And so if we, if we replace this notation in here, we get the important point. So let me highlight it in purple. So we have that. Over I from one to n s I divided by lambda minus s I squared. Psi transpose vi squared must be equal to n over sigma squared. So I pulled out a sigma square from from here and I put it there. So this is an equation for the Lagrange multiplier lambda. And it depends on the eigenvalues of the we should matrix corresponding to the matrix of coefficients of our linear system, and it depends on the noise of the procrastin bed. Okay, the terms that the known terms of the way. Good. So let's now try to analyze a bit this equation for the constraint. So let me just stress that constraints in this problem this non linear quadratic constraint is extremely important it changed, it changes the entire physics of the of the problem. Okay, so the presence of a non linear constraint in a linear system changes the changes the physics of the problem entirely. Is everything clear if there are questions just stop me now or problems might start snowballing. Okay, so let's, let's try to draw a sketch of this function, let's call it phi of lambda as a function of the Lagrange multiplier that the left hand side here for a fixed realization of our disorder. So for a fixed realization of our eigenvalues, the eigenvalues of we should are positive, remember or non negative. So the situation is as follows. We have a set of locations for the eigenvalues of our we should matrix, the we should matrix obtained from the known coefficients of our. matrix. Here we have lambda. Here we have phi of lambda. So you see that this function on the left has poles at every location of the eigenvalues so it will diverge every time we hit an eigenvalue of we shared. So that at the beginning here, and then we will have like some sort of our Bollock quadratic behavior here, whose depth depends on the. So the depth of these minima depends on the coefficients. Here, which are, which are random. Okay. What happens is that this object here, this function geophile of lambda must be equal to a constant and over Sigma square, but this constant changes the level changes depending on the noise of the bed. So we need to draw a line here. And the number of intersections of this line with with this curve corresponds to the number of real Lagrange multipliers that are satisfying the equation for the critical points of my loss of my constrained loss function. This line is at position and over Sigma square. The noise becomes large. This line goes down. And at some point, the number of hits, the number of matches that you are that you're finding decreases as the noise increases up to a point where you will have only two solutions here. Okay. There are typically zero or two solutions between every pairs of eigenvalues. So zero or two solutions apart from the critical limiting case here, and the number of solutions of the critical equation changes with the noise. Sigma going to zero, we typically have to end solutions corresponding to these these points here on either side of an eigenvalue for very large noise, only two solutions will survive. So if we plot the number of solutions as a function of Sigma. So what we will get is a staircase staircase type of behavior that starts from two N, and then BK up to a point where for for very large Sigma, only two critical solutions survive. This is what in in in the literature in the somehow somewhat fancy way is called gradual topology, trivialization. It means that when you increase the noise, the topology of your of your space trivialize trivializes from a situation where you have two N critical points. Remember, we are counting all critical points, we are not just counting minima, we are counting minima saddle and maxima of your of your loss loss function, but for small noise, we have a rough landscape with a lot of critical points. When when the noise increases, then you gradually lose complexity in your in your landscape until only two of them, a minimum and a maximum survives. This gradual topology trivialization is, you know, it has several consequences. And of course when you, when you, this is just for one realization of your disorder, if you average over many such matrices, a and many vectors. On the other side, what you expect is that this staircase will with smoothing, and you will get something like you're expecting to get something like this. A smooth, a smooth curve. That's why it's the topology trivialization. So this, this problem is one of the few examples where this smooth curves curve can be computed. Exactly. Not only that, but also for finite M and N. Now, the few cases where this, the full landscape of critical points can be characterized analytically for finite M and then so without any large and approximation and between now and now will try to essentially describe how to sketch this, this calculation. The question is, of course, is this something that we are very much interested in. This is not the best thing that we would like to compute right. What we would like to compute is the statistics of the minimal loss or the statistics of the critical points grouped by type. So at the moment we are totally unable to compute the statistics of critical points by type. So dividing the minimum from the saddles and from the maximum that that is that is something we are unable to do. And, and, and, and, and averaging over the disorder. I mean, the staircase is, of course, for a given instance for a given realization, the average you can compute it for finite N and M, but this includes all critical points, not, not just not just minimum. Okay. So turning the minimum. I just mentioned one thing, and then we, we start. And there is a theorem, which helps us. It's a theorem by Brown in 1967, which, which states that for this loss function for the loss function defined by a H of X that I gave at the beginning. A chain of so inequalities are preserved between the Lagrange multipliers and the value of the loss function. So, so if you pick values of the Lagrange multipliers that are solution of this to this equation, and they are sorted, then the corresponding loss function evaluated that those solutions are exactly in the same in the same relationship. So smaller Lagrange multipliers correspond to smaller loss functions. So if this is the case, then of course, the implication is that the minimum value of the loss function is obtained in correspondence of the minimal value of the Lagrange multipliers, which is this one. So this solution is the most important one because it gives the minimal loss loss function whether this is zero. Of course it means that the system is compatible. If it is larger than zero, it means that the system is not compatible, but how far we are from from full compatibility is determined by this by this number. So the minimal loss is obtained here where X mean is, I hope I, yeah, of course I erased it, but X mean is w minus lambda mean identity to the minus one applied to a transpose be that that was the formula for the solution of the critical point that I gave you a few minutes ago and then erased. So we will not use this now but we will use this, this later so we have a precise characterization of what is the minimal vector. I mean, the argument of the loss loss function in terms of the minimal Lagrange multipliers, which sits here. The minimum. No, it changes sign actually and you can see you can see from here you can you can characterize the point at which it changes side using mean field type of argument like a self self averaging property essentially. So, it does change sign and but once you plug it in here everything, you know, everything is taken care of by the by the final form. Okay, any other question. Yeah. I absolutely have no idea. But maybe, yes. I, I don't know. I don't know enough about that to answer this but probably, maybe yes, I don't know. Sorry. Anything else. Yeah, okay. The Lagrange multipliers negative or positive. The, the, what, what changes. I mean, we, we need to have, I mean, we are going to see a precise solution to this to this problem in like tomorrow. Maybe we can wait until then to just discuss, because we will have a precise expression for for the minimal loss and for the corresponding. I can, I can values. So maybe, maybe it's, it's just a bit premature. I prefer to just go over the rest and then get to the final expression so we can, we can study exactly what happens when you when you when you change signs with the in front of, in front of us. Okay, if you don't mind. So can you say one word about this monotonic property because it seems very strong and that's like my multiplier imply that. Yeah, I mean, the, the, the, the proof is very technical and it is specific for this for this type of, of loss loss function so I don't think it extends. It is a general, it is a general feature. I don't have, I don't have a very strong intuition because the, the, the proof is based precisely on this quadratic quadratic loss function and, and, and it's constrained. It's associated constraints so it requires some some convexity argument that only applies to this type of loss function. It is true that it is a very strong, strong property. I don't know how generalities in in in this paper it is specific so it heavily hinges on on this specific form of the loss function. So, and there is no sort of heuristic that I couldn't I couldn't find one. And, and, and, but we are lucky that that that we've got it because we have an analytical handle on on the minimum, but it is, it doesn't seem to be very general. It's, it's, it's an accident that makes this problem. So rich and lucky. In the machine learning term this this cost function would represent the training at all in this rich regression problem. And so this brown theorem tells you that you should look for, not to minimize this training or you should look actually for small regularizations, which I didn't know it's. Yeah, but, but exactly, but it's not a general, it's not a very general problem property. So, what happens to the red box when N is really large because I expect that scar product to have some self averaging properties and as I, as I've heard, five minutes ago maybe that's our product converges to one or something like that. Yeah, I mean, the question is very, is very interesting, and even more interesting is the fact that the right hand side depends on this combination and over Sigma square right. And what we are going to do is everything at finite and essentially, or in in a larger limits, so that this term is not non problematic. But of course, this, this term might have different behavior and the solution to this to this equation might have a very different behavior depending on how Sigma square actually scales with N. So, not only we have a we have a problem of how N goes to infinity but we have a problem of how Sigma square the noise scales with scales with them. Unfortunately, we will not have time to discuss all the different regimes, but, but young in his paper discusses discusses very different regimes depending on whether Sigma is of order one or order one over N. And, and, and this creates, I mean, the larger limit of this equation is very different in the two cases. In general, what what what you have is that this trivialization does not survive necessarily the large and limit. So in, in a wide range of of regimes. You will get the three realization that is almost immediate. So you will essentially to the situation where you only have a minimum and a maximum always in the larger limit. So trivialization is is a property of the of a finite and or finite and problem problem. So, why do you actually call the Sigma square noise because they see be as the solution that you're looking for. No, access to access the story as the constraint that you're the constraint that you are enforcing but what would you call Sigma square noise. I mean, maybe, maybe, maybe it's not, it's not the best choice of words but in the context of the orthogonal process, I think the term is appropriate because if you remember, you had a set in the orthogonal which is analog of this one, you had a set of points a that you would rigidly rotate into another set and then you would add some perturbation. And at the end of this operation you're given the results so clearly calling this noise is very, very appropriate because if the noise is very large, the final points are very far from the from the from just the final point of the rotation. Right. So, so that that is why I'm referring to it as as noise. How much time do I have a few minutes maybe or it's over. Well, let me just, let me just give you the technical bit that we're going to use just a formula and then we go for lunch. Yeah. So, I just wanted to show you how we would compute this average number of critical points for finite M and N because there is a lot of technical machinery that might be useful to you. Who knows when, when one of these tricks might be might might come handy. And so to compute this. This average curve, we will make use of the so called cat's rice formalism, which is also summarized in the handouts formula three. So the cat's rice formalism allows to compute. So that the setting is, we have a set of equations in K variables. And a system of K equations in K unknowns may have many solutions and the number of isolated solutions in the domain is given by the cat's rice formula, which is N of the is what is an integral over X one. It's K over the domain D of what, well, you have delta F one delta of F K, and then you have the absolute value of the determinant of the F I be X K. And the formula looks complicated but essentially if you look at it, it's quite should be quite obvious, these direct deltas take care of the fact that F one and F K must be a solution of this of the system so they must be equal to zero. And this is the absolute value of the determinant so this is a Jacobian of of a transformation and the absolute value is is crucial. And the proof of it is, is difficult but in one dimension, it's, it's a bit more heuristically it's a bit more obvious. So imagine that you have a function. So let me throw it. You have a single single variable, and you have a certain level, you, and you want to find the number of solutions of this function F of T equal to you in a certain domain. What you would, you would do to compute this object. Well, you would, you would set certain variable V equal to you, you would integrate over the variable V, and you would enforce that V is equal is inside your domain. V equals to F of T. You would get essentially the integral over DT F prime of T delta of F T minus you in the domain D. So essentially the cuts rice formalism in K dimension is the extension of this the solution of this simple problem in one in one dimension, and you should not forget to include the, this Jacobian Jacobian term here. So what we are going to do is, we are going to try to write this cuts rice integral for our problem, and then try to average this, this integral over the disorder, the disorder is given by a the matrix of coefficients, and the matrix of known, two terms so A and B. And, and, and so there is, there is quite a lot of work to do but these averages can be performed exactly in this problem for finite and, and, and, and. Okay, and in the course of the, of the next lecture we will learn a few techniques on how to compute this type of, of averages. Okay, I'm done any, any questions. We start at four and we can have an unjust above their at four sorry at two, 2pm.