 Yes, Pika, you can go. So thank you for the introduction, for the invitation to this great conference. So I'm going to give a talk more about the random metric theory talk, ready. But a problem will arise in the study of random neural networks. So it's based on a couple of words, one about a couple years ago and one which is ongoing with some in-patient. And so let me kind of give you the model and what I'm going to study. So here we are in the neural network. I mean, you know, maybe we know from all the talk we saw yesterday, you know, X can be an input vector and actually later in the input matrix, if you take four and several samples of this data and W would be a certain matrix of weight. And then you have this non-linearity or put N2Ys to this vector or this matrix later by the activation function. Okay, so here I give some example of a different function that people use. It'll be a motivation to do the simulation with this activation function. So the signal aid or the absolute value or also something used with the activity in the unit which is the max of zero X. And that's the model I'm going to study is based on that. So here I do several layers. I'm actually going to focus on the singular case, but we still have some result on multiple layers. So, you know, as I said, I work in random matrix theory. So often to try to understand very complicated systems, one good way is to make things random. Okay, so you can maybe take the random weights, random data and see what's happening on different statistics. Of course, random neural networks are used. You know, for initialization, you can take random weights for instance and things like this. So where do random matrix theory comes into place and where, you know, eigenvalues and maybe eigenvector also arises is by the following. So here now X is an input data matrix. So, yeah, I did some more dimension to the data. And let's say A is an output target data sets and the points of the training phase to find the best output weights after one layer of the network. So here W will be random. So it's a random weight matrix by the entries and let's do ridge regression. So let's try to minimize this loss function here. So of course you want the output to look like the target data sets. And you have the regularization factor which to avoid overfitting. And you can just solve this thing by the, what's called the ridge regressor and you have this formula here. So of course it depends on the target data sets. It depends on Y, which is F of WX and on this factor here. And one interesting things especially in random matrix theory that this matrix here is a very particular one if you know a bit of operator theory is just the resolvent. K of pi of minus gamma of the sample coherence matrix and Y transpose Y. Okay. And then you can even look at different statistics at the expected loss of loss function depending on the weights or the random data, et cetera. And you can see that the eigenvalues and eigenvectors arises in this type of problems because of course the resolvent contains information eigenvalues and eigenvectors on Y transpose Y. It's a symmetric matrix. So you get a real eigenvalues and the normal eigenvectors and you can reconstruct this. Okay. So that's kind of the training phase will depend possibly on the spectral measure. And that's where the interest on the field of looking at these several matrices comes from. It is so a problem. So the model I'm going to consider here is the following. So here to low assumption but I'm going to go pretty quick about them. I'm going to take WNX random. Okay. So of course you could ask whether for application it's useful not to take data random. But here on a more theoretical level so some interesting thing happened when we take both WNX random I'm going to take them independent centered also symmetric. So the third one to zero whether it's really important to notice is a question. And I'm going to scale them variance one but this doesn't really matter. And they're going to have a sub exponential decay. Okay. So for instance, you can take Bernoulli works or Gaussian or exponential. Many things work. You won't see me check them. Uniform I guess works. And that's different things. So the activation function has something very strong which is real analytics that's used in the proof for polynomial approximation. For application is not completely, you know, at a way of course I should get so value I should do the real you which are not analytic but they have an analytic counterpart like the soft plus something like this. And I'm going to add an assumption and it'll be later for the second part of the talk which is I'm going to send to the function. Okay. So I'm going to take the function so the expectation of f of a Gaussian is zero. Of course it doesn't matter too much it's just setting a rank one perturbation to your matrix but this might not have a trivial huge I can value somewhere. It's like a nice center. And, you know, it's called you think high dimensions I'm going to use a limits of dimension. And I'm going to work in this random matrix regime when all dimension grows together. If I am sorry, our two parameters depending on the value X and different sizes. Okay. And so the matrix I'm going to consider is simple covariance matrix, y, y transpose where y is f of the value X over scrout and zero. So here's a scale to have something over one. And I'm going to look at this matrix here. Okay. So the eigenvalues and in particular I'm going to first look at the EED as the empirical again value distribution we're going to give us, you know information on the whole kind of macroscopic behavior of eigenvalues. So what is the first theorem is the following is that actually there is a, at the limit a deterministic probability distribution mu F for the EED. Okay. And we can describe it but what's called a quartic self-constructed equation for a stress transform. So maybe you don't know what that means and it doesn't really matter. What can matter is that it's more complicated than the usual one. Like the beginner semester color would be quadratic or much of a capacitor will be quadratic. Here's something a bit more complicated because there's a before. Okay. And it depends on F, I am side but it doesn't depend on distribution of W and X. Okay. So it's a universal again distribution. And so this type of problem has been studied actually quite a lot recently. So the first two papers looking at this in a way was being Tanwara and Luar Lioquie then Luar Quie for different assumption most likely think they're Gaussian or kind of functional Gaussian, something like this. Last year, Fan and Wang, I think Joe's going to give a talk later this week. He looked at a similar problem with ID Gaussian for multiple layers also with a nice result. And also very recently he went shorter looked at his problem again by using a different method and which is a bit more robust, the resultant method that we use. So, okay. So let's do some, you know what does the eigenvalue look like? So it looks like this. So here I take the sigmoid so a hyperbolic tangent, the value and absolute value. So I took a different iteration function and look at the ED. So, okay, maybe you just different ED something interesting happening for this one if you look very closely that it's actually the margin capacity distribution. Okay. So it's kind of interesting that if you take WX and you take the absolute value of every entry and you get a simple covariance matrix you recover the margin capacity distribution. And actually it's not just the absolute value. So you can find a wildcast of function that took even more complicated at the end to show that you still get a margin capacity. Here I take an odd function extremely three X plus sine log absolute value and they all get the margin capacity distribution of a certain parameter five of ups. Okay. And so why is that is for the following? So that's the limiting measure. So, you know, that's a big equation but don't look at it too much. It's not that important. What's very important is the first line is that it depends on two parameters only of the function. So an expectation of f of a Gaussian square. So of course it can be zero. We can't be accepted f is zero and theta two the expression of f prime of a Gaussian square but the square is outside of the expectation. So in particular, this can be zero. And I wrote it like this because there's something interesting happening that two special cases you have theta one equal theta two then of course this term just vanish and then you have a cubic equation. And this is not surprising because the only function for which theta one can be equal to theta two are linear functions. Okay. You can just extend them up basically. You recover only linear functions. So of course your model is just a linear one and so you recover the product b sharp. So linear network yesterday and you recover just this matrix. And another special case is that theta two equals zero and this term vanishes and then you recover a quadratic equation and you recover the marching capacitor distribution. So actually you have a whole class of function where theta two equals zero that make you recover the marching capacitor distribution. And actually the interesting behavior is that we have a result for it. I'm not going to talk about it for a time reason but if you look at several layers so if you keep putting ID weights and you choose your function to be marching capacitor at the start you're gonna recover marching capacitor at every layer. If you're normalized correctly each time. So it's an interesting behavior actually the fan wrong paper kind of states exactly what happened for multiple layers for any activation function. So even though I told you to not look at this equation you can look at actually this model of random matrix theory. So actually the distribution is similar to in the same the asymptotic one as a linear model. And actually it's an interpolation between the product b sharp and marching capacitor. So it's really like two extremal cases and depending if you have a function which is not one of those you have an interpolation between the two. You have a product b sharp and a full IID matrix of correct dimension. And you just have the two. So if you want to do that two with this one is zero and two equals zero you just have this thing to be zero. Okay. So that's the behavior for the eigenvalue the empirical eigen distribution to the whole thing of eigenvalues and we saw it was used for the rich regression here. But it doesn't tell us anything about another interesting random matrix theoretical problem which is the behavior of the largest eigenvalue. Of course, if you know the asymptotic EED you don't know anything about individual eigenvalues. You just know that the whole thing looks like this but you don't know about the first or second or third one even eigenvalues. And so we're going to look at this now. So the largest eigenvalue is as you can't really see it maybe directly by looking at the solution of a rich regression or something like this but there's some numerical evidence in this paper that says it earlier that shows that maybe the existence of not an outlier so it means an eigenvalue that pops out of the book could help having an idea of the testing error and the main observation for a certain I think classification problem that the existence of an outlier and the distance of the outlier from the book gives you a better testing error. So the idea is that if you don't have an outlier you actually they can't even classify correctly. And if you have one which is further and further from the book you get a better and better testing error. Okay, so this motivates the thesis study of this larger second value which is actually we'll see a pretty non-trivial some interesting things happen very different than the empirical eigenvalue distribution. So I'm going to look at three different as a behaviors or three different ways to change the parameters and see how it impacts the larger second value. So the first one I'm going to call non-universality is the following. So I'm going to fix an activation function F here just take X square minus one over square root two and just for simplicity. So it's centered so that, you know, I told you I don't have any huge eigenvalue anywhere. Square root two is just so theta one equal one because it doesn't really matter. And theta two equals zero at six five M side to be this one and I take X to be a Bernoulli. So when I mean Bernoulli here, I mean minus one one probably two one half one. Simplest distribution you can take and I'm going to check the distribution of double. Okay, so if I take W to also be Bernoulli of course I get Marchenko pastor because I have theta three equals zero. Okay, so I know I should get Marchenko pastor and I don't get any outlier. Now I can take something with a bit of a heavier tail. I take W to be a quarter of the time Bernoulli and three quarter of the time Gaussian. So now I have a, you know a force moment which is a bit higher if you want. And I start to see something happening which is I get one outlier. And then I can take something with a bit heavier tail which is just Gaussian and I get still an outlier but even further away. So here was two and here is like 2.2 or something like this. And you can see that the heavier kind of you get the the tail of the distribution is the further from the ball to get this outlier. So this is a behavior which is really interesting and very in a way nonlinear. What I mean by this is that if you look at the study of outliers and for linear models in one number six theory the position of the outlier is usually universal, okay? Which means that the existence and the position is universal and the fluctuations sometimes are non-universal. But here it's very different because just the existence or the position is non-universal because it really depends on the distribution of W. Well, it's fixed by here W. So this observation was made in this paper I've said it before while you're looking though they couldn't really explain, you know mathematically why you have an outlier here. So the second interesting behavior will be the following will be about the architecture of the model. So what I mean by architecture with a single radio model so not the number of layers but the different dimension if I am size the number of samples or in zero. So here I'm going to still fix this activation function. I'm going to fix W and X with a certain distribution and I'm going to change by M side. So of course the shape will change because I will again, I'm searching for posture reshaped fire website but the most interesting thing will be about the largest I can find. So here for this fire I'm saying I don't get any outlier and now I get the fire size slightly different. So fire website will get lower, I think and I get one outlier. And if I do it again, now I get two outliers. Okay, so I get something still again very different and you want to understand these two outliers as one for X and one for W. That's the idea and we'll see at the end why is that. So that's this problem where the architecture in a way the way the dimension kind of behaves together also can change the existence of another outlier. And finally, I'm going to talk about the activation function and how a different choice of activation function can also make a rise is an outlier or not. So what I'm going to do here, I'm going to take a certain function so it looks very complicated. Actually, it's not really just cosine of alpha X, right? But it is centered and scaled so that theta one is always equal to one, okay? So theta one is equal to one, whatever I take alpha to be and theta two will always be zero because this is an even function. So I still get much of a capacitor but the two partners are fixed. So in particular, the ED will never change. The asymptotic ED because it depends only on these two parameters and they are fixed. I'm going to fix phi M psi also and I'm going to fix the distribution of W and X also. So everything is fixed so the ED will be the same and whether or not the outlier arises means that something different is happening to the activation function. And actually it is. So here take alpha equal to I get much of a capacitor and now outlier. If I take alpha 1.5, I start to see one outlier coming up and then if I take alpha equal 0.8, even smaller, I start to see two outliers. So really it means that the behavior of this large second value depends on a different way of the activation function than just the ED. If you want to have to go maybe one order later to understand what's happening for these outliers. And so actually, so all of that observation will be given in the theorem, but I wanted to maybe show, you know, what's happening because maybe it's less clear by just reading these phrases that something different is happening. So yes, it depends on another parameters which are called theta three here, which is now the expectation of F second of a gauge. Okay, so whether or not you have an outlier in the position of the outlier depends on F second of a Gaussian and not just F prime of F. Here I'm going to define kappa for a later remark. So here's the kurtosis. The kurtosis is the fourth moment divided by the variance squared. And it'll depend also on that. So on the fourth moment in particular. And so what is the theorem? Is it actually the behavior of the large second value also relates to a linear model but which depends on, which distribution dependent on W and X. So what I meant that there exists to rank two random matrix J hat. And his J hat is not really complicated but a bit big to write, but it depends on W and X. It depends on the distribution of the volume and on the distribution of X. If you want, you have one rank for W and one rank for X in a way. It depends also on N zero is hidden inside the matrix. So they depend on phi and psi but it doesn't depend on the function. Okay, the really only theta three will be the depends on the function not on J hat. And you have the largest second value of M is the same as the simple coherence matrix given by this matrix. You have the same linear model that we saw earlier. So we know that ED is given by this but the largest second value is given by your perturbation of this linear model by your rank two random matrix J hat given by this coefficient theta three and the distribution of W and X. Okay, so here the existence or not of outliers is what's called the BBB phase transition. So it's something in random matrix theory that exists which states basically that maybe if the perturbation is big enough so if theta three is big enough if the fourth moment of W and X are big enough if there are five of upsides because low enough or big enough then you get one or two outliers. And that's a theory that pretty well known in the linear case either using a one moment method or also a super alienation and free probability something which is pretty well understood. So there's some several cases I want to consider. So if there are two equals zero which is an example I showed you in simulation you can match in capacitor. And actually in this case you can have something a bit more simple which is actually for the largest second value you have a rank one perturbation J and one ham is just a matrix full of ones. And it depends only on the on F second and on the kurtosis of the distribution. So on the fourth moment. So the largest the fourth moment is the further away from the both you are and the bigger theta three is also the further you are from the book. Okay, so here it's a bit more simple and actually here the position and the existence of an outlier is completely explicit. Okay, like work one perturbation of full IID matrices are very well understood now since maybe five or four with the first case of the VP phase transition and there's been many, many work to understand this. And so you can even have a formula for the position of the outlier in this case for W and X random. Another observation is that if theta three equals zero then of course there's no perturbation. And so you don't have any outliers. Okay, so one reason for example if your function is odd you don't have any outliers. And also it's a way to look which I'm not showing here but if W and X are Bernoulli so minus one one with probability one half one half then J hat is zero also. So you can see it directly here you see it in this formula because kappa which is the kurtosis of W and X will be zero. Okay, because Bernoulli is the only distribution that kurtosis one. So kappa will always be zero for this distribution. So that's an observation that was made also in this paper by the way you're clear that if you take Bernoulli weights you don't get any outliers. And that's kind of the idea why. So finally the two little remarks I wanna make is that here it's a rank two random matrix. So usually a rank two means that you have possible two outliers. And it seems to be true, right? You have sometimes two outliers. And we believe that the result should be true. But if you look at this model and you look at the possible two outliers they look similar to the two outliers here. Sorry, here. However, the proof is not as robust as this and we can only consider the largest again value. And on the second one, but it should be the same idea that this gives the two outliers. And so finally, I want to finish by saying one thing is that this is not completely surprising even though it's very nonlinear in the sense that it's behave very different than a linear than usual linear models. There's a paper from 2015 by Fan and Montanari who looked at the similar problem for kernel matrices. So F of X X transpose instead of F of W X transpose. And they have actually similar idea and similar result where it depends on the same theta three and also on the fourth moment in a similar way. So of course, when you one outlier you have one symmetry, that would value but you get something similar. And actually I want to thank Joe Fan particularly. Actually, we talked about this and he instigated this problem really of looking at our second value. And this is slightly different because you have in the way to randomness you have W and X to be random. So that's one way where you have two outliers. Okay, so I think my time is up and I want to thank you for your attention.