 So this is a joint work with a former PhD student, Luca Benini, who was also a student of Paul Bourgade. And this is about some nonlinear random matrix ensemble. So let me briefly give the plan of this talk. I will first explain where the question we consider come from. And this comes from some new old network questions. And then I will give also some previous existing results. And then I will turn to our results, which is more in random matrix theory and give some ideas for the proof. Okay, so the main motivation I said for studying nonlinear random matrix ensembles comes from new old network. So in this setting we consider an input column vector X, which is our input data. And then the data goes through a first layer of a new old network. So this is a combination of both linear and nonlinear functions. So this is here the linear part, which are given by two p times n matrices. And what is important here is that there is an activation function f, which is applied point-wise. So to each entry of the vector. And it is here to reinforce some neurons. So I give here the example of what is called the relu function, which means that there is a special which is zero here and the which the neuron is not activated. Okay, and then, so you obtain like this an output vector and you can consider multi-stage architecture for a network. So this is just alternated layers with both linear and nonlinear such functionals. So in artificial neural networks, the thing is as follows, you are given say two data sets X and Y, X is the input and Y is the output. And you want to learn the synaptic weights, which are the matrices W1 and W2. So there are different ways. You can try to learn. So I'm using for instance, some gradient descent. I need also to say that the activation function is not unique and there actually one of the problem is to choose the activation function. So one commonly used is the relu function and for some reasons I don't, I cannot explain because I don't understand. It accelerates learning but there is a problem which is that the process can die in the architecture and there are some other examples which are the sigmoid function or some other function. So if one wants to understand how such neural networks work one possible way is to make them random. And this has already been done. I will state in the next slide some results from Roman Couillet and other sources. So the idea is that the weight, the synaptic weights are given by large dimensional random matrix with IID entries. Okay, X, which is the input data set is still deterministic and again it's high dimensional and the output is the random matrix which is given by when we apply the function f point wise to WX. Okay, and so the idea is as follows. So we are given say an input matrix which is a piece sample of photographs with NL and so there are photos of cats and we want to identify the breeds of the cats. So the idea is from the output try to approximate the correct breed on the photo and so the training set comes with D times P matrix which is the target output which is the correct breed for each photograph. Okay, so to do that because we are in the high dimensional setting we commonly use procedure is to use a real machine commonly used procedure is to use a rich regression which is reconsider this approximation for the target matrix Y and the aim is to minimize the error remake when estimating say Y with this matrix Z. Okay, so you try to minimize a loss function. If one makes some computation one can then show that the optimal B here which has to be determined can be expressed like this. Okay, and then what is interesting is that the learning performance can then be expressed in terms of the empirical eigenvalue distribution of a sample covariance matrix which is M star M divided by P. This is a usual covariance matrix, okay? Okay, so this question has been investigated in a series of paper by Roman Couillet I said and for the question in the preceding slide they have shown the following they have actually identified the limiting empirical eigenvalue distribution of this sample covariance matrix. So they consider a sub Gaussian matrix W. F is Lipschitz continuous function and all the dimension grow to infinity at the same speed, okay? So that the ratio are all bounded. So then they show that the limiting empirical eigenvalue distribution of the sample covariance matrix often to rest as the same limit as another spectral measure I will say. So which is given by it's still just transformed here. So this is a equation. I will just explain in a few words. So one important parameter in this equation is the matrix G bar which is the expected value of our sample covariance matrix. And actually this equation has already been encountered in random matrix theory since this is a deterministic equivalent for the limiting empirical eigenvalue distribution of sample covariance matrix with non-identity covariance. Actually if you consider a matrix T such that T, T star is equal to G bar then the limiting spectral measure of M star M is almost the same as T, X, X bar, T bar, T star, sorry. Okay? What is interesting in their result is that the dependence in the activation function is actually hidden in this matrix G bar and it's non-universal. Actually they have shown that there is quite a strong dependence on the first moment of the entries and they use it to study the performance of neural networks. Okay? And they also study a lot of other questions for this model. Okay? And then there is another model which has been proposed by Bennington and Voram who consider the fully random case. So this time the input data set is also random and the weights are random and they consider the simplest case where W and X are Gaussian random matrices. So for ease they are centered on variance one actually they consider a more complicated case where the variance of W and X is something different. So I state the simplest result and the output data set is also random. So we consider exactly the same question. We consider the non-linear matrix F of W X divided by square root of N and the associated sample covariance matrix. Okay, so you're given an input data set, the output, there is a loss function. Okay? I'm coming to this question. You try to minimize the loss function. Okay? You find your optimal B and then the expected value of your loss function can be a measure of your performance and it is expressed in terms of the empirical eigenvalue distribution of this matrix. Okay? And what they show is the following. So there exists a limiting distribution, a limiting empirical eigenvalue distribution provided the ratio of the dimensions of X, W. This ratio converges to some constants. Okay? And they obtain a fixed point equation for the still just transform, which is a quadratic fixed point equation. So the way they prove it is based on some more or less explicit computations because W and X are Gaussian. We can use some analytical tools and some, actually they compute the moments of the spectral measure and use a subtle point argument. Okay? And this was done for Gaussian matrices and actually they describe the limiting spectral measure using some graphs which we could not really understand. So we try to understand how they obtain this result using some more general models than Gaussian random matrices. So the model we consider with Luca is again some random matrices W and X. And we don't assume that the entries are Gaussian but we need some decay assumption here on the tail of the entries. We also consider a class of real analytic functions. I don't give here the all assumption, all the derivatives of F grow at most polynomially on the compact sets. And the important thing is that F is centered with respect to the Gaussian distribution, the N zero one distribution. Okay? And we make the same assumptions, the ratios of the dimension converge to some constants as n goes to infinity. Then we obtain that the limiting empirical eigenvalue distribution of G converges to the same distribution mu F as that of Pennington and Vora. Okay, so I will try to explain what is this distribution mu F. So actually the dependence in F is in two parameters which are theta one. So this is the average of F squared and theta two which is the expected value of F prime to the square and the expected value is always with respect to the Gaussian distribution. Okay, and so this is the variable aquatic equation for the still just transform of mu F. So I will not discuss here this equation. What is important is that the dependence in F lies really in these two parameters, theta one and theta two. And which means that the limiting distribution mu F is universal in this fully random setting in contrast to the case where X was deterministic. Okay, some more now the limits of the ratios here. So what is important is if theta two is zero, mu F is the Marchenko-Pastore distribution. And the other external case is when theta two is equal to theta one. And in this case, the measure mu F is the same as that of a linear random matrix model, WX divided by square root of N. And in the other cases, this is some kind of interpolation. I will come back on this later. Okay, so this could be of interest for applications to neural networks. And this was actually already a question in the article from Pennington and Vora because this can help to determine the activation function. The Marchenko-Pastore distribution is a good limiting distribution because it means that there is not much distortion between the output and the target data sets. A few simulations. So the first one corresponds to the case of where there is no special choice for the parameters theta one and theta two. This is a generic case. The second case is the Marchenko-Pastore distribution. So theta two is equal to zero. And the last one is when theta two is equal to theta one. So we get a linear random matrix ensemble again. And if in addition, the largest, sorry, F is bounded, then the largest eigenvalue sticks to the support of the distribution mu F. Okay, in all cases, the support of mu F is compact. I didn't explain that. Okay, another possible description of this limiting distribution mu F. So because I said this is some kind of interpolation of two linear models. The Marchenko-Pastore one and the product reshort case. So actually one can show that this is a real is a case. So I need to introduce another random matrix C, which is just Gaussian. And W and X here will be again Gaussian random matrices. This is not important for the limit. Then one can show that mu F is also the limiting empirical eigenvalue distribution of an information plus noise sample covariance matrix. So, okay, how can I explain this formula? The information plus noise matrix is Z plus A and Z plus A star. Okay, so you just replace, so Z is Gaussian say, and we just replace A with a product reshort matrix and simply the weights on each part of either Z or A depends on theta two and theta one. Okay, so this result is related to the three convolution which has been already studied by Florent-Bénage. And this is also some kind of similar to the result of N-Carouy who studied the kernel matrices. Okay, and because we are interested in more than one layer, we can consider, I just gave the limiting empirical eigenvalue distribution when there is one input and one output and one goes through one layer. So we would like to consider more than one layer. Actually, we can only consider a fixed number of layers, so at each step we introduce some new random matrices for the rates, so the matrices WI which are independent and they satisfy the same decay assumption as before. And we form the output matrix at each step, except, so we start from the output at layer L minus one and we again consider the output data set except that we normalize here the matrix so that at each step the variance of M is one, okay? This is what is called batch normalization. And we are interested this time in the limiting empirical eigenvalue distribution after step L of our network. So we studied only the case where theta two is equal to zero and F is in addition bounded. And then we can show that at each step, the limiting empirical eigenvalue distribution is the Marchenko-Pastore distribution. Okay, so this finishes the results. All this has been conjectured by Pennington and Vorra. And one of the main problem here that we cannot push L to grow to infinity, okay? This would be the interesting results. Okay, so I will now give some ideas for the proof. So this is a simple moment method and we start with a case where F is polynomial and the easiest case is when F is an other polynomial. So we want to compute the moments of the spectral measure of G and we do as for Wigner's theorem, we develop the whole trace in terms of W and X. Okay, so there are a lot of indices, I, J and L. And because we assume that the matrices W, X are independent, and the entries are independent, we need each entry to arise twice. So we will need to find a way to encode such a cement in order to see how one can compute the moments. Okay, so I and J indices will play a different role than the L indices just, let me explain, because the J and the I's are necessarily repeated, okay? So we encode the cement in the following way. So we first assume that all I and J indices are by rise distinct. So this means that we can draw the I and J indices on a cycle of lens two Q and the blue vertices on this cycle correspond to an L index, okay? And if there is a label L here in this niche here, I want J1, this means that I read W, I, L, X, L, J1 in my expected value, okay? Okay, now I have this constraint that each of the W and X and three need to rise twice, so I need to match the blue points. And the basic idea I would say is that we want to maximize the number of pair rise I, J, L indices, so if I assume that the I and J indices are pair rise distinct, I will try to maximize the number of L indices. And if I want to do so, one can check that the good way to do it is to match L indices inside niches, okay? But then I told that K, the number of blue points inside the niche is odd, so I cannot obtain a perfect matching inside the niche and I need to add a cycle which necessarily has to go along the whole cycle, actually, okay? So this is the case where Q is bigger than one and if one wants to determine the number of ways to perform a perfect matching and also to determine the cycle, so in each niche we have K possible choices for the vertex in the cycle and then a perfect matching for the K minus one remaining blue indices and this gives the parameter seated two, okay? The last case is when Q is equal to one. So in this case we get a simple cycles of lens two and two K blue points which can be matched, matched actually without considering niches and this gives this time the parameter seated one, okay? So now if I remove the assumption that INJ are distinct and I make some identifications, we can obtain in this way cycles of lens two or longer cycles and we can then show that those graphs which contribute in the limit are those for which they are called actually cactus graphs so which mean that one can see them as a tree of cycles, okay? So the black graph is okay but if I add an INJ, I can see them as a tree of cycles, okay? Graph is okay but if I add an identification here, this is not a contributing graph, okay? And so this is my black graph on the INJ indices, a tree of cycles and then inside each cycle so that it contributes in the limit. A typical matching is as before which is if the cycle has lens greater than two, we have a full blue cycle and perfect matching it's inside niches so this gives a seat that two and if the cycle has lens two, we will obtain a contribution of seat that one, okay? So in the end, we obtain this formula for the moment of the probability distribution Mu F, called AQ, this quantity actually is the number of cactus tree, cactus graph sorry, which have been obtained using a certain number of identification and similar for J and with B cycles of lens two, we obtain this formula for the moment of Mu F, okay? And we have done this computation for an odd monomial for the moment but actually the dependence in F is only in seat that one and seat that two if and we recover the expected properties so if seat that two is zero, we obtain the number of fat trees and if seat that one is equal to seat that two, say equal to one for ease, we simply get a sum over the cactus graphs, okay? And then we can do the same for even polynomials and due to the fact we assume F is centered with respect to the Gaussian distributions, this means essentially that we ban perfect matchings inside niches so we need to match blue vertices from one niche to another, at least once. So for long cycles, this will give a negligible contribution and only cycles of lens two will contribute in the limit, okay? And then we go to arbitrary polynomials and the extension to our class of activation functions simply follows from a Taylor approximation, okay? For the largest eigenvalues, this is simply we can push the argument up to Q in the order of log n. So we don't know exactly where the top edge of the support is but the largest eigenvalue cannot exit the support under our assumptions, okay? And I will finish with the extension to multiple layers. Just in the case of two layers, I will just give the ideas, this is far from approved. So after the first layer, our output is matrix M1 so this is different from X because its entries are not independent, okay? So the idea will be to come back to independent matrices. So again, we assume that the I and J indices are distinct and we consider the output after layer two. We cannot say anything about matching the entries of M1 because they are not independent so we simply match the W and trees and consider the induced. So if I match this and trees, this will mean I will forget about the I indices and I consider the induced graph on J and indices, okay? This will give me some moment to be computed, okay? And we can show that the graphs which contribute in the limit are such that, okay, so the green edge will be important here. So there is a single edge linking two niches adjacent to the same I vertex, which is a bridge. Then we can make identifications between bridges. So in the end, this may result into a JL graph which is not connected. And the remaining edges inside the niche match according to a perfect matching which gives here what we call flowers. Okay, so there are K minus one flowers at each step. And when you multiply layers, you multiply at each step the number of flowers. Okay, as a conclusion, so this was a way actually to understand the result of Pennington and Var, because they gave these graphs and we could not understand where they came from. So this works for the fully random case. If W is deterministic, we get nothing interesting. It doesn't work either for f of x. If x is a Wigner matrix, one of the main problem is that it doesn't work for the most commonly used function which is the Roulou function. But for some others which are simpler. And what is funny is that if we consider complex, I didn't say matrices are real, but if we consider complex matrices with the usual assumption that the expected value of W IL squared is zero, then we only obtain a March and Goubert's tour or product result limiting distribution. Thank you for your attention.