 Good. So first of all, thank you very much. A big thanks to the organizers and personal, a personal thanks for inviting me here. I could speak for four days about Wigner semi-circle laws and deeper subjects that arise there in, but it's not going to be quite that long. It's going to be only four hours, for our day lectures. All right, so let's see how to transform this blue screen into a PDF. Good. Okay, so since I'm the first lecturer of the summer school, I feel like I should say a few words about random matrices in general and then delve into the more specific subject at hand. So of course, coming in here, you can ask if you haven't seen such things before and I know that some in the audience might not have been familiarized before this lecture with random matrices. You can ask, what is a random matrix? And a random matrix by and large can be thought of as a model. So an object that can be used to model things. In general, you want to model uncertain, noisy, hard-to-quantify information. So for example, random matrices have a reason as sample covariance matrices in statistics. They have great important roles to play in signal processing and information theory, quantifying noise. There are, they can serve as testers for algorithms in numerical linear algebra. If something works, you know, what does it mean for an algorithm to work well on a typical case? Well, let me take a random matrix and see how it performs on it, right? And on and on and on and on. So they serve as models and you can extract information from from them about the system that you're working with or about the situation or about perhaps like I said, the algorithm that you want to analyze. There are roughly in classical random matrix theory, there are two paths that have been introduced to random matrices. One of them came from statistics and it came with the work of John Wishart. Well, that's the most the first prominent instance in which random matrices have been used explicitly and they were used to model covariance matrices, sample covariance matrices. And then in the 50s, Wigner and a bunch of physicists, mathematical physicists coming after him, including Dyson, have looked at random matrices in the context of particle interaction, many body problems. The two types of matrices that they worked with are similar, in many respects, but different. We're gonna focus in the course of this series of lectures on the second kind of matrix, the Wigner random matrix. So it was introduced to model the behavior of complex systems given by Hamiltonians. And Wigner himself considered two types of random matrices, the ones which have binomial entries plus or minus one and the ones that have Gaussian entries. But with that, I'm introducing the idea of a random matrix in terms of its entries, so let me say a little bit about that. So how do we define a Wigner matrix? And I'm going to talk about real random matrices. So you might see that the title of this slide is random real facts. Random just stands in for the fact that whatever I say is going to relate to random variables. A real does not want to emphasize the fact that they are facts. Although in today's world, you never know, it's perhaps good to put a qualifier on things. But it's meant to emphasize the fact that we're gonna talk about real entries. So I'm only going to talk about real entries in this lecture. Pretty much everything that I say will be extendable. Nicely and smoothly to complex entries or quaternion entries, but that's going beyond the scope of my first lecture. Okay, so random real facts. We look at a matrix, say an n by n matrix. We put a condition on in for it to be symmetric. So w i is equal to w j, i j is equal to w j i for all i j. And we sample these entries from distributions. So w i j are independent and centered. So that means expectation zero. And the diagonal can be differently distributed than the off diagonal, as long as they're all centered. And we normalize the off diagonals to have variance one. This is expected w i one, one two squared is equal to one. That's because it's already centered, so we don't need to subtract the expectation. We don't condition the variance of the diagonal. It'll turn out not to matter for our purposes. So this is what a random matrix is. You can think of it as just sampling one variable at a time from these distributions with the constraints that I emphasized. For example, if you look at the off diagonal entries as being distributed, and this is a notation that I have to explain, n of zero one, n of zero one stands in for the standard Gaussian, so centered variance one, Gaussian variable. And the diagonal ones are n of zero two. So the variance here is two instead of one. And what you get if you do this is the Gaussian orthogonal ensemble. It has beautiful properties. It has been analyzed very closely. It serves as a benchmark in a sense and is something for the rest of the ensembles, an ensemble here being defined by the kind of distributions that you choose to put on w ij and w ii. It serves as something to project yourself onto. So how different will a Wigner ensemble with some other distributions be from the one that has Gaussian distributions? And this question is going to be answered more in depth in the next week, I guess, when we're going to talk about universality and the like. But for starters, for this week, for this lecture in particular, we're not going to go quite as deeply. We're just going to mention that this ensemble is particularly nice because things can be computed for it. Exactly. Okay, so one more assumption that I'm going to make, which I will eventually remove, but for now to make things easy as a stepping stone, I will assume that the distributions that the w ij's come from have moments, have finite moments of all orders. Okay, this is what this assumption is saying. So a moment is the expected value of x to the k with respect to some distribution. That's the case order moment. So these are the assumptions that I'm going to make on my Wigner matrices this lecture. And with this assumption, so we have now, we have a model, we know what the entries look like. We have an idea of maybe what what symmetries are. What are we interested in about this matrix? Well, it turns out actually to really normalize things properly. We're going to have to work not with the matrix that I just told you, but with the scaling of it. So we're going to just scale by one over square root of n. The reason why you want to scale by this one over square root of n intuitively is that this makes the variance on each row be bounded. And therefore the spectral norm is likely to be bounded. Okay, so you want to bound the spectral norm. You want to be able to say something about the eigenvalues being with high probability in some compact interval, right off the bat. Okay, so that's why we need to to introduce this scaling by one over square root of n. So I'm going to work with the matrix W bar, which is the scaling of the matrix that I introduced. So the main quantity of interest for Wigner per specifically, but for subsequent generations and generations of mathematicians and mathematical physicists that came after him has been the spectrum. The eigenvalues of this matrix were supposed to model the energy levels of the system, of the system that he was interested in and therefore introduced this random matrix as a model for. And the question is, what's the distribution of the eigenvalues of this matrix? So what do I mean by that? What's the distribution of the eigenvalues of this matrix? Well, let's take a look. So we have, for some matrix A, which is square and symmetric. Suppose that its eigenvalues are lambda 1 of A, lambda 2 of A, lambda n of A. You can think of a distribution of one uniformly random eigenvalue chosen from this matrix as being 1 over n sum of delta lambda I of x. So this is the delta Dirac function. All this is saying is that you choose from amongst the delta, sorry, the lambda I of A's uniformly, okay? This is called the empirical spectral distribution or for short ESD. I'm going to use this shortened acronym for the matrix A. So this is the empirical spectral distribution ESD of A. This right here. In fact, let me write it down. So if A is n by n, and I'm going to look at it as being symmetric. It doesn't have to be, but it sure helps me think of the distribution with eigenvalues lambda 1 of A less than or equal to lambda n of A. Then the empirical distribution of A's fA of x is equal to 1 over n summation delta lambda I of A of x. So this is the quantity that I'll be looking at. And now we can ask what is this ESD for the matrix W bar, this random matrix that I introduced. And let's pretend that this is an actual real-time experiment that you can write on a MATLAB, on a machine that has MATLAB on it. I know we're supposed to use Mathematica and the like, but I'm a MATLAB girl, sorry about that. And let's take a look at what I'm doing here, okay? I'm picking a 200 by 200 random matrix. In MATLAB, random denotes sampling from the normal distribution, okay? So this is now a square matrix of independent standard Gaussians. It's not symmetric, so I'm going to have to symmetrize it and I do that by just adding the transpose and dividing by two. Then I scale it, as I promised, W bar is what I'm interested in, so I divide by 1 over square root of n. I take its eigenvalues, I do the histogram of the eigenvalues, and then I say hold on, and then I plot something else, which is actually going to give me a density. It's the semicircular density, so it's from minus 2 to 2 and it should be a semicircle. Well, okay, I'm going to qualify that. It's not a semicircle because you're dividing by 1 over square root 1 over 2 pi, so it's an ellipse really. But it looks like a semicircle. And if you do this experiment on MATLAB, you're going to get something that looks like this. The empirical, empirically plotted spectral distribution is going to be awfully close to the continuous semicircular distribution that I've also plotted here. You can say, okay, but this in itself is an interesting fact, of course, but you have chosen a very particular type of Wigner matrix, the one that has Gaussian entries, and you just told us that they're nice. That they're particular, they're nice, we like them, right? Well, it turns out that what you write here doesn't really matter. You can start off with brand of 200, which is uniform distribution. Well, you'd have to center it, so you have to start with something that's entered. But you can start, you can construct your own distribution to start off with. Go through this process, plot the results, and get something extremely similar to this, extremely similar. So you can see that immediately you can conjecture, okay, so it looks like the eigenvalue distribution for one of these Wigner matrices, for n large enough, n is 200, but I think that you can start actually seeing it. I did 200 because I wanted the histogram to have enough meat in it, so that you know, you can look at it more closely, but you can probably start seeing this thing at 50, 100, it's already there. So it looks like the distribution of the eigenvalues as n, the size of the matrix, increases, approaches that of this semicircular distribution. And you can conjecture this regardless of the distribution on the entries that you start from. So that's pretty interesting because it says two things. First of all, it says that even though you have a random matrix and therefore, you know, you would expect, okay, so you're going to have random eigenvalues, that's clear, right? And random matrix means random eigenvalues. But there's tremendous structure behind these random eigenvalues. They may be random, but the group of them kind of arrange themselves in a very nice pattern. So their structure behind the random matrices, in fact, there's very strong structure, and this is where the second thing comes in. It's a stable structure. It's what it's called a universal structure. It does, because it doesn't depend on the distribution of the entries that you start with. Okay? This is pretty powerful. It's a very simple. It's the first step, really, in this examination of universality for Wigner matrices. But even so, it's pretty powerful. And now, hopefully after I wad you a little bit with this, I will tell you that there are two lies on this slide. Can anyone point them out? Yes. The normalization, oops, oh, sorry, I think that this is my laptop conking out. Yes, you are completely correct. The normalization of the histogram is different. As you can see, I've plotted things pretending that things were from minus two to two, but the plot has minus one to one. I'm sorry about that. I realized this this morning. It comes, the plot comes from a time when I insisted on doing things my way and plotting things between minus one and one instead of minus two and two. So good. That's one of them. What's the second? How familiar are you with histograms? It's what? The frequencies? It's normalized, exactly. So if I just plot the histogram, it's just going to, the heights are going to be integers. It's going to show me how many eigenvalues I have in a particular bin. But what I have actually done here is I've normalized the histogram to have area one, because if I don't, there's no point in comparing it with a distribution or with a curve that has area one, right? So that's the second, let's call it like, as I knew what I was doing when I said it's good to take responsibility. But there are easily fixable lines. So you just have to introduce a little bit more code here. Instead of the histogram, you have to plot an normalized histogram so that the area is one. So this is the experiment that will tell you what we're going to be doing for the next, let's say lecture and a half. The fact that you pick a single matrix and you see such a good convergence, such a good approximation to the semicircular distribution tells you something. Suggests that the convergence of this ESD to the semicircular distribution is going to happen in some sense in probability, because you've chosen one matrix at random and you plotted it and you see some good results. So that's what we'll be proving for the next, I guess, lecture and a half, maybe less now, because I've been talking for a while, so we don't technically have the rest of the lecture, it's not a full lecture. So this here is the formalization of what I was saying. We look at the ESD of W bar and we say that FW bar converges weekly in probability to the distribution with density given by the semicircle. That's what we're going to be proving. And the way we're going to be proving this is we're going to look at the method of moments. The method of moments is something very more usually used in probability when you have to deal with countable structures and we'll see in a moment why matrices qualify. And it's based on the following idea. Under certain conditions, and I'm just going to enumerate a few, you're going to see a couple of them, you're going to see in the problem session, in the first problem session for this course. Carleman-Riss or compact support, a distribution can be defined by set of moments. So if you know the expected value of x to the k with respect to this distribution, then you know the distribution. And if you talk about distributions, you should probably get a feeling of why this is true. I mean, you're going to do some sort of wire stress approximation. You know that if you know the distribution on compact continuous functions, then you know compactly supported continuous functions. Then you know in a sense what it is. Those compact continuous functions can be approximated by polynomials, polynomials are linear combinations of moments. So that's the idea. Now, the method of moments is the following thing. If a distribution sigma is uniquely defined by its moments, and if you have a sequence of distributions fn that have the property that their moments for each k, the kth moment of the sequence of distributions converges to the kth moment of sigma, then fn converges weekly in probability to sigma. This is something that again you'll be discussing more in depth in the problem session, but we're going to rely on this fact today. So in other words, if I want to show convergence, let me go back, if I want to show convergence in probability to a certain distribution, it will be sufficient to prove for each moment convergence of the moment. Does that make sense? I hope it does because I'm going to be using that a lot. So this is the moment method or the method of moments, and like I mentioned, we'll be using it many times. What are the moments of the semicircle? Well, it turns out that the moments of the semicircle are very easily expressible. By delta k over 2 is in Z. So what do I mean by that? I mean that this variable here is going to be 1 if k is even and 0 if it is odd. So the moments of the semicircle, the odd moments are 0. The even moments are equal to the ck over 2, ck over 2 being the k over second kettlen number. The k over second kettlen number can easily be defined as such. So k choose k over 2 divided by k over 2 plus 1. These kettlen numbers are actually very important in combinatorics. They count all sorts of structures. We'll see some of the structures that they count. And they're at the core of some very interesting phenomenon. And one of them is that they are the moments of the semicircular distribution. So in other words, if I were able to show that for any k, the kth moment of the empirical spectral distribution converges to these kettlen numbers, I'd be done. It would follow that the distributions converge weekly in probability to the semicircular distribution, which is my goal. What are the moments of the SD? Well, let's take a look. So here is the SD. Suppose that I want to look at the expected value of x to the k with respect to this distribution. What is it going to be? Well, since I'm applying x to the k to delta Dirac functions, those are going to be just lambda i of h to the k. So the kth moment of fw is going to be 1 over n summation of lambda i to the k lambda fw bar, let's say, to the k. And lo and behold, that's just 1 over n trace of w bar to the k. So that's the nice thing, that the moment of the empirical spectral distribution is expressible in terms of the matrix. It's expressible in terms of, as we will see, the entries of the matrix. And that's what allows us to calculate it. So as I mentioned, this is what I want to be proving without the expectation. But I will start off by taking an additional expectation. I'll say, okay, I don't want to look at fw bar. I want to look at the expected f of w bar. And that can be achieved by averaging over w bar. How do you do that? You average over the entries. So I'm going to be looking at a slightly different object. I'm going to be looking at the expectation over w bar of this object. This is going to be the moment of the distribution that has, quote, unquote, level density. So when Wigner was looking at these matrices, he was looking at the eigenvalues as being energy levels. He was interested in the level density and therefore he was looking roughly at how this would average out over the ensemble. The nice thing about this is that even though the ESD itself is always atomic, okay, it's a sum of delta functions, when you average out over the matrix, you will get a continuous object. You'll get a continuous distribution. In fact, these have beautiful expressions in terms of orthogonal polynomials for the case of the Gaussian orthogonal ensemble. And the plots, if you were to plot the level density against the semicircle, you would see what I like to call finger plots. So essentially for any n, you would see a little bump around the semicircle around there. It would look something like maybe this. So if suppose that you were looking at n is equal to five, you would have the semicircle like this and you would see something that has five bumps, kind of like that. But that's the level density, which is the average of the empirical spectral distribution. And I'm going to be looking at the moments of that first. The reason why I'm going to be looking at the moments of that first is that it'll turn out that the empirical spectral density is actually very close to the level density. The moments are very concentrated. The moments of the ASD are very concentrated around their means, which are the moments of the level density. And that's going to be intrinsic to proving things. Okay. So let me look at this object. I'm going to look at the one over n expected value over W bar of trace of W bar to the k. Remember that W bar is scaled by one over root n. So when I raise it to the k, I get one over n to the k over two. There's an additional plus one there because of this one over n. And what's the rest? Well, what does the trace of W bar look like? Or trace of any matrix to a power. So trace of A to the k is going to be a sum over I1, I2, Ik, all being indices between one and n of AI1, I2, AI2, I3, AIk, I1. This is elementary linear algebra. This is how you multiply matrices. So if you multiply the same matrix A to itself, k minus one times, that's what you're going to get. So I'm going to be interested in taking the expectation over these variables of products such as this. Let me call this ordered k tuple, I1, I2, Ik, the indices. And I will write WI for WI1, I2, etc, etc. Okay, so this is just notation for now. I'll call this a word, although I think that I'm probably not going to use the word word again necessarily in this presentation. For each such word, we have an associated graph G. Let me give you an example. So suppose that I want to look at a term that looks like this. W12, W24, W43, W34, W42, W24, W42, W21. This is an acceptable word. What's the corresponding index set I? It's 1, 2, 4, 3, 4, 2, 4, 2. It's length is 8. That's how many terms you have in this product that constitutes a word. And if I were to represent this as a graph, I would do it as follows. Let me first draw four points, 1, 2, 3, 4, and then see if I can connect them in the order indicated by the set of indices, 1 to 2, then 2 to 4, then 4 to 3, then I come back, 3 to 4, then I come back to 2, then I advance again to 4, back to 2, and then at the end I come back to 1. So what do I do? I execute a walk on this simple graph with four vertices and three edges. I execute a closed walk on it. That's all I do. And for this graph, so if I identify this with the graph g sub i, I have three edges, just three edges, and I have four vertices. And if you look again at this word, you will see that actually you can collapse it in and of itself and use the fact that the matrix is symmetric to rewrite this. So w12 is the same as w21 because of symmetry. W43 and 32 are also the same. And then everything that's remaining, w24 and w42, are one and the same. So this is actually going to be w12 square w24 to the fourth w34 square. That's what it is. So knowing that every one of the words can be expressed as such and that each one of them has a graph associated with it, let's think a moment about what we're doing when we're summing all of these things. We're picking graphs which have to have at most k edges because each word gives you a closed walk of length k. So you have at most k edges in that. So you'll have at most k plus one vertices in the in the graph. And that means that for every k you have a finite number of such objects. Once you've picked a graph and a way to walk, all you have to choose are labels. To figure out, you know, how many terms in here will correspond to that particular graph in that particular walk, but with different labels, perhaps. You have to count the labels. And that's the interesting part that you can do very crude approximations. Okay. Now I said that you can have up to k edges, but there's one thing to note, which is that the entries w i, w ij are all independent. So that means, since I'm taking an expectation, if you have an edge that gets walked on only once, so if the power in the expression kind of like this here on the board is going to contain a w ij to power one, then what that means is that that disappears. It wants to take the expectation it disappears because all the variables are centered. So I'm not interested in those. And therefore I'm only interested in walks in which each edge is walked on at least twice. But what does that mean? Well, it means, let's see, here's, it means that this gives me a sharper bound than I mentioned before on the number of edges. Since each edge will have to be walked on at least twice, that means that k, the number of edges in the walk is going to have to be at least as large as twice the number of effective edges because each edge is walked at least, walked on at least twice. So that means that E is going to be at least, at most k over two on the one hand. On the other hand, we shall have that this graph G is connected after all I'm executing a walk. So if the graph is connected, then we know that the number of vertices it has is at least, sorry, the number of edges it has is at least the number of vertices minus one. So we have this interesting set of inequalities here. So the first one comes from the fact that each edge has to be walked on at least twice. Otherwise, that's a term that I'm not interested in the expectation. And the second one comes from the fact that the graph G sub i is going to be connected. So all right, so what does that mean? It means that V is less than or equal to k over two plus one. Why is this important? Well, as I mentioned, the sum that I'm interested in is a sum over all graphs of fixed size less than or equal to k of ways of walking on these graphs. And all of those are finite numbers because k is finite. There's no end so far. And intervenes only when I start putting labels on these graphs. Because that's where the meat is. It's how many times does this particular walk appear in the expression for the trace, given that you have n vertices? And the answer is that for each effective vertex in the graph G i, you'll have about n choices. Of course, you don't have exactly n choices. It matters. The order in which you choose the vertices matters. But overall, you have about n choices. So you'll have about n choices for that one, about n choices for the two, for the four, and for the three. So that means that effectively, each walk on each graph is going to appear about O, so big O of n to the v times. Big O is notation saying that it's on the order of. It's going to be constant with some constant, which is independent of n, times n to the v, where v is the number of vertices. So this is O of n to the v choices. Let me go back for a moment. So for each walk, and there will be a finite number of types of walks, I'll have O of n to the v choices. Those choices get scaled by this n to the k over 2 plus 1. So what do we notice? As asymptotically as n grows large, the only walks that are going to survive are those for which there are k over 2 plus 1 vertices. If you have fewer than that, the contribution from such terms is going to wash away, because you'll have n to the v over n to the k over 2 plus 1. It's order of 1 over n, and that disappears. So the only thing, the only walks that will matter, that are asymptotically relevant, are those for which v is precisely k over 2 plus 1. Now, immediately you should be able to see that this creates a problem. If k is odd, k over 2 plus 1 is not an integer. So what can I conclude? All the odd moments are going to be zero. I can't satisfy this. I can have only fewer than that vertices. Therefore, all the terms in the sum are smaller than constant, and asymptotically, odd moments are going to be zero. But what if k is even? Then I can satisfy this. I can have v is equal to k over 2 plus 1. What kind of walks do I have there? Let's think about it a moment. If v is precisely equal to k over 2 plus 1, what does that mean? Well, I obtained that inequality as a consequence of this string of two inequalities. So to satisfy this as an equality, both of these have to be equalities. So that means two things. One is that each edge is walked on exactly twice, and the second one is that the number of edges is precisely the number of vertices minus 1. What kinds of connected trees have the number of edges be the number of vertices minus 1? Oh, no. Tarnit. All kinds of connected graphs have e is equal to v minus 1. Duh, trees. Terms with this property will come from walks on trees. That's where I'm going to get this highest order contribution, walks on trees. So here's the kicker. You have a one to one correspondence between a rooted unlabeled tree. What does that mean? Well, I mean something like this. You have a structure with no labels, but you have a marked vertex, a starting marked vertex. You have a tree with a marked vertex. You can do what is called a breadth first search or a depth first search on it. And that will give you a way of walking. So it will give you a type of walk. It's going to be in a one to one correspondence with this. How do you find out what terms you have in the sum corresponding to these walks? All you have to do is just put labels on it. 100, 275, 1310, and 1. That's all. And you have o of n to the k over 2 plus 1 ways to do that. So all you have to do really is count the number of unlabeled but rooted trees. That's all. And all of those will give you the asymptotically relevant terms. Well, it turns out oh, why do we only count that? I think I have it, oops, sorry. I have it over here. So if you look at the asymptotic contribution, you will be taking the expectation over moments of these variables. Since in this case, each edge will be walked on twice, all the powers here will be 2. And because of the way we set things up, expected of WIJ squared is 1. So they come with no extra weight. All you have to do really is just count the number of rooted unlabeled trees. That's all. And that number is a Catalan number. And that's the end of it. That tells you that the moments of the expected ESD or the level density are going to indeed converge to 0 if you have odd moments and Catalan numbers if you have even moments. So we've shown that the expected ESD converges to the semicircular. That's not all there is. Let's see how much time I have. Why good. We would like to actually show not that this is true for the expected ESD, but that it's true for the moments of the ESD in general. So I want to remove that expectation. Now there's a very simple technique for removing the expectation or showing concentration for a random variable. And that is to examine the variance. If I look at a random variable, and I know that its expectation is over here and its variance is very, very small, then that will tell me that the variable converges asymptotically almost surely to the expectation. Okay, that's all. So I'm going to do precisely that. I'm going to remove the expectation by showing that the variance is small. What's the variance? It's a variance of this quantity, 1 over n trace of W bar to the k, which can be expressed as such. And when you take the variance, you will get that this is 1 over n to the k plus 2. The reason is these two guys multiply, so it squares. Covariance of I, W, J, where I and J are k tuples, and W, I, W, J are the words corresponding to these k tuples. It's a covariance, because I will look at the expected W, I, W, J minus expected W, I, minus expected W, J. Okay? That's what I'll have to compute. So it's enough to focus on one of these guys and see what happens. Okay? Each one of them will come with its own graph, and in fact you can think of the join of the two graphs. So I'll have W, I, W, J, this is G, I, this is G, J. Let's think of the two graphs happening somewhere here. So I have one over here and I have another one, let's say over here. Okay? So this is G, I, this is G, J. I'm going to look at it at the join. The variables W, I, J, which correspond to edges in these two graphs are independent. So if these two graphs don't have an edge or a vertex in common, what is the covariance of the two words, of the product of two words? What happens to covariances of independent variables? Zero, thank you. So covariance of a product of independent variables is zero. And if the two graphs are disjoint, then the variables are independent. In fact, this is true even if they have a vertex in common. Because remember, the variables are defined by edges, not by vertices. So when I look at these covariances, the only ones that will stick around are those that correspond to words with overlap. The two words have to overlap. The graphs have to overlap at least on an edge, otherwise the covariance is zero and the term disappears. Okay? So now let's take a look. So the number of edges in the join of the two graphs, G, I, and G, J, the total number of edges is at most K. Oh, by independence. So it's the same considerant as before. If I have an edge in this join of graphs that is only walked on by either I or J, once, well actually not by the sum of I and J or by the concatenation of I and J, if I only walk on an edge once, then by the fact that there's independence and centering of the variables, that will disappear. So yet again, I'm only interested in walks for which each edge is walked on at least twice. Now I have K edges that I walk on in I, I have K edges in J, so for a total of at most two K edges, but the effective edges have to be walked on at least twice. So the total number of edges in G, I joined with G, J is at most K. Therefore, the number of vertices, as we know, V minus one is less than or equal to E and since it's less than or equal to K, so I'm applying the same kind of reasoning as before. I have this, therefore I have that V necessarily is going to be at most K plus one. Again, I can wave my hands and say that for any fixed K, the number of pairs of such walks is going to be finite unless I look at the labels in the absence of the labels, the numbers of such types of walks is going, the pairs of such types of walks is going to be finite. So I'll wave my hand around and say that the only thing that matters is in how many ways can I place labels. The number of ways in which I can place labels again is going to be O of N to the V. V is less than K plus one. What do I normalize by? N to the K plus two. So that means that the total contribution from any kind of type of pairs of walks is going to go to zero. I just don't have enough choices because I have to overlap them. I just don't have enough choices. So I'm going to have that the variance just simply goes to zero because the highest order term is of the order N to the K plus one and I divide by N to the K plus two. So that means that the variance goes to zero, which means that the moments are concentrated, which is what I set out to do, but actually I can say a little bit more. The variances are indeed going to zero, but they're going to zero a little bit faster than I stated. They're going to zero as one over N squared and the reason for that is simple. It's the condition of overlapping. Let's look at why I cannot have terms for which V is K plus one. So suppose that V were K plus one, right? What does that mean? Just as before it means that these two inequalities are both equalities. So that means that in the join of graphs the number of edges is equal to the number of vertices is minus one. So that means that the graph is a tree. So I have a tree. I have a tree in which each edge is worked on precisely twice because this inequality becomes equality. If the join of the two graphs is a tree, that means that, well, okay, they're both connected. Both GI and GG are connected and since their join is a tree, it follows that both of them are trees. But since the number, okay, both of them are trees, the walks I execute on these trees have to be closed because the words represent closed walks. So that means that already in WI and WJ, I walk on the trees in such a way that each edge is walked on twice. And when I join them, each edge is still walked on only twice so I cannot have any overlap. If I had overlap, I'd have to have at least one edge walked on four times. Is that clear? Okay. So that's a contradiction because if there's no overlap, if I have WI and WJ and there's no overlap, then the covariance is what? Zero. So that means that I cannot be in this situation because it doesn't allow for overlap. So what does that mean? It means that this term doesn't appear at all. In fact, the most choices we're going to have is n to the k and when you divide by n to the k plus two, then you get that the covariance is, that the variance, the variance of the moment of the ESD is actually going to converge to zero, like one over n squared. You can do actually pretty nice close estimate on the things that I've just waved my hands around, which are the numbers of walks, the numbers of joints of graphs and so on. You can estimate those things pretty clearly and it will allow you to construct an almost sure argument by using Borrell-Cantelli, but this is a little bit beyond of the scope that I want to achieve today. So now what do we know? We know that the moments are concentrated and we know that they converge to the Catalan numbers or to zero, depending if they're odd moments. So I'm going to stop here and tomorrow I will show you how to use these two facts together with a various trust approximation to conclude that not just the level density distribution converges to the semicircle law, but that the ESD itself converges to the semicircle law. So rather than the smoothed out deterministic object converging to the semicircle, we will show that the random object also converges to the semicircle. And there will be a problem session later today in which you'll be able to kind of dig a little bit deeper and settle some of these notions before tomorrow's lecture. All right, thank you. I should have said from the beginning that I really appreciate questions and I'm trying to make this interactive. So you didn't ask questions in the body of the presentation, but if you'd like to ask some questions now, you're completely welcome to do so. Any questions? We have time for maybe one or two. What happens if you let the variance of the off diagonal elements go to zero? So in other words, the matrix, so you're looking at a sparse matrix. Yes, so you'll be talking about that in problem session, I think, tomorrow. It's an Erdos-Schraini ensemble and you can still prove that as long as the weight in which this variance goes to zero is not too slowly, then you get the same kind of semicircular distribution. Very good question.