 Okay, is that better? Okay, so first I'll be discussing community detection also known as graph clustering. And second, I'll be discussing graph alignment. And the reason why I like those two learning problems related to graphs are three fold. First, they each have lots of applications. So community detection is useful for recommend their engines or for clustering proteins in protein protein interaction networks, whereas graph alignment is useful for the anonymizing social network data and also applications in biology. So first off, you know that if you have an interesting algorithm for one of these tasks, you may have some practical impact, which is not bad. Second, they have a nice mathematical structure. They are to some extent tractable. There are none trivial things that can be said about them. And finally, they show some light on a very intriguing phenomenon that arises in many of these are high dimensional learning problems, which is the existence of what is known as a hard phase. That is a phase where we know that a brute force algorithm can extract some useful signal from the observation, whereas we don't know of any polynomial time algorithm that will succeed at extracting this signal. So they are, I mean, ways for us to have a look at the existence of this hard phase. And okay, we don't quite understand it, but these are our entry points in trying to understand them better. All right, so my outline will be as follows the first three talks. The thing is if I want to annotate. Yeah, I hope you can read with this. Okay. Let me try to do full screen and so first off, community detection. I'll talk for three. I'll talk about that roughly. And then the last three sessions will be on graph alignment. Actually, I will start with something that is motivated by by community detection but that is not community detection per se. This is the so called tree reconstruction problem. Allow me to introduce what is known as the custom stigum phase transition or custom stigum threshold, and we will see later on why this is relevant to the problem of community detection in random graphs that we will discuss after the first the first lecture. All right, so let's get started with a tree reconstruction and the custom stigum threshold. All right, so what is the tree reconstruction problem. Well, this is the following thing, assume you are given genealogical tree to have an ancestor it has its children and they have their own children and so on and so forth. Okay, that's the genealogical tree. So we have some traits that are traits that are carried by each of the individuals. So that could be if these were human beings that could be the colors of their eyes blue eyes brown eyes. And we will talk about traits or spins. Since we're in a physics center this will be mostly spins. And so we have probabilistic mechanism for the transmission of traits from parents to children, we have, we will give us probability of a matrix that has as many rows and columns as there are traits so far as the number of traits will be q, the most basic example with will be q equals to but we will allow ourselves any value finite value for q. And so the transmission of traits will be probabilistic and independently for each parent child relationship. So this is why we have the formula that I write here. So I wanted to be able to highlight things as we go. Okay, maybe this is not a good color. Let's try something a bit. So that's my probabilistic transmission mechanism so given the tree. I transmit straight straight J to my child, given that I have straight I with probability Pij. So that's the only thing I need to specify the tree that I'm considering this stochastic matrix P, and I'll assume it's irreducible so that it has a unique invariant distribution that I denote by new. And I'll assume to complete the description that the trait of the ancestor is distributed according to the stationary distribution. So if you marginalize things you can figure out very easily, that the trait of a descendant is also distributed according to new because it's just, you know, there's a mark of chain of propagation from the ancestor to that descendant and so we start in stationary state we stay stationary state. So what is the tree reconstruction problem. Okay, it's the following thing. It's. Okay. So given the observation of the tree, and given the observation of the traits of some individuals at Generation D. Can you infer non-trivial what was the trait of the ancestor so given the population on Earth's eyes colors today, can we infer the eyes colors of Eve where they blue or brown. Other possibilities exist, but that that's the gist of it so can we infer non-trivial the trait of the ancestor given the traits of the descendants far away. As we let the generation D go to infinity. We have not said what our tree will be, our tree will be for us either deterministic tree will assume it's given it has such and such property, but most of the time we'll assume it's a probabilistic tree it's a Galton Watson branching tree. So for instance you assume that each individual has a random number of children that is a Poisson distributed random variable. Okay, so. Can you call some notation. I think it's just a recap of notation for most of you, because I've seen at least in the last talk, neutral information was presented so let's just set the notation right. And then for the entropy of a distribution so H of new is the entropy channel entropy of the stationary distribution of our stochastic matrix P conditional entropy is can be defined in many ways so conditional entropy of a pair of random variables is the joint entropy minus the entropy of the conditioning variable. So the mutual information can be defined again in many ways so mutual information between variables X and Y can be defined as the sum of entropies of X and of Y minus the joint entropy. And it's also the cool back label divergence between the joint distribution and the product distribution of the margin of so that's probably the most useful characterization of a mutual information. Okay, so more background on information theory. There's also a notion of a conditional mutual information. And basically, this is the expectation of the cool back label divergence between the conditional distribution of two variables X and Y condition done Z. And with respect to the products of the conditional distribution P of X given Z P of Y given Z, and you average that over the conditioning random variables. So that's conditional information, mutual information. And, well, for now, the one thing that we will get from this that we will use is the fact that if you have so, I guess, can write this if we have the following diagram as I state in the lemma I have X and Y that are conditionally independent given Z. Sometimes depicted by this dependence diagram conditionally on Z there's no dependence between X and Y, then the mutual information between X and Y is no more than the mutual information between X and Z. So that's something that you can easily derive from basic formulas for mutual information. So this is known as the data processing inequality, inequality. All right, so that's, that's where we will need for now. Okay, and so that will allow us to precise mathematically what we mean by non trivial tree reconstruction. We say that non trivial tree reconstruction holds if the mutual information between the spin at the root, the thing we are interested in. And what we get to observe that is the tree itself, plus the traits or the spins of the nodes at generation D, the mutual information between those things does not go to zero as D goes to infinity. And we do know that a limit exists as D goes to infinity for this mutual information because if you look at, you know, you have the root R here, it's been sigma R, then you look at the tree down to depth D, and then you have no I here and sigma I. Okay, given our probabilistic mechanism for transmission. So there is such a conditional independence diagram. So, if I knew the spin at generation D, then what happens at generation D plus one is conditionally independent from the spin at the, yes. You typically will get exponentially many nodes because we are going to look at branching processes so typically a generation D you have an exponential in the number of individuals you look at all these values. But in some cases you have non vanishing information about the spin at the root. And in other cases, this information vanishes there's nothing you can figure out from the roots spin. Yes, yes, we will take discrete set for the spin values so think of binary values, or any finite values actually we will be considering but think of two values to set ideas. Yes. Yes, what I'll describe is assuming I know everything about the tree itself, as well as the probabilistic mechanisms so that is to say the transition matrix, I assume I know all that. Alright, so as I was saying we can apply this data processing inequality we know the mutual information decreases so it has a limit so either it goes to zero or it goes to something positive. If positive we say reconstruction is feasible, otherwise it's not. Okay, and we want to understand when it is when it is not. There is a problem and we'll see some characterization based on the data, the parameters of the problem that that determines reconstructibility or non reconstructibility. And let me describe right now a second reconstruction problem that. But before I do that okay let me just say one more words before I move to the other reconstruction problem one more word about this kind of reconstruction. It turns out that you cannot reconstruct if and only the conditional distribution of the spin at the root, the spin of the ancestor given what you observe at generation D this conditional distribution. This diverges in probability to the unconditional distribution that is new. So that is something that you can prove in a few lines and this is on the slide, but I won't do it. So, maybe we can put the slides online and you can, you can read that at your leisure. Okay, one, one more word about this non reconstructibility assume you have a symmetric distribution for the spin at the root so it's a uniform distribution on the q values that it can take. So in binary spin one half one half q values one of our q for each of the of the q values. So by what I just said, you can see that non reconstructibility holds if and only if the maximum of the conditional probabilities of the value of the spin at the root given what you observe down at generation D converges to one of our q. Okay, it's the same thing to have the distribution converge to the uniform as to have the maximum converge to one of our q. And a way to understand this non reconstructibility properties then the following. It will hold if and only if for the best estimator that you can construct from the spin of the spin at the root sigma hat of the spin at the root sigma r. And then, and under the assumption of non reconstructibility the probability that you guess right will converge to one of our q whatever and no matter how smart you are, because you can, you know that the way to maximize this probability of guessing is to guess at the trade that maximizes the conditional probability. And so since this conditional probability converges to one of our q. That's, that's the performance you get. And that's, that's a trivial performance that you can achieve by guessing at random. Okay, so that's that's a way to understand this property. So let's, let's consider a second kind of reconstructibility property which we call census reconstructibility. And this is the same question can we guess about the spin at the root but we give ourselves less information instead of having all the details about the tree down to generation D as well as the individual spins at generation D. So we get summaries, we get the census of how many individuals down at generation D have a particular spin. So we get a few dimensional vector of counts of spin counts at generation, so less information. And so we say census reconstructibility holds if the mutual information between the spin at the root and this lesser set of information goes to something positive and non reconstructibility census reconstructibility holds it goes to zero. It's an easy remark that we can make again using this data processing inequality this monotonicity of mutual information is that the mutual information between the spin at the root and the census at generation D decreases with D once more because given the census at generation D the census at generation D plus one and the spin at the root are conditionally independent. Okay, so there's a limit. And so we can tell as well that the mutual information is less because we have less information. Okay, that makes sense intuitively but you can prove it using this data processing inequality. If we do not have census reconstructibility we certainly have ordinary reconstructed. Okay, so I said in the title, a gestan stigum threshold and that's now that is going to appear. We will make a link between this property of census reconstructibility and particular threshold on the parameters of the model specifically the spectrum of the matrix P, as well as the average number of children burn nodes in our tree. We will have a close relationship between the two. And for that we'll assume that the tree is a branching tree so Galton Watson branching tree and we will make two assumptions on that tree will assume the average number of children is some constant alpha strictly larger than one so there's a non zero probability that it survives and if it survives. It tends to grow exponentially fast with the generation. Okay, and we'll assume as well that there's a finite second moment for this number of two. Ah, yes. Okay, so here's one way to do it. You start off with an ancestor. And then you sample. Okay, you sample, you give yourself random variables x d I for the larger than one and I larger than one that are ID and that follow the same law as some variable why in our case. It could be a Poisson variable with parameter alpha if you like. And so we assign to the ancestor, which is a generation zero at level zero we assign it. Okay, maybe I'll start at zero here x zero one children. Okay. And then I'll keep track of these children for each of the children I assign a number of children so first child will get x 11 children second child will get x 12 children and so on and so forth. Second child will get x 13. And I can iterate so I generate at the following generation I'll use the first index to track generation and then I'll end though each of the individual individuals with its number of children. It's a random tree that is generated in this iterative manner by having ID numbers of children. Okay. And it's been proposed in 1873 by Sir Francis Galton as a way to describe the number of individuals of a noble family in England, and he asked the question can you predict whether this family will survive or going extinct. So he has the question and then there was an answer posed by Reverend Watson in a naming and journal of the time. And he proposed a solution for that there's a fixed point equation involving the generating function of this random variable describing the law of the number of children. Okay, and apparently it was false the answer provided by Reverend Watson and some say should not be called Galton Watson it should be called the biennium is something or but I don't think I think this is a last battle. So that that's what we are looking at. So, so far reconstruction without census we assume we are, we observe the full tree. So we know everything about the tree, but we only observe the traits of the children down at generation. Yes, you stop it at generation D, and you observe the spins of the leaves once you have cut things below generation D, but if you are given the information of the tree downstream, it's of no use, because of conditional independence. So, that's what it is so reconstruction you want to guess the spin at the root you're given the whole tree and the spins only a generation. Yes, you get to see the spins only a generation. Yes. Okay. All right, so we have this Galton Watson tree we have alpha the average number of children second moments is finite. Okay, we have the transition matrix P, and we'll care about its spectrum. And so we'll denote by lambda two of P the eigenvalue of P with second largest modulus, and we know it's a stochastic matrix so the largest modulus is one, since it's irreducible, could have a second largest modulus of one as well. If you have, for instance, minus one, you have a periodic chain but okay so lambda two is the eigenvalue with second largest modulus for this transition matrix. Okay. So, so the serum we have first year and one direction we see there's a contrast is that if alpha average number of children times the square of the modulus of lambda two is strictly larger than one then census reconstructibility holds. So I'll spend a bit of time explaining you how this gets to be shown. There's a construction you you can make that involves an eigenvalue that is associated with lambda two this eigenvalue of P. So I give myself this Maybe I don't need to rewrite what's on the slide so X is a Q dimensional vector that is an eigenvector associated with lambda two so you have the X equals lambda two X, right. And we'll construct the statistics from our sensors vector will will construct the statistics that D, which is the sum of our possible traits of individuals, and then we'll have. Okay, let me do not trace by letter s will average. We will take a sum of our traits possible traits of the entry of the eigenvector for that trait times as if you recall this is the number of individuals with traits as a generation D. And we have a rescaling quantity which is alpha lambda two to the minus. So that's our statistics, and we'll be able to show that census reconstructibility holds remember this means that the mutual information between the spin at the root and this vector x as the does not vanish by showing that in fact the information between the spin at the root and this statistic does not vanish. So how do we do that, well we leverage martingale theory, so I don't know if that is something that is familiar to many of you here. No, no one is too familiar with martingale story. So that's one of the achievements of probability theory in the 20th century so it's a, I mean, a notion that generalizes independence and you can put theorems like law of large numbers and for limit theorems when you have martingale structure, concentration inequalities many things hold when you can exhibit a martingale that you know to hold for sequences of variables that are that have a stronger form of independence. So what what is a martingale, this is a sequence of random variables. M sub D for step D, let's consider discrete time here D is the discrete time for us. So we have a martingale with respect to an increasing family of Sigma fields of Sigma fields G sub D. If I'm only the conditional expectation of M at time D plus one given the information G sub D at time T at time D is equal to MD. So it's a process that is a conditionally centered if you like. We need to state the central result in the theory of martingale's an extra definition, which is uniform integrability. This is the notion. On the slide here, a family of random variables is uniformly integrable if uniformly the expectation of these random variables on a set where they are in absolute value above a large threshold a uniformly over the set this goes to zero as you move the threshold so that they may have you know an expectation that is bounded but some of them may achieve this expectation of one say by having a little bit of mass very far away from one. And you could have bounded expectations without uniform integrability if you had somehow some of the expectation moving to infinity. So uniform integrability plate precludes that. Okay, and so the central results in martingale theory is that if you have a martingale if it's uniformly integrable. And then it converges up almost surely and in L one to a limiting random variable. And so if you denote by M infinity the limit of this martingale, then the value taken by the martingale at some finite time D is the conditional expectation of the limiting condition on the information at times. And these expectations of M infinity condition on G sub D. Okay, and for the record will use that. Well, there's a whole body of results on martingales but so what I was stating now is the central result on convergence of martingales does somewhat easier result that we might need later on is convergence for what we call backward martingales and the backward martingale is the following thing. Instead of having an increasing sequence of sigma fields you may consider decreasing sequence of sigma fields. Okay. And for us what it will amount to it will amount to considering the information below generation D and as I move D as I increase D I get less and less information. Okay, and so decreasing for a decreasing sequence of sigma field of sigma fields H sub D. So consider a random variable X sub D that is the conditional expectation of X given H sub D. Then we have a convergence almost truly any L one of X sub D to the expectation of the variable we use to define the backwards martingale condition on the information at infinity. So if you have three points of sigma fields, you can define the limiting sigma field at infinity. So there's, there's such a notion of H infinity and we have this backward convergence martingale. Okay. So the proof of the lemma is a simple calculation. So I have the details here, I will be less give less and less details as I proceed over the days, but let's see some details now. So remember I have my statistics which is a weighted sum of the sensors vectors coordinates, and I want to show that this is uniformly integrable martingale. And so for this, I prove first that it is a martingale so I look at the conditional expectation of my statistics. So I have the condition of the information at generation D minus one. And so I can pull out the normalization term here. Okay, and so I have a conditional expectation of a son, and I can put together the spins of individuals at generation D, according to their father or mother, their parent. So that's what I do so I sum over individuals I in. Okay, I have not introduced this notation L D minus one for me this is the leaves at level D minus one so meaning the nodes at generation D minus one. Over some over their children. So Jay, such that the parent of Jay's I for even I, and then I have X sigma Jay the spin of that child. Okay, so this is really a rewriting of ZD that I have done and I have to download the renormalization factor. So, now I can use what I know about the construction of a tree this is a branching tree. According to the Galton Watson branching mechanism so I know that one individual has an average alpha children. Okay, so I will have a factor alpha that comes out from that. And then I will for one child condition, condition on the spin of its parent and average over the values of its own spin so that's going to be a sum of a possible values of its spin of the transition probability p sigma s times X s. And so this is where I use the fact that I have used an eigenvector of p so that is exactly lambda two times X sigma I. And this is how I get that my process is indeed a martingale martingale and I have not used that all this distance to go threshold condition alpha times lambda two squared larger than one. So this property, the fact that it is a martingale holds without the condition. So, where I need the case 10 stigum condition is to show that it is uniformly integrable. And so, to show that I can use a criterion that is. How it's usually done, when you want to show that the family of variables is uniformly integrable. You establish a bound on some moment of order larger than one so one plus excellent works to definitely works. And so if you can bound the second moments of random variables uniformly over them, then you get uniform integrability at once. So that's what we do we bound so we know bound on the expectation because we have a martingale so all expectations are the same. So what we need to do now is come up with a bound and the variance. And that's what we do. So I don't know if I want to detail all of that. But what it boils down to the following if you look at the variance of our statistics at generation D, you apply what is known as the conditional variance formula and you get a result easily. And this result separates into two contributions you have the variance of the statistics at generation D minus one plus something that you can bound and you can bound it by something that is proportional to an exponential in D term and this term that is raised to the power of this precisely one of these alpha times lambda two square. Okay. And so that's where the case density boom threshold holes. Since we assume alpha lambda two square larger than one we know that we are adding a contribution to go from variance at step D minus one to variance at step D that has a finite sum. So the variances are uniformly bounded. Okay, so that's how we get uniform integrability. And so there's a bit more work to show that because of this uniform integrability there is non vanishing mutual information between the spin at the root and the sensors at generation D. But I don't know if I want to go through the details but that's the key step showing uniform integrability and after you have that you need a few arguments. To show that necessarily there must be some T such that the mutual information between the probability that this statistic said that's the fact that the event ZD less than some threshold T and the spin at the root they have non vanishing mutual information. So the, the, the one of the KD here is to understand which is the proper statistics, which is a martingale so that you can use this results right. So is there a kind of intuition or how do you guess that this is a correct statistics. This CD. Well, I don't know if there is a very clever answer to that I mean you try linear statistics. Okay. You have, okay. You have a conditional expectation of a linear transform then P shows up. Okay, so if you want to martigale when once you look for linear statistics, naturally you'll want to use weights that come from my gun vectors of. So, why a priori the, the, the, so the one associated to the second eigenvalue, it's because you knew in advance that the second eigenvalue is related to the threshold from physics or. No, no, it's because you can use the first eigenvalue and you will get something you will get a martingale as well it will be uniformly integrable, but then at generation zero the martingale that you get is a constant, it equals one. And so it's something that is not correlated with the spin at the road. So you need to go to a higher order eigenvectors. So you want to go to at least lambda two. And but the larger the modulus, the better it is to ensure uniform integrability. So you just go to the second largest models. Okay, so there's, there's no freedom. This is the one statistic you want to use in. Okay. No, no, no, you cannot actually there as I was saying there's a converse. So, so this is as good as it gets. So, maybe I can, I can move to that now. That's that that's in the past. So tomorrow then. Okay. Okay, okay, that's fine.