 gamma to the D where gamma is strictly less than lambda. Okay, so that's what our first result, which is crucial. And now if we take a correct match, then we know that there is the neighborhood of that correct match in the intersection tree that will be a tree common to the two neighborhoods. And so this tree common to the two neighborhoods is in distribution close to a Galton Watson branching process with Poisson number of springs with parameter lambda times S. That's the, because the intersection graph is another string graph with a parameter average degree lambda times S. So if we pick lambda times S larger than this gamma thing, then we are good because for a correct match, we know that we tend to get a matching weight that is about the lambda S times to the power D. Whereas if we pick two nodes that are far apart, we tend to get matching weight that is gamma to the D for a gamma strictly less than lambda times S. So that's how we carve this little triangle in the phase space. So, okay. So I guess these slides show more or less what I was telling. So we have managed to improve this scheme. So I wanted to describe it because the ideas needed in its analysis, they can be boosted. And so we can construct better schemes. And now we have a better understanding of a large region of the phase space for which polynomial time alignment is feasible. So let me not dwell on the numerical experiments for that scheme. So I'll now discuss more recent results we obtained and where we stand, okay? So I first tell you about some results we obtained with Mark and Luca this year. So remember in the three matching weight algorithm, we have to compare two cases. The case where we have the neighborhoods corresponding to an exact match. So what we tend to get is if we remember the construction of the two graphs, we get two correlated trees and how are they constructed? So you can start from a root node then you sample Poisson of lambda S children in both of them. And then you have an extra number of neighbors you'll get that is going to be Poisson of lambda one minus S, okay? And here as well, independently, okay? And so you have this structure for the two neighborhoods that's when you have a correct match. So and you know how to pursue the construction of the correlated neighborhoods. So eventually you'll have a tree of that will appear in the two neighborhoods that is going to be a Galton Watson branching tree with a spring distribution Poisson of lambda times S, okay? And this gets augmented independently in the two cases. So how is that augmented? So well, each node in the intersection tree gets an additional number of children that is Poisson lambda one minus S independently in the two trees and the new vertices we add, they get a descendants that is a branching process that is Poisson with a parameter lambda. So that's okay. We get a Galton Watson Poisson of lambda here and likewise for all the other nodes. So we know that's the joint distribution of the neighborhoods of two nodes that are an actual match. And so that's one situation. The other situation is two nodes that are far apart in the master graph for which the neighborhoods are independent, okay? And so we have two Poisson lambda branching processes. So what we did in this tree matching weight algorithm, we tried to distinguish between the two situations and we had one statistic to distinguish which was this tree matching weight. Actually we can ask for the best statistic there is to distinguish between these two situations. So this is a hypothesis testing problem in fact that we have to solve. We must distinguish between two hypotheses, whether the two neighborhoods are from this correlated distribution or they are independent branching processes. So once you view it like that, well, you go back to hypothesis testing theory and you have this Neyman-Person lemma from the early 20th century and you know exactly what is the best test that there is, I mean. Just say the Neyman, did this classical result that I should know then? I don't remember this criterion. The Neyman-Person lemma. Yeah, if it's not too wrong. Yeah, yeah, yeah. So Neyman-Person lemma, Neyman-Person lemma is the following thing. So you have, you observe X and under H zero X follows distribution P note under H one X follows distribution P one. Okay, so what you're interested in is you'd like to maximize it. So you construct a test T of X. And when T of X is zero, that means you believe you are under assumption H note. If T of X equals one, you believe you are under assumption H one. And so your test should minimize the errors that you can make. So you have two kinds of errors, of the first kind and of the second kind. So you want to maximize P one that T one of X equals one. So that's the probability of correctly detecting that you are under the alternative hypothesis H one. But okay, if you take it always equals to one, you have something meaningless. So typically you try to do that under a constraint that P zero X equals one is less than some threshold, okay? So to strike a balance between the two types of errors you can make. So Neiman Pearson Lemma tells you that essentially the optimal test is of the form T one of X equals one. If and only if T one of X over P note of X is above some threshold. So that's the actual statement says a bit what you should do when you are exactly at the threshold but essentially this is that. So the best test that there is just considers the likelihood ratio between the two distribution decides for the alternative distribution if the likelihood ratio is above a threshold decides for the null hypothesis H note if you are below this threshold. This threshold is a functional of P zero and P one. You can construct it explicitly. It's a function of this alpha, this trade off parameter that you have chosen. So pick alpha then it will force a value for the threshold. Okay, for a given P one and P zero that you know and given alpha there is a constructive way to get this threshold. Yes, yes. You can look at the cumulative distribution function of the likelihood ratio and the P note and so that will give you how to choose style as a function of alpha. And there are some cases where you have jumps and so you need to randomize where you are at a jump point but that's a detail. Okay, so we know we have this lemma so we know we should not do these three matching way thing. Actually we should face the distribution of these pairs of trees whether independent or whether correlated and do the likelihood ratio computation and then decide whether to put the two nodes as a candidate match or not based on the value of the likelihood ratio. Okay, so I guess that's essentially what it says here. So we'll limit observations of neighborhoods to some depth parameter. We'll take the two neighborhoods if they are tree like, we'll leave aside the neighborhoods where you find loops in your graphs. So if they are tree like, we'll compute the likelihood ratio of the pairs of trees that you get under the two distributions, the distribution of correlated trees and the distribution of independent trees and you'll put the matched pair in your set of candidate matches if this is above a threshold. Okay, so it turns out that this likelihood ratio of the distributions of pairs of trees, you can compute in a recursive manner. And those are branching processes are constructed in a recursive way and you can compute all kinds of quantities of interest on such branching processes in a recursive way. And it is the case here. It's a nice exercise actually that you can compute recursively the likelihood ratio. So that's good for algorithmic purposes because we can implement a test of that kind. We can, the way you would do it is you would start off with depth one and compute likelihood ratios. So you would remember, we'll use dangling trees. So for IJ, then we look at depth one here and then we have UV, we have dangling tree of depth one there. And so we'll compute the likelihood ratio for this pair of trees under the two hypotheses. And then we'll keep increasing the depth and we'll use an induction formula which will allow us to compute the likelihood ratio when we observe at depth D from the likelihood ratios when we observe at depth D minus one. Okay, so it turns out this is a polynomial time scheme but it takes some work to establish it but it is a polynomial time scheme, okay? So let me not dwell on the recursive formula for the likelihood ratio. We just need to know it can be computed. And so, okay, so once we have this angle of attack in mind for this problem of aligning trees we can do some analysis. So I'll not give you many details, all right? I don't know how much time I have, 20 minutes or yeah, okay. So I'll be sketchier and sketchier as I go on. So have I, okay, I went too fast. Okay, so we have one notion that is important for us for analyzing the scheme that we will produce. This is the notion of a one-sided detection. So we have those two hypotheses whether the two trees are correlated or whether they are independent. And we'll say that we have a one-sided detection in this hypothesis testing problem if we have a family of tests indexed by the depth at which we consider our trees such that the probability of correctly guessing the alternative hypothesis when it is true does not go to zero asymptotically as the depth goes to infinity. So we catch some correlated pairs, okay. We don't miss out all the correlated pairs that's what it will mean. But we never wrongly classify as a candidate correlated pairs, some pairs that are not correlated. So the probability under the uncorrelated distribution of deciding that we have something that is uncorrelated goes to one, the depth goes to infinity. So that's a definition. That's a property that we would like to have, okay. And so we can characterize when this holds for this hypothesis testing problem. This holds precisely when the cool back libeler divergence between the distributions of the correlated trees observed to depth D and the uncorrelated trees observed to depth D when this cool back libeler divergence goes to infinity as D goes to infinity, then we have this one-sided detection property. And okay, so in order to progress on our understanding of this phase diagram that I was drawing here, I have this tiny triangle here. We can fill out larger region, any region where we know that we have one-sided detection for this hypothesis testing problem. We know that the algorithm based on computing likelihood ratios using thresholds, et cetera will produce a non-vanishing overlap. Whenever we have one-sided detection. So now the work has been reduced to computing or to controlling the cool back libeler divergence in a hypothesis problem between pairs of trees in these traditions. But why then in the definition of one-sided bounds, you needed the probability to, it's a false positive to be zero because if you, why did you need that? Because if you think of the graph matching problem, we know there are exactly n correct correspondences. So for each node of the n nodes in the first graph, there is only one good match, whereas there are other n wrong matches. So we have other n squared false matches that we need to rule out. And so if we want to produce a non-vanishing overlap, we need a strong control on the number of false positives that we produce. So this is why we crafted this definition. Okay, so now the game is to figure out under what values of lambda and S we have divergence to infinity of the scale divergence as D goes to infinity. So I mean, in the paper I was quoting with Mark and Luca, we came up with a number of sufficient conditions for divergence to infinity of this cool back libeler divergence. So here's one particular, well, this is just a lower bound, but okay, here's one sufficient condition for divergence. There's I guess nothing for you to glean from the formula, but it's just one that can be shown reasonably quickly based on the recursive expression for the likelihood ratios. And so using such arguments, we know that this triangle was a pessimistic region, you can enlarge it. And so we know that there is actually a larger portion. But so the, okay, maybe I'll jump to what we know now. Okay, so recently we started collaborating with Gillem-Semergent on this problem who did amazing computations. And so now we have a much better understanding of when this correlation tree-based correlation detection in trees approach succeeds, okay? And so, okay, okay. I think I should not try to say too much about the, well, okay, let me try. So I've skipped many details. There are several ways in which you can describe those trees. So we are dealing with rooted trees. There's one way which is quite natural, which is to assume you label in some order the children of each node. So this is a representation in terms of labeled trees, okay? And I did not mention it, but this is how we approach the problem initially. And so when we observe the neighborhoods of nodes in graphs, there's no intrinsic notion of an order of the children or of the neighbors. So what we do is we assume we label them uniformly at random. So we worked on that. And so the expressions I wrote for the likelihood ratio were based on this representation, okay? So that got us that far, but actually we are dealing with unlabeled trees, okay? So rooted trees that are labeled, but changing the labels would make them identical, they are isomorphic, and we are dealing with classes in this equivalence relation. So we are dealing with unlabeled rooted trees. And so what we wanted to understand in particular is how does this region behave as lambda becomes large? And if I look at neighborhoods to depth one, so I have one node here. I have here Poisson of lambda s, that is a common number of nodes in the two rows. And then I add Poisson of lambda one minus. Okay, so when lambda becomes large, I can subtract, so a tree to depth one is just a number. It's a number of children, okay? So I can subtract its average, I can rescale it. And when lambda becomes large, I know that this properly rescale will admit a Gaussian limit, okay? That's just the central limit theorem. And I know that under the correlated model, after proper centering and rescaling, instead of having two independent Gaussians, I'll have two correlated Gaussians. The parameter s will show up in the correlation of the two Gaussian random variables. And so what we did with Gilem is push this idea of their existing Gaussian limit to arbitrary depth D, and which allowed us to understand exactly how this region behaves as lambda is large. And so, well, to get a Gaussian limit, you want to apply centering and rescaling. So you'd better be in a vector space to rescale things, subtract things. So it will be helpful. It turns out that unlabeled trees, they can be seen as objects in a vector space. So for instance, unlabeled root tree of depth D, you can view that as a counting measure of unlabeled rooted tree of depth at most D minus one. So, you know, you have the space of depth D minus one. Each of the children of your root, which has as its downstream tree of depth at most D minus one, then particular tree, you put a Dirac mass for the corresponding object. You do that for all the children. And so that's an equivalent way to consider unlabeled rooted trees. So we are in a vector space, okay, measures. We can add measures, we can multiply measures. And so that's actually the right way to consider this hypothesis testing problem. And using this, we can actually show that for our joint distribution of trees, if we let lambda go to infinity, there is a Gaussian limit. We can, it's a, it's a delicate operation, but we can center and rescale the measures that stem from those unlabeled trees. And we get limiting objects that have a Gaussian distribution in an infinite dimensional space, but vector space. And we can also compute the KL divergence for the limit, the Gaussian limit. And we can also show some continuity that as lambda goes to infinity, the KL divergence we are interested in will converge to the KL divergence in the Gaussian limit for which we have an expression. So putting all these together, we get, okay, what have I, all right, so, okay. So maybe let me stick to the slides. So what do we know about the, limiting Gaussian objects? Okay, so the KL divergence for the limiting Gaussian objects diverges to infinity as the depth goes to infinity. If S, the correlation parameter, is larger than the square root of a constant alpha that is known as the Otter constant. That's 19th century in mathematics. Otter worked on combinatorics of counting the number of trees rooted trees unlabeled with a given number of nodes. And so this is a number which grows exponentially with the number of nodes and the exponent, well, the generating function has a radius of convergence that is this Otter constant alpha. So that's, we know that for S larger than square root alpha that is larger than 0.582, the KL divergence in the Gaussian limit blows up, whereas it stays bounded if S is smaller than a square root alpha. So eventually what we know today is the following. We know that the region in which this approach produces in polynomial time a partial alignment, probably so, is a region which as lambda goes to infinity is asymptotic to square root of alpha. So we know exactly the boundary of this domain for large lambda. And we know that this region lies above the line at square root alpha, okay? So for lambda large, we know reasonably well what goes on at square root of alpha. We know reasonably well what goes on. We don't have an exact characterization of the boundary for any finite lambda, but asymptotically, we know where it lies. Okay, so that's where we stand. So and in between, you know that it has to stay above the square root alpha because? Because we know that the KL divergence in the original model converges to the KL divergence in the Gaussian model, but we know it is bounded by a quantity that stems from the Gaussian model. And this quantity does not diverge if we are below. So we know the KL divergence in our original model can never blow up as D goes to infinity if we are below this line. Okay. All right, so conclusions. Okay, maybe you have to catch a bus, right? Okay. We took the time enough so that... Okay, okay. So take the time to conclude. So as for information theoretic feasibility, as I said, we believe that the true condition is lambda s larger than one, but it is still open. So if you are interested, you can try and improve the result of Su, Wu and Yu from lambda s larger than four to lambda s larger than one. So that's an open question. Okay. We are trying to understand even better this region where this algorithm I described based on likelihood ratios succeeds. But we have now a good understanding and this Gaussian limits that I mentioned are really a good way to learn more about this, but okay, we are trying to improve our understanding of what happens for finite lambda. You conjecture that your algorithm is optimal in this intermediate region? So yes, I guess that's my third point. We did conjecture that when it would fail, it would not be polynomial time feasible to achieve partial alignment, but we don't have a strong argument for that. So this is a gut feeling or something like that. But that's completely open, showing that no polynomial time algorithm can succeed in a region. This is, okay, we will never or we can't do that in general, what could be done though is to show that if you limit yourself to a family of algorithms, say you have those low degree polynomials that you can construct from the observations, if you limit yourself to a family of estimators, maybe they are not powerful enough in some region. So that that would be, I guess, when we're to approach that showing that below this square root alpha line, and no algorithm of bounded polynomial degree can succeed. Okay, something like that. But this is not done at all, this is totally open. Okay, and so, right, so there are open questions as well for the graph clustering in understanding this hard region in showing somehow that a whole class of algorithms will not succeed in the hard region. This is something I think that is quite interesting and that will improve our understanding of what these hard regions are. And, okay, so as a general conclusion, I'll just say that statistical physics brings a rich perspective on computational complexity. I mean, these hard regions, hard phases, this is very intriguing, but this is something I think we have tools to progress on. So better understand, and this is exciting. Okay. We have time some question, no stress. I can't believe that we're actually such in time. You use the recursive solution for the generative regular pair of trees to obtain the likelihood parameter L. So is there any other solution to solve this problem, to achieve the likelihood parameter L or not? You mean to compute this likelihood ratio? Yes, except recursive solution. It has a lot of structure, so we could produce several formulas for it, but I think in terms of algorithms, the recursive formula is the one that is the most practical. Or if we consider non-generative trees, is it possible to use the recursive solution again or not we have to switch off? So the algorithm can be implemented no matter what. I mean, you're given two trees, you can compute a likelihood ratio. So that's not the two trees, there's a value for the likelihood ratio, no matter whether they are sampled from a distribution or another. So that's always doable to implement the algorithm. The guarantees we can prove are going to be limited to the cases where the trees are sampled from these, well, the graphs are sampled from these correlated data. But it is not possible to use the recursive solution for non-regular trees. Oh, typically they are not regular, okay? These Galton-Watson branching trees, each time you have a node, you sample at random the number of children and you use Poisson distributions for that. So these are not regular trees at all. Yeah, the number of children does vary from node to node. Thank you. There's a question. Can you speak a little bit about the practical use of these results in the social networks? Like if you manage a social network, what can you do to avoid people making these graph alignments? And so what would you do if you find out that you can de-anonymize, say, Twitter or? You try to align the graphs and what if you try to prevent that from happening? The reverse problem. Well, yes, okay. So you could say how many fake friends do you need to make in the anonymous network? So you want to be below the green region and maybe we can tell you, okay, here are the parameters, here are the numbers of fake edges you need to create so that that could be a way, I guess, to exploit those results. I can also comment on the algorithms that have been used typically in order to de-anonymize social network data. They typically start from what they call encore nodes. So if you know that a guy in that network has this identity in that other one, you have a correct match granted. And so you are in a better situation to de-anonymize. So I've discussed the case where you don't have those encore nodes, but in practice, the successful de-anonymizations have always been based on those encore nodes. So the real-life situation is more complex. Okay. Okay. I think it's time to thank the organizer. Yes. Thank you very much. Thank you very much. And thanks for, thanks to Vianne, Parché and Sondrin Péché who unfortunately could not be here, but they helped greatly on many aspects, including financially. So thanks to them. Also, I mean, ICTP staff, I didn't mention them, but really they did most of the jobs. So, but I mean, I had personally a lot of fun and a lot of pleasure this week. I really miss that. I don't know about you, but I mean, you forget during these two years that having these kinds of events really changes many things. So thanks for being there and making it happening. Also, I mean, I got tons of new, too many new ideas of new projects. Thanks to these great courses. I hope this is the same for you. And so you are always welcome to ICTP to come back. Events are taking place all the time. And I can advertise, for example, the youth in high dimensions, which is a conference I organized with three other friends, which is concerned with a generic, with many fields related in some way or another to high dimensional statistics that goes from statistical mechanics to computer science to mathematics to neuroscience. It's very broad. We give a main need of floor to young researchers, PhD students, postdocs and young faculties. It's a very nice events until now it was online because of, you know what? And it's now starting to take place in a hybrid mode, exactly like this. It's a bit bigger, the size, it will take place in June. But I mean, keep it in mind for the next years because we will make it a yearly event. So that's one occasion for you to come back here. I hope there will be many others and so we keep in touch. Thank you very much.