 Sorry. Should I just share my screen? Recording in progress. The next speaker is Hizayu from UC Irvine, and he's going to talk about non-backtracking spectral clustering. Hizayu, take it away. Thank you for the introduction. It's very nice to be here, and thanks for the invitation. I really enjoyed all the topic, all the nice talk and posters in this conference. Today I'm going to talk about the joint work with Ludovic Steffen from EPFL. He's also in the audience. We'll talk about spectral clustering problem in sparse random hypergraphs. So first of all, hypergraph is a classical object in combinatorics and theoretical computer science. It's just a slight generalization of graph. So instead of having vertex and edge set, we have a hyper edge set on the left-hand side. We can have seven vertices and different color of the set is a hyper edge. You could have hyper edge of size one, two, three. And on the right-hand side, I give you a reason that hypergraph is useful because it could be used to model some higher order relation among data. On the right-hand side, this is our students taking different classes. Every hyper edge will be a class. So every student taking different classes will be sharing different hyper edges. And from there you could have some sort of influence. Where are they from? And students taking information theory, linear algebra, algorithm. They might come from a department. And on the right-hand side, general psychology or British fiction, they might come from a different department. So based on those data, you could influence some community information. Okay. This is interesting. So in recent years, people already start to look at higher order network as a generalization of graph network. So we already know graph network is really good at modeling, parallelized interaction. Then if you want to do more, we should take into account the higher order interaction. So there's a lot of research going on to study tensor method beyond matrix model. Also some simplicity or complex representation. And today we'll focus on hypergraph representation of this higher order network. In general in network analysis, one of the fundamental question is about community detection. It is a real studied object in the graph model. So to study the algorithm on any graph model, it is used to be usually the NP-hard problem. So you want to put a probability measure on your data and to study average behavior of your community detection algorithm. The most fundamental one is the Stochastic block model introduced by Holland et al. in 1983. The simplest version is the following. We consider unknown partition of the graph into two communities of equal size. And then we generate edge between vertices depending on their community structure. If two vertex are in the same community, you generate the edge with probability p. Otherwise another probability q less than p you generate edge across communities. The task will be you sample a graph from this Stochastic block model and you run your favorite algorithm to find the unknown partition. But here we want to require a high probability answer with efficient and accurate algorithm. The simplest way to study this model and its connection to random matrix is the spectral method on the adjacency matrix. If we write down all the information in this random graph, it will be a Bernoulli random matrix where aij is an independent Bernoulli random variable with parameter p or q depending on the location of the entry. So if we write down the expectation matrix, it will just be a two block matrix with p on the main block and q on the off diagonal block. And the eigenvalue tells you the information of this network. Namda one is the average degree. Namda two tells you the discrepancy between p and q. The eigenvector also gives you information. The first eigenvalue is all one because every row sum is exactly the same. Second eigenvector tells you the exact membership of different vertices. So the problem essentially is an inference problem. You have a random matrix A but we can decompose it as expectation plus a noise term A minus expectation A. So the question becomes how do you reference the lower end structure from the noise? But right now this is not a Gaussian noise, it's a sparse matrix noise. So the question, if we have a concentration result, if we know A is concentrated around its expectation, then you can do some sort of perturbation analysis to say the second eigenvector is close to the expectation. That means if we can observe the adjacency matrix A and you do the second eigenvector calculation, you can use the sign of the eigenvector to recover the community. But this is true in the relatively dense model so we know the concentration of the adjacency matrix would hold if the average degree is at least logarithmic. In that case, you do some perturbation analysis like Davis-Kahan inequality. Then we can show that all but little one fraction of vertices can be correctly classified. So this is fitting the setup of sparse random matrices. There are a lot of work still going on to understand in different sparsity regime what is the behavior of your eigenvalue and eigenvectors. But the question is a more important regime is when we look at random graph with bounded expected degree. And now if we want to try to do the same argument calculate the second eigenvector, you will not see the right answer that will be the partition between red and blue vertices. Instead, your second eigenvector will output some high degree vertices. This is the phenomenon called eigenvector localization in sparse random matrices. Basically the second eigenvector, the most of the math of the second eigenvector will be located at high degree vertices. So the second eigenvector tells you who are the popular people in the network but not the global partition of the network. This is the fundamental question. So instead of looking at a jcc matrix there has been a lot of development to do other method to reach the detection threshold. So what is the sparse stochastic block model setting? The simplest case is still we consider a community partition of equal size. So you can think about sigma is a label function that gives every vertex a label 1 or minus 1 depending on their community. And you have parameter p equals a over n q equals b over n. There is a conjectured detection threshold. You can do detection which means strictly better than random guess. If and only if a, b satisfy the inequality which is called the Caston-Sticham threshold. This conjecture was first proposed by DeZella at all in 2011. Then following this conjecture Moselle-Diemann's slide proved this is an if and only if statement so above the threshold there's some algorithm below the threshold there's some information theoretical impossibility result showing no any algorithm could perform better than random guess and also Moselle-Lohan-Massouli and the Bordena of the Large-Massouli they come up with different algorithm. They all reach this detection threshold. And today I want to mention their particular spectral method which is not based on the eigenvector information about adjacency matrix but based on this so-called non-backtracking operator. Okay, so this is a operator that's defined on this set of oriented edges so if we count the number of edges in a graph you put two directions then the oriented set will be twice as big as the edge set then the non-backtracking operator is defined on this oriented edge set in the following way if u v is an oriented edge x y is another oriented edge and you can go from u to v which is x then x goes to y without making a backtrack walk then you put one in this matrix otherwise it's zero. So this non-backtracking relation is not permission that makes this operator non-premission. In recent year this non-backtracking operator is a very important ingredient to analyze sparse random matrices so there are many works based on different models random regular graph, random regular bipartite graph inhomogeneous earth-draining graph they all rely on this operator. Okay, so what's the relation between SVM and this operator? If you plot the eigenvalue distribution of this operator and the result in bottom-down of the large Metsulia shows that if the parameter regime is above the threshold by probability you will see this phenomenon the outlier in the spectrum gives you the information of average degree which is a plus b over 2 and discrepancy between a and b and the rest of eigenvalue are confined in a circle of radius square root a plus b over 2. So we already see the outlier and you can compute the eigenvector associated with those outlier and you can use those eigenvector to detect your true label sigma. So the message here is in this regime using special method on the adjacency matrix will fail but using our eigenvector information on this non-backtracking operator will work down to the optimum information threshold. Now we want to move to higher dimension to look at higher order network a natural generalization of this stochastic block model is this so called hypergraph stochastic block model instead of generating edges we want to generate hyper edges independently and the probability will depend on the membership so the simplest way is we still assign membership to every vertex with minus 1 plus 1 and every hyper edge appear independently with certain probability and the probability will depend on your membership. So on the right hand side the color of the vertices are all blue or all red you generated with the probability p and then across community you have another parameter q which is smaller than p so this model some sort of higher order correlation and you can imagine this is a co-author network people on the left hand side could be mathematician on the right hand side could be a physicist and every hyper edge could represent the archive paper that they wrote together and the question is you observe those collaborations and you want to influence what's their department and the question is we want to construct a label estimator that's better than random guess but I don't care whether you put 1 or minus 1 correctly so you can guess every mathematician wrong every physician wrong that's the perfect answer this is an interesting question that got a lot of attention in statistics electrical engineering, theoretical computer science in recent years but most of the results so far focus on the regime where expected degrees growing with n so that's not a nice model to look at realistic network so here we want to focus on the case where the average degree is fixed so one quick way to store all the information is to look at tensors here if you want to write down all the hyper edge information a matrix is not enough so you need a higher order tensor such that the tensor will take value one if i1 to iq form a hyper edge so for order 3 tensor that could represent a 3-uniform hypergraph the bad news is there are a lot of tensor computation problem that are NP-hard instead you can do some sort of tensor unfolding or higher order singular value decomposition so there are some work based on tensor methods which is not NP-hard but so far there's no result below a growing degree regime so directly apply tensor method might not be enough to the boundary degree regime and here we want to study this more challenging regime with a very general model so consider a order q tensor which means every hyper edge is of size q and we have a probability parameter set that's given by a tensor and i can generate hyper edge of size q with probability p sigma p divided by the scaling n choose q minus 1 so that make my expected degree of order 1 for every vertex and this probability p sigma e will depend on the membership of my vertices inside this hyper edge so it's not just too value you can have a lot of parameter to describe this model and we don't assume the number of vertices in each community is the same so you could have general proportion of each type but there's a regularization assumption that we assume every vertex has the same expected degree d otherwise this becomes a much easier problem you can count degree to classify different groups okay and so in this case there's also a conjectured cast and stick threshold which can be stated in terms of the adjacency matrix so now the adjacency matrix will be a matrix that take value in integers not just zero one value where aij counts the number of hyper edges containing i and j and you can compute the eigenvalue of this adjacency matrix the first one must be d and if you have r blocks there are r many eigenvalues the cast and stick threshold can be stated as the following if you take r0 to be the number of informative eigenvalues below above this threshold d which means you take eigenvalue square times q-1 if it's above d then there should be an algorithm to detect r0 many communities this is a conjecture stated by Angelili, Katakourini, Zakala, Zeperova the conjecture above the detection threshold the def propagation algorithm would work and they also propose a special method based on the non-backtracking operator which I will define in the next slides so we already know how to define non-backtracking operator on graph which is defined on the set of oriented edges now it gets trickier to define oriented hyper edge because every hyper edge there are q vertices and you could put q directions in the following way so you pick a vertex v and you take a direction v to the hyper edge e so altogether you have q choices and then the non-backtracking operator is defined by those oriented hyper edges in a similar fashion if I can go from u to the hyper edge e and land at a vertex v and jump outside e to another hyper edge f then you put a 1 otherwise you put 0 so the right hand side is a backtracking work I start from u go to the hyper edge e and I come back to v and take e again that's not allowed so this makes this operator non-commission our first result is the characterization of the spectrum for this operator so remember r0 is the number of eigenvalue that's above this Kessler's Tickham threshold so those eigenvalue can be seen in the box of this operator spectrum and for the rest of eigenvalue below the information threshold they are confined in a ball of radius square root q-1 times d so in the simulation we have a stochastic bulk model with four blocks and you see the first eigenvalue is around average degree times q-1 and the other three eigenvalues are around the predict location and the rest of eigenvalue are inside this disk however if we count the number of hyper edges it's roughly like q times d times n so even though it is an order n size if I have a large parameter q and d this could be still a large matrix there is a nice way to do a dimension reduction to make this problem in a smaller size so we can define a 2n by 2n matrix b-tutor that form a two block structure where you need to calculate d which is the diagonal degree matrix and a which is the adjacency matrix we defined before it turns out that there's a nice connection between the number tracking operator b and this operator b-tutor following Iharabad's formula so this formula tells you the eigenvalue of b which is recorded in the characteristic polynomial can be factored as some trivial eigenvalue either 1 or minus q-1 and eigenvalues of b-tutor so if we care about informative eigenvalues we can just compute eigenvalues of b-tutor that makes our problem easier so this Iharabad's formula was discovered in 1992 by Bas for graphs and we generalize it to hypergraph okay so instead of calculating eigenvalues we care more about eigenvectors because eigenvectors should give us an approximate answer of the signal so we can also perform an eigenvector analysis of this reduced operator b-tutor in the following methods you calculate the Ith eigenvector of b-tutor which is a 2n dimensional vector but you take the last n entries and normalize it to be a unit vector then with high probability this eigenvector will be positively correlated with the true signal eigenvector from the expected matrix A and the overlap can be calculated explicitly depending on this signal to noise ratio top okay and this gives a way to quantify the correlation between the true signal and eigenvectors of b-tutor and there's a standard way to go from eigenvector to a guarantee of a spectral algorithm okay so in the picture we draw a 3 block model where you use the eigenvector and the third eigenvector of b-tutor to classify those three casters so you can see a strong correlation between the true label and our eigenvector information okay so let me tell a little bit about the proof idea so there are two ingredients in this proof the first one is about a random hypergraph structure so usually in this very sparse graph regime you can always do a local tree approximation to reduce your graph problem to a gotten Watson tree and here since we have a higher order structure it is a hypergraph you need to build a random hyper tree analog of a gotten Watson tree and do some sort of local approximation the idea is you can generate a gotten Watson tree with labels in the following way you start with a root with a certain spin and you generate Poisson hyper edges and you assign labels with certain probability and you propagate this hyper tree and that gives you a good approximation about the geometry of this hypergraph from there you can calculate eigenvector information out on tree and do approximation to get a good understanding of top eigenvalue and eigenvectors of my hypergraph so this is the random hypergraph part another part is the random matrix part we want to show that the rest of eigenvalue are confined in this circle so we need to apply some moment method but instead of taking a fixed power you need to take a high power at the level of log n the idea is to take the high power of this non-backtracking matrix will count the number of non-backtracking work of that L in this hypergraph so we apply high moment method to this matrix but not exactly B you need to do some modification to get rid of the expectation so you need to center this B in a certain way and then it becomes a counting problem how do you count such work in this hypergraph but there's another convenient way to see those work by doing a bijection so you can look at a hypergraph configuration on the left and you do a bipartite representation we put vertices on the one hand and hyper edges on the right hand side then you can draw edges in between this gives a way to translate non-backtracking work on hypergraph to non-backtracking work on bipartite graph and we do the counting on the bipartite graph level that gives you a good understanding of the spectral radius so that's the structure of the proof so the take home message here is that even if it is a sparse random tensor problem but for community detection in sparse random hypergraph you can reduce it to an eigenvector problem of a 2n by 2n non-permission matrix constructed by the adjacency matrix A and degree matrix D and we show that it works down to the conjecture to generalize Kaston-Seagum threshold and the next step is to make it more complicated by just not just looking at uniform hypergraph but taking into account all higher order relation in this data set that will be a non-uniform hypergraph and there are some partial results in this direction after our work this showed a hard bus formula for non-uniform hypergraph but the probability analysis is more challenging and it's not covered yet and there's another interesting problem is about this impossibility result below this threshold so we use random measuring techniques to show that there's a way to reach the detection threshold but the impossibility result you want to show there's no any other algorithm even with exponential running time that could gain any information below the threshold and another interesting question is to see where is the computational to statistical gap in this model so it's different from spec tensor model that we can we can have a various bars regime with a weak signal you could still get a non-trivial correlation but maybe you can do some other tensor method even below the threshold we don't know so that's an interesting question to look at thank you for your attention I'm happy to answer questions questions okay so I'll break the ice do you have any other thoughts on the computational to statistical gaps like for example there has been a fair amount of work recently in the past say 3-4 years that tries to deduce a web of reduction between problems and typically prototypical problems are hidden clique and sparse PCA so do you think that for example it's reasonable to expect that there will be an average case reduction from this spectral hypergraph problems to others okay so there's a quite different behavior of this hypergraph stochastic bulk model and tensor PCA so in the tensor PCA case if you have if you want to have polynomial algorithm the signal must diverge with n but the information threshold was proved at the constant level and here you see our algorithm already working the bounded degree regime so if there's a gap though it's only a constant gap because if the hypergraph is below the connectivity threshold definitely there's no giant component there's no way you can get any information so if there's a gap they only differ by a constant so that make a difference between this problem and tensor PCA but I wonder if even at a constant level there's another gap by using NP-hard tensor method other questions you saw this dimension reduction matrix B tilde can you explain a little bit or give some intuition of where this is coming from okay so this is coming from the first few work on this this is a deterministic formula the first few work is coming from number theorist and combinatorist the idea is you can factor this matrix number factor matrix B in a certain way and you do this determinant formula trick like determinant of AB and BA are differ yeah differ by some true eigenvalues so by doing that you kind of reduce the dimension so it's hidden in the proof but that's the main idea yeah other questions then if not let's thank both the speakers thank you