 without early information and this is work done with Rafael Marino in the last year at Fuji before he became a Swiss and we asked ourselves we asked ourselves to investigate the simplest possible model in which some of the basic themes in search could be explored so I'm going to talk about community detection but everything that I have to say applies to or I think everything that I have to say applies also to constraint satisfaction to the basic machine learning tasks and they have to do with the they have to do with the difference a critical difference that shows up is between searching for an acceptable solution when there are many solutions present to a problem and searching for a rare and unique single optimal solution to a problem in the first case when there are many things to look for there is not only the difficulty of navigating in a local potential which is extremely complex with upper with high lying early minimum which are far from the optimum which is being sought there's a different kind of search difficulty that is easily isolated in the community detection problem and that is what you might call entropic difficulty there are many things to look for in the early stages of a search one gets hints that indicate multiple directions in which to search each direction impedes the search for each each each of the goals which you put together as in the early stages of a search make it difficult to find any of the others and so the total search is impeded or or undermined by the early confusion and we chose a series of algorithms to explore in which it's possible to do early cleanup of mixed information or mixed directions rather than solve such problems at the end of a search secondly there is a long history of exploring the phase structure the easiest description is that many problems have gaps between what one can do trivially and what can what one can prove is possible and what can prove is impossible and this gives a phase structure of easy hard impossible which you saw in Federico's talk yesterday but there's a richer phase structure which is known for problems that have been explored in greater depth and using the advanced mean field theory methods that have arisen with the understanding of replicas and other breakdowns of the phase space of random systems the nice thing about community detection is naturally occurring communities with without a whole lot of pre-specification of what kind of community one seeks have a complex and you know have multiple solutions one can plant communities or plant cliques in a graph and reward entirely different search methods and at the end I'll show you that we have a case in which all of these elegant schemes that the literature is full of can be defeated in the case of a planted problem by very simple methods so let me go down to the middle of this slide imagine a country imagine a country with a thousand oligarchs two thousand politicians and five thousand journalists among millions of a population of many million it turns out you might want to try to learn things about the social social communities in such a population by understanding who talks to whom and telephone data in Mexico permits exactly that now in Mexico the oligarchs are are known and it is a known fact that the telephone data connects all the oligarchs in a densely interconnected clique so if we didn't know who was an oligarch could we find that fact out by starting with the telephone data well in fact this is a good example where we probably wouldn't get very far because we might be given a hint we might know that Carlos Slim for example is probably the oligarch the the coppa de tutti copy so we start with him and we then add to the community the person to whom who talks to him and has almost as many connections as he does but that might be a politician and no oligarch owns all the politicians they each own just enough to do whatever oligarchs need to make their life oligarchical so now we only have some of the politicians but those some of the politicians don't talk to all of the oligarchs and furthermore if our next person added to the community turns out to be a journalist we're host we'll never find all of any of these communities this is a good example of what I mean by mixed early information we get confused okay a toy version of this problem has had a long and enjoyable history of analysis and that's let's take a random graph an Erdos Renyi random graph to make it as hard as possible we we allow half the bonds to survive chosen at random and we look to see if we can construct a maximum possible clique the problem has a gap because the algorithm that I described start with the site with the largest number of neighbors add to it from among its neighbors which is now only half the graph the site with the largest number of neighbors add to those two the site in the quarter of the graph that is neighbor a neighbor of both with the largest number of neighbors you run out of sites to add at log base two or log base one over P if P is the fraction of bonds of N and furthermore it's a lovely phase boundary it's an algorithmic phase boundary for this problem you can do finite size scaling but you should be careful because if you if instead of creating a fancy tightening up finite size scaling plot you simply shift each threshold by log base to event you will discover that at every value of log of where log base to event passes through an integer you're likely with these out with this simple algorithm to find a solution to run out of space maybe one or two before the threshold and to be able occasionally to find a solution maybe two or three steps beyond the threshold so instead of having a width which is a rapidly which is sharp which is a power law with an exponent less than one this threshold has a width which is integer a few integers on either side so it's it's actually a phase boundary with a softer threshold and if you look carefully at a number of popular combinatorical problems where the solutions are a sequence of integers that's not an uncommon behavior so just saying that it's a phase boundary was shocking enough in the world of combinatorics when when this was first encountered that the distinction between softer and sharper phase boundaries has never really received a whole lot of attention and that's understandable because study of this problem actually preceded the names P and NP and the question of whether they were any different so what sort of phase boundary is this I've said it's it's soft but the natural question is does it belong somewhere in the Lenka cascade now Lincoln tells me it's not just link of but it's on her thesis so I think we should give it her name and she's not here to defend herself the picture the picture that we start with is in fact very much like this but you notice we're not doing what is what this diagram is meant to characterize the cascade diagram describes a series of asymptotic solutions as one parameter the ratio of links to sites is made harder and harder makes it harder and harder to find such solutions and the solutions get fewer but what what I'm described what I just described on the previous slide is the search for larger and larger cliques at a fixed value of n so is the cascade the right way to think about that search or is it a simpler easy hard impossible sequence and so the the question will be whether at either across the top in what I'll show in the next slide is there a cascade with intermediate points of structure that can be separated by my more sophisticated arguments I think the answer is across the top there is indeed such structure one of the nice things about combinatorics is you can calculate the probability of anything if you characterize it precisely by just counting so if you count in other words if you evaluate the expected number of cliques of a particular size in a graph of size n it's just you know n choose k times the probability that the number of links you're going to need will be present and in n choose k you divide the multiple counting of the different ways to label things the maximum size clique is given by Markov's inequality that is if the expectation value for the number of such cliques is less than one than the probability that there is such a clique rapidly goes to zero and you can also look at other things but what this tells you is that the phase boundary when the possibility of finding a maximum clique is not a smooth line although you can approximate it well at when p is close to a half and less well when p goes to zero or to one with a with a smooth curve it is in fact a staircase so the phase boundary when we go up in size k is a series of steps and this the surprise before there was a general appreciation that boundaries like this must exist there's a theorem by Jean Bourguin of Gil Calais and his student who freed good all worked this out about 25 years ago they refused to commit themselves that there was a smooth phase boundary and phase spaces and all the things that were familiar with from statistical mechanics they simply said there is a boundary and they refused to say anything more about it and this is a good example why this is a staircase with uniform steps on a scale of log n no matter how large n gets the staircase never smooths out what happens across the staircase is interesting there's a concentration result if you go to the second moment with the first two moments of any distribution you can if the distribution is on the integers and positive integers including zero you can separate the probability of not seeing something from the probability of seeing some number of that thing in this case cliques of the maximum size so at at a staircase boundary we go from one value of k max and the likelihood that all you're going to see is k max it lies within this boundary and the next value of k max the likelihood that you're going to see that lies within this region using Chebyshev and Markov inequalities it's a little trickier than that that's a technical detail but if you just use the second moment for the number of such graphs you'll get a lower bound of zero which isn't very useful so you have to focus the the count for a positive definite second moment like quantity by looking only at graphs where there are at least some of the some cliques of the larger size and that turns out to be the easier quantity to evaluate anyway therefore you get a way of showing that only two values asymptotically the step below and the step above are really of interest and in the limit that we pass through such a step which we do every log every time log n passes through an integer value the chance the the fraction of graphs in which you will you will find the larger the larger value passes through one half so it's a kind of an elegant elegant evolution of a problem now what do we know about max cliques not about max cliques in particular that would guide strategies for finding them in the asymptotic regime they're rare there's an expectation of one but in fact there's an expectation of one but only you're only going to find them in half the graph so in fact it's a long tailed distribution to the high end and there because there's only one you either find it or you miss it completely they're small their size log n as you move across each of the steps it just happens over and over again you you go from a regime where there's a tiny tiny probability that any given randomly selected site will touch a maximal size clique to where at the right-hand edge of the step the number of such cliques you will see passes through five six or presumably somewhere up around Avogadro's number you'll get a much you know a much larger a diverging quantity so the only place where so along each step we clearly have a hard-to-easy transition I don't have the I didn't put the figures in the talk to show you but along the step there is there is never more overlap between isolated along the step the the combinatoria the probabilistic method shows you that the likelihood that two of these maximal cliques overlap is basically just a cross-section they they have shown no correlation the overlaps are at most one or two sites when you get to the right-hand site side of a step you you still have only a handful of overlaps between isolated maximal cliques on any size that one would ever do computer experiments for to be able to check if you continue if you look at cliques which are two three four five sites less than the maximal clique at that value of n then you begin to see a regime in which there are clumps of cliques to go back to the the Lenka cascade what the intermediate phases in which you'll see clumps of cliques as well as isolated solution is present but not for cliques at the maximal size on their own special step it is present in the growth from easy to find cliques to almost impossible to find cliques so there are there are points along the Lenka cascade that we see in this problem there's a literature which peaked in the late 1990s around and maybe early 2000s with a dimax workshop at Rutgers that's those are small quickly written papers with tables some of which contain encouraging results and they're hard to read we've we've found that you can restate much of what was done during that period within a single model and so without discrediting all the work done back then we've been able to do as well and maybe a little bit better and so I'll show that in the next couple of slides oh the nomenclay the sm doesn't really mean very much but any call sm scott's method or it's actually rafaeli's method sm 0 is a moderately smart version of what I described in trying to find oligarchs in Mexico Mexican telephone data pick a site at random you know or somebody hands you an oligarch because he has a whole lot of connections so pick a site at random with the most can or pick a site with the most connections add to it a neighboring site with the most connections in the space that remains and do so until you can't do any better sm one is do sm 0 on every site and keep the best result do sm sm 2 is pick the link with the most neighbors or I'm sorry sm 2 is try every link and from that point on follow the sm 0 methodology and keep the best result and what you can see is that the harder you work the closer you get to the staircase so this is the dumb result log n this is the smart result it's a huge improvement over the dumb result you have to work every time each of these lines is another factor of n more work and the reason you have to pay a full factor of n is because you're looking for many things and so this this is the cost the cost of confusion is this power of n for each improved effort to follow the details of the staircase and you'll notice it only works to a moderate the to a number that today would be considered embarrassingly small 5,000 if we want to go further you can take these 1990 ish algorithms and do what amounts to early cleanup and what what we did was sm 0 stopping when sm 0 stops because that's super quick and then as n gets larger selecting from that from that intermediate size clique subsets and choosing all subsets of a of a of an optimal size from those subsets exploring the best answer you can get maybe even doing that a couple of times that is resubsetting recleaning and then going from that so this is in effect a an algorithm with a cost of did we start with sm 0 or sm 1 0 or 1 0 ok so this is in effect the linear time algorithm with with a logarithmic cost a fairly expensive but logarithmic cost for local search so it's local search with cleanup after completing the the best inexpensive local search and the reason for the two staircases there is that the lower staircase is the best result you would get if instead of cleaning you started with a randomly selected subgroup or of the size that we selected only from the solution to sm 0 as as a starting point so we're better than any we're better than any randomly we're better than any randomly selected subgroup of that size and still falling below the answer and we fall below with a slope which shows that we're asymptotically not on the right scale so that I would say is a qualified it's only a qualified success but ultimately it shows that the local search does not solve the problem in the limit of large n on the other hand it might be good enough for you know government work now supposing we the for the rest of the talk I'd like to shift to looking for just one thing where the cleanup issues are can be held to the end and I'll summarize a lot of work by indicating by suggesting that there's a large number of papers which tackle these problems using spectral methods the spectrum of interest is the spectrum of the adjacency matrix or matrices derived from the adjacency matrix the connections between the sites in the graph those are minimal in the same sense that our linear algorithm is minimal that is with a data structure of order n squared the adjacency matrix and of order n work added to that one does the best we can the short answer but I'm going to spend some time explaining it because it's interesting to it's it's worth knowing about is that cliques planted cliques of size square root of n proportional to square root of n can be found by the spectral method easily if your clique is some large factor alpha times square root of n so for for simplicity of proofs set alpha equal to 10 and you can learn a great number of things from the specter and they're good good linear algebraic methods that tell you things about the states the eigenstates of this particular random matrix but if you want alpha to become one or less than one things get a lot harder the spectral methods in the literature run out of steam and a linear algorithm from decollet all also runs out of steam a little bit above one and we found some tricks that can in fact push it below one below one if you use a hint and all the way down to one if you correct for the cheating that goes into using a hint to get there so currently the clear winner in general methods for searching for a planted clique in a random graph at p equals one half our belief propagation a belief propagation algorithm that comes out of the amp work from of Montanari and various colleagues is shown to be capable of extracting useful information down to alpha of one over square root of e and they've produced numerical examples I'll if in effect repeat their work in order to comment on some cautions that should be applied to the use of belief propagation down in that regime and then finally as a surprise we found that our local methods can also play down there so not only our local methods but the parallel tempering that you heard about yesterday all have succeeded in getting solutions below this one over e line well here's the one over e line no sorry this is the this is the one over square root of e line and parallel tempering which because it's rather expensive doesn't go to terribly large samples and our sm2 have been able to reach to obtain solutions significantly below that and almost down to the two log base to event limit which is the below which point you can no longer clearly distinguish a planted clique from the naturally occurring kind so it's kind of fun to think about sources of gathering information from the spectrum of a random graph so I have to tell you what does the spectrum of a random graph with a planted clique look like well the before you plant the clique you have the usual semicircular distribution of states and one special state which is the uniform state out at a half n that uniform state isn't terribly interesting except that there are some tricks to make it go away if you want to use things like power method to more rapidly evaluate states at the edge of the distribution the spectrum of a complete graph is a delta function actually it's two delta functions there is a state associated with the uniform eigenvector across the planted clique and the rest are a small negative number and of course degenerate when you join them what you get is one eigenstate sorry one eigenstate that lies outside the semicircular band and the rest of the states that were that were originally associated with the hidden clique are all hybridized into this band of random states I grew up learning about Vanie functions and tight binding wave functions for electronic states in the gap of a semiconductor because that was an interesting problem for a long time and that's that to me is what the problem of diagonalizing this random matrix is like so in effect this is an there's an impurity band of states that are derives from the hidden the planted clique and it's and it sits in the middle of this random of the random quasi localized states that are derived from the air dose rainy overlaps that insight makes you think maybe we should take one of the sites maybe we should take one of the sites in the graph or one of the sites in the planted clique if we have a hint to get us there and use it to pull the other sites in the planted clique out of the band where they can be inspected and so that was the approach we took to see if we could push the limits of the spectral method down to one alpha of one or below alpha of one if we don't use the trick the special state from which the planted clique can be extracted disappears at alpha of about one just below 1.2 and that's dependent on n as well as alpha if we have a hint and the what you see on the right is the all of the states on the planted clique as we decrease alpha and you see with a hint to pull that up towards the upper edge of the of this random band you can keep you can keep all of that information away from the background of the spectrum of the adjacency matrix of the air dose rainy graph so it works but I have to be honest about that we were able to extract useful information about the planted clique and I'm not going to go through that in detail well below alpha equal to one down to about alpha of point seven five which is not quite as far as one over square root of e but we used a hint now each hint throws away half of the sites in the graph as obviously not being connected to the hint site and therefore not a member of the planted clique take the square root and that says we should only be down by about a factor of 0.75 so in effect the the hint really didn't gain us anything and the I would say that the limits of the spectral method are still as far as we know an alpha of one so now let's let's ask what can we do to go below alpha of one well belief propagation can be shown to contain some chance some non-zero chance of extracting information down to one over square root of e but our local methods have a unique advantage when there's only one thing to look for and that is they can stop when they find it what we discovered is it's actually remarkably easy to find a planted clique if you get anywhere close to it and I'll show you how that works so we did one final competition we set n equals 10 to the fourth and ran belief propagation on smaller and smaller planted cliques and we also ran early stopping versions of sm1 of sm0 no sm1 and sm2 belief propagation finds a solution for this fraction of the graphs on which we ran it and I know from running competitions with computer experiments if your heart isn't in it you don't run as many samples as you do when your heart is in it so our data is kind of crummy but the fact is that the fraction of graphs for which belief propagation extracts essentially all of the hidden clique drops to a tenth or less at one over square root of e and the nature of the belief propagation is if it doesn't find the good fixed point it goes to a fixed point with no useful information so it's all or nothing the cool thing about the local search methods is they stop when they find anything bigger than a naturally occurring clique even two or three sites bigger because while they have now returned and you know you know some expected number of sites not all of which are actually members of the planted clique there's a finite cost a fixed cost or I guess it's one one power of n search that will quickly restore the clique to its full size and just and eliminate the the sites that don't belong there so we essentially are able to find the find the clique and completely on those graphs in which we stop we find it all just as BP finds it all or nothing we find all or nothing on a group that go on a fraction of graphs that goes down to zero at oh about four times log base two of n and one and a half times log base two of n or I'm sorry two and a half times log base two of n using the more expensive method and even the more expensive method as you'll see here stops so early that it runs in minutes so in effect the the best way to tackle this problem at a cost of at a cost that is less than any other method is local search with early stopping and the results are available results are available on cliques which are too small to find that way I make no claims about how this scales up to billions as opposed to tens or hundreds of thousands but but presumably this is the approach that is that is appropriate for I would say local methods that have the avail availability of cleanup and can stop early give us the power to deal with data sets of size up to about a million and those problems abound and there are of significant present value so what have we learned as I I would claim today's big data problems are within the range of all sorts of hybrid and less fancy methods and I've talked about some doing computational physics on problems with this kind of simplification is a valuable adjunct to the link a cascade view of you know the look looking for the whole nature of the phase boundaries and finally combinatoric problems over the integers are definitely do definitely not definitely do not have smooth phase boundaries and they are not as sharp as the phase diagrams that we have in physics and it's possible that the phase diagrams that characterize glassy structure and glassy behavior may have some of the same uncertainties and extra width so that's it thank you