 Here, so I thought it perhaps is a good starting point is to place a little bit this area of research in the context of statistics, of modern statistics. As in, part of the audience is statisticians, but we don't know if you are a statisticist. So classical statistics is about basically typically learning a parameter, which is a vector in the dimension. iz nekaj delovih jaznih sampljah, z njega je nekaj delovih, in je tudi se vzvečen, da je je odstavil in nekaj je odstavil z teženstvom vzvečenav, je z njega zelo vzvečen, kako je odstavila, s vsi ustavili, Protoči. Tako, to je bilo zelo vzelniko in dobrov. Zato v tem, da je to vzelniko, priče to počkevati v vzelo, kaj bi se zelo izgledali v zelo izgledanju. Ko je to? Vzelo smo v samih vzeličnjih zvom, ali da bo tudi vzelo v zelo tita. To je tita. To je izgledan, otej, this set sporta has some kind of effective dimension. Let me just skip over defining what an effective dimension means. Typically means the number of non zero entries or the rank or stuff like this. And then you look at a case in which the number of samples is actually much smaller than the dimension. Zato v klasiku, je zelo što je kos, ali je veliko vzelo, da je vzelo vzelo. In tudi je prišlo, da je to zelo, da je to vzelo, in da je to vzelo. Vzelo, da je tudi izgleda, je tako Košnje, tudi ne bo naslovanje, dokonjuč info na vsega, pa tukaj ne bo tukaj, kako poživājamo, ki je tukaj parametri, često jemne bri tukaj, če je tukaj, kako je zelo, če je zelo, ki je zelo, je tukaj, ki je zelo, ki je zelo. Tako, ovo je tukaj, ki je zelo, ki je zelo. Však je, da smo ne boš nezavršene, kaj ne boš nezavršen in nečično zelo. Ne zelo, da smo našli, boš je to občiniti. Nolim, ki je zelo vsega, kar smo boš na vsega vsega vsega, nečično zelo vsega vsega vsega vsega. na vse obježenje. To je bilo vsega funtijera, na kako je ljega srednje. Zato vse je zelo vzelo vzelo vzelo, in to je vsega vsega, da ne zelo vzelo, ne zelo, da ne vzelo, počke teta hat je vzelo v teta 0 in vsega vzelo vzelo, in ne zelo vzelo, da teta hat je vzelo v teta 0 in vzelo v teta 3. Vzelo v srednji vzelo, in vzelo vzelo, in je tudi pravda teori. To je najbolj naturalne regije. V tej toh, kaj je tukaj, tukaj, tukaj, tukaj, tukaj, tukaj, najbolj simple model. Tukaj, zato počke, tukaj to model je, tukaj, zato počke, tukaj, tukaj, zato počke, tukaj, tukaj, tukaj, tukaj, tukaj, tukaj, tukaj, tukaj. To ka e model. Dobr šte functionality v ljudarku in ležünk. Se postoči blazmet де je 그럼 ochroč, tx Mikro NHÌĎ parmata t Blues t baking memetikos. Kaj je to? Deriv nero parliamentary in rồili? obojevovala, workersh dvod, Želko. And here I scaled things in such a way that operator norm of these two turns is of the same model. One of these is true and operator norm of the lower rank part is lambda. So lambda is a SNR and I interest it in what. I am interested in teta zero, that belongs to this set theta. tako, da imaš statičnji mod, da imaš set teta v parametričnih, kaj je bilo, in tukaj imam vsega vsega vektor teta v Rn, tako, če je to? Tako, to je lambda, ker je tukaj tukaj tukaj 0, ki je tukaj tukaj tukaj. Tako, da je tukaj tukaj tukaj tukaj, ki je tukaj tukaj tukaj tukaj tukaj tukaj, ki je porodil tukaj, odsne, da je, da je p be motimo tukaj Px, ki je tukaj tukaj, kot bi sejerjno cola zrednima hovorila. Tako, če mi je danes umodil, bo tega je vsega estimatorja, če mi je tukaj tukaj tukaj teza, ovo zil, ki prikotirajte, kaj je tkaj zore Hada fara vši teta, z tita 0. To je vseč tudi, da je morač je komputirati. Zdaj imam tita hat of x tita 0, kaj je x. Zdaj to je vseč vseč. Zdaj, nekaj taj vseč, nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj nekaj ne z vsem, da je ta nečel zdaj, da je odkazano z vseh tita 0 ko z vseh tijel vseh zreizladov. Zdaj je tijel vseh, kaj je tijel vseh tijel vseh, ki je to abonnivne problem, in ko tijel vseh tijel vseh vseh vseh vseh vseh vseh vseh. In z vseh več semeta, ki je to početno vseh, z neku zreizcal. Da nekaj sem, da sem sem je še nekaj, da sem je še nekaj, da sem sem se vseh vseh vseh vseh, ki se vseh vseh vseh vseh, kaj je je nekaj zelo zelo zelo. Posložite je nekaj nekaj zelo zelo, ko se nekaj izpracite, da se opreši, da se je tako zelo zelo fantastične vizipne zelo z našelih kodnjih. To je to, da se je vzelo, da se je vzelo, da je zelo vzelo za tito stahel. A počekaj... počekaj, da se je počekaj, Problem. Protočen estimator je kičen kezikčnik, ki je tudi ampak za odložiti galj. Uče opravil Trih, da se tačite včetno osnega, da je bolj odložiti tudi, ki sistersi po drugih, pa prezino skočen skupaj. Znalez da se projekt glavno vtrši tudi,Tokon je bilo neč v large in limitu, ... ...takam, da je tako pravda, ki je čakaj tačnega, ... ...zapravno je to, čeče je to, da je kot See. ...ja je tako pravda, ki je to, tako pravda ... ...1-1 over lambda squared positive path plus molov 1. Kaj jaz se kerjunj, tudi se kaj je bili se počudje v lambda, ... To je 1, o, spomčo. To je 0,81 in taj se zelo sezajva z infinity, in to je tudi trzvestočno trzvestočne, ki je to tudi tudi prist, trzvestočno trzvestočne, kako jaz pričela, kaj je to najbolj odstavljenje. Zelo, tukaj, zato se zelo, da je vizito, da možemo bila dobro? V zelo, da je bilo dobro. I tudi da sezajva za 30, Tako to, da je ljepo, da je inkačne počavna, in neko je zelo, da je vzelo. In zelo, da je učaj izgovoril, da je je bolj bolj zelo, nači se se tako počavno, da je počavno komputiran, in da je bolj, da je bolj počavno. I pravno, da imam tega, da je bolj bolj obas. To bi se tudi povodili v rukov. To je nekaj recentrji, več od razrednjih mladih. Vzlužim, da bomo boj, da boj narediti na povodilje. Vzlužim bar, ok. Zato prej, vzlužim bar, ok. Zato vzlužim bar, ok. Vzlužim bar je tačnja, da je tudi vzlužim bar, ok. Here the version of this was in a paper by Barty et al. Lenka is included in a list of authors and another version by Lach. And the theory is following under this model that I discussed before. Opriljamo inšaj informacij, što je v gamma, zato je to nekaj definitiv, da je zelo različno, da je kaj jen dobro i v djelji 2, je, da je x0g p epsilon različno nr.01 ... intervitaj, zelo v termini vsega. Pateč, da esistiraj, neč načoslo v delovu, ... malo je to, da jaz sem njdno vendiaj. ... so, da se učal, da se ne čestilo, i razdajümüz. ... bo ni od tet Heart of七 in ten. ... je do zjavum. ... je, da se pogodaj, da je, da je ozobil... ... tega se naziv, da nas leg single... ... je, da smo neč načas, da je... ... zjo tem, da se zahvalo. ... lepo... ... žel mi hvalo... ... pesk... ... zjavum. akduža kaj nischih v aktivitah, in to je je zelo izgledaj na ne. A men si može imati in po tom [(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1. Zelo je to naprej vokon in dolgoz. Našanje je na regions, ki bo za vesel. Bo je naprej v понимаю, da našanje je suliz". informacije, če pa maximizirajte še fungje. Še je fungje, kaj je tudi Psi of gamma root lambda, če pa limit m to infinity of qn of lambda INTA and theta base ... whatever theta base is what is ... whatever maximizes the overlaid, the optimal estimator. It's the estimator that optimizes this criteria. And this is square root of gamma base of lambda. The overlaid, okay? Z revealingem. espišnje. This is the story and JSON base estimation. The first step is to bring the posterior. OK, let me write it here, perhaps. The first step in the proof for understanding work is to bring the posterior of theta given x. What is tthis? To je zelo jezivno, da bila, da je zelo nekaj spokojen. Teta tukaj. Tako je tukaj, da je zelo jezivno spokojen. Tukaj je zelo jezivno spokojen. To je zelo zelo, da je e to z tukaj minus m over 4. X minus lambda over n, teta, teta transporo. Tukaj je zelo zelo z tukaj v vzeli z vzeli z vzeli z zlokov. kaj je to, zelo jaz sem počet. To je vsega vsega vsega vsega. A kaj je to, zelo jaz sem počet, je to ne objezajno, bo je tudi počet. Danes je landa kaj je to in danes je teta teta in danes je minus landa kaj je to vsi na vsega in danes je teta kaj je to. Zelo, da se vsega vsega zelo je začne, da se potvrte ta ročnik, da se lahko pogleda, da se potvrte to začne, da se je zelo počela, da se zelo počela, da se je počelo, da se je to počela. Proto, nekaj so počeli na vse krvene, da je to počela, da je več še zelo počela, da so vse optične, numerikali, na ngelje. Zdaj sem odličila je tudi po vseh pomečnev. Tudi po dvim priježenih priježenih. Zato je 0,5. Zato je 0,0005. In zelo se je vsega časnega. Zato sem najbolj izgleda, ovo je lambda, ovo je q. Zato sem najbolj izgleda, ovo je plot ki je obakljen, ki je tudi pravljeno. Zdaj je tako, ki je vzvečana in tudi je počut. Zvom, kaj je svetovan, da je na zelo vsakrat v zelo, kako je skupil je zelo, kaj je to začnil, je zelo, ki je zelo, nekaj je vzvečan, zelo, kaj je tudi, nekaj je to. Selo, kaj je to? Kaj je to? Kaj nekaj je to? Zelo, kaj je to? Zelo, kaj je to? V nič delje je 0,1, in tudi je to dobro. Zelo je tudi vzvečen, ki je vzvečen na vse pojel, ali je to vzvečen na vse ljube, ki je vzvečen na vse ljube, tudi je vzvečen na vse pojel, ki je vzvečen na vse ljube, ki je vzvečen. In tudi je to vzvečen. Tako je informacija teoretica, tudi je, da je nekaj informacij, pa je krozak, da je uživaj vahto vzvečen na vse ljube, pa je nesetja za tega, majiče, vse pomečno. Tako je dobro informacija, da je nekaj informacij, da je vzvečen, tudi je to vzvečen na vse ljube, tudi je to vzvečen na vse ljube. Hašil Рničה. Ono je tudi otsiv izhodo. in težično. Zato je tukaj, da je tukaj, kako je tukaj izgleda, zelo je, da je tukaj. In izgleda je, da je ampi algoritma, zelo je tukaj, da je algoritma, in izgleda je, tako, nekaj nekaj, nekaj nekaj, nekaj nekaj nekaj, in izgleda je, da je tukaj, in organizacije je obžadno, da je zelo vrst, zelo v svoj. In potem sem tega vzpečenja. V ovom različenju je zelo počke. Zelo sem počke vzpečenja, in potem sem prišel tega vzpečenja. Na vseh tudi je to počke. Vzpečenja je tudi, da je tukaj, češte, da se začem je vzpečenja, in da se zelo počke, če je tukaj počke. in je vzveč, da je kot komputen, ko je to komputen, in je je to lenj wardov, povznaj, da so našli, da je tita hijeta vzvečen. Kako je komputen, taj prinspa hijeta vzvečen je bilo varj, da se tita hijeta plus otev in otev, da je nekaj vzvečen. Evno nekaj da se vzvečen. Taj nekaj priče, da je vzvečen. In je vzvečen, da je taj nekaj, je zakonečno izprotovan, kar sem povoljno vsega bolj, zato še in hoje tazda s državom. Vse veče službje je, da bodo počiči prišličenji držav, ali bude občasno vsega, kako iz 가ženja službje. Wi je začujega. Ovo je delstva dojoja, bi na kojo začujeva je vsega, je in pa ustvari, pa raz allocerju je začujega kakšnja In v vse komponent, ki jim zčučujem, je nekaj komponent, ki jim zčučujem, je zelo, da jim solim na njočkega vzveča. Selo, ki vse komponenti odličuje, jih je tudi nekaj zeprešenji komponenti od veču. Jih je za to, da so vse komponente z mojej kajstv. In tudi je tukaj, ki jim je vse komponenti od kajstv. V komparenciju, zelo, da je obzervačnja vstavljena, plus gaušanja, z nekaj snr-gamati, in pošliči vse optimale estimator X0, da je posteljena estimator X0, a je to, da je to, da je to. Tako, ta gamma tija ne zelo, tako ta gamma tija je komparencija. V tjih vseh vseh vseh, kaj je zelo, da je pošličnja, o če je inšalizacija, v pošličenju, The algorithm works if you start with random initialization, but it's an interesting problem to prove it. So empirically, okay it is zero random, but in theory it doesn't work. In theory it doesn't work but we don't know how to prove it. We go around it by using this initialization and since this initialization is not trivial, we have to do the separate theorem. To je tudi. Zelo smo tačno povoljamo vzbavega, da je zelo, da je nekaj tukaj posebni, da je tukaj, da je tukaj zelo, če je tukaj zelo. Zato je to začnuje, ki je dobro vse zelo, in tukaj, da je tukaj, da je to zelo, da je to zelo, imet nr. of iteration going to infinity and going to infinity q i n at lambda for this estimator t to hat t. This is equal to. So what is the picture here is that we have this function psi that I wrote out there. This is a function of gamma and this in general starts somewhere and then can have multiple critical points. In this special example is much simpler than that, but somehow the algorithm gets started, the first critical point. So this is the algorithmic performance. While if you were able to do complete base estimation, then you will learn that. So this described quite precisely what is the gap between optimal estimation and the conjection is that no polynomial time algorithm beats this value. Now the technique, I should mention the technique, this has been said by Valenka. So the basic technique for doing this theorem relies on several ideas, one ideas, a thousand, and then this basically allows you to prove the following kind of theorem that for large n, if you look at this vector z, this is approximately square root of gamma t where this is a deterministic number times the true vector theta zero plus g where g is normal zero. So I wrote down that algorithm, this is an iterative algorithm, in general this zeta t might have a very messy distribution and it will have for any finite area, but in limit of infinity time, this is basically the true vector plus Gaussian node. So this is extremely simple. So once you have this now, we have only to kick back of these colors. And this approximation is basically we converges for finite dimensional margin, pre-computed. So pre-computed, using this theory, you can pre-comput it. You can also estimate them from data, from the iteration of the algorithm itself, but for proving a theorem, you can just derive it from the theory. Is there something for t which depends on n? Or is it really... Okay, so here, attention here, I'm taking n very large after that, so t is of order one. How big I can take t? Well, there is a theorem, there is a generalization of this by Ramji Menkateraman and Chintzirath, that works for any t. If I remember correctly, it has to be log n over log log n. And you don't expect to be able to go generically much beyond this t. There is some being that there can be unstable points, in which the asymptotic theory tells you the stable point forever, but at any finite time, you will diverge after log n number of iterations. But perhaps if you have some additional contractivity property, then you can put theorems for larger t. This I think is interesting with this program. Okay, now am I satisfied with this theorem? Any other question about this theorem? Am I satisfied with this theorem? Well, not really. What is the answer? Do you have to standard curve in your alloy? Excellent, excellent point. So, these curves, when you plot them here, let's use another color. Okay, here you don't do anything, because you cannot, and here you follow this curve. Fantastic. And you repeat the story here, but here you stay stuck at this point until the right threshold, and then you jump off this curve. So, this is very similar to the picture of Laurent, where here you don't have intermediate part. Can you see that this is impossible, or this can be done polynomial time? Here there is a half space, or similar, also two. Okay? The interesting point that we cannot currently prove in sparse model, for instance, with exceptions, as in these sorts of cases, that above this threshold, we can prove that we can follow the optimal curve. For instance, in the queue, even in the two groups sparse block model, we cannot prove that the both threshold you follow the optimal curve. So, this is possible only for these models. So, this is very nice, and my conjecture is that I cannot improve over this algorithm, so why I'm not happy with it, but I'm not happy because the algorithm is really fine tuned to the distribution of the model, to the fact that the noise is IED in particular, and to the fact that the true vector is drawn with an empirical distribution that I know. Okay? But the first is a problem. Okay? So, one way to construct something that is... And then also that is iterative, this fact that I have to keep the number of iterations about the one, and then go to infinity is not very nice, right? I would like to be able to have something that converges at final test. In theory, you can say, I will stop after log n to the one-half iteration, but log n to the one-half is three-year. So, something that everybody likes a lot are what statistician called an estimator. So, what is an estimator? It's basically an estimator that comes as solution of an optimization problem. Okay? So, I would like to construct a cost function f of n, that is a cost function in n-dimension, so m is a n-dimensional vector, such that I achieve the optimal performance. So, such that this estimator is... This depends also next, but I will drop it for the after, such that the theta bias is the solution of this optimization problem. Can I do this? So, what is here on epsilon equal to one-half? And unfortunately we can... So, this means the plus or minus one problem. We are currently working on generalizing this, but for the moment we have a theorem only for one-half, and lambda big enough, all right? So, I need a lambda at zero, and lambda has to be bigger than lambda at zero, and okay, if you are able to get to lambda equal to one... Okay, so what is the idea? That is very prevalent, very popular in machine learning, is what's called variational inference. So, what is this idea? It starts from following. Look, what you are comparing to compute is the posterior. How do I write this as an optimization problem? I write this as immunization of the KL divergence between q and the posterior. For all distribution q, we can have this quality distribution on the hyper-q. Here, remember that your theta is the hyper-q. So, if you can solve this optimization problem, then you have the posterior. Of course, this is not a nice thing to say, because q is very high-dimensional, and it's dimension 2 to the n, but still, okay? So, this is the picture, and this is the space of quality distribution plus one minus one, and here there is your posterior. We want to construct this. What is the idea of variational inference? I will not minimize around this space, but I will minimize about a variety, some manifold in this space. And what submanifold do you want to take? Well, you know, we like probability, so the simple one is take product form distributions. Okay? So, for a given vector m that is minus one, so in the solid cube, I'll define qm as the product, where qmi, or sigma, is just the distribution of a plus minus one with expectation equal. If you take this idea, and you go ahead with it, you can plug this in here and compute what is the KL divergence. The KL divergence, qm, is equal to certain function of m that is called the mean field free energy plus a constant independent of m. And this mean field free energy is water, is minus lambda over 2 mxm plus or minus, sum over i1 to n, n of h of m i, where h is the binary entropy function. So, it's the entropy of a single bit with expectation x of another market with expectation x. So, this is an idea really popular in machine learning. Give you citation numbers. It's used, for instance, in particular, in what are called topic models, that are models for finding topics in copula of documents. Ok, so, this doesn't work. So, Kajša, I mean, I described you model. An approach, I tell you, that is extremely successful in practice. Does it work in theory? No. Ok, so as a part of our work. So, this is going to work with something. This doesn't work, in the following sense. Suppose that I can optimize this cost function. Notice, by the way, that this cost function is something that makes perfect sense. What you are trying to do is somehow fit the data. This is basically the likelihood of the data. So, this is the maximum likelihood objective, basically. But you say, ok, that is not quite the end of the story. If I lambda is small, I don't want to trust my data too much. So, I try to maximize also the entropy. So, shrink the end towards zero a little bit. So, I don't want to trust too much this, but I want to shrink it a little bit. And for lambda large, the likelihood objective will prevail. And for lambda small, the entropy objective will prevail. So, suppose that I can optimize this problem. Then, the distance of this mean field estimate from what it should be computed. And there is the extra complication that theta in this model can only be determined up to a global sign, because I have set theta transpose. So, I have to minimize over this sign. So, modulo this minimization over this sign, this is the distance between the mean field estimate and the posterior expectation, square. This is on the scale n, so I normalize by n. And this is bounded away from zero. So, this is what is called mean field, sometimes, doesn't work. It doesn't work because in a sense we cannot neglect complication of these dependencies. We cannot... See depends on lambda, right? And when lambda is large... Yeah, we see depends on lambda. Yeah, if lambda goes to infinity, this goes to zero. And in fact, there are results by Hariju and Tied and one of his students that look at mean field and prove, you know, actually for stochastic block models or things like this, that if lambda goes to zero and we wait, then this becomes of nowhere of that term. That is interesting. So, in a sense, okay. But I want to stick to this lambda for the one that is this noisy i dimension. So, if you are a physicist, you know what is the solution of this puzzle. You have been knowing it for 30 years. So, this is what is called the top free energy. So, this is the same as the influence free energy plus an additional term. So, I don't have time to go into deriving this correction term. But this is more... is shrinking a little bit more towards zero. Right? So, basically, what is the effect that we were neglecting? Okay, somehow it's shrinking a little bit more towards zero. The effect that we are neglecting is that when you compute sigma t i t j under q, this is a bit larger than what you would hoped. If you approximate the t i and t i j as independent, this would be x i j m i m j. But this estimate is not good enough, because the t i and t i j tend to be under the posterior largely correlated. So, this is the term that is about the x i j square times the stuff of order one. And when you sum over i j, since all of these are the same sign, they contribute with the term of the same order as this. So, if you do this calculation carefully, you get this term. And there are many ways, actually, of deriving. So, this is, again, this is the stuff stands for illness. We are not interested in statistics, both Tawless and Anderson got Nobel Prize, but neither of them for this. I think it's very interesting. Can we prove something about this term? So, this is in the same paper. So, the theorem says that this works. Namely, they find a set of critical points. So, these are the set of vector m in the solid cube, such that the gradient of 0. Ok, this is not specific. We look at all critical points. We would like to prove that the minima of this free energy are actually the base estimator. So, we look at this and that, no, high dimensional function, function and equation. We look at all critical points that are below a certain level. No, it's not just the minima, but everything, right? It's very easy that there are, ok, first point, let's see the statement. And second, if you look at minima n to infinity of what? 1 over n square expectation of m in c star transpose minus expectation theta transpose c dm x minus minus square. So, here I'm quantifying the error instead of taking this minim over the sign. So, it's always this sign problem, right? So, instead of taking mean, I'll try to estimate expectation of theta transpose. So, I'll try to compute this. Since this is, you know, side result of the theorem says that this is affectively of rank 1, basically of rank 1, you know, you can do the rank 1 of decomposition and get this. So, basically all below a certain level, all critical points, there is more than one, but they are all very close to each other and they are all very close to the base estimator. So, this also implies that if you look at, if you use this tough estimator, this gives you any question about this? Because I have a question about the mean fit, but it tells you that you have a mismatch at the sign or can you say something about this? If you look at the at the sign of M. Oh, okay, good point. Now, we don't prove anything like this in this. We have another paper which, okay, is a smaller result in this paper. We have that for lambda between 1,5 and 1, you know, basically mean field gives you completely garbage, more or less. But this is the other... which is impossible, but still mean field gives you this object is non-zero where the real posterior is zero. So, for lambda less than one, the real posterior is zero, but actually should be exactly zero. Okay, modular signs, but let's forget about the signs, but the mean field in norm is big. So, in this... I think that one can use... with a little bit more, so we don't have such... we don't stay such results. Okay, so I don't know if I have 10 more or 5 more minutes, because perhaps I'll try to do a 5-minute sketch of the proof. Sure. Maybe, but you erase your curve because where does it... Ah, okay, okay. It's the same question, you know. Yeah, so this result is for lambda, only for the case epsilon 1, what is the curve? 1, lambda. So, this is the curve, so there is, you know, whatever, this is the spectrum method. This is A and T, that is equal to base. So, in the paper, the paper will tell that there exist nomsets here, such that if you minimize the top free energy above this, you follow exactly this curve, right? Now, this is what we can prove. If you ask me to put my thesis instead for a moment, what I think is the corrective, empirically seems to be the case, that if you minimize the top free energy, you achieve this curve down to lambda equal 1. But, you know, proving it, it's quite non-trivial because, okay, okay, you will see why, if I sketch a probe, I can tell you why it's non-trivial. What is believing in epsilon very, very small? Ah, for epsilon very, very small, I don't know. I always think that if lambda is big enough, this approach will work. I don't know if there is a question. If you have epsilon 0, then I believe so this is the place curve. I expected for lambda big enough, everything will be fine. Here there is the algorithmic threshold. I don't believe this bit. This algorithmic threshold, now I don't know in between. I don't know if it reaches all the way. A and P. I mean this, yes. You can go A and P. In practice, so this is another important question. This cot function is non-convex, so can we minimize it? You know, when I said that empirically I reach it to this, this is by doing something very simple, I do gradient descent starting with the initial, it's more under initialization, V at 0. I didn't do a lot of simulations, I was just a student, did a couple of simulations and saved to work. It would be interesting to prove that this is the case. There are lots of other interesting computational questions. The most interesting I think is can you construct a convex version of this? That would be very interesting. Because that's related to the hard phase, because if you could find the minimum you would go in the hard phase. OK, so you reach this, so you could go up in the hard phase if you could optimize this for any value of the parameter. So I'll give you the idea of the proof at very high level. Without loss of generality for psychoanalysis you can assume that the true vector is the old class 1 vector and then I'll define a function t that just computes a few statistics of this n-dimensional vector m and the statistics that I want to compute are m times 1 by n. So this is basically the overlap with the ground proof. The second is the norm of m. The third is a funny object. And the third is a function that I will not write, because it's a bit messy, but the important point about this function is again a function of the empirical distribution of m and it's such that this is equal to the top free energy from the point such that the critical point. It doesn't depend on x, it's the function of the empirical distribution of m but the critical points coincide with that. And then we'll define m for m domain in R4. We look at the set of critical points in that domain meaning the number of m in the solid queue such that dropping the subscript. So this is the critical points and what we do is just compute and compute the expectation of this. In particular we show that we really compute an upper bound expectation of the following form. This is the super of the set u of some kind of larger issue function. So you compute the expectation of this of twinning exponential over there. So this is the key part, technically the key part of the proof in this computation. Now the computation is tricky and there is quite a few subtleties. OK, we will probably see something about it. So because of that we simplify it and we only prove an upper bound at least in some regime of the parameters this should be equal in some regime of TOR because this formalize the same that physicists of things. So physicists computed this thing by Lorentz's method and they got this kind of complexity function so the first people who did it were for this problem and this was where brain moved and so we get the same formula. So this makes us believe that probably this upper bound is dilate vector. And then what happens with this function how does it look like or not does more enough. Now I plot the function only as a function of the TOR of the first parameter. Meaning that I optimize over all the others and always TOR4 less than minus TOR. For lambda's mode this looks like something like this. So it's a maximum at zero, so TOR1 is the overlap with the truth. So this is a number between minus one, one, zero. In general most points are in the equator and also most critical points are in the equator. This is what you would expect then there is an intermediate regime of lambda and being a little bit sketchy in which this function has a bump around zero but then it goes below zero but then come back to touches the axis. Touches the axis at some point TOR star. In this TOR star you compute it and turns out to be the same as the base overlap. Expectation of r theta times 1 under the posterior tip at zero. So meaning that there are some critical points where the base posterior there might be some critical points here where the base posterior here but there might be also a bunch of other critical points here that you cannot rule out. And finally there is a regime of lambda large. So this is lambda, this famous lambda at zero and this is what happens the curve looks like this. We touch the axis here after this bump here so this means that if there is any point that is below this level minus lambda it must have an error close to the one scalar product with the truth that is close to the base expectation. So separately you prove at least one point below this level under 3 so there must be also a critical point below this level and it must be here. So I will stop here and just say one word so as I said the most interesting part of the proof is really computing this thing and the rest is basically studying this formula and trying to get something out of it. So how do you compute this expectation of the critical number is the expectation of the critical number of critical point by Ikev's vice formula of the action of the density this is the density gradient at zero and then there is an indicator this is something that people have done to compute this kind of critical points of random function starting with often there are we started using fjodorov to compute critical points of random smooth random function that if you see so any of this you know that the hard part in this calculation is computing this expectation of absolute value of determinant this is a random matrix and in this case it is quite a bit more difficult than of it it's more difficult than previous cases doesn't it use to previous cases because the action of this energy is if you compute it is basically this random matrix X plus a diagonal matrix and the diagonal matrix depend on that plus a lower rank part lower rank part doesn't really matter for this calculation that we do but this is still a geometrics plus a diagonal matrix and the spectrum of this is not you can compute it by free probability but it is a messy function of m and then we have to plug it in here so conceptually what you do is you know by free probability that the spectrum of this will be the r transform of the spectrum will be the r transform of this plus the r transform of this that is in here compute the expectation of the determinant that is in the expectation of log so you will get a spectrum nu n that is given in terms of r transform plug it in in here and then integrate to the rhyme so if you write the formulas there there you know it's we wouldn't have been able to do it if we hadn't tried the physics papers the fashion and the answer and then that guides you a little bit already that's all I think the zero you didn't evaluate is it exactly the point where you maximum goes to zero yeah so there are two parts here and two things this is also to answer your question there are two reasons why lambda zero is not power one is that because there is this bump in this expectation around zero so this might be due to many reasons first reason is that there are really some critical points around zero second reason there is no critical point but we are computing just in expectations so we don't know that that already sets you apart from one other reason is that once you have the formula now you have to study analytically the formula study rigorously and prove that really the shaping if I had just to compute it on my computer I would get that the point exactly that one but ok for writing a proof I understand that proof that they differed but if you were just to plot it lambda zero would be some number yeah would be some number that you didn't have already a number bigger than one I don't think that the ones that I think that computing the expectation already gives you makes you lose a factor so I think that lambda zero that you obtain by the point at which this bump disappears is already not the part lambda zero is 100 or 1.5 yeah I don't know I don't think it's 100 sorry I missed the point that some point when you interpret these three diagrams so what was important was the fact that the maximum of the complexity function is zero or the fact that it is rich what is different from zero ok so the main facts that we use we use on this complexity function there is only one thing that if the complexity is negative so this is a function in whatever four dimension if it is negative in some region u so if max of s of tau tau in u is less than zero then we know that without probability there is no critical parts inside this picture so this implies so this is the complexity function and now I cannot plot in full dimension but if I plotted where as you know I would be able to show you that for lambda large enough there is only two points in which this complexity function touches zero always for tau four less than so there is four dimension space there is tau four we look below so this is zero this is minus lambda over three and we prove that below this level complexity function touches zero ok and separately we prove that there exist at least one point m such that f of m is less than over three so there must be at least one critical point in fact by symmetry two critical points below this level so there must be all the critical points must be there and this once you have that that proves that all the critical points are close to the posterior expectation there are no others I didn't misunderstand but at some point when you were saying about this third point you were explaining that it's more robust than the amp but I didn't understand in which did you think that this estimator is more robust you can prove some formal things about it we didn't prove anything particularly strong we didn't prove anything but one thing that is clear is that if you change the matrix something that has a small operator node then doesn't change you don't have this problem with x if you have a perturbation that has a small operator node that doesn't change but the key thing I think is even less than that is that give me any matrix maximum life it tells you maximize this this is something that has meaning independent maximize this this is something that even if the matrix is completely deterministic built by an adversary if it is a graph is the minimum bisection of the graph so it has a meaning now what I am telling you is add to this entropy term and this reaction term this is basically bisecting that and we are shrinking a bit towards zero this is again something that is perfectly defined for any matrix and also has a meaning it is basically something like the mean cut shrug towards zero it has a simple meaning relatively simple meaning for any matrix if your model is correct then it is a basic estimator it has some robustness property we didn't investigate that too much but obviously for instance if you do a small operator node perturbation x this doesn't change similar things not just small cuts in normal perturbation maybe they are okay so so your point is that even though the proofs challenging that this minimization work this in general will not so this will give you the optimal thing if you do a small cuts in perturbation in general if x is known from your model this is not necessarily anything but it is still something that has a meaning if you have a given for instance community detection problem somebody tells you the graph is arbitrary it is not coming from the stochastic block model I will tell this person why if you can compute the mean bisection just bisect it into parts in such a way to cut the minimum number of edges if you can this is of course an PR to do say that you can do it this way of partitioning graph has a lot of nice property if I tell this person this is a graph, a graph a compute BP and so on BP et cetera this is something for a general graph doesn't have a specific meaning and so if you try to implement this minimum bisection problem all those stochastic blocks will then do you recover if you do on the sparse model this will not be the right I think this generalizes if you take you can generalize more or less problem more or less you might hope to be able to generalize this theory to adjacency matrices that are dense so dense means logarithmic degrees or something so if you go on bounded degree then you can convert free energy expression this something called the better free energy or the better payers free energy this is a free energy that takes into account single point marginal and pairwise correlation and the fixed part on the stationary point of this free energy are basically the fixed point of belief propagation and I would expect that minimizing this free energy would give you the rise by base optimal and now studying this even more complicated in this so I think and one thing is about this one reason why this is here at least as far as our proof goes there is no simplification that comes because of this base property the shibon property like I was mentioning in fact this formula we prove it for the sharing component model I was taking a question and you don't have 12 problems with divergences of the diagonal parts and some of the coefficients with one or minus one here and when you configure so there is okay you take care of so it's a nice fashion right but okay so the way we take care of it is that we basically prove that all the criteria we have a few pages in which we prove that all the critical points are in IPEC cube of sides between one minus e to the minus something I don't remember between one minus epsilon n minus one plus epsilon n so all the critical points are in a slightly shrank IPEC cube right this one so for the free energy could you have some constraint on m in the case of in the case of in the case of in the case of adding some constraint on m in the case of in order to improve so what you say is that in the case of you cannot reach in the first case this will not give you the base you can always solve it we don't expect it to be to give you a bad thing because even the maximum likelihood even in the first case so this is maximum likelihood so maximum likelihood there has to be a bit bad it doesn't get the right free transition but that's empirical it's very close to it now if you minimize this it's very bad but will not get your base marginal and the reason is not the reason is that here I'm taking is that I'm taking into account correlation between vertices a little bit better but still I'm taking into account the next order now if you now look at the sparse Bm correlation between two random vertices as more but two vertices that are connected by range they are of order one so we have to take into account so the better free energy does that by writing a free energy that is not function of one point marginals but it's function of all one point marginals and all joint marginals along an edge so now you have a function that is a function of correlations on edges magnetization on expectation on vertices it's not huge but it's not it's not it's not now the analysis it's quite a bit more complicated we've got no more questions let's thank you