 So I guess we'll have to start without the it will get here and okay so so up until today our focus was right so we talked about a Boolean cube maximum of a function over the Boolean cube and I want to start talking about a different variant which is optimizing over the unit sphere so so basically today polynomial over all vectors as far as the SOS algorithm is concerned there is not really much difference between the two you can define a minus little distribution which will be a function new from the sphere yes typically we write SN plus one or SN I never remember what you write think SN plus one for the two from the set of x in our expectation integral of x mu x do the same things as we did before except instead of using you know instead of using these identities x i square equals x i now we use this identity sum x i square equals one so basically the sum squares algorithm you can run it for optimizing over the sphere just as well as you run it for optimizing over the Boolean cube but there are some algorithmic differences so in particular say for a degree two when f is degree two then over zero one n this is mp hard finding the maximum because you know max cut is an example you know the sphere then this is this is in in poly time because this is an egg new problem right maximizing if f is a quadratic function maximizing f over all unit vector is basically just finding the maximum eigenvalue of a matrix right so so we can do it in in polynomial time but for degree three this is mp hard let me leave it for you as an exercise the usual reason why I repeat that an exercise because I don't exactly remember how it's done and I have a strong feeling that it's true but I and that it's probably doing some reference somewhere for you but I didn't get the time to chase it up so let me give it to you as an exercise and typically that's a way if I knew how to do it I would just do it if I don't know how to do it I leave it as an exercise so so let's let me leave it as an exercise but let me show you why say degree four is np hard we for for example is np hard and I think this can be adapted to degree three then basically what we can say is something like let's look at the note let me make a degree six that's easier so given a SAT formula a SAT formula phi xi squared minus one over n squared so basically uh what basically what it would say is the following thing pix pix would be pix squared would be a some polynomial that is always non-negative and it's zero if if x is a plus minus assignment that satisfies the i close right so for every clause every clause of the form so right if you have a clause of the form say x seven or x 15 or x 18 write it as a polynomial let's probably get wrong if I try but something you can write some polynomial p in x seven if it's a plus minus assignment and then it say it has you know let's say it gives zero if it satisfies it doesn't satisfy it would make sure that it's always non-negative yes sorry oh sorry you replaced it with six right yes yes yes yes so and and then so basically this polynomial it's it's a minus a sum of square so it's the maximum is always at most zero and the only way the maximum will be actually zero if you have if x is exactly a plus minus one or like the scale it to be a unit vector plus minus one one over square root n vector that exactly satisfies the SAT formula so you can transform a SAT formula into the degree six polynomial such that the maximum of this polynomial is zero using if and only if the SAT formula is satisfied so that's and and you can be a little bit more clever and you can make this degree four and even degrees three but that's and let me leave it as an exercise do you need to modify that you're sure it's hard to get an approximation what do you need to modify that you're sure it's hard to get an approximation yeah so you have to be a little bit careful because you have these two things and but you can actually show also hardness of approximation and and yeah so you can actually show also how the hardness of approximation for these kind of results although like they are more tricky this kind of hardness of approximation for these types of results and I think it did like some of them are not some some of these things are not yet then there are some questions there like that we still don't know exactly how hard these problems are so so it turns out that there is like a bunch of very interesting problems that all fall under under this kind of general paradigm optimizing polynomials over the sphere and they somehow come from very different from very different areas so so you have the problem of tensor pca or tensor svd so pca is principal component analysis svd is singular value decomposition so these are things that are kind of the bread and butter of working with matrices and you know when people are doing data analysis they kind of do these things with matrices all of the time and it turns out that it's incredibly useful to be able to do it with higher dimensional versions of matrices which are tensor and but the problem is exactly that when you move from degree two to degree higher than two then suddenly computation becomes much more another problem is this problem of sparse coding and this is again incredibly useful maybe useful primitive that is I think it started by some paper of a neuroscientist that says that in some sense this is something that happens in the I'll define the problem later like this is something that happens in the visual cortex of the brain maybe they give some evidence for that and now people are using this as a primitive in constructing deep learning networks and it's kind of incredibly useful again that they have this problem of computational efficiency there is a sparse vector problem which is a somewhat of a natural problem a natural problem which is you can think of it as a continuous analog of problems in coding theory and it's also related to this small set expansion problem and then there are problems in quantum information theory of finding the best known entangled state and in all of these problems they have the so all of these problems somehow seems to come up from very different domains but they all have the following characteristics one is that they can all be phrases maximizing the polynomial over the sphere two for all of these problems are kind of empty hard to solve exactly three for all of these problems we don't really know the limitations of what's the best thing we can do efficiently and four for all of these problems the current best algorithms or at least the best algorithm we know how to analyze is the summer squares algorithm and and basically for every one of these problems there is an interesting research with questions of trying to completely nail down their complexity because often the best thing we know how to do and the best kind of hardest results are far from each other and it's all interesting so so I don't know if I'll get to I mean I might and I'll see how much we get this week and maybe some maybe I'll talk about QMA next week we'll see how how we're doing so so let me try so basically then let me start talking about these problems and so I think what I'll do is I'll start by basically describing and problems and then and then in the next hours I'll I hope to at least talk about maybe this one and this one and actually show how the SOS algorithm the SOS results for these problems but let me start by describing the problems so you can see why what are these things except just you know just buzzwords and and now they are related to optimizing polynomials over this field any questions at this point so I hope that the summer I feel like the you know between these four you should find something you like I mean I don't know I hope this basically should cover you know like right who doesn't like care about like I think like probably like between these four they kind of cover you know and basically hundred percent of the population maybe maybe you need to kind of add like five and problem five like something I don't know related to I hope that's like a good so so my knowledge in popular culture is not super great I hope this is a good day so I hope that between these five you kind of cover the interest of all the population so we so euristically we like we're going to cover 80 percent so so the summer's class algorithm can be related to four to four out of these five and I'll let you figure out which four so okay so let's start with 10 so PCA one thing that you know everyone likes to do with data is linear regression so basic thing that kind of like to do is okay you don't have this line you have a lot of data points this is one of these things like just looking at these data points you can see that they were plotted by a human and not by a random process but yes you have these data points and then you want to find the best line that approximates them and this is one way to do it is least squares regression so the way we will think about it is that you have say it could be kind of in higher dimensions let's say you have this distribution d1 which is some x times you know vector v0 plus some e which is some noise and let's say x is in n like just a normal random variable so what you can do is you can write this is called principal component analysis you look at the expectation of you know vv transpose when v is drawn from d1 which is like expectation of xv0 plus e xv0 plus e transpose which will be expectation of e transpose plus v0 v0 transpose right the cross terms will vanish with their kind of independent mean zero and now if you do if you put put it in the eigen basis then what you will get is basically you'll get one and you know some if e is kind of small then you'll get a it's a psd matrix so you get you'll get one in the in the vector corresponding to v0 and epsilon elsewhere so you can kind of read off v0 from the organization of this matrix so this is known as principal component analysis this is kind of the principal component and of course often you want to look at the generalization of that for a higher dimension so you might have so you might have the distribution like this they're like saying three dimensions or like higher dimensions you have some kind of a plane and and you have these points that are kind of around this plane the way we'll think about it is maybe you know let's call it which will be something like one over square root two x v0 plus one over square root two y v1 plus some e and then we'll do this expectation you know of this thing be from the plane in it up it will be like half expectation v0 v0 transpose plus half v1 v1 transpose plus this expectation of e transpose so we'll get this thing like half half and all these epsilon's okay so so far so good this is what people do and they're very happy with that but let's look at another example suppose that you have this distribution you have the same plane so but now basically what you have is kind of a mixture instead of having this plane you have a mixture of you know with probability half you get something that's close to v0 and with probability half you get something that's close to v1 so now if you look at so let's call it the so this is if you just look at it it's a different distribution right it has kind of two clusters but if you if you try to do the same thing you get basically get right expectation of vb transpose will be no half of what you get there and plus half of what you get there so you get half vb plus this noise but you get something that you get basically the same thing so you get you cannot using basically pca or using any analysis on the second moment matrix you're not able to distinguish between this distribution and this distribution in particular you're not able to recover these two clusters so so that's kind of a limitation of pca and this limitation is a reason why we want to kind of go to so so the second moment does not allow us to distinguish between these two things but the third moment or say I think this case it's cleaner like to look at the fourth moment will allow us to do that let's see if I can do the calculation here so so basically so the second moment is the same but if we look at the fourth moment then you know if we look at the expectation of d plane okay so let's start with expectation of say d let's call it d mixture this v turns out to be 4 it will basically be you know half times v0 turns out to the 4 plus half look at expectation let me ignore the noise because moments vanish so what we get is half expectation of a norma happens to be free by square root 2 to the 4 turns out to the 4 plus what happened 4 choose 2 divided by 4 turns out to also the other guy I mean this includes actually the other word I around it's basically what it will be if v0 turns out to you get this matrix 10 so 10 so 2 you get basically a matrix well here you get a rank 2 matrix so different between these two distributions and in particular you can if if you are able to take this more this tensor and decompose it into its one one components you can read off the centers of the clusters so this is what's known as a tensor decomposition which is kind of like the tensor analog of singular value decomposition so you get a tensor like say degree 4 tensor and you find out its decomposition into some of one components and this is generally a computationally hard problem but actually people people use it all the time and you know they have heuristics that seem to work on some realistic inputs and we sometimes can analyze and show that under certain distributions we can actually success okay everyone understand what's tensor decomposition okay so now for sparse coding that's that's a great question so under certain so so generally tensor decomposition doesn't have to be unique there are some kind of incoherence or generosity assumptions that imply that the composition is unique and and but even when it's unique it's not clear how to efficiently find it so but we'll show some example I mean we'll give an I mean we'll show an algorithmic result using the samskers algorithm that does recover these under certain conditions and we also discuss some other algorithms to do it okay so the second question is sparse coding this is also known sometimes as dictionary learning so the way I kind of like to think about sparse coding is to look at this alternative history of how the Fourier basis was discovered so there is kind of a long and complicated history going you know back to much before Fourier how this Fourier basis was discovered but let's let's imagine that it happened in the following way that you know this guy Fourier which according to Wikipedia I think it looked something like this and so this guy Fourier you know sat and listened to sounds being played violin and he sat and listened to sounds being played and he wanted to find the right the right representations for these sounds and some of the most succinct representation for these sounds so he wanted to find so he wanted to find a basis so he gets this kind of in the time domain right these sounds in the time domains which are kind of vectors and he wanted to find and a vector a basis in which you know this will be sparse this might be just you know two just two things and eventually you know if you got enough sounds you could find basically ask this question what's the what's the basis that would make these these things the sound the sparse as possible and that would be a way to discover the Fourier basis and why do we want basis to be sparse because in some sense a basis when when it's sparse it's kind of meaningful somehow in the time domain it's kind of hard to say that every particular coordinate what it means for it to to be large or small where basically all coordinates are kind of but when something is sparse you could basically say well you know here this kind of corresponds to I don't see flat I don't even write flat I'm not a very musical person shout whatever is this note is this note even exist to see shout okay so you can say you can kind of identify you can identify a coordinate with a particular say frequency particular note and this is and this is just not just true for sounds in general when you have this kind of sparse representation then you can say that every coordinates you know it's most of the coordinates are kind of have low magnitude they are kind of off when the coordinate is turned on then you can say it's kind of meaningful so for example if you look at images in the pixel basis they every particular pixel you know you take an image and you say okay the pixel at place you know 100 comma 52 is black or white it doesn't really tell you much about the image but you might want to find some kind of representation of the image where this particular coordinate will kind of correspond to you know whether this whether this image contains an eye and indeed or and indeed basically sparse coding is often used as a primitive to build up deep neural networks and in fact and in fact this was the this all housing and field what they argued is that maybe in basically the in the human visual cortex we have this kind of sparse representation we find the right linear representation of the images that we see that kind of where it's a sparse representation it's a somewhat more meaningful so typically these things the first representation would kind of it will not correspond to an eye it will correspond to and it will correspond to some edges or things like that and then and then you find the next level you go deeper and deeper into the network and you kind of have more and more meaningful features so so some squares is related to deep learning and if i if i just managed to relate it also to the internet of things then probably like all my funding problems are like completely sorted out so so so so again we so what we kind of want to so mathematically what we want to do is say the input we get samples y equals something like that even or even you can start by thinking what that's a good enough case to think about so you can think about think of as for example being a basis and so think of the the question how do we discover the four-year basis from from examples and y is in our end this sparse we think of x as being chosen from some you know distribution d which is a distribution of sparse vectors and is so so basically the input is like these samples y1 till y say capital m we got like lots of samples like that to be a now of course you might not be able to get exactly a so let's say up to say permutation and approximately you can discover the three-year basis you might discover it in some commuted order or something that is making sure it measures wise or possibly yes but yes because sometimes we want kind of what it's known as an overcomplete basis in fact in many cases we kind of want an overcomplete basis but you can start by thinking of m equals m like that we're just trying to figure out a basis and and again think about you can see we'll see these problems actually related but you can already see that they say if you just wanted to look at the the four-year basis versus they say the standard basis if you just looked at the second moments of these samples they would be the same if it's the four-year basis or the standard basis so that will not help you in discovering the three-year basis and and there is another problem called so so there is a problem that we want we want to construct we reach which is sparse recovery also known as compressed sensing in in sparse recovery what we typically think is that a is known then we get a we want to find x and we know that x is sparse so that's that's kind of a different problem and so for the sparse recovery there are pretty good algorithms using what's known as l1 minimization these are algorithm based on lp you can apply similar algorithms for the sparse coding problem but it turns out that they're not so good for the sparse coding problems and some squares can do better okay so this is the sparse coding problem the sparse vector problem yeah this is somehow a routine that's useful in some other places i don't know if it's as gotten as much direct attention as the others and i think it's a fairly natural problem and so the input i mentioned is okay which is like the number of you want to find the sparsest vector inside the subspace and you can think of it you can think of it as a continuous dialogue but like the problem of finding the shortest code in a code so yeah of course you want to find the minimum non-zero guy it's restricted to the question of certifying the property known as a restricted isometry property that also rises in a compressed sensing and also you can think of it as being related to the problem so it's also related to the problem of the small set expansion in a graph so the small set expansion problem in a graph you can think of it as the following you give the input and some parameter lambda and the goal is to find a smallest s in the graph is at most lambda so another way to write it is you want to find the smallest such such that one s the Laplacian of the graph one s at most lambda side s this is the norm of one square so this is not exactly the same but it's approximately the same as basically trying to find the sparsest vector in the eigen space of the Laplacian that corresponds to lamb to eigen value smaller than lambda and this problem is like the small set expansion problem is strongly connected to the unigame problem so this is another motivation to look into this problem but the sparsest vector problem is kind of natural in its own right and you can study kind of average case versions of it have a short description of the v basis yes just some basis right it's a linear subspace so you get by right and it's not by the way it's not completely immediate how why these problems are optimizing something over the sphere but we'll kind of see how they're related typically what we'll do is we'll find some proxy for these problems in both these cases we cannot find proxy for these problems that basically looks like maximizing something over the sphere for the fourth problem we it's actually pretty directly optimizing over the sphere so this is this problem of entanglement so let me give you a super crash course in quantum information theory okay so so for quantum one of the most mysterious things in quantum about quantum mechanics and one of the ways in which it is mysterious is also you can think of it as the one of the way this is manifested is that it's also computationally hard to understand if a quantum state is entangled so so what is a quantum state okay so so suppose you have a system so typically if you if you look at if you you know take a course on quantum computation you think of qubits etc but right now we're going to we're going to start today one that we say okay we have a system that has n states and you can think of it as maybe the system has log n bits and the states are like all possible assignments for the states but right now we don't care about it so much and so a state a quantum state so this is a it has like n basic classical state a quantum state is an n by n such that pretending that quantum states are real not talking about complex numbers it doesn't really matter so much the power of quantum doesn't come from the complex numbers it comes from having everything numbers so this is known as a density matrix in quantum notation so so this is this is how we encode the quantum state and if rho is diagonal this this basically means that it is classical so if so what does it mean that a diagonal dsd matrix is trace one it means that the diagonal is simply if you think of it just set of probabilities right non-negative numbers that's out to one and tells you what's the probability to be the diagonal at the point i will tell you what's the probability to be in the state i so that's the classical case and and in quantum states are these kind of dsd matrix that might be non-diagonal that's why also notice that the diagonal matrices are commuting with each other and in some sense the quantumness comes from states that do not compute to be one another and typically in quantum i don't know if you're seeing in a when you talk about closed quantum systems like a quantum computer sometimes you can pretend that the states are a rank one so if rho is kind of rank one so rho is of the form v transpose then we say that rho is pure and typically the way they would write it like the quantum people they would write it as i think this thing they use this is notation for the column vector v and this is notation for the bit one suppose but i'm not going to use this notation but so this is basically a pure state and but more generally a quantum state can be a mixed state which is not necessarily one one one one one object okay so please stop me any question because i guess if you haven't seen quantum at all then this might be a little bit fast or abstract but i guess you can just believe me that this is a way to describe the quantum state so so now to talk about entanglement you kind of want to talk about a system that has two parts so suppose you have a system we think of it as a n squared states so we think of a system as having two parts each part can be in n states so the system has n squared states total and and so in the classical case we would think like a product distribution right basically one you say one one the system the state is kind of independent between one part and another part that would correspond to being a product distribution in quantum world we talk about entanglement which is like a or non non being a product distribution kind of corresponds to being separable and and the non separable is this notion of entanglement which is somewhat somewhat subtle but let's say what does it mean to be so we say that row so we say that a pure state which is of the form v let's v first suppose is separable so so notice that v is now typically it should be in c but ignoring this issue so the state is in n square they mentioned the system as a n squared state so so the state and so the so we say it's separable if this if this state is actually itself like if you think of it as a matrix it's a one-quad matrix so that what it means is it's separable and in general a row is separable for a state to be non entangled why do you care about the state being entangled or non entangled let me not get into it let me just say that you know people in quantum information information to you and deeply care about whether a state is entangled or non entangled and in one way in which entanglement is more complicated than being a probability being a product distribution is the problem if i tell if i give you a probability distribution over say n squared i give you n squared probabilities and i ask you to is it a product distribution or not you can you know this is a this is a computationally easy task it's basically the same task as you give me a matrix and you ask me whether it's one or no i can find out but if if you give me a quantum state and ask me if this quantum state is separable or not that we don't have an algorithm to do that and in particular this is a one question that people have been looking at is this question known as best separable state so you're given so the input quantum measurement a quantum measurement is something that you can do with an operator that takes a state and and outputs you think it configure it as an experiment that given some state will output say one with some probability and zero with some other probability and you want to i know what's the probability that it will output one the way mathematically you think about it is m is a psd matrix between m and one so now the probability that m x is simply the trace and notice that you know if row has trace one m is between zero and ps is an i then this is indeed a probability it's always between row is psd and this is going to be between zero and one so so generally finding out what's the best probability that you know what's the best probability that you can accept a measurement that's an easy task it's basically any valid problem but finding out what's but the best separable state is you think of this measurement as a matrix that's n squared by n squared and the goal is to compute the maximum of a row that is separable could i say distinguish between the case that let's say this is one and the case that this is the most half yes uh so so we're assuming you get w also yes yes yes yes so and so so this is basically the the best separable state question you can say one to distinguish between these two cases and we don't know of any you know good algorithm yes what that's a great question yes that's true so that's that's that's one observation so this is the same the maximum you can always get it because any separable state is a complex combination of q state you're basically asking what is the maximum over u and w w transpose and in some sense it turns out like that it's basically the you can kind of revolve also generally the hard case is when you can even enforce that's a u equals w so it basically becomes the same as asking the maximum over x of x tends to the four so this is basically just maximizing the griffel polynomial over the sphere so so so this is so so basically we are back to this question that we started maximizing with the griffel polynomial over the sphere and but you know something about the structure of this degree for polynomial because of this comes from this question and let me just say that what's known so Doherty Pablo and I forgot the last person's name but starts with an s they are thinking I don't remember when like maybe oh I'm not sure exactly when they did it they proposed basically they showed they they they proposed SOS for this problem what we know also based on works of Ashley Montaneri SOS or even any other algorithm there is some NP-hardness results that show that you cannot assuming some natural assumptions you cannot get better than you cannot do this in polynomial time it will require quasi polynomial time but the main question is is there an end to the polylogue we don't even really know really recently we think that we know something better than two to the end we have a very recent work that shows an algorithm that that SOS gives you like two at least two to the square and for this problem I hope to talk about that next week and one result was by Cristian Delanyard they show that you do get an end to the SOS let's give you some special measurements ends these special measurements ends are something known as a one-way local operations and classical communication which is something that kind of makes sense in which is something that kind of makes sense in the context of quantum but it's still like only special it's not it's not for all measurements and generally if you will if we will find an algorithm that runs this in quasi polynomial time that would imply so such an algorithm would imply a result and so qma2 so this is the next question in complexity terms this would be like a class the question whether quantum merlin also with two poopers is in exponential time and so sometimes you will see this question phrase this way and the way I kind of feel is that you know the main reason this question is interesting because it's really about the nature of entanglement and finding some non-trivial ways to to argue about entanglement and but the qma2 version is really just a way to to talk about it it's not that the complexity class on its own is like you know super interesting on its own but but people are very interested in this question because it's so related to these other questions in quantum information theory and there was this whole workshop I think in august just dedicated to this question so so yes questions do you know of any of the special ends over there are hard instances for the problem in general so no I mean in some sense we in some sense what do we know that those things like that we can't solve the problem with a special end no so some of the there is kind of a mismatch I think arm would know better like for the latest results but as far as I know there is some kind of a mismatch between the hardest result the m's that come out of the hardest results and the m comes that comes out of the easiness results it could be that for these LOCC things as far as I know it could be that they are in polynomial time but we know for general m's you cannot do better than quasi polynomial time and and and whether you could and what's the truth is kind of a very interesting question so it's really one of I think a very fascinating question so and so let me yeah so I think maybe now it's a good time to take a break and so yeah so I kind of hope that I motivated you by doing this kind of optimization over the sphere is very useful and now I'm going to start to talk about how SOS actually is relevant this and let me start with this problem so I'll talk about the sparse vector problem so and I'm going to focus on the average case setting of this problem and so is the basis which is the span this thing of v0 is sparse means that number i such that v0 some epsilon n and let's say that v1, vk are just random for the sake of presentation I might also assume that v0 is super critical but it's makes life simpler be not be just be zero is orthogonal to them but they are kind of they think of them as random Gaussian so they're not pairwise orthogonal but they're kind of close to it I might at some point also assume that they are pairwise orthogonal also if it makes my life easier to do some calculations and use that although they are kind of only nearly orthogonal so so what is the goal generally the goal is to recover v0 you know it's a subspace of course visa you can only recover it up to scale so let's just say that the goal is to output some w such that the dot product of w and v0 squared is at least 0.9 times the norm of w squared so w that's up to scale is very correlated with v0 so this is kind of an actual problem so let's just not talk too much about motivation and just try to solve it and I'm pretty sure that there is a I'm pretty sure that if we just kind of write this like some sum of squares program directly for this problem and solve it we'll get the same kind of algorithm that so we'll get the same algorithm that I'm about to present and that would be maybe nicer so let me leave that as an exercise to somehow show that you don't have to be creative in how to apply some of the squares for this problem but the way I present it I will use this some creativity in the sense that what will I use creativity for to find kind of a proxy for sparsely so I believe this creativity only is only really needed for the analysis and not for the actual algorithm so so we now want to you know be this this thing is not really a kind of loaded people you know me out and so what we want to do is we want to find some kind of a proxy for sparsely yes what you get a basis for this span you don't get this basis you get a basis for this subspace you don't get this particular basis you don't get v0 you get an arbitrary basis if you want you can think that you get like lots of samples of random vectors for this subspace or just arbitrary this so yes you don't you don't get busy so and so so we kind of want to so in general what we want to say is we want to basically maximize spikiness v subject to v is in the subspace so spikiness basically means that most sparse vectors right so like this is a you know this is kind of a dense vector and this is a sparse vector and we want some function that is higher value when you're sparse and lower value when you're dense and it turns out that the nice way to think about it is you can look at a compare say the q-nome or to the p-nome where q is larger than p so generally when you look at and when you look at norms that are so p-nomes are quite a p-nome of a vector and is some of a p and generally the higher power you take p the more the more weight you give to the higher things so the more you the some of spiky spiky things will get larger larger values so so some some proxy for sparsity is looking at the ratio of the q-nome of the q-nome to the p-nome and different choices of p and q give you different ratios so so one choice so people have actually looked at this so choice one is look at infinity versus one so so this might not be obvious but i left it as an exercise so this is there is something good about it which is this is computationally easy so so if you if you look at sparsity if you look at if you look at if you define this to be your sparsity you can find this you can find the vector v that maximizes this ratio the thing that's bad about it is that it requires epsilon to be smaller than i think one of us quote k if i remember correctly so basically what would happen this is not super great proxy for sparsity because what will happen is that when when you try to find the vector that maximizes this you could it might not be v0 in fact you can show that it will will not be v0 unless v0 is very sparse so so this is but but this is computationally easy so this is a great choice so another choice that people look at v2 by v1 so like the ratio between two to one now this solves this problem this is great as long as you know epsilon is smaller than one this will succeed you know for any case say that's more than one or something like that so it's smaller than n and but unfortunately this is hard at least there are not so many hardness results i think as as much as we just don't know how to do it efficiently there might be also some hardness result this is by the way very related to this question of certifying the restricted isometry property of the subspace if you have heard about it in the context of comparison in this this is a question and that people have looked at so so basically this two to one over by the way people are actually in practice use some heuristics that are based on a two to one ratio and sometimes they work or sometimes they don't and so so what today i'm going to talk about is a fair choice which is looking at the four versus two norm and to some extent this is this seems like somewhat idiotic thing to do because it's somehow hard you know it's takes the worst of both worlds it's not as great it's not as great as the two to one norm it can handle this only if k is a smaller than square root n so there's more general thing that it handles but i never i always forget the formula i think it's like you need maybe epsilon to be smaller than maybe epsilon to be smaller than n over k squared something not sure but anyway it's not as good it's it's it's not as good as the two to one norm and it's also in the worst case it is computationally hard so in some sense it seems like a silly thing to do to look at this policy but what we are going to show is that it is easy on on average so for random so we're going to show that it is easier on and for these random subspaces we can solve it using some squares so we get so we get better results than what was known before in a for the computationally efficient set yes uh so we say that's much less like let's say one maybe one way to say it is you know epsilon is like 0.001 both these cases kind of like in this case epsilon as long as k you know in these cases i think as long as k is smaller than n sufficiently smaller than n then you can handle epsilon that's pretty large here you need epsilon to be k to be smaller than square root n so so this is a little bit worse let me erase this exactly remember the right formula but uh so so this way is almost optimal in the sense that it can detect a small and it can basically walk and as for for the subspace can be very very loud of course if k equals equals n then then this will then you cannot recover v zero because the subspace will contain sparse vectors other than v zero so so if you could do that two versus one when maximization you could basically solve the problem completely and when you do that for versus two maximization you do this and you you solve it still better than what was known before so another comment um overall it's something like a entropy for or sorry uh no highly yeah yes highly it's like flat and so you can you can try to look at basically various proxies they i mean in some sense what you want is some proxy that you are able to analyze and so so this is not necessarily the best one and i think like it's a great open question to find a different proxy that we would be able to efficiently optimize of maybe in the average case and will give us better performance there is no evidence that this is the best we could do but at least this is something we could do in particular like it seems like somewhat strange that if we could do this and we'll show that this we can do with degree four summer squares that we couldn't do something better with degree eight summer squares but we don't know so this is this is by the yeah i mean this i think is one thing that i might put as a open question of the week is to you know so what we know you know when epsilon is a one over thousand and k is i know square root n over thousand and open open question is say epsilon is one we don't know how to do that efficiently so so now let me talk about the algorithm so the program gets a basis or some arbitrary basis like that doesn't matter which basis it gets i can tell you exactly now the algorithm you'll see how we use it so and like uh you doesn't get zero no yeah right that's the that's why it's going and that's the goal right to find you know i like to think about some other parts but there's no close to uh so so so the one thing that you can prove is that if you get random subs if you take random as long as k is kind of sufficiently smaller than n and if you take random k vectors they are not going to span any sparse vector so if you found the if you had a way to find they say the sparse vector in the subspace it is going to be close because there really is there is not going to be any other vector that's even somewhat far okay so so here is here's the algorithm right so the input is a which is like this basis right the way to think about it is n times k plus one matrix which is a basis and we look at we're going to look at maximum over the sphere so we are maximizing the b4 polynomial over the sphere so so we are we are maximizing this degree for a polynomial over the sphere and we so what we do is we're going to run for sos mu over the sphere such that alpha alpha is the largest thing you can get here and basically this you can think of it as b0 for b0 we scale it to be a unit vector is we sample w using a quadratic sampling lemma so this is a little bit disappointing i'm still using only the quadratic sampling lemma i'll promise you that eventually we'll see algorithm but use something else and what's interesting here is that we somehow sampling this quadratic sampling lemma so we're just matching the first two moments of mu but for that for the algorithm to work we actually need to run the degree four sum of squares but i want to show that the degree two sum of squares will not succeed in so so even though we are just sampling these two moments we kind of need the the region distribution to come from the fourth moment and actually we have already seen this in the ARV algorithm where we the bounding algorithm only used this quadratic sampling lemma but we needed this square triangle inequality to hold and the reason it held is because it came from the degree four right yes it doesn't even exactly make sense yes but you can try to come up with like the degree two sdp is for this kind of sparse problem and you can show that they are not so they will not succeed so okay so now this is okay so now basically we want to analyze this the analysis is going to you know rely on so basically the analysis is basically we're going to first pretend actual distribution is going to pretend that mu is an actual distribution and then we're going to apply malice corollary and say that as long as we our analysis didn't use things that we weren't supposed to be using then everything is going to be all right and even though it's the you know it's a su distribution and not an actual distribution we can still it's okay the first thing we want to do is just to prove that this succeeds when and we just want to prove that this succeeds when this is an actual distribution actual so we're going to prove mu is the following so let's define d prime to be the span smaller than square root n let's just maybe just to see that it's an interesting bound let's just compare this the volume of v zero let me just look at v zero that is not like scale to be unit vectors let's just think of it as a let's even for simplicity think of the v zero as a zero one vector that is one on epsilon fraction of the guys then the four norm of zero this to the four is simply epsilon n right and the two norm epsilon n squared so it's like epsilon squared n squared so so notice that this would be violated right so so v zero would violate this thing in this v zero will have much bigger four norm related to its two norm than than these guys that everything that is in the span of v one v k okay so so that kind of answers Greg's question that this shows that there will not be any any sparse in particular lemma one implies that there will not going to be any other sparse vector in the span of these guys so at least so at least intuitively it means that we are doing the right thing we when we're trying to maximize this thing that we're not going to find something that has nothing to do with v zero what this is with high probability yes yes with high probability in lemma lemma two it basically says the following thing and so the following for every w in v the four norm of w the four norm of w divided by say the two norm of w is at least say half times the four norm of v zero divided by the two norm of v zero it must be very correlated then basically it's w v zero squared must be you know one minus o of epsilon w squared right so basically lemma one tells you that you cannot get a vector that has large four norm you cannot get a vector that has large four norm from v one t v k and lemma two tells you if you get a vector that has large four norm from v one till v k plus v and v zero then it really came from the v zero part and and lemma one lemma two is actually very easily followed from lemma one and so we first show that lemma two is implied by lemma one so the idea would be lemma one and lemma two together tells us that if we if we look at this if it was an actual distribution so lemma one and lemma two together tell us that if it was an actual distribution then if it was an actual distribution over vectors that satisfy this property that the satisfy this property that the the phonon is basically the same as v zero then this is a distribution of the vectors that are very very correlated with v zero that means that the second moment matrix of this decision is very close to being the one one matrix v zero v zero transpose so we sample if we sample if we sample a vector from the quadratic sampling lemma we are going to get a vector that's very close to be zero in fact we need to use it for that sampling lemma I would have also said that step two would basically take the second moment matrix of u and output its top eigenvector that would also be because the second moment matrix of u will be very very close to the one one matrix with the top of the eigenvector being v zero here so what does a relate to on the basis what so a is a basis a is some arbitrary basis i would say arbitrary orthogonal basis for this orthogonal basis for the same for the same subspace so you're given this a it's kind of an arbitrary basis so that's an input now yes that's an input yes let's yeah you can always also i mean if you get a basis you can always make it orthogonal right so so the the reason we want it to be also normal is we want if we want this program to basically mean that a x is a unit a unit vector in the two norm so so this will be basically the same as maximizing v for this is this is the same thing so in the sense this a you can kind of ignore it and think that what we are doing is really doing this okay so so let me show you why lemma two implies lemma one okay so we level one implies lemma two so suppose the w okay so so suppose the w is you know unit we also assume let's say v zero is unit easy is easier to think about it and w is in this sponsor so w we write w as a alpha v zero plus v prime v prime is in the span notice that is okay and now we what we want to show let's just kind of make things think of epsilon is going to zero we want to show that alpha is like one minus little of one so we really kind of be not not trying to maintain all the constants just right so we want to show that if epsilon is going to zero then then this is going to one so so what do we know okay so we know that the four norm of p zero is much much larger than the four norm four norm of v one v prime right because v zero is a unit vector v prime is a vector of norm two at most one and so the four norm of v zero is much much larger than the four norm of and we know the so so this is by lemma one we know this and we also know by our assumption that the norm of w the four norm of w is at least the four norm of v zero over two right so now let's just write this equation so so we get that the four norm of w at least the four norm of v zero is smaller than alpha so it's smaller than alpha times the four norm of v zero plus i have to remember what the way i wanted to actually do it let me keep it in this way okay get this and now now we divide okay so now we want to write it as alpha larger than we want to write it as alpha larger than okay so this okay so i guess for some reason i believe that this is also true when this but let me just make this if this is at least 0.9 then i get you like 0.9 this is this is good enough for us right we basically in some sense the the the w's will be distributed in our case now all the other things that are basically are going to be having the same norm as v zero two four so so we kind of we can assume that we know that this is at least 0.9 or 0.99 and we're going to get this is going to be at least 0.99 minus litre of alpha okay so so this this is what we want yes yes so so so basically the main thing we used is a twangal inequality for the four norm this is the main thing we use and and and you know i'm going to basically so now so now if we get a sum of squares proof for for lemma one to basically to show that to show that this works also for pseudo distribution the only thing i need to know is to give a twangal sum of squares proof for the twangal inequality for the four norm right and i'm going to give this proof in this box i'm going to give the proof for the sum of square that the twangal inequality for the four norm so this is the proof that you have a sum of squares proof that the four norm has a twangal inequality by the way if you look at the notes there is a weaker statement that is somewhat easier to prove that's already good enough for what we want although you can also just move the program okay so so this basically leaves us with lemma one yes so we did this triangle for the cube so for what did we do it it was like it was like it was a twangal inequality not for the four norm it was a twangal inequality i guess for the cube maybe that yeah we did triangle inequality for for like the square two norm but yes yeah yeah um so this is yes over the yes no but this is even more not just over the sphere basically what maybe we could just say what does it mean the triangle inequality what what you want to show is that and let's see how you show it you want to show the following thing you want to define say polynomial in x y z which is a x plus say x minus y so we don't want to show we want to show that this is always one issue is that it's not a polynomial and basically what so what we want to show is that so we want to show this is always non-negative and you can actually show is that say for every sort of distribution over x y z it will satisfy that expectation of to the one quarter expectation of this the larger than expectation of this to the one quarter which I think is what we need and so I think the exercises there are saying exactly what we need but you basically can show the version of the triangle inequality that you need to make the argument go through as typically in the case for the some of these things the hardest thing is to even phrase what you want there's a polynomial once you can do that then it's not that hard to prove that this polynomial is a sum of squares but it also turns out that what you need for this argument is something even simpler than this because you cannot afford to kind of lose various constants okay so so now we are left with lemma one and let me just give you kind of some intuition why lemma one would be too so basically the way the way we think about lemma one is the following so so lemma one is basically the following thing it says define define p of x so say take p to be from rk p of x be the fourth say you wanted that p so basically lemma one says that p of x is smaller than this is lemma this is another way to say lemma one right so lemma one says that if you define this polynomial on k variables that that looks at we had this v1 till vk these random vectors then this polynomial is the before polynomial this is less than 10 x x is the fourth by by summing this is with my probability and assuming that k is much smaller than and I can already tell you that what would happen is that we can actually prove the pc is equal to this minus the sum of squares so that will so that will also that that will already basically give us the distribution version okay so what does it really mean so one way to say what does it mean that p is less than the sum of squares it basically let's think about what p of x is p of x is basically you look at sum bi tensor to the four tensor to the four oh another way to think about it is you look at x tensor x sum okay when we think of this as an and think of this as a k squared by a k squared matrix and so what we really want to show so basically what we want to show and that will give us not just the lemma but also the sum of squares version it what we want to show is that the spectral norm of sum i from 1 to n vi tensor 4 we want to show that the spectral norm the spectral norms spectral norm squared so we basically want to show that the spectral norm of this is like o of with less than o of 1 over square root and let's just kind of get some intuition why we might hope that something like that would be true then what now I think if it's the spectral norm squared I think I'm what I want to look at okay so maybe it's the spec that's okay okay right I think it's your your rights yeah this one over n right yeah I think you're right it's one over n that's that corresponds to this matrix right so we want to put the spectral norm of this is one over n and another way to write it is that we want to prove that like this one over n and the average of this is the spectrum is one over n square so yes so this is okay so this is what we so the lemma is if then this is the lemma and this is kind of the proof or sketch of proof so we want to say this is p px and what we need to show is basically that the spectrum of this matrix is at most o of one over n squared and but in fact we don't even have to kind of talk about the spectrum of what yes this minus the sum of squares oh minus yes yes so this is like an inequality yes and so what we want is to be this to be like less than like one over o of one over n squared and and in fact we are also free to do some kind of as long as we get the same polynomial we are free to kind of do some permutation of this matrix and we actually need to do it to to get this bound but that means I just kind of give you some intuition why this bound is roughly what we think should be the right one and the idea is basically the following uh this is basically n random matrices which are kind of k by k we're going to look at the so we're basically looking at the average of these n random matrices which are k by k and the matrix so basically what what happens is we have this kind of notion that's a matrix a channel bound which basically will tell us that we expect which basically tells us that we expect and now n is much much larger than k squared so so these are matrices which kind of have after we subtract out what we need to subtract out we kind of have expectation zero and we need to show that and we might need to kind of subtract the identity but that's okay so so this is basically some kind of a concentration bound and if you walk through the matrix channel bound it tells you it turns out that you can use the kind of a matrix channel bound to show that because n is larger than k squared you actually get what you want so this is kind of in the lecture notes yes that that that's a good point so in some sense what you want to show is that yes that level one a version of it is kind of stronger like this kind of thing is is somewhat of a stronger version than level one because it tells you you can afford to do a union bound not just merely over all say one one matrices which are like of the form x 10 so x but over all matrices period and that's kind of what's what allows you to make it into a summer school because what is going to be true is that you can you can bound it using a spectral argument yes so here I guess you want something yes I have to remember always get this confusion so how this actually okay so I have to okay so I think okay so so why do we need this k squared I think that I can at least give some intuition so we can you know but maybe let me not I mean this is in kind of the basically this is kind of there but basically what would happen what you get here is something like yeah like you use this matrix turn of bound and you get this kind of the minus epsilon squared and over k to the four okay let me not yeah let me not I mean because I get the calculations wrong but they are in the in the lecture notes so so basically what turns out is that you can actually prove this thing you can prove this by a spectral argument and it's a little bit of a pain by the way the proof of thing that intellectual notes does lose a log n you can look at the actual paper to see the proof that doesn't look losing that doesn't lose this log n it's kind of really a pain but that you can actually prove it and then eventually what you get is this you get lemma one you get a version of pseudo-distribution lemma one so then you basically so then you basically are able to apply marlisa corollary and say that you it works also for pseudo-distribution okay so I know this is a little bit sketchy but let me stop here so I can talk a little bit more about tens of decomposition so so I want to talk at least at a high level about tens of decomposition and so I want to talk about at least at a high level about tens of decomposition and might get back to it and with more details next week and actually show the proof but I want to somehow hit you some and I guess with sparse coding and probably not going to get again so we'll probably destroy sparse coding to next week so and qma2 but I just want to give you like the high level idea of how do you do tens of decomposition with some squares so so for tens of decomposition the input is some tens of t which we think is a vector think of it as a vector in our end of the neighbor let's look at the symmetric case so t is equal to i and 1 to r ai tens of to the r and generally you might also have some noise let's ignore the noise for now and so so and the goal to recover set a1 ar dr dr and so generally if you keep our fixed then and d and you keep our fixed and increase d you get more data and more data the problem becomes easier that's because there is this kind of general thing like more data beats better algorithms or something like that so you uh the the bigger the ease the the easier this problem and so in particular there is a very nice algorithm say a generic algorithms from the 70s that if r equals say n and d equals 3 what it does is something very simple it defines it defines a matrix m to be a dv where you think of t as an n squared by n matrix be just a random vector in our end so so basically m is going to be a sum ai v tens of 2 i from 1 to r so let's think of the ai's that start with the case that they are orthogonal to each other if they are orthogonal to each other and you hit them with this random vector v most likely these things are going to be distinct so basically you're going to get you if you do a singular value decomposition of 4m you're going to be able to kind of the eigenvectors are going to be the ai's because all the eigenvalues are going to be distinct so so this is basically generated by the problem and then there are ways to somehow make the vectors orthogonal even if they are not orthogonal so this algorithm works as long as so the real requirement here is that r is at most n but for say for random vectors what we could hope for what we could hope r would be roughly n squared so up to n squared if you think about it this is why do I say this this is because we get n cubed observations if it's at the degree 3 tensor and we have n our parameters it's n vectors and vectors of n dimensions so basically you think that this could be recoverable up to r which is n squared and so genders gives you r which is n and sos can give you r which is roughly n to the 1.5 a little bit smaller than n to the 1.5 we do it and there are some other advantages in the sos based algorithm apart from this but so so basically what so sos allows us to use less data than the simple algorithm of genus and then there is even simpler algorithms so we might go back to it next week there is an algorithm I like to call the root data algorithm basically when d is kind of low r then you can use an even simpler algorithm than genus which is basically you just think of t as an n squared by say and you can think of t as say an n squared by n to the d minus 2 matrix and if you look at t and hit it with some v tensor to the d minus 2 it will basically become like a vanquan matrix and you can read off one of the ai's and you can kind of repeat it and read them all of them off so so so so the idea in using sos for this problem is the following and the way I kind of like to think about it is that it's a little bit like the movies I don't know if you saw in the movie inception that you that is movies where you have a dream and then you hold an object in the dream and you wake up and you're still holding this object so this is basically what we want to use for the algorithm here and the idea is the following we are you know we are given right so so we are given these n to the d observations on we're going to define a pseudo distribution which mu which kind of matches these observations so we can also look at t which is like the pseudo this pseudo expectation according to mu of a tensor to the say capital d capital d is much larger than small b so let me recall it's t tilde so we can get this kind of fake tensor that that has more more moments than than the ones that we have observed and now basically we can you know we can run say genrich's algorithm or some other algorithm so okay but I should have said like basically genrich's algorithm it can scale to basically the way you would think about this if r is like n to the c then d needs to be two c plus one something like that genrich's algorithm so you can run basically you can tweet if r is say n squared you can tweet the tensor you can you can tweet tensor as like you can instead of thinking of of vectors that in n of n squared vectors in n dimensions you can think of them as n squared vectors in n square dimensions so think for example instead of a six you can give them a six tensor you can think of it as a three tensor over the dimension n square you can save so so intuitively you can think of genrich has given you three c and it turns out that you can do it with two c plus one but let's you can even you know that that part just think of the genrich's algorithm has given you if the rank is n to the c it gives you if you have like n to the three c observations you have and so so the idea is basically you're going to use if you if you don't get as many observations as you deserve then you're basically going to dream about these observations you're going to run this through the distribution and then in your mind have this bigger tensor now you can run genrich's algorithm and it gives you something it gives you something write some a1 till a r now you take this object that you often wake up from your dream and you hope that this is actually a good tensor decomposition and and it turns out to work and what you really want is the following and so so what you really need is the following thing so so for for this you need basically two coms in some sense one is you need an uh identifiability uniqueness theorem so you know to have any hope you need that basically uh observations determine uh parameters right so you need to show that uh the num the the data that you have observed is uh determines the parameters and another thing is that you need to show that uh you know analysis of say you know genrich's algorithm and the analysis is basically says you know tons of observation implies recovery efficient recovery so basically so so genrich's algorithm we have this analysis we that if you get tons of observation you have recovery and then we also need to prove when I said that say the hope is that r equals n squared you can actually prove that if you have observed that when when r equals n squared m cubed observations are enough to are enough to determine the parameters so what what turns out to be the case is that if you have an s os proof of this and this then this s os algorithm will succeed basically uh what you need is if you have a sum of square spoof that the observation implies the parameters that it also implies that when you do this like larger tensor then it kind of will be consistent with the original tensor that you had and then when you run and then when you run this kind of genrich's algorithm then it will it will succeed and you have an s os analysis of the of this algorithm then you will actually recover and it turns out that basically while this theorem for example observations implies parameters is true so basically this this kind of this kind of theorem is true as long as r is smaller than n to the d minus one but is s os able to blow degree when r is smaller than n to the d over two we have some feeling that maybe but this is that maybe this is actually the limits of efficient computation that uh it really couldn't do it efficiently when r is much larger than n over two so so we we can't do it we can't do it using s os with the optimal information theoretic bounds but we can do it better than what was known before and it maybe i'll just okay so let me maybe i'll not tell you totally talk about the analysis of gen genny's algorithm and i'll tell you a little bit about how we put this observation implies parameters using s os so the degree of s os is roughly basically it turns out to be roughly capital d so the first algorithm that was based on on this paradigm we we didn't know how to do this part so we kind of used an even simpler algorithm this kind of buddhata algorithm that takes log r that was with a work of me with uh john kelvin and david story and there is a more recent work that uh david story with uh jonathan she and umma where they show that they use genny's algorithm and then they managed to do it in polynomial time so it's a constant but this part is actually not that hard this part really only requires a fairly small degree so let me let me show you okay so so basically what i want to show is that uh let me look at the degree four case it's kind of simple so basically what i want to show is suppose and these kind of things by the way they're always not for every set of vectors we kind of need to use something about generously let me start by just saying like the things are random so basically here is a theorem suppose that a one in a r some r that's smaller than n square per random then then there exists a possibly inefficient so i don't want to say then uh some a i tend to the four uh determines the one in a r and determines i mean that you can recover it after permutation and after some noise that's not exactly going to exactly the definition of this but basically what i what i would mean by the first part of the proof would be make this mode make this more precise and also connect it to questions that we wanted really before then basically the proof is the following we'll say that let's say the maximum the following if x achieves if x is like the arg max t dot x tensor to the four is one let's say the unit random unit vectors so that basically tells you that you can recover you can be careful at least one of the ai's you can actually show that using this so i mean you can recover all of the ai's by solving this problem of maximizing the polynomial over the sphere yes yes this is four yes okay so given this t this is basically an identifiability theorem it tells you that if you could maximize this polynomial over the sphere it will give you what you want okay and the idea is simple uh you just say that uh that sum t dot x tensor to the four equal to sum uh to the four and this is smaller we'll do it this is smaller than uh let's see don't get it wrong max x ai squared see we get this might have to break it about in slightly different uh with slightly slightly different exponents so let's see might have to okay so maybe if r is okay so we kind of write r to be okay so okay so this this kind of partition works when if r was equal to n then then then this this basically would be orthogonal so x this would be equal to one so we would get that this is at most the maximum that we would get that this t x tensor to the four is at most this maximum so so on the other end if x is equal to one of the ai's we get that this thing is equal to at least one so so we are good and so that and I think what happens is when r equals to n to the two minus epsilon we're going to somehow break this about in a this and four minus epsilon we could make the kind of same argument that we can bound this but at most this guy and then so basically we we should be able to make the the same kind of argument it's actually a pretty similar argument to the argument we made before that the maximum over x of sum in the four one to r if say r is smaller than is smaller than n squared then this should be at most one kind of calculation you have to do is kind of thing that I always get confused about so so basically and we not do this calculation here so it's usually this can be checked you can make an exercise so so by the point is basically the the following that you can show that so you can show that right that t to the ai tends on to the four this is always this is on one hand this is at least one and on the other hand you can show that if tx tends up to the four is at least one minus epsilon then max or maybe you should call it one minus delta then max at least one minus some something related to delta so basically the maximum the maximum of these polynomials are the all the ai's maximize these polynomials and everything that maximizes polynomial is close to these ai's so so you can show the you can show this and now if you actually go and look at this proof which I didn't even show you the proof but the psychological proof that they didn't show you you can already see that if if it was a proof that looked something along these lines then you know you take some norms to some epsilons and some norm to the four minus epsilon clearly I'm going to use some kind of a holder or something like that you know so if I was able if I was one of those people that is able to kind of do hold on the whiteboard blackboard without getting like the exponents confused and I would have shown you I guess this is a little similar to you know you can dream of a proof that I have given you and expect from this dream and the object which is basically that hypothetical proof that you didn't see used holder and since it's used holder then it is SOS so conditional believing that the proof exists you can believe that this proof is by an SOS argument and and therefore the this proof really and therefore what this proof tells you is that if you define if you let mu be a pseudo-distribution such that satisfies you know some over you know unit sphere such that satisfy that expectation of a mu of this thing is at least one or some then then then mu has to satisfy that the mu has to satisfy that it's the mu has to satisfy that basically it's a supported AR but that doesn't really help us because right so that doesn't immediately help us because yes okay but how do we get this a1 suppose you even got this you had like the sound squares book so you know that mu mu satisfies this property that is close to with the a1's ARs but how do you extract one of these ai's from you given that I mean right so because we only have access to the moments of mu so in some sense we are back to the tens of decomposition problem but the beautiful thing about sound squares is that now we are in the tens of decomposition problem where we can get more dimensions if we are willing to run in more time then we can get more dimensions so we can use the tens of decomposition algorithm that requires more that requires more data so so basically the idea is that if the standard kind of generic type algorithm trades given extra data it saves time sound squares allows us to do the opposite thing and if we are willing to spend time it allows us to manufacture more data even when we don't have data so so basically the idea is right once you have a sum of squares based proof of the different ability then you can create the data and start with the actual tens of z and I think what I'll do is basically try to do it more slowly and carefully I mean first of all I'll also add lecture notes so it will be in the lecture notes and I'll try to do this next week with a more careful proof and but what the advantage is of sum of squares so just say so basically so generally basically if r is equal to n to the c then we need d to be at least 2c plus 1 and s os basically r is m to the c we need d to be at least 2c so that doesn't seem to I mean it saves the factor of n in fact it saves a little better because then if you doesn't really handle c that is not integer but so but it says like between n and n squared with integrity factor in the data it's not trivial but it's not exponential but it also like this s os type of approach allows us to also get a better robustness and control over the noise in particular one thing that we are going to be to use is that it can recover even if t is like this sum i from 1 to r a i tens or r plus some e when we just need to say e to e the spectral norm say small let's say smaller than 1 or something like that which in data this condition on basically the spectral norm just compare it like this basically if you think about it the typical way people would think is like look at a fobinius norm so that would basically because this is like an this is like an n4 dimensional vector so spectral norm at most one or like an n squared by n squared matrix so this spectral norm at most one would correspond at most n i think like a fobinius norm at most n right like it's an n squared by n squared matrix in diagonal form if you have like one one one one of the diagonal then the fobinius norm of that would be n so so so the interesting thing here is that this in fobinius norm this can be much bigger than a like as a vector this can be much bigger than any one of those a i's but but you can still recover so in some sense this is recovery in some measures where the signal the noise is larger than the signal in some it's a natural measure but you can still recover and and and this turns out to be useful in this pass coding application yes ah yeah this this thing that's not symmetric yes not symmetric yes everything is kind of generally it's not symmetric it's just like yeah it's more for notation you know i use symmetric symmetric because then i would have to talk about a b c b yeah so yeah all of these things and yeah they they generalize the most symmetric things that you are the problem like in an actual application it's often easy for them so so the last two parts where maybe a little bit big i hope the lecture notes will help you and i'll update more about terms of decomposition lecture notes i hope hopefully this weekend and we so basically for next week what i'm planning to talk about is so i want to talk about basically more need a bit more here and there's definitely definitely so i might kind of do a little bit more of that but maybe a little bit some of it and do it a little bit more let's talk about this pass coding version and talk about the qm a and if i learn enough about those weeks i think i managed to talk about there as well and if not it would be in the lecture notes and so there is this by the way there is this reading group talk pizza starts in 15 minutes the talk starts in half an hour by larry goof which is like a very it's a few mass talk but he's a very good mathematician and a very good exposer and seems like kind of advances that might find application and in theoretical design so you're all welcome to go there of course still no obligation to and even if you do go to the to the real group you can always feel free to leave after the break so and this is in room i forgot i don't know if one of you remembers it's building 34 but i don't remember the room and i think i i even the i put it with the pizza but anyway i guess i'll see some of you