 OK, thanks for coming for the final session. So let's see, let's begin. A quick recap, so what did we do yesterday? We looked at tomography and we looked at tomography and other forms of learning quantum states. So one fundamental drawback of the things that we spoke about yesterday was, for example, tomography. The sample complexity was optimally 2 to the 2n, or exponential in the number of qubits. And then we looked at other models of learning, like shadow tomography, packed learning. The sample complexity was exponentially better, but the time complexity was exponentially bad. So that kind of naturally motivates things like as to, OK, is there a way that you can efficiently learn quantum states? Or are there interesting classes of quantum states that one can learn? So this is a motivating question for this talk. And in this talk, I'm going to give you a few examples of quantum states that one can learn time efficiently, or maybe sample efficiently. The first thing I'm going to be talking about is the Gibbs states of local Hamiltonians. I think Andras has motivated this topic already, so it saves me a little bit of time. The next thing is stabilizer states. Those are output states of Clifford circuits. And they've been a recent spate of works where they've kind of learned states more than just stabilizer states, which I'll be discussing. And finally, I'll be talking about some recent work that I've done on statistical learning. Good, let me begin. So the first thing I'll be talking about is the Hamiltonian learning problem. So given Gibbs states of a Hamiltonian, the goal is to learn the Hamiltonian. Like this was essentially what Andras was trying in the morning. He was trying to prepare the thermal state of the Gibbs states of a Hamiltonian. And the question is, let's just say he was able to prepare it. Can you verify it? So let me just be a little bit more clear. So think of H as a K local Hamiltonian acting on n qubits. And I'm going to be assuming geometric locality. And because think of EI as an orthonormal basis for the set of all operators and C n cross n, you can just write the Hamiltonian H as a decomposition of this orthonormal basis with coefficients mu i. I'll come to one particular orthonormal basis as a poly decomposition in a couple of slides. But for now, just think of this Hamiltonian H written in this format. Suppose somebody gives me T copies of the Gibbs state. So think of beta as the inverse temperature. And think of rho as the Gibbs state corresponding to the Hamiltonian H at this inverse temperature beta. And the goal is to learn these coefficients mu. So mu are these coefficients of this Hamiltonian H in some orthonormal basis. You can pick the poly basis if you like. Think of the K local poly basis. Just think of EI as the basis in the K local poly basis. And mu i as the coefficient of the decomposition. The goal is given copies of the Gibbs state corresponding to this Hamiltonian. Learn the coefficients up to, say, L2 error at most epsilon. That's the goal. And the motivation is, I think, so as Andras was saying in the morning, preparing this Gibbs state is an extremely important problem for near term devices. But let's just say we managed to solve that. We managed to prepare a Gibbs state of a Hamiltonian. How do you verify it? Let's just say somebody came out of the quantum computer. They prepared this quantum state and claimed that this was a Gibbs state at a certain temperature of a Hamiltonian. Given copies of the state, how do you verify they actually prepared the state? So that's one motivation. And in machine learning, there's been a plethora of works that have looked at graphical models and learning distributions of graphical models. And our work kind of quantizes the classical part, but with a lot more non-subtle in the quantum setting. And just, for example, I think it's just natural. For example, these coefficients here kind of could encode properties about this Hamiltonian. For example, this electronic structure, magnetism, and so on and so forth. So if we could learn this coefficient's meal, maybe you could learn some unknown properties about the Hamiltonian from the Gibbs state. So that's the motivation. And the main result that I'll be showing is you could actually sample efficiently learn these coefficients meal. So when I mean sample efficiently, think of n as the number of qubits. So I'm essentially doing a version of tomography, but the sample complexity is no longer exponential in n. It's only polynomial in n. And these other parameters, I'm not getting into the dependence, but I'll touch them on it in a couple of slides. So this is the main result, and I'll try to give you a proof sketch of the main result. Good. So let's see. Let's just try to use some techniques that we developed yesterday and try to see, maybe can we use the same techniques to actually do this Hamiltonian learning question? Right. So recall, I'm given copies of this Gibbs state. So e to the minus beta h. And z beta is the partition function that was trace of. So I'm going to call this quantity as just z beta. So that's just precisely this. z beta, e to the minus beta h divided by z beta, where h is k local given in this decomposition. The goal is to approximate this coefficients mu. So the first thing, let's just try to do shadow tomography. You're given copies of a quantum state. We know sample efficiently. We can actually estimate trace expectation value as well. So what do you do? You can compute, for example, rho mu. Those are the Gibbs states. For all ei, I can just compute all these expectation values up to epsilon error. And you can use, for example, the algorithm of either Aaronson or the Hwang-Kyung-Preskill or Betasco-O'Donnell. All of these are sample efficient. And they kind of give you approximations of ei, that is the expectation value of rho mu with respect to ei, up to epsilon error. And the next question is, OK, now we have this, how do I recover mu? That is these coefficients sitting in the exponent here. So one thing you can do is, OK, I have all these expectation values. Just go back to your lab and try to find the quantum state rho prime that is close to rho mu. Just kind of go back and try to find a rho prime that matches all these expectation values. But the annoying thing is somehow rho prime that you find in your lab need not even be a Gibbs state. It could just be an arbitrary quantum state. So you could just produce it. Given copies of a Gibbs state, you find all these expectation values. And then you find a rho prime that is close to rho mu. But rho prime is not even a Gibbs state. And if it's not even a Gibbs state, how am I going to kind of approximate these values mu? It's unclear. But maybe there's something more to be done here. So instead of just maximizing over all rho primes in your lab, just maximize over all Gibbs state. Just try to maximize over all rho lambda that is of this format, where lambda is the coefficients here. And the goal is try to make sure that trace of rho lambda is close to trace of rho mu for every i. So recall that you have approximations of this. You go over all possible rho lambdas. Try to ensure that this is equal to this. And this is non-trivial, but you can observe that if this is true, then rho lambda is equal to rho mu and lambda is equal to mu. So we're getting somewhere. If you can kind of find a rho lambda, just maximize over all possible rho lambda, such that these trace expectation values are the same, then my claim is lambda is equal to mu. OK. But the question is, how do you maximize over all possible rho lambdas? This seems, again, non-trivial. Can you just go back and just go over all possible real numbers in 0, 1, to the m, and things like that? So is it a clean formulation of this problem? And indeed, there is this concept of maximum entropy principle, where you can do this observation one that is maximized over all possible Gibbs state using the following optimization problem. So OK, this is slightly, this is actually fairly simple. So what it says is you maximize over all possible quantum states. So sigma is just a quantum state. PSD, trace 1, maximizes the entropy of sigma, such that the trace of sigma ei is equal to ei. Recall, little ei was the exact expectation value. Try to find a sigma, so the trace of sigma ei was equal to little ei. OK, so let's just say we write this optimization problem out. We know, for example, all possible ei. Try to maximize over all quantum states that maximize this. Recall, I'm not maximizing over Gibbs states or anything of this form. So this is an observation that one can show that S of sigma is just nothing but the quantum entropy function. So it's non-trivial to show that actually the objective of this optimization problem is exactly equal to the unknown Gibbs state you're trying to learn. So for example, somebody told you this ei value is magically well. So you exactly know what are these ei's. Go back to your lab, run this optimization problem, let us try to maximize over all quantum states, such that trace of sigma ei is equal to little ei. Then I claim that the maximum of this objective value is exactly equal to the Gibbs state that you want. OK, we're getting somewhere now. So if we know trace expectation values, maybe we can learn the unknown Gibbs state well. And if you learn the unknown, if you know the rho mu, exactly. OK, so as I said, we know that supposing you have matching marginals, then you're exactly the maximum entropy that you had in the previous slide for the optimization problem is exactly the Gibbs state. But there's a small problem now. You don't exactly have ei's, but you only have approximations of ei's. Like shadow tomography does not give you the trace expectation values exactly. It only gives you epsilon approximation to the expectation values. So ei was trace of ei times rho mu. What a shadow tomography give you? It kind of gives you only approximation ei prime that is epsilon close to ei. And so the question is, OK, how robust is the optimization problem? So recall, this was the optimization problem I said in the previous slide. And I said, if you optimize this, you exactly get rho mu. But unfortunately, you don't have exactly ei's. But shadow tomography, you only have ei primes. OK, but still maybe is there something to be said now? So OK, but OK, it's a simple calculation which shows that if rho mu was the optimum of this and rho mu prime was the optimum of this, then the trace distance of rho mu and rho mu prime is at most m times epsilon. Like this number for epsilon to be a constant, this is actually much greater than 1, but these two are quantum states. So this could be a trivial upper bound. But if epsilon is tiny, maybe there is something to be said. But OK, this is not so great. But more than that, there is something more fundamentally an issue here. And the question is, even if we had this kind of approximation, let's just say this quantity was much less than 1, even if we had such an approximation, is this good enough to approximate mu and mu prime? And the answer is no. So recall, what's the goal? The point is rho mu is exponential of e to the minus beta h, where beta is encoding this coefficient's mu. So morally, if I take the log of this, I'm going to get the mu things in the log of rho mu is approximately mu, just being hand wavy here. And log of mu prime is approximately mu prime. And you want to kind of minimize, you want to learn mu, so you kind of want to minimize this quantity. And the problem is the log function is actually very, it behaves very badly when it comes to 0. So when it comes to 0, for example, the red lines could be exponentially, like it could be very tiny here, but the distance here could be very large. And the problem is the log function is pretty non-lipschitz. So even if you get a good approximation of rho mu in trace distance, taking the log of both these quantities could be exponentially worse than the bound you get here, which is not good. So OK, how do we fix this? You can fix it in the following format. OK, the question is how do I handle this quantity? Log rho, like how do I handle this quantity? Can I show that this is small? So the idea is to take a look at the dual of the optimization problem. So this is the problem. If you take the dual, that's precisely this problem, where, again, z beta is the partition function that is trace of e to the minus beta h, where h is the sum of these lambda i times ei. And as I said, we don't have exactly these eis, but we have these ei primes. So kind of you're doing this optimization problem, and so you're going to solve this dual problem. So the question is you're working with this dual problem with these ei primes, which you obtain by shadow tomography. And the question is, how far is mu prime from mu here? So if I give you approximations of ei as ei prime here, the question is, how far is mu from mu prime? That's the question. And there is a technique based on strong convexity that says the following. So if you show that the function is strongly convex, or the gradient of the function is at least alpha, then actually you can put a bound on how close mu and mu prime are. So you should just think of this being an objective function, log of the partition function plus this beta of summation lambda i ei. And my claim is this function is what we need to look at. And if this function is strongly convex, then I claim that mu and mu prime are close to one another. That's by definition of strong convexity. And that's precisely the statement that we've proven on paper. That's the technical part. For example, if you just pick f to be the log of the trace of the partition function, classically showing that it strongly convex is obvious. But quantumly, the technical part is actually showing that the trace of the quantum partition function is still strongly convex. This is the technical part. But once we have this thing, it immediately follows that mu prime is exactly epsilon close to mu in two norm. So that's the idea. So the idea is, again, you obtain these approximations of trace of ei rho mu by shadow tomography. And then the observation is, this would have exactly given me the Gibbs state. But I don't have eis. I have ei primes. And then to show that this optimization problem is robust, I go to the dual. And I show some property about the objective value here. And by showing that, that shows that I can obtain a mu that's close to mu prime with at most epsilon. So let me just make a few quick remarks. The first thing is our algorithm is not time efficient for generic Hamiltonians. Because the objective fact, when doing this optimization problem, you need to compute the partition function, which is time-expensive. But the nice feature is kind of the main things that we need in an algorithm is just measurement statistics. So all you need to do is just do shadow tomography. And if you just get expectation values of the Gibbs state, you can actually learn properties about the Gibbs state. So it's kind of a pseudo-classical algorithm. There's nothing more quantum than just step one here that is estimating the marginals. There is this exponential factor in beta and kappa, where beta is the inverse temperature and kappa is the locality. Generically, this is unavoidable. Even classically, based on complexity theory, you know that you shouldn't have obtained an algorithm that's polynomial in all parameters. And there's been some recent work by Ha, Kotari, and Tang, where they showed that if beta is small, less than the critical temperature, they get a sample complexity bound, which is very nice. Like all our polynomial bounds are pretty bad here, there's this pretty neat. So they get a sample complexity bound, which is log of n divided by beta squared over epsilon squared. Is there any questions? They need to do something more, because when the beta is less than the critical temperature, they can use some cluster expansion techniques. So it's morally the same, but they use the fact that you're working in small beta to use some more properties about the Hamiltonian. So there are even in classical works, when you want to do everything of the Hamiltonian, it's just being diagonal. The dependence of this distribution learning task is exponential in beta and kappa. And the claim, I don't know how they prove this, but they claim that if rp is not equal to np, which is something in complexity theory, then at least it has to be exponential in one of the two parameters. I don't know what's the intuition, but this is just the classical work. M. So OK, good. So because I'm assuming geometric locality, think of m to be polynomial in n, the number of qubits. Yes. Good. So like, GIP states are an interesting class of states. And we have just shown that you can actually sample efficiently learn them all the time. Complexity is bad, but the sample complexity is good. That's the main message of the first application that I mentioned. So let's go to the second thing. It's going to be on learning stabilizer states. But before we talk of stabilizer states, I want to first introduce this poly basis and the whale matrices, which will be important. Good. So these are just the poly matrices. i is identity, x is just the xk, yz. Good. And this will be an exercise that you'll be doing. If you just take identity x, yz to the tensor n, this forms an orthonormal basis for cn cross n. This should be in a typo. It should be in n cross n. So these poly matrices form an orthonormal basis for the set of all operators on n cross n. And there's a nice succinct way to put this entire thing. You can just write wx. Think of w as the while operator, and x is just a 2n bit string, as just a set of all a1, b1. So one thing to observe is x times z is equal to y. So essentially, and all these things square to identity, and x times z is equal to y. So essentially, just understanding this whale decomposition in terms of x and z is sufficient. So given x, which is an a comma b, think of it as x to the a1, z to the b1, tensor x to the a2, z to the b2, all the way to the n qubit. x to the an, z to the bn. And there is this global phase, which I need, because the y gates have this global phase. So this is a while operator. And you'll also be showing this, like the while operator is from an orthonormal basis for all quantum states. And in particular, for every Hermitian operator, you can write the Hermitian operator in terms of the while basis. So I can just, sorry, this should be an x to the 2n here. But you can write every, in particular, every pure state, can be written as 1 over 2 to the 2n, summation over x, alpha x times wx. So again, think of wx as kind of a base. So wx is a basis. It's an orthonormal basis. So it can decompose every operator in terms of the while basis with some coefficients alpha x. And these coefficients are nothing but trace of wx times psi psi. And this is another property that you'll have, and you'll kind of prove it in an exercise session. So in particular, every operator can be written in terms of while basis. And the coefficients satisfy this nice property. And I'll be using one definition. I'll be saying, for the next slide, I'll just be saying psi as just like alpha x squared divided by 2 to the 2n. And because of this equality, we know that summation of psi is equal to 1. So this kind of should already remind you a little bit of Fourier analysis, because for every Boolean function, I could write it as a Fourier decomposition. And we know that there always exists a distribution in front, f hat of s, which sum of squares is equal to 1. So this kind of already looks like that. And that's kind of the way the intuition goes in all these learning algorithms. So for every Boolean function f, there's a Fourier decomposition of f of x in terms of the character functions with Fourier coefficient of hat. And the sum of squares is equal to 1. So you can think of the f hat of s squared as just being replaced by psi of x with sums to 1. So the main message here is there is this while basis. And every Hermitian operator can be written in terms of the while basis and the coefficient sum to identity. Sum of squares of the coefficient sum to identity. Good. So let me introduce this concept of bell sampling. So this has been the crucial subroutine which we are going to use for learning stabilizer states. So I'm just recalling some stuff from the previous slide. And the important thing is to know that psi of x, this is a distribution. That's all I need. So first, I'm going to define something called the bell basis. So what is a bell basis? So WB, think of it as just like an operator on just two qubits. So W00, W01, W10, W11. So recall from the previous slide, W00 was just this, W01 is this, W10 was this, W11 was this. And my claim is if you just write this thing where psi plus is just the EPR state, this forms an orthonormal basis for C2. So you can just write this PR state WB as just applying the while operator on the maximally mixed state, sorry, on the EPR state. And it's not too hard to show that this actually is an orthonormal basis because the WB is actually the inner products are zero unless they are equal. So the bell sampling procedure is as follows. What do you do? You take four copies of an unknown quantum state psi. Actually, for now, you can just think of it as an unknown quantum state psi. You take four copies of an unknown n-qubit quantum state psi. And you do the following procedure on the first two copies and the second two copies. Take the first two copies and take the first qubit of the first copy and the first qubit of the second copy measured in the bell basis. So recall that this is the bell basis. It outputs a two-bit num, two bits. So you're going to get a bi and bi prime. So for example, if I measured the first of the first, sorry, first qubit of the first copy and the first qubit of the second copy, I get a b1, b1 prime. Second qubit of the first copy, second qubit of the second copy, I get a b2, b2 prime, and so on and so forth. So I repeat this for all n qubits. So I'm going to get a b1, b1 prime all the way to bn, bn prime, and that's going to be my bit string x. And so I did this with just two copies. Now repeat this with another two copies. You're going to get a c1, c1 prime all the way to cn, cn prime. You could have done this. So this just took four copies and I just measured in the bell basis. And you just output x plus y. And when I mean plus, I just mean in f2. So I just output this. And the main result that one can show, which you'll be showing in the exercises, the output distribution of this bell sampling procedure satisfies this formula. So you can think of this as a convolution of psi with psi on the input z. So this might look slightly messy for now, but you're proving it in the exercise. I'll give you an example of why this equation is useful in the next slide. So for an arbitrary quantum state, take four copies of it. I can measure it in the bell basis and I can get a distribution of this form. I can sample from this distribution. Good. Let me give you an example of why sampling from the distribution is actually useful. And one application is learning stabilizer states. So what's a stabilizer state? So think of a Clifford circuit that just has H gate, S gate, and the C0 gates. And just think of a polynomial size, polynomial depth Clifford circuit applied to the all-zero input. The output is called a stabilizer state. And these are classically simulable gates. So these states are interesting. They can be prepared in some sense. And there's an alternate way to view stabilizer states. So you can just view stabilizer states as a pure state that is stabilized by a subgroup of size 2 to the n. So recall, Wx is a set of size 4 to the n because they're four polys and you're acting on n qubits. So they're four to the n elements in the poly group. Think of script S as just a subset of all polys of size 2 to the n. And think of find a subgroup in the set of all polys such that P psi, the psi is stabilized by P. So that's an alternate way of viewing what's a stabilizer state. It's just stabilized by a group of size 2 to the n of the poly matrices. That's it. And in particular, there's another alternative way to view this. If you just use this fact and write the Fourier decomposition out of psi psi, you can just view that ket bra psi psi or the density matrix corresponding to a stabilizer state is just nothing but the summation over all sigma's coming from this group, sigma. That's it. So if you just write the poly decomposition of this out and if you just observe what are the non-zero Fourier coefficients, the non-zero Fourier coefficients are only those for which those sigma's that stabilize this psi. So this is an alternative way of writing a stabilizer state. It's just a sum of all polys that's stabilizer. As I said, this group that stabilizes Wx has dimension n. And it's OK. Now let's just try to use the Bell sampling procedure to understand what's really going on. If you want to learn just stabilizer states based on just this decomposition, how do I learn it? Exactly 2 to the n. Sorry. Yeah, SSI is exactly 2 to the n. I'm sorry. Stabilizer states are pure states. Good. So let's just observe just this quantity p psi of z that I mentioned in the previous slide that was just this quantity. But if you write this quantity out, we know that wz psi, if you write this quantity out, this is actually 0 if supposing z was not stabilized by psi. If wz does not stabilize psi, this quantity is 0. And if wz stabilizes psi, this quantity is trivially 1. So it's 2 to the minus n. So if we just observe, for example, like I said in the previous slide, OK, I'll come to that. So p psi of z is just this expression. And in the previous slide, I said if I do the Bell sampling on four copies, I'm going to sample from this distribution q psi of z, which was given by this convolution expression. And now we need to understand what is this convolution expression actually? So again, what are we trying to do? We just did Bell sampling using four copies of a stabilizer state. We obtained a z that was sampled from this distribution q psi of z. We are trying to understand what is a z that we got? Actually, does it say anything about the stabilizer state? In order to understand that, first we write down q psi of z as this. Fine. And then we use what we know about p psi of z here. So the first expression is we know that p psi of a is non-zero, only if a was stabilized by psi. So the summation is only over a, so that's stabilized psi. And you get this 2 to the minus n out. And if you write this out again because it's forming a group, we get actually this quantity is equal to 2 to the minus n times z that's stabilized psi. And this quantity is exactly p psi of z. Recall this is 2 to the minus n if z stabilizes psi, otherwise it's 0. So q psi of z is exactly p psi of z. So in some sense, what are we saying? If you're doing the Bell sampling procedure, we are actually obtaining a z from this distribution p psi, where p psi is nothing but those z's for which only stabilize psi. And that's it, we are almost done now. So what do we do? We take Bell sampling, use four copies of psi. We produce a z that comes from this distribution p psi of z. So technically in the previous slide, we saw z was sampled from q psi, but q psi was equal to p psi. So we're sampling from p psi. And we know that p psi of z is only supported on those z's, such those stabilize psi. So in four copies, we produce the wz that stabilizes the unknown psi. That's good, but we know that the z's that stabilize psi form a group. So we just need to find the basis for the group, that's it. So we just need to find n many basis elements for this group because by definition, s forms a group of dimension n. So that's it, we just need to use, we repeat this process like many times, like maybe four n log n times. We repeat this thing n log n times. We find n many linearly independent w's, and that's good enough. So we find n many linearly independent z's that stabilize psi. Let's call it w1 to wn, take the span of it. That exactly gives you the group that stabilizes psi. And once we find the group that stabilizes psi, you can just compute this expression on your own. That is exactly your unknown stabilizer state. Yeah, yes, there is a subtlety I'm missing here. So technically, there could be signs here, but for simplicity, I'm just avoiding the signs. So technically, either wz or minus wz stabilize psi. And you could learn this, once you learn this, you can just learn the signs on your own just through the swap test by applying wz to psi and checking whether wz or minus wz stabilizes psi. Yes, I'm going to come to that next slide a little bit. Yeah, I'll touch upon that on the next slide. Good, so stabilizer states, you can learn these things time efficiently. So the point is, all you need to do is take the span, that just takes time n cubed and things like that. So the time complexity is n cubed, sample complexity is n. So this is a class of interesting states that is both sample efficiently learnable and time efficiently learnable. But also, I won't call it the annoying thing, but we know, as I said earlier, and I said, the circuits that produce these states, they are classically simulable. So maybe can we do something slightly more? And there have been several recent works. I'm just going to give you an overview of them. I'm not going to give a proof sketch of these things. The first thing was given by, yeah, I forget the authors. So what they consider is, look at a Clifford circuit, but all I need is a single layer of T gates. So let's just say I have a single layer of T gates. So what they showed is like, if I have a single layer of T gates and all of them contain, they only log in many T gates in the single layer, I can actually learn this in polynomial time. And the way, the main technique that they use here is they use the stabilizer decomposition of the T gate. So one technique that people use for classical simulation of quantum circuits is they use the non-Clifford gates, right? The stabilizer decomposition that requires an exponential overhead. They essentially do that, and then they do some version of Bell sampling, and they claim that you can actually learn a Clifford circuit with one layer of log N many T gates in time polynomial in N. So I need log N because the complexity depends on N comma two to the k, where k is the number of T gates. So that's what they proved. So this was like a couple of years back, but just last year they've been like, I think five or six works that have just looked at extending their results. So the main idea has been like, if I just give you, if I can remove the fact that it's almost one layer, for example, it's arbitrary like all over the circuit, but just log N many T gates, can I still learn these circuits well, or the states produced by these circuits well? And the main idea is something that somebody asked. So for example, can we just, they looked at this idea of stabilizer dimension. So earlier, for every quantum state psi, you can write it in terms of, for every pure quantum state psi, you can write it that it's a poly decomposition. So look at the Zs whose support is non-zero, or the psi WZ psi is non-zero, and look at the dimension of the space. So sorry, take the span and then look at the dimension. So they look at the, they call it the stabilizer dimension. And one thing they showed is you can actually learn states with stabilizer dimension at least N minus k. The proof is, I don't know the proof completely, but the idea is kind of, it mainly uses bell sampling. You can use the same bell sampling procedure, keep collecting these WZs, take their span, and if you promise that the stabilizer dimension is large, that's good enough to learn the stabilizer, to learn the support of psi in the poly basis. And the main result that they showed is, how does stabilizer dimension have to do anything with the T count and things like that? They showed that if psi is produced by Clifford plus k many non-Clifford gates, so recall the k could be anywhere in the circuit, not just in a single layer, they showed that the stabilizer dimension of the states produced by these kind of circuits is at least N minus k, and then you can just bell sampling, learn the stabilizer dimension, and then they have a procedure to actually learn the unknown state. Yeah, you're kind of correct. So essentially I can tell you like a high level, what these guys do. So the point is what they say, if you have Clifford plus k many non-Clifford gates, they say that look at the state, map it into a Clifford part and a non-Clifford part. So the Clifford part, I can learn it efficiently, but the non-Clifford part is on k qubits. You just do tomography on these k qubits. That's why you get this two to the k complexity, and this polynomial of N, it's just I learned the Clifford part on its own. That's why the complexity depends on N comma two to the k. Good. So if people have kind of lost me because there's too much stuff going on, there's this other way I view bell sampling, which I think is much more neater, and I like viewing it this way, but this does not give the optimal results, but I think it's a very pretty way to view bell sampling. We just use like taking derivatives of functions. So I'm just going to look at a simple example. So I'm going to do exact bell sampling, but just in the language of taking derivatives of functions. So okay, this is a folklore result in quantum computing that every stabilizer state can be written in this format. So think of A as a subspace of F2 to the N. So you have like a quadratic phase here, X transpose BX, and then a complex phase here, but that is linear. So you sum over all X and A minus one to the quadratic polynomial I times a linear function in X. And okay, let me just make a couple of assumptions. Let me just for simplicity say that A is zero one to the N, that is A is the entire Boolean cube, and S is zero to the N. That is, there is no complex part. There is just a summation over all X and zero one to the N minus one to the degree two polynomial. That's it. And the question is, I want to learn this unknown stabilizer state given copies of the state. And how do I learn such a state? So the idea is the following. The goal is to learn B because B is the only unknown quantity in this expression here. So the idea is the follows. So take two copies of Psi, like what we do in balance hub link, take two copies of Psi and let's just write it out. That is summation over X comma Y minus one to the X transpose BX plus Y transpose BY get X comma Y. Let me just apply a C naught between the first and the second register. Or if you want the X R between the first and second register. So you're going to get everything is the same. It's just going to become X comma X plus Y. Okay. Let me just relabel what I call X plus Y as Z. So that becomes Z here. So I'm going to relabel Y as X plus Z. So all that is, so the Y becomes a Z, X remains an X, Y becomes X plus Z. This is just exactly this. And because everything is over F2, the Z transpose BZ cancels out somehow. So you're going to get X transpose BX, X transpose B transpose X, and Z transpose BZ. Give me a second. Oh, sorry. The X transpose BX cancels out because there's an X transpose BX here and there's an X transpose BX here. So those cancel out and you're just going to get this quantity here. So it's linear in X and quadratic in Z. And just measure the second register. So measure the second register, you're going to get a Z tilde out. And if you look at the first register, so if you measure the second register, Z tilde, this is just a global phase. You can pull out this Z tilde transpose BZ tilde. You pull it out and you just get this quantity. But now this is just a linear function in X. So it's X transpose B plus B transpose Z tilde. And if you recall from the first slide, this is just a parity, just a single parity in the phase. So if you just do Bernstein-Masrani or just apply an n-fold Hadamard transform, you're exactly learning B plus B transpose times Z tilde. So you obtain a Z tilde. I view that as a direction and I view the derivative of the function where the function is X transpose BX in the direction Z tilde. And that's kind of it. Like you take two copies of psi and you obtain one derivative of this quadratic function in the direction Z tilde. And if you repeat this thing n times, you're going to get n different directional derivatives of B in n different directions and that suffices to learn the unknown function B. And once you learn B, that completely learns the function psi. So you could just view like take Bell sampling as a way to take derivatives of quadratic phase polynomials. That's it. Is this clear? Which is the direction? Yeah, so just one derivative is not enough. If you have a quadratic polynomial, if you have the derivative of it in n linearly independent directions, that can be used to completely learn the original quadratic polynomial. Right, so there's some post-processing that one needs to do. Like for example, once you learn B plus B... Oh, by the way, I'm making an assumption here already that because X transpose BX, it makes sense only if it's upper triangular. Because if it's lower triangular as... If it has both entries here and here, like X i, B i, X j will also appear as X j, B i, X j. So that'll anyway become zero. So I'm implicitly assuming here that B is upper triangular. By which circuits? I'm not very sure. It's a good question. I suspect it should be known because there's been so much work in this direction in this last year, but I don't know. I'm not sure. I suspect you should be able to simulate with an exponential overhead and the only log-in of them that should be a polynomial overhead, but I'm not sure if there's a subtlety there. For like these T gates and things of that, I'm more familiar, but in the magic case, I'm not very familiar. Good, so let me move on to the second part. So the point is I'm learning phase states here. So kind of like the previous slide kind of motivated me to learn the states of this form, summation over X minus 100 F of X. So when learning stabilizer states, F was just degree two, but you could just think of like, what if F is degree D? What can I learn these states or not? And the question is, how many copies of psi F are sufficient to learn F? And the question is, why care about these states, like why care about just these phase states? So again, in the last couple of years, there have been several results that have actually employed phase states for several results. So the first thing is this idea of pseudo-randomness. So we know that F is a random, sorry, if F is a random function or a pseudo-random function, then we know that this ensemble psi F is close to a hard random state. So if you don't know how to construct a hard random state, you can just prepare a pseudo-random function, put it in the phase that is actually a hard random state or close to a hard random state. The second thing is if you just look at IQP circuits, so IQP circuits are just a bunch of Hadamards in the start and the end, and the middle just consists of Z, CZ and CCC gates. These three, and CCC is again a non-clifford gate. And we know that if you look at the output states of IQP circuits, it's just a degree three phase polynomial because Z corresponds to degree one, CZ degree two, CCC degree three. So if you just write out the output states of IQP circuits, that's just a degree three phase state. Okay, so maybe degree two is learnable, but degree three, we don't know what's going on. Maybe it's exponential time because IQP circuits, we believe they're not simulable. And also there've been results on quantum complexity and pseudo-entanglement and things like that where kind of people have looked at these phase states and kind of gotten new results in QMA, QCMA, and pseudo-entanglement and things like that. So that gives enough motivation to actually look at states of this form and try to see can we learn them or not. And the main result that we had is we kind of showed optimal bounds here that we showed that if you had only separable measurements that are single copy measurements, the best you could do is kind of use n to the d many copies here to learn the state. And if you had entangled measurements, you need n to the d minus one copies. So in the exercise session, you'll be looking at just d equals to two. How can we learn it with n square copies? It's just a simple trick. And I will now describe the algorithm to learn these states with entangled measurements in sample complexity n to the d minus one, right? So the first approach is, can we continue taking derivative? So I mentioned like if you have degree two function, you can just take a derivative and kind of n different directions, kind of learn the function. Maybe you have degree three, what's going to happen? So recall that if f is arbitrary degree, if you do a Bell sampling like we did earlier, you're gonna get the state of this form. And we observe that if f was degree two, then what's in the phase is degree one and we can learn that by Bernstein-Mazzarani. But if f was degree three, what we have in the exponent is actually degree two now. Because f of x is f plus f of x plus s that just cancels the cubic terms, the quadratic term still stays. And it's unclear how to learn this quadratic term actually. Like without an exponential overhead, I don't know how to make this Bell sampling procedure work that takes two copies and is able to learn some directional derivative of f if f is degree three. So the second approach that does work is the standard idea, which I've been saying since lecture one, just apply the pretty good measurement. We're kind of looking at state identification task here. So f is a degree D polynomial in n variables. You have k copies of that. That's your ensemble. You can look at the failure probability of the pretty good measurement. It's just this expression here. You just look at all the f that is not equal to g, the inner property in psi f and psi g to the power k. You renormalize by the size of the ensemble. I'm not going through these calculations for running out of time, but you can just run through these calculations. And the main observation that we have is once you run through these calculations, you can observe that actually because you can view this as read molar code words. And you can go to properties of read molar code words and you can prove good bounds on the weight properties about the read molar code words and you can apply this entire expression by this quantity. And in particular, if you pick k to be n to the d minus one up to some constants, the failure probability is small. That is if I pick k to the n, n to the d minus one and apply the pretty good measurement, I would have identified the unknown g well. And this is optimal because they're n to the d many unknowns because it's a degree D function and every quantum example is just giving n bits of information. So n to the d divided by n to the n is just n to the d minus one. Good. Let me get to the last part of the talk and then I will conclude. So it's this notion of quantum statistical query learning which we introduced a few years back and the ideas are following. So let's just say C is a collection of Boolean functions mapping n bits to one bit. And somehow like in the first lecture, what we saw is we're given this quantum example states of the form x comma c of x, a subposition of those and the question was can we learn the unknown function C? And there are a few models of learning that we looked at and the first model was we could make entangled measurements that is somebody prepared a T-fold tensor copy of the state, gave it to me and asked me can I learn this unknown C? But okay, there's another model that I can do. For example, I don't perform an entangled measurement on all T copies, I just perform single copy measurements. I get a psi c, I perform a single copy measurement, I get another psi c, I perform another single copy measurement, and so on and so forth. So single, separable measurements are slightly more doable in the lab than entangled measurements. So at least it's more plausible to implement these kind of measurements. And the thing that we introduced a couple of years back with Alex Grillo and Henry Yuan is the QSQ model of the quantum statistical query model. So as I said, maybe entangled measurements and separable measurements are both far from near term implementation. Maybe like performing an arbitrary measurement on one copy of the state, like preparing the state already is hard, but performing an arbitrary measurement on the state could, maybe that's hard. You could just perform like a two outcome measurement. So this is a QSQ model that we introduced, and the idea is the following. So think of QSAT as just a quantum statistical query that just specifies a two outcome measurement, M comma identity minus M. And somehow what you're going to get out is the probability that this two outcome measurement would have accepted psi c up to plus or minus tau. So think of tau as a tolerance. So I have a quantum state, you specify a two outcome measurement for me, I perform it, and I tell you what's the probability that this two outcome measurement would have accepted or rejected plus or minus tau. So that's going to be the output of my statistical measurement. And the question is how many quantum statistical measurements are necessary and sufficient to learn the unknown concept C? And the way kind of I motivate this is, say somebody in the cloud, just as if for example IBM, we prepare a quantum state, transmitting it to you would be hard, but you could specify a two outcome measurement, we could perform the two outcome measurement and just give you a classical output. So essentially you could just be a classical learner, just specify a POVM and a tolerance, I perform a measurement and I just give you a classical number alpha M out. So the input is M comma tau, which is classical and the output is classical. So the learner is purely classical, but he's trying to learn something about a quantum property here. And one thing that we observed actually in our paper in AGY is all these algorithms that we discussed in the lecture one, that is parities, juntas, DNFs, the coupon collector, all these algorithms need not have entangled measurements. Just with statistical measurements, that is the weakest form of measurements, you could have actually implemented all these learning algorithms. So that was kind of surprising, even in the weakest models, you could kind of get like a quantum speedup for interesting concept classes. And the natural question is, how powerful are these measurement statistics? So recall that this was a concept class and this is a quantum example. And you're given a statistical query and the question is, how can I learn this unknown Boolean function? And as I said in the previous slide, like almost all known quantum and learning algorithms that I know can actually be implemented in this QSQ framework. And so the question is, for example, is it true that generically all quantum learning algorithms could just be implemented in the QSQ framework, like, or you don't need entangled measurements at all? Or could you, is there a class for which you do need entangled or separable measurements you could not have just implemented with just two outcome measurements? So let's just go back to classical. So classically, what do we know? So classically, in uniform pack learning, you're just going to get a uniform x comma c of x. And in the previous slide, when I said m comma tau, I just think of m as a diagonal matrix. So that is just specified by classical SQ learner. So he just specifies a diagonal matrix and a tolerance and he gets like, what's the probability that phi would have output at one on c comma x? And the question is like, is there a concept class classically that separates pack learning from SQ queries? And there's a trivial answer, it's just a parity functions. Parities are not learnable in the pack model, but in the SQ model, we have an exponential lower bound. So that gives an exponential separation between classical SQ and, sorry, classical SQ and classical pack. And the question we're trying to answer quantumly is, quantum SQ versus quantum pack or entangled measurements versus just statistical measurements. And the result that we showed recently, so if you just look at, instead of degree one functions, if you look at degree two functions, you can actually separate Q SQ from Q pack. So let me just be a little bit more clear. So just look at this concept class of c, which is just x transpose ax. This is just a stabilizer state thing that I've discussed so far. So think of c of x as x transpose ax where a is an unknown n cross n matrix, then we observe the following. With entangled complexity, we saw how to do it with n. With separable, you will see in exercise how to do it with n square. And the Q SQ model, we showed a lower bound of exponential in n. And also what, like a bonus that we get with this is if you want to learn these quantum states with statistical, sorry, with random classification noise, that is learnable in polynomial in n comma one minus one or two eta. So this conclusively says that there exists a simple class of function states, just based on stabilizer states, which require at least separable measurements with just two outcome measurements I should not have been able to learn it. And not only just learn it, we give an exponential lower bound. Myself, Vojtek Havlicek and Louis Schatzke. And what are some consequences of this or why do we even look at this? Like the thing I mentioned, like I think it's a natural motivation as to what can be learned with separable but not just two outcome measurements. But also classically, there is this very long standing open question in classical learning theory. Can we separate back learning from random classification noise learning? There's a very unnatural separation between these two classes, but there is no natural separation. But kind of quantumly, we can actually separate QSQ but with learning with classification noise. So we kind of can solve the quantum analog of this classical question, but I'm not sure if it's as well motivated as the classical question. And there is another application that there have been some people looking at error mitigation and we can give an exponential separation between weak and strong error mitigation. I think I have a couple of minutes, I'll just wrap up with the last slide. So as I said, like maybe if you don't care about Boolean functions, you can, you just care about learning quantum states. Like for example, in previous slide, we had this, I'll perform a statistical measurement on this example state, but let's just say you care about learning arbitrary quantum states, which are not just Boolean functions. The question is, you could still do a statistical measurement, it still makes sense. So think of real script C as a bunch is a concept class of quantum states, row is an unknown quantum state. And again, like say I prepare a quantum state row, you could specify an M comma tau, a two outcome measurement. And on input that I'll tell you what's the probability M would have accepted row or not. And for example, there was this recent work of Chen et al, where they looked at shadow tomography and so we saw shadow tomography yesterday required an entangled measurement on copies of the state. They looked at just separable measurements and they said with separable measurements, you should not have been able to do shadow tomography. So they gave a lower bound of two to the N for doing shadow tomography with just separable measurements. It's possible in N with entangled measurements with separable measurements is two to the N lower bound. And some things that we do is, for example, in our paper, is we show that if you want to learn the same states, but not just with separable measurements, but with just say statistical measurements, we prove a lower bound of four to the N. So this is as large as it can be. The upper bound is trivial because I can just estimate the four to the N poly coefficients, but we also prove a lower bound which is four to the N. The statistical measurements are very weak if you want to learn these states or if you want to do shadow tomography. And then we look at the hidden subgroup problem and then we say that even if you want to do the abelian hidden subgroup problem, even for this, you can do it with separable measurements, but if you want to do it with just statistical measurements or just by performing two outcome measurements, the sample complexity is at least two to the N. We also give some positive results. For example, as I mentioned for a slide, if you want to just learn GIP states that just required single copy measurements, we also show that trivial states or states produced by constant depth circuits can be solved with just statistical measurements. So we do have positive and negative results in our work. So when are statistical measurements useful and when they're not useful? Good, let me conclude. Right, so what have we seen through this lecture? Somehow we first saw abelian functions and quantum examples. We saw sometimes they're useful when you talk about uniform learning. Sometimes they're not useful when you talk of distribution independent learning. Then we saw learning quantum states exactly approximately on some of their properties, that is tomography, packed learning, shadow tomography, and things like that. And then in this lecture, we saw interesting class of quantum states which you can sometimes learn sample efficiently, sometimes even time efficiently, like GIP states, stabilizer states, maybe even extending stabilizer states to have some T gates in them, FAA states, and things like that. So we saw interesting classes of states which can be learned efficiently in terms of sample or time. Good, so just going ahead, like what are some interesting questions or directions which I think are useful is kind of like, I think it would be interesting to learn more sample efficiently, time efficiently interesting class of states. Say for example, low stabilizer rank states or states produced by constant depth circuits or things like that. I think those would be interesting to learn. And kind of like the statistical model of learning that I introduced was a theoretical way to model near-term quantum machine learning. Like I think it's always useful like quantum machine learning is getting a lot of hype, but it's nice to like bring out a theory in it and try to prove formally what can be done and what cannot be done. So we need to have more realistic theory models to understand what you can do in near-term quantum devices. And one step in this direction was this quantum statistical framework. Yeah, as I said, there's also this connection between learning theory to Hamiltonian simulation to the hidden subgroup problem to communication complexity. So there's a lot of these connections between learning to other topics. Maybe can we use new ideas in learning for other topics? Yeah, as I said, there have been several surveys on this topic. I had a recent survey with Anurag where we have like 25 open questions in the subject. So I think it's kind of interesting if you are interested in this topic, you could look at that and maybe try to see if there are some interesting open questions that interest you. Yeah, with that, I hope I interested you enough in quantum learning. Thank you.