 Welcome everyone for day four. So I'm going to be talking more on quantum learning and this time more on quantum learning quantum states. So let me begin. So far, what have we looked at? I mean, the past couple of lectures, we looked at we wanted to learn Boolean functions. So C was a Boolean function that mapped n bits to 1 bit. And we kind of looked at examples of this form. And in the first lecture, we saw for PAC learning, these kind of examples are also useful. But when you fix D to be the uniform distribution, these kind of examples are useful for learning Boolean functions. But I'm going to be looking at something slightly more general now. Like, I mean, one question would be, why care about learning Boolean functions? Sure. It's interesting in its own right. But the point is you could ask for something slightly more general. The point is, let's just say rho is an unknown quantum state. And I give you copies of an unknown quantum state. Can you learn it? Let's just say an experimental device. Like, we have a quantum computer. Somebody claims that they prepared a quantum state rho. And the question is, as a verifier, you want to understand, did you prepare the state rho or not? How do you verify such a thing? And there are quite a few ways to put this in a theoretical framework. And that's going to be what I'm going to be talking about here. So first, you could think of, I'm just giving copies of rho, learn rho. So somebody just keeps giving me copies of rho. The goal is to learn rho. This is called tomography. Then say somebody gives me measurement statistics of rho. Let's just say somebody prepares a rho, performs some measurement on it, gives me just measurement statistics. That's one form of learning. That's called shadow tomography. Or let's just say I'm given copies of rho, but I want to learn rho just approximately well. This is called back learning. Then, for example, maybe these two tomography and back learning, or even shadow tomography, they are time-expensive. Maybe I can just fix an interesting class of states. And in this interesting class of states, can I maybe learn these things time-efficiently? So in this talk, I'll be talking of learning gift states of local Hamiltonian. And in the next talk, I'll be talking of stabilizer states and learning circuits with a few t-gits. So my plan is to go over protocols for tomography, back learning, and shadow tomography, and then talk about learning gift states. And that's going to be what I'm going to be doing in the next one hour. Good. So let's begin. Let's just talk of tomography. So what is tomography? So rho is just an unknown n-qubit quantum state. So rho is just a d by d matrix, where d is 2 to the n, because rho is n-qubit. And the question in tomography is how many copies of rho are necessary and sufficient to produce a classical description of a state sigma that approximates rho well enough? More pictorially, so you have qA. That's your quantum algorithm. And I keep giving it copies of rho, say identical copies of rho. And the goal is for the quantum algorithm to produce a state sigma, such that sigma is close to rho. And here, the point is I don't care about how, you could care about how the output should be given, but in tomography, I just care about the number of copies of rho or the sample complexity. And this is a very fundamental question. This is like a starting of quantum information, like how do I verify a quantum state? If I give you copies of a quantum state, how do I learn this quantum state? I claim this to be rho, but the point is I produce a rho prime. Can you check whether rho was equal to rho prime or not? For example, I produce a noisy quantum state rho, but you want to understand what is the noise in a quantum device that produces state rho? Maybe I do tomography and the noisy copies of a quantum state rho, and then I learn the noise in the system so I can use it for calibration, verification, and so on. So it's a very fundamental problem. And yeah, one thing I've been a bit sloppy here is for simplicity, at least in the next, I'm going to be talking of outputting state sigma, such that sigma and rho are close in trace distance. You could look at other norms. For example, you could look at Frobenius norm or some of the matrix norm, but for simplicity, I'm just going to be talking of trace norm. Good. So there is a very trivial algorithm that one can do that you just, for example, somebody gives me rho. I just want to estimate the first entry of rho, like 1 comma 1th entry. And I just do this for all the d square entries. And if I do it for all the, and I want to estimate this well for all d square entries so that I again output a state sigma that is close in trace distance, the epsilon that I need to pick for in order to estimate just the first entry of rho well enough, I need to take d square copies. I need to do this again for all the d square many entries. So that's a trivial algorithm that just takes d to the 6 copies. d to the 4 copies, just estimate 1 entry. And because they're d square entries, d to the 6, and that outputs a state sigma that is close to rho in trace distance. So that is one potential trivial algorithm. And there were subsequent algorithms in 2003, and maybe after that, where they reduced to d to the 6 to d to the 4. So they said, d to the 4 copies suffice to output a state sigma that is close in trace distance. And that was using compressive sensing, and then using techniques of matrix recovery that was inspired by classical TCS. They kind of gave an algorithm that used d cube copies of rho and still output a state sigma that was close to rho in a trace distance. When I mention these things, I assume epsilon to be a constant, but I'll get to it in a second. But still, I mean, somehow fundamentally, you know that rho just has d square entries in it, and you would expect the right answer should be d square. But somehow, for a very long time, people, the best algorithm we had was d cube. And then there were a couple of breakthrough results by Reino, Donald, and John Wright, and Ha at all, where they showed that the right sample complexity for demography was d square over epsilon square. So for every epsilon, if you take d square over epsilon square copies of rho, I could produce a state sigma that was close to rho in trace distance. So this was a breakthrough back in 2015. And we also knew that this was optimal. That is, you could not have done anything better than d square over epsilon square. So this was an optimal algorithm for doing demography. Good. So let's see. So as I said, there were two papers that talked of doing demography using d square over epsilon square copies. Although you could say maybe the tools they used were the same, but I think they are slightly different. The result by O'Donnell and Wright, they used this technique called shore sampling to first estimate the spectrum. And then they used this classical technique of this RSK, which is a classical algorithm for reconstructing the state. And then the algorithm of how it all used a pretty good measurement. And then used some techniques from shorewild duality to recover the state rho. I guess there might be a unifying way to see these two things, but at least as far as I know, I don't see a way to unify them. But these are two ways you can get the d square over epsilon square sample complexity upper bound for demography. And they prove something slightly stronger, actually. They prove that if supposing your unknown state rho has a rank r, then d times r over epsilon square copies suffice to learn the unknown quantum state. So in particular, if you have an arbitrary quantum state rho, it could be have rank as big as d. So r is d, and that gives the d square over epsilon square. Good. So I don't think I completely understand these two algorithms well enough, even. But the point is one thing which I could maybe give you a protocol idea about is let's just fix r to be 1. So let's just look at pure state, and let's just talk about what's happening in the pure state demography case. So let's just look at pure state demography. So I give you copies of an unknown quantum state, psi. It's pure. I give you t copies of it. And your goal is to produce a state, psi, such that psi and phi have fairly good overlap. And the protocol idea is fairly simple. You just apply the pretty good measurement. So I give you copies of an unknown quantum state. You need to identify the unknown quantum state. Like I've been saying in the past two lectures, one technique you could always apply is the pretty good measurement. And one way to actually solve this pure state demography task is to do the pretty good measurement. So the ensemble is uniform over all possible. You're given an unknown psi. So it's uniform over all possible psi, t-fold tensor copy of it. The POVM that you can apply is EV to the part T, where EV is just like the projector onto the v-th unit vector. So you can look at this POVM, this one POVM element for every unit vector. So this is for now an infinite POVM. And EV is just like the projector onto the pure state and take the t-fold tensor copy. And one way to write this, I'm just going to write VSI because V is the unit vector, so you can just call it a quantum state. So one thing you can write out this POVM is just going to look like this, where t is because you're given t-fold tensor copy and d is because you're in CD. I'm going to be talking about the property of this in the next slide, but for now I'm just going to tell you what the algorithm is. So I give you copies of the state psi. I'm going to perform this POVM. If you write down the pretty good measurement for it, it's going to look like this. And as I said, this is a continuous POVM, but you could discretize it. You could just look at the net of this POVM and look at the discrete net and just apply the POVM corresponding to the net. And this POVM is going to output a state phi, and that's going to be an output. So the point is, somebody gives you copies of psi, apply the pretty good measurement. The POVM operators look a bit messy, but it's exactly these operators. And then once you apply this POVM onto psi t, it's going to output a phi, and just output phi. That's it. And again, I don't care about the description complexity. I'm just going to just care about the sample complexity. And the thing I'm going to be showing is if supposing t is d over epsilon square, then the trace distance, or you could just look at overlap between these two states, is going to be at most epsilon. There are two things that remain to be seen. One is, why is this a POVM? Like, I just claimed to you that this was a POVM. I'll prove it to you why it's a POVM. And why does taking t equal to these many copies suffice to reconstruct the state phi that is close to psi in overlap? And that's what I'll be doing in the next slide. Any question? Yeah, I think the point is, when you start the algorithm, you know epsilon. So you discretize based on epsilon. Some constant times epsilon is good enough. Good. So let's just look at the first thing. Why is that a valid POVM? And to show why it's a valid POVM, you need to show that the sum of all the POVM elements, because we are working on the continuous family looking at the integral of all the POVM elements, this is equal to identity. That's what you need to show. And that's not too hard to show. You just need to observe the following. So we have this quantity. Remember, this was the POVM element for every phi. I have to take the integral over all phi and show that this was identity. So I have this quantity. I can just pull this binomial quantity outside this integral. So I have this left. And a little bit, if you look at this, this is well-known in representation theory that if you look at the integral of phi tensor phi, phi to the t, if you look at this quantity, this is just the projector onto the symmetric subspace. And so this is the projector onto the symmetric subspace divided by the trace of it. The trace of it is exactly this quantity. That's why we defined the POVM itself to have this renormalization. So this cancels with this. And this is the projector onto the symmetric subspace of the t-fold tensor copy on CD. And then you just need to observe that our POVM itself was symmetric to begin with. So this quantity is actually equal to just identity. Because the POVM itself was acting only on a symmetric subspace. And you looked at the sum of all these POVM elements corresponding to every phi. And this was, again, the symmetric subspace. This quantity is actually equal to the identity, or the maximally mixed, or just identity. So that verifies the why this is a POVM. Now let's analyze the output of the algorithm. So what did I claim? I claimed that if you apply this POVM for every phi onto the unknown quantum state psi, so on input psi, I'm going to apply this POVM. And let's analyze what is the expected output phi. Or rather, what does it satisfy? So this is a slightly ugly-looking calculation, but it's not too bad, actually. So let's just look at the expected value phi that was output by the POVM. So psi was a quantum state that was given to you. Phi was the output of the POVM. So let's just look at what's the expected value of this inner product. So this is nothing but the maximum sum over all phi. So you have a d phi here. You look at this quantity. And what's the probability of the POVM outputting phi? That is precisely this quantity. We know exactly what's the probability of the POVM outputting phi. That's just the inner product of this with psi to the t. But if you take the inner product of this with psi to the t, there's nothing but inner product of this with psi to the power t, 2t. So that's what you're going to get here. And we're going to get this renormalization from here again. So again, you have this. You pull this quantity out. So you have this quantity here. And again, using the same trick that I do here to do here, so here you have the tensor power t. You have 2t over 2 here. If you do this, if you write this out, this is, again, a well-known representation theory that this quantity is just 1 over d plus t choose d. And if you just take the ratio of these two things, that scales as approximately 1 minus d over t. And now we are almost done because we know that the expected distance between psi and phi is square root d over t. Recall that this was just the inner product. So if we look at the trace distance that is square root d over t. And if you want this quantity to be epsilon, you just pick t to be d over epsilon square. Maybe if this was a little bit technical, let me just kind of give you a high-level idea. I give you copies of unknown states psi. I claim that this was a pretty good measurement that you should do to identify the unknown psi well enough. So I first verify that why this is a POVM. That's just like a couple of lines of calculation. And then after I apply this POVM, I try to analyze what is the output state phi when I feed it this input state psi. And I just observe that the expected inner product between psi and phi square, when I give psi to the t, this POVM, is going to be this quantity. And if I pick t to be d over epsilon square, this is at most epsilon. So if you're looking at the expected overlap square, and because of the looking at the expected overlap square, this probability is exactly, so you're going to have psi times phi to the power t. And because you're looking at the trace, you're going to get another psi times phi to the power t here. So that's 2t. And because you're looking at the expected overlap square, there's going to be t plus 2t. So this is a very high-level idea of how the pure state tomography works. The main ideas of O'Donnell and Wright and the Ha et al paper, they kind of generalize this for arbitrary quantum states. And the analysis they use is much more technical. And I'll not be going into it. But kind of what I want to emphasize is the following thing. The point is, if you look at the sample complexity here, the sample complexity is d over epsilon square, and d scales exponentially with n. So if we have an n-cubit quantum state, and if I want to do exact tomography on it to get an epsilon approximation of it, the sample complexity is d over epsilon square, or 2 to the n over epsilon square. And that's exponential in n. That's not something that's nice. And there have been experimental demonstrations of tomography, and somehow they seem to be stuck at n equals to 10. That's the best known experiment on tomography that we're able to do. And that's not maybe surprising because 2 to the 10 is already huge. And it's already surprising the fact that you can do tomography on 10 qubits because the sample complexity should scale as 2 to the 10. That seems big, but there have been experimental works doing this. And that kind of motivates the question, maybe if you want to learn an unknown quantum state, should you do tomography? Do you want to learn the entire quantum state altogether? Or maybe do you just want to learn certain properties about the quantum state? Say, for example, in the physics language, you want to learn some magnetism or electrical properties or something of that sort. So maybe learning row completely is overkill. Maybe you just want to learn some properties about the quantum state. Maybe the trace of the quantum state with respect to some measurement. In that case, maybe learning the entire quantum state is useless. I just need to learn some properties about the quantum state. And that kind of naturally leads me to the notion of back learning. So back learning is something we discussed in slide on the first day. And I'll again come back to back learning for quantum states now. The first day I spoke about back learning for Boolean functions. Now we'll talk about back learning quantum states. Good. So as I said, so far we looked at tomography. And in tomography, you learn the entire quantum state well up to trace distance. And I mean, so we learned sigma such that sigma minus row in trace distance is at most epsilon. So if you write down the meaning of trace distance, what it means is, for example, for every E that you pick, for every operator which is at most identity, trace of E times row is close to trace of E times sigma. So what does tomography do? I give you copies of a quantum state row. I can estimate almost every observable that you want pretty well. And as I said, maybe you don't want to estimate all observables well. You just want to learn row well enough. For most two outcome measurements, I want to learn row well. Not for every two outcome measurement, I want to learn row well enough. For most two outcome measurements, I want to learn row well. That is, I want to output a state sigma. So the trace of sigma times m minus trace of row times m is at most epsilon. Not for every m, but for just most m's. So this relaxes the goal of tomography a little bit. And this kind of motivates the notion of pack learning. Because in pack learning, by the pack just means, like, probably, approximately correctly. So the probably means, like, for most m's, I'm doing a good job. When I mean a good job, I mean, like, I'm getting an epsilon approximation. So let me define the notion of pack learning. And again, like, I think Scott Aronson introduced his model in 2003. I think the result is very pretty. But I think the motivation for introducing it in the first place was kind of it kind of motivates, generalizes classical pack learning to quantum states. And right. And in this notion, I mean, yeah, I think this I mentioned in the first day already. So in the pack learning model, here's the following model. So here's a quantum algorithm. So recall, in the first, in tomography, you're given copies of an unknown quantum state row. But now, you're just given e comma trace of row times e. Recall, in first day, I said I give you x comma f of x for uniformly random x. But now, you're given a uniformly random observable e1 and trace of row times e1, e2 times trace of row times e2, and up to ek times row times comma trace of row times ek. That is fed to a quantum algorithm. And the goal of the algorithm is kind of the same. It should output a state sigma such that sigma is close to row. And not yet mean, this is kind of like a pictorial way to understand what is pack learning quantum states. So let me try to make it a little bit more rigorous now. So like in standard pack learning, you always have a distribution even in the quantum pack learning setting. So d is an unknown distribution on all measurements. So it's just a distribution. So it's a function mapping e to 0, 1. So e is a set of all possible measurements. And so somebody samples e1 to ek from the distribution d and gives the learning algorithm trace of row times e1, trace of row times e2, all the way to trace of row times ek. So d is an unknown distribution. Somebody samples e from the distribution d and gives the learning algorithm ei, trace of row times ei. And the goal is to produce a state sigma. But unlike in pack learning where sigma and row have to be close and trace distance, the requirement of sigma is much weaker now. So what do I want? So I want like for this probability where this probability is over an e sample from the distribution d. So think of e coming outside your training set. When I give you a new e outside the training set, for most e's, I should be able to estimate trace of row e well enough. That is, the output state sigma should be such that trace of sigma times e minus trace of row times e should be at most epsilon. Again, this outer probability is over the probability of the algorithm and the probability over e sample from the distribution d. Recall, supposing this was 0 and d was just all possible observables, that is exactly tomography. So here, you want it to be true for most measurements, not for all measurements. And of course, the probability is always taken across from the unknown distribution d. So as I said, probably with probability 1 minus delta, if I sample an e from a d, I should be able to approximate trace of row times e well. So is the model clear? Good. Let me continue. Right. So I'm just summarized the model again. Rows unknown, d's an unknown distribution. Somebody gives me e i comma trace of row times e i, where e i is sampled from the distribution. The goal is to produce the state sigma. So as I said, for most e's, trace of row times e and trace of sigma times e are close. Let's just say I give you the explicit description. I care about the sample complexity again here. Good. And so the natural question is, tomography says the sample complexity of tomography was d square. The question is, this is a weaker task than tomography. Is pack learning really much simpler than tomography? Or is it maybe just quadratic and d, which is exponential in n? Or what's going on? And surprisingly, pack learning is actually exponentially more efficient than tomography. So this was a surprising result that Scott Aronson proved in 2003, where he showed that if you just take log of d, which is this little n, so if we just take little n, which is the number of qubits row axon, those many labeled examples of e i comma trace of row times e i, that's good enough to produce the state sigma. And let me give you a quick proof sketch of how he proves. Actually, most of the proof that he uses is very classically inspired. So I think it's very neat to see what's going on. Good. So what does Scott do? He just takes log d many examples of the form e i comma trace of row times e i. And he just finds a state sigma that is consistent with all trace measurement outcomes. That's it. So he just picks k to be log d. So he takes e 1 up to e log d, trace of row e 1 up to trace of row e log d. He takes these many labeled examples. And he has these things explicitly in his hand. He just, now, you just go to the lab, the learning algorithm. He just tries to brute force over all possible quantum states sigma. This could be an exponentially long algorithm. But the sample complexity is only little n. So you try to find a state sigma that is consistent with all possible trace measurement outcomes you just saw. And, of course, the natural question is, why does this work? Why does just taking little n many examples suffice when tomography required exponential n many copies? And the answer is actually, if you go back, it's just actually the understanding it's like, for example, earlier, for example, when you were speaking about learning Boolean functions, the reason why learning classical Boolean functions work was because we use the VC theory for learning Boolean functions. But there is a notion of VC theory for learning real value functions. And just if you just plug this in, you actually get this result little n out. I'm going to summarize the proof here, but in the exercise, you'll be proving this mostly. So consider the set of functions f rho. So f rho is just a function that maps script e to 0, 1. So remember, script e was a set of all possible two outcome measurements. f rho of e is just trace of rho times e. And this is a number between 0 and 1. So think of e as just being between 0 and identity. Trace of rho times e is between 0 and 1. Good. So consider the set of functions. And in particular, consider this concept class, script f, script c, that is just a set of all functions, one for every quantum state rho that maps e to 0, 1. So this is for every unknown quantum state rho, you can associate a function f rho that maps e to 0, 1. And think of this as a real valued concept class. So when we're talking about learning Boolean functions, we were learning concept classes where f rho was mapping to 0, 1. But here, it's in the real interval between 0 and 1. That's because the way I've defined the function is just a trace expectation value. And then now it's very well known that if supposing I want to learn an unknown concept class of real valued functions, like how for Boolean functions, VC dimension captures the complexity of learning Boolean functions, fact-shattering dimension captures the complexity of learning real valued functions. So if you have a real valued function, which is in my concept class, if I just think of this as my concept class, but just give you labels, label the examples from this concept class, when I know f rho is real valued. But just compute the fact-shattering dimension of this concept class that suffices to learn script C well enough. I'm not going to define fact-shattering dimension here. It's a very close analog of VC dimension. And you'll be looking at examples and what it is in the exercise section. And then what remains to show is like, OK, I just showed, for example, fact-shattering dimension is the sample complexity of learning this real valued function. And if you learn this real valued function that also translates to actually producing a sigma that is close to rho in this distance, what remains is to uprope on the fact-shattering dimension of an arbitrary concept class. So remember, the concept class is arbitrary because the rho is all possible quantum states. And Scott Ironson observed that you can use ideas from random access codes and show that for every concept class on Boolean, sorry, on quantum states, the fact-shattering dimension of the concept class is at most log of the dimension, which is little n. So this shows that if you just obtain n many labeled examples from this real valued concept class, and if you just do this time-expensive step of finding a sigma that matches all these expectation values, you will produce a state sigma with just n many labeled examples that satisfies this property. You will be proving these things in the exercise section and looking at the definition of fact-shattering dimension. But this was the main insight of the algorithm. And this shows that you can actually do back learning, which is a much weaker task of tomography exponentially more efficiently. Any questions? So yeah, so we saw that, for example, back learning can be done in log d, which is n many examples. And as I said, for example, this is exponentially better than tomography. Tomography requires sample complexity 2 to n. Here it is just order of little n. So that's very neat. But the one unfortunate part is that time complexity is expensive. Because for example, how is EI given? You could maybe say EI is given explicitly. Or even if you say EI succinct, the first step of the algorithm requires you to go over all possible sigmas and try to find a sigma that is consistent with trace of, that is consistent with this. And that is to go over all possible quantum states that takes time, capital D squared, which is exponential in n. But maybe the one takeaway that you should take away from this is kind of like how VC dimension captures learning Boolean functions. In the PAC model, fact-shattering dimension captures learning quantum states. This is kind of a takeaway I would think out of the last two slides. Good. Let's see. So this is PAC learning. And OK, now we learn PAC learning. For example, one annoying thing about this thing, for example, that people of course say even in classical PAC learning is what is this thing about this distribution? I find this distribution slightly unnecessary. Like why should examples come from a distribution? And why should only I be able to do well on the distribution? So you can kind of strip this assumption away. And you can now that naturally leads to the protocol of shadow tomography. So this was again introduced by Aronson in a few years back. And the goal is the following. So as I said, you estimated rho on a distribution of measurements. But what if we just want to learn rho under a fixed set of measurements? So again, here is my quantum algorithm. It receives copies of the quantum state rho. But also, somebody tells me a set of measurements, E1 to EK. And the goal of the algorithm is just to learn what is trace of rho times E1, trace of rho times E2, and so on and so forth. So it just needs to output what is trace of rho times EI for all EI from 1 to K. So this is kind of like, you could think of this as a version of back learning where I need to do well on just like on the support of the distribution. So there is no distribution just on a fixed set of measurements I want to do well. And to be a little bit more formal, given E1 to EK, how many copies of rho suffice the estimate trace of rho E1 to trace of rho EK? And the two trivial algorithms here, and one is tomography. Because I'm given copies of rho, I can just do standard tomography and the sample complexity is d square. So I just, for example, take d square copies of rho. I do tomography, and as I said, in tomography, I output state sigma, so the trace of sigma EI minus trace of rho EI is true for all I's. So in particular, for this script E, and I could just output that. But this uses d square copies, which is pretty expensive. There is another protocol where I can just do like empirical sampling. So I just take 1 over epsilon square copies of rho and do trace of E1 times rho. I find what's the probability that rho would accept E1. And then I repeat this for all possible EIs. And the sample complexity here would be k over epsilon square. 1 over epsilon square for estimating trace of rho E1. 1 over epsilon square for trace of rho E2, and so on and so forth, up to EK. So the overall sample complexity is k times 1 over epsilon square. This doesn't seem too bad, but it's still linear in the number of measurements. And if supposing the number of measurements is exponential in n, this still seems bad, the sample complexity looks bad. So yeah, so there is a much better algorithm, an exponentially better algorithm. This is using ideas from online learning and packed learning that Aronson came up with, where he showed that you can actually estimate trace of rho E1 to trace of rho EK using just log k times log d many copies of rho. So this is exponentially better than both the trivial algorithms. So one trivial algorithm based on tomography takes d square copies. His protocol requires only log d many copies. And one protocol of using empirical sampling requires k's copies. His protocol just uses log k copies. So if I just want to estimate trace of rho times EI for i equals 1 to k, it suffices to just take log k, poly log k, comma, poly log d, many copies of rho. So this is a very surprising result, and I'm going to give a proof sketch of this now. Is the model clear? Or if there's any questions, I can be happy to answer now. Yeah, right. So what do I do? So think of EI as just an operator between 0 and identity. I just perform the two outcome measurement E1, comma, i then 3 minus E1 on rho. I keep repeating 1 over epsilon square many measurements, and I get an epsilon approximate. So this protocol is slightly non-trivial, but I think the ideas are very interesting. So let me try to tell you what's going on. So I've just described the problem in the top again. You have E1 to EK, copies of rho. Your goal is to estimate trace of rho E1 to trace of rho EK. And I think the main idea that comes in shadow tomography is this idea from communication complexity from, again, I think in early 2000s. So the ideas are following. Alice has a quantum state rho. Bob has K many operators E1 to EK, and the goal is for Bob to output trace of rho times EI. And somehow, only Alice can communicate to Bob. So recall, this almost looks like tomography, a shadow tomography in some sense. There is some Alice has a unknown quantum state rho. Bob has this K operators E1 to EK, and Bob needs to output trace of rho EI. This looks very similar to tomography, but it's a slightly different setting here. And recall that only Alice has rho, Bob does not have rho, and only Alice can communicate to Bob. So even if Bob wants to estimate trace of rho EI, he cannot do it, because he has no idea of what rho is. And Alice cannot send what is rho times E1 and trace of rho times E2, because she doesn't know what E1 to EK is. So there is a trivial protocol for this, and the complexity of that is D squared. That is Alice just sends rho completely, and Bob just learns it. So there is a pretty idea on communication complexity, which Aaron said introduced, which actually is able to do this communication task exponentially faster in K and D. So what does Bob do? Like Bob just does the most naive thing. Initially, he has no idea of what rho is. So he just thinks, OK, rho is just maybe the identity thing, and he slowly starts iteratively updating his guess of what Alice's state is. So Bob starts guessing Alice's state iteratively. So initially, he thinks sigma 0 is the maximum mixed state, because he doesn't know anything about sigma. And then sigma 1, he keeps updating it, so on and so forth. And the hope is, after repeating this iteration T times, or T updates, he would like to come to a state sigma T, where a trace of sigma T times EI is close to the trace of rho times EI. That's the goal. And the question is, how does this update happen, and how does communication help? So what does Alice do? Alice keeps giving him useful information. So think of Alice like a teacher. She knows that what is Bob's protocol going to be. She's going to help him update from sigma 0 to sigma 1 to all the way to sigma T. And the way she's going to help him is the following. She's going to tell him, so she knows explicitly what's E1 to EK. She's going to tell him one possible I is from E1 to EK. So the trace of rho EI minus trace of sigma EI is greater than epsilon. So recall, the goal was for Bob to output an epsilon approximation of this. So let's just assume there is an I for which Bob's guess. So Bob, when he begins with, he thinks he has sigma 0. So let's assume that trace of sigma 0 times E1 itself is very large. It's much farther from trace of rho times E0. So Alice just says, look at E1. In this E1, your guess, what you think of what my state is, that is very far away from what the true value is. So she tells him, OK, fine. I will send you I. This is the I for which your current guess is not doing a good job. She sends him I comma trace of rho times EI. And she can do this because she knows EI explicitly. She knows rho as well. She can send I comma trace of rho times EI. So what does he do? Bob then updates his guess. Now Bob realizes, OK, the state that I have right now that is sigma I, it is not doing a good job on EI. Maybe I need to update it to EI plus 1, sigma I plus 1, where sigma I plus 1 does well, at least on EI. And the way he updates it is the following. So what does he do? He considers these two outcomes observable F that applies EI on log n many copies of sigma I. Because he knows explicitly what sigma I, he prepares log n copies of sigma I, applies the two outcome measurement EI on log n many of them, and he only accepts if a constant fraction of them accepted. And he updates his sigma I to sigma I plus 1. If it accepted, you get a post-measurement state, and you just trace out the log n minus 1 copy of the registers there, and you get the first register, and that's the sigma I plus 1. So yeah, she knows EIs. Alice knows EIs completely. Oh, no, Alice knows EIs completely. It's just that, OK, OK, OK, good. Maybe I wasn't clear. So this is one trivial protocol. The other trivial protocol, because she knows all E1 to EK, she could have just sent trace of rho E1, trace of rho E2, all the way to trace of rho EK. But the communication cost here is K. And I'm going to give you exponentially more efficient algorithm than K. Sorry if I wasn't clear. So Alice knows rho, and she knows E1 to EK. They are true trivial protocols. Either she sends rho explicitly, which takes these square entries, or she sends trace of rho E1 all the way to trace of rho EK, which takes K bits of communication. And I'm going to describe a protocol that's exponentially better than both these protocols. So that's why she's able to send i, trace of rho times EI here. So this is a slightly non-trivial way, but the point is this is a way for which Bob updates sigma i to sigma i plus 1. And the guarantee of sigma i plus 1 now does well on the POVM, which Alice said you're not. Sigma i wasn't doing good on. So they keep repeating this process iteratively. And the point is they stop at a certain time. So T is the complexity parameter here. Because an update is met only when a communication is sent. And a communication is only a polylog in K and D. So they keep repeating this process. And at the end of it, we ensure that because a communication will be made only if this condition is met. So they will stop at T when this condition is not met. At some point, Alice will say, OK, you're doing a good job. Your guess of what my state is satisfies all these things. So sigma T is a stopping point. At Tth point, Alice says, OK, your guess sigma T does well in estimating Chris of row EI for all EIs. Yeah. Yes. Good. So the point is Alice knows deterministically what is the procedure Bob is going to do. She knows that he's going to take log n copies. And he's going to repeat it until he accepts. And OK, fine. So I think I see what your question is. OK. So the first thing I think non-trivial thing that Aaron-san had to prove is, firstly, the probability of accepting is high. Like, if you do this procedure, you will accept the probability, say, 2 thirds or something like that. Now, when he accepts, I think there is a deterministic procedure, which I'm missing at the moment, which says for Alice as well, like, OK, this is a deterministic protocol that he did. Oh, OK, actually, give me a second. Maybe there's another way to do it. The update is a post-election. That's correct. But I think you ensure that the post-election probability is very high so that when you do a union bound, it's good enough. So for example, let's just say Alice knew, OK, the point is maybe this accept probability is 1 minus 1 over k or something like that. Maybe, for example, at some point, Alice realized that, OK, she thought Bob actually accepted when he didn't. OK, that means the protocol is making a mistake. But because the protocol error is 1 over k, you keep repeating this thing. And by a union bound, overall, you're doing a good job. I think that's the idea. But maybe I can check this and get back to you. Yes, yes, that's correct. This is the non-trivial part one needs to show. Yeah, so this is the non-trivial part one needs to show. But somehow, I think the main idea that Scott shows is, iteratively, whenever you do this update step, you're always making sure that e1 to ek minus 1, you're still doing well on those. And ek just because the f that you apply only changes ek. Your constant identity on all the e1 to ek minus 1. So you ensure that you're still doing well on those. Only on ek, I'm making sure that sigma i plus 1 is doing a good job. Right, so I think this is the non-trivial part one needs to show. Yeah, I think that comes in the POVM here. Yeah, that comes in the POVM here. That's right. So Bob just not needs to know i, but he also needs to know trace of ei times rho. So this POVM here is where the magic is kind of happening. And I'm kind of hiding that under the rug because maybe I don't understand it completely, but I just want to give you the high-level idea. So the main thing that Scott shows is, for example, if you keep repeating this process, after t equals the log d times polylog d times polylog k, you will stop. And the total number of communication each round is anyway log k. So the total communication is polylog d times polylog k. So to answer your question, the trivial protocol would take cost d square if Alice just sends d square or trace of rho times e1 times trace of rho times e2 all the way to trace of rho times ek, that takes k bits. But Aronson somehow shows that you can just say in polylog d many bits. And this already suffices for Bob, the output of sigma t, such that this is right, correct. Yeah, right. This is where somehow, yeah, the magic is happening. This is where somehow the magic is happening, but that's where this POVM again is happening. So the point is you would think that by just repeating this process log d many times. Actually, the point is you're not just repeating this process log d many times. The number of rounds could be many, but it could be that in some rounds, the number of times Bob is making a mistake need not be too many. For example, I can say ek, he's totally fine. For ek plus one is when I need to do it. So you need to count the number of mistakes he's making from e1 to ek. And the number of mistakes he makes is actually not too much. That's what one needs to show. So only when Bob makes a mistake, that is for example, ei, trace of rho times ei, minus trace of sigma i times ei is greater than epsilon, thus Alice sends a communication. So it could be that t is large, but it could be the number of bits that Alice sends. In some rounds, she just doesn't need to send anything. When she sends something, he's actually making progress. Yes, the total number of communication is polylog d, exactly. So in polylog d, many rounds is the number of times he needs to communicate to Bob. That's right. These are arbitrary, exactly, exactly. So this is where kind of the magic is happening somehow. So when you do this POVM, you're kind of ensuring that you're always making, when you do this collective observable here, you're kind of ensuring that you're doing well for many... Actually, this is where the technical part of the proof goes, and actually, maybe, yeah, this is where the main technical part of the proof goes, and actually to show that only polylog d, which is non-trivial, like, I see what you're saying. The point is because there are k many POVM elements, you would expect that Alice could, in the worst case, have to send k many rounds of bits, ei, comma, trace of ei times rho. But this is the non-trivial part of this entire result, that he shows that polylog k times polylog d many bits are sufficient to get this output condition. Actually, even classically, in online learning setting, you can get the log k dependence. That is another result that Scott had. You mean Bob has to do much more than... Oh, I see a point, I see a point. So you're saying Alice might have to do more than this? I don't think so. This is his exact protocol. So the protocol is exactly what Scott does. But I think the analysis is something which is the non-trivial part, which I'm not sure about. Okay, good. So this is what is happening. Bob takes log n many copies of sigma i. He applies this observable f. What this observable f is doing is ensuring that, for example, you're in the t-th stage, it's ensuring that e1 to eT minus 1, you still maintain the same expectation value that you had. But you use the fact that you know what a trace of eT times rho is. And so he does this observable f, which applies like eT on log n many copies of rho. And he applies this POVM, and whenever he accepts, that's when he stops. When he stops, he just takes the first copy of these log n copies and that's going to be his new state, sigma t plus 1. Because we're talking one-way communication here, if you're two-way, you could have done that exactly. Possibly, I'm not sure. Possibly, but I feel like not much more. I'm not sure. Yes. They're supposed to be the same. Oh, sorry. Oh, sorry, sorry. No, no, sorry. For example, here in the t-th stage, it should have been sigma t times eI. That's a typo. Yes, Bob can do this entire thing classically. That's right. Maybe let's discuss this also offline. Maybe let's discuss it on the board, but at least I'll get through this protocol and then we'll discuss this. Good. So this was the communication protocols Aaron's and had back in the day. So this was based on communication complexity. Somehow, he also had this thing based on simulating this. So what we have to do is come up with a protocol for shadow tomography, not just for communication complexity. So the idea that Aaron's and had this thing in the t-th stage is that there are two lemmas that Aaron's observed on top of this communication protocol. The first observation is this quantum R lemma that says that if I give you locked K copies of an unknown quantum state row, I can decide if there exists a J for which trace of EJ times row is large or for all J trace of EJ times row is at the same time. So that's what we're going to do. So we're going to go ahead and we're going to go ahead and we're going to go ahead and we're going to go ahead and say that each K times row is much smaller than one over K. So what does this quantum R lemma say? I can take locked K copies of row and I can decide if there's one J for which trace of EJ row is large or for all J, trace of EJ times row is small. So this is what the quantum R lemma can do, but somehow, the question is what is J, that is unclear. And for this, Aaron's and Fick's this is by saying I can do this by doing a binary search, so I group this into K over 2, K over 2, too many POVM's And then I repeat this thing log k many times again. So using polylog k somehow, he can find a j for which trace of Ej times rho is large. And the overall sample complexity he gets is log to the 4k times log d times epsilon divided by epsilon to the 5. Right. And there's been quite a few works trying to improve the shadow tomography protocol that Aronson has. And the state of the art algorithm we have is by Barasko and O'Donnell, where they have log square k times log d divided by epsilon to the 4. Good. Let's see. So let me just mention one more thing about quantum. So let me mention an interesting corollary of this Barasko O'Donnell protocol, which I think is very interesting. So it's the corollary on quantum hypothesis selection. Right. So as I said, Barasko and O'Donnell gave this protocol with sample complexity log square k times log d divided by epsilon to the 4 for shadow tomography. And they had this interesting corollary which says the following. So think of Scripps-E as my concept class, which consists of k quantum states rho 1 to rho k. And let sigma be an unknown quantum state. And given copies of an unknown quantum state sigma, the goal is to find the nearest rho i to sigma. In another way to put it, sorry, this should be an L. So another way to put it is they should find an L such that sigma minus rho L is close to the closest approximation of rho. So think of opt as the closest rho to sigma. You need to output a rho L that is close to the closest optimal plus or minus epsilon. That's the task here. I give you copies of an unknown quantum state sigma. Output L from this class such that sigma minus rho L is close to the best approximator for sigma from this concept plus epsilon. That's the quantum hypothesis selection question. And yeah, maybe this remark should have come later. But somehow, OK, wait, oh yeah, good. So the point is they gave a protocol for this hypothesis selection question with the sample complexity log square k times log d divided by epsilon to the 4. And somehow, if you could have improved this log square k and the protocol to log k, you could have actually obtained back the tomography protocol with sample complexity d square. So without using all this representation theory technique that I mentioned in the first two slides, you could have recovered the tomography protocol if you could have improved the shadow tomography protocol complexity from log square to log k. So that, I think, is an interesting. But apart from that, I think just this task of giving copies of an unknown quantum state, finding a state from your ensemble which is closest to the unknown quantum state, I think it's just an interesting task on its own. And let me quickly sketch the protocol here. The idea is pretty simple, actually. So we know that for every i not equal to j, by however far from there exists an aij, such that trace of aij minus rho i minus rho j is equal to the trace norm. We know such an operator exists by the definition of the trace norm. And we know that you could have just applied this aij onto the unknown rho i or rho j and distinguished it with the probability exactly equal to the trace norm plus half. And so what is the idea that, but it's going to be observed, they did shadow tomography on copies of unknown quantum state rho sigma using the operators aij. Recall that the operators aij is at most k square because i and j run from 1 to k. So these are k square operators. You run shadow tomography on d copies on these k square operators. So the sample complexity is this. And so when you do shadow tomography, this is your output. You get alpha aij, such that trace of sigma times aij is at most epsilon over 2. And then they just go over all possible rho in the concept class. They just go over all possible rho in the concept class and try to find the l that minimizes this. So try to find the l that minimizes the closeness property. Find a rho l, such that rho l times aij minus alpha aij is the smallest for the maximum of i comma j. And then if you use a little bit of analysis, you can show that rho i minus sigma and trace systems at most three times op plus epsilon. That's precisely how we want this with the factor 3 here. So using shadow tomography, you can actually do this question of a hypothesis selection where I give you copies of an unknown quantum state rho sigma. I have a class of quantum states rho 1 to rho k. You can find a state i in my class, which is closest to the best approximation of sigma being from coming from the concept class. Good. Let me mention classical shadows. And then I'll stop. Good. So there was a follow-up work of Huang, Qiong, and Preskill where they introduced this concept of classical shadows. So one potential problem of the shadow tomography is you don't know if the protocol depends on m1 to mk. What if supposing m1 to mk are given only after you're given I give you copies of a quantum state. You do some classical post-processing. And then I give you m1 to mk. Then we don't know if shadow tomography works or not. But then these guys came up with another protocol based on classical shadows that set the following. I first give you copies of an unknown quantum state rho, and that creates a classical shadow of rho efficiently. And then you can use a classical shadow to compute the expectation values of rho on arbitrary observables of your choice. So I give you copies of unknown quantum state. You perform some operation, unitary, and then measurement on it. I produce a classical shadow. Using this classical shadows, you then give me a set of observables. I can actually estimate the trace expectation value of these observables with respect to this quantum state. And the procedure, let me just describe briefly the procedure to obtain these shadows. So you get a copy of this unknown quantum state rho. You perform this random unitary UI on rho. You can think of UI as a random Clifford. And then you apply UI to rho. So the resulting state is UI rho, UI dagger. And then you just measure this quantum state in the computational basis. So you get an n-bit string out. And the classical shadows is exactly UI star applied to this computational basis state bi. So you repeat this thing for t many times. So you take t many copies. You apply UI to rho1, UI to rho2, UI to rho3, UI to rhot. I apply this thing. I get t many quantum states I always measure. I get these bit strings b1 to bt out. And once I get this b1 to bt, I apply UI star to bi. I get this state si. And that's going to be my classical shadow. And if UI is just a random Clifford, this is also classically. You can write that on a piece of paper. And one way to view this procedure of actually taking rho, applying this random unitary and then performing this measurement, you can view that as a quantum channel. And you can view it as a quantum channel. This expectation is with respect to this applying this random unitary and this measurement. So think of this quantum channel m that takes rho and actually produces this classical bit string s1 to sd. I write it as a quantum state, but you can just think of it as a bit string. And maybe the more intuitive way to think about it is the following. So we know that this is equal to this. So you can just apply the inverse of m on this side and just pull it through the expectation. So you've just written rho as just an expectation of m inverse of si times si. So this is not completely defined. Maybe it might not be completely positive, but just operationally you could have done this. And in particular, if you just remove this expectation, morally, you can view this quantity as equal to rho, approximately equal to rho. And that is essentially the idea. So at this point, what they do is the following. For example, I just perform these things. I get 1 comma 2. I just perform classical shadows. I get this s1 to sd. And then I give you an observable e. So what they do is they just view this entire operation as applying this channel m. Do trace of e times m inverse this. So you could have just computed this on your own. And this quantity is approximate. Let me just call this thing alpha e. But view the trace of m inverse times si si is approximately rho. And trace of e times this was approximately trace of e times rho. And the question is, this is true in expectation. So in expectation, these quantities are close to one another. But the question is, what should t be in order for this to be close in epsilon distance? And for this, they use this median of mean tricks where they say, I can obtain t, many. They pick t to be on, yeah. So they pick t to be this quantity, where this, for example, e is the operator that I gave, the learning algorithm. It performs the shadow e. You compute the shadow norm of it, and then take these many samples of this unknown quantum state conjugated by unitary and measurements. Take these things. And then they observe that if you take t, many copies, perform this empirical expectation value with respect to these many copies, the output expectation alpha e minus trace of rho times e will be at most epsilon. And they prove that this bound is tight. And if, for example, you can recover back shadow tomography, if I give you e1 to ek instead of just a fixed e, I can just, I think they have a log 1 over delta dependence. You can just use the use like a union bound and say like, I could have just reused the samples again and again. And so with log k times e shadow times epsilon square, number of copies of rho, I can actually produce an epsilon expectation value, epsilon approximation of trace of ei times rho. So this is what they get. But the one thing which I'm hiding from you is what is the shadow norm? It's actually fairly complicated. And I'm not going to get into what the definition of is. You can look at their paper. But they give some nice upper bounds on the shadow norm. Like one upper one on the shadow norm is trace of rho times trace of e square. And in particular, if you just pick e to be a rank 1 observable, so if you just want to do shadow tomography and just rank 1 observables, this trace of e square is at most 1. So the sample complexity is just 1 over epsilon square d square. There has been a follow up paper that does the same thing for pure states, exactly. Using idea of classical shadows. Yes, yes, yes, this part. Yes, it has the same complexity. Yes, d over epsilon. I think they get dependence on the observables you care about as well. Like for example, you just don't want to do complete tomography on the side. But just with respect to certain observables, they get some better dependencies. But if you want to do exactly tomography on a pure state, it's exactly that, you're right. Yes, exactly. Yeah, I wanted to describe it, but it's exactly the t that makes it for this to work exactly. Anyways, I'm done. Thanks.