 the grand goal in machine learning is to somehow enable AI systems to improve themselves. This has gotten a lot of attraction recently, I guess for many reasons. But how do you expect the machine to learn on its own? You expected to interact with nature, maybe like supervised learning, or maybe unsupervised learning, or maybe reinforcement learning, or you just give it a bunch of data and just hope that the data that you give it helps you train the machine. There are many reasons why machine learning is becoming very successful, hype, whatever you call it. These days is because we're having much better algorithms than what we had maybe a decade back. There's a lot more available data, and more importantly, not just the data that's available, we're able to crunch it much better and much more efficiently. We have a lot more better computational power to crunch all the data that we have. So it's becoming very much interesting. Not only is it just interesting, but the point is it's also produced results. Like for example, the most recent thing is chart GPD. But before that, even deep neural network was also good for image recognition, and then for natural language processing, and then reinforcement learning, those things that DeepMind had come up with. And I think in public, for example, for my dad, when he came to know what machine learning was, when he learned about the game of chess being broken by IBM, or DeepMind having this thing that beat the world chess go champion. And I think a couple of years back, I think there was even this 14-folding competition where I think DeepMind had this AI-based algorithm that did much better than other state of their classical algorithms then. And all of these things were kind of in a way based on reinforcement learning. Right, so what do we care about? Like what I've been thinking about for the past few years is quantum machine learning. And the question is like, what in quantum computing do for machine learning? So the question is, I mean, maybe some people might say maybe quantum advantage, maybe the first practical advantage for quantum computing could be, in a practical problem could be in machine learning. This is a possible candidate, but we don't know it. We don't have like a solid machine learning problem with the current number of qubits that we have, we can provide a quantum advantage. But maybe this is one direction or one potential application for a near-term quantum device. And indeed for like many interesting tasks, we do have like polynomial speed-ups. Say for training Boltzmann machines, clustering or SEPTRON learning, support vector machines, for a lot of these things, we do have polynomial speed-ups. And we even had like exponential speed-ups for interesting problems like principal component analysis, recommendation systems, linear system solvers, semi-definite solvers and things like that. So we did have like exponential speed-ups or what we thought to be exponential speed-ups, but then there was this era of like a few years back of de-quantization, where even Tang came up with this interesting quantum inspired classical algorithms, which she might be talking about in the later part of this workshop, where she gave a classical polynomial time algorithm for recommendation systems in a certain framework of learning. And since then there have been many algorithms for, so for most of these tasks for which we thought we had exponential speed-ups, if you made realistic classical assumptions as well, you could get similarly polynomial time classical algorithms as well. So there was no exponential quantum speed-up, but there could be polynomial speed-ups still. So for principal component analysis, low rank, linear system solvers, semi-definite programs, and so on and so forth. And that kind of motivates one, the main theme of what I'll be talking about in the next few lectures, like we need a regime in which we can prove provable machine learning speed-ups, approvable quantum learning speed-ups. Like one of the problems here is we gave like very fast quantum algorithms, but we couldn't prove classical lower bounds. And there was a reason why we couldn't prove classical lower bounds, because they existed classical efficient algorithms. So I would like to come up with it, not like to come up with it, but look at learning theory frameworks in which like quantum learning is much better than classical learning, and provably. Right, and classical machine learning, there is an exact field called computational learning theory, where they focus exactly on this. So in classical learning theory, for example, in machine learning, they have this computational learning theory, which works on the theory of machine learning, and there is this form of practical or heuristic machine learning, where they kind of implement algorithms, which could be efficient or not, we don't know, but they do work well in practice, we don't know why yet. But in this talks, I'll be focusing on computational learning theory, or what's the theory behind some of the learning theory speed-ups. And let me first give you a brief overview of what I'll be talking about in the next few lectures before I start about this lecture. So first thing I'll be talking about is learning Boolean functions, and so I'll be learning Boolean functions encoded as quantum examples. And the first thing that I'll be spending the rest of this lecture is the hardness of packed learning. So packed learning is a very fundamental model of classical learning. And in this model, we can show that quantum examples do not provide an advantage compared to classical examples. So this is a negative result. Tomorrow I'll be talking more about positive results, where under uniform distribution, maybe you could get exponential quantum speed-ups. And I'll be talking about some negative results as well tomorrow. And the last two lectures, I'll be talking about learning quantum states. The first thing is tomography, where if you're given copies of an unknown quantum state, can I learn the quantum state, both sample complexity and time complexity? And eventually I'll be talking about learning certain class of interesting quantum states, time efficiently. And then I'll be looking at weaker models of learning quantum states, like packed learning, classical shadows, shadow tomography, and so on. And even in these models, I'll try to motivate these models and give you efficient quantum learning algorithms for them, sample efficient. And finally, in the final lecture, I'll be talking about statistical learning. This is kind of a learning theory framework, which is more near term motivated and kind of people have been looking at statistical learning the past few months or years, I would say. Good, let me begin. Let me first tell you what the learning framework model is. Okay, let me tell you what the learning theory framework model is. So Leslie Valiant in 1984, he gave this complexity theory notion of what it means to learn. So like this was like the underpinning of what in classical machine learning, like what does it mean to learn? And he gave like a formal definition. Before I give the formal definition, let me give you first give you an intuition of what the framework is. So often the goal is to learn a class of functions. Think of script C as just a collection of functions, C1, C2, C3, and so on. So it's just a collection of functions that we would like to learn. And think of these functions as just like mapping some finite alphabet X to zero, one, just for simplicity. One example is you could just think of the CIs as just half spaces. That is, I'll tell you what a half space is in a second. So what's a half space? So what does it mean to learn first? So the point is script C is a collection of functions that everybody knows. We know that we are going to try to learn script C, but the point is we know all the elements, C1, C2, C3, C4, and so on, but somebody, an adversary, picks a C star from script C which is unknown to you, and he gives you points X and asks you what is C star. So let me tell you an example first. Good, so let's just say for example, you're working in R2. You have red points and green points. Think of red as zero and green as one. So the point is there is a separating hyperplane here. It could be either the one that separates this red from green this way or this way or this way. It could be anything. All that we know is there is a line that separates the green points from the red points. So somebody just gave us two points, this is red and this is green. Good, so we ask for more examples. So somebody tells us, okay, here's one more red point and here's one more green point. Okay, that kind of narrows down a certain set of, we know that the separating things between red and green need to be somewhere in this direction. It cannot, for example, be like this. It has to be somewhere in this direction. We say, okay, give me some more points. Okay, he tells us one more green point. He tells us a lot more red and green points. And at some point we start, we like, okay, we think we kind of understand what C star is. We think it's this line here. So we say the blue line here could be the separating hyperplane between the red points and the green points. So in this example, kind of the hyperplane or this blue line was the thing that we are trying to learn. We knew we are trying to separate the red points from the green points with a line, but we didn't know where this line sat. We needed all these red examples and green examples, and eventually once we got all of these, we realized, okay, this should be the line that kind of separates them. This is a very informal way to understand what a learning, the framework that Leslie Valiant came up with. And let me tell you concretely what the learning theory framework is now. Okay, before we begin, there are some certain set of definitions that I will need, and I will use this for the next few lectures constantly. So script C is just a concept class. It's just a collection of Boolean functions on N bits, and this is known to everybody. We know what we are trying to learn. We just don't know what, so we know a family we are trying to learn. We don't know what element in the family we are trying to learn though. So the target concept is just little C which comes from the collection of families that we know. So think of script C as known to everybody in this room. Little C is something only I know, none of you know. And your goal is to learn what is the little C that I picked from this target concept script C. D is a distribution on N bit strings. So all these elements, little C, are just N bit functions. They map N bits to zero comma one. So the distribution could be known or unknown. I'll come to that in a bit. And finally, what is the example that you guys are going to get? I'm going to tell you a point X sampled from this distribution D, and I'm going to tell you X comma C of X. So in the previous slide, you could think of X as the point. C of X is telling me whether that point was red or green. So I gave you several X comma C of X, and the previous slide that translated to X being either red or green, X being just points in R2, and C of X just being telling me, telling you whether it's red or green. So I give you several points, X, and I tell you what is C of X that is a labeling of the point. And your goal after obtaining many X comma C of X is to learn the unknown little C which only I know, you don't know. The only thing again you know is little C is not an arbitrary function. It comes from a nice structured class called script C. Good, let me try to re-emphasize the point again. So again, script C is something which the algorithm knows. He knows, it knows for example, it's C1, C2, C3 all the way. Target concept is something the algorithm doesn't know, but he's trying to learn this target concept only with the knowledge that it's coming from script C, the concept class. Again, the learner is trying to learn little C. And how is he given information about little C? First, the learning algorithm samples an X1 from this distribution D gives it X1 comma C of X1. He then gets X2 comma C of X2. And you keep getting these many points. And eventually after a certain set of points, XT comma C of XT, the algorithm thinks, OK, I think I have enough information about C. The information I'm getting through C is through these labeled examples that I can approximate C well enough. That's still not clear enough. What I mean by approximate C well. And so here, the algorithm says it can output a hypothesis H that is close to C. So again, the learning algorithm obtains many X comma C of Xs. X is sampled from this distribution D. Little C is the thing that is unknown to the algorithm, and it obtains many X comma C of Xs. After it obtains these many labeled examples, it outputs a hypothesis H. So let me be a little bit more formal. What should a learning algorithm do? For every little C in the target concept, so the point is little C could have just adversarily that the hardest possible concept to make your life hard. For every little C, pick from the concept class. And for every distribution D that is unknown to the learning algorithm, with high probability, the algorithm should output a hypothesis H that satisfies that H and C are close to one another. When I mean close, suppose I give you a new point, sampled from this distribution D. I give you an XT plus 1. What is the probability that you classified the, what is the probability that C, your hypothesis H, did a good job in classifying the new point? So that is what a learning algorithm should do. If I give you a new point, outside your training set, what is the probability that you learned a function H, you output a function H, which is close to the actual true C of X? And this is the goal of a learning algorithm. Let me conclude this slide. And if you have any questions, we can start off there. Sure. Good question. So yeah, let me just come to that in a second. So sample complexity. So in sample complexity, the only complexity measured that we care about script A is the number of labeled examples that I saw. So for example, if I just saw, for example, 100 examples and it was good enough to satisfy this assumption, then the sample complexity is 100. And the next thing I care about is time complexity. So what is the number of time steps used by the hardest concept in the concept class and the distribution? So coming to your question, so when we talk about sample complexity, we don't care about how we represent C or how the script C is stored with us. All that we care about is the number of labeled examples that we see. And these labeled examples are all n plus 1 bits. x1 is an n-bit string. c of x is 1 bits. So just the number of labeled examples that we see is the metric that we care about in sample complexity. Little c? Script C is known publicly to you. You know everything about script C. You explicitly know what, for example, you can just say like parity functions or DNF formulas. You explicitly know what are all these are the class of functions that I care about. Maybe the question that might answer you is, for example, when I talk about time complexity, how should I output h? And then you would like a succinct representation of h so that that could be a true approximation of C. And I will come to the succinct representation again after a while. Yeah. Right. Ah, good. OK, that's a good question. So the point is, let's just say d was a point distribution. It kept giving me x, comma, x, comma, c of x, comma, c of x. I kept getting the same x. Notice the requirement of the algorithm. The algorithm also should do well only under the same d. So the new point that I'm going to get is going to be in the training set always. So then always I'm going to be correct. That's not on the basis. No, that's a good point. So the point is this improper learning and proper learning. And in improper learning, h can just be an arbitrary function. Right. Yeah, that's a good point. For this lecture, I'm going to assume, OK, for this lecture, I'm going to be assuming that it's efficiently sampleable. And in the next lecture, I'm going to be talking d just being the uniform distribution element. Right. So the two complexity metrics that I'm going to be caring about in this talk is sample complexity. There is a number of x, comma, c of x, as I saw. And the time complexity was the total time a takes. For this lecture, maybe I might not talk of time complexity. I'll just be talking of sample complexity. But in the next lecture, we might come to time complexity. OK, now let's talk of quantum-packed learning. Everything so far was just classical. So what's quantum-packed learning? So of course, the first thing is, of course, the learning algorithm is quantum. Great. So you could either, there are two ways to feed it. I just give it classical data and hope for the quantum algorithm to do something. Or maybe I also give it quantum data. And that's what the model I'll be talking about here is. So Shruti and Jackson, back in 95, they introduced this model of a quantum superposition or a quantum-labeled example. It is just a uniform superposition over x, comma, c of x's. So this is n plus 1 qubits. And it has the amplitude square root d of x. Good. So this was just a quantum example. And the first thing is, is this stronger than even a classical example? And it is. Why? Because I can just choose to measure this. A quantum learning algorithm can just choose to behave classical. He gets such a quantum example. He measures. He gets an x, comma, c of x with probability, the square of this, which is d of x. And that was precisely how a classical learning model was defined. Somebody sampled x from a distribution d and gave it x, comma, c of x. And so of course, a quantum example is at least as powerful as a classical example. And the question is, can you do more? So let me trade the quantum learning model again. So you have a quantum algorithm here. It knows everything about script C. It doesn't know anything about the target concept. The only difference between quantum learning and classical learning is now a quantum example is just given many quantum copies of this form. And the goal is still the same. The goal is still to output a little hedge, which is close to C, with high probability. Yeah? You're saying that we're not going to talk to you to compare it with the psychoanalyzing? Exactly. Exactly. Compare the two things. I don't think I'll be talking about that, but yes. For a sample complexity and most of learning, we don't care about preparing this cost. Like, the only difference is there are some works that have looked at, you have access to a unitary that prepares this, in which case you can invert the unitary and things like that. But for this talk, I'll just be looking at, I just give you copies of this quantum state. Is it useful or not? And some of you might just think, OK, this is kind of not so great, because the quantum learner here keeps getting the same quantum example again and again. It doesn't seem so useful. It might not be in some cases, and it is in some cases. So this even very simple model is sometimes more powerful in classical learning. Good. And the motivating question for this talk is, do quantum examples give an advantage for fact learning? So what is fact learning? Again, you need to do well for every concept in the concept class, for every distribution. To give you a point outside my training set, with high probability, you should be able to classify well. And the question is, how many quantum, does having these coherent superposition examples, is it better than just having labeled examples? Is the quantum learning model clear? Good. Let's see. So yeah, so there is this fundamental dimension in classical learning theory. It's called VC dimension. It's something like I give you a concept class. You can always associate this combinatorial parameter to this concept class. And this kind of combinatorial parameter characterizes classical fact learning. So let me tell you what this parameter is before we go ahead. So script C, again, is just a collection of Boolean functions that map n bits to 1 bit. OK, just think of script C as just a collection of functions. So m, think of it as, OK, I gave you a collection of functions. Let me just write down the table for it. So this is a table where I write down the truth table for every little c. So the truth table for every little c is of size 2 to the n. It's defined on n bit strings, and they're 2 to the n bit strings, so I just write down the entire truth table for little c as just a 2 to the n size vector and just stack it one below the other. So the first row in this matrix is just c1, the first concept in the concept class, the second row is c2, and so on and so forth. And since there are size of script C many things, the number of rows is the size of script C, and the number of columns is 2 to the n. Good. OK, defining the VC dimension is slightly hard, but I'll give you also a pictorial interpretation of what it is. So what is VC dimension? So VC dimension says, OK, you have this huge table in front of you. Find the largest set of columns in this matrix, such that if I just project onto those columns, I get all possible d bit strings. So for example, find the largest d, such that if I just look at, say, columns 1, 7, 11, 3, so they're four columns, I just project onto these four columns in the entire truth table, I should see all possible 0, 0, 0, 0, 0 to 1, 1, 1, 1. That is VC dimension. And these d columns are set to be, indices are shattered by script C. So let me give you an example. So think of this concept class having nine concepts in it, and it's defined on two bits. So the truth table size is four. Now, my claim is the VC dimension is two. And my claim is the VC dimension is achieved by columns 1 and 3. So let's just look at columns 1 and 3. You need to see 0, 0, 0, 1, 1, 0, 1, 1. So you saw all these four instantiations of 0, 0, 0, 1, 1, 0, 1, 1, so the VC dimension is two. You could have tried to check if the VC dimension is three, you wouldn't be able to. I guess so. Yeah, the VC dimension of this is at least two, at least. That's what this example says. OK, let's just look at more examples. So let's just say I pick the same nine concepts. I pick a different concept class. And my claim is the VC dimension is three here. And it's satisfied by columns 2, 3, and 4. So what should I do? I should just check whether 0, 0, 0 is here. It is here. 0, 0, 1 is here. 0, 1, 0, here. 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1. 1, 1, 0, 1, 1, 1. So the point is all possible three bit strings were found in these three columns. So the VC dimension is three. So again, I give you a concept class, script C. Just write down the truth table for everything inside the concept class in the first row. Sorry, for the first concept, write down this truth table as the first row, second concept in the second row, and so on. Write down all the concepts, its entire truth table. And now numerically, just go over the maximum D. So if I just concentrate on those three columns, or those four two columns in this case, I get all possible 0, 0, 0 to 1, 1, 1. If the VC dimension is D, you should get all possible D bit strings. And that is what the VC dimension is defined as. The definition is sort of clear. You might, again, work at this in the example. So this is VC dimension. No, no, no. Just arbitrary D columns in this matrix. They don't need to be contiguous. For example, in this, the columns are 1 and 3. Oh, really? Yeah. So there is 0, 0, 1, 0, 1, 0, 1, 1. There is no 1, 0, 0. So in columns 1, 3, and 4, you can't find a 1, 0, 0. Good. So I just defined VC dimension here. So there's something called the fundamental theorem of fact learning. This is like the backbone of kind of like if you open any computational learning theory textbook, they will start always with the fundamental theorem of fact learning that states the following. So suppose a concept class has VC dimension D. So a very seminal result of Bloomer-Warmuth at all showed that for every fact learner, the sample complexity has to be at least D over epsilon. So if you have a learner who's epsilon good, he needs to have sample complexity at least D over epsilon, where D is the VC dimension. So if I give you a concept class, the VC dimension divided by epsilon is a lower bond on complexity of fact learning. And they gave an upper bond that was only log far away, but it was actually quite a long-standing open question whether we can get rid of the log factor. And it was only like in 2016, Steve Hanukkah actually showed that there also exists a learning algorithm for every concept class that uses only D over epsilon examples and can learn the concept class in the fact model. So in some sense, this lower bond and upper bond are exactly tight up to constants. So D over epsilon, so for getting log 1 over delta for the time being, even if you think of epsilon to be a constant, D, that is a VC dimension, characterizes the exact fact sample complexity. I give you a concept class. You know that I just need to obtain VC dimension many examples, labeled examples, for you to learn the unknown concept in the concept class. Good. So the next question is OK. We are done with classical fact sample complexity. What about quantum now? So the classical upper bond carries power for quantum, trivially, because the quantum learner could just choose to always measure the example. He gets a classical labeled example, and he just runs the classical algorithm of Steve Hanukkah. Good. So the upper bond is done. So for the lower bond, Atichi and Servetio in 2004, they actually looked at this problem, and they gave a lower bond of square root of D over epsilon. So they show that, OK, the quantum lower bond is at least the square root of D over epsilon. It is plausible that quantum examples could potentially give a quadratic speedup. Like in quantum, we often see Grover type of speedups where they're quadratic speedups. So maybe potentially, they kind of open the question, maybe for PAC learning, which is a very general framework of learning, quantum examples could potentially give quadratic speedup for learning. And this was the lower bond that they approved. And we kind of improved this lower bond in 2018. What we showed is we showed a tight lower bond of D over epsilon. So we showed that we are matching this upper bond that we have from classical learning. We showed even a quantum lower bond. So even if I give you quantum examples, you should not be able to get any advantage for PAC learning. So in this model where I have quantum examples, which are potentially stronger and even fed into a quantum computer, which can make a giant entangled measurement on these many quantum examples, there is no advantage compared to classical PAC learning, at least in terms of sample complexity. We gave two proof techniques. One is based on information theory, and one is based on the pretty good measurements. The first one is very simple and nice. It gave us nearly tight bonds then, but then only a few months back, there was a paper that actually made this completely tight. So information theoretic also could have been a technique to prove the tight lower bond. But since this is messy and interesting, because it gave us a tight bond, then I'll be talking about this bond here. And I think more interestingly, more than taking away the fact that quantum and classical examples are the same, I would like you to take away the fact that the pretty good measurements, which will be the main toolbox to prove this lower bond, is a very interesting toolbox that you could use. In general learning theory, pretty good measurements are a very interesting technique to prove results. And your exercise session will be kind of revolving around the pretty good measurement. So what is a pretty good measurement? It's just a tool to understand state identification. So think of this ensemble script. It's just a collection of quantum states, psi z, 1 to m, and a probability distribution p, z associated with the ensemble. So think of it this way, I have an ensemble in my mind. Psi 1 to psi m, each associated with the probability p1 to pm. I sample with probability p1, sorry, pz, I sample a psi z, and I give it to you. And the question is, identify z. You know everything, psi 1 to psi m, you know everything. You know the distribution also apriary. What you don't know is, I just picked a state from this ensemble, and I gave it to you. And the goal is, what is z? So this is just a simple state identification problem. And of course, the point is, I could give you t copies of this quantum state, and your goal is to identify what is z. And finding this optimal measurement could be complicated. It is a semi-definite program, but it could just be complicated. Like the structure of it, maybe it's not efficiently implementable, you don't know what is the structure of it, or how to come up with it, give it an ensemble, that might not be easy. But there's always a pretty good measurement. And so the point is, although the optimal measurement could be complicated, the pretty good measurement is pretty good in the following sense. So it has the following POVM operators. This is slightly messy to understand, but you'll come back to this again in the exercise. So it's just a think of rho as just the ensemble state. So I just write the ensemble for this quantum state, weighted by the probabilities. And the measurement operator is mz, 1 for every z. Sorry, mz, m1 to m. OK, yeah. So the point is, you have little m many POVM operators, capital M, 1 to capital M, subscript little m. Each are given by these POVM elements. And where rho is this. And what is the success probability of the pretty good measurement? So it's just this, for example, so probability pz, I'm going to sample a size and give it to you. And the probability I'm going to do well is the probability that mz was the outcome when I performed this POVM on this quantum state size. And so this is the average success probability. Good. And this is the crucial property of the pretty good measurements that I need. Suppose, for example, the state identification problem, where a sample size is according to pz and give it to you, the optimal success probability that you had was popt in identifying what is z. So I give you copies of this quantum state. And the best you could do to identify what is z is popt. Then clearly this is true. The pretty good measurement is just one measurement among all possible POVMs. So clearly the success probability of the pretty good measurement is worse than the optimal measurement. But it's not much worse. It's only quadratically worse. So in case you want to use for state identification, you don't know what the optimal measurement is. You can always use the pretty good measurement, which has a fixed set of measurement operators. So I give you this state identification problem. You know everything, psi 1 to psi little m. You can construct this POVM operators on your own. You know what is rho. You know m1 to m little m. You can, given copies of the state, just implement this POVM. And you're going to do almost as well as the optimal measurement. The optimal measurement, again, could be very complicated. But the pretty good measurement is at least easy to state. And it's succinct. Is there any question in the pretty good measurement? It could use, for example, think of psi 1 to psi m. It could have arbitrary structure in them, for example. Like finding the exact structure in them to distinguish, say, psi 1 from psi 2. That could be something that's hard to. Yeah, there is no concrete reason why it's hard. But yeah, it could just be potentially hard. Yeah. Good. Yeah, when I was talking, I realized I made a mistake there. So technically, if I give you a psi z to the tensor power of t, think of all these psi z as the t-fold tensor product. And you will need this in the lectures that will make this more clear again. Yeah. Good. Again, you will verify in the exercise session. So you will be verifying that. Why is it a POVM in the very exercise session? No. Yes. I don't think we know anything. Because the ensemble could just be literally arbitrary. All that I know is you can express it that way. But the structure of the solution to the STP, I don't know if you know anything more than just solving it and finding it. Yeah. Oh, that's a good question. Right in there, maybe? Yeah, that's a good question. I don't know actually. I know when pretty good measurement is the optimal measurement. But when there is a quadratic gap, I think that's a good question. I haven't thought about it yet. Good. So the takeaway, at least from the first part of the slide, is if you are solving a learning task, if you want to solve any form of state identification, maybe finding the optimal measurement is hard. But use this concrete set of PIVM elements and it could be useful. And again, in the exercise session, there will be three questions and one of them will just be to use this on a simple type problem where you could get a log n speedup by just using the pretty good measurement for the coupon collector problem. Good. So, OK, fine. We had this interlude and the question is how does this relate to identification? So how does learning relate to identification? Yeah, I'll just continue on. So the question is how does learning relate to identification? So far, we just talked about learning approximately and so on. I never spoke about exactly state identification. So how does learning relate to identification? So recall the task again. The goal is in quantum back learning. I give you this quantum example state. This is unknown distribution D and little c. I give you t-fold tensor copy. And the goal is to learn c approximately. And the goal is actually to show the result that we had earlier, that is the sample complexity of back learning is D over epsilon, or the BC dimension divided by epsilon. This is too loud, but sure. OK, let's see. So how does learning relate to, this is too loud. OK, how does learning relate to identification? Good, so this is where the technical meat of the proof goes. I'm going to give you a proof sketch now. And again, in the exercise, you're going to be proving this line by line, actually. Good, so we have this concept class Cripsy. It's just a collection of Boolean functions. And we know what the VC dimension is, say, D plus 1. And let's just say we know that it's shattered by the columns S0, S0 to SD. Like I said, the point is somehow back learning is kind of a strong model in that you would like the learning algorithms to do well for every distribution. So in order to prove my hardness, I can just pick a hard distribution on my own. I can just say, this is my hard distribution. I want my quantum learning algorithm to do well under this hard distribution. How do I do it? And the hard distribution that we pick is actually one that has almost all the mass on the first point, and just epsilon or D mass on the remaining D points. So this is a distribution on D plus 1 points. I'm going to put a mass of 1 minus epsilon, essentially 1 minus epsilon on the first point, and essentially epsilon over D mass on the last D points. This is my hard distribution. And now I need this concept of error correcting codes, so let me go over it a little slowly. So what's a good error correcting code? It's just a function that maps k bits to D bits. And let's just say we pick k to be approximately D over 4, just for simplicity. So we call it error correcting code good if it satisfies the following properties. So for every y and z, which is a k bit string, E of y and E of z is the D bit string, just by definition of how E was defined. This is an important property. For every y and z, which is a k bit string, the hamming distance, so think of E of y as a D bit string, E of z as another D bit string, the hamming distance between E of y and E of z is at least D over 8. So E of y is a D bit string, E of z is a D bit string. If I look at just the difference, at least a constant fraction of them are far away, are different from one another. And this should be true for every y comma z in the 0, 1 to the k. So that's what a good error correcting code is. And the number of code words, or the number of elements in the image of E is just 2 to the k. But since k was essentially D over 4, 2 to the k is essentially 2 to the order D. So, okay, here is the proof strategy. What I'm going to do is I'm going to pick concepts CZ, each, so I'm going to pick 2 to the k concepts in my concept class, C1 to C2 to the k. And all of these things are going to be in my concept class and I'm going to pick it in the way that all these concepts satisfy the following property. CZ of S0 was 0, and CZ of Si was EZ of i. This is slightly technical, let me just go pictorially. So again, script C was the concept class. Now we know the VC dimension was D plus 1. That means there are D columns. So if I just focused on these D columns, projected onto them, I got all possible D plus 1 bit strings. So the point is we know there is a script C comma D plus 1 rectangle where they're all possible D plus 1 bit strings. And my goal is to find this collection of concepts CZ that satisfy this property. This E is fixed a priori. So, for example, consider this concept class. So script C is just a collection of all concepts, so I just wrote them all of them down. I know this was shattered by on these D plus 1 columns, S0 to SD. And because it's shattered by this, I know for a fact that if I just projected onto all these D plus 1 columns, I got all possible 0 to the D plus 1 to 1 to the D plus 1. I knew that for a fact. What I'm saying is among these 2 to the D many rows, pick 2 to the K many rows such that it was always 0 here, which is always possible because everything is possible here, all the way from 0 to the D plus 1 to 1 to the D plus 1. Pick the rows for which the first entry is always 0, which is what is happening here. And pick the red rows where I found 0 to the D. So the point is now I need to just look at the code words of the error correcting code. And the point is I know that, let's see, there are 2 to the K code words, 1 for E of 0, 0 to the K, E of 1 to the K, E of 2 to the K, all the way to E of 1 to the K. We know these 2 to the K many code words. And once we know these 2 to the K many code words, just look at those concepts in this shattered set that correspond to the code words and there are 2 to the K many of them. So among these C1 to C2 to the D, I'm going to pick 2 to the K concepts that correspond to the code words here. And I know that these, I know that for a fact that these code words occur in this table because this is a shattered set. Because the shattered set has all possible bit strings between 0 to the D plus 1 to 1 to the D plus 1, if I give you an arbitrary D plus 1, D bit string, it should have occurred here because that's by definition the meaning of shattered. I'm just saying that instead of picking all possible 2 to the D concepts in the shattered set, pick only 2 to the K, which is essentially 2 to the D04, that satisfy the code word property. And the only code word property that I need is the distance between the red rows or at least D over 8. You will be again proving this thing line by line in the exercise session. But at this point, we are almost done. So the point is this was the main idea actually. You first pick a distribution on the shattered set, then look at a good error correcting code. So I said the distance is at least a constant. Then pick the concepts in the, pick the concepts C1 to C2 to the K in the concept class that satisfied this property I mentioned in the previous slide. And the observation is learning CZ approximately because in PAC learning you don't need to learn it exactly. You just need to learn it approximately. It's exactly equivalent to state identification. So kind of I've reduced the PAC learning task here to a question of state identification. And as I said, you can always solve state identification by using the pretty good measurement. And that's what we're going to do. Good. So we have reduced the learning task, the approximate learning task to exact state identification. And what is exact state identification task? I have this two to the K many concepts. These are the codeword states. And I'm going to give it the uniform distribution where because they're two to the K, the probability distribution is one over two to the K. We know for a fact that the pretty good measurement is quadratic, quadratically related to the optimal measurement. So let's start with the assumption that you have a good learner. What is a good learner? He obtains T copies of this and he can identify CZ exactly. So that means the probability of Popt square is one minus delta square. Think of delta as a constant. That means the pretty good measurement should have had success probability, at least a constant. And all that you need to do is understand the success probability of the pretty good measurement, which we saw in the previous slide was this. We know the exact POVM elements. We know the exact success probability. We just need to write it down and analyze it. And this is where the slightly technical part comes. We can show with the technical calculation that the success probability of the pretty good measurement is exactly at most this quantity. This might look slightly messy, but I think what you should take away is that if T is small, so just look at the first one, forget the rest. If T is less than D over epsilon, oh sorry, I think you should look at this. Oh sorry, you should look at this. If T is at most, yeah, maybe I'm missing something here. Yeah, I think you should just look at the last two terms actually thinking about it. So if T has to be at least D over epsilon for this term to at least be a constant. If T is much smaller than D over epsilon, you can show that this quantity is tiny. That means to say, if T was much smaller than this, the success probability would have been tiny. So if you had a good learning algorithm, you should have had the sample complexity be at least D over epsilon. And I think that's it. So this is the proof that quantum examples are actually equal in the classical examples for packed learning. And the proof is exactly, you have an approximate learning task. Using this code word structure, you reduce it to a state identification problem. And as I always say, when you see state identification, you apply the pretty good measurement. The hard part in pretty good measurement is analyzing it. But once you do the analysis, you actually get the exact bound that you can, and you can prove the lower bound that we want. So we proved here that the VC dimension over epsilon is also a lower bound for learning, quantum packed learning. Yeah. Not half, maybe a constant say one over eight or something like that. Yeah, so the main reason we needed that was because of this final step as well. Like why does learning, like the point is you could, in packed learning, you just need to output a hypothesis edge that is close to unknown concept C. But that has nothing to do with state identification where your goal is to identify C, Z. So the distance property show that if you're learning it too well, that means you have actually exactly learned it. So H is very, say, H says half close to C. Sorry, H is say one tenth close to C, but the distance itself is like one over half. That means I've exactly identified my exact code word. And if I've exactly identified my code word, I will learn by unknown concept. So I've reduced the approximate learning question to an exact learning state identification question. As I said before, I don't know what is the optimal measurement for arbitrary learning task. Exactly, exactly, exactly. Exactly. The point is, as I said, like we don't know how to analyze like arbitrary like state identification task because I don't know what, how do I analyze prove a lower bound on this? But for the PGM, we know exactly I can just, I just need to analyze the learning task under the certain set of observables. So then I can hit it with that, exactly. So this is the main idea. And then after that, using a pretty good measurement was just a way to analyze this. Are there any further questions? Again, you might be going through this like step by step again in the exercise. Yeah, good. So the point is, there are two ways to think about it. One is you could think of quantum example as something very powerful or something very weak. The point is if you think of it as something very weak, there are setups where like interesting function classes say DNF formulas where you can actually get exponential quantum speed ups compared to the state of their classical algorithms. So even in that model, this is even in the weak model, you can get quantum speed ups. But if you think of it as too strong, that makes this result even more stronger. The fact is even in the strong model where you're getting this coherence of oppositions, you do not get a quantum speed up. Whether it's natural or not, that's, I think from a theoretical perspective, understanding the strengths of these examples is something that I think of. So in our work, we also handle the fact that even if you have an arbitrary amplitude, like you should not be able to obtain anything further than what I showed you. Here is just a simple examples where we can prove our lower bound. It's very important. So the point is that this is where somehow I feel like this result takes in is because we could just pick such a distribution which is just adversarily hard. So for example, as I said, I just pick the distribution which is, yeah, I pick the distribution which puts almost all its mass on a zero and the remaining mass on epsilon over D. This seems admittedly slightly unnatural, but given the fact that in pack learning, you should work on every distribution, the requirement of the algorithms to work well even under this hard distribution. But I think it's an interesting open question, like for what distributions can you get quantum speedups? I think just as a general learning question, I think that's a very interesting question, yeah. So for sample complexity, you could get even an exponential speedup just for parities, for example. I'll come to this again in the next talk. Just for parities, like we know that the classical sample complexity for learning parities on the uniform distribution is little n on n bits, but quantum you can do to just one quantum sample. So you can't hope for a generic quantum lower bound or something of that sort. But the question again is like, why is uniform distribution so nice in this nasty distribution so bad? Is there like an interplay that's happening here? Because this distribution is as far away from uniform as possible. It peaks at one point and is uniform everywhere else. And this seems to be hard for quantum, but uniform seems to be easy for quantum as well. So is there an interplay, I'm not clear. Good, let me conclude with a couple of quick slides. Yeah, so in this paper, we also had a couple of other models. I'm gonna introduce this now, but I'm gonna be talking about these models again. I'll come back to this again at some point later in this week. So it's called a random classification noise model. So here, there's always a parameter eta between zero and one. And classically in the random classification, so in the classical pack learning, if you remember, somebody just gave in x comma c of x. But now let's just say somebody just gives you an x comma b where b was equal to c of x with probability one minus eta. So if eta was zero, that's just classical pack learning. But in classification noise, like I just picked an x comma c of x and I just flipped c of, I just flipped c of x with probability eta and I gave it to you. So probability one minus eta, you got x comma c of x with probability eta, you got x comma c of x flipped. So this is called a classification noise model. So this is all classical and the goal is given such noisy classical examples, can you learn the unknown concept? Quantumly also you can look at this. I call it unnatural model, but it's actually interesting. I'll come to why it's interesting in a second. So in this slightly unnatural model, there's still this parameter eta between zero and one. And earlier in the, for example, in the quantum pack learning, you got x comma c of x. So think of eta to be zero, that was exactly x comma c of x. But if supposing in the random classification noise model, you're gonna get x comma a superposition between square root one minus eta c of x plus square root eta one plus c of x. So this is a noisy, sorry, a quantum example subjected to random classification noise. It's coherent, that's an assumption, but let's just work with that. And the goal is again, given copies of this, then we learn the unknown concepts, little c. So one thing that we showed in this paper is even in this kind of unnatural model, quantum examples do not help. So that's not too surprising. But one thing we'll get to it tomorrow is, when actually you pick D to be the uniform distribution, this classical problem actually reduces something called the learning parities with noise problem. And this is notoriously hard classically. So the best non-algorithm we have for learning parities with noise is scales as two to the n over log n. So in the classification noise model, learning parities itself, the most simple concept class actually in the classification noise model is hard. But surprisingly, if you had these kind of coherent quantum example states, maybe it's hard to prepare, we don't know. If you could prepare these things, you can actually solve learning parities with noise in polynomial time. So actually these kind of simple example states are actually very powerful for solving something that's notoriously hard in classical learning theory. We'll come back to this in this protocol for this in the next tomorrow. Yeah, yeah, sure. You could also think of it this way, like for example, I know an unknown parity and I prepare an unknown parity quantum state and I give it to you. Like how many copies do you need to solve it? Just from a sample complexity perspective, you can just say like, I know the unknown parity which you don't. I give you this noisy quantum parity state. And the question is with this noisy quantum state, you could have actually solved learning parities with noise, but maybe for me to prepare the state itself is hard, I'm not sure about that. And the final thing, again, both of these, so I'll talk about random classification noise model tomorrow and I'll also talk about one more model called agnostic learning. This is another learning model which is I think interesting and people in classical learning theory have spoken a lot about. So far, kind of all these things that we have spoken about, models we spoke about, you always get an X comma C of X. So somebody samples an X and they give you a C of X. But let's just say there is no concept class or anything of that sort in your mind. See, somebody just give you a random X comma B. So for example, in like realistic situations, like agnostic learning model is kind of like the most strong learning model. Like for example, learning algorithm agnostic that would be much implied learning algorithms in the other models as well. So for example, you have noisy examples from the target. So far, we assume that we had noisy examples or perfect examples from the target concept, but maybe there is no target concept to begin with altogether. And what do I mean by this? So somebody just gives me examples of the form X comma L. Think of L as just a bit. And D is just an unknown distribution and N plus one bits. So far, D was a distribution and just N bits and L was C of X. Or maybe in the classification noise, D was a distribution on N bits and L was C of X with probability one minus eta and one plus C of X with probability eta. But let's just say D is an arbitrary distribution on N plus one bits. And the goal is to find what is the best concept in the concept class that approximates this distribution? So think of little C as the best concept in the concept class that does well to approximate this distribution. And the goal of the agnostic learner is given many X comma Ls where X comma L is sampled from this unknown distribution on N plus one bits to output a H that is close to the unknown concept C. Sorry, when I mean unknown concept to the best optimizer C. Yeah, you're also given some slack. The point is you need to do, you need to output a hypothesis that does well in classification of D up to opt plus epsilon where opt is like the best classifier in the concept class. Yeah, something that classically, we know that the classical sample complexity is D over epsilon square where D is again the VC dimension. And prior to what we looked at, quantumly we had no quantum agnostic lower bounds. And one thing that we showed in our papers, quantum agnostic is also not very, quantum examples in this agnostic framework is also not so useful for classes for improving sample complexity. But the quantum agnostic model is much, much less studied than quantum learning with examples. And even like basic things like parity learning within the agnostic model is still, we don't know whether it's possible or not in the quantum setting. I'll come to a potential proof approach in tomorrow's talk, but so far I think quantum agnostic learning has not been much less explored than classical agnostic learning or even like quantum, any other model of quantum learning. I think with that I'm done. Thank you.