 즉 az állal hagymáját, és gondoljátok és életben, és itt van k� ésabonában a másorval milletíteni kompetációt. Úgyhogy nem ésgy kérdezik a fél két 차okban, és én szóval megválom, hogy életben, de k mangázzal a kevét, amit a majdó megkeverjük. Én is szemmi-szerűen megbáltunk a frecap, Just Jordan's algorithm, because we will apply it four times today. It's best to make sure to, that we are in the same page on this. So in Jordan's algorithm, we are, I don't know, do I hold it at the wrong place? Or what's happening? He's Omar, he knows how to do this properly, I guess. He didn't have issues for him. I don't know. Okay, let's try to see if it's better now. Okay, so normally we had integer numbers, Zn, and we had the Fourier transform over that. But now we will treat these numbers from 0 to n minus 1 as some numbers which we normalize so that they lie between 0 and 1. And the, yeah. I will put it into my pocket, leaving it here. Do you think the pocket matters? Yeah. Well, let's see. All right, so we have, okay. So we have these binary numbers, which now we think of as some numbers between 0 and 1 in binary. And so this grid will be, this discrete grid that we work with. And if we represent, if we interpret our binary numbers this way, then the quantum Fourier transform were such that if X and K were, in these integer numbers, we had to divide it by n to get these Fourier states. Now, because we divided both of them by n, we need to multiply this product with n. And if we have such a phase factor for every X, then the quantum Fourier transform recovers the value of K in this binary representation if it's exactly one of these n numbers. Okay, so this is just a rescaling of how we treat these labels. And now I assume that we have this representation, then we can state Jordan's algorithm as follows. We assume that we have a phase oracle for our function. So given some quantum string X, which is the different coordinates each in binary representation, then it will apply this phase e to the 2 pi f of X for that particular vector. And the output of this algorithm is, hopefully, an epsilon coordinate wise estimate of the gradient of this function. For this, we need to assume that this function is roughly linear, so f of X is something like f of 0 plus the gradient times X. So this linear approximation by the gradient is indeed good. And if we have this, then it works nicely in the following algorithm. We just prepare uniform superposition over all the grid points by doing Hadamards or something similar. And then we apply our phase oracle, and not only once, but n times. And we do that for getting n times f of X in the phase. And if our assumption was correct that it was indeed roughly linear function, then this phase is roughly this phase corresponding to the zero value, and then the linear phase by the gradient. And so the function value at zero, that is just a global phase, we can pull out of this superposition. And what we have here looks very similar to this Fourier states. And actually if we just do the Fourier transform on every coordinate, then we get back a good binary approximation of the gradient and each of these coordinates. Yes. So this is Jordan's algorithm. And so if you look at this phase, then this is a sum of the coordinates multiplied together. So I'm attempting to put it in my pocket, maybe that's the issue. So if we have this state, which is e to the 2 pi i n times X times the gradient. Now that is the same thing as if I take, and X is a vector. I take the tensile product from i equals 1 to d, and now, yeah, sorry, I forgot to put here the vector and the normalization. Now this is the same thing as I have them individually, these X vectors. So here I have just the X i's, and the corresponding phase is e to the 2 pi i, and then X i times the i coordinate of the gradient. And here I divide by square root n. So here I treated this X as a vector, and it had d coordinates, and it had a phase, which I treated this way jointly. But if this is my state and this is actually a tensile product of the d coordinates and the corresponding phases with those coordinates together. And because this is a tensile product state, I can just apply the Fourier transform coordinate wise, and I get back all the coordinates. So this is the idea behind this last step in Jordan's algorithm. And we discussed that it basically requires quantum circuit implementation of this function. That's usually how we get these phases or how it's described in often cases. But if we have a recipe to computing f, then this cheap gradient principle very often means that we can compute the gradient basically just as cheap as the function itself. So I will show you applications that we cannot classically compute the function nicely, and therefore Jordan's algorithm gives a genuine speed up. Can you hear me now? Well, for some reason it interferes with my pocket, which I don't understand, because I didn't learn this from physics. All right. Let's look at the first application, which is application to distribution estimation. So the question here is that we start with is a classical problem, and we know the classical solution to it. The question is how many samples do we need to estimate every probability in a probability distribution to epsilon precision. So we have some probability distribution over the elements, which is a non-negative vector, and we wish to estimate its entries each to epsilon precision. Now, standard knowledge that if you take something like log 1 over epsilon over epsilon, log 1 over delta over epsilon squared samples, then taking the average of these samples and looking at how many times you got the outcome 1 divided by the overall number of samples, that gives you an epsilon precise estimation of P1, with success probability at least 1 minus delta, and this is standard, you can prove it using a chain of bounds, for example. But this is just estimating a single coordinate of your distribution P1, say, or could be any error. But now the trick is that if you are choosing your delta, your success probability to something like 1 over d, then if you take log d over epsilon squared samples, then for every coordinate i individually, you get an epsilon precise approximate with probability, with failure probability, something like 1 over dimension. And if you just apply the union bound, so what's the probability that any of my coordinate estimators fail to produce an epsilon precise estimate, I just add up the individual failure probabilities, which were all something like 1 over d or 1 over 2d, and then they all together sum up to a number, say, 1 third, if I choose, yeah, d, sorry, delta, something like 3d, 1 over 3d, and then by the union bound I get an estimator, which is good for all my coordinates simultaneously, not only for one of them. And so the increase in the number of samples was something logarithmic in the number of different outcomes of my distribution, different outcomes of my experiment, so the size of my distribution, which is a very nice thing. And so we would be hoping to get something similar a quantumly. And so the question is, how can we prove this algorithm in a quantum case? We often can remove this epsilon squared dependence and make it one over epsilon dependence. And well, first of all for this, I need to define an excess model to my distribution. So previously I just assumed that I can get samples from my distribution, but that's a classical model. Now I want to do some quantum operations on it, so the corresponding thing here that we can work with is that we can assume that we can prepare these samples as quantum states in the sense that we have some unitary V, and on the all-zero states, V prepares a superposition of these I labels, maybe with some garbage state, this psi-i, some garbage attached, such that if we would measure this state, then this first register would give output I with probability Pi. So this is a very natural quantum analog of classical sampling. And this is a weakest natural assumption when you have some coherence in your input model. And just as an example why this is reasonable to have such an assumption is think for example if you have some classical procedure to produce your samples. Maybe you are running some Monte Carlo algorithm. Now running this Monte Carlo algorithm on your quantum computer instead is exactly providing you this kind of access. It gives you a coherent way to prepare your samples and then the probabilities are indeed correct just how it comes out from your quantum computer. And so it's important that sometimes people assume a stronger input model when this garbage state is not present, but that makes it a much stronger assumption. And think about this psi-i, for example, if you consider the Monte Carlo sampling example that I told you is that it kind of describes the state of your Monte Carlo sampler, how it got to that particular sample. This is in general hard to erase, so if we assume that there is some garbage state, then that greatly enhances the applicability of this result because it applies more generally. Ok, so if we have this sort of natural access model for our distribution, then we can just use amplitude estimation to estimate, for example, the P1 value of this distribution with roughly one over epsilon steps. This is what we have seen, but now the question arises, what can we do to get all of them together? And amplitude estimation is focusing on a single probability, so this would give you something like D over epsilon because you would need to repeat this procedure for all the coordinates. And then we had a classical algorithm, which was one over epsilon square, complexity up to log factors, and now you got one, which is D over epsilon. Well, this D can be very large, so you want to avoid it. This is a not so nice quantum algorithm. You want something better, speed up. And this is exactly what you can do by Jordan's algorithm. So amplitude estimation is Fourier transform in a single coordinate and gradient computation by Jordan. That's multi-variate phase estimation. We can think about it that way. And that's exactly what we want here. We want to estimate all the entries of this distribution at the same time. So it's a natural fit, but it's not immediately clear how you achieve this. And so for this, I would like to introduce this notion of probability or echo, which is the following. I want to modify my initial or echo, which prepared my distribution using a version of quantum rejection sampling the following way. So I want a unitary u, which from 0x prepares a state, which looks like this. It's basically the same thing as before, but if you would get outcome i of this distribution, then you want to only accept it with probability xi. So acceptance means that the first bit is set to 0. That's what I mean by accepting. And so if I would manage to prepare this state, where these pi's are multiplied with xi's, then it would mean that I accept this element i with probability xi, and I need to then rotate it somehow to make it unitary. I need to put a qubit in the one state to the rejection branch. And so how you can do this is just apply our previous state preparation or echo. Remember this was this square root pi i psi i. And controlled on which i you get here, and what is the xi value that I added there, you apply this rotation. So this is a rotation gate, which puts that this square root xi amplitude, and so when you square it, you get the probability that you wish. So in particular if the second register is in state, this vector x in 0,1 to the d, then the probability that your first qubit is in state 0 is just sum over i xi pi. So you get the i-t outcome with probability pi, and then you accept that with probability xi. So the overall acceptance probability, which is the probability of seeing 0 in the first qubit, is just this sum, which I can also write as an inner product between the vector x and the distribution p as a vector. Yes. Yes, so this is something I had now. I had the unitary v here, and I modified it by also adding a new parameter x. This is just, someone gives me an x, I want to do this operation. And what I do? I first prepare this sample this way, and then I look at which i actually I got here, then I look what is the corresponding coordinate in the vector xi, and once I know that I apply this rotation, and this will do this reducing probability, and ultimately I accept this sample with this probability. And now this vector x can be arbitrary in 0,1 to the d. And now as you can expect, we will later put it in a uniform superposition and do Jordan's algorithm, but we are not quite there yet. So the end goal is that we want to apply Jordan's algorithm, which is preparing uniform superposition over every grid point, and then apply the phase. Well, we don't have a phase, we have a probability oracle, but we will get there. So what we managed to construct is a probability oracle, where this acceptance probability is a linear function of x. This x are registered as you had it. It is just x times p, this is a nice linear function. Okay, but this is, once again, it produces the correct probability, which is linear, but it's not a phase. And so now here comes the trick that we are going to convert this probability oracle to phase oracle, exactly what is needed in Jordan's algorithm. And now I just want to reduce the clatter in this formula, so I just assume that I have a generic probability oracle, by which I mean that it will accept this vector x with probability p of x. This is the linear function from the previous slide. Just yeah, this way I know that. It doesn't matter what are these ancillas states, the only thing that matters is to see qubit 0 in the first, value 0 in the first qubit with probability p of x, if I measure it. That's a probability oracle for the function p of x. And so this is what I constructed for the linear function that I have in mind. But what I wish to implement is a phase oracle, p of x. Okay, and so maybe just to explain for clarity, my probability is a linear function of x. So what's the gradient of a linear function? Well, it's linear in x. The gradient is just p itself. So if I can estimate the gradient of this function, well, the gradient is exactly just the probability distribution that I have in mind. So that's the ultimate goal. Understanding this distribution is the same as learning the gradient of this linear function. So this is the goal that we have in mind. Okay, and for that, we already constructed this probability oracle of this form, but we want to convert it to a phase oracle, and this is how we do it. So as a first step, we create a block encoding, a unitary metric is W. And W is just the following. You apply your state preparation, unitary, and attach a single anszela qubit that you just don't touch. Then you swap the first two anszela qubits and then run the reverse of your state preparation. And so what I claim here is that this is going to be a block encoding of the diagonal matrix p of x. Okay, so it means that if you set your first qubit to zero, and this zero qubits of your state preparation oracle, you apply W, and then you would assume that you measure the O zero state again at the beginning, then you would get just p of x in the diagonal. And, okay, so this is a little computation that is hopefully not going to be too difficult. So what does it mean that it's a block encoding of this diagonal function? That means the following that if I start with the matrix element of W, which is just this, x of y, okay, so this should be delta x y p of x. Okay, this is what it means. It's a diagonal matrix and p of x is in the middle. And so if you write out what was the definition of W, well, you need to apply this state preparation for this state at that state. So that would be something like you get a zero and then that the identity doesn't change. Another zero and then you get this psi accept. And this is happening with square root p of x. Okay, and then there is something starting with a one. That will fall out and here we have the swap tensor identity and then I apply again this state preparation on the other side. So that will be square root p of y and then zero and once again a zero and then this psi accept state plus something starting with one. But I don't care about that. Sorry, starting with zero one and here also starting with zero one. Okay, so I have some states here with zero zero and here zero one zero one. Now when I apply the swap gate say to this guy and this will become one zero and this one zero will not meet with this other guy zero one. They just don't have anything. So this term will fall out because of the swap gate and the only thing that remains is this thing here and I'm sorry I forgot to mention that this oracle crucially keeps lying around the vectors that you started with the x and y vectors. So we have some state with x here and some acceptance state and some probability square root and here the y, some accepting state and p of y and well they only have a non-zero inner product if this xx false y so that is exactly this delta xy times while these two square root probabilities so hopefully this convinces you that this simple circuit is indeed a block encoding of the diagonal matrix containing the p of x values in the diagonal. Ok, so p of x that's a non-negative number certainly a real number and because it's diagonal matrix it's also symmetric so there is nothing preventing me from thinking about this diagonal matrix as Hamiltonian. It's a Hamiltonian, sure, fine. But then I can use the well established techniques for implementing Hamiltonian simulation which in this case would mean that I can implement this operator and I'm sorry for here this is just the phase but what is missing from there from this slide there is missing this term so it keeps the value of this vector x so I have this sum over x times the phase e to the so whatever phase I wrote here should be multiplied and summed up with all the axes so this is exactly a phase or echo which now applies my phase and times and because of the complexity of Hamiltonian simulation we know that this exponentiation of this Hamiltonian basically has cost roughly n up to some log vectors and this is now just a phase or echo because given vector x it applies the phase exactly this e to the i times n times diagonal p of x so I managed to convert my probability or echo to a phase or echo and well here because I know what I have in mind Jordan's algorithm it's going to be n times this value and so now we have everything ready we have a phase or echo for the function p of x which was just in a product between x and p whose gradient is the probability distribution p itself so now you just need to choose your precision parameters so n equals 1 over epsilon and use Jordan's algorithm for estimating this entire distribution with only 1 over epsilon uses of v the preparation circuit for your distribution yes yeah, yeah, yeah, but everything is logarithmic so I didn't want to confuse with where these failures come in there is some, this all tilde hides something like logarithmic in the failure probability and these kind of vectors yeah and ok, so this is one thing that is an ordinary phase estimation that you get this but we have studied advanced versions of phase estimation so if instead of using this uniform superposition over this grid over these vectors x we can just use the Gaussian distribution over all coordinates so I get an overall Gaussian starting state for x and if I use Jordan's algorithm with that so it meaning that each individual phase estimation that each coordinate is like the Gaussian then it's not only going to be an estimate of p with 1 over epsilon precision but it will always also be unbiased estimator with Gaussian noise so you get the best estimator that you can hope for basically using just this advanced version of phase estimation in Jordan's algorithm alright yes, sorry I don't hear you which function is which yeah it is this x in a product p why is it what sorry so on this slide I showed you how from being able to prepare a distribution p in the sense that I described you can you can transform this ability to prepare this distribution p to a probability oracle which outputs a cube with 0 with probability x times p in a product so this is a number between 0 and 1 and the probability that my circuit will output 0 is exactly this and it depends sorry this should be a vector x so this is the x vector times p in a product so this circuit just achieves that if you are inputting a vector x then it will output the 0 qubit on the first qubit with probability this x times p in a product so this is a probability oracle and in the next slide I showed you how from any probability oracle I can construct a phase oracle so ok I try to maybe write something ok so probability of 0 given x to this circuit so this is a if you put vector x in your anszilla the probability of seeing 0 in the first qubit that equals in a product between x and p ok this is what I constructed and this is what I will call the function and this is a linear function so in particular the gradient of p of x is just my distribution it's a linear function ok and so I constructed phase oracle for this function p of x and then I just converted it to now phase oracle which was mapping x to e to the i and well I multiplied with the phase factor because Jordan's algorithm in mind ok so these are the steps is it clarified ok thanks yes yes no no no because classically it's log d over epsilon square well if you are interested in total variation distance that's a different story there are also improvements there but yeah that's a bit different the next yeah but you may be interested in histogram so I'm just this particular speed up if you are executing this algorithm that's exactly doing histogram estimation to epsilon precision this is a meaningful task I will show the next example maybe more it's a quantum generalization that I will show you but for this particular problem the classical complexity is one over epsilon square this algorithm is one over epsilon total variation distance if you wanted to convert this total variation distance by just naively doing it that's not going to the best algorithm for ok ok I would not double check what is the actual complexity in total variation distance maybe you get the same complexity but this is more complicated so yeah if you want total variation distance this is not the algorithm of your choice but this is an optimal algorithm for this particular distribution estimation task ok so yeah yeah that's a good question why you don't violate hollows bound maybe you don't learn that much information actually I mean your distribution is mostly going to be zero so it's similar in the classical case that's also why the classical case it actually works so it's true that you have d coordinates but most of them will be zero or yeah or very small and then you basically just output zero for that particular value that's fine it's going to be epsilon precise so like the classical intuition is I think the best here ok so I would like to describe on a high level another application so this was a line of tomography for distributions so let's move to tomography of quantum states ok that's the next level and so here again I need some some sort of coherence to get some improvement compared to just sampling and so I will assume the following input model I have once again a state preparation in Italy V which in the all zero state prepares a pure state on two registers A and B and this pure state is such that if you are tracing out the first register of it then you get some mixed state rho on register B ok this is my assumption so this is an access model for the density operator the mixed state rho and it's a coherent access model to this particular mixed state defined by its purification and so in tomography what we want to do is we wish to estimate rho as an interest distance and the motivating example would be for example maybe you can prepare some many body quantum states people often want to understand its reduced density matrix on a few particles ok suppose you can prepare your many body state and want to understand how the reduced density matrix looks like on I don't know qubit 1, 2, 3 or something like this then since you can prepare the entire state you have that, just trace out the rest of the qubits or that way now how can you estimate it precisely and well in physics the natural notion is estimation in trace distance which would be the one plan log of total variation distance that you asked before yes so once again the idea is to consider a linear function whose derivative is exactly the density operator that we look for so it is a linear function from x which is a matrix I wanted to map it to trace x rho the trace is linear operation and so it's derivative with respect to the matrix x will be rho itself so it's very similar strategies before but a more complicated setting and to understand how this algorithm will work I just go through the high level so once again I will think about the coordinates of x the matrix elements of x as my coordinates so I will just sample them uniformly at random from minus 1 to 1 so ultimately I will set a super position with such a way so matrix elements come from minus 1 1 and because I need to somehow implement my x in a quantum computer to compute this linear function I actually need x to be normalized now unfortunately in the worst case for example if all my random choices happen to be 1 then I get the all 1's matrix and the all 1 matrix has operator norm d that's too bad that's large, I already lose a factor of d there however it turns out that due to the matrix channel bound apart if you are choosing all your matrix elements uniform random minus 1 1 then apart from some exponentially small probability the norm of your matrix will be only square root d as opposed to the worst case which is d which is good so if my matrix x has square root d then I can just normalize it by roughly square root d and then work with it in a quantum computer and generalization of this idea of how I could build a diagonal matrix with diagonal entries which were this p of x before I can do this here as well but because of this normalization issue I can build a block encoding of a diagonal matrix which is this trace x rho divided by square root d and this trace x rho is also can be this is the same as the Hilbert Schmidt in a product of x and rho if you know this stuff, if you don't know just stick to trace of x I just put this Hilbert Schmidt form here so the only difference is from the probability case that because of this normalization I cannot directly build a diagonal matrix of this linear function but divided by square root d but I can do that so once I have this diagonal matrix which for every matrix x or every discrete representation of this matrix x will put this value in a diagonal I once again do Hamiltonian simulation and convert it to a phase oracle but because I have had this subnormalization here now the cost will be increased by square root d so if I do this conversion to phase oracle and apply Jordan's algorithm then I will get a square root d over epsilon state preparation oracle I will get an epsilon coordinate wise sample estimator of rho the density operator about because this was a linear function its derivative is once again rho and moreover my individual coordinates will be roughly independent because of the tensor product structure of this phase state and the reason why I put here almost because I cannot build this block encoding for every x only the majority of x's because of this tiny issue that for some outlier matrices it will have a higher norm and then my algorithm will fail but that's exponentialism of probabilities so I can just forget about it so it will be roughly an independent estimator for every entry of my density operator with precision epsilon and so far I spent square root d over epsilon resources now once again if all these errors then my error would be d times epsilon but I wanted to do something like operator norm or trace norm later but first consider operator norm but if all my errors line up in the same direction then they will be d times epsilon error in operator norm but because these samples are independent and I make them unbiased using the unbiased phase estimation I get a much smaller error so with exponentially high probability my error will be only epsilon times square root d in the operator norm yeah thanks so now I have epsilon times square root d estimate in operator norm and just converting norms if my density operator has rank r then it implies an epsilon r square root d estimator r density operator in trace norm so far I spent square root d over epsilon and got this precision my ultimate goal would be epsilon precision so I need to further increase this number of uses of v and if you just multiply this cost you get d times r over epsilon complexity so you can learn rank r density operator of dimension d with that many uses of the state preparation record and this is once again a quadratic improvement in the precision because if you would only get samples of these of these density operators then it would be d r over epsilon square and this is optimal so in the sample case d r over epsilon square is optimal and in this more advanced case when we have a state preparation or ankle for the purification d r over epsilon is also tight so Jordan's algorithm with now unbiased error that gives rise to this estimator and here for this channel matrix channel of bounds it was essential that our estimator were coordinate wise unbiased and so this was the motivation for developing these techniques originality because now this gives you an optimal algorithm for this purified density operator estimation yes sorry i cant hear you properly ok i just glossed over the technical detail so how you actually block encode this matrix you somehow need to prepare this matrix x as a block encoding somehow apply to row but you can only prepare a block encoding of a matrix of norm 1 and if it so i need to reduce the norm well in most cases it suffices to reduce the norm by square root d and make my x norm at most 1 in worst case i would need to use it by d and so what i do instead i just assume that my matrix has norm at most square root d i reduce the norm by square root d and in the unfortunate case when my x matrix was actually matrix of larger norm then something other thing will happen it will just my block encoding will do something arbitrary it will be error but that will only happen with exponentialism or probability so it's fine so like understanding this is going a little bit under the hood which i don't want to do here but basically that's the idea that i need a block encoding of the matrix x and that requires x to be normalized and this is normalization that works for most axes well no i just block encode this and my procedure will work for every axes which have operator norm at most square root d and i do this for a huge super position over lots of x values and some of those x values will actually not satisfy this thing their operator norm will be actually larger than square root d and there at that particular point my phase will be incorrect there's something bad happens but that's only a tiny fraction of all the points on which i apply the phase oracle and that's fine but it can tolerate a little bit of error so once again the strategy as in the classical case was to first create the block encoding of the linear function and because of this normalization things you cannot block encode with a single encode x times rho or trace x rho it turns out for this technical reasons you can and that will rise to block encoding of trace x rho divided by square root d this is just what comes out because in fact you will not work with the matrix x directly in your quantum algorithm but you will work with your matrix x divided by square root d and that exactly gives you this dependence yeah so this is just the normalization that comes into this yeah so because you have this reduced so now it's your Hamiltonian is like shrink by a factor square root d so you would need apply Hamiltonian stimulation square root d times more to get back to the original phase that you wanted to have so having a block encoding reduced by some factor in this case square root d means that your Hamiltonian stimulation is square root d times more costly to get the phase so subnormalization hurts you so this was an application to state tomography so what should I do here I was talking about this if you have now nonlinear function in the early applications actually I constructed a linear function which was perfectly linear so I didn't need to worry about nonlinearities at all in the general case when I have nonlinear functions then I somehow need to make sure that they look almost linear at the region when I look at them and one method for that was just zooming into the function around the point which works nicely, however again it reduces my phases and it means that if I zoom into my function r times then I need to apply my phase oracle r times more at the end and that's not really good but then there is a nicer method as opposed to zooming into your function you can also use some numerical differentiation formulas to basically kill the nonlinear terms so for example a I can define my function f prime and that would be basically just evaluate the function at point x and at minus x and subtract and divide by 2 now this is only 2 evaluations of the function but it completely kills the second order term so imagine that you have a smooth function that has some constant term, some linear term some quadratic, some cubic and so on in your Taylor approximation in your Taylor expansion and if you manage to kill the second order term then your error will be only third order and so on and with this single formula you can kill the second order terms and it has higher order analogs and you can basically kill all order terms with something like k-evaluation of the function only and this is much nicer than zooming into it's much more efficient and well let me just state the result here that if you have some just on a high level if you have some nice analytic function which is in some sense c smooth then this smoothed version of this gradient computation gives you something like veruddi over epsilon complexity which is in the case when you happen to have a polynomial function then it has only up to order k terms so with the order k numeric differentiation formula you can kill all nonlinear terms so in that case the query complexity would be k over epsilon and in the general case yeah this comes down to the geometry of this high-dimensional functions everything comes out to square root d over epsilon ok so and this third application would be the optimization of these parametrized stochastic circuits and this would be examples that you see in a lot of cases these days some stochastic circuits which are for example quantum machine learning variational circuits and sorry, I just keep a few slides here ok so I would like to explain this slide better for you so when you do for example a variational again solver or an approximate optimization algorithm or something other quantum machine learning stuff usually what happens is that you have some anzac circuit this is in this box anzac circuits has some fixed elements like c-nots, other fixed rotations ex gates what not and some parametrized gates and what you do you are tuning the parameters of this rotation gates and similar things and you hope that you for example prepare a low energy state or something and so this is your anzac circuit and maybe in variational eigen solvers you want to prepare a high overlap with the ground state so once you prepare the state maybe your verification circuit would somehow verify the energy of your state or something parametrized circuit and then the verification circuit which tells you the quality of your state and basically the quality would be how big is the probability of measuring zero in the last qubit that's like basically you can pretty much convert everything in this sense into this form and so your quality is the better if you have a smaller probability of measuring zero after verification and basically all this QA or a quantum variational circuit answer on quantum machine learning when you want to optimize your anzac circuit is of this model and so how do you optimize it you compute the gradient of this objective value which is encoded in the probability of measuring zero and then you adjust your rotations so this is a very generic view of these variational circuits and now the trick is that you want to tune your parameters in superposition this is not that something people do today but in the long run maybe you want to do that so imagine that your parametrized rotations and other gates would be now controlled by some qubits that describe the value and with respect to these qubits you could compute the gradient of this optimal value of this objective value and that's once again can be modelled as a probability oracle so basically given your parameters for your anzac circuit you get a probability outcome for zero which you want to optimize by computing the gradient and adjusting your parameters and so you just convert your probability oracle to phase oracle then smooth it out using this numerical formulas that I told you and apply Jordan's algorithm and what it gives you that you can compute the gradient of your variational circuit to epsilon precision per coordinate with square root d over epsilon uses of your circuit and it turns out that if you just assume that your parameterized circuit is a black box you don't really understand the structure just basically can only infer anything from it by measuring this output qubit under this black box assumption black box assumption actually this is optimal you cannot do better so if you want to learn the gradient to epsilon precision then the cost of this is square root d over epsilon which is bad news you would imagine that in the long run you would have a zillions of parameters and you want to optimize them for a long time and this is what is happening in classical machine learning they want to optimize their neural networks by a lot of gradient steps but then they have this back propagation and this cheap gradient principle they can very efficiently evaluate to very high precision their gradients however in the quantum version there is no quantum back propagation you cannot do better than this complexity than using Jordan's algorithm to learn the gradient and this is a warning sign for me that these variational circuits will probably not be the future of quantum computers so maybe when you use small quantum computers noisy then it makes sense but I think once we have large quantum computers they will be very hard to train because it's very costly to evaluate the gradients unlike in classical machine learning so for that reason this is a fundamental limitation which is very different from the classical case when you can compute basically the precision dependence on your gradient computation is logarithmic in the precision and when you work with very high dimensional objects you probably want very high precision too bad and this is the final example and now this is a classical application I told you when you can evaluate your function with an arithmetic circuit then you have the cheap gradient principle but now there is a setting where this doesn't work and this is the setting of convex optimization you have a convex body this is this more dark more deep purple here and you want to optimize over this object this body and for this you will usually assume two types of different access models to this convex body which is living in very high dimensional space hard to understand one thing is when you want to optimize maybe you want to find some in some direction the extreme point one thing is when you can ask from any point whether it's in your set or not that's called membership query and the stronger one is separation query when you if you are outside your set you don't only get a no answer that no you are not in the set but you also get a separating hyperplane which separates your point from your convex body and most efficient algorithms that are used in convex optimization actually assume this separation oracle and you can in fact classically construct a separation oracle from a membership oracle but that requires dimension many queries as you would expect but in the quantum case you can basically just look imagine that this is the point which happens to be not in your convex set and so your membership query told you well unfortunately you are outside a convex set but it doesn't tell you a separating hyperplane now from this point you can look at the convex body and then it will be a convex function here convex function and it will be mostly have something like derivative or subgradient at every point so subgradient estimation is very similar to gradient estimation and while it turns out that basically given binary search you can efficiently evaluate this function and once you can evaluate this function with your binary search over the basically just like asking points until you hit the boundary close this is the binary search and this way you can basically compute the plot of this function and once you can compute this function you can apply gradient computation by Jordan and a gradient at this point will be just a separating hyperplane so this shows you that basically only with a few membership queries you can get a full separating hyperplane on a quantum computer which is exponentially better than in the classical case when you need really the queries and in some cases indeed it's the case that you have membership queries but you don't know how to get separated hyperplanes and in this case you could get an exponential speedup for this classical problem on a quantum computer but I must tell you that this is not a very often situation so if you have any problem of this sort in mind where this cute technique could be applied then please tell me and with that I finished the fourth example application and then see you at the exercise session