 three, two, one. Hello and welcome to our presentation on concrete quantum cryptanalysis of binary elliptic curves, a paper published in chess, 2021, written by Gustavo Banegas, Dan Bernstein, Tanya Lange, and myself, Iggy van Hoef. So what does this actually mean, concrete quantum cryptanalysis? So as you might be aware in 1994, Schor published Schor's algorithm, and Schor's algorithm allows a sufficiently large quantum computer to break RSA and discrete logarithm problem. And so our question for today is how big is sufficiently large? Previous work looked at RSA as well as prime field elliptic curve Diffie-Hellman. But today we will be talking about binary elliptic curve Diffie-Hellman. And when we say concrete, we really mean the number of qubits and the number of gates. And these numbers we will explain during this presentation what they actually mean. All right, so Schor's algorithm, we are going to be treating most of Schor's, the quantum parts of Schor's algorithm as a black box. And we will be really focusing on adding pre-computed points, which is the non-quantum step of Schor that is repeated a lot and is really the most expensive one. And so we are adding multiples of a point on the binary elliptic curve. And to do that, we need addition in binary finite fields multiplication, as well as division in these binary finite fields. So today we will be talking about building quantum circuits for these actions and then talking about circuits for the full point addition. And finally, we'll be putting it together for the full result. All right, so these quantum circuits, we built them using quantum gates acting on quantum bits. And we call quantum bits qubits. And again, today we'll be not be talking about the quantum parts of Schor's algorithm. We will be talking about the reversible parts, which means these gates are part of also classical reversible computing, which you might have heard of. And we will be using four reversible gates today. So the first one is the not gates. Hopefully if you're familiar with that one in classical computing, and it works the same here. And you can see it's reversible because if you repeat the not gates, you get one minus one plus a, or you get a back. And the swap gate is the second gate. The swap gate, we don't treat as a real gate, rather we treat it as an overhead thing, where in overhead, you replace a with b and b with a, you just rename them, rather than physically building a swapping gate. So this is free and cheap. So, but the expensive gates, we start with the CNOT gates, which is the quantum or reversible equivalent of XOR. And you see it's reversible because you have to keep one of the inputs around. If you keep b around, and you have a plus b, you can add b to it again. So you repeat the gates and to undo it. So it's reversible. And these quantum gates acting on multiple qubits are really much more expensive. And so the most expensive gates, by far, is the Toffoli gates, which replaces ant. Of course, ant, if you had two bits, if it would output zero, you would not know what the other bit is, even if you keep one of the bits around. So in the quantum case, we have to use three qubits, and we have to keep both inputs around. So we add the ant results, we XOR the ant results to a third qubit c in this example. And again, if you repeat the Toffoli gates, you get c plus a times b plus a times b, which a model of two is c again. We only need these four gates to really build the circuits we will be talking about today. All right, so these circuits, we will, as I said at the beginning, we will be looking at how big does a quantum computer need to be to actually implement these circuits. So that will be our primary concern, the number of qubits, but we still need more measures of quality. In classical computing, you might be thinking about just counting the number of gates, the number of, for example, XOR and ant gates, and taking that as the complexity measure. But in quantum computing, the Toffoli gate is much, much more expensive than the CNOT gates. So because it's X on three qubits instead of two qubits, and estimates put it from between at least seven times to many, many more times as expensive. So our secondary concern for today will be counting the number of Toffoli gates. But you can go beyond that, because as you might be aware, you can paralyze some of these Toffoli gates. So you have a measure called depth, the actual number of steps it takes at doing multiple gates at the same time. But we will not be focusing on that today, but it is also a very good measure of quality of a circuit. And finally, I will be talking today about what we call logical qubits. That means qubits with a very low rate of error. But if you actually want to implement this on a physical quantum computer, you're dealing with physical qubits, and you need many physical qubits to simulate a logical qubit. So if today we will be talking about, for example, needing around 2,000 qubits to break ECDH, but that means logical qubits. So to actually implement it on a quantum computer, you need many, many more physical qubits. So if you read in a headline, there's Google built a quantum computer with 2,000 qubits, then that doesn't mean right away that ECDH is completely broken. Although that will be a very good sign that we probably should be moving on to post quantum cryptography. All right, so let's go for the actions we need. So addition, addition is a straightforward action. In the simple case, if you add a constant, you just, in the binary finite fields, you just use not gates. So that's the same as a classical example. And if you want to add two variables, classically, you would use bit-wise XOR, but we need to keep one of those inputs around. We can use XOR gates. So we use C naught. And since we are dealing with up to n, n bits and qubits, we need n C naught gates. And again, we need to keep one of the inputs around. We can undo this computation by just repeating it in this case. All right, so next week, we look at multiplication with X. So we need to go a bit more in-depth into our structure. So today we will be presenting every field element as a polynomial. And since every polynomial is a binary polynomial, you can very nicely implement them as bit strings. And so these polynomials all have degree up to n minus one. And the field polynomial has degree n. And so if you just want to multiply by X, if we ignore the modular reduction for now, this is free with just swapping every qubit one to the left, or in the picture, swapping everything one down. You can see that in the picture on the slide, that first three swap gates is just multiplying by X. And then we need to do the actual modular reduction. And in the binary finite field, it depends on the number of coefficients of the polynomial, M, so it's trinomial, we just need one C naught gate. And if it's pentanomial, we need three C naught gates. So that's also fairly efficient. That's in the picture, you can see that on the right. That's the C naught gate. And so now we have multiplication by X. And since we are doing this in place, this is a linear map, we can actually also do division by X now, because if you want to reverse this, you just go from right to left, you have now suddenly have an algorithm for division by X, without really having to take extra step to create this algorithm. So that's for linear maps, it's a very nice property. All right, so if multiplication by just X, what if we want to multiply by a more complex, but still constant polynomial? So in a finite field, multiplication by a constant polynomial is the linear map. So you can write it down, if you have your polynomial representation, you can write it down as a matrix. And what we can do in general, if we have a linear map and invertible matrix, we can always turn that into a quantum circuit consisting only of C naught gates. So that we do that using an LUP decomposition to decide where the C naught gates go. And so that takes up to n squared C naught gates, but it takes no topoly gates. So that's very nice. And specifically in the binary finite fields, we can do the same with squaring, because squaring in the binary finite field is also a linear map, which means again, we can use an LUP decomposition to create a series of C naught gates to get in place squaring. And if you want to invert it, you also get in place square roots and in place division by a constant polynomial. And later we will need to add squaring results to a different polynomial. And you can actually implement that with less C naught gates because squaring in binary finite fields is very well behaved and takes like up to three n C naught gates. All right, so we have multiplication by a constant. We're going to be looking at something much more complex, which is general multiplication. You have two variable polynomials and you want to multiply them. So you will need some topoly gates now. So in one of my earlier works, I looked at doing quantum Karatsuba multiplication. And so Karatsuba multiplication is a fairly efficient implementation of multiplication. And so really the main part of this result is that we only end up needing three n space. So two n quits for the input and one n for the output because we need to keep that input around in order to be able to uncompute the output. It's not like the in place multiplication where if we want to compute, we can just go from right to left and then have division. Sadly, going from multiplication to division I will again take some extra time. And so what's nice about this algorithm is that it needs to know ancillary qubits and these ancillary qubits are usually used to store intermediate values. And generally we consider having these bad because ancillary qubits, you have to uncompute them. You have to set them back to zero because qubits are very expensive. And so by having no ancillary qubits in this multiplication algorithm we consider it very efficient. And also the topoly gate count we consider very efficient because it's minimal at least for Karatsuba multiplication. All right, so we have multiplication. Finally, we have division or in our finite field inversion. And so this will be the most expensive step of our algorithm. And so for this algorithm we will compare two methods for inversion. The first one is the extent that you could be an algorithm based division. And the second one is very mass little theorem based division. So for the extent that you could be an algorithm you're hopefully familiar with that. But as you might be aware it has a variable number of steps. So if you want to implement that reversibly you have to keep track of the number of steps and keep that around. So in order to fix that we implemented a classical constant time extended to creating algorithm based inversion. And this is nice for our case because for the quantum case because that means you don't have to use to keep around the counter or anything like that at the end of your algorithm. And in this picture you can see a representation of this algorithm in a big circuit. All right, so that was the extended to creating an algorithm. You should, you're hopefully aware how it works otherwise you can look at our paper for more. And so the other option was for mass little theorem which is X to the P equals X mod P and from that we can find inversion by doing exponentiation. And so this works just as well in the binary five fields. We just need to take a bigger power. And so the issue with this is that with square multiply you end up having a lot of multiplications and multiplications again, they're expensive. So what we do instead is use ito suji inversion, which in the exponentiation optimizes the number of multiplications. They do have a rather large number of squareings but for us that's not really an issue because we're looking at mainly looking at the number of toffoli gates and squaring linear map only takes C not gates. So it has a low number of multiplications, a relatively low number of multiplications bounded by the log of M. And so we have a low number of multiplications and these multiplications again, we implement using the Karatsuba based algorithm. And so here you have an example of an inversion in this case for N equals 10. You just do a number of squareings and then you do a multiplication. All right, so now we wanna compare these two. And if we look at how efficient they are, well, the extended Euclidean algorithm-based inversion has certainly uses a fair number of toffoli gates but it uses a relatively low number of qubits, our main objective and the Fermat's little theorem based inversion uses more qubits but much fewer toffoli gates. And here's an example for N equals 233, a commonly used field. And you can see here the numerical results. And sadly, or we know that if no matter the number of the size of N, you will see that this comparison roughly holds. So you have lower qubit count than the higher toffoli gate count for the XGCD based algorithm. All right, so now we wanna put all these things together and we wanna look at point addition. Point addition, we have to take all these things together and we need to formalize a bit what's mean with point addition. We're adding a pre-computed point P2 which is a multiple of P which we pre-computed depending on a qubit Q. And then the P1 also is a superposition which is a fancy quantum word. But for our case, it does not matter that Q and P1 are quantum. They actually behave exactly as we want whether they're quantum or classically. So our point addition algorithm uses two squareings, two multiplications and two divisions. And so the divisions really are the expensive part of this algorithm. And we need two divisions despite having the result in one division because we need to clear our ancillary qubits. Again, creating ancillary qubits every step is bad. So we need to uncompute these intermediate values. If you're familiar with point addition, you might be seeing this and think that this gets some issues because specifically that you have special case additions. So the first one is if you add the point at infinity which is zero of elliptic curves to a X1, you should always get X1. And the second one is adding P1 to its own inverse or to itself which it also gets you special cases. But as it turns out, the chance of the special case occurring is fairly low. And in fact, it's so low that by just repeating our calculation a linear number of times or very small number of times, you can make sure that in our final result you do not see these as a problem. All right, so the last thing we can do with our algorithm is pre-computing more points because right now we're just pre-computing a very small number of points and we are not really looking them up. In the classical case, you can often speed up algorithms like this by just pre-computing some points storing them in ROM and then looking them up. And in quantum computers, this is intuitively even smarter because even in 50 or 100 years when we might have a big quantum computer, a classical computation will still be much, much cheaper than quantum computation just because of how limited quantum computing currently is. So by doing, even if we do many pre-computations we can still get a speed up. So if we pre-compute some of these points we have to do a quantum random access memory lookup. And this, and well, this is expensive. I will tell you how expensive we currently think it is but it's much more expensive than a classical random access memory lookup. So you have to limit the window size. And so for again, we have an example here where you have an equals 233. You can see the number of toffoli gates is dependent on the division again. But if you have a window size of seven you certainly have one over seven times the number of toffoli gates. And you can take this to the extreme. For example, if you have a window size of 32 you need to pre-compute around 69 billion points but you have much fewer toffoli gates. Otherwise it turns out with our current approximations we think are the optimal window size for every field for every end is probably between seven and 16 and explain why. First let's go for the summary of the results without windowing. So in our results, you see that the vision really is the most expensive step. So here we have some of our results without windowing. So we have some small results so you can see how it increases. And for the bottom three results you can look at currently implemented binary electric curve cryptography. And so for the final case we can say that you need about a contributor of roughly size 2000 or 4,000 to with 2000 or 4,000 logical qubits. So not physical qubits, logical qubits to solve electric curve Tiffy-Hellman very efficient. And now let's look at windowing. So for this we do need to approximate the cost of a Q-ROM lookup. So we look at some previous work and this previous work gives us an approximation for every lookup based on the size of our window. And so again, you can see our results. The bottom three are of currently widely used binary electric curve cryptography. And the number of lookups and the optimal window size increases but the total totally gate count in this case is much lower than the total totally gate count without windowing. All right, so we are fairly happy with these results because they are current deficient and multiplication algorithm. We think they're very efficient and we can give for any given N. We can give a good estimation of the number of logical qubits you need. And most of these qubits end up being necessary acillary qubits for the division a little over half. So, and we also can say something about number of totally gates which currently is still rather large unfortunately, but we do think it's optimal given our division and multiplication algorithm. So previous work in the prime fields has given similar results. And you can really see how nice the binary finite fields are in this case because they have cheaper addition, cheaper multiplication and cheaper division. And that gives us a pretty significant speed up. And finally, if you are familiar with electric curve algebra, you know that sometimes you use projective coordinates and they can really reduce the number of deficients. However, I'll currently know of work we were aware of that uses projective coordinates does not optimize for space. So in the previous work we compare with in the paper, they did not optimize for space at all, but they have very few deficients. So they have a much slower totally gate count that because you need to, they keep a lot of intermediate values around, they have significantly worse space. So we think our results will be very useful for future computing. Thank you for your attention.