 This talk is a fast, constant time, GCD computation and modular inversion. This is joined work by Dan Bernstein, oh, this is, yeah, working. This is joined work by Dan Bernstein and Bo-Jin Zhang, and Bo-Jin is going to give the talk. Okay, so the executive summary of this talk is that we have fast and safe GCD and inversions. So normally, we compute the inverse in the field with P elements using Fermat's Vito theorem. And this is n cubed if you are using school book multiplication. And something like n to the 2.6 using Karatsuba and n squared, mostly n squared, if you are using FFT-based multiplications. So the next question is, why are we not using extended Euclidean algorithm, which is roughly a factor of n faster? Typically, the answer is that we need constant time in our algorithms. So our algorithm is constant time and it achieves n to the one plus small o of one bit operations. And it is simpler than previous variable time algorithms. There's no division subroutine between the two recursive calls. So there are some examples of modern cryptography that needs inversions. There's the entry key generation, where we need to find the inverse in F3 of x over x to the n minus one. Okay, so for HRSS and that's the psychotomic polynomial. We also need to find the inverse in z over two to the kz of x over x to the n minus one, and which depends on another inversion in F2 of x over x to the n minus one. And similarly for entry prime key generation, that's another two different kind of inverses. And for example, in C side, there is an inverse module of 511 bit prime. And for all of these cases except this particular case where the Fermat's little theorem was too fast, we are able to improve the situation with our algorithm. So let's take a look at the Euclid-Steven algorithm, the extended GCD in F7 of x. So we have R0 and R1. You will recognize that these are digits from pi and E. And we are getting sevens and eights because some of the times you get the results in non-reduced form. So this is not a constant time algorithm because the ideal Euclidean step has a dividend that's one degree higher than the divisor and the remainder is one degree lower than the divisor. But here there is one step that's not ideal. So what does this mean? So if we look at this as a sequence of subtractions, there are 15 coefficients and we want to reduce it to one coefficient, the constant. So there should be 14 steps, but there's one skip. So now there are 13 steps. The number of steps depends on the number of imperfections in our Euclidean sequence. So let's take a look again at the Euclidean subtraction stage. We start with a dividend of higher degree than the divisor. The regular subtraction stage subtracts from the dividend a correct multiple of the divisor. If the dividend leading term is zero, there's no problem. We subtract make a dummy subtraction. We decrement the dividend degree and if the divisor now has a higher degree, we swap the two polynomials. However, if there is a zero lead turn for the divisor, we still need to decrement the divisor degree and do a dummy subtraction. And we need to do this whether this happens or not because of constant time. So how did the existing constant time GCDs do it? They mostly kept the polynomials as a race and kept the track the degrees and they do the GCD in rising order from the constant time up. Now, here's a better way to do the subtraction stage. We can start the known bigger polynomial as the divisor. The reason to do that is we can ensure that it's leading term is nonzero. We track only one number which is the degree difference, the degree of divisor minus the degree of the dividend. And we can reverse the polynomials to make sure that the leading term corresponds to the constant. So here's our subtraction stage which we will call deep step. If delta is positive and the dividend has a nonzero leading term, then we will swap and negate delta. And then we will take an appropriate linear combination of the two polynomials. We will shift the dividend which means divide by X and increments delta. So what do we do exactly? The details of the computation when we are doing a GCD with R0 and R1 or doing R1, sorry, the inverse of R1 modulo R0 where degree of R0 is D and degree of R1 is less than D, we set up by making F the reverse of R0. And the dividend G is X to the D minus one R1 of one over X. The degree difference we set to be one. What we do now is we do two D minus one deep steps and we can collect the return values. That will give us the answer. So to make it more clear, the deep step is considered a map of the set, the integers times power series with nonzero constant terms times all power series to itself. And the deep step is given by the following formula. So there are two cases. One is happens if delta is greater than zero and G has a nonzero constant term and the other formula takes place otherwise. So this is the example we gave and please notice that we start out with one polynomial of degree seven and one degree polynomial of degree six. And after 13 terms, F stays constant. After 14 attempts, we have a zero for G. So like I said, after two D minus one deep steps, we have the result. So what do we do to get time constant deep step? That's not too hard and here are the steps to do it. We first make a mask and we XOR the correct mask to F and G and we XOR the correct mask to delta and then there's a uniform second half and we can do this with ABX two instructions if necessary. So if we just do deep steps, so for example, for intro and intro prime rings, we can do inversion in F3 of X over X to the 700 plus X to the 699 plus plus all the way to X plus one. So intro HRSS original code had about 150,000 hardware cycles. Now this algorithm tracks two extra indices compared to ours and it requires a variable scaling by a X to the R at the end. And if we just use our given method and this takes about 90 hardware cycles, we can look also at intro prime key generation which is mostly an inversion in F of 4591 of X over X to the 761 minus X minus one. There is also an inversion in F3 of X over X to the 761 minus X minus one. And originally this takes six million cycles in the intro prime code and if we use our code, it takes less than one million, 0.94 million cycles on the hardware to be exact. Okay, so that's deep steps and this is still an n squared algorithm. And first let's take a look at the integer case. We have a radix two analog for the deep step and we can consider this as a map for the integer case. From the integers times the twatic integers with an odd numbers and all the twatic integers to itself. And there's a formula here that gives the result and we can also again split this twatic deep step in two halves. First there's a conditional swap. So delta FG gets mapped to minus delta G and minus F. If G is odd and the delta is greater than zero, notice that F is negated here. If we do not negate F here, the result may not terminate. And then for the next half we eliminate. So we increment delta and we make G, G plus F times G mod two over two. We have a termination theorem which says that if F and G are K bits, all that's needed is 2.883K twatic deep steps and we will see a zero. And at this point we will have the GCD and we can compute the modular inverse. So we can make this a sub quadratic GCD or modular inversion by doing the following recursion. This is simpler in structure than previous sub quadratic GCD and modular inversions because there's no middle division step. The transition matrix of deep steps to the end delta FG depends only on the bottom end bits or coefficients of F and G. The result is that what we can do is we can do half of the steps using half the precision. And then we update F and G using advanced multiplication usually FFT and then we do half of the remaining steps. And sorry, the remaining half of the steps. And the result when we do this recursively is that the end deep steps takes time and times log two plus small o of one end when using FFT multiplication. So this is time constant. And previously two recursive steps sandwiching a division is what it looks like. The letter is necessary because you need this to ensure that there is progress. So the division is not naturally time constant. To make it time constant takes some non-trivial modification and the split is not even and that makes it slower. Our algorithm has two equivalent recursive steps and it's an even split and there's no division step. So what are the results? The integer inversion results can be seen as follows. So for the 25519 prime and on Intel CPUs we have 10050 cycles on the Hustwell and 8778 on the Skylake and 8543 on the Cuddy Lake. And this is like 10,000 runs using medians. The previously best known result is due to NAS and SACAR and you can see that their result is anywhere between five to 10% slower. Now this is not very much faster and it requires much more complicated code. So you might say this is not quite worthwhile and we agree if that's the case then we shouldn't need this algorithm. However, if we look at smaller CPUs that doesn't have as big a multiplier, for example, the ARM Cortex A7 which is a very common microcontroller, okay, a microprocessor and we take the same 25519 prime and this takes about 35,000 A7 cycles and compared to Fuji, Aranya this is about 40% faster. And if we treat a 511-bit prime like the one that's used in seaside, however, this one is a pseudomersine prime so that should be better for famous little theorem. And we can see that this is anywhere between 50 to 150 and 120% faster. Finally, on the entry polynomial inversion and you can see that our algorithm is about 10% faster. So this is not much faster but think of it this way. The person who programmed the deep steps is Dan Bernstein and the guy who programmed our algorithm is mostly me. So I mean, you should think that's a achievement. And what's left, there are more usage cases. So there's seaside and you can get integer inversions that's between 1.5 to 2 times speed up and at PQC standardization workshop and the lead-up crypt developers said that they can get two times up to four times speed up. And our proof of the termination theorem uses exhaustive search and we can find a prettier theorem and there are future work and we can verify, try to verify our complex code and sometimes we can think we can use more inversions if the ratio of I over M is small enough. I think that's it and I'll take any questions. Probably time for a quick question. Doesn't seem like it. No? Okay. Hi, I was wondering if you had a feeling of how this compares some very small binary fields to say powering up using normal basis for the Totsuji. I think in particular the inversion in the ASS box. Binary fields, that's an interesting question, but I would say for the most part that for binary fields, it's probably, it probably takes a larger polynomial for this to work, I think. I think there is some work on like Maquely's using like this similar method, but I'm not sure if that fits your question. Thank you. One quick question. You're mentioning other uses. Just use such cases as IDH, C side. That is speed up that you mentioned 2.5x2x corresponds to 64-bit machines or because you're getting a better performance for smaller machines, right? Right. So that is speed up corresponds to 64-bit machines, the large ones or the small ones? That's a 64-bit machine. Okay, so in small ones you should get a better speed up, right? Yeah, so if we, I mean, when I actually take the time to write a proper code for that, yes. That should be better, even better than two to four times. Okay, great. Okay, let's thank Boying again.