 Hi, so together with Sunil Boubier we recorded this brief presentation for our 2020 PKC paper, so hopefully it's somewhat watchable. So the context of this talk is integer factorization. As you may know, the best known algorithm when you want to factor a very big integer is the number field sieve. The current record from February 2020 is a 250 digit RSA number. That was factor using the open source software CAD-1FS. So the number field sieve algorithm has different steps. One of the first steps is called the sieving phase. And part of this sieving phase consists of breaking into primes, millions or even billions of, say, medium-sized integers. This step is called cofactorization. And the time spent in cofactorization is rather important. We estimate that it was about one-third of the time for a previous record of RSA 768. So when you want to do that, the usual method that you pick to find those prime factors of medium-sized integers is the elliptic curve method. So I will not enter into the details of ECM by lack of time. But the only thing you need to know to understand the rest of the talk is that the step one of ECM simply consists of computing a very big scalar multiplication k times p, where p is a point on an elliptic curve. And k is a special number. That is the product of all the primes up to some predetermined bound b1. And all the prime, actually, we mean all the prime powers. So as you can see on the example, if we pick b1 is 32, then the scalar k is going to be 225 times 3 to the 3 times 5 squared times then all the primes of b1. So in order to do that, to compute k times p, we can think of two rather naive options. The first one is to evaluate the integer k and then to compute the scalar multiplication k times p. Or we can take the primes one after the other and accumulate the result in the current point. And the best option out of these two is, well, it depends on the way you perform this scalar multiplication, either k times p or the prime times p. So when we want to do that as part of co-factorization in the number of felsives context, what do we have? We usually know that the integer that we want to break into primes have a size of roughly 150 digits. Of course, it depends on the size of the integer you want to factorize using NFS, but this is more or less what you expect. We also know that these integers have no small factors. They have been eliminated by the sieving phase. And usually we expect numbers to have three or four factors. What do we know also is that the bound b1 are usually rather small and fixed. That's important. They can be fixed in your implementation. For example, if you use the Kado NFS, b1 values go from 105 to 8192. So the result is that k, the scalar in all k's for all these b1 values, k is not in advance. So the goal is to design some kind of optimal algorithm for computing k times p for those b1 values. So the first algorithm that used something rather different than the two naive options I mentioned before is due to Dixon and Leistra. Their idea was to regroup some of the primes into what we're going to call blocks for the rest of this talk. In order to reduce the number of addition. For example, in the formula you see that you can write k as a product of some of the products of well-chosen primes. And the best is to show an example. I pick three primes, p1, p2 and p3, and give their hamming weight. Because, for example, if you use the double in algorithm for this kind of application, you know that the number of addition is equal to the hamming weight minus 1. So the first prime has hamming weight 10, the second 16 and the third 11. But if now you consider the product of those three primes, then you can see that the hamming weight of the product is only 8. So it's much more efficient to compute p1 times p2 times p3 and then multiply this number by p, then taking the primes one after the other. So that was the idea of Dixon and Leistra. They implemented this algorithm, but of course finding the combination of all the primes was way too expensive. That's the limitation what they only consider block of at most three primes. Some years later, Boss and Klein Jung proposed a nice improvement. As I just said, it was just not possible to consider blocks of more than three primes. It was way too expensive for practical p1 values. So instead they used the opposite strategy. They generated a huge number of integer with very low hamming weight because they knew that the corresponding scalar multiplication will have a small number of addition. And for those numbers, they checked for smoothness. Let's pick an example with the small value of p1 of 32. The first one has hamming weight 3 and as you can see, if you evaluate the integer that corresponds to this binary string, you can see that it's b1 smooth value, meaning that all the prime factors of this number are less than b1. So we save this number and we hopefully can use it for the algorithm. The second binary string has also hamming weight 3 but this time it's not b1 smooth because one of the factor is greater than 32. So we don't keep this one. And of course we can also consider sine representation as we have with the NAF for example and the third example has hamming weight 2 and is also b1 smooth. So we keep this one. So the idea of Bosz and Kleinum was to generate a huge number of that form, check for smoothness, keep those that were smooth and try to recombine everything in an early optimal way. So another parameter that is important when you implement ECM is which curve model do you want to pick? Well, it's not clear. It depends on different things. We usually have two competitors, the original Montgomery option using the Montgomery curves as different advantages. For example, you can represent the point using only the X and Z coordinate. It has very efficient doubling. You can enjoy tripling if you want. A bad point is that we don't have a usual addition. We only have what we call differential addition. We'll get back to that in a few minutes. Then because of this restriction on addition, the scalar multiplication is also somewhat constrained and we need to generate something that is called LucasChain. On the other hand, Edward's curve has very good arithmetic, doubling, tripling and addition. You can use the algorithm of your choice for scalar multiplication and the best choice between the two is not clear. In this work, we used a theorem that says that every Edward, tweet to the Edward curves, is rationally equivalent to a Montgomery curve. So we try to use the best of both words. So what is the contribution of this work? So as I just said, we try to use a good mix of Montgomery and Edward's curves. And for that, the algorithm works as follows. We start the computation on a twisted Edward curve. At some point, we'll see later, we switch to the equivalent Montgomery curve. And for that, we introduce a new operation that is called addM, given two points in Edward's coordinate. We can compute the sum p1 plus p2 on the corresponding Montgomery curve in x, z coordinate only. And this operation costs only four multiplication. This operation is partial, but that is sufficient for this context. And then we can finish the computation on the Montgomery curve, including the step two of ECN, which we're not going to talk about in this presentation. And then for the scalar multiplication, we have proposed an extension of Boss and Kleinung's algorithm in two ways. The first is with the generation of blocks of various types, beyond the usual enough, and also a better combination algorithm. Alright, so block generation. So in order to take advantage of faster plane operation on Edward's curve, we consider double base expansion and double base chain. In that form, a number is written as a sum of mixed powers of two and three. In the picture below, the dots are represented according to the corresponding powers of two and three. You have the powers of two on the x-axis, the power of three on the y-axis, and each dot corresponds to a term of this sum. And as an example, I give you an integer that can be written as a sum of three terms of that form. And if you use this representation to compute your scalar multiplication, you will need eleven doubling. That is the largest power of two, seven triplings, the largest power of three, and only two addition. And you may need to pre-compute three points. Double base chains are subsets of double base expansion. It's the same representation as a sum of mixed powers of two and three, but you have extra divisibility conditions on the terms. That allows to compute without pre-computing points. And in this example, we have a double base chain with only two terms. That requires twelve doublings, eight triplings, and only one addition. So that's for Edward's curve. On Montgomery curve, we do not have a classical addition. We only have a differential addition, meaning that if you want to compute p plus q, you need pq, but you also need a difference between p and q. And if you want to use this differential addition to compute your scalar multiplication, you need to use what is called Lucas chain. This is exactly a chain that satisfies the above condition. A term in the sequence is defined as the sum of two previous terms, but you also need to ensure that the difference of these two terms belongs to the chain. One way to produce Lucas chain was proposed by Peter Montgomery in the PROC algorithm. I'm not going to explain the PROC algorithm here, but you can see this algorithm as a big switch algorithm. You have a rule A, given according to some invariance, and you can have a corresponding sequence of operation, and then rule B, etc. And you have a sum number of rules. The way we produce Lucas chain in the same vein as before, we simply generate short words on the alphabet A, B, C, F, 2, J that correspond to the rules. That means that we generate some short Lucas chain, then we compute the corresponding integer and test for smoothness as we did before. So for the overall block generation, we use a strategy that was similar to bus and client new by considering the reverse approach. We consider a very large number of blocks of each type, double base, expansion, chains and Lucas chain. We eliminated the blocks that did not correspond to smooth integers. And also if, for example, we had a block of a double base expansion and a double base chain that represented the same integer, we only kept the double base chain because we had no pre-computation involved. So in total, we generate quite a lot of numbers, as you can see on this table. And it took about 10,000 hours of intensive computation to produce all these blocks. When you compute such a big amount of objects, especially computational, combinatorial objects, you want to make sure you do not compute the same object twice. So that's what we did and no block was generated more than once. So for the rest of the talk and the block combination and the result, I will give the screen to Cyril that will finish his presentation. Thank you. We are now going to see how to use a block that was just generated to compute the scalar multiplication in ECM. So for small value of b1, like in the example b1-32, it's quite easy to find a way to use them to compute the scalar multiplication. For example, by using eight blocks and one block on the one double base chain on the twisted order squares and seven Lucas chains on the corresponding Montgomery curve. So the goal here is to find, given a b1 value, the subset of all the generated blocks with the smallest cost such that the product of the integers represented by those blocks is exactly k, which means that the subset of the block allows us to compute the scalar multiplication in ECM with the given value b1. So here by the cost, we mean the sum of the arithmetic cost, which is just counting the number of multiplication and squaring that are needed. In their paper, bus and canyons propose a greedy algorithm to combine the blocks. Their algorithm is very fast. It generates good solutions, but not optimal ones. In order to do that, they use two values to choose the best block that they add in the current solution set. So the first value is the ratio between the number of doubling and the number of addition. And the second value is a score function that they design to fiver blocks with a large number of large factors. They also propose a randomized version of their algorithm in which they only choose the best block with a given probability or else they choose the second block or the third block, etc. It's all of them to generate lots of solutions and to keep the best one. When we try to adapt bus and canyons algorithm to our setting, we encounter multiple problems. First, we look at the ratio between the number of doublings and the number of addition that they use to sort the blocks. The first problem was that we use triplings. So we need to add this in the ratio. But also, we use both twisted dwarf and mongom records, where the cost of addition and doubling are different. So it was not easy and straightforward to find one single value to replace this ratio. We also observe that the score function does not always achieve its goal to fiver blocks with large factors. We were able to find examples where their score function was fivering blocks with smaller factor or less factor. Like the ratio between the number of addition and the number of doubling, we were not able to find a suitable replacement for the score function. We first try to sort the block by arithmetic cost per bit, but it does not yield better result. Here, by arithmetic cost per bit, we mean that we divide the arithmetic cost, the number of multiplication and squaring needed to compute the block, by the number of bit of the integers that the block is representing. So we wanted to do something more exhaustive, but a complete exhaustive search is totally out of reach, even for small b1 values. So we try to reduce the enumeration, the exhaustive search. So first, we shrink the enumeration depth by using an upper bound on the number of blocks in a solution set. So here we lose solutions, but we hope that the best solutions and the good solutions have a small number of blocks. Then we find a way to reduce the enumeration width at each step using the knowledge of an upper bound on the minimal cost. Using this, we do not lose any solution. First, note that an upper bound on the minimal cost can be found with any method, like Boston-Clenion's algorithm or the double-node method. Using this knowledge, at each step of the exhaustive enumeration, we are able to compute an upper bound on the arithmetic cost per bit of a block that can be added to the current set if we want to obtain a solution with a smaller cost. So now we can describe our algorithm. First, we sort the set of all generated blocks by increasing value of the arithmetic cost per bit. Then we enumerate depth first all subsets of blocks of size less than a given bound. At each step of the enumeration, we use the upper bound on the arithmetic cost per bit to discard inadmissible blocks. The bound on the arithmetic cost of the best solution can be updated throughout the algorithm, which means that as we obtain better solutions with smaller costs, we obtain a sharper bound for the arithmetic cost per bit at each enumeration step. Now let's look at results that we obtained with this algorithm. First, let's look at the best solution that we found for B1 equals 105, which is the smallest B1 value used by CADO and FS during the cofactorization step. Here we can see that with this solution, we use both the twisted Edwards curves and the Montgomery curves. We use double base chain and double base expansion with triplings on the twisted Edwards curves and we use Lucas chains on the Montgomery curves. Now let's compare our results with other implementation of ECM. We looked at the implementation of ECM in CADO and FS, which use Montgomery curves. Then we looked at other implementation of ECM using twisted Edwards curves like EECM and PFQ or ECM at work or an implementation of ECM on Calary. We compare using the number of multiplication needed to compute the scalar multiplication with what we call the arithmetic cost. For all B1 values that we looked at and for all small B1 values, we are always better by a few percent than all the other implementation of ECM. If we look at the graph for lots of B1 values of the arithmetic cost per bit of the solution that we found, we can see that with our method, we always found a solution with an arithmetic cost per bit between 7.6 and 7.7 and all the other solutions have higher arithmetic cost per bit. We then implemented our method in CADO and FS and we rerun some parts of the co-factorization step for RSA 200 and RSA 220. We observed that with our algorithm for the scalar multiplication and ECM for the co-factorization step, the time decreased by 5 to 10 percent, which correspond to what we estimate from our theoretical results. To conclude, in this talk, we proposed a new and better implementation of ECM in the context of NFS co-factorization. Following the idea from Dixon and Lengstra and Boston-Clanion, we generated blocks of various types, like double-base chains, double-base extensions and look-as-chain, and we combined them using a new quasi-exhaustive approach for values B1 values. Our ECM implementation used both twisted dwarf curves and Montgomery curves, used a new addition and switch operation to go from two points in a twisted dwarf curve to their addition on the Montgomery model, and used not only NAF, but also double-base extensions and chain, and also PROC generated look-as-chain. All our results and our source code are available at the following address. Thank you for your attention.