 Hello everyone. So this is our work, software implementation of cobalt curves over quadratic fields. The main motivation for this work is to combine the cobalt curves, which allows an efficient scalar multiplication through applications of the Frobenius map, which accelerates considerably the scalar multiplication computation with the quadratic field arithmetic. This quadratic field arithmetic provides opportunities for us to exploit the instruction level parallelism, which is available in the current desktop architectures. So the purpose of that is to design a fast 128-bit secure constant time variable point multiplication. So the outline of this talk will be as follows. First I will give a brief introduction on cobalt curves of F2. Then we will see aspects of the cobalt curves over F4. And finally we'll have some details about our implementation in base field arithmetic, quadratic field arithmetic, our scalar multiplication, and summary and some results. Okay. So cobalt curves over F2. This curves are called anomalous binary curves. It's generally referred as cobalt curves because they were proposed by neo cobalt in 1991 for cryptographic use. This is the form of the curves, the visor's form of the curve. So you can choose between parameters A, 0 or 1. Well, since the introduction, they have been extensively studied because they have a structure which allows to substitute, as you see, point doublings, where 2p, by the cheaper operation, the style p, where tau is the Frobenius map. And the style is very fast because we just have to perform two squareings in a fine coordinate. Next we have to square x and y. So it's very fast compared to the point doublings operations, which requires multiplications and squares. Okay. So why can we perform the substitute, the point doublings by application of Frobenius map? So we have the curve and let's define mu equals minus 1 to the power of 1 minus 8. So this Frobenius map can be seen a complex number which satisfies tau squared plus 2 equals mu tau. So as a result, we can multiply the points in this group by elements in this polynomial ring like that, like multiple powers of tau. In 2000, Jerome Solinus presented the method to represent this color k in this form here. It's in a very efficient representation where l e is approximately m plus a. So that way we could devise very fast methods for computing scalar multiplication using Corbett's curve. So as a summary, the Corbett curves allow this substitution of point doublings by Corbett's map, which results in efficient scalar multiplication algorithm, and provides a rigid curve generation process. You have to choose between a equals zero or one. That's the only option to choose between its parameters. In 2000, many curves were standardized by NIST which provides different security levels. However, there's two problems with these curves. The first one is because it's defined over a prime extension field. So it's arithmetic is somehow costly in modern desktops. And also, if you want to design a 128-bit secure point of multiplication, you must choose an extension 277 or 283. These subgroup orders, it's more than required to generate a 128-bit secure multiplication. So we need more iterations in the main loop to implement such security level because all of the extensions below that does not have a subgroup order which can derive 128-bit of security. Okay, so let's see the Corbett's curve of F4. This is the form of the curve. So we have A equals zero or one again. And we have to choose a gamma in F4 which satisfies gamma squared equals gamma plus one. Now the Forbidden's map is defined like this. Instead of one square, you have to perform two consecutive squareings in each coordinates of an affine projective point, an affine point. And let's consider mu equals minus one to power of A. The Forbidden's map can be seen as a complex number which satisfies this equation. So we can also represent the scalar in the zeta tau polynomial ring. And it can have a almost prime group. But now H, which we can have a group of this form, or H equals zero, four, or six, depending on the parameter A, and N can be prime. So here are different groups. So we wanted to implement a 128-bit secure scalar multiplication of 64-bit architecture. So our base field should be at most 122 bits with three 64-bit words. So we have a multiplication of three 64-bit words. So we consider primes between M equals 127 to 191. We have this subgroup orders. The ones in red are feasible for implementing 128-bit secure scalar multiplications. So we chose this group four to the power of 149, because this has a subgroup of 255 bits. So we have a suitable number of iterations here to be done in the main loop of the scalar multiplication. The group factorized like this. So to adapt the Solian's algorithm for representing scalar in zeta tau, it requires minor change to adapt Solian's algorithm to the F4 case. And window methods can be implemented by computing a regular recording based on the work of Joy-Tunes, though. So for a given width of W, we need to compute two to the power of two W minus three points. It's a big number of points, as we see in the next slide. So if you want to, we need to pre-compute, in the case of left to right scalar multiplication, we need to pre-compute this point here. This is not considered an optimal form of pre-computing the points. We designed this by hand. It maybe has an optimal number. This is the pre-computation cost for the width of two, which is one doubling plus one full addition for width of three. One doubling, four additions, three mixed additions, and four applications of tau. And you see that it increases the cost with the width. It's a very harsh increase, for example. In the width of four, we have two, one doubling, 24 additions, 11 mixed additions, and five tau. So we need to be careful when you choose the width. Okay, so as a summary, this code is combined effectiveness of the Frobenius map with the parallelism opportunities given by the quadratic fields. Also, we have a subgroup order of two 5-5 bits. It's very nice for implementing this security level. And also, but as we see, we have to be careful with the selection of the width of doubly because the pre- or post-computation cost can be very, very costly. And also, we have to be careful with the Frobenius map because now it's more expensive. We need six queries in the projective coordinates. And we cannot do it by lookup tables using mode squaring because it can be solved in that case the implementation will be vulnerable to timing attacks. Okay, so let's see our implementation. Our code was designed for 64-bit platforms which were embedded with vector instructions, 120-bit vector instructions, and a 64-bit careless multiplier. The benchmarking was performed in a hash architecture. In fact, this is 2.4 Gigahertz machine. And turbo boost and hyper-threading technology disabled. We coded our library in C and assembly. And we compiled our codes in different systems for the sake of comparison. And with this flags optimization flags. Okay. So first of all, let's see our base field arithmetic. So we want to implement the efficient field arithmetic. So the first thing to choose is a polynomial to construct our field, choose the power of 149. And this polynomial should allow a fast model of reduction because this is a very important operation in the field arithmetic. After doing multiplication and squaring, we need to perform a model of reduction. However, there's no trinomials of the degree 149, irreducible over 2. And there are many irreducible pentanomials of this form. However, these pentanomials make the model of reduction very costly. Why? For example, for the minimum cost of doing this, if all these parameters are perfect, we need like four XORs by step, where each step is determined by the difference between M and A. And in the worst case, we need 12 XORs and 16 shifts per step. And we cannot find any, none of these pentanomials have this property here to be reduced in four XORs. So we have found lots of them between these two costs. So in short, it's very, very costly to perform this model of reduction with pentanomials. For that reason, we chose this strategy introduced by Brent Zimmerman and Dohe in 2003 and 2005, which is called redundant trinomials. What's this idea? So we have to find a non-reducible trinomial, GX, which factorized into an irreducible polynomial FX of the desirable degree M. So in our case, we have to find a terminal GX, which factorized into a degree 149. So we can construct our field with this polynomial FX, but we perform throughout our algorithm, model GX, which is this trinomial here. Yes, in case of the elliptic curves, all of the operations during the main loop is performed. We reduce the point, coordinates, model GX. And at the end of this calving application, we return the point Q, coordinates, model FX. So we have to perform this reduction FX just at the end of the algorithm. Since we have a 64-bit careless multipliers, then we have to search trinomials up to degree 192, which provides us three-word multiplication. So we found this trinomial, X to the power of 192 plus X to the power of 19 plus one. So this polynomial factorized in this 69-term reduced polynomial FX. So this is a very costly to reduce. We need to spend too many clock cycles reducing this polynomial, but we're going to use this polynomial to reduce throughout our algorithm. Because this polynomial has two great advantages because the difference between 192 and 19 is more than 128, which is the size of our registers, because you are here using 128-bit vector instructions. And also because 192 mod 64 equals zero, we can reduce the amount of shifts during our shifting out steps. At the end, as we see, you have to reduce this polynomial model FX, which is this 69-term polynomial. So because of there are many terms, it's better to perform this reduction via mull and out reduction instead of shift and out. And it costs about 7.3 multiplications, which is costly, but we have to do it only twice in two coordinates of the final point Q equals KB. So this is our base field arithmetic. So let's move to our quadratic field. As usual, we construct our quadratic field using this trinomial u squared plus u plus one. So our elements in this quadratic field are represented like this. So we have two terms, the linear and the constant term. So these three words are 64-bit words, C, B, and A, and the other term is the linear C prime, B prime, and A prime, which is starting 64-bit words. So let's have three 120-bit registers, i are 0, 1, and 2, which store these 64-bit words. Okay. So how to store them? Well, just to remind, this is our terms for the quadratic field element. So how do we store them? The usual way is to store, like doing the first register, we store these first two words, then the second we store the last word of the A0 term, and then we store the rest of the terms. However, we have to perform, for example, using the Karatsuba multiplications, we have to perform separately the multiplication, and we have, for example, one of the results of terms of the result element after the multiplication will be stored in 64-bit words. So one step of the reduction, will be like this. So we have shifts between the registers, for example, so this is a very complicated reduction, which takes about 24 shifts and 20 exhausts, because we need two steps of this, and we have two terms to reduce. However, we feel stored in a method called interleaving. We store each word separately, for example, the first word comes here, and the first word of the other term comes there. So we interleave the linear and the constant terms of the quadratic field element. Because of that, after the quadratic field, we have two elements, and we can group them together. So the multiplication now, we don't have any more shifts between registers, because we are supposing that we are dealing with elements of 120-bit size. So now, we need three steps, but just to reduce simultaneously two terms of this quadratic field with only nine shifts and nine exhausts. We can reduce this by doing some optimization techniques and reduce only six shifts and nine exhausts like this. And also, this has a very advantage, a very good advantage, because it allows savings in the precomputing phase of the Karatsuba algorithm. The problems that we need to reorganize the registers after multiplication and squaring, but this is very simple. We just take four clock cycles compared to the savings in the model reduction, which now needs just six shifts and nine exhausts to reduce the whole quadratic field element. Okay, so this is the timings of having multiplications in GCC one here with 52, squaring is about 20. The inversion here is now very expensive because we have different strategies to reduce the cost of this multiplication, and this is the reduction effect. So the inversion should be avoided and more and more in the binary arithmetic because it's becoming very, very expensive in software. This is the point arithmetic timings. So we have here the addition, doubling, and tau endomorphism. You see that the endomorphism here, we are using left to right approach, so we need to perform tau endomorphism in three coordinates, which is still faster than the doubling, more than a third faster. Okay, so it's still very efficient to perform applications of Frobenius map in cobalt's curve over F4. So we implemented here in this curve. We perform a constant time, though we tau enough left to right and right to left tau and at scalar multiplications because of the number of points we've pre-computed, we restricted our window width to two, three, and four. So this is our timings. This is the regular recording based on the joy it turns out. It takes about 2,000, 2,000 clock cycles. The linear passes because it should be done to avoid cash attacks, and it increases with the witness because you have more points. And this is the right to left and left to right. So right to left is more costly because you need two linear passes in the interaction, and the post-computation cost is more expensive than the pre-computation cost. So these two strategies were crucial for achieving our solid multiplication timings. And we have a nice subgroup size, which allows you to compute a 120-bit secretive optimal number of interactions compared to the cobalt's NF2. The for being, besides being more expensive, it's still efficient and costing less than a third in the point-toe, so you can still substitute point-toe by 1,000 in this case. However, the drawback is the number of points generated by the regular coding, the F4 case. So we need too many, the linear pass become costly, and you see that with the 4, almost 40% of our scarification is done pre-computing and post-computing points, so we need to restrict here NWP equals 3. So this is our results. Here, our state-of-the-art implementations is with 3. We implementationally are the fastest one in cobalt's curves. We surpassed almost 30% of the state-of-the-art previous state-of-the-art work, and it's very competitive to other 120-bit secure multiplication on binary and prime curves. In Skylake, our state-of-the-art was about 52,000 clock cycles. So that's it. Thank you very much. Any questions?