 Okay. Thank you. It is my pleasure to be here and talk about our work on these high-performance elliptical that we call for Q and with an extra focus on analysis of the analysis of this curve on high-performance applications or applications such as IOT scenarios where you care about low power or low energy consumption and following probably the vein that Michael Narix mentioned just to highlight how awesome ECC continued to be and we will show a few reasons here in the presentation. But before going to the main part of the presentation, let me give some context about our work. Probably most of you are aware that there was a kind of competition to select new elliptic curves. The ITF finally selected through the CFRG two elliptic curves for standardization. Dan Bernstein curve 25519 and Mike Hamburg's Goldilocks curve. And there is already an RFC describing the curve, giving the generation details for the curves, and then focusing on cryptographic schemes like in the case of these RFC that I show here, they show how to do DH key exchange for both curves. And then ongoing work is focusing on a signature scheme and specifically they are targeting EDDSA. Now, in the CFRG discussion, there was a focus on what are the kind of requirements that should be important at the moment of selecting the curves. And this is a quote that I've taken from the NISTCC workshop, the real motivation for working CFRG is the better performance and side channel resistance of new curves developed by academic cryptographers of the last decade. So there was a real emphasis in performance and side channel resistance at the moment of selecting this curve. And I can add other requirements that were important at the moment of the selection. I mentioned two that I think are relevant, the rigidity in curve generation, and also the support for existing cryptographic algorithms, such as ECDH, the femoral case especially, and signature schemes. So, given this context, with this motivation in mind, with Greg Costello, we kind of sit down and said, okay, how fast we can go with this, how we can design a curve that is really very fishing at the same time is secure. And so we began to put together the best of ECC at the moment and we call what we came up with 4Q. And that really takes the best out of the literature. It combines two optimal, in an optimal way, two endomorphisms. This has been work that began with the seminal work by Galen Lambert and Vansons, and continue until very recently by Benga Smith, Galevic, and Ionica very recently. We also combine the use of the edwards form with the fishing edwards coordinates, and finally, using the very compact Mercing Prime 2 to the 1.25 minus 1. Now, after assessing and implementing and assessing the security and performance of the curve, we actually observed that this curve really supported side-channel secure implementations as we wanted, and also achieved top performance, another very important requirement. Another very nice feature of this curve is the uniqueness. This is actually the only curve at the 100, or close to the 128-bit security level with the properties that are shown above. So this should give a very special rigidity, let's say, characteristic to this curve. Now, let's focus for a moment on the performance side. What are we talking about here? We were analyzing on different kind of platforms the performance that we could achieve with 4Q. In this case, for example, I show the speed in thousands of cycles to compute the main, the standard operation in ECC that is called variable basis scalar multiplication, and I show for different computer classes here the numbers that we obtain, for example, an Intel desktop class and a smartphone class such as an ARM Cortex-A15 processor and a microcontroller class like Cortex-M4 microcontroller. And you can see, compared with CURT 5.9, we observe a significant speed ratio well above 2X and even close to 3X in some cases. So when we, from a certain perspective, we can think of 4Q as the road runner of elliptic or something like that. So this is in terms of performance and we were analyzing how the performance looked in very different platforms. And so the next step was, well, let's look at the CIFARG was doing and they obviously focused on trying to make the course work on basic cryptographic schemes, specifically ECDH and signature schemes. And now we'll focus in the some point of, in the presentation on those. First of all, let me first give you some basics about 4Q, what it is, what it is specifically. And this is a curve that is defined by the way on a quadratic extension field where the characteristic is a large prime, where P is the same prime that I mentioned at the beginning. We have these specific parameters and the most efficient representation of this curve is in the twisted order form which is shown here. The cardinality of the group of points on this curve, it's actually 392 times a very large number, in this case, a 246-bit prime. And so you can see that there is a cofactor that is, let's say, relatively large and we have to deal with it. Other important facts, it supports the fastest addition loss in elliptic curves that are also complete, that means they allow very secure side channel security implementations. And importantly, it comes equipped with two endomorphisms and just to give you a rough idea, an endomorphism can be thought as a shortcut when you compute the scalar multiplication that maps a point on the curve to other point, let's say, in the middle or a quarter of the whole computation of a scalar multiplication. So it's like a shortcut to obtain values on the curve and save a number of point operations. So this curve has optimally two endomorphisms and so what you can do with it is basically transform your original classical scalar multiplication, in this case, as expressed as m times p, to a multi-scalar multiplication with these mini-scalars a sub 1 to a sub 4. Now, we describe in the main paper that we are publishing Asia Crypt 2015. The composition method, it's optimally in the sense that you can input any scalar m that has 256 bits and you decompose in exactly four mini-scalars a sub i with a maximum size of 64 bit each. So something like the following, let's suppose we have a scalar m and then with the decomposition method, you can obtain these four mini-scalars which are 64 bits maximum. Now, each of the mini-scalars will be attached to a either to a debate to the base point or to a value using the endomorphisms map by using the endomorphisms. So let's take a look at how the scalar multiplication works in this case. Before proceeding, because we want to do it efficiently and in a side channel resistant way, we have to recode the mini-scalars. So these are the values taken in binary form from the from the previous mini-scalars and we perform two steps to convert them to the design representation below. And if you notice, every column now is non-zero because the top row, all of them, all the values available here are now digits that are either one or minus one. So if you look at them from a column perspective, you can construct these values, digits that are from one to eight and are accompanied by the sign in each case and the sign is determined by the digits on the top row that is the first mini-scalar. Now, using this representation after decomposing and recoding, you can proceed with the scalar multiplication. What you have to do first is construct a table and with age values, combining all the possible, all the possible values of the base point plus the mappings of the using the endomorphisms, the math values using the endomorphisms and each entry will correspond to, so each digit will point to an entry in the table. So if you did the computation from left to right, first of all, you take the entry six and load it and then you proceed to perform exactly 64 doublins addition operations or subtractions depending on the sign, of course, each time doubling and performing an addition or subtraction with one value from the table and the entry is indicated by the digit, of course. This is exactly 64 times, 64 times, that means it's a very regular execution containing exactly 64 doublins, 64 additions and that facilitates protection against timing and simple search and attacks. And also the table only contains eight points, naively good contains 16, after recoding, we can make it more efficient, only six eight points in this case. Okay, so these are, let's say, a very quick overview about the kind of computation that we need for the high speed version of the scan multiplication using 4Q. And now what about doing some basic stuff in the cryptographic sense? We recently wrote an internet draft in collaboration with Watson-Lad and Richard Barnes and it's available in this link, the title of the draft is Core 4Q. The current version describes the compressed version of the cofactory CDH, meaning that the publicists are compressed and they are 32 bytes in total, in size. Inside the document we describe two implementations, scalar multiplication, the one that I described, the high speed version using endomorphisms, but also we describe an naive version without endomorphisms that could be suitable for memory constraint application, for example. Now let's look at very roughly, this is the basic cofactor CDH, but for completeness, let me go through it very quickly, but in the case of this scheme you have Alice and Bob that wants to communicate, they establish some secrets A and B and compute the scalar multiplication with a generator and then they proceed to compress the values. That means that if you have a point on the curve that contains an x-y coordinate, the compressed form is basically the y-coordinate by a plus a sign bit that is the riff from the x-coordinate. You pass the value, you have to decompress it, you have to perform a validation to make sure that the value is on the curve. Now here is the cofactor part, you have to compute 392 times the validated value and this value, this computation actually consists of actually eight doublins and two additions and then we define our computation to get to the supposedly shared secret. On the other side you perform roughly the same operations and hopefully you get to the same shared value. Now as I said, these are compressed publicies of 32 bytes long and so that's similar for example of Qurto 5519, but you can also proceed with an uncompressed version where you use 64 bytes and it's slightly faster and more power efficient so it could be more suitable for certain applications. It's just a straightforward implementation of cofactory CDH without the compression part. Now the next part is that we wanted to do something about signatures so we came up with a specification of a high speed high security signature scheme and we call it the Schnorr queue which is basically a Schnorr type signature scheme closely following the EDDSA specification but in this case using 4Q so we apply some changes in some cases. As in the case of EDDSA we have two versions so one using pre-hashing which is efficient to support single pass interface for signing. During signing you have to load the message twice at least and that could be inefficient if the message for example is too long so if you apply pre-hashing that could make things more efficient in certain scenarios. Also the version without pre-hashing and in this case we have the resilience against hash function collisions of course for the version without pre-hashing. Now this is a Schnorr type signature scheme so you have a deterministic generation you don't need a randomness per signature as for example is the case of EDDSA and as in the case of EDDSA we have small signatures 64 bytes in size and a small public case 32 bytes in size but the relevant feature here is that we have at least to my knowledge very efficient probably the fastest score-based signature scheme at the 128-bit security level and to show you the kind of performance that we achieve for example these are numbered for an Intel Haswell processor signing takes 39,000 cycles compared against 61,000 cycles achieved by ED25519 verification is even much faster 64,000 cycles compared to 185,000 cycles for ED25519 so very high speed which can be suitable for high performance applications and here is the link with the specifications for those of you that are curious about the complete details for the specification of a Schnorr Q. Now we are putting together these schemes and we implemented them and now we are preparing the release of version 3.0 of our library the library that we wrote is called 4Q Lib. The current version is 2.0 but we are releasing version 3.0 including co-factory CDH and the Schnorr Q digital signatures as I describe them 4Q Lib already includes a bunch of implementations but the version 3.0 will include many more and for example we have a portable C implementation and next is for optimized implementation optimized implementation for 32-bit platforms, optimized implementation for ARM using Neon and also for some 32-bit ARM microcontrollers such as Cortex M4 and all of the crypto operations in the library are protected against timing attacks, cash attacks, exception attacks, invalid curve attacks and small group attacks. Now for the next slide let me focus on these implementations. We have been doing some work on small devices embedded devices such as 8-bit, 16-bit and 32-bit microcontrollers trying to assess what kind of performance and memory and energy constraints you can achieve with 4Q. So and this is joint work by the way with Celio, Jovandro Pereira and Guadoncio. We basically ported and specialized the library to various microcontrollers here to show you a very quick summary of the kind of performance that we are achieving. These are preliminary results I'm showing for an 8-bit AVR at mega microcontroller and also for a Cortex M4 you can see comparing against Core 2.5.1.19, significant speed up in comparison with that curve and I think we can still go further in the case of AVR. I think this is still a lower bound. So now let's focus on the AVR implementation. We want to assess how much efficient also not only in terms of performance but also in terms of energy consumption which can be relevant for IoT applications that care a lot about latency, let's say, or latency response but also energy consumption. And here I have preliminary results showing the computation in seconds for example for AVR microcontrollers, for crypto operations. So I'm showing here for ECDH and also for signatures. Comparing here the numbers are for NIST B256 which is shown in green, Core 2.5.9 and ED255.19 for in gray and for QSNORQ in blue. And you can see in each case we achieve a much better performance. These are in terms of seconds on this AVR microcontroller. These are more details about the implementations that are compared. Basically one thing that I want to point out is that for example you can see here a very, very huge speed up in the case of key generation. That's because the state of the art implementations of these other curves don't contain, don't support fixed basis scalar multiplication. So that shows how much improvement you can get if you exploit pre-computed tables as we do. And in general this also shows how for Q is the only option that computes most of the stuff in less than one second. And that kind of latency can be important in some applications. So as you can see only here in verification is slightly above but all other cases is below one second that can be an important factor in certain applications. Now let's look at more closely to the case of ECDH static and ephemeral ECDH. Again NISP256 results in green, CUR2519 in gray and 4Q in blue. By the way the C means compressed public keys and the U is the uncompressed version that I described in previous slides. Again you can see the performance difference. These are estimated energy consumption in millijoules on the AVR and this analysis corresponds to a wireless sensor note that a popular wireless sensor note that is called MICA-Z at 7.37 megahertz. So again you can see the significant improvement in terms of energy. Significantly notorious in ephemeral ECDH and that again it's because most implementation of let's say CUR2519 don't exploit pre-computations for ECDH. So for truly ephemeral computations our library performs much better and that could be important for us again for low energy applications. Now our implementation as I'm showing prioritizes speed and also a low power consumption. So there is a trade-off of course we have higher memory consumption and this is shown in the as an example here. Our variable basis scalar multiplication requires almost twice the memory size for code compared with the CUR2519 implementation. So depending on the application you can you can balance or decide on what is best in the case of in the case that speed latency is important or low energy is more important than 4Q gives an edge a very important edge but you have to balance depending on the application. On the other hand 4Q is a very rich it contains very rich arithmetic so really if your project it's for your project is more important memory consumption then you can even implement 4Q using a Montgomery ladder as CUR2519 and and have a very constrained implementation and still be faster and more power efficient. So that's another advantage of 4Q the kind of flexibility that you can get out of it. Now let me finish this presentation but showing our work in progress. We are working on plugging in 4Q into open SSL and this is showing work with Bob Bromley and Nikola Tuberi. We have been working on this integration working with the the the current version of 4Q lib. Actually that integration is complete version 2.0 works with version 1.1.0 of open SSL so there is a wide support of any ECC protocol that is out there can right away use 4Q and we will be releasing the patch for open SSL soon. Now one downside of of that approach of supporting all the protocols available is that we can we have to use some of the slow parts of open SSL and in particular it hurts to use some of the multi-precision operations as I will show in the in the last two slides. Now we have more work in progress to address these performance issues because we plan to have an additional option using an external engine to provide 4Q that should solve most of the performance degradation issues and also we are integrating SnorQ for signatures. Let me show you the preliminary results in open SSL in this case on a 64-bit Intel Skylake processor to show the kind of performance that we achieve when we compare against NIST P256 and Core 2519 against the 4Q implementation. Here we show the cases of static ECDH and also ECDSA not for the case ECDC not for the case of Core 2519. As far as I know this curve is has some being plugged into ECDSA so I cannot show numbers for that. In any case you can see when you compare the performance of 4Q against the numbers of the other two curves there is a significant improvement in the operations per second that are here especially here the green value that corresponds to ECDH. There is also a significant speed up in the case of ECDSA verification and now we don't observe that much of a speed up in the case of signing and that is related to the open SSL performance issues and that will be obvious in the next implementation. Now I have a comment here at least the implementation that we tested of Core 2519 in open SSL was kind of slow. Nicola Tuveri actually plugging a new engine for that curve based on Dona C64 and that performed much better. Now here is the breakout of the timings on the same Sky Lake processor now focused on P256 and 4Q for ECDSA and ECDH and here is to show what are the kind of operations that are causing performance issues and as you can see the numbers in blue that corresponds to the scalar multiplication pretty much make sense in every case you see that 4Q is much faster as it should be but you can see these blocks in pink and orange they are pretty expensive and those corresponds to actually to modular inversions and modular inversions shouldn't be that expensive but in open SSL they are so when we use their functions we really get a hit here and so that's one of the issues that we are solving with using the external engine and using our own functions so we should we will see pretty soon you know most of the costs overhead in 4Q going away and getting a much better ratio in the improvement. So with this I'm finishing if anybody's interested in getting more additional information about our project related to 4Q there is the paper out there the link to the library the rfc draft there is a reference implementation in python that comes this together with the rfc draft a post it in jithub a snorq and so on and all the stuff that is coming soon and will be released if you want to be participant of this project you can do a lot of stuff there are a bunch of stuff that can be done so talk to us send us an email and and and that's all that I wanted to say thank you. Thanks for your talk so early on you showed the the trick with the endomorphisms and you had that table of eight uh pre-computed points so how did you protect that table of eight against the cache timings? Yeah so as seen all other ecc implementations when you have multiple points that you have to access you have to do it in a constant way so basically you go you do a linear path through all the points and use logical operations to access each point and take the the one that you want. Interesting so is that actually faster than because the the eight points there are just sums of three different points together yeah so is that actually faster than going through it is? Yeah it is it is much faster yeah. You mentioned you have a faster engine for modular inversion for open SSL uh when they just contributed and not use the engine users don't really like using engines in open SSL it's a pain uh is it possible to get that integrated? Well I'm not the best one for that question. Are you gonna make it available? So the question is yeah we are going to make it available so we are probably that was in we I passed too fast on that but the release is coming soon for the patches to open SSL so the idea is to publicly openly uh um yeah. All right but pull requests welcome if you have a faster modular inversion okay we're on GitHub yeah okay great. You provide some nice speed up uh as opposed to curve to 5519 uh Ben Smith provided some nice speed up with uh the commercial face any comparison to this uh hyperactive curve? Yeah so if you go to the to the to the papers you will see comparison with other curves available including genus to stuff like the kumar and uh in every case except with a few exceptions we see still a significant speed up in in in comparison with the kumar uh but kumar is also an efficient alternative yeah. Can I just ask about patents? I know the GLV method was it was patented is patented I'm not sure of my facts but perhaps you could say is there any patent issues that arise? I cannot comment on patents uh so the the only thing that I I can say from our side all our software is released on their MIT license open source completely so we are not applying not planning to apply we having applied for any patents on on all the work that you see if there are companies with patents I cannot comment on that uh what I will say is that uh uh our library uh supports both kind of implementations exploiting endomorphisms and without endomorphisms if there are any kind of concern you still can use the the case without endomorphisms get a nice speed up uh and that's uh an alternative that we also provide. Thank you. Thank Patrick again.