 As Junghän said, I will talk about our implementation of the new 4Q FB, elliptic curve on an FBGA. This is a joint work of four guys from four different institutions, so myself from Aldo University in Finland, under Amile from now in Intel, but at EPFL at the time of the work, Reza Asaderaks from Rochester, and finally Patrick Longa from Microsoft Research. So what is 4Q? So it's a new elliptic curve, which offers very high performance, especially in software. We know that only on software. So it has been shown to be two or three times faster than curve 25519, for example. And this has been shown to hold on many different processor platforms. But the speed up comes from employing four-dimensional scalar decompositions, which require then extensive pre-computations, which might turn into complex control when implementing the operations on hardware. And for this reason, it's not really clear how suitable this curve is for efficient hardware implementation. And this is the question that we are trying to address in our answer in this work. Okay, so 4Q was introduced by Greco Stello and Patrick Longa at ASACrypt last year. So it's a straight over year old curve. It's a twisted Edward's curve with cardinality of the group 392 times a 246-bit prime, so it offers security of over 120 bits. It's a very nice curve from many, in many respects. So first of all, it's defined over a final field or a quadratic extension field with a merchant prime P2 to 127 minus 1. So this gives us a very efficient reductions. Then we can use the complete addition formulas of Hisil and others, which lead to efficient point arithmetic. And then it has two efficiently computable endomorphisms, C and V. And if we represent the 256-bit scalar as a computer, four-dimensional decompositions for that, we can compute the scalar multiplication M times P as a combination of four or smaller scalar multiplications with different points mapped with the efficiently computable endomorphisms. And these A1 to A4 are 64-bit numbers only. So these smaller scalar multiplications are fast to compute, especially if we combine them as I show later in the talk. So we implement scalar multiplication only. So we use this scalar multiplication algorithm, which is shown on the left here. And now I will go step-by-step what kind of operations are required in this algorithm. So first we decompose and recode the scalar into a multi-scalar, as I explained, of four values A1 to A4. And in addition to having four short values, we recode these values in the sign-aligned form so that we represent the scalar as a matrix of four rows and 65 columns. And the first row contains only plus and minus ones. And the following rows are sign-aligned with the first one so that they contain only zeros and ones if the first row has value one and only zeros and minus ones in the case if the first row has a value minus one. And this sign-alignment allows us to recode the multi-scarals so that we have a sign, which is the first row, and then a point index from zero to eight, which shows which particular point we need to use from the pre-computation, which I will explain next. So we pre-compute all these eight different points with different endomorphisms, and then we store these with five coordinates into the memory. Why we use five coordinates? It's because in the point addition formula that we use there in the bottom, we need to use four of these five coordinates so that if we compute a point addition, so we want to add a point here, then we fetch four values so that we use this, this, this, and this. But if we want to compute a point subtraction or subtract the same point, then we just read these coordinates in different order. So these two first in different order, then the same set coordinate, but then we use the negative of this value, which is also pre-computed. So we don't have to compute anything anymore, we just read them in the correct order. But this pre-computation requires quite a lot of operations. So it's 16 multiplications and 27 squareings and several additions. So this is non-negligible computation that we have to do in the beginning before we can proceed to the main operation of this for loop. This for loop is fully regular and constant time, so we always compute 64 point doublings followed by either a point addition or a point subtraction, but because of this reading of the pre-computation, pre-computed point, we always compute exactly the same operation but with different values, regardless of whether we are adding or subtracting a point. And here we use the Hissel's formula, as I explained already earlier. This algorithm from a hardware point of view yields quite a natural division of operations for two different units. So the scalar decomposition and recoding is done in a scalar recoding unit, and everything else goes into the field arithmetic unit, which is highly optimized for this particular immersion prime arithmetic. So first about the scalar unit. So decomposition is mainly multiplications with constants. And we perform this with a truncated multiplier, which is built around a 17 times 264 bit multiplier, which is then implemented using 11 of these DSP blocks that are available in Silings FPGAs. And this gives us the four smaller scalars, and then we do the recoding for those scalars by using some simple bit manipulations and 64 bit additions only. So that's easy compared to this decomposition part of the scalar manipulations. This unit outputs the encoded values starting from this M0 and V0, but the scalar multiplication algorithm starts from the other end, so we first store these values in a last in first out buffer, which then outputs them in a reversed order. And then the main component is the field arithmetic unit, which has this top level architecture, where we have a dual port RAM with 127 bit width. So we always read one element of FP at once, and then we used two addresses for storing one of the FP squared elements. Then we have 127 bit data path here, and a control logic, which consists of FSM and fixed programmer ROM. So let's take a look on the data path of this field arithmetic unit. So it consists of two paths. So we have a multiplier path, which is used for computing the integer multiplications, so 127 bit times 127 bit multiplications using a pipeline 64 times 64 bit multiplier and an accumulator. Then we have another path, which is used for two purposes mainly. So we reduce the results of this integer multiplication by computing an addition of the two halves of the result. This is thanks to the efficient merchant prime that we can do this, and then just absorbing the potential carry. And we do that second addition always so that we have constant time. And then we use this other path also for computing the additions and subtractions in the field. Okay, let's look at an example of how we actually use this data path for computing operations in FP squared. So one multiplication in FP squared consists of three multiplications, two additions, three subtractions in this FP by using these formulas. And now the diagram here in the bottom represents the data path, so that this is the dual port RAM, so each block is one clock cycle, and now this shows how the data flows through the data path. So here we have the input registers that we have here, and then the multiplier pipeline here from here, and then the adders are here on the right. So we begin by reading the operands for the first multiplication from the memory, and then we feed them into the multiplier pipeline, but at the same time we read the operands we need for these two additions and compute them at the same time when we are computing the integer multiplication. And then we start the second integer multiplication. Wait until it's ready. And now we already have the operands that we need for this third multiplication ready in the memory, and we can continue directly from that. But now actually when we have proceeded long enough with this third multiplication, the multiplier data path, multiplier pipeline becomes idle, and we can actually start computing already the first multiplication of the next multiplication in FP squared before finishing this first one. So it's highly interleaved architecture, and finally when we proceed here, we have computed all the multiplications of our first multiplication, we then compute the final subtractions at the same time when we compute the second multiplication in FP squared. And after 45 clock cycles we write back the last result of our first multiplication, but at that time we have already processed more than half of the second multiplication and already began the third one. And now when we utilize this kind of scheduling, we get these latencies for our field operation. So we see, for example, that the multiplication takes 45 clock cycles, but we can start the next one already after 21. We have also a slightly faster variant, but then we cannot start the next one as soon. So this basically just computes the final subtractions a bit earlier. And in practice most of these additions can actually be also computed at the same time when we compute the multiplication, so they don't add up latency at all. And now when we hand optimize routines for different operations that we need in this scalar multiplication algorithm, we get the following latencies. So we have, for example, slightly over 4,000 clock cycles for the pre-computations. One double and add step takes 354 clock cycles and when we combine all this, the scalar multiplication takes slightly less than 30,000 clock cycles. What is an authority here is that the scalar decompositions take about 2,000 clock cycles, but because we can compute them at the same time with the pre-computations, the effective latency is zero. But this has importance in multi-core architecture, which I will explain next. So because the scalar decomposition is so much faster than the operations that we compute in the field arithmetic core, we can actually share one scalar unit with several field arithmetic cores, so that we first decompose one scalar that corresponds to the field arithmetic operations computed in this core and then we start with the next one, which is pressed here and so on. Okay, so then we implemented this implementation on a design on Xilin Zynx 70-20 FPGA and got the following area results. So we use less than 13% of the resources that are available on this particular FPGA and now it looks like the critical resource would be the number of slices, but if we take a closer look, we see that actually most of these slices go into the scalar recoding unit and if we do the multi-core architecture, then the limiting factor is actually the number of DSPs and that's what we get when we put 11 cores on this FPGA. In theory we should be able to fit 13, but the routing failed if we use more than 11. So this is utilizing the resources of the whole FPGA, although the percentages are less than 100%. And now this is how much is used for this scalar unit. Okay, then about the performance. The single core runs at 190 MHz, which gives us 157 microseconds for one scalar multiplication or over 6,000 operations per second. The multi-core runs on a slightly lower clock frequency, which means that we have a longer latency but we still get more than 10 times the throughput that we have with the single core. Our implementation also supports other operations like point validation and co-factor killing, but if you are interested in those, look the paper. We also designed a third variant of the design, which is using only Montcom ladder, so we don't utilize the endomorphisms in this and then we don't have any scalar recoding unit, we don't have any pre-computation, so the implementation is significantly simpler and also smaller but slower, so we only get like slightly over 3,000 operations per second with that variant. It's of course important to compare one's work to what is found in the literature and there are many, many easy implementations available, also our prime fields, but the comparison is extremely difficult. There are multiple reasons for this, so the optimization goals might be different, the FPGA that is used is different and so on, so we present the full comparisons in the paper, but in this presentation I will focus on comparing our design against the one that is the closest counterpart, so Sastrik and Guinness Curve 25519 design, which is also implemented on the same FPGA with similar optimization goals and so on. Oh, sorry. So now if we compare our single core architecture against their single core architecture, we see that our design is slightly larger, but quite a lot faster, so we get over two and a half times the performance of this curve 55519 design and even if we compare the throughput per DSP, the speed area metric, we get almost twice the performance out of the FPGA. Then if we take the Montgomery ladder version, where we don't utilize the endomorphisms, we are actually smaller except for the programs, but on DSPs and still get 28% improvement in the throughput and the speed area metric is also better. Sastrik and Guinness also presented a multi core architecture, but in their case the shared resource is not a scalar unit because they don't do any scalar manipulations, but they used a shared inverter because inversion on curve 25519 or in this field of 25519 is much more demanding operation than what it is in our quadratic extension field, so it makes sense to use a shared inverter, but in our case the shared resource is the scalar unit and here it's the inverter, but they managed to use all the DSPs for their design and they didn't experience any reduction in the clock frequency, but we are still exactly twice as fast as they are. They are with their multi core architecture, so both utilize all the resources that are fully utilized, the Zynx 7020 FPGA, so these are quite well comparable results. Okay, so what are the conclusions? So we showed that the speed advantages that are visible on software for 4Q also carry on to the hardware side, so that's the main result, so we get two or two and a half times the performance of curve 25519 with this new 4Q FPGA, elliptic curve, sorry, on an FPGA. How to continue from this? This was the first implementation. There is a room for optimizations. For example, we focused on optimizing the speed area ratio, but in some applications it's very important to have the shortest possible latency, so that's one point. Our design is protected against the timing attacks and simple side-channel attacks like SPA, but it doesn't claim any protection against the DPA or more advanced single trace attacks, so that's an important future work also. So thank you. Any questions?