 So the next talk is given by Philip. It's a joint work with Fabrizio Losantis, John Heisenberg-Ziegel, and it's about FPGA implementations of Diffie-Hammon on the Kama surface of genus 2 curves. Thank you for the introduction, and welcome, everyone. So in this talk, I'm going to present two highly optimized FPGA implementations that use a genus 2 hyperliptic curve to enable very fast Diffie-Hammon key exchanges. So the main question that we tried to answer in our work was to evaluate how efficiently we can implement an hyperliptic curve-based scheme on an FPGA, and if so, how it compares with similar elliptic curve prime-feed implementations. And to answer that right away, so we will see that our performance are quite decent, and in fact that outperforms all previous comparable works in terms of latency and throughput. Now when we usually operate in time-sensitive environments, what we usually do is, of course, that we want to use elliptic curve cryptography. Why do we do that? Fast arithmetic, small key sizes, and all of you know the benefits. And this raises naturally a question. Why do we need hyperliptic curve cryptography? And so what are the benefits of hyperliptic curve cryptography, and also what is the related work in that topic? And for that, I would like to give you a quick overview about the different works that we've seen in the past few years. We will mostly discuss here two works that are elliptic curve-based, which is the well-known curve 25519 from Daniel Bernstein. And then we will also compare our work to the very efficient implementation, the 4Q work. And both of those elliptic curve-based schemes have been implemented in 2014 and 2016 on an FPGA. The nice thing about those two implementations is that they have very similar optimization goals. Both of them present two architectures, one targeting low latency, the other one targeting high throughput. And we will see how we align our results to them. Now, the only hyperliptic curve-based scheme that you see here, or the only implementation that you see here, is the one from Bose et al, which got published in 2013. And that was actually the first work or the first software implementation that used the kuma surface of a genus two, or the kuma surface of Gautry and Schost's genus two curve. And everyone basically knew already before that this should theoretically perform quite well, and they actually improved it by that. But what is still missing so far is the proof for the hardware implementation, and that's what we will try to answer in that talk. Now, before we jump into the implementation, let's probably take a look how hyperliptic curves are different to elliptic curves, and also how the group operation is different to elliptic curves. Now, first of all, how is a hyperliptic curve different to an elliptic curve? You distinguish them by their definition. So elliptic curve has a so-called, or in general, curves are described by a so-called genus. Genus is the direct relation to the degree of the polynomial that the curve is described with. In case of elliptic curves, curves are said to be of genus one, which is degree three. And in case of hyperliptic curves, or in our case, we use so-called genus two curves with our curves of degree five. Now, when you take a look at that curve, and if you want now to design some kind of diffielement key exchange where you need a scalar multiplication, and then, of course, you also need some group operation, if you take a look at that curve, which is a hyperliptic curve, if you would try now to use the standard cot and tangent rule, it wouldn't work. Why? Because, of course, if I would draw now a line through this curve, it would probably intersect in five points, and you can't construct a group operation. What we do instead in that case is, instead of having this one-to-one correspondence between a group element and a single point, as it is for the elliptic curve cryptography, we have, in that case, two points, for example, P1 and P2. And those have a certain structure, and those points together form one group element. Same thing we can do here for Q1 and Q2. Now, when you want to determine now the group operation, what you do is quite similar, actually, to elliptic curve cryptography. You determine some kind of polynomial that has now a degree 3, and this polynomial, again, will intersect the curve in all those points. And, of course, then also into further points, and always into further points, which is R1 and R2. Then you do the same thing as with elliptic curve cryptography and mirror them across the x-axis, and you obtain your group operation. Now, last thing to clarify, and that is probably you are asking yourself, what is now the Kuma surface? I don't want to answer here what it actually is. I think it's more interesting to see what it does and where the benefits are. So what the Kuma surface does is, when you map the points from, let's say, standard representation towards the Kuma surface, what you do there is that you identify the group elements with their inverses. And if you're familiar with elliptic curve cryptography, you've probably done that already, because that is essentially the same thing as for elliptic curve cryptography, where you have the x-only arithmetic, meaning where you're dropping the y-coordinate. And, of course, the Kuma surface, you use that to speed up the operation. Now, what does it mean implementation-wise, and how do they differ now from each other? So in that case, now the Kuma surface-based implementation enters an example curve 25519. And there are two interesting parameters that you can compare here, and that is, first of all, the field size, and then you have also the field operations per data step. Now, the nice thing with the Kuma surface-based implementation, or with a specific Kuma surface of Gautry and Schoss curve, is that you have a field size that is half the size of curve 25519, namely 127 bits, which, of course, gives you a great advantage in terms of implementing the field operations. On the other hand, you have this increase of the field operations per data step, which raises the question, essentially, does the reduced field size outweigh this increase for the field operations per data step? Now, before we jump into the actual implementation, let's take a look at the functions that we need to implement. And first of all, the functions that you see here, we didn't come up with them. Those got published in some other works already before, and we would simply use them. And in a sense, we need to implement three functions. That is an unwrapping function, the scalar multiplication, and the wrapping function. Now, the scalar modification is known, I think, that as standard. You have some input point P, or some group element, in that case, that you multiply then with your secret key K to obtain your either shared secret or your public key Q. Now, you also have those two wrapping and unwrapping functions. Why do you do that? Because this is probably not known from elliptic curve cryptography. As you can see here, the bit size of the input group element of the input point is quite high. So we have here 508 bits. Now, the wrapping function gives you the advantage as you reduce the size here of this point. And it also gives you advantage for the scalar multiplication. Why? Because you use parts of the wrap point when you input it into those functions. So you reuse some of those to gain here speed up. From the operations that we need to implement, yeah, scalar multiplication is quite standard. You use a standard Montgomery letter, which consists of modular multiplication, modular squaring, and then the so-called Hadamard transform, which is in the end just a chain of some additions and subtractions. For the wrapping function, then you also need a modular inversion, which is, of course, known that it's quite time intensive. Now, let's take a look at the implementation. So we will present two implementations. The first one that we will talk here about is the so-called single core implementation, which means we are interested in multiplying just one point at a time, which means we are mostly interested in the latency here and not in the throughput of the implementation. Now, we realize the functions that we've seen before with three building blocks, which is the control logic, which obviously provides the control signals for the other modules. Then we have some memory module, which consists of a distributed RAM. It's not a block RAM. It's distributed RAM. And also a register file. So the distributed RAM is to hold some constants that we need. And the register file simply holds some temporary values that we get from the data path. Now, for the data path, we decided to implement three modules. That is the Hadamard module. That is the modular multiplier and the constant modular multiplier. We decided here not to implement a squaring module because the modular multiplier is already quite area intensive. And for that reason, we neglected it here. Now, we don't have enough time to talk about all those modules in detail. And for that reason, I decided to focus on two of them that are, in my opinion, probably the most interesting one and also the ones that are responsible for the performance that we achieve. And that is the modular multiplier and the control logic. So we will start with the modular multiplier. And for the modular multiplier, I apply three, let's say, techniques or principles. So the first one is we are interested in performance or in latency and later also in throughput. So the first important thing is that we want to use a parallel multiplier, which means we want to compute all the single digital products in full parallel. And we also want to accumulate all those digital products in full parallel. The second two techniques, the second one is that we use a so-called non-standard tiling technique that got published a few years ago. And what it essentially does, it helps you to reduce the DSP blocks that you need for computing those smaller digital products. And that is a very nice technique because it doesn't cost you any additional hardware. Sorry, of course, it reduces the hardware, but it doesn't affect your performance or anything like that. Now, the third one is combining multiplication and reduction procedure for better performance. And I would like to talk about that in the next few slides because I think it's a bit different than we know it. And first of all, to be honest, that's also a work that we've done earlier. So we've published only a modular multiplier before. And I just summarized the results here. Now, when we usually perform multiplication in MSM prime fields, we have essentially two steps. First one is, of course, multiplication. The second one is reduction. Now, for MSM primes, we have this very specific structure, which is P of the MSM prime equals 2 to the power of P minus 1. And this is actually a very nice structure because it allows us specific computational tricks. Now, in our case, P is 127, which means that also our field itself is 127 bits wide. Now, what you do, you multiply the two input operands, both P bits wide, obtain your results C. Then, in the next step, you have some logic that performs a reduction, which means, in that case, because we can use so-called trenders reduction, take the upper part of C, shift it to the right, and add it on the lower part of C, and you obtain your reduced result. Now, let's take a look how we do that. And I would like to start here with a picture on the left. So before we oversimplify things a bit, so if you usually perform a multiplication in FPGA, you need, of course, to decompose the large multiplication in smaller digit products that you can see now on the left. What you do then is, once you have computed those digit products with the standard algorithms such as schoolbook algorithm, you take all those, then you put them into an editor tree, which is, in our case, a fully parallel editor tree, and then you obtain your results C, and then you do the procedure that we've seen earlier, take the upper part of C, add it on the lower, and so on. Now, instead of doing this, what you can also do is, instead of computing the digit products, accumulating them, then shifting the result, you can also basically shift before you accumulate, which means, which we can see now on the middle picture, that we take the upper part before we accumulate it, and shift it on the right. Now, what happens is, if you would input this now into a hardware design, you would have, went into an editor tree, you will have some unused bits, and that is, of course, not efficient, and for that reason, we will try to avoid this. And what you can do here is, you can slice the digit products into their single bits, and then you can reorder them on a vertical line. Horizontally, it's, of course, not possible because you would change the value of the bit. Now, once you have done that, and once you have regrouped those bits, we are now on the picture on the right, and we see here, we have a very nice, symmetrically structure, and yeah, now it's quite easy to put that into an editor tree and to process that with a very high maximum frequency. And I should also mention here that we designed the model, and multiplier, and all other modules in such a way, or at least the finite field modules in such a way, that you can input, in each cycle, you can input a new operand, which means we have basically no busy or any stalls here. Now, we did that first, then we continued with the other field operations module, and once we had all those, we came, of course, to the scalar multiplication, where we needed to schedule the single field operations. Now, you see now the scheduling, basically, for the field operations, and here for the first ladder step. And for the, then we see on the left here also the three modules that I've mentioned earlier, which is the Hadamard module, the modular multiplier, and the constant modular multiplier. The blue bars mark now whenever you schedule a new operation. So it doesn't mean when the output is valid or anything, it simply means when you schedule a new operation, which means in all those, let's say unused cycles, now you could potentially schedule another operation. And that came then to the idea why not schedule a second scalar multiplication in between, and potentially with the idea without losing any cycles here. And we did that, and as you can see now, we have a second scalar multiplication interleaved, and the nice thing here is really that you don't lose anything in terms of performance. So the only thing, of course, we require some more area because we need to store some of the input operands in some memory. There we, of course, have a disadvantage, but the advantage here is that you actually increase your throughput. If you would use that now, for example, for multiplying two points at a time, you can double your throughput. Now, in case if you are not interested in doubling your throughput and you say, okay, I am only interested in multiplying one point at a time, what you can do in status, you can use this as a fault counter measure, meaning that you simply perform the scalar multiplication twice on the same input point, and then, yeah, basically perform the equivalent check at the end of your computation. Now let's take a look at the results. So we will start, again, now with a single core implementation, and we will compare our work, as I've mentioned earlier, to 4Q1 curve to 5519. Now in terms of latency, I think there it's quite obvious we outperform the two other works, which is probably more interesting is to take a look at the area, and here we clearly have to admit that in terms of slices and also in these P blocks, we suffer compared to the other implementations. Now I also want to note here that you should, or that you consider the ratio between latency and area, and there we are, I think we are at least comparable or even outperform the two other works. Now last thing, Brem, we don't use any Brem, which I don't know could be probably nice for further functionality on your FPGA, so we neglected that. Now let's come to the multi-core implementation. Again, the multi-core implementation has the goal to compute basically as many points as you can per second, so throughput, and the constraints here are that you have your FPGA, and I forgot to mention that, we all use the same FPGA, which is the Zing7020 FPGA. Now what you do is you try to put basically as many cores as you can on the board, and then you measure the performance. It's a bit weird, but that's also how the two other works have done it earlier, so we followed their example, and then again for the throughput, we see a very similar result. In that case, of course, we activate it now, the feature where we can double the throughput, and then for slices and DSP blocks, things are similar, but I would say a bit more, like the comparison itself is now a bit more fair, at least compared to Curve25519. Again, 4Q itself here has much lower area of utilization, and for Brem, of course, we have the same as before. Now, I would like to conclude my talk and give you three take-home messages, so the first one is to come back to the initial question that I asked myself in the beginning, whether we can achieve such a high-speed implementation based on a kuma surface, and yes, it is possible to do that on an FPGA, and on the other hand, and that is the second point, I also want to be careful with the comparison, why? We use very specifically optimized modules in the end, so the modular multiplier is extremely optimized, and also this technique with the interleaved scalar multiplication, of course, gives us a huge advantage in terms of throughput, and those techniques you could potentially also apply to the similar or to the comparable works, Curve25519 and 4Q. Now, the last point is just basically to re-emphasize, so hyperliptic curve cryptography is an interesting alternative to elliptic curve cryptography, but we require more research, for example, area optimized implementation would be interesting or similar comparable things. Okay, thank you. Thank you very much, we have time for questions. Any questions? People are warming up, I have a quick question. Which hardware platforms did you implement that on which FPGAs, and are they similar to the ones you compared, and how many different platforms did you compare them on? We only compared one platform, and that is the thing, 720 FPGA, we used basically the exact same version that the other works mentioned for easier comparison. Thank you. Still no questions? Hello, hello. You said you made the other tree, the other tree is made by compressors, are you just put a lot of additions that led it to solve for you? Yeah, so in that case, we didn't really optimize the others themselves, so we simply instantiated those and then took basically or told the synthesis tool to synthesize them, so we didn't really optimize them by hand. And for the timing, is the other tree the bottleneck or not? Yes, it's the other tree that is the bottleneck, and basically by aligning those levels that we've seen from the earlier, what I showed, the nice thing is that basically all the stages from the other tree operate on a very similar maximum frequency, so you don't have one stage that is faster than the other one, so we pipeline all those, and yeah. Okay. All right, let's thank the speaker again.