 Hello everyone, I'm going to present the chess paper entitled A Compact and Scalable Hardware Software Co-design of Psyche. My name is Pedro Mat, Costa Massolino, and this work was done together with Leila Bettina, Patrick Longa and Yoast Henez. While doing this work, me and Yoast Henez were both at Radbaude University. Now we both are not any more affiliated to Radbaude University. Patrick Longa was at Microsoft Research and Leila Bettina is still at Radbaude University. So let's get started. So in this presentation, I'm going to focus mostly on the contribution of the paper, which is the hardware architecture. I'm not going to talk so much about how Psyche works, or mod-gram multiplication works, or any of the other stuff works, the algorithms themselves, because I want to mostly highlight how architecture was chosen, even though those algorithms influenced the architecture, I want to focus only more on the architecture side. But because they were influenced, I have to talk a little bit about the requirements of Psyche. Then after that, I'm going to follow with the literature how they made their work, and then some are our approach itself. I'm going to spend most of the time on it. Later, I'm going to talk about the results. And together with the results at the end, I'm going to talk about some difference between our current print paper and also the THS version, because there was some change done. So what do we need to make Psyche? So for Psyche, Psyche is a protocol that works on top of elliptic curves, in this case, mod-gomerally elliptic curves. It can work with other elliptic curves, but you should work with mod-gomerally elliptic curves. And these mod-gomerally elliptic curves are built on top of the quadratic extension field. This quadratic extension field is built on top of a prime field. So this prime field is just operations on top of a prime, and this prime range between 434 bits up to 751 bits. Before, when this project started, it was between 500 and 3 bits, up to 964 bits. This change on the size completely highlights the fact that when you start doing something for new post-quantum schemes or even new public key crypto systems, you should be prepared to change parameters, change on some parts of the design and stuff along those lines. So it's always nice to have a very easily tunable implementation. On top of all of those stuff, like for these mod-gomerally elliptic curve operations, we're also talking about a treat-as-versing procedure. That's how, let's say, the fast procedure of doing SIDH works by doing the walk between the elliptic curves and the isogenes to the isogenes. And this procedure itself requires a lot of stack operations and data-ray operations. And also you need Shake256 that is on the SHA-3 standard. So how do we tackle this problem? For this problem, the best idea is to focus on the base of the triangular. So that's prime field operations. You focus on those operations. You focus on the Ketchak F as well. And then you do those on hardware. Then you build everything on top as a softer layer. The FP inversion, the extension quadratic field as a softer layer. You could also do some of them a little bit more into the hardware side. But then it's a trade-off between flexibility and also make it easy to maintain. Software is easier to maintain, usually from hardware. And as you grow, you make everything as a function of the order. So then you can call, put more arguments or remove arguments. So what's our solution? Our solution is a 16-bit CPU. So how I said it, we need function calls, we need data arrays, we need stacks operations, and those usually are well done for CPUs. So then we chose to use make a CPU and then make the prime field operations as a co-processor of such CPU. So that's Carmella. Carmella is the co-processor which was mostly of the project spent time on it. The Ketchak F is the Ketchak package. So the authors of the Ketchak already did a good hardware design. So we just took it and we just integrate into our CPU. So how is the approach of the literature? So the literature more older is more on the side of SIDH. So they did only SIDH. And they started with using affine formulas with fast inversion units, but that didn't go so well. So then they changed to projective formulas, which has better results. Then later they did more optimizations on the multiplier and then they got other parameters as well, added shake so they can have psych. And in that architecture mostly how it works is you have different multipliers themselves and a special other subtractor unit together in a main block of memory. Those multipliers, they are fed loaded and stored. And then you have, you change the problem for from, you change the problem to a scheduling problem, because now you have to schedule all those multipliers. So then you need a really good scheduler for that. Another approach very different from that one is to focus only on the field arithmetic. So no SIDH no psych just to try to solve, make really fast field arithmetic. There was no SIDH in this case of paper. But it's a pretty it's a difficult problem. So if you make a very fast field arithmetic you probably most likely are going to get really good results. Then there was another paper as well that uses some carry safe notation and and that one also is a multiplier for SIDH. But our solution and you should pay more attention is to focus more on the side of doing a CPU. So be very flexible and be more can be easily tunable for any parameter in case of any change. Okay, so let's start focusing on Carmella. And Carmella is built around a multiplier and I added that builds the MAC the multiplier accumulator. And the multiplier is a 256 or 128 bits depending on the version of Carmella. We have two versions so the 256 bits version of Carmella is made for the seven series of FPGAs of Xilinx that uses rectangular multipliers and the 128 bits version of Carmella is for smaller units that have square multipliers like the Spartan 6 and older series at least that has enough. The adder itself is done for the multiplier to be used to get in the multiplier accumulator. That's why we have this special adder and then on top of this multiplier accumulator we put some pipeline stages 8 or 4 depending on the operation and then on top of that we build the FP operations. So let's focus first on the multiplier which is a problem by itself. So the seven series I was talking about has a rectangular multiplier of 18 by 25 bits sign it or 17 by 24 unsigned it. And the idea here that we want to build a sign and multiplier so then we can both use it for unsigned operations and signed operations. So we are built in 257 so the most second bit is used to basically to flag the multiplier into unsigned mode or signed mode. So one way to build it is to ignore the multiplier as being a rectangular one and just use as a square one and then when you do that basically you're going to need 256 bits multipliers and that's quite a lot. So then to reduce the number of multipliers we end up using a technique by Roy et al. That has been proposed a long time ago and in this technique we use a technique called the tilling where the idea here you see the problem as my 256 or 257 bits multiplier is a square of 257 by 257 and I want to fit as much as possible rectangular ones inside of that square. So you can see this problem here like you have this 257 bit square and then you fit this small rectangular one. So this figure here is up to scale and you have here 161 multipliers. For the 120 bits multiplier we don't use the tilling because we are talking about the Spartan 6 series which has square multipliers and we built everything around of square multipliers so we don't need to use tilling we just use the schoolbook itself. So then but the problem with this is we don't do final multiplication we just generate partial products and partial products are basically when you do long multiplications when you were at school that means you have you take one part of the decimal and you multiply by the big number then you have a lot of values later that you have to add. That's the same thing you have a lot of values that are just partial multiplications that you want to add. In this case we have 30 partial multiplications that we have to add. The best way to add is to use a unit called the compressor. The compressor basically is a unit that do only additions in one bit. So basically it takes the bit 0 of every one of the 30 additions and add them all together generating a 5 bits number. So then with this 5 bits number what happens is that you do all of this for every bit then you are basically compressing all 30 additions into 5 additions, which is way easy because I'm not doing carries and carries are usually the unit that makes the adders slower. So then you are reducing from 30 to 5, then you reduce from 5 to 3, then 3 to 2. When you are in 2 basically you cannot escape you have to do you have to solve the carries and you solve the carries by using a special adder in this case the adder that we used is this one is used both in this part and also another part for the accumulator. So this adder is a special adder because it has a special architecture around it. Usually for adders it's better to ask the tool to do it, like ask Vivado, ask Zalynx ISC to do it and they will do it to give you a really fast adder. But then for this size of bits like around 200 to 500 most likely this adder will not be as fast as possible most likely the most a very compact one cycle adder but it will not be as fast as it could be. So then you could use this other architecture that was proposed by the group of authors in the title and it's called the add-add multiplex where basically you just add two numbers and then you add with the idea of okay I'm going to add this number in case of carry 0 and in case of carry 1 then later going to solve this situation with a regular adder then after solving this the carries I'm going to feed it again I'm going to recurve the carry and I'm going to choose between the solution with carry 0 or carry 1. Basically and this makes this faster from just doing normal addition. The cost of doing using this technique is you're going to use more lookup tables but it's okay so then after you I decide okay I have this multiplier I have this accumulator but we need pipeline stage in order to make the frequency higher otherwise the frequency would be very slow. But how many pipeline states? So we want here to match two things so one thing that I want to match is I want to do two fp squared operations if I want to do two fp squared operations I can do it with eight multiplications and four additions that would be using this cool book technique and in order to do so I'm going to do eight and four so then technically I want to put eight pipeline states in my mark because if I put eight pipeline states that means I can do eight parallel operations then I can use the idea of doing always two fp squared and for additions I want to put only four pipeline states but I cannot put like less pipeline states unless it's an addition because basically I turn off or ignore the output of the multiplier therefore in the next architecture I'm going to show this a little bit better so here you can see that you have here the inputs are registers here in the registers A and B and my multiplier here it has internal five pipeline states then the adder is or the accumulator is used here so then I multiply two values then I accumulate them to here written by optimized adder and then the accumulator goes back to here and can be also shifted when I'm doing addition basically I'm going to ignore the output of this multiplier and I'm going to compress the two values so the two values are added here and then compressed and then we are going to skip these extra pipeline states that are here to match the multiplier and I'm going to add them together giving up to four pipeline stages the S here is it has a mask for doing the addition so then I can do mask additions and mask subtractions basically the idea here is I want to always do the addition but I want the addition to be effective or really not effective it's just a way to do additions without really if's analysis but just like people doing hardware I mean in software where they just use masks so now let's explain how the state machine works given the mark I'm going to do the 8 to 4 parallel operations the 256 or 128 bits operations so we have this 257 bits sided mark and how do we make up to 1024 bits words well we just make it split into words so then for example if I need only one word then this one word is sign it and then it's 256 then if I need up to 412 bits then the lowest significant is unsigned then the other part is signed the same for 768 where only the most significant word is signed by word here I mean 256 bits of words so that's how it works and for doing multiplications of these words then we use a product scanning technique through multi-government application in the product scanning technique basically you just scan to the product output instead of scanning to the operands while doing the multiplication so for example I want to do the multiplication of two words and then first I look what are the multiplications that affect the output of word 0 of the product oh those are then I do these operations then I shift to the next word of the product which is word 1 and then the first of word 1 and you can see here that basically what happens is that in the operands scanning the index of the operands just increments usually or maybe decrements let's say in a linear fashion and here your index of the operands just has to keep stable you add or subtract the other you add one and then subtract the other in order to both the index of both operands to keep equal so here and then we have we do additions to subtractions directly and by additions to subtractions like we just do the addition or the subtraction then the value is a negative number or a positive number but since we are using a sign representation the multiplication can work with negative numbers then later after we do all the processing we need to be sure that the output is between 0 and P-1 so then we have one extra operation that makes it from minus prime to the prime up to 0 to prime-1 and then we also add it after the chess paper we added addition and subtraction that also performs this extra reduction step in case in case it's needed so how do we control of this I'm saying that we use a state machine so basically you can think of all those algorithms that I just said with all the operands size and we just unroll all of them so if you unroll all of those algorithms basically end up with 300 up to 500 states and this is quite that really big a lot of states so then in order to be sure that everything just stays good and everything works I decided to make my own state machine and it's quite easy to make your own state machine if the number of ifs of the path of the state machine is pretty stable then everything is really simple because here what we have here is the row and the program counter so the program counter just stays in the set to the first state that I want to execute so let's say the first state of the operation then it keeps incrementing as it goes until it meets like oh this is the end of the operands size operations so for multiplication you can have the same program for both two operands size and then at some point just change okay so in this point for this operands size you have to go to this path so then you need this kind of if logic here besides that it's just the row and for the addresses of the operations you just use basically shift registers that keep rotating the addresses so basically every time that I am doing an operation of let's say the value 0 of the multiplication of value 0 1, 2, 3, 4, 5, 6, 7 and it keeps rotating and then at some point it always know which address I am working on because it's rotating all together so is this everything enough? well I just said to you how I did all the prime field arithmetic which is already a pretty hard problem by itself but that's not psych that's the only prime field arithmetic we're still missing a lot of stuff we're even missing even the inversion of the prime field are fp squared operations Montgomery elliptic curves and all the things necessary for Psyche and SIDH so the remaining parts of Psyche are run on top of the CPU so now we're going to focus mostly on parts of the CPU itself and not so much in Carmella so the CPU is a Harvard CPU with a custom instruction set basically the instruction set I made as simple as possible and tried to make as big as possible to be easily decodable that was my entire goal so then it performs some basically operations, additions, subtractions shifts, rotations logical operations, comparisons jumps, conditional jumps loads and stores and push and pop and I'm focusing, trying to focus more on load stores and push and pop because those are mostly the operations that are really well needed for the function calls and also for the data pointers itself and all the operations are all also needed as well so this is a cyclical basis CPU so then the CPU that basically takes the operation the codes, executes then takes the next one all in the same heart it doesn't try to do some pipeline or anything like that and the CPU basically takes the operation and if it doesn't match the operation for itself it just sends to Carmella then Carmella will have to execute it and the CPU has some kind of pipelining just in the address resolution of the instruction can be solved a little bit later and let's say in the terms of pipelining it's kind of okay because there is an extra state to be sure that the address resolution is updated in case the values of the registers are also updated so it's fine in that sense and the 16-bit ARU you have a DSP and a BaroShifter separately because I end up building the BaroShifter separately instead of using the DSP it looked like a little bit better and for the BaroShifter you look to the paper there is the the citation there from where I took it and the RD registers the base run one thing to pay into account that they both share some memory region so what happens is that both the base run and RD registers they have the same values in both locations this can be seen kind of as a cache unit but doesn't really use a cache unit in the other sense of being fast for memory it's just that the registers share the same positions so then when you do some operations you can do use the values from the base run and sometimes you want to use the values from RD registers and that's it so the high level of the architecture basically is you have a local bus of 16 bits that's the bus that connects everything and you also the pro itself is not connected to this bus it's a Harvard CPU to program the bus only the external loading is possible which is done to here to the external communication this is not really connected the CPU cannot load the values from the prong only the external communication can access the prong the values on the prong are then goes to an extra instruction and then the address is resolved then goes to the current instruction which is then executed or passed to the core processor itself one thing to pay into account is that because how Carmella works with the 8 or the 4 operations and basically each instructions of Carmella takes the same size as one instructions of the CPU then all the Carmella instructions has to be one after another so you always have 4 blocks of Carmella edition operations and also 8 Carmella instructions here and you also have a stack counter to know because of the pushes and pops the mac run which is basically the run that Carmella uses with the 256 bits bus but that can also be accessed by the local bus and the base run is another run that's amazingly used by the CPU where it can store some other values related more to the CPU from Carmella itself so like the pointers and the data arrays and some points, some stuff is stored there together with the kcheckf values like to shake out both values so for the results I want to highlight a few things the first thing is the first thing that I want to really highlight is there's never a really let's say a solution that's good for everything our solution here is that we want to show and we want to give is a solution that works for all the parameters it doesn't need to be reprogrammed it's not really fast but it's extremely well flexible if it's deployed and it's going to work with all the parameters and everything it is done, it's ready it's a good way but if you want something really really fast then you should maybe look for other results like the selected results they have a lot faster results for us and if you have requirements for this kind of timings then you should need to expand the amount of resources our solution on the other hand if you are if your timings can be a little bit let's say you accept bigger timings then our solution can be well fitted it has way less DSPs from selected results and slices as well for the same parameter and we give not only this parameter we give also the other parameters as well as a consequence and that's pretty much our contribution and when we compare this with other schemes these are pretty old results there are new LATs based implementations but the main idea here and I also kept these results here because I think the new results waste a little bit more resources but SCIKE needs a lot of resources for a lot of lots of time it needs lots of time to do stuff and also needs lots of resources it's a very not very resource friendly scheme but one thing that I want to see it here is that our SCIKE implementation because it's a CPU with the Carmella core processor and we have all this ellipt curve stuff all the prime field arithmetic we can also do PCC as well we can also do scaling multiplication so P224 up to P521 and even a bigger prime field up to the limit of almost a thousand bits and of course this is all done with the same CPU so no V2DL needs to be redone, no FPGA has to be reprogrammed just the CPU it could even share the same program of the SCIKE you could have the SCIKE program and the ECC program in the same and in the PRO itself you can have both programs and just start one and start the other so our solution could also be used later as a hybrid, now at the end of the day it cannot do the hybrid but it could be used as a hybrid so between the THS and the A-Print that's online I think since June or maybe May so the two things the main thing that I want to highlight here is not only that we have the scaling multiplication but also we fixed a bug in P503 so P503 had a bug that was discovered by Ultimaco because they took architecture and did some testing themselves and then they discovered there was a bug for the P503 parameter so then as a consequence I had to update the VHDL files and some other stuff related to the limits of the representation of the modcromed representation and then everything just worked fine and then this was updated and then it is already there and I want now to conclude my talk and I hope you liked it, if you liked it don't forget to subscribe to EACR and to press the like button