 Hi everyone, thanks for attending this talk. Today I will talk about the parametrized hydropower accelerators for lattice-based cryptography that we designed, as well as the application of these accelerators to a software hydropower code design for Qtesla. This is joint work with my collaborators, Shanquan Tian, Yaqub Shefer, from Ya University, Burhard Yonk, from Mantrunk in Germany, Nina Bindel, from University of Waterloo, as well as Patrick Longa from Microsoft Research. So during today's talk I will first explain why we are providing a new hardware design for lattice-based schemes, given that there are already many existing work. Then I will give a brief background introduction for the digital signature scheme Qtesla that we used to showcase the performance of the hardware accelerators that we developed. Following that, I will focus on two of the hardware blocks and expand the design details. In the end, I will show the software hardware code design for Qtesla, which is prototyped based on a Resquire platform. So now let's take a look at the new lattice-based hardware design that we're targeting. Before that, as I mentioned briefly, lattice-based schemes are pretty popular nowadays, and as a result, there are many different types of hardware designs for different lattice-based schemes. One type of the existing design occurs only on the development of one or two building blocks, for example the Gaussian sampler. These blocks cannot cover all the expensive computations in a full lattice-based scheme, therefore the schemes are only partly accelerated. Moreover, most of the existing work only supports one side of the security parameters, and hardware architecture they provided is fixed as well. Also, the target of this work is just usually one specific lattice-based scheme. Moreover, these designs don't take into account the portability metric, therefore they do not offer any support for standard interface communications. Another type of existing designs provide the full-hubber realization for a specific lattice-based scheme, and therefore these designs provide very good speed-up. However, still the same problems exist in these designs, for example fixed security parameters, fixed architecture, no support for standard IO, etc. More recently, a more flexible approach gradually being proposed and used, which is to use the software hardware co-design approach, and this brings more flexibility to the architecture design, but at the same time, the same problems remain, fixed parameters, tied to one specific scheme, etc. Therefore, in this work we would like to propose a new hardware design for lattice-based schemes, which is able to solve all of the problems that we have just discussed. The idea is to first develop a set of hardware accelerators that are generic enough to be able to support different lattice-based schemes. Further, we want to make sure that these hardware accelerators can not only be tuned depending on the security parameters, but further, users can tune the performance parameters when targeting different applications. Another very important design decision we made in our work is that we use standard hand-shaking protocol throughout our design to make sure that our hardware accelerators, as well as the software hardware co-design, can be ported very easily among different standard platforms, for example, risk-free based and arm-based architectures. Now let's take a look at the Qtasla scheme. Qtasla is a lattice-based signature scheme that has been our second-round candidate in this PQC standardization effort, but it didn't advance to the third round. Its reference implementation is already integrated in several open-source libraries. Qtasla is a scheme which is secure against both classical and quantum adversaries and is also secure against some implementation attacks, for example, some side-channel and some fault attacks. Another advantage of Qtasla is that it only uses very simple arithmetic operations, so they are perfect targets for hyperacceleration. But the property that actually makes Qtasla stand out from the rest is that it comes with proofably secure instantiation. That means the security hardness of a given instantiation is proofably guaranteed as long as its corresponding RAOW instance remains secure. This leads, however, to rather large parameters, as you can see from the table here. So in this work, a very interesting research question for us to answer is that how much the performance of Qtasla can be improved by use of hardware acceleration. Now let's take a look at the operations needed in Qtasla. For the signature generation, given the secret key and the input message, what you first do is to sample a random polynomial y, and then you do a hash based on the secret key, the random polynomial, as well as the input message. And then the check is carried out to ensure that the acceptance is valid during the verification step. If the checks succeed, you go and compute a potential signature, which would later be checked to ensure the security. If this check passes, you go ahead and send out the signature, otherwise you go back to the random polynomial sampling step and repeat the whole process. For the verification step, given the public key, the signature, as well as the input message, what you do is you first hash all of them together, and then you do a comparison with part of the signature. If this comparison result is valid, then you go further and check the security property. If this check passes, the verification passes, otherwise it fails. So that's pretty much how Qtasla's signing and verification works. As we can see from the whole procedure, the operations involved in Qtasla are pretty straightforward, pretty simple. Basically, all you need are sampling, hashing, some comparison, some multiplication as well as additional operations. Now let's take a look at the full list of accelerators that we implemented for lattice-based schemes. We used Qtasla as an example to get the reference software profiling results and further to get an idea of the computation cost for different functions used in lattice-based schemes. As expected, most of the time within a lattice-based scheme is actually taken by the hash function as well as the polynomial modification. Further, the Gaussian sampling process also takes a fair chunk of the communication time. We also found that for Qtasla, the sparse polynomial modification also takes a big portion of the communication time. Such patterns can also be found in many other lattice-based schemes. Therefore, based on these profiling results, as well as the repetitive evaluation results for the full software hardware co-design, we implemented the following hardware blocks for lattice-based schemes, including a unified hardware core for both shake and c-shake, targeting 128 and 256 bit secure level, and a novel parametrized binary search CDT-based Gaussian sampler, as well as a novel and fully pipeline entity-based polynomial multiplier. Further, targeting Qtasla, we also implemented a parametrized sparse polynomial multiplier as well as a very lightweight HMX sum module. More details can be found in our paper. During today's talk, I will only focus on the Gaussian sampler and entity-based polynomial multiplier. Now let's take a look at the design of the CDT sampler. Here I'm showing the pseudocode for the binary search CDT-based Gaussian sampler. The algorithm works as follows. At first, you have given the random number x and a pre-computed CDT table, which steps is not necessarily the power of two. In order to carry out the binary search, what you first do is you split the CDT table into two tracks, and later you will focus your search within the chunk whose steps is the power of two. Then depending on the value of the input value, input random number as well as the CDT table, you carry out the binary search and in the end you will figure out the corresponding row of the CDT table and extract the index of it. This index will later be sent out as one sample. In a hardware design, we implemented a binary search engine, which basically keeps interacting with a pre-initialized memory, storing the values of CDT. This diagram here shows the full architecture for the sampler. On the left side, you can see that the CDT sampler gets inputs from an outside C-shake module. These inputs will further be processed by a PRNG module, which prepares random inputs of our targeted precision. Then the PRNG pushes its results to an input FIFO, and this FIFO will feed inputs to the binary search engine. Once the results are available, they are pushed to an output FIFO, which is responsible for communicating with the outside world. One thing to note is that within the design of the CDT sampler, the communication within the sampler, as well as the communication between the sampler and the outside world are all implemented for an XC4 light-like handshaking protocol. Therefore, the parallelism, the synchronizations among different blocks are very easily maintained in our design. Overall, our CDT-based sampler has a few good features. First of all, the whole design is fully parameterized. We can freely choose the security parameters, namely the standard deviation, the targeted precision, as well as the tail cut. Further, you can also determine how many samples you want to generate at a time, which is determined by the batch size parameter. Therefore, as long as your lattice-based scheme involves relatively small standard deviation, you can use our CDT sampler in your design. Further, our design for the CDT sampler is fully constant time and also pretty lightweight. We will see that in a minute in the following table. This table shows the performance of the CDT sampler. As you can see, our sampler can be easily tuned to support different security and performance parameters. Further, you can see that the area usage of the sampler is very low, which shows that the design is very lightweight. Another important finding from this table is that due to the use of the hand-shaking protocol between different blocks, the searching phase and the PRNG phase are actually perfect to overlap. This you can see from the table because the total cycles is very close to the cycles taken by the PRNG. When we compare our design with state-of-the-art implementation for Gaussian sampler, you can see that actually now the existing design can be plugged in a real-world application easily because the commutations in these designs are fully sequential. There is no synchronization mechanism implemented between different sub-blocks in their design. Moreover, these designs, they use much faster but arguably less cryptographically secure PRNG while our design uses a C-shake-based cryptographically strong PRNG. Now let's take a look at the design of the entity-based polynomial multiplier. Currently, there are two widely adopted approaches for implementing entity-based polynomial modifications, both in software and hardware. One method allows one to only use a unified algorithm basically by using the same butterfly for both forward and inverse entity. The other approach uses a separate algorithm where the forward uses CT butterfly and the reverse entity uses a different butterfly, the GS butterfly. These two approaches both have pros and cons. For the unified approach, if you think about it from the hardware perspective, only one hardware module is needed because you have the same algorithm for both forward and inverse entity. But this would require a few extra computations in the algorithm. For example, pre-scaling, bit reversal, as well as some post-scaling operations. While for the separated approach, because of the different butterfly structures, you don't need such extra computations. However, as a hardware design, you will need two separate modules for forward and inverse entity. In our work, we proposed a new entity algorithm which is called CTGS entity algorithm. It's a unified algorithm in the sense that the same algorithm is used for both forward and inverse entity. And this is achieved by designing a combined CTGS butterfly unit which is able to function either as a CT butterfly unit or as a GS butterfly unit. We will see more details in the following pseudo-code. Therefore, our approach only involves one hardware module and no extra computations are needed. This is the pseudo-code for the CTGS entity algorithm. The core part of the algorithm is that the constants for the control logic have to be initialized properly at the very beginning. As you can see, the structure of the loop or the control logic when you map it to hardware designs for forward and inverse entities are pretty much the same. The only thing that differs is the interloop logic, which is handled by our merged butterfly unit. Another feature of our design was mentioning is that we implemented an efficient memory access scheme to ensure that there is no idle cycles in the computational units. And in this case, the entity-based polynomial multiplier is fully pipelined. In the end, we present such an entity-based polynomial multiplier. The diagram here shows the structure of it. It basically involves a few components. First of all, import memories as well as the memories during the pre-competitutal factors. And for the computational units, we have a merged butterfly unit and we have a generic Montgomery multiplier. We have a point-wise multiplication unit as well as some control logic. Overall, our design for the polynomial multiplier is fully pipelined in terms of the length of the polynomial as well as the modulus. Further, our design is fully pipelined, fully constant time and similar to the design of the city sampler, the multiplier also supports standard interface. This table here shows the performance of the polynomial multiplier and the comparison with state-of-the-art related walk. As we can see, compared with related walk, our design is parameterized. It supports flexible parameters tuning. Further, we can see that our design actually improves very good cycle count. Basically, the cycles we achieved is very close to the theoretical cycle limit. And as we can see from the table, when compared with a high-performance hardware design, which actually instantiates four parallel factor violets, our design actually achieves a much better time-ever product. Compared with another related walk, this actually shows that we achieved similar cycle counts but actually bigger in consumption. This is actually due to the fact that these related walks, they fixed the value of the length of the polynomial as well as the modulus Q. And this shape of this specific Q actually supports very cheap reduction, which is essentially a bunch of addition and shifting operations. And this can be easily done in hardware within one clock cycle. So the pipeline structure in this walk is pretty simple to design. Okay, now let's take a look at the prototype of a real lattice-based scheme Q-Tesla on a RISC-V-based software hardware co-design. We use Marex SoC in our work as the platform. This is a very lightweight SoC, which is fully open-sourced. Marex integrates a few components. First of all, a 32-bit standard RISC-V CPU, which is called VEX-RISC-V. It also has a memory block, which is shared for data and instructions. It also has an APB module, which can talk to different peripherals, for example, the UART module. Similarly, we add the hardware accelerators to the SoC by adding them to the APB as peripherals. This is a setup we use for real experiments. On the left side, there is a walk station, which first programs the PGA and later loads the compiled software code to the RISC-V. On the right side, we are showing an RTX 7 APGA, which is actually the APGA recommended by NIST for PQC hardware implementations. The RISC-V, together with the lattice-based accelerators that we designed, are running on the APGA. Once the commutation on the APGA is finished, the commutation results are sent back to the walk station through the UART module. Now, let's take a look at the evaluation results. This table here shows the speed-ups which is brought by the hardware accelerators for different functions in the lattice-based schemes, namely shake, Gaussian sampler, polynomial modification, as well as first polynomial modification. As we can see from the table, very good speed-ups achieved for all of these functions. You may have noticed the speed-up for the Gaussian sampler is especially high, which is much higher compared to the other functions. This is partly due to the fact that the hardware accelerator itself accelerates the software function very well. Another important reason is that the data communication overhead, which is between the software and hardware, is actually perfectly overlapped with the commutation within the sampler. Therefore, the overhead becomes almost negligible in this case. That's also why the speed-up of hardware code design is very high in this case. This table shows the performance of the key generation of Qtasla for different design configurations. The leftmost column shows the different configurations. Here, a plus sign means that the corresponding hardware accelerator is integrated with the Marex SoC. Depending on the user application, you can either just add one accelerator to the design, or you can add several accelerators in a merged fashion. If you want to achieve a full speed-up, you can simply add all the available hardware accelerators into the SoC. We can see that when all the available hardware accelerators are added to the design, the best speed-up is achieved for both Qtasla P1 and P3 variants. For the P3 variants, an over 100 times speed-up is achieved. And also in this case, this configuration gives the best time error product. We can see that Qtasla, although it's not competitive in terms of software performance, when you run Qtasla on our software-added code design, you can easily, for example, for the P1 variant, finish key generation operation within less than 8 milliseconds. We can also observe a similar pattern for the signing operation in Qtasla, and the round 10 times speed-up is achieved when all the hardware accelerators are integrated in it, and the smallest time error product is achieved. Similarly, we can see a similar pattern for the verification. When we compare the performance of Qtasla on the Morex SoC, with all the hardware accelerators integrated, further with the state-of-the-art implementation of, basically, two other lattice-based signature schemes namely DeLithium and Falcon. One thing to note here is that given that there is no dedicated hardware-based implementation for these schemes, a fair comparison is not able to achieve. But let's still try to do some comparisons. Here we can see that although the performance of the perfectly secure Qtasla variants are slower compared to the software implementation of DeLithium and Falcon, when you integrate the dedicated hardware accelerators into the design, we can see that they have comparable performance compared to the other two signature-based schemes. We also compared with a RISC-5-based software hardware co-design which is focused on generic lattice-based schemes. In this work, they included the performance of Qtasla with outdated heuristic parameters. So for better comparison, we synthesized our design with these heuristic parameters and do the comparison. Note that in this related work, they implemented a customized and non-standardized processor which is closely coupled with the hardware part because they're designed to target low power and low cycle IC complications. So you can see that their work presents a much smaller cycle count because they pack more commutations into one clock cycle, but this in the end leads to a poor frequency. So overall our design actually achieves better performance in terms of time, although we are using a standard processor model. Okay, to summarize, in this work, we presented the design and implementation of a few hardware accelerators for lattice-based schemes, including a shake and c-shake unified core, a binary-serve CDT-based Gaussian sampler, an NGT-based polynomial multiplier, as well as first polynomial multiplier and HMAC sum module. Further, we showcased the prototype of a full Qtasler scheme by use of accelerators that we developed. And last but not least, you can find our open-sourced code from this link. Thank you for your attention.