 Hello! I want to present to you the topic RISQ5 – tightly coupled RISQ5 accelerators for post-quantum cryptography. My name is Tim Fritzmann and this was a joint work with Georg Siegel and Johannes Sepulveda. As a reaction to the fast development of quantum computers, the National Institute of Standards and Technology NIST announced the standardization process for post-quantum cryptography in 2016. And just a couple of weeks ago, the third round of this competition and standardization process was announced. And lattice-based cryptography builds the largest glass within this competition. In the last years, there has been a high focus on efficient implementations. Whilst software implementations achieve a very high flexibility, hardware implementations achieve a high performance. And in order to combine both advantages, hardware software co-designed strategies can be used. Now the question is, which operation should we accelerate in hardware? And in order to answer this question, let us have a look at the ring-learning-with-error problem. It can be used as main tool in order to create ideal lattice-based cryptography. Now in order to create a ring-learning-with-error instance, we require polynomial multiplications and additions. Moreover, we need to create or sample random polynomials such as the random polynomial A, the secret polynomial S and the error polynomial E. So the performance bottlenecks of ideal lattice-based cryptography is the sampling of random polynomials and the polynomial multiplication. So one of the first hardware software co-designs for post-quantum cryptography was developed for the Scheme Andrew. At that work, the ARM processor of an FPGA SOC was coupled with an ternary hardware multiplier implemented on the programmable logic. And in recent years, we have seen a clear trend towards the development of risk-5-based hardware software co-designs. Post-quantum schemes were implemented on risk-5 and accelerated using polynomial multiplication and sampling accelerators. However, most of the previous designs are based on loosely-coupled accelerators, for example connected via an ACSI bus. But the disadvantage of a loose coupling is that the connection to a bus system leads to a high communication overhead. And in order to decrease this communication overhead, one can either implement the whole cryptographic algorithm in hardware or at least large parts of it. But this leads on the other side to a high area consumption and also decreases the flexibility. Moreover, the area is quite high for loosely-coupled accelerators because they require large I.O. buffers and also hardware resources for the control circuit. Now in this work, we investigate the suitability of tightly-coupled accelerators for the post-quantum schemes Newho, Kyber and Sabre. In contrast to a loosely-coupled design, tightly-coupled accelerators do not require complex bus communication and also they are usually designed in a flexible fashion. The disadvantages are that the instruction set architecture must be extended and also the core must be modified, which complicates the integration process. Now previous tightly-coupled accelerators accelerate the modular arithmetic by integrating a finite field accelerator. However, in this work, we develop more powerful hardware accelerators and also accelerators for all the performance bottleneck. So instead of only using a single finite field multiplier, we developed a vectorized modular arithmetic unit capable of performing parallel modular additions, subtractions, multiplications and further operations. Moreover, we developed accelerators for entity hash and sampling computations. We integrated our accelerators directly into a RISC-5 core and extended the RISC-5 instruction set by 29 new instructions. So some algorithms use the number-serratic transform in order to increase the performance of the polynomial multiplication, as the polynomial multiplication in the spectral domain corresponds to a coefficient-wise multiplication. And in order to turn a polynomial into the spectral domain, we require to multiply the coefficients by the nth root of unity to the power of ij, which is also called the twiddle factor, and by a so-called pre-processing scaling factor. So at the inverse entity, we do not require a pre-processing step, but a post-processing step. As the polynomial length has a high influence on the performance but also on the memory footprint, we develop two different approaches for new hope and kyber. For large polynomial lengths, which means for new hope, we compute the twiddle factors on the fly in order to reduce the pre-computations and also to reduce the memory footprint. For in fast access to the floating-point register set, we also develop an address controller in order to increase the performance. And for small polynomial lengths, that means for kyber, we use an LUT-based approach in order to avoid a post-processing step, as the memory consumption is not so high as for new hope, for example. Moreover, we use two different algorithms for the forward and inverse transform in order to avoid a bit reversal step. As the prime and saber is not directly suitable for the entity, we use a significantly larger prime in order to avoid precision errors at the entity computation. And in order to use the same hardware architecture for new hope and kyber and saber, we split this large prime into several smaller pieces, perform the multiplications on the several shares, and recombine the results using the Chinese remainder theorem. Now this approach, unfortunately, is less efficient than other ones, therefore we do not go into the details in this presentation. To discuss the optimizations that we use for the entity transform, let us have a look at the following entity diagram, which has 16 input coefficients, and let us assume that we have four registers available in our register set. So the first step will be to load the first eight coefficients into our register set, and we start two coefficients in one register. The next step will be to load the first coefficients A0 and A8 into the butterfly unit BF0, where a modular multiplication is performed, addition, and subtraction, which is the so-called butterfly operation. Now at the same time, we will load the coefficients A4 and A12 into the butterfly unit BF1, and we will also perform a butterfly operation. Now before we write back the result to the register set, we will perform a swapping operation. And this swapping operation is used in order to prepare the coefficients for the next layer. Now the same approach will be done for the next four coefficients. And now, instead of finalizing the first layer, we continue with the second layer, as the coefficients for the first layer are not available in the register set. And continuing now with the second layer will save us all the store operations of the first layer and all the load operations of the second layer. Now the coefficients for the third round are not all in the register set, therefore we have to refresh our register set and we will continue with the first layer. After four layers, our result is available and we can go further on. So our optimizations can be summarized as follows. We calculate the powers of omega on the fly, so the twiddle factors. We store two coefficients in one register in order to decrease the memory access. We swap the coefficients in hardware in order to prepare them for the next round. We compute two parallel butterfly operations and we calculate the next layer before finalizing the briefest one. So this will save us a lot of store and load operations. Now let us have a look at the entity hardware architecture. It has as main input the content of two registers from the general purpose of the loading point register set. And depending on the respective instruction from the instruction decoder, the entity and modular arithmetic unit will perform its operation and will store the result back into the register set. Now the entity and modular arithmetic unit consists of three parts, an address unit for the automatic address calculation of the first merged entity layers and twiddle update unit in order to compute the twiddle factors on the fly and the modular arithmetic unit. And we will now have a closer look at the modular arithmetic unit. The modular arithmetic unit is capable of performing two parallel butterfly operations. So it will first take the higher halfords of the first and second register H1 and H2 and it will multiply it with our twiddle factor omega. Now the result will be forwarded to the modular address and subtractors where the modular additions and subtractions with the lower halfords are performed. The modular arithmetic unit is also capable of performing the post-processing step for the inverse transform. So it is capable of performing a multiplication with n to the power of minus 1 and gamma n to the power of minus i. Therefore we have two operations. The first one will take the first register and multiply it with the scaling factor and the second one will take the second register and will multiply it with the scaling factor. In order to avoid lookup tables for the scaling factor, we compute the scaling factor on the fly by reusing the existing hardware multipliers of the modular arithmetic unit. Moreover our modular arithmetic unit is capable of performing vectorized modular additions, subtractions and multiplications. We followed our SIMD principle which means that we compute several operations with only a single instruction. To evaluate the performance of our accelerators we will compare the cycle count of the entity for the following implementations. The latest assembler optimized cortex M4 implementation and risk 5 based implementation using the finite field multiplier and our work which uses the entity and modular arithmetic unit. The results show a clear cycle count reduction when using our approach. And in contrast to the other two works, we stick to the new hope reference implementation and require a bit reversal step for the inverse transformation. So in order to accelerate this function we developed a dedicated bit reversal function. Apart from a cycle count reduction, we were also able to reduce the memory footprint. For example for the scheme New Hope 1024 we reduced the amount of pre-computations for more than 7000 bytes to 44 bytes. Now as the entity is not directly suitable for all the schemes, we will also have a look at another multiplication approach. And this approach is the Karatsuba step. So Karatsuba will split two polynomials into lengths m-half polynomials into a lower part and into a higher part and instead of using four polynomial multiplications with the smaller polynomials, the Karatsuba step will require only three different polynomial multiplications. And also Karatsuba can highly make use of the multiply-accumulate function. As in Sabre, after the multiplication the result can be directly reduced to a half word, we can also use a vectorized modular multiply-accumulate function, which we will call in this work Picoumak operation. The Picoumak operation and the hardware architecture is shown in the following slide. First we will take the lower half words of the source registers Rs1 and Rs2 and we will perform a multiply-accumulate function. And then we will also take the higher half words and we will also perform a mac operation and the results are then recombined in the destination register RD. When carefully using our Picoumak operation then we can reduce the amount of clock cycles by more than 30 percent when compared to the reference implementation. Now the sampling of random polynomials is the second performance bottleneck. It usually requires a huge amount of randomness and this randomness can be obtained when using a truly random seed and expanded using a pseudo-random number generator, for example the shake function from the Ketchak family. Now previous Ketchak hardware implementations either use a completely standalone core, which however requires a lot of resources for the IO buffers and also for the control circuit, or they connect the Ketchak core to a 32-bit system memory interface. Now in this work we follow a completely different approach and we store the Ketchak state in the register sets in the floating point register set and into a part of the general purpose register set. Now this will allow us to have a fully parallel access of the complete Ketchak state and instead of accelerating the whole cryptographic Ketchak algorithm we will accelerate the performance bottleneck of the Ketchak operation which is one round of the Ketchak state permutation. In order to keep a high flexibility, the rest of the Ketchak algorithm will be performed in software. Now we use the following design strategies. First we reuse existing hardware resources, then we keep the state for the whole sampling process within the register. That means we do not write back the state to the system memory, which saves us a lot of load and store operations and when we require fresh randomness for our sampling process then we will simply permute the state in the registers. Now when comparing our result with previous ones we can see that we do not require any further registers. Moreover we were able to decrease the amount of logic because we do not require a control circuit and also we have to consider that previous works such as in the reference would also require some resources for the connection to the bus system. The secret and error polynomials in ring learning with error based schemes require usually binomally distributed samples and in order to turn a uniform sample into a binomial one we can use the following equation which is basically a modular subtraction of the hamming weight of k-bit integers b and b prime. So this can be also done by the following circuit which computes the hamming weights for several values of k, while k depends on the parameter set of the post quantum algorithm and the multiplexers will then forward depending on the mode signal the respective hamming weight to the modular subjectors. In order to evaluate the performance of our accelerators we integrate them directly into a 32-bit RISC-5 processor which has four stages in-order execution pipeline consisting of a prefetch buffer, instruction decoder, ALU and load and store unit. We will enhance the design by two completely new elements by the PQR ALU and the PQALU. The PQR ALU contains of the accelerators which require direct coupling to the general purpose and floating point register set and the PQALU has a similar structure as the usual ALU and will require only the content of two source registers and one destination register. Now in order to use the existing multipliers of the multiplication unit we integrate the PQMAC operation directly into the multiplication unit. Now to evaluate the performance we will compare the cycle counts of the new hope kyber and sabre implementations for the following implementations. The latest Cortex-M4 implementation which uses assemble optimizations the RISC-5 base design using the finite field multiplier our baseline implementation which consists of the reference implementation compiled for our target platform and implementation which uses the assembling accelerators and an implementation which uses all the accelerators developed in this work. Basically we can draw three different conclusions. The first conclusion is when we compare our baseline implementation with the Cortex-M4 implementation we can see a large performance gap. Therefore we require hardware accelerators to get to a similar performance. The second conclusion is when we introduce and when we use our sampling accelerators containing of the catch-up accelerator and the binomial sampling unit we can get already very close to the cycle count of the Cortex-M4 implementation and for some schemes we are even better. The third conclusion is when using stronger accelerators than only a single finite field accelerator we can achieve a clear cycle count reduction. That means it is worth to develop powerful accelerators and also accelerators for all the performance bottlenecks. It also has to be noted that sabre uses less hardware resources than for example new hope or kyber. Therefore it is difficult to compare between those schemes. Now when looking at the ASIC synthesis results of our RISC-U5 implementation which integrates the powerful tightly coupled accelerators we can measure a cell count increase by a factor of 1.6. However we were also able to reduce the energy consumption by factors of up to 9.5 for new hope, 7.7 for kyber and 2.1 for sabre. In order to summarize this talk in this work we develop RISC-U5 an enhanced RISC-5 architecture that integrates powerful tightly coupled accelerators. Moreover we developed instruction set extensions for the modular arithmetic but also for other performance bottlenecks such as the catch-up and binomial sampling. Our design strategies were to reuse existing hardware resources such as the general purpose and floating point register set or the multipliers of the multiplication unit and also our strategy was to reduce the memory access rate for the entity computation and also the catch-up computation. So this summarized my talk. Thank you very much for your attention.