 Hi everybody, my name is Andrea Vassa and I am a PhD student at the University of Birmingham. Today I would like to present our work on a high-speed instruction set co-processor for lattice-based key encapsulation mechanisms, or Saber and Hardware, where we present a full in-hardware implementation of Saber. This is joint work together with my supervisor, Sujoy Senior Roy. At the time of writing, Saber was one of the round two submissions to the NEST post-quantum standardization process, but since then he has now become one of the four round three finalists and according to NEST itself, Saber is one of the most promising schemes to be considered for standardization. Thus, investigating its hardware implementations is an interesting research problem. Saber also has some peculiar characteristics that make its implementation requiring a different approach compared to other lattice-based protocols. In particular, its parameter sets does not allow for the usage of NTT. Thus, we wanted to investigate non-NTT-based polynomial multipliers and their performance. Let's quickly see how the Saber protocol works. There are two variants, a public key encryption scheme and a key encapsulation mechanism. For the former, during key generation, the first part degenerates a random seed, from which they can then compute a matrix A. Then they generate the secret S, which is a vector consisting of polynomials with more coefficients. Lastly, they compute the vector B by multiplying the transpose of A times the secret S, rescaling it and rounding it. The public key then consists of the vector B together with the seed A from which A can be recomputed. During encryption, the other party indeed does recompute the matrix A and generates their own secret S prime in a similar fashion to S. Indeed, the S prime is also a vector consisting of polynomials with more coefficients. Then they compute the vector B prime, similar to what happened during key generation, by multiplying the matrix A times the secret S prime, rescaling it and rounding it. Lastly, they compute the problem between B in the public key and their secret S prime and use this value to embed the message M. The ciphertext then consists of the vector B prime and the polynomial C M. During decryption, the first party computes the polynomial B, which is the product between B prime, given in the ciphertext and their secret S. Since the value of B is similar to the product of B times S prime, it is then possible to recompute the message M. To obtain the key encapsulation mechanism, we simply apply the Fujisaki Okamoto transform to this protocol. From an implementation point of view, the Fujisaki Okamoto transform mainly consists of calls to shine-shake. So we can see from here that the majority of computations has to do with either shower-shake or computing polynomial multiplications. For the former, we see indeed that in software platforms, we have that ketchup, which is at the core of shine-shake, takes about 70% to 80% of the overall computation times. However, on a hardware platform, ketchup is extremely performant and we just decided to utilize the high-speed implementation provided by the ketchup team. One thing to note here is that Saber has a serialized approach to shine-shake. There is little use for multiple ketchup cores, since the computation cannot be easily parallelized. We just decided to only use one core, since that allows almost for optimal performance while maintaining a moderate error consumption. So we can see that shine-shake was not our main concern given the high-speed of ketchup on hardware. Thus, we were left with computing polynomial multiplication as the main performance bottleneck. And indeed, optimizing it was the main focus of our work. There are many characteristics to polynomial multiplication in Saber. In particular, we have that Saber bases its security on the problem of module learning with rounding. Given the module nature of the problem, it can achieve different security levels by varying its module rank. This means that all polynomials remain with the same degree, 255. And from an implementation point of view, this is highly convenient, because it means we can create only one polynomial arithmetic, those dedicated to polynomials of degree 255, and reuse across different security levels. Another characteristic of Saber is that he uses most secrets. Indeed, all secret polynomials have coefficients that fall in the range minus 3, 3, minus 4, 4, or minus 5, 5, depending on the security level. Lastly, the main defining feature of Saber is that he uses power of 2 moduli instead of prime moduli. This means that all coefficient-wise multiplications take place modulo of 2 to the power of 13th or 2 to the power of 10. This has the main advantage of rendering modulo reduction free, in the sense that there is no need for a dedicated algorithm to compute modulo reduction, since modulo reduction modulo, a power of 2, is simply a bit truncation. However, it also has some important consequences, and mainly that it does not allow the usage of entity. The number of theoretic transform indeed requires the modulos to be prime, plus some other conditions. And it is the default go-to algorithm for polynomial multiplication, because it is the algorithm with the best asymptotic performance. However, given the limited degree of the polynomial square dealing with 255, we believe that other approaches can also obtain high performances. In some cases, even higher performances than the entity. Indeed, in software, Saber used an improved TomCook-based algorithm, which incidentally is also presented here at chess 2020. In hardware, however, TomCook and Karatsuba-based arguments are not particularly convenient because of their cursive nature. Indeed, being recursive, that would have meant that each level would have been implemented differently. And furthermore, on hardware platforms, we have the fact that we can exploit high parallelism levels and create ad hoc solutions. So we decided to go for the schoolbook algorithm. The schoolbook algorithm, or the classical algorithm that is also used to compute polynomial multiplications with pen and paper is represented here. And on a higher level, it mainly consists of computing all possible coefficient-wise multiplications and adding them together. More in details, we start by setting the accumulator to zero and then we iterate over all coefficients of the first polynomial. Each one of those is then multiplied by all the coefficients of the second polynomial and the result is stored in an accumulator. Once the inner loop is completed, we multiply the second polynomial by x. Now, this last operation is particularly simple to implement since we are operating modulo x to the power 256 plus one. That means that on a hardware platform, it can be easily implemented via a negative shift, which is a shift where we are shifting each coefficient by one position and the coefficients at the last position is sent to the first position while being multiplied by minus one. Such an approach has several advantages. In particular, it's very simple to implement, is highly flexible and offers great performance. The reason why such an approach can offer great performance lies mainly in how we compute this line, the coefficient-wise multiplication and updating the accumulator. Let's then see how our multiply and accumulate units look like then. We started from the fact that Saber uses small secrets. There is a coefficient-wise multiplication between a 13-bit number and a small number can be efficiently computed via a bit shift operations and additions. Moreover, since we are operating modulo power of two, there was no need for a modulo reduction circuit. This means that we can implement a full MAC unit in only 50 LUTs. Given the limited area consumption given by a MAC unit, we can parallelize several of them. And indeed, we chose to use 256 MAC units in parallel, one per each coefficient of the secret polynomial. This is the full architecture for our polynomial multiplier. We start with the PRAM where all the data is stored. Then, at the beginning, we load the secret polynomial inside a buffer that can store the whole polynomial. This is because the secret polynomial is relatively small given that each coefficient is up to four bits long. Then, we also load the public polynomial. However, the public polynomial has 13 bits long coefficients. And of course, this does not play well with 64-bit long PRAM blocks. Indeed, we see that often we have that one coefficient is split across two different blocks. Because we had to create a buffer that could resolve the issue. But to reduce the overhead due to loading the polynomial buffer, we also implemented a coefficient selector that would allow for reading the coefficients while they're being loaded. In this way, we reduce the overhead of computing, of loading the public polynomial to only one cycle per full polynomial multiplication. Then, we have 256 MAC units in parallel, each one of them connected to one coefficient of the secret polynomial. Each cycle, we fed one coefficient from the public polynomial to all MAC units who then compute the multiplication between the public polynomial coefficient and the secret polynomial coefficients they are connected to. They then store their value in the accumulator coefficient they are connected to. By doing this, in one cycle, we can compute the full inner loop of the algorithm. So, we then need to compute the multiplication by x given by the mega-cyclic shift, which is then computed at the end of each cycle. With such an approach, we can compute a full polynomial multiplication in less than 256 cycles, which goes on to show that non-NTTB's polynomial multiplication can still achieve very high levels of performance. We also implemented the remaining parts and we decided to organize it according to an instruction set co-processor architecture. That means that each block is built independently of the others and is all handled by a central controller. In this way, we do lose some slight performance. Indeed, it is not possible to parallelize the different blocks, since they are handled in a sequential manner by the program memory. However, this has several advantages. Firstly, it is highly modular. Indeed, it is possible to replace, add or remove each module independently of the others. This allows us to propose a generic framework that can also work for other protocols. For instance, if one wants to implement Kyber, which uses an NTT-based polynomial multiplier, it is possible to simply replace this polynomial multiplier with an NTT-based one while reusing all the other components. Furthermore, such an architecture allows for programmability. And to demonstrate the advantages of programmability, we propose a unified architecture that can compute operations for lightsaber, saber and fire saber. They are the three security levels of saber that target respectively 128 bits of security, 192 and 256 bits. To further demonstrate the flexibility of our design, we also propose a framework to target any performance area trade-off. And in particular, we implemented a variant of the MAC unit that fits two coefficient-wise multipliers. And this way, the overall polynomial multiplier fits 512 coefficient-wise multipliers. Thus, it is possible to complete a full polynomial multiplication in only 128 cycles. To the overall encryption and decryption performance, this brings about 20% improvement in speed, but of course comes with a significant error consumption cost. Let's now see the performance of all the operations. We implemented our design in Verilog and run it on an ultra-scale plus FPGA. We see that key generation requires less than 22 microseconds encapsulation on less than 27 microseconds and decapsulation about 32 microseconds. This gives us a very high throughput of about 46,000 key generation per second, almost 38,000 encapsulations per second, and about 31,000 decapsulations per second. If we now look at the breakdown, we can see the importance and the relevance of polynomial multiplication. Indeed, it required about 50% of the overall computation times, while catch-up computations took about one-third. Such high-level performance still comes with only a moderate error consumption. Indeed, our design only consumed less than 9% of the LUTs available on the FPGA and less than 2% of the flip-flops. Moreover, our design does not use DSPs at all and only consumes two BRAM tiles. If we look again at the breakdown, we can see again the relevance of polynomial multiplication. Indeed, we have that polynomial multiplication consumes about three-quarters of the LUTs and about one-half of the flip-flops consumed by the overall architecture, whereas the catch-up core only requires about between one-fifth and one-third of the resources available. Given the relatively moderate error consumption, it is actually possible to fit up to 11 parallel core processors on a single FPGA. If we do so, we can achieve an incredibly high throughput of about 500,000 key generation per second, 400,000 encapsulations per second, and 350,000 decapsulations per second. We can now compare our results to existing implementations, both of other post-quantum algorithms and previous implementations of SABER. Compared to some implementation, our proposed solution can achieve orders of magnitude speed-ups, whereas our results are more comparable to other implementations. If we focus our attention on existing implementations of SABER, we can see that we achieve about two orders of magnitude speed-ups compared to the first pre-ported implementation, whereas we still achieve about half of the computation times compared to the second implementation of SABER. We can achieve such speed-ups, while also reducing the DSP count from 256 down to zero. We are very excited about the work that we did, and we would like to extend it further along three main directions. Firstly, we would like to investigate other protocols. As mentioned, our full architectural design is highly modular, and that allows for easily replacing some modules that can make targeting other protocols fairly simple. The main candidate, of course, would be SABER, given its similarity with SABER, but also other lattice-based schemes. We're also interested in investigating signature schemes and how well they can play together with key encapsulation mechanisms, and whether there can be a significant resources for utilization. When designing this architecture, we mainly focused on achieving high performance, even at the cost of greater air consumption or energy consumption. We would also like to look into a lightweight implementation that would reduce the number of multipliers and do all sorts of accommodations to reduce the overall energy consumption. Lastly, we are very interested in researching the side-channel resistance of the current implementation and having a full side-channel secure implementation. That would mainly mean having a mask implementation. This is particularly interesting because the proposed polynomial multiplier exploits the fact that SABER uses small coefficients for the secret polynomial. However, if we have a mask implementation, the property is then lost since the secret coefficients are then masked. Thus, we would need to find a different approach that can guarantee both side-channel security while also maintaining high performances. So, to summarize, we propose a complete hardware architecture for SABER that can compute all three camera operations, key generation encapsulation and decapsulation, and can target all security levels, light SABER, SABER, and fire SABER. We did this with very high performance levels while still maintaining a high degree of flexibility and only with a moderate air consumption. All of our code is available on GitHub. Going beyond SABER, we also propose a generic framework that can be used for other lattice-based protocols and would demonstrate that high performance can be achieved also from non-NTT-based polynomial multipliers. Thank you very much for your attention.