 Okay, so stages years. Thank you for the introduction. Good morning, everyone. Today I'm here to present sapphire, which is a configurable lattice crypto processor and which we used for for a sick demonstration of several post quantum key and encapsulation and signature digital signature schemes. So we already know why post quantum crypto is important. And this entire session is on post quantum cryptography. And for the previous two talks, we looked at different ways of implementing entity and how we can implement the existing lattice based schemes on existing hardware for pre quantum cryptography. So in this talk I will present an basic implementation for accelerating lattice based cryptography schemes. So next has been standardizing post quantum crypto recently and round two has 26 candidates which can be divided into these broad categories. So lattice based cryptography accounts for almost half of them. And this includes both key and encapsulation and digital signature schemes. And this was our motivation behind building dedicated hardware for lattice based cryptography. And we have configurable parameters on this implementation. And we use it to demonstrate five round two candidates from the NIST standardization. And this way we also enable post quantum crypto for low power embedded devices. So in this work, we are mostly focusing on protocols, which are based on the learning with errors problem and its variance. Basically LW, ring LW and module LW. So LW uses matrix computations. Ring LW works on polynomials. And module LW involves operating on matrices of polynomials. So from implementation point of view, of course we need the standard arithmetic and logical operations. But apart from that, these schemes have three main computational requirements. First we have modular arithmetic. And LW uses power of two moduli which makes modular reduction very straightforward. But ring LW and module LW use small prime moduli. And next we have the polynomial arithmetic, especially the number theoretic transform. And this is used to speed up polynomial multiplication. And it's one of the main components of ring LW and module LW, which makes them so much efficient. And finally, for all the three kinds of schemes, we need to sample matrices and polynomials from different discrete probability distributions. And if you look at software implementations, the sampling process is actually one of the most expensive components. So we accelerate all these computations in our configurable crypto processor. And this is an overall architecture of our hardware implementation. And we have a dedicated circuitry to perform the number theoretic transform, modular arithmetic sampling, and other polynomial arithmetic operations. And as I mentioned before, all the key protocol parameters such as the modules, the size of the polynomials and the error distributions can be configured at runtime. And we also have a small one kilobyte instruction memory, which we program using some custom instructions to accelerate the entire protocol functions. So this is an outline of my talk for today. And first I will talk about some of the implementation aspects of our lattice crypto hardware. And then we'll look at our test architecture and how we have implemented the different lattice based schemes. And finally, I will summarize with some measurement results and preliminary sites and analysis of our tests. So first we take a look at the modular multiplication. So in our implementation, we support fully configurable 24 bit prime modulus. And we use a 24 bit multiplier followed by the standard Barrett reduction. And the algorithm for Barrett reduction is shown here. So to explore the different tradeoffs between efficiency and flexibility in the modular reduction process, we implemented two different architectures of the modular multiplier. So this is the first multiplier architecture where the prime modulus is fully configurable, which means that all the Barrett reduction parameters like mk and q can be configured freely. And because it is configurable, we need two more multipliers to do the reduction process. And this is our second architecture for modular multiplication, which is not as flexible. So we call it reduction with pseudo configurable modulus where we can choose the prime q from a set of commonly used primes. And the rest of the reduction process is all coded in digital logic. So we lose some of the flexibility here. But here now we can exploit the special structure of the primes to make the modular reduction more efficient. So from simulation and measurement results, we observed that this modular multiplier, this modular reduction is around six times more energy efficient than the previous one. And the overall modular multiplication is around three times more efficient. This modular multiplier is then part of a butterfly unit, which is used to accelerate the entity computations. And the multiplier adder and subtractor inside the butterfly are all reused for other polynomial arithmetic operations. And also to provide further flexibility and to eliminate the need for expensive bit reversals, our butterfly unit can be used both in the Kulitaki and the gentlemen's sign configuration. So now we look at how we have implemented the number theory transform using all these modular arithmetic and butterfly units. So if you look at traditional entity hardware implementations, there are two memory banks, and the butterfly inputs and outputs will usually ping-pong between these memory banks. And they are implemented typically using two-pot or four-pot ramps. Now the entity memory and logic accounts for a majority of the area in ASIC implementations of lattice-based schemes. So using two-pot ramps or four-pot ramps poses large area overhead. And in our architecture, we propose a single-pot ramp-based entity to solve this problem. Specifically, we refer to the constant geometry FFT data flow. And then we look at each polynomial and we split them among four single-pot ramps instead of storing them in one dual-pot ramp. And this splitting is done based on the parity of the coefficient indices when we look at the polynomial elements. And using this technique, we achieve around 30% area savings compared to the dual-pot implementation. And the constant geometry data flow still allows the butterfly inputs and outputs to alternate between these single-pot ramps without any read or write hazards. So to demonstrate how our entity data flow works, here we have shown a toy example with eight-point entity, both for decimation in time and decimation in frequency. So in each stage of the entity, we have the two inputs of the butterfly coming from two different single-pot ramps and the outputs being written to two other single-pot ramps. And therefore, we still have one butterfly per cycle and we maintain throughput. And also, compared to the traditional implementation, we don't have any extra energy overheads. In fact, we save some energy because reading from a dual-pot 10-transistor SRAM bit cell involves more energy consumption than reading or writing to a single-pot ramp, six-transistor bit cell. Finally, we look at our implementation of the sampler. So one of the key components of the distribution sampler in lattice-based schemes is cryptographically secure PRNG. And because sampling accounts for a major part of the computational overhead in the lattice-based schemes, it is important that we have an energy-efficient PRNG. So we have compared some of the standard primitives in hardware in terms of energy per bit and also number of bits per cycle. And for fair comparison, we have implemented them as full parallel architectures with comparable area energy product. So we observed that SHA-3, when used in the shake 128 mode, is around two times more efficient than SHA-3 and three times more efficient when we use AES-128 in the counter mode. So we decided to implement our PRNG based on Kechak. And we have a 24-cycle Kechak core which can be configured in the different SHA-3 modes. And our Kechak implementation processes its entire 1600-bit state in parallel so that we don't need any expensive register shifts or multiplexing. And this is how we are able to save energy. And there is some area over it associated with this, but it is small if we look at the entire implementation and the PRNG accounts for less than 10% of the total area. This is a very high-level architecture of our sampler. So we have the Kechak-based PRNG, which is generating uniformly random samples. And then these samples are generated from a seed, which we can choose from between two seed registers at runtime. And these samples can then be post-processed in various ways to generate, convert them into samples from the desired distribution. And we can always configure at runtime the precision, the support, and the standard deviations of these distributions. So now we look at the overview of our test chip. So in our test chip implementation, we have integrated this cryptographic core with low-power RISC-5 microprocessor to provide more programmability. And our RISC-5 processor implements the RB32IM instruction set, and it has drystone performance comparable to the ARM Cortex M0 processor. The crypto core can be accessed, including the memory and the internal registers, it can be accessed from the RISC-5 processor through software using a simple memory-mapped interface. And the same memory-mapped interface is also used to control some peripherals, like GPIO, SPI, and URART, which we use for debug purposes. We chose to use this memory-mapped interface because in this way we can access the crypto core through only load and store instructions, and we did not have to make any changes to the compilation toolchain. On the right, we also show the chip micrograph of our implementation. Our chip was implemented in a 14 nanometer TSMC low-power process, and the Sapphire crypto core occupies 106K logic gates and 40.25 kilobyte of SRAM, and the total crypto core area is 0.28 mm square. Next, we look at how we have implemented the different lattice-based protocols on our test chip. To demonstrate the configurability of our chip, we have implemented three CCAKM schemes, Frodo, New Hope, and Kyber, and two signature schemes, Q-Tesla, and Dilithium. The Sapphire crypto core is used to accelerate all the cryptographic computations, and the RISC-5 processor is used to schedule these cryptographic workloads, and also to perform the encoding, decoding, and compression, decompression of public keys and ciphertext. And the Kechak core that we have inside our crypto processor can also be accessed standalone, so when the RISC-5 processor is performing any hashing or random oracles for the CCA transformation, it can still get the benefit of hardware acceleration of Kechak. And also while the crypto operation is happening inside the Sapphire core, we can clock gate the RISC-5 processor using the wait for interrupt instruction to provide some additional power settings. So here we look at how we have implemented the RLW and MLW schemes, specifically how we utilize the polynomial memory that we have inside our chip to implement the different configurations. So our 24 kilobyte polynomial memory has 8192 elements, which can be accessed in chunks of power of two sizes. So this power of two is 256, 512, 1024 according to the requirements of the protocol. And the size of the memory is just enough so that we can even support parameters for the highest security level. So for example, New Hope 1024. So we did not stop at just ring LW and module LW. So to demonstrate the flexibility of our implementation, we have also evaluated the LW-based key encapsulation scheme Frodo. So unlike the other two, the ring LW and module LW schemes, here the matrix sizes are not power of two. So we cannot use the polynomial memory as is. So what we do is we tile the array into rows and columns into specific sizes. So for example, for Frodo 640, we split the 640 element array into 512 and 128. And we can still access the polynomial memory in this different non-uniformly sized chunks. And for Frodo 976, we just use a 1024 sized array and we just zero out or ignore the last 48 elements. So here we have some protocol evaluation results. So we have compared our hardware accelerated implementation with full software implementation running on the RISC-V processor. In our paper, we also have very detailed comparison of our implementation with assembly optimized Cortex M4 software, which we have referenced from the open source PQM for library. And we observe around an order of magnitude improvement in both energy efficiency and performance. In this plot, we show the energy consumption of the key encapsulation step and the signature step of the ring LW and module LW protocols and how it varies with respect to post quantum security. So again, because our implementation has configurable parameters, we can implement all of these different modes and thus allow the flexibility and energy scalability by varying the security levels. In this table, we compare our work with some of the previous work in hardware accelerated implementations of the lattice based NIST candidates. And there is a more detailed comparison of each sub-module in our chip with respect to previous work in the paper. So the key two ideas here are the key three ideas here are the configurable parameters and then the single portram entity which provides makes it area efficient. And then the fast SATRI base PRNG which makes our sampling energy efficient. So in the next couple of slides, I will spend some time on preliminary side channel analysis of our chip. So just like any other public key cryptography implementation, side channel security is a very important implementation aspect of lattice based schemes. And this is our side channel, power side channel analysis setup where we show our test chip and the test board. And it's a simple setup with series register, differential amplifier and a very high sampling rate oscilloscope. So all the key building blocks in our hardware like binomial sampling, Gaussian sampling, polynomial arithmetic and number theoretic transform are constant time. And to verify their constant time behavior, we have measured the run times of these operations over 10,000 random executions. And we also observed that the energy consumption of these operations follows a narrow distribution with coefficient of variation less than equal to 0.5%. Also our crypto processor and the risk five processor have a single level memory hierarchy. So this eliminates any possibilities of cash timing attacks. If you look at existing SPA attacks on lattice based schemes, they mostly exploit data dependent branching or non uniform execution times in the polynomial arithmetic operations. So to further quantitatively evaluate the SPA resistance, we have performed differential means difference of means test on these operations with 99.99% confidence interval. Again, the measured results are and the details are available in our paper. And this was about SPA. We should also look at DPA. So the protocol evaluations that we that I talked about earlier do not have any explicit DPA countermeasures, but our crypto processor is programmable. We thought we should explore if we can do mass implementations. So as a first step, we looked at the additively homomorphic masking scheme, which was proposed in PQ Crypto 2016. And we follow the similar scheme and implement a mass version of New Hope CPA public key decryption. So at a very high level, what is done in this scheme is we generate a random secret message, then encrypted, add it to the original ciphertext, decrypt it and exert it. We get back the original message. So we were able to implement this mass decryption on our test ship and it was around three times slower than the unmasked version. We also know that masking of this scheme increases the decryption failure rate. And again, referring to the PQ Crypto paper, we can resolve this by decreasing the standard deviation of the error distribution slightly, but of course at the cost of slight reduction in the security level. So work in progress, we are performing more detailed leakage tests on our chip and we are also planning to implement a massed CCA secure schemes on our test ship. So to summarize, in this work, we have presented a configurable crypto processor for LW, ring LW and module LW protocols, where we have area efficient entity, energy efficient sampler and flexible parameters. We also provide detailed benchmarking of the NIST signature and NIST key encapsulation and signature protocols. And we observe order of magnitude improvement in both energy efficiency and performance compared to state of the art software and hardware. And finally, in terms of side channel security, our key hardware building blocks are constant time and SPA secure. And we also show how we can use the programmability of our design to implement masking for DPA countermeasures. I'd like to thank Texas Instruments for funding this work and the TSMC University shuttle program for chip fabrication. Thank you. Are there any quick questions? If not, then I'll be impolite and not ask a question because the coffee break is starting. Please be back here at 10 past for the next session and please join me in thanking the speaker again.