 Hello everyone and welcome to this presentation on the papers covered a suite of efficient learning with rounding in encapsulation mechanisms I'm Jose and I'm going to start presenting this paper and later my co-author and colleague Suparna will follow This is a joint work with other two co-authors and Anshuman and Ingrid So let's start look at the contents of this presentation I will start with an introduction to our paper and the motivation of her work Then I will talk about the design of the schemes that form this suite the suite is covered later Suparna will talk about the security analysis of the schemes in the suite and she will also present the software implementations and I will close the presentation by telling a bit of the hardware implementations and with a small conclusion So let's first look at the the context of this work NIST the National Institute of Standardization and Technology is going to release in the coming year the first standard for quantum resistant cryptography and we already know the finalist of this contest For both KM and signature Categories and we can see that lattice-based schemes Turned out to be the preferred solution for quantum cryptography and among these lattice-based schemes or among these finalists the Selecting criteria by NIST will be the security analysis of this scheme and also implementation Aspects that are both the efficiency of the schemes in different platforms as well as their side channel security and other issues so for that Forgetting a bit their understanding on how the efficiency is affected by the design of the scheme. Let's talk about the lattice problems that are used in post quantum cryptography So the first program the first problem is the learning with errors problem In this problem We state that it is hard to distinguish a randomly sample a from another sample b which is form as the Product of this this public element a with a secret a plus and error so As we can see in the figure we have a public matrix then we have a Secret which is a vector the error is a vector and the sample the learning with error sample is also a vector So what are the bottlenecks of this problem? well, of course the sampling of the public matrix the sampling of the secret and then the expensive matrix vector multiplication and Also the sampling of the error that requires certain randomness so one Optimization that can be done to the learning with running with error problems is to introduce the learning with rounding problem and in this problem The error is not sample and added to the sample, but it is Inherently generated by a rounding operation and What we have made for this suite is we have used This learning with rounding problem as the base to build up our schemes but also within the learning with rounding problem there are different variants and The first one is the ring learning with rounding problem or being learning with error in which Instead of having a public matrix We have only a public vector We can see it as if the public matrix is formed by rotations of this vector And another way to see it in practice is that we have two polynomials a public and polynomial and a secret polynomial and We make a multiplication between this polynomial a ring multiplication, which is a convolution then another variance is the Module lattices and in the module lattices we don't have only a polynomial, but we have a small matrix With each of the elements of the matrix being a polynomial and of course the length of this program is Lower than in the ring version And we can achieve a similar level of security. So We use both these problems the ring and the module variants of the learning with rounding problem to build all our schemes Yeah, why why your paper is interested? Well, the goal of our paper was to take advantage of all the latest Abundance in lattice base cryptography and as regarding to the parameter selection As well as the construction based on these problems on ring learning with running and learning with rounding To create or to improve a state-of-the-art key encapsulation mechanisms and the contributions of our work are first we have a fast ring learning with rounding came which we call we call Florete we have also a compact module learning with rounding came Which is also highly parallelizable in hardware and we call this scheme Spada And then we have a shably which is an alternate version of saber in which we modify the shape of the secrets and a bit the parameters while maintaining the same construction to achieve a faster and more efficient scheme and Of course, we provide implementations and software platforms and also hardware software accelerators for all these three schemes in a suite So now let's start already by by looking at the design of our schemes And for that I will introduce the framework. We have used to build the learning with error Learned with rounding based key encapsulation mechanisms We have used the same framework for all three schemes in our suite And the only difference is of course how these elements that you can see here in this image the sample B is built and how the public matrix is sampled in the case of modular ring the secret and the error and If you can see this is a generic construction of a key encapsulation mechanism and In the case of lattices we need to use this Encoding functions this error correction codes to Recover the same message between both parties. So now let's look at the design of the schemes And let's start with Florida Florida has been designed using the ring learning with rounding problem because the goal was to design and a high performance Scheme we use the question ring Converting with x to the power of 768 minus x to the power of 384 plus one because of security reasons and The main characteristics of the sign of this scheme is that we have reduced randomness requirements with respect to For instance saber, which is the scheme we are going to take as reference to compare to during our work Because we have a public polynomial of 668 768 coefficient instead of Matrix of three times three polynomials of 256 coefficient So we have much less coefficients to sample in the public matrix, but also the coefficients of the secret polynomial that in saver are Sampled from a binomial distributions here a sample from Narrow distribution all coefficients are minus one zero or one and also another characteristics of Florida is that the Probability failure is lower Because each message be it is repeated three times because we have to encode a 256 bit message in 768 coefficients, so we use the repetition and golden Then the next scheme in our suite is a spada and here the goal was to design a scheme that was very compact And had a very low memory footprint, so we introduce And a design that as fast we know is novel in which We design a module learning with random schemes, but the length of the polynomials that form the public matrix are is Shorter than than usual so as you can see here the quotient ring is Rings of polynomials from with extra power of 64 plus one so every polynomial has 64 coefficients and To compensate for this that the public Matrix will have a higher rank so we can achieve a very low memory footprint But the downside is that we need more random numbers Because we have to sample more elements for the public matrix and a way to mitigate the penalty in performance if we want to implement it in hardware is by parallelizing the operations and Lastly we have the third scheme in our suite, which is Sable and for Sable we use the same Construction as for Saber, but we introduce some changes So that's why we can see Sable as an alternate version of Saber in which We reduce random randomness requirements by choosing a network distribution for the coefficients of the secrets and We also adjust the moduli p and q and actually we use only the moduli P and q to tune the security level of the scheme so now I will leave the floor to Suparna she will talk about the security Parameters and software implementations Hello, everyone. I am Suparna Kundu Before going to the implementation details of our suite scabbard I would like to mention preliminary design goals of our schemes In this performance memory graph we pointed out the design standing of our schemes Florate, espada and Sable respected to Saber For Florate we intended to achieve high performance For espada our goal was to use less memory footprint and we designed Sable to achieve a better tradeoff of memory performance than Saber For every LWR based keks or chem has two important notions. These are security and failure probability In any LWB based chem crypto system both party have to agree to a key with very high probability But that key can be differed with a certain probability that is failure probability To achieve the highest security which is CCA security We need to strike a balance between security and failure probability That depends on q modulus n degree of the generating polynomial of the ring and sigma which is the noise That is minimum of standard deviation of the distribution of secret and errors If we increase q keeping n and sigma fixed then security and failure probability both will decrease If we increase n keeping q and sigma fixed and if we increase sigma keeping n and q fixed then security and failure probability both will increase To find failure probability we modify the Saber script and for the security valuation we Used dukey et al's leaky LW estimator We obtained the final parameter set of a scheme by searching exhaustively in between all possible values of all the parameters of LWR based chem just by keeping n fixed with our security goal Here our goal is to obtain bit security label greater or equal to 128 and Failure probability strictly less than 2 to the power minus 128 For florida we fixed n equal to 768 and as it is ring based so matrix dimension L equals 1 chosen underlying ring modulus q equal to 10 bits rounding modulus p equal to 9 bits t coefficient of redacted polynomial is 3 bits and we have used one coefficient of ciphertext to height 1 bit message So be equal to 1 and for the secret sampling We used centered binomial distribution with parameter mu calls one This set of parameter help us to obtain 157 bit of security together with failure probability 2 to the power minus 131 For the 768 cross 768 polynomial multiplication in florida We used to cook three way multiplication on top of Saber efficient 256 cross 256 polynomial multiplication This is a hybrid model which uses to cook Karatsuba and schoolbook multiplication In as per the length of ciphertext polynomial is 64 and message length is 256 bits That's why we needed one coefficient of ciphertext to hide four bits of message So b equals 4 here. We have chosen n equals 64 vector dimension L equals 12 q equals 15 bits and p equals 13 bits and t equals 3 bits and the parameter mu of centered binomial distribution Here is 3 this helps us to obtain 128 bit security together with failure probability 2 to the power minus 167 Matrix vector multiplication is one of the most time-consuming operation here as this scheme is module LWR based Then parallel computation will reduce the time of computation Also the length of the polynomial here is 64 in the public matrix a and secret vector s and 64 cross 64 polynomial multiplication is very fast in hardware So here we can directly perform the multiplications in hardware efficiently Hence if we use L parallel processors R1 R2 R2 RL for computing L polynomial Multiplications of the matrix vector multiplication. It helps to make the scheme efficient for different levels of security in saver They kept q and p fixed and adjust the standard deviation of secret sigma But here we kept sigma fixed and varies q and p and tried to make standard deviation of error and secret same We found this set of parameters of three security levels of sublet by maintaining the security bound and corresponding failure probability polynomial multiplication used in sublet is also same as saver The slide contains results of implementation in C and AVX for benchmarking We have used system with Intel Core i7 with hyper threading turbo boost and multi-core support disabled and Compiled with GCC with optimization plug O3 Here security level of each scheme is greater or equal to 128 bit as you see Fluority is faster than all other schemes in C and AVX implementation It was our initial goal also the performance of Florida on C and AVX is faster than saver by at least 45 percent 26 percent and 10 percent for key gen end caps and decals respectively and The performance of sublet on C and AVX is also faster than saver and the performance of spada on C and AVX is Approximately two times slower than saver For the Cortex-M4 implementation, we have used STM32F4 discovery port running at 24 megahertz speed and by using the PQM4 framework Kaveri's another niche round 3 finalist came like saver As you can see Florida is faster than all other schemes and espada needs least stack memory Florida performs better than saver for all of the algorithms Also requires almost three times more stack memory for each algorithms of Florida than saver Key gen and end caps algorithms of Florida is faster than Kaver But decaps algorithm of Kaver is faster than Florida But the overall performance of Florida is better than performance of Kaver Sublet performs better than saver in all the algorithms and also needs little less stack memory than saver for each of algorithms Spada needs twice time than saver for each of the algorithms But stack memory requirement of spada is lower than saver for each of three algorithms key gen end caps and decals The stack memory requirement of spada is also lower than Kaver 2 which uses Implus entity multiplication Number theoretic transformation is another method of polynomial multiplication. It wasn't used in saver due to power of 2 modules Recently Chung at all applied entity to perform polynomial multiplication in saver by considering a larger ring such that the Multiplication of any two number of previously used ring always belongs to this large new ring In this plot we named that scheme saver entity saver entity received a performance improvement than saver By applying similar technique Savley also got speed up Although this is just first draft not an efficient implementation Savley entity faster than Savley for all of three algorithms key gen end caps and decals Florida and Spada would also receive a certain speed up by applying this technique Remaining part of our talk will be continued by my colleague Jose Maria Barmodomera Thank you Suparna. Okay, so Yeah, as for the hardware implementations and we want to stress that The goal of our paper was not to perform a full exploration of the hardware architecture that can be used to accelerate all the schemes but rather to provide some guidelines and to have some Tool to compare these different schemes So for this reason we haven't implement the full schemes in hardware But we have used a hardware software co-design approach that allowed us more flexibility in the whole place But also a faster feedback in the design cycle while Implementing the algorithms or tuning the parameters so In this hardware software Co-design approach we have decided to implement only the polynomial multiplication hardware Why because this is a critical operation for all schemes But also is the operation that is different for each scheme and also even within each scheme it can vary for each parameter set because it depends on the length of the polynomials that are involved that are different in the case of Florete Spada and Shably but also in the parameters set if the error distribution changes However, we would also like to note that Hashing is also an important bottleneck in lattice-based cryptography and that all of the schemes All the schemes would benefit highly from having a hashing module in hardware also But the hashing module would be the same for the three schemes in Or sweet because at this moment we are using Ketchak as the hashing function for all three So we decided to focus only on the polynomial multiplication and I'm going to discuss now What are the differences between the three schemes? So first of all, I will start with Florete and I will explain the hardware partition for Florete So as you remember is a ring learning with a random scheme in which we have polynomials of 768 coefficients. So what we did here was to break down this polynomial multiplication between polynomials of 768 coefficients in five Multiplications of polynomials of 256 coefficients by using tomb cook 3 algorithm And then after that we can reuse the hardware for Sabre the polynomial multiplication for 256 coefficient polynomials That doesn't exploit any particular shape of the polynomials So this is what we did for Florete and we reuse the hardware for Sabre for the case of Spada and The hardware software partition as it is now is that only the polynomial multiplication is implemented in hardware so we have to we have to design a polynomial multipliers for 64 coefficient polynomials and Yeah, we did this with the idea of achieving decent performance while keeping the area low because the goal of Spada was compactness and Here we want to stretch that if The designer wants to achieve high performance You should have also the hashing on hardware and you should parallelize the hashing so that you can generate the coefficients of the polynomials that will be multiplied in parallel also in parallel And then Finally the hardware acceleration for Sabre For Sabre we can use the same polynomial multiplier as we use for for Sabre and also the one we we use for the internal Multiplications of Florete the 256 coefficient polynomial multiplication but in this case we can also use a Specific construction and we decided to design High performance multiplier for Sabre which exploits the shape of the secrets the fact that the secrets are small as It has been done for Sabre in previous work So you can see here that in contrast to Sabre. This is easier to do for for Sabre because the coefficients are even smaller and Finally I will sum up all the hardware results in this table and In the paper you have Numbers and performance for the full schemes when using the hardware software Accelerator but here I focus only on the performance of the multiplier and only on the area of the multiplier because that is what we implement it on hardware and We can see here that for Florete we are using the Same multiplier as for Sabre or compact accelerator for Sabre For Spada we are using also a compact Design so we can compare this Florete and Spada numbers to the Numbers of the first Sabre implementation, which is shown in row 4 Whereas the first comparison for Sabre would be with the implementation in the fifth row in this table Sabre and we can see that we achieve Even though we have different FBA technologies, we achieve very similar results with much less area so that's that's it all for our presentation and I will wrap up with the conclusions. So first of all in this work We have improved the practical aspects of the state of the art in lattice base cryptography by providing Florete, which is a faster King Capsulation Mechanism by providing Spada, which is the most compact King Capsulation Mechanism and by tweaking Sabre to create an alternate version of Sabre that improves certain characteristics of Sabre but also We have introduced new design decisions when designing games and in particular How the parameters are chosen in Spada has Novel with respect to how it was done in the state of the art And finally the future worklines and how to follow up this work Of course, we are going to provide parameters for other security levels for Florete and for Spada We already did for Sabre And we should explore different hardware architectures for the acceleration of these schemes So now that's it. Thank you for your attention and we will be glad to answer your questions