 Hi everyone, I am going to present the paper Time Memory Trade-off in Tumku Multiplication, an application to module lattice base cryptography. This paper is a joint work with Anshuman and Ingrid. I am only going to show the slides on the screen, so let me first introduce myself as the speaker and you also get to know my face. I am Jose, I am a PhD researcher at the K-11 and my research is focused on implementation aspects of lattice base cryptography. You can find a list of my publications in the link shown in this slide and you can also contact me by mail with any inquiry or just because you feel like it. Ok, so now let's jump into the presentation. I will start introducing some necessary background such as the context of this work, the basis of module lattice is in cryptography and efficient polynomial multiplication algorithms. I will continue explaining how we have improved these algorithms and how we apply these optimizations to speed up Saber, which is one of the finalist schemes in the initial standardization context. I will also explain some memory optimizations that can be applied to Saber when low memory footprint is needed, for instance in embedded implementations, and finally I will show the most relevant results of this work and draw some conclusions. As I am sure you all have heard, the security of the currently deployed public key infrastructure is based on number theoretical problems such as integral factorization or the discrete logarithm. However, these hard problems can be efficiently solved by quantum computers, thus retaining our security and privacy. To anticipate this threat, NIST is running a standardization competition for post quantum cryptography and they have recently announced the finalist of this contest, listed in this table for each category. As we can see, lattice-based schemes seem to offer the most promising solution for post quantum cryptography. Moreover, among these lattice-based schemes, the constructions based on module lattices are the most popular. The first lattice problem used to build a cryptosystem was learning with errors. In learning with errors, one can create a sample B which is equal to the product of a public matrix A and a secret vector S plus an additive error E. The error term is what makes the system nonlinear and therefore hard to solve. If the dimension of the matrix L is large enough, it will be hard to distinguish this sample from a randomly sampled vector as long as you don't have any information about the secret or the error. Module lattices are built in a similar way to learning with errors, except that each element of the public matrix A is a polynomial with n coefficients rather than an integer. This allows us to reduce the parameter L while achieving an equivalent security level. And this also makes module lattices more efficient. The challenge for the implementation of module lattice schemes is no longer the matrix vector multiplication, but the polynomial multiplication. The most straightforward algorithm for polynomial multiplication is the so-called schoolbook multiplication, starting from the coefficients of the operands and performing n squared multiplications plus some additions, one gets the result. However, this method is not the most efficient for polynomial multiplication. As I said, for using a schoolbook, we need the polynomials to be represented in the coefficient form. However, the most efficient polynomial multiplication, which is a complexity, is based on a point value representation of the polynomials. In practice, one performs an operation called evaluation to transform the polynomials from the coefficient form domain into their point value domain. Then, the point value multiplication and finally the inverse of evaluation call the interpolation to recover the result in its coefficient form. Since the point value multiplication is linear, the complexity of the overall multiplication will be determined by the evaluation and interpolation algorithms. The most popular choices for cryptographic applications are the entity Karatsuba and Tunkuk. Among these three, the entity being linear arithmetic in time has the lowest complexity. However, it imposes some requirements in the parameters. The modular arithmetic requires a prime modulus and this modulus has to be co-prime with double the length of the polynomials. Both Karatsuba and Tunkuk achieve a sup-quadrility complexity, which is worse than entity complexity, but they don't impose any requirement in the parameters, giving more freedom to the designer of the cryptosystem. In practice, the entity is using crystal casciber and deletion, while combinations of Tunkuk and Karatsuba are used in sable. Our work is focused on improving Karatsuba and Tunkuk in the context of module lattice base cryptography, so we demonstrate our speed-ups on sable. I am not going to explain here all the details of sable, but for the sake of this presentation, I am going to recall that for the standard security level of sable, the parameter L is equal to 3. For this category, a matrix vector multiplication and an additional vector vector multiplication must be performed during the encapsulation. This amounts for 12 polynomial multiplications. For the encapsulation, even more polynomial multiplications are necessary. The polynomial multiplication is thus critical for sable operations, and it has an optimal algorithmic choice consisting of a top layer, where a 256th coefficient polynomial multiplication is broken down into 7, 64th coefficient polynomial multiplications, using Tunkuk four-way. Each of these is in turn broken down into 9, 16th coefficient polynomial multiplications, resulting in a total of 63 schoolbook multiplications with polynomials of 16 coefficients, plus the additions and multiplication bugs' context of the evaluation and interpolation to perform a single multiplication of polynomials with 256 coefficients. If we represent a polynomial in an image, we would start with an operand a to which we apply Tunkuk evaluation, followed by Karatsuba evaluation. Similarly, we do for the other operand b, and then we perform the multiplication of a and b in the point value domain. For the coming figures, I will also use this dash purple box to highlight the evaluation and point value multiplication altogether. After performing this point value multiplication, we perform the interpolation to retrieve the result. When applied to module lattices, we propose two optimizations. First, less interpolation takes advantage of the fact that we don't really need to get the result of each polynomial multiplication, but the row column multiplication to accumulate the result in the point value domain, saving us most of the interpolations. On the other hand, pre-computation takes advantage of the fact that one of the operands, in particular the secret vector, is reused for different row vector products during the matrix vector multiplication. We can then perform the evaluation of these polynomials only once and store them in the point value domain, saving up the time of some evaluations. In this figure, we summarize less interpolation as applied to save. As you can see, we perform the evaluation and point value multiplication of each of the three products. Then, we add the results in the point value domain. Since both evaluation and interpolation are linear operations, after running interpolation only once in this accumulated result, we get the result of the full row column product. In the pre-computation, compared to the original figure for the multiplication, we would not need to perform the evaluation of the operand b for some of the multiplications, since it has been previously computed and cached. Both less interpolation and pre-computation can be combined. We illustrate this combination with this figure. As you can see, the evaluation of one operand does not always need to be computed. As for the rest, the result is accumulated in the point value domain after running evaluation and point value multiplication for each pair of operands. And lastly, only one interpolation per row column product is performed to retrieve the final result. These optimizations can be generalized for an insecurity category, defined by the size of the matrix through parameter L, or when it's matter-sized more in general. In this table, we show the improvement in the number of operations for the key generation and the encapsulation and the encapsulation of Sabre for any value of the parameter L. As you can see, the greatest improvement is due to the less interpolation. We know that the theoretical improvement does not always translate into an effective speed-up when implemented. So we have applied our methods to the available optimized implementations of Sabre. The implementation optimized for AVX takes advantage of the vectorial registers of these processors to accommodate 16 coefficient polynomials and perform a schoolbook multiplication using the vectorized instructions. Something has to be taken into account is that the polynomial needs to be transposed after performing the multiplication. But this plays in our favor. Using the same approach as for the less interpolation, we can transpose the register only once for each row column product when we are going to carry out the interpolation. Therefore, the AVX-2 implementation will absolutely benefit from our methods. On the other hand, the performance of the implementation optimized for ARM Cortex-M for processors is very much dependent on the schoolbook multiplications since it is performed 63 times for a single polynomial multiplication and this embedded processor does not enjoy the powerful vectorized instructions from AVX-2. However, we can still benefit from the DSP extensions to minimize the overhead. The only penalization in performance will come from the extra preloads that are necessary to perform the accumulation of the result. Less interpolation overcomes this penalization only for the larger cyberparameters as we will show in the results. Having to store polynomials in the point value domain to implement less interpolation and precomputation has an important cost in terms of memory. For some applications, it may be desirable to have a low memory footprint rather than the highest performance. This is particularly true for embedded implementations and this motivated us to research some memory optimization for cyber. Firstly, the secrets, as any other element in the scheme, are stored as polynomials with n coefficients, bnn equals to 256 for all security levels and these coefficients are moduloq being q2 to the power of 13 that is, there are 13 bits per coefficient. However, these secrets are sampled from a binomial distribution. We know that a binomial distribution defined by its parameter mu will have samples lying in the interval minus mu to mu and we also know that for cyber, the worst case is given for the highest security level for firesaber where mu equals to 5. This means that we need way less than 13 bits per coefficient. In practice, we have decided to store the secrets using only 4 bits per coefficient. This has two advantages. First, the office reduction in the memory requirements to store the secret key, which now requires between 35 and 40% less memory. Additionally, the packing and unpacking functions from bitstreaming to a polynomial and baseversa will become way simpler than for 13 bits per coefficient. Two coefficients can be directly packed into one byte. Moreover, we can embed the unpacking into the evaluation of the multiplication routine. The other memory optimizations that we have applied to Saber are the bookkeeping of the randomness for the computation of the hash functions, the just-in-time generation of the polynomials of the public matrix A from the seed, the in-place verification of the ciphertext using the capsulation, the use of only Karatsuba rather than a combination of Tumkuk and Karatsuba for polynomial multiplication, since Tumkuk is more memory-demanding. And, as we already mentioned, the in-place unpacking of the secrets during the Karatsuba evaluation. So now it's time to discuss the results. First, I want to show the impact of less interpolation and pre-computation on the matrix effect of multiplication alone. As we anticipated, the gain for the AVX implementation is quite significant, while for the cortex M4 is more limited as well as dependent on the size of the matrix. The speed of both matrix vector multiplication for AVX reaches 30%. On the other hand, the implementation for cortex M4 for the parameters of lightsaber that is L equals to 2 offers no improvement at all. For larger values of L, the speed-up is around 12% to 18%. For the overall results on Sabre, we start by showing the results of a generic seed implementation without any platform-dependent optimization. The blue colors correspond to the previous state-of-the-art implementation, and the wing colors to the implementation provided in this paper. The three blocks from left to right correspond to the key generation encapsulation and encapsulation of lightsaber, sabre, and firesaber, respectively. Similar color codes will be used in the following plots. So, as I said in this setting of an only seed implementation, we can observe that the proposed methods yield impressive speed-ups, since logarithmic optimizations are the only available. In the AVX implementation, the polynomial arithmetic is so optimized that the hashing operations have become the bottleneck. Then such a big speed-up as 13% to 5% for matrix vector multiplication has a limited impact of only around 10% of overall speed-up in sabre, neither of its security levels. Even though we can consider that a 10% speed-up can be significantly speed-up for an already optimized implementation. For the Cortex M4, we first show the results for the speed-optimized implementation. We can see that the overall speed-ups are quite limited, of 0.5% and 10% approximately. In addition to the bars, we have float in orange and yellow the memory requirements for the previous state-of-the-art implementation and the one provided in this paper, respectively. We can see that there is a big penalization in the memory utilization. Finally, we also show the Cortex M4 implementation optimized for memory utilization. In this case, we show the results for sabre for L equals to 3 since we can only compare to a similar implementation in the state-of-the-art. We can see that the implementation is almost as fast as that while achieving an impressive memory reduction. So, to conclude this presentation, in our paper, we have formalized and generalized interpolation and precomputation and its applications to model lattice-based cryptography. We have shown how an optimization at the theoretical level does not necessarily relate to an optimization at the implementation level due to different trend-memory trade-offs. We have provided the fastest software implementation for sabre for both C, AVX and Cortex M4 platforms. And additionally, we have provided the smallest sabre implementation for embedded platforms and effectively reduced the required storage for the secret keys of sabre. And that's all. Thank you for your attention.