 Hello. My name is Yusuf Alper Bilgin. I am going to present our paper title test, cortex and four optimizations for RMLWE ring and module learning with Rostikim. This is a joint work with Erdem Alkım, Murat Cenk, and François Gerard. This presentation consists of four main sections which are introduction, implementation details, results and conclusion. In introduction, I will briefly talk about the NIST post quantum standardization process, our target algorithms, and our target platform. In implementation details, I will explain our optimization techniques, which can be categorized as optimizations for speed, optimizations for stack usage, and optimizations for secret key size. Then I will share our results, finally I will conclude this presentation. NIST has initiated a standardization project for post quantum algorithms, which includes both key encapsulation mechanisms and digital signatures. The standardization project was started in 2016. The first round of the project to place main need during 2018, and assess different possible quantum safe algorithms. There are five main categories, which are lattice-based, code-based, multivariate, symmetric-based, and others such as super singular elliptic curve isogenist, or zero-knowledge proofs. In the first round, the performance was not the main consideration, but instead NIST considered the security and cost as primary factors in its decision. In early 2019, 26 out of 64 initial algorithms advanced to the second round of the project. In the second round, it is stated that the practical performance of the candidates will play an important role in the selection for future standardization. This paper was accepted during the second round of the process. We aim to improve the practical performance of some lattice-based candidates, namely Kyber and NIMPOP, utilizing number-teradic transform for a faster polynomial multiplication. Recently, the second round of the evaluation was completed, and NIST announced the third round candidates, which contain four key encapsulation mechanisms and three digital signatures. In addition, five key encapsulation mechanisms and three digital signatures were also advanced to the next round as alternate candidates. Therefore, in total, there are nine key encapsulation mechanisms and six digital signatures. Now, let me talk a bit about our target schemes for this one. The first one is Kyber. It is one of the third round finalists and fast almost in every platform. Its security is based on module learning with error problem. The polynomial multiplication is performed by using NTT, number-teradic transform. It uses 7-level NTT with Z3329x modulo x to the 256 plus 1. This ring does not fully split. We do not carry out the full NTT, but rather stop it before reaching the level of phase-field arithmetic. This also means that the coefficient-wise multiplications in NTT domain are done on small polynomials instead of integer coefficients. In this case, degree 2 small polynomials. These polynomial multiplications are performed with schoolbook method. The second one is Nemhub. It was a promising candidate, but eliminated in the second round. Its security is based on pink learning with error problem. It utilizes 9 or 10-level full entity with these two corresponding rings. The last one is Nemhub Compact. It is a faster and smaller variant of Nemhub. It was presented at batting creep in 2019. Its security is based on ring learning with error problem. It uses 7-level entity with these three different rings. As you may notice, these rings do not fully split. Hence, at the end we have degree 4, 6 or 8 polynomials respectively. So we use schoolbook multiplications for these small polynomials. Şimdi birisi bir de al-gurtümsü açıklayacağım. Nemhub. İlergesi Nemhub'un haram olarak, Haylevli Lürenpli Püplici Geçme rezist Wareh ve CCA Seker Kean Kapsüleşim mekanizmesi. Üzerine aktif kirli al-gurtümsü Härkideden lütfen bağlantı ve genetli foreignersu vermayın. Burada görebiliyoruz ki genelde, eklifçiler ve eklifçiler algoritmlarımız var. Bu durumda polimomilya ve eklifçiler boyunca konusuyuz. Genelde, burada görebiliyoruz ki, A dot S plus E. Fester multiplicatörün engellemelerini kullanıyoruz. Bu public key b. Bu public key  keys  ekliyerek  ekleyerek  eklemeni . Bu eklemenin senelere sizden  ekleyerek da  ekleyerek  ekleyerek . finally the ciphertext is decrypted by using the secret key s by performing this multiplication now let me talk about our target architecture on cortex m4 it is nist recommended platform for pqc evaluation the cortex m4 is a 32 bit platform and implements the army 7 instruction set and provides special digital signal processing instructions these instructions offer single instruction multiple data simd that can perform arithmetic operations on two half birds or four bytes in parallel this microcontroller has the advantage of having large enough memory to support public algorithms while being still reasonably small and cheap in the grand scheme of computing its popularity led to the development of pqm4 a library aiming to offer a common framework for benchmarking implementations of post quantum algorithms on this platform this architecture comes with the restriction of a limited number of registers which is 16 general purpose 32 bit registers out of which only 14 are available for the developer in this slide we list the previous optimizations for of kyber on cortex m4 not that we also implement all of them we use these techniques for new hope new hope compact and also kyber cortex m4 is a 32 bit architecture while the polynomial coefficients are below 16 minutes therefore in order to fully utilize the properties of the cortex m4 platform we represent polynomials as an array composed of signed half birds we also pick two coefficients into one register thereby we can utilize simd instructions and perform addition subtraction on two half birds in parallel by using u at 16 or u sub 16 more ever we implement a double butterfly which takes a packed registers register as input and returns a packed butterfly result we perform all computations in montgomery domain since it heals a fast montgomery reduction specifically after multiplication we pre-comput all of trivial factors in the montgomery domain and store them store them in flash memory pre-order these constants before storing them in flash memory to have them appearing in memory in the same order as they are used during the computation hence we can easily load the next one without computing its address load instruction on cortex on the cortex m4 has the ability to move the pointer to the next window factor while fetching the current value from memory thus moving to the next factor has no extra cost link time optimization flto can give a performance boost so we enable it the critical performance gain with link time optimization is achieved from cross model function inlining which is not directly possible without flto this will tend to increase the cost size since inlining functions across source files introduces code duplication however it should also be noted that link time optimization is more effective at identifying unused code or code that has no impact on the output the montgomery reduction on the left side is implemented in three clock cycles in this work we optimize the implementation of the montgomery reduction such that it can be performed in only two clock cycles we achieve this by storing minus q to the minus one instead of q to the minus one and using smlabb signed multiply accumulate instruction which multiplies two half words and adds the 32b result to another 32bit alu in one clock cycle the kyber implementation performs 3200 non-gomery reductions in a full polynomial multiplications a dot b by using entity where a and b are polynomials in this range therefore this change saves 3200 clock cycles for a full polynomial multiplication we also implemented a double montgomery reduction on a packed argument it is slightly faster one site faster than the double barret reduction depending on the size of the modulus and the register size of the underlying architecture it is not always necessary to reduce the results after an addition or subtraction skipping unnecessary reductions is called lazy reduction it is common that optimized entity implementations heavily use this technique to speed up the code however those lazy reductions are mostly performed after an addition or subtraction in this work we also perform lazy reductions during component wise multiplications here is an example as you can see here we can see keep intermediate reductions these two we add a reduction to the end and in fact we lower the bound while the bound for the first equation is minus 2q to 2q the second one is only minus q to q we save a lot of reductions by using this approach you can see them here we can also skip the montgomery reductions after the multiplications in the first layer of the entity butterflies where the inputs are polynomial with small coefficients that are sampled from the center binomial distribution in this work we use eight registers to keep 16 coefficients and perform three or four layers of the entity depending on the distance between the coefficients being used in the same butterfly on the next layer in other words we load coefficients these eight registers in such a way that a maximum of entity layers can be performed before storing the result thanks to the structure of the entity used in nivhop and nivhop compact we can merge four layers since at some point we need coefficients with distance one and loading consecutive coefficients from the memory is free with ldr instructions now let's move on to the stack optimization entity is already stack friendly it is entirely in place we use these two previous optimizations the first one is inline comparison in cca decapsulation cca saker decapsulation first decrypts the ciphertext and then re-encrypts the obtained plain text this produces a ciphertext which is then compared to the original only if they are equal the shared secret key is returned this additional ciphertext is eliminated on the stack by inlining the comparison into encryption more ever the matrix a is only required once for matrix vector multiplication and accumulation the memory footprint can be reduced using an approach that reduces the storage requirements of a to only the state of the extendable output function at a time allowing to generate a small number of coefficients for multiplication so on the fly generation of matrix a in matrix vector multiplication in this work these methods are also implemented for nivhop and nivhop compact an optimization to reduce the stack usage of keygan presented here instead of computing the first equation we compute the second one now let me explain the differences firstly hat means that this value is in entity domain we have already talked about a we do not need all values of a we can do on the fly multiplication however we need at least two polynomials on the ramp to complete the entity of s and e secret term and the error term for the second equation we move the second entity here to the beginning therefore we can calculate all operations by using only one polynomial at the stack s here then we can do on the fly error addition and on the fly multiplication of a at the cost of one inverse entity the stack usage is decreased almost one polynomial the last one is secret key size optimization this is actually a well-known and obvious idea there are different options we can store the secret key in entity domain which is currently used for hybrid and nivhop or we can store only the 32 byte seed and rerun key generation during decapsulation this can be preferred if the size of the secret key is really critical however it gives a significant performance penalty a middle way could be storing secret key in normal domain another method could be storing only the 32 byte seed used to sample the secret key we prefer the last one since our entity implementation is fast enough this reduces the secret key size about 96% but it has a drawback it increases the decapsulation time by 7% for fiber 18% for nivhop and 9% for nivhop compact this table represents our cycle count comparison one can see that nivhop and kyber perform around 10% better with our optimizations furthermore nivhop compact is more than 40% faster compared to nivhop and more than 25% faster compared to kyber for all security levels the stack usage comparison invites is given on this slide our results for nivhop are almost half of the previous implementation you also improve the stack usage for kyber in this slide you can see the improvement on polynomial multiplication functions it can be seen that kyber and nivhop compact have similar performance while nivhop is slower this is mainly due to the extra layers of the entity and the increased number of reductions caused by the larger models it can also be seen that our implementation of the nivhop entity is slightly slower compared to the previous implementation while we have noticeably better performance for the immerse entity to sum up we propose various optimizations the first one is more efficient modular reduction we implemented a two-cycle Montgomery reduction more aggressive layer merging we merge up the four layers of entity by using registers carefully more aggressive lazy reduction we are lazy on the reductions even after multiplications we optimize small degree schoolbook polynomial multiplications we reduce the stack usage of key generation by adding the error term on the fly we also give a trade-off between secret key size and the encapsulation time we give a unified framework to compare the three schemes kyber nivhop and nivhop compact as they used the same level of optimizations moreover there are still some room for the improvement one interesting point is that our entity implementation for nivhop has slightly slower performance than the one reported in pkm form before our comment so using gentleman sunday might improve the performance with the cost of more trivial factors storing more trivial factors moreover representing coefficients as 32 bytes signed integers instead of 16 byte 16 bit might improve performance by doing so we can be more aggressive with the lazy reduction at the cost of using more registers you can check our source code on this kitap link it is publicly available we also commit our changes to pkm form thank you for listening