 Welcome to the post-quantum cryptography session, part two. In this session, we will have two presentations, the first presentation titled Cyber On Arm, CCA Secure Module, Lattice Base Key, and Capsulation On Arm. The paper was authored by Anshuman Karmakar, Jose Maria Vermudo-Mera, Suyoi Sinaroi, and Ingrid Berba-Ued. And here we have Anshuman to give the presentation. OK, please. Yeah, thank you, Patrick, for the introduction. So yeah, today we are going to talk about implementing module lattice-based post-quantum cryptography on Arm microcontrollers. Our specific focus is on Saber, which is our submission to the ongoing NIST post-quantum standardization procedure in the Key and Capsulation Mechanism category. So a little bit about Saber. Saber is a CCA Secure post-quantum chem. Its hardness is based on the module LWR. So what is modules? Modules are the trade-off between security and the efficiency between the standard and the ideal lattices. And the LWR. LWR stands for the learning with rounding. It's very close to the very well-known learning with errors problem. The only difference from the LWE is that in LWE, you have a sample, and you add external noises like Gaussian or the binomial noises to it. But in learning with rounding, we don't need to do that. We get our inherent noise just by rounding the sample down to a lower module. So we use less randomness. So during the design phase of Saber, our implement on the efficiency was of our most consideration. So as a result, Saber is very efficient and is very flexible. So if you want a higher or lower security, you can just increase or decrease the dimensions of the matrix. And you'll get it. And for this, because all the basic operations stay same, you don't have to change the code much. It's a minimal code change. Also, another very important aspect of the Saber, which differs from the other lattice-based submissions, is we don't use prime module. We use more power of two modules. So it has two big advantages. First one, the rounding, the one I said earlier, to generate the inherent noise and the modular reaction is basically free in both hardware and software platforms. It's just bit shifts. But yeah, of course, we cannot use entity, because entity requires some specific structure of the primes. And you may know that in most lattice-based protocols, the polynomial multiplication is the most computational intensive part. But the entity is the best polynomial multiplication, and we cannot use it. This makes things that the Saber may suffer in performance. But actually, if we use a combination of Tumko, Karatsuba, and Schoolbook, and as our polynomials are small, only 256 of size, we will show it later that we don't suffer much in the efficiency or in the speed. OK, so just for recapitulation, how this multiplication is done, actually. OK, so we have two polynomials, A and B, both of which size 256, and we want to multiply them to get the product C. In the first level, we use Tumko 4-way. It reduces the big 256 cross 256 multiplication into seven smaller 64 cross 64 multiplications. Then we use two levels of Karatsuba, which reduces one 64 cross, each of the 64 cross 64 multiplication, into 916 cross 16 multiplication. And then we do the multiplication using Schoolbook multiplication. And after we are done with the Schoolbook multiplication, we will go back and actually generate our product C. So you see, to multiply 1, 256 cross 256, two 256 cross 256 multiplication, we have to do seven into 9, 63 Schoolbook multiplications. So to speed up our overall multiplication, if we can speed up our Schoolbook multiplication, it's great. In this work, we have seen that Saber is very efficient in high-end processors. We also want to show that Saber is also very efficient in low-end resource-consent platforms like Cortex-M0 and M4. So for this, our two target devices are M0. M0 is a very low-power microcontroller. And it has only eight or 16 kb of RAM, eight registers for data processing. Other is the Cortex-M4, which actually not very high-end, but kind of sits in the middle of the Cortex-M series. And it is the first roster of the Cortex-M series, which has DSP instructions. And we'll show how we can utilize these DSP instructions to our benefit. Also, Cortex-M4 is a very popular platform to implement any public key cryptography which are targeted for IoT devices or low-power microcontrollers. We provide two types of implementations. One is very high-speed implementation on Cortex-M4. We'll show how we can use the DSP instructions to reuse the number of multiplications in each Schoolbook multiplication. We also provide an in-register version of the Toon Cook multiplication or the wrapper, which actually reduces the access to the memory a lot. We have another compact, very memory-efficient version on the Cortex-M0. We'll show an just-in-time approach to generate the public matrix, which is very memory-hungry, in module lattice descriptor. And we also provide some optimizations using an in-place Karatsuba multiplication. So this in-place Karatsuba was known for quite some time, but we haven't seen it much using implementation of public key crypto. OK, for the Schoolbook multiplication. So in our case, each coefficient is 13 bits long. So it easily fits into the half word of each register. In other words, each register can have two coefficients. Now we have this DSP instructions, SMLABT, which can actually operate on the half words of a register. So you can multiply the bottom half word with the top, top with the bottom, or in any combination. OK, so it's an example just to understand the optimizations. It's a weird multiplying of four cross four polynomial. So in a very naive way, even if we use this DSP instruction, it takes 16 instructions to calculate the 16 small products. But again, we have this SMLABT instruction, which can actually cross multiply and accumulate two half words of a register. OK, now focus on these two products, A1B0 and A0B1. It is two products, so if we use only SMLABT, we need two instructions. Fine. Now if I replace RA with C1, RB with A and RC with A, what do we get? We get this product in only one instruction. And in this way, to have these two products, we can do it in only one instruction, and thus we reduce our instruction count here. And now we can apply this trick for all these registers. Similarly, we can calculate all of them in single instructions, each of them. So now our total instruction count has reduced to only 12, which is a 25% reduction. But actually, you can do even better. Consider these two coefficients, which are adjacent, but they do not reside in the same register. Then we cannot use this trick, SMLABTX. But we have some spare registers. Using some tricks, we can free up some registers. So we have one packing instruction, PKHBT. We can pack these two coefficients in a spare register and then apply the same SMLABTX trick again. So using that instruction, we can perform again these four products in only two instructions. But in this case, we save two instructions, but we lose one in the packing instruction. So for this example multiplication, now we need only 11 instructions for the multiplication, instead of 16. In our case, I showed you earlier, we have 16x16 polynomial multiplication. So in the very naive way, it takes 256 instructions. But here, we need only 168 instructions, which is actually 37.5% reduction in the total instruction count. So it may look similar, but just recall the slide I showed you earlier to have 1256x256 multiplication. You need 63 schoolbook multiplications. So even very small savings here can actually result in a big saving in the overall multiplication. So in our multiplication, we have in the top level Tumkuk, then the Karatsuba, and then we have the schoolbook. We have shown you how to split up the schoolbook multiplication. Karatsuba, we just unrolled and did some small optimizations. The details are in the paper. Next, I'm going to show you how we can make the schoolbook multiplication. Sorry, the Tumkuk multiplication faster. So for the Tumkuk multiplication, initially it has an evaluation phase. So in the evaluation phase, it partisans the polynomials in four smaller polynomials. So our polynomial has 256 coefficients. It partisans each of them in A3 to A0, four partisans, each of it's 64 coefficients. And then it actually needs to create weighted sum of these polynomials. Actually, it did seven weighted sum of the polynomials, but A0 and, sorry, AW0 and AW6 are A0 and A3 only, so only five here. Now focus on the AW2. So here you see, to get one weighted polynomial, we have to access A0, A1, A2, and A3 all. And each of them has 64 coefficients. So we need to access the main memory 256 times. So it's just like here. We load the coefficients here in the registers, we do the weighted arithmetic, and then put them in the respective position. So just think if we have to do this for all five weighted polynomials, we have to access the main memory, five times 256 to the main memory, which is actually huge overhead in the memory access. Instead, we did a vertical coefficient scanning and in-register version of the Tim Cook multiplication. So we load our coefficients as usual, we put them in the registers, and we have some spare registers. We do all the weighted arithmetic inside the registers and put them back in their corresponding positions. Now, actually, we don't add this, weighted arithmetic is quite complex and we don't have so much spare registers. So we did some partial sum and some arrangements inside just to reduce the memory access as much as possible. So here you can see instead of five times 256 memory accesses, now we need only 256 times external memory access to generate all the weighted polynomials. So it's a huge savings in memory access from five times 256 to 256 only. But the problem is now we have to keep this space in the memory to save, to put our weighted polynomials. Okay, so till now, I showed you optimizations which increases our efficiency, not some memory optimization. So in the reference implementation of Saber, we first, we need to generate a public matrix A, which is actually a collection of nine polynomials here. So what we did, we first, of course, generate a random set, then we run sec 128 and we have a huge array, 3.8 kilobytes of the array. We put all the pseudo random bytes here, then we actually generate each of the polynomials one by one there. But this array is 3.8 kB, which is huge, and for the platforms like Cortex-M0, it's prohibitively large. We cannot even put it there. So we took a just-in-time approach. So sec is composed of absorb and a squeeze. We took a just-in-time approach. We generate the polynomials only when it's needed. We first took the random set, we absorb it, and then run kichak squeeze. We generate the required number of pseudo random bytes, generate the polynomial, go, do whatever we want with the polynomials, all our computations. We have also saved the state of the kichak squeeze, and when you need the next polynomial, we come back. We feed the state back to the kichak squeeze, we generate the tweet against some pseudo random bytes, and again generate the next polynomial, and again do whatever we want with that polynomial. So this goes on for each polynomial. So now instead of having space for nine polynomials, we need space for only one polynomial, and it goes on like this. Of course, it requires some, I mean, many bookkeeping so that we don't break the consistency with the whatsoever submission, but the details are all in the paper. But the memory requirement decreases to the 1 ninth of the initial requirement. Okay, so for the results. Here is our most fastest implementation on Cortex-M4, and here is our most compact version on M0. So here you can see, of course, we are using Tumku, Karatsuba, and Schoolbook, and the Kyber, which is actually similar module, lattice-based chem, but it uses entity and prime module-y. So here you can see that in the fastest version of our implementation, we are actually a little bit faster than them. And it's important to remember that in microcontrollers, we are never going to run Kijian and decapsulation in the microcontrollers, it's mostly the encapsulation. So it's the most important operation, and here we are a little bit faster than the Kyber. And the most memory efficient, we need at most 6.3 kV. Our initial requirement was 18 kV. So we have almost three times reduction in the memory requirement. So it's also, I should say here like this, that all the optimizations I have described here till now are in, which are in the paper, it's, you can apply them on top of each other. It all depends on the user. So if you want to have some efficiency and some memory, speed efficiency and memory efficiency, you can take some of the optimizations, merge them together, minimal codes to change, and have a very good implementation according to your need. So here we did. We applied in the Saber memory. We applied some of our memory saving technique with our efficiency. So here you can see the memory requirement drops from almost half in all the cases. And we don't lose much in the performance here. So still we are very close with the Kyber. It's an entity-based multiplication. OK, so yeah. So in conclusion, we show that module lattice-based cryptography is very practical in resource-constant platforms. So in Cortex M0, our most, we need maximum 6.2 kb, which is for decapsulation. Incapsulation is even less, which is actually 1 third of our reference implementation. For the Cortex M4, we can do the most critical decapsulation operation in only nine milliseconds, which is around eight times faster than our reference implementation. Yeah, as I said earlier, the optimizations, which I described earlier, they can be applied on top of each other. So here we also showed that the choice of parameters is very crucial here. And for small dimensions, the asymptotically faster entity, the asymptotically edge of entity over the other multiplications don't much matter, because this edge is actually gone by the irregular memory access of entity. And they cannot directly use the special interests they may, but it requires some special considerations. So it's like insertion sort and quick sort. You know, quick sort is asymptotically faster, but even for small dimensions, insertion sort performs very well. So that asymptotic edge sometimes is lost over the overhead of the recursion or something and for the memory accesses. So yeah, the paper and the implementation are in public. So you are most welcome to visit them. The implementation is in GitHub, in our K-Luban GitHub page. And the paper is in ePrint. Yeah, that concludes my talk. Thank you very much. Thanks, Sanjuman, for the nice presentation. Are there any questions? If not, I have a question. Could you go back to slide 11? Slide 11? Yeah. So as far as I understand, this implementation of cyber matches category three of the niche categories, right? It's for almost 120 post-contempt security. Yeah, so it's pretty high. So the first question is, if you're targeting microcontrollers, is it a little bit too much? Use that high security? Why not try to take advantage of lower security? Cyber was our recommended parameters, and so we just implemented it. And as I showed earlier, decreasing and increasing is not an issue. It's just small code change, and you can have that. So just go with that. Not too high, not too low, less than medium. Yeah, and you get well. But yeah, if you reduce the security and target category, well, maybe you will get much better performance. OK. Thank you. Nice. The second question is, regarding the bit security, how the bit security compares between cyber and the ones that you're comparing here? With the Kyber? Yeah, with Kyber and your code. With the same levels of security. OK. Yeah, the same category, yeah. OK. Category three, then. Yes, as I remember, yes. OK. OK, thank you. Are there any questions? No? So then let's thank Anshuman again, please. OK, thank you very much.