 Hello everybody, my name is Dan Sprenkles. I'm representing the joint work here from myself, Denise Krikonici, Matthias Kanwische. And our title is Compact Dylithium on Cortex M3 and Cortex M4. So I will be talking about a research which was implementing the Dylithium Signature Scheme on the Cortex M4 and the Cortex M3. I will give a brief introduction on Dylithium as there are probably people here that don't know what Dylithium is and how it works. I will explain how we dealt with non-constant time multiplications on Cortex M3 and how we made multiplications in Dylithium constant time. I'll briefly touch on optimizing both the performance of the schemes and optimizing the memory of the scheme. I will go into the results and then I will conclude. So first, the Dylithium Signature Scheme, it's obviously a signature scheme. It's part of the crystals submissions together with Kuiber. And it's currently progressed to the third round of the NIST competition. The idea of the scheme is that it's a Fiat-Chemier with a board scheme. It looks a lot like a general Fiat-Chemier scheme. But sometimes it can occur that the signature is incorrect and it could leak something about the secret key. In these cases, we check whether the signature is correct. And if the signature is incorrect, we restart the Fiat-Chemier loop. And we call one of these cases an abort. At some point, we will end up with a good signature. The underlying hard problems for Dylithium are the module learning with error and the module shortest integer solution problems. In general, Dylithium, like most letter schemes, it has very pretty small keys and pretty small signatures. And the most important thing about Dylithium is that it generally operates in this polynomial ring, which you see here on the slide, with this special prime Q. The point of using this ring is that it allows very efficient polynomial multiplication using the number theoretic transform. So the number theoretic transform is basically a transformation which is done in a ring, which basically, if you have a polynomial in your ring, then you can evaluate this polynomial at the powers of the nth primitive unit of unity in this ring. And then you will end up with another representation of the same polynomial. The formal definition you can see here on the slides, these are all really complicated multiplications. And they're also inefficient. But in practice, what we do is we generally use the fast Fourier transform algorithm to compute the entity of a certain polynomial. What we then can use is the fact that if we want to multiply two of these polynomials, we can transform both of these polynomials into the entity domain. Then we do point-wise multiplication. And the resulting polynomial, we can transform back into the time domain to get the product between, in this case, a and b. The benefit of this is that while schoolbook multiplication has a complexity of n squared, where n is the amount of coefficients in your polynomial, the entity and the inverse entity algorithms, they only have an n log n complexity. And then the point-wise multiplication has an n complexity. So using the entity is faster than using more traditional means of multiplication. So the signature scheme itself obviously has three parts, the key generation, the signature, and the verification. This is how it looks. I won't go into it too much. But the most important thing is that all these bold characters are vectors and matrices. Most of the involved operations in this scheme are vector and matrix multiplications. And as such, the scheme uses the entity a lot. So then the target platforms, we chose two different target platforms. First, there's the ARM Cortex-M4. We use the STM32-F407 discovery board. This board is, I think, the main target for all the NIST workshop schemes. I think NIST at some point said that in some presentation. It's a 32-bit platform. It has the ARM-P7 instruction set. It has a lot of ROM for microcontrollers and also a lot of RAM. And it actually is like 168 megahertz. Because it's an M4, it has a couple of really nice instructions where you can do 32-bit multiplications in a single cycle. The ARM Cortex-M3, so we used Arduino Dewey, which has an AdMil-SAM3X8E chip. It looks a lot like the M4, but the main difference is that the flash size and the ROM size are a lot lower. And on the M3, there are not these nice multiplications that multiply in one cycle. It has the same instructions, but they have variable runtime, which means that we cannot use them for sidechain-resistant code. How does that look? Well, we thought about maybe we can trick this instruction into doing all this stuff correctly, but apparently the flow chart for these instructions, for example, for this one, is so involved that it's really hard to actually trick this. And if we want to make sure that we always end up, for example, in the five-cycle path, then it's almost impossible to properly implement crypto with this. So the first obstacle to overcome was how do we actually implement constant time multiplications on the Cortex-M3? So yeah, basically the normal instructions for 64-bit multiplications do not have a constant cycle count, but there are some 16-bit multipliers that are actually constant time. So the model instruction is one cycle, and the MLA and the MLS instructions, which MLA does accumulate and MLS does subtraction, they are both two cycles. And our solution is to use 32-bit multipliers, and we represent the 64-bit values in radix to the power 16. So what we basically do is we do normal schoolbook multiplication, which is what you will see here. And it kind of looks like this. One of the important things is that while you can do this for the values that we have used in our dilithium implementation, it is not universal, and we could only do this because the values in our dilithium implementation were bounded by some specific bounds. And using that, we could make sure that no overflows would happen, especially overflowing in the addition here is, I think, something that can happen. OK, so that is basically how we implemented the multiplication in the end. I will describe another trick that we thought of, but didn't end up using because it didn't actually work. So for optimizing the performance, the three main tricks that we used or that we at least tried to apply this Chinese remainder theorem to actually split larger numbers into smaller numbers. Second trick is that we moved from an unsigned to a signed representation based from the previous implementations that our implementation was based on. And then the last optimization that we did is in the number theoretic transform computations, we merged the layers. We merged different layers of the entity so that we had to do less loads and less stores. So first, applying the Chinese remainder theorem. So this trick is based on the code from n through prime. And the idea is as follows. First, we want to compute c equals a times b, where c, a, and b are all polynomials, which have 256 coefficients. You would do that in dilithium by first computing the number theoretic transform of a and also computing the number theoretic transform of b. Then you do the point-wise multiplication of a and b, which gives c, which is fast. And then in the end, you do the inverse entity of c, which gives you the normal representation of c back. The downside of this in the cortex in 3 is that all these multiplications are much lower q. q is 23 bits. So for these multiplications, we need a 64-bit multiplier. As in, we need to multiply two 32-bit numbers and get a 64-bit result. And then we have to use the schoolbook method that I described earlier, or we have to use one of these big multipliers, which is actually kind of slow. So the idea is as follows. Instead of computing all the numbers modulo q, we take a new Chinese remainder theorem basis, where we have different q's that all support number theoretic transforms. And what we do is for each polynomial, we take the representation modulo some smaller entity friendly q. So for example, that would be the Kyber q or the New Hope q. What we can then do is make sure that if these q's are actually smaller than 16 bits, we can use the fact that they are smaller to circumvent using these really big multipliers. So what we do in that case or what we would do in that case is we take each of these polynomials, modulo some smaller q. We would take multiple versions of the same polynomial, modulo different smaller q's. For each of these polynomials, we compute the entity. And we do the multiplication using the same method as always. And in the end, we use the Chinese remainder theorem to construct the C polynomial back from the CRT basis. So the requirements for this to work is that all of these q's, they have to be entity friendly primes. So there is actually not a lot of values that are smaller than 16 bits and are also entity friendly. But we managed to find some. But also because the CRT basis has a larger, because in the CRT, we are not representing the same ring anymore, we are actually computing in the integers and not modulo the regular q. So for this trick to work, the product of these different primes has to be larger than the coefficients in C before they are reduced. For the lithium, this would mean that we had to come up with four different q's and have to split our polynomials each into four different shares. We found out that it is actually slower to do this than just using the schoolbook multiplication. But we want to stress that this is actually probably useful for all the platforms. And we recommend you to see if for your implementation this would maybe work. So the other performance upgrade that we did is we moved from unsigned to signed representation. Basically, every time if you have an unsigned implementation, you do a subtraction, then it's possible for this subtraction to overflow. So what you do is you mitigate this by adding a multiple of q every time. When you have to do this, then every time you need to do an extra addition. But furthermore, you also have to do more reductions because you are constantly adding this multiple of q. That means that your numbers are growing faster. And that means that in the end, you have to do more reductions. So we found that for the Cortex-M3 and the Cortex-M4, we can easily just implement all the entities and all the math in signed representation. And so we moved to signed representation. So there's no extra additions, and there's less reductions in the end. And the last main optimization that we did is that the entity is implemented using the finite. It's using the fast Fourier transform algorithm. And the fast Fourier transform algorithm, basically, it recurses a binary three. You can compute it in different ways. If you would compute this in a depth-first manner, that means that you have to do a lot of reloads of different primitive views of unity. And these reloads actually take quite a lot of time if you're working on an embedded platform like this. If you do this breadth-first, then you're constantly loading and spilling coefficients of the polynomial that you're actually transforming to the entity domain. So that's also not very nice. So the hybrid approach to fix this is that you're going to merge layers. So see here a representation of the fast Fourier transform algorithm. There's different ways to do this. This would be the breadth-first method, where first you do the first layer, and then you do the second layer, and then you do the third layer. But you can also do this hybrid approach where you first transform the butterfly operations over the first coefficients of a layer, then immediately do the next butterfly operations for the next layer, and then you basically interleave those computations and merge this layer. That's what we call. You can also merge these layers. You merge three layers, two layers, et cetera, depending on what you think will be the fastest. So in our case, what happens is the amount of layers that you can merge depends on how high your register pressure is. And our register pressure is higher on the M3. So on the M3, we were not able to merge a couple of layers. But on the M4, where we could use these big instructions, and on the M3, where we could use those same instructions we ended up with non-constant time NTD implementations, there we could actually merge two different layers. So apart from performance, we also optimized memory. The first strategy that we thought of is when you're implementing one of these schemes in the wild, how would you actually do this? And we thought if you're writing a signature, if you're generating a signature, then your secret key is probably static and you only have one secret key. So in the Dalitian specification, it's common to fully expand the big public matrix that you will need immediately in the beginning of the algorithm and pre-compute that whole thing. This is very annoying for Stackspace because you will need a lot of kilobytes of Stackspace during this operation. So what we thought of is, well, if this A matrix is always the same, then you can just write it down to flesh or to ROM and you can just reuse it from flesh all the time. You don't have to have to ROM for that. That's what strategy one and strategy two is basically the same as the Dalitian spec says, we generate A once during signing, and then we use it during a single signature. And the last one is basically we stream A and Y. It's very likely that we get a very small stack footprint, but we expect the scheme to be a lot slower. So basically, the biggest bottleneck for stack optimization is this W equals A times Y. So we found that if you do mild stack optimization, then either you have to have W completely in RAM or you would have to have Y completely in RAM. So that means that you always have either K or L kilobytes of polynomials always around. So that is kind of a lower boundary that we expected if you're not going for the abyssal performance. So after implementing all these, these are our results. We measured using, on the M4, we measured using the statistic timer. On the M3, we used the DWT cycle counter and how we measured stack was we filled the stack with dummy values, we run the algorithm and we count how many of these dummy values were overwritten. For the entity, we kind of, we sped up a little bit compared to the previous work and we see that the constant time M4, sorry, the constant time M3 entity is like three times as slow as the M4 implementation and the variable time entity performance is about two times as slow. So for the M4, we have these speeds and stack values and compared to the previous work, I will, these are all the numbers, you can read them here. Basically on the Cortex M4, we have the fastest implementation for all that we have. At the time that we wrote this software, we had the fastest implementation for the Cortex M4. We have like a 13 to 27 speed up compared to one of the previous works and a 14 to 20% speed up to the other. For the M3, we don't have anything to compare to, so I present to you the numbers here as is. We see that for signatures, the signatures are actually 40% to 100% more cycles than the M4, so that's a good guess if you want to estimate how slow this scheme would be on the Cortex M3, but we also see that verification is only 20% slower and that is because we don't need, in the verification, we don't need to use constant time operations and so that means that the extra overhead that we get because we cannot use the 64 bit multiply operations is less, is lower. For the memory, we see that the key generation and verification are always pretty cheap, but for signing, we also, we generally need 40 to 70 kilobytes of memory depending on the version of that Ithium and we see that if we put some of that stuff in Splash, then we can save 24 to 48 kilobytes of memory, which can be very useful. We can get signing down to around 10 kilobytes without optimizing a lot, like without actually compromising the performance of the scheme anywhere and for a factor of like, if we do signatures then for a factor of three times or four times, we can actually save 40 to 60 kilobytes of RAM. So the conclusion, so we implemented this Ithium scheme on the Cortex M3 and Cortex M4. We have quite fast results, but we think that the memory footprint is still quite large. So there might be some research done to get this even smaller. We didn't take into account that there could be a hardware accelerator on the platform. We did all the catcher evaluations in software, which is really slow. If you have a hardware accelerator, this might be really fast. And we think it's a shame that we could not use this CRT trick because I think that we think that it could be very useful in some of these both, well, in some of the latest schemes. So we hope to see more of that in the future. The link to the paper is on the slide. We also have the code on GitHub. And I think the questions are asked in the chess after the chess short talk and feel free to send us an email if you have any or more questions. Thank you.