 Hello and welcome to my talk on Rainbow in the critical example. This is a joint work with Dongchul and Green Yang Let me start by introducing Rainbow. Rainbow is a NIST PQC signature finalist, and it's based on multivariate quadratic equations It's one of the three finalists in the NIST PQC competition and the other finalists of Falcom and Deliphine And those are both based on structured lattices. So Rainbow is the only one left that's based on MQ And as such it comes with quite different performance characteristics and also the operations in there are quite different from From the lattice-based scheme. So it's very interesting to to look into How these schemes perform, how Rainbow performs compared to the other finalists. And there's also three alternative schemes One is called GEMS, which is also an MQ scheme But there has been a recent attack on GEMS. So maybe GEMS is no longer that interesting Because it's not basically broken. And there's Think Plus, which is a hash-based signature scheme and there's Picnic Which is based on zero-knowledge proofs So what Rainbow and basically all MQ schemes are famous for is that they have big public keys and Very small signatures. So due to the big public keys We have a third poll intended that makes them very unsuitable for micrometers So for example if you look into the table of the parameters that's in Rainbow here On the slide you see that even the lower security level has the public keys of 162 kilobytes Which might be too large for many micrometers Nonetheless, we have looked into implementing Rainbow on the Cortex M4 And our paper is available on print and our code is completely open source and available on kitab So please have a look and try it out Let's dive a little bit more into Rainbow. Rainbow comes in three variants The first one is called Classic. So that's the standard Rainbow scheme The second is called CC and previously it was called Ciclic The idea here is that we do we sample part of the public key from from a seat Which makes the public key quite a bit smaller But also leads to much slower verification because in verification you first have to sample as part of the public key Which is is quite expensive The third variant is compressed, which is basically CZ Plus we sample the entire secret key from a seat Which leads to a tiny Secret key basically just a seat but also much slower signing because we first have to sample the entire secret key So if we look into the parameters sets here we see that at each security level we have each of these variants We see for example for the first the QT level that the CZ has 60 kilobytes of public key compared to the 162 kilobytes of the classic so it has Quite significantly there's more of the public key, but we you can see in the paper that the cost of this is Quite huge for verification. So it depends really on what you're optimizing for Let's talk a little bit about The platform we are using and you will see that most PQC papers Use the same board and that's the SCM 32 F4 or 7 discovery board So basically all all the other papers in this session, you will see that they are using this board And this is also the one used by the PQC framework PQM4 and this board comes with a megabyte of flash and 128 kilobytes of RAM and they immediately see a problem for us And that is that the 128 kilobytes of RAM is not even enough to fit the public key of frame classic To solve this there's multiple alternatives, I'll talk a little bit about that later We went for the easy solution here and just board the board with more RAM So we are instead using the the EFM 32 GG11B Which a series is also called a giant gecko And this comes with two megabytes of flash, but more importantly We have 512 kilobytes of RAM, so this allows us to actually put the keys of RAM in there and then that's a lot easier What's maybe also interesting with some is that it comes with a re-cryptor accelerator Which supports Shature AS and also some ECC operations So this might be useful for some PQC schemes also for rainbow. It's very useful because you can because it's using Shature and AS However, in this talk, I will not Talk about the results when using this crypto accelerator to keep the results a little bit more comparable to other boards But in the paper, we also reported the results When using this accelerator, what's also important to mention here is that this core produces quite comparable size accounts to the SCM32 So when measuring PQM4, we saw like less than a percent difference. So that's that's quite comparable If you don't want to use this this EFM board, there's also an SCM32 nuclear board Which has quite a bit more RAM than the other one. So it comes with two megabytes of flash as well at 640 kilobytes of RAM So this is also something you could use And for implementing rainbow and this board is now also supported by PQM4 So Yeah, if someone wants to try it out, that's a that's a nice board as well and however we see that Even with 512 kilobytes of RAM or 640 kilobytes of RAM still only rainbow one is feasible Rainbow three and rainbow five are a little bit out of reach with this amount of RAM To be in this talk and also the paper we focus on rainbow one Okay, so let's dive a little bit more into the details of rainbow And I'll briefly describe it here. So we first need to pick parameters n and m where n is the number of variables and m is the number of equations And for MQC interest, it's also always that n is larger than m For example, rainbow one is using m equals 100 and m equals 64. So 100 variables in the equations And in total 64 equations Then we'll have to pick a finite field in this case. This is f 16 for rainbow one and f 256 for rainbow three and rainbow five So for this talk only that f 16 is relevant And we can do a key generation of rainbow who first need to sample two Linear and rotable transformations t and s By t is mapping m elements to m elements and s is mapping n elements to n elements Then we need to sample a Inroutable or quadratic central map q mapping n elements to n elements And then we can from these three transformations We can compute the public key p as the composition of t q and s So the resulting will be the resulting map will be from m elements to n elements And the private key basically consists of the individual mappings t q and s Signing and works that follows So we first compute the digest of the of the message Call that w here and that consists of m m Field element And then we remap this this w Using the inverse of t the inverse of q and the inverse of s Through a signature set which consists of n elements This can then be used in verification But we again compute the digest of the message And then apply the public map to the signature set and get some W prime And we check if this w prime is equal to the digest of the message w And if that's the case Then the signature is valid So what's most interesting here is is the central map um, so how we Um, how how it's implemented in the rainbow. This is defined as two layers of of uv basically So we would have two sets of equations Where the first set of equations consists of 32 equations and the second one also consists of 32 and you see that Here in the equations in the first one um only the first um v1 plus o1 Variables are used whereas in the second one all of the variables are used to the way we solve this in practice The following so remember we're giving x And we are trying to find y What we're already doing is We'll pick the first v1 variables at random So this makes the first set of equations a linear linear system of equations And then we can solve this the set of equations for the remaining variables Um remaining of one variables Then we can plug all this into the second set of equations Which is then again of linear set of equations and we can solve that to obtain the remaining variables Um, so we see this is um your rainbow. This is two layers. One could in theory also construct this with more than two layers Um, but in the the NIST submission, this is not done Yeah, okay, so this gives us roughly a list of what things we need to implement so to get a fast implementation rainbow The first one is this field multiplication, which is used everywhere that's used in keychain and signing verification And in some places this needs to be constant time So mostly in signing it needs to be constant time, but in other places mostly in verification It can be non constant time because it's only operating in public data Then we need efficient linear equation solving, which is used in assigning for inverting the central map We can either implement this using matrix inversion followed by multiplication Or we can directly solve the equations Here it's important that um this is constant time because this is this secret secret data depending on the secret key Then the third one is we need to evaluate the public map p And that's basically the only operation verification besides hashing So if you optimize if you speed up the evaluation of the public map, you directly speed up verification And here since that's verification the runtime here may depend on the signature or the public key in most use cases Let's start with the finite field multiplication So here's a quick overview. So in rainbow the the um f16 is defined as a tower field So we represent a f16 element with two elements in f4 Which then is again represented by two elements in f2 so in the end each um f16 element is represented by four bits um And as almost always in rainbow what we are doing is we multiply a large vector by um the scalar The easiest way to do this Is using lookup tables And if we have basically two choices here, we can either do one multiplication per lookup um, which will take us at least One cycle for the inx computation then at least one cycle for the the fetch And then another cycle for packing it back. Now this will take at least three cycles per multiplication Or we could do a little bit larger lookup tables table and do two multiplications in parallel Then this is basically half then we need one and a half cycles um per multiplication But um, what's important here is The cortex and four cores may have a cache. So the lookup approach should really only be used in public data So most of the time when it needs to be constant time, what we will instead be doing is we bit slice the entire operation So we we bit slice the the um f16 elements into four registers And then as we'll see on the next slices That this will then take 32 cycles for 32 multiplications Excluding the bit slicing within if we ignore the bit slicing then we will need one cycle per multiplication A little bit more of the details or what um We will be doing is we're given An element a and b and you want to find the product We can now represent the elements Using their bits the bits are now here called ei And then given the bits of a and the bits of b we can express the multiplication as a logical bit by operations and Express the product in this way. So now we see that here the the dots are a logical ends and the plasters are XORs and if we have the The elements bit sliced into into registers we can directly implement these ends and XORs and that will give us a multiplication So to understand how we can implement this on the arm. I need to introduce one feature And that's that's very useful and that's conditional execution So arm allows you to execute a block of up to four instructions conditionally on a flag so if a flag is set then instruction is Executed and otherwise it's not what's important here is that even if the condition is not satisfied Um, even though the instruction is no effect. So it doesn't actually write anything back to the registers. It will still take one cycle And that means if these are like just logical operations Um, and not branch instructions, then and this will be constant times. We we can actually use this for secret data Um That we use the it instructions or if then instructions Will be used to also encode Which of our instructions are in the if branch and which are in the else branch? Okay, let me give you one example here. So we can compare our zero to 17 and then we can have a if then else equal and in the then branch we We do an add R2 and if the in the in the not equal branch we do an add R3. So depending on If R1 was equal to 17 we will be either adding R2 or R3 You can also do this with more instructions and then the it will then encode Um, which instructions are in which branch and so in this case all of the instructions are in the then branch And in this case we do a test on our zero Which is basically checking if the second bit of our zero is Is set and if that's the case, so it's not equal to zero Then all of these four instructions will have an effect. Otherwise, they will not have any effect Okay, so if we now can use the we can now use these instructions to implement final field multiplication In this case with accumulation because that's what you mostly need So let's assume we have three two elements It's sliced into four registers And we have one element B That you want to multiply the vector by And that's in the in the least significant nibble of E And we also input an accumulator Where we want to add the products What the function will be doing is it multiplies all of the elements a i by the element B and adds them to c i We can see here and we can implement this using four blocks Four conditional execution blocks That are conditional on the bits of B add to the accumulator Or to some temporary registers And we see that this is a 32 instructions, which gives us the 32 cycle multiplication That's the final field arithmetic and now I'm moving on to the verification So what this is doing is basically applying the public map to set and then checking if this is equal to Digest of the message The way that this works is We will have This formula where we plug in the elements of the signature z i and multiply them by this matrix a i And in total we will have m of these matrices And we see that This matrix has as half zeros and then What we actually get in the in the public key is m Of these matrices and and the matrix is always one row of The public key And the way it's stored is actually that in a column major form So we will first get the first element of all the the m matrices in the second of all m matrices and Yeah, we can easily see how hard is this will then be implemented usually So we would implement we will multiply z zero by z zero and Multiply the product by the the first column of the public key and add this to the accumulator Then we move on multiply z zero by z one Multiply it by the second column of the public key and add it into an accumulator Uh, so that's a standard way to do this Um, that's the approach that I just described so you multiply z i by z j Multiply the result result by one column of the public key and accumulate the result in w However, we use a different approach here So instead of multiplying each column by z i z j and then accumulating in w We instead have 15 accumulators. So for each possible value of z i z j except for zero because Multiplying by zero will be sure any razor key can throw that away immediately And then depending on the product z i z j we add the column in the corresponding into the corresponding accumulator and then in the very end you do the multiplications Um, then we have some other tricks we use in here one is the Um that with the f 16 multiplication we can now use lookup tables Because this is actually faster for for scalar times scalar multiplication And another trick that we use is if ir is said ir z j is zero We can skip the corresponding columns, which gives us quite a bit of speed So looking into results, you see that this is much faster than the previous state of the arc And I also need to mention here that previous implementation was actually around two rainbow implementation Which is smaller per meter set and that this previous implementation was using lookup tables to actually this is not secure in case You have um a cache and we see in in the plot here already that We outperform this implementation by by two x for signing and even more significant by seven x for for verification We can also compare the results to to other finalists and there we see That rainbow is by far the fastest in this ptc intra finalist on the court example And signing is four times faster Than the lithium and 45 times faster than falcon And verification of rainbow is five times faster than the lithium and two times faster than falcon Then we've done some more tricks Um to make it a little bit faster one is the pre-computation of the bit slicing and The other one is We tried out an alternative field representation to see what this would would change Because it allows a little bit faster bits lesson on vacation. Let me give a little bit of detail here. So for the pre-computation So we know that the secret key is input to multiplications and in signing So we at some point need to bit slices So option one and that's what we did in the results that I just presented is we bit slice this as a part of sign Option two is for bit slices ahead of time. So for example, we could just keep a bit slice in keyjet And then of course option two now conflicts a little bit with the compatibility of this Secret key with non-bit slice implementations But probably in most cases, it's fine. The secret keys is not portable We'll see that this saves up to 20 percent in signing time Note that a similar approach could be used for the public key in verification but There's only a couple of multiplications in verification and so this is very negligible and Of course for the public key the compatibility issues are much larger concerns. We don't do this for for verification Um, yeah, then the other thing we did is try the different f16 representation So the spec actually prescribes you to forces you to to use the tower field representation um Because that's touched on some platforms because you can use a katsuba and then Multiply the four elements But one could also think of Using the direct direct representation throughout the tower field and we see that this actually results in faster bits less than the duplication Which is about five times faster than Than the tower field representation Every one important note here is that this is actually incompatible with the rainbow specs are Since the conversion Would be way too expensive you will have to change this back to actually sample everything in in the different representation And so this is incompatible. We would have to change this back So let's look into the results Um, we see that there's both of these things don't really change the verification time Um, we see that signing comes about 20 faster when you're seeing big computation and Signing becomes about around five percent faster if we use the direct f16 representation So let me conclude this talk So we showed in this this work that rainbow can actually run on a Small microcontroller like a cortex m4 in our experiments. We used the giant gecko with 512 kilobytes of ram One could also use a stm nuclear with 640 kilobytes of ram And um, our implementation is now actually merged into into pqm4 Again, they're reused with of the nuclear code Rainbow is by far the fastest since pqcc is your final list on the cortex m4 And um Let me also discuss some different approaches that we can use if we don't want to buy more ram So one alternative approach would be to stream in the public key and this was done in recent work by Gonzalo said I'll um Where they verified They implemented verification of both crime signatures in less than eight kilobyte ram And this work actually includes rainbow. So that would be another path that you could go Then yet another approach would be to store the keys in flash memory um, so that's also That's a process used in recent paper by page n and two Which is also presented in a chess in this session They implement plastic make a lease And the similar approach could also be used for rainbow to to store the keys in flash To write the keys to flash and kitchen and then use it from there Yeah, and that's all. Thank you very much for your attention