 about, again, Lattice, so the second talk in the session is implementing ring LWU-based schemes using an RSA code processor. The speaker is Fernando Vrida on joint work with Martin Albrecht, Christian Hansel, Andrea Hölle, Thomas Perlman, and Andreas Wallner in the stages used. Thank you for the introduction. Can you hear me? I'll take it as a yes. So, this is work on trying to do Lattice schemes on hardware that is currently available. So, the talk before was about how to do very fast entities but that requires some hardware design, so we look at alternatives for deploying Lattice-based crypto. So, another view, we'll do the obligatory slide on post-quantum crypto and then talk a little bit about deploying crypto and then we get to the main dish which is the ring arithmetic and how can we make it work using RSA code processors and then we'll look at an implementation of a scheme and at some future directions. So, I think some in the room already heard about this but a short came up with a quantum algorithm for allowing factoring and discrete logs and this means that much of the currently deployed public key crypto may not be viable in the mid to long term future. So, NIST started a standardization process and they are looking at many different schemes based on different algebraic assumptions and a very common one is our assumptions of our polynomial rings. So, ring LWE-based assumptions or NTRU-based assumptions. Now, in practice, cryptographic scheme have two crucial requirements. Well, maybe three, they should be secure first of all but then they should have a, if we want to actually deploy them and we want to actually do crypto rather than just talk crypto, we need them to be performant and we need them to be easy to deploy. So, of course, optimized implementations are a very active area of research. So, as part of the NIST process, this work was published during the first round so there have been some changes in strategy maybe since as part of the NIST process, teams were required to provide a reference implementation and then strongly advised to provide an optimized implementation with a particular focus on modern CPUs. Since there's been a lot of work also on FPGAs and ASICs and there's been also some work in general on the direction of constrained and embedded devices. So, for example, microcontrollers, HSMs, or smart cards. So, if we look at the current smart card environment, for example, we're often dealing with a low-power 16-bit or 32-bit CPU with very tiny amounts of RAM. And these are augmented usually with some specific co-processor to enable them to run public e-cryptography because the CPU is not very fast and so maybe modular exponentiation can be quite expensive. So usually, if we look, for example, as the smart card that we focus on this work, it's part of the SLE 78 family from Infineon and provides a 16-bit CPU clocked at 50 megahertz and it has 16 kilobytes of RAM, 500 kilobytes of non-volatile memory and then it comes with an AS and a SHA256 co-processors and then it comes with what we call the RSA co-processor which is this co-processor for adding and multiplying mod ZN where N is about 2,200 bits. And so I guess the obvious application of this would be to do RSA 2048. So what we are thinking about is, okay, in this current, currently available smart card context, what would it take to run idea lattice-based cryptography? So the main operation, maybe one of the most expensive operation in Ringel-W-based schemes is the MULAD operation. And so here we have A, B, and C. These are polynomials, modulo Q and modulo some other polynomial F which is often, for example, X to the N plus one where N is some power of two. Now schoolbook multiplication is quite slow. So usually what many schemes propose is to use the number theoretic transform to perform the multiplication. In the embedded hardware system, there's been a lot of work to design co-processors for Ringel-W and often a central component is a design for this entity on hardware directly so that one can provide a sort of entity co-processor. The issue with this is that new hardware design requires implementing and testing and certifying across countries and then deploying. And only recently, for example, we started having ECC-crypton smart cards. And so the idea of having post-quantum-crypton smart cards might seem a little bit far in the future. And also there is the issue that since the competition is still gonna run for a few years most likely, there is no obvious candidate that is gonna be what gets actually deployed. And so there might be some lack of incentive for companies producing smart card chips to actually start doing all this process when maybe they come out that they have the wrong size hardware co-processor. So our approach, our alternative approach is the following. We try to build a flexible MULAD co-processor or gadget by reusing the RSA co-processor that is already available on this smart card. And we demonstrated by taking Kyber as it was parametrized in the first round. There were some changes since. And we make a variant of Kyber CCA. We'll see how it differs. First of all, it doesn't use the NTT. And so there are some incompatibilities in API. And we run it on this SLE-78 platform. And we argue that we run it in a viable way. So that this technique could be a possible way of transitioning into post-quantum crypto while the new hardware co-processor is still being designed. The main ingredient of all of this is the Kroniker substitution technique. So it's a classic technique in computer algebra for reducing integer polynomials with multiplication of polynomials with integer coefficients to just plain integer multiplication. And so to scare everybody with some mats, we have two columns here that are running more or less the same operation. So say on the left we have A times B and there are two polynomials and so we can just have compute A times B by, for example, schoolbook multiplication. And then on the right we take A and B and we evaluated at 100. And so we get two integers, 102 and 304. And we just multiply these two integers. We get 32,008. And then we write the result in base 100. And what we can notice is that the coefficients here in base 100 translate to the coefficients from the result in polynomial. And the reason this work is because 100 is a large enough number that when we start multiplying in schoolbook manner, so 4 times 2, 4 times x, et cetera, is the same as multiplying the two integers. And the possible categories of integer multiplication are gonna be small enough that they won't overflow from one digit to the other. And this can be also done with sine coefficients and this can also be done with smaller integers. There are a few techniques that we use and we discuss more in detail in the paper, but they're a bit tricky for a presentation maybe. Now, this also works, this technique, with modular reductions, modulo f. So here we have a, which has been relabeled to be the result of the previous operation. It's quadratic. And we have f are cyclotomic polynomial, x to the n plus 1. And we want to reduce a mod f. And so what we're doing really is we're taking an appropriate multiple of f and removing it so that the highest power of the resulting polynomial is less than the degree of f. And now if instead we're to evaluate on the right a and f on a 100, and then we were to reduce a mod f as just a classical integer modular reduction, what we're doing there is also we're trying to remove some multiple of big f, which is 100 squared plus 1, such that the result is smaller than f. And this can be seen also like as a parallel operation between the two techniques, and we can see that actually the result is a 1005, which is just 10x plus 5 if we write it out in base 100. So now that we have these two techniques, we can try to combine them, and so we can use them to compute the moolat operation. So we have a times b plus c as polynomials, mod q and mod f, and we can choose t, some base that is large enough, and we're going to choose a power of 2 for efficiency reasons of course. And then we can compute a and b and c. We can evaluate them at t, and we can also evaluate f at t, and then do the operation over the integers directly. Times b plus c, mod f over the integers. And then what we can do is, well, okay, then this is more or less like working on z of f of t using the RSA coprocessor. So before we had big n as the RSA module, now we're just using f of t. This is going to result in a polynomial that is equivalent mod q to our result once we have packed all the coefficients. And so, yes, after we do this operation, there's still going to be some modular reductions to do on each coefficient of the resulting polynomial. So now the question is how do we choose t such that we can actually do all of this, right? And so in the paper, we provide a lower bound and a proof of why this works for our algorithm for doing this. And so, okay, let's just plug in kyber. We have polynomials of degree 256, and some other properties of the scheme, again as presented in round one. And it comes out that t should be 2 to the L, and L should be larger than 24.5. And so we just choose, for example, 25. Okay, cool. So it seems like we're done. Okay, so what's going to be the modular that we have? Always log 2 of f of t is going to be 6,400 bits long, more or less. Now we have a problem. Our coprocessor on this marker was more or less designed for RSA 2048, and in particular doesn't take integers that are longer than 2,200 bits. So it seems that chronicle substitution alone won't cut it. But luckily, we can maybe try to interpolate. So we went from full polynomial multiplications to full integer multiplication. But maybe there is somewhere in the middle. And indeed, this idea has been explored before by, for example, Schoenhaag and Nussbaum and there's been multiple tricks in the literature similar to this. And so the idea is the following. Say that we have here a polynomial of degree less than 6, we have, we can split it sort of into the even powers and the odd powers. And the odd powers are sort of shifted by an x. And then we add this dummy variable y, which stands in for x squared. And so what we have is a 0 plus a 2x squared, but it becomes a y and so on. And here a 1 times x and a 3 times y, x squared times x and so on. So we can sort of write the same polynomial in this way. Now what we can do is we can use chronicle substitution on these y polynomials that are smaller, of a smaller degree, but I'll have many of the same properties of the polynomial before. And these might fit actually in the coprocessor. And then we can just have a low degree polynomial in x with large coefficients. And we can use, for example, a Karatsuba or we could use a schoolbook or we could use Tumkuk or some other efficient polynomial multiplication technique. To run the operation. And then at the end there's going to be some tidying up to do at the end of this operation that we described in the paper, but it's just fiddling a little bit with some extra modular reductions, but nothing too worrying. And so in this case, for example, using this technique what we can do is we can take the polynomials to the 256 for Kyber and sort of reduce our operations to multiplications of polynomials by N°5 in x using Karatsuba. So we did all this work and we have this sort of flexible MULAT gadget because we could also use it. We can sort of split and interpolate and see different strategies. Is it worth it? So round one, Kyber, makes use of Ketchak for many of its random functions, for the random oracles in the CCA transform and so on. And the SLE 78 doesn't currently have a Ketchak hardware coprocessor, and so using Ketchak in software is extremely slow and it actually would make this completely useless. So what we did is we circumvented this problem by defining our all random functions based on the AS and the SHA256 coprocessor and then use those instead. And we noticed that in round two, Kyber indeed introduced a new variant called Kyber 90s that more or less does something similar to use this AS and SHA256 as origin for the random oracles in order to make speed measurements and not depend on the SHA356 implementations. So here are some numbers. The table is much larger and much scarier on the paper. But here what we are looking at are implementations only on the same SLE platform to make comparison as easy as possible. The first two lines are our work. So this is the Kyber CPA encryption scheme that is used to construct the CCA chem. And we have timings for we have cycles for key generation encryption and encryption. And in particular, if ballpark figure, these numbers here are more or less 130 milliseconds. We report them more precisely on the paper, but that's what 6 million cycles more or less translates into. Then we have RSA and we have just plain RSA or RSA using the Chinese remainder theorem, which is really how you want to compute this. Funny enough, they don't propose numbers for the key generation in the spec, but they have very fast encryption, of course. RSA has a small exponent. But then if we look at the decryption numbers, we have that for example, we have for the fastest RSA version, we have decryption taking more or less 6 million cycles, which is more or less what it takes us to do the same operation. And so it seems that actually with RSA smarker, we can be more or less as fast as RSA. Of course, this is always comparison a little bit difficult because these RSA contains some security, extra security majors that our work doesn't contain, but still it looks like ballpark figures are similar. And then we have a CPA implementation of kyber, so it should be compared to the first line, and it's just kyber and software, and it's definitely losing all around. And then we have also a new hope implementation that also has a co-processor, but it's still much slower. So, future work, we could investigate other schemes. So three birds, for example, is in round two, uses only integers and uses a similar idea but from the ground. Their integers don't fit on the SLE 78 co-processor that we had available, but so there should be, we should probably do some more algebra tricks, but it could be better to investigate that, and they may also fit on a larger co-processor. Saber instead is a design very similar to kyber, but what they do is they use rounding noise, so there is less AS cycles to run to generate the, to run the scheme, and also uses a power of two Q, so all those modular Q reductions that have to be done after we unpack the result, those can be made much faster. And then maybe there could be a room for designing a scheme that fits exactly into our co-processor but, you know, we don't have to do the noise boundary tricks, and then of course we could look at implementing some signature scheme, like say the lithium. The issue there is that Q is much larger and so probably it would be a kind of different design challenge. And then a final idea is the fact that LWE by its own nature supports errors, supports sort of adding errors, so maybe what we could do is we had the number before that said L to be 25 bits and so we had to pack everything at least every two to the 25. But it would be very nice if it were 24 because 24 is byte aligned of course. And indeed in our case we do need 25 but we end up using 32 because the byte alignment is so powerful. And so maybe there is some way of bounding the carry errors or the probability that a carry introduces a full decryption error and so maybe there could be some way sort of like noise even more noisy LWE based scheme that just uses 24 bits coefficients instead of 25. But of course it's a little bit tricky so yeah it's left for future work. That's all I wanted to say. Thank you very much. Thank you. Are there any questions? Again the microphones up front here. I don't know. I think this is our next speaker. So you're looking to RSA co-processors. If you would have looked a curve smart card then those numbers are much, much smaller. So how much overhead would it introduce? I mean you kind of bouncing between the different representation would have much more of the noise balmer. Yeah so I'm not exactly sure of the numbers. I think that maybe a strategy of course could be maybe it's like the co-processor is too small for Kyber but maybe one could use a higher rank ideal and a smaller degree polynomial and that maybe could be a way. But I mean without, we are sitting down on Python and making some computations a bit hard to tell but maybe using 128 degree polynomials could help. And the second question you mentioned, I mean you were motivating your research as like this is the kind of easy or cheap way to get some idea of how fast it would be if it was built in hardware. So much overhead is there over a dedicated Kyber processor. How do you go away of your complexity? I mean certainly it would be slower than having a proper entity co-processor. No question. I think it's rather the point is rather currently without having to redesign the hardware just having similar, we compare to current cryptography and say like okay well we could be much faster probably with an entity co-processor but we could also be not much slower than with that RSA on the RSA marker. So it's rather like alright. There are no further questions and please join me in thanking the speaker again.