 All right, then the second talk is titled crystals the lithium a lettuce based digital signature scheme So this is one of the submissions to the NIST post quantum call for proposals It's by Leo Ducca Ike Achilles, Tancred Le Poin, Fadim Lubashevsky, Peter Schwabe, Gregor Seiler and Damian Stelé and Gregor will give the talk Yeah, thank you very much for the introduction and yeah, so Dillithium is the signature scheme of the crystals package and to give you some overview. So we have submitted the scheme as was already mentioned to the NIST PQC standardization process and as such it is actually one out of five lettuce based signatures which were submitted to the first round and One of the most important features of Dillithium is that it's actually quite short. So both with regard to Public keys and to signatures so public keys are about 1.5 kilobytes large and signatures 2.7 kilobytes and so the scheme falls in The family of schemes which use the Fiat-Chamiya with the balls technique. So this was invented by Lubashevsky in 2009 and on top of this The scheme also includes an important improvement, which was developed in a series of papers in 2012 and 14 where you basically only send half of the data in the in the signature and therefore get a signature compression which makes signatures 50 percent smaller and From this there are a couple of new things in Dillithium So the most important is that Dillithium now also has a public key compression. So compared to earlier schemes Public keys are about 60 percent smaller with a small cost on side of the signatures So they grow by about 100 bytes because of this Then Which is also new in Dillithium is that the hardness is not based either on ring LWE Ansys or on plain LWE ansys But on something which basically interpolates between the two and which is called a module LWE and Yeah, third We have produced a very efficient implementation of the scheme and i'm going to talk about this later So what were the main design goals when designing Dillithium? So first the scheme is supposed to be really easy to implement and for this reason. There's no Gaussian sampling whatsoever So the reason is that we think that it's very difficult to get Gaussian sampling correct and even more difficult to implement it in constant time and therefore we completely omitted this Then so the second goal was to really get the total size of public keys plus signatures Small and if you take this as your as your metric then Dillithium really is one of the one of the shortest signatures in the in the First round of the NIST process. So there's really actually only one signature which is considerably smaller and this is falcon Which uses Gaussian sampling Then parameters are chosen very conservatively so that there's a headroom for future crypto analytic developments and Also Dillithium features is somewhat modular design which comes mainly from the use of this module LWE assumption so to explain this a bit in module LWE A different to ring LWE the the secret of a module LWE sample is a is a whole vector Over the polynomial ring and you can easily adjust the security by choosing the the length of this vector But the underlying ring always stays the same and this means for all security levels and possibly even for for future Somehow adaptations of the security parameters You will always stay over the same ring and the arithmetic in this ring. You only need to implement once And so you can really optimize this and then you're done with this for all times basically Um, so how does one choose this ring? Yeah, so the the idea is basically to choose to choose the smallest ring that gives you all the advantages of ring LWE and in the size or in the case of Of signature schemes or in the case of future mere signatures What you need is that there's a large set of small polynomials the so-called challenge polynomials And if your dimension is 256 then you have this so this is what we did So we choose a 256 dimensional cyclotomic ring and then to make the choice complete you need to Decide on a modulus q and because we want to have entity Based multiplication we the the the modulus is Entity friendly so and it is in the order of two to the 23 So this is the ring we use and with this introduction. I can now give a very short Idea how the scheme works. So this is Basically simplified version. So actually this is the the bell gal bright scheme from 2014 which differs to deletion mainly in That there is no public key compression and Yeah, so how does this work? for key generation You essentially just a sample an LWE sample So you you you pick a matrix a and then two short vectors as one and s2 and compute as one plus s2 So t is some LWE vector then you put a and t into your public key and with this and Yeah, signing works as follows. So signing basically is a fear jamia transform of a signal signal protocol So what you do is you pick a short vector y compute Ay and then put this into a hash function to get the challenge polynomial c But because we have a signature compression You only take the high bits of w. This is what high of w is supposed to mean And with this c that you get you compute y plus c s1 and then there's the important rejection sampling step where you need to reject if Either The z vector you got or w minus c s2 would reveal a secret information And the second rejection step is also needed for correctness because In verification what you want to do is basically recompute this challenge polynomial. So what you would need is The high part of w. But what you already have only have is something which is equivalent to w minus c s2 But because of this second rejection condition the high part of w minus c s2 is actually equal to the high part of w So this is correct And now I can explain how the new public key compression works on top of this. So basically the idea is that You don't put the whole mwe vector t into your public key but you also decompose this in the high and the low part t1 and t0 and only Put t1 into the public key and then the question is how can you still Verify signatures if you only have t1 and for this Remember that you need to compute the high part of az minus ct But now you only have t1. So you can only compute az minus ct1 times 2 to the 14 so to to Overcome this problem the ideas basically to add carries from the addition of this additional minus ct0 into the signature so that In the verification you can correct The term so that you get the correct High part of the vector that you need So this is basically the idea now security wise There is a tight reduction Even in the quantum random oracle model to module l w n cys and a new assumption which is called self target m cys and self target m cys is basically Convolution of of m cys and a hash function and this is also the reason why we think it's secure. So Since the hash function doesn't share any algebraic property with m cys it seems unlikely that you can solve this More easily than solving some m cys problem But if you don't believe this then there's actually you can actually go all the way to to to module cys but Only in the classical random oracle model and with a standard walking time lemma argument Okay, so with this I want to go to the Or turn to the implementation now. So we have produced two implementations One reference implementation in plane c and one avx2 optimized implementation. You can find both of them on our github repository and so in general the main Operations which are most important for for the for the speed of the final scheme are polynomial multiplication In the ring and I will say more about this later and secondly expansion of of the shake Xoff and a shake is used Everywhere to in the scheme to to sample polynomials to also expand this matrix which appears in m w e and so on and we actually Designed the use of shake in a way that if you need to sample a vector of polynomials Then every polynomial is sampled with a fresh input to shake So this is very easily parallelizable and we also do this in our avx optimized implementation, but I also come to this later Our implementations are constant time So they they are protected against timing side channel attacks in for example We never use the c operator to compute any modular reductions or something like this Maybe as a note here constant time for Fiat-Charmier signatures with rejection sampling. This is a bit subtle and As one example, so for example the sampling of the challenge polynomials in our Implementation is not constant time although you might at first think it does it needs to be so what you can Get from the implementation is you can for example compute challenges even for iterations of the rejection sampling loop which Don't end up which would reveal secret information But in case of the challenges and there's this hash function in between So what you only get is an output of a hash function and there's enough entropy. So This is this is actually safe Yeah, so with this Yeah, I want to show you the the speed of our reference implementation So all these numbers are cycle counts on a skylake processor and I invite you to look at The sign-in columns and there you see that really multiplication and the use of shake Make up for by far the most time and This is why I want to say something about entity multiplication now So I said in the beginning that we picked our ring to be entity friendly And I want to argue now that that for at least for fiat. Shamir signatures Which are based on module lwe. This is the right choice And the reason is not so much that entity multiplication is fast but that it's very easy to to save a lot of entities and To to explain this so if you count multiplications in deletion then you find that to sign a message on average You need to do about 224 polynomial multiplications So if you would do this naively with entities, which means you do three entities per Multiplication, then you end up doing 673 entities. So this is just 224 times three But as I said, you can save a lot of them and the the most important saving comes from directly sampling the matrix a which appears in m lwe in the entity domain representation, but There are other other possible savings because in the in the loop over the rejection sampling There are certain polynomials that you need to multiply with which stay constant So you can transform them outside of the loop and in the end it turns out that we only do 172 entities So this means You immediately very easily get a four times speed up over karetzuba multiplication from this And I want to note here that although we have this Speed up also in the reference implementation entities still make up for the most time consuming operation. So this is really Something which ends up to be important in in practice Okay, with this I come to the avx2 optimized Implementation so this differs in four ways from our reference implementation So first we have a very fast vectorized entity in assembly language And I will talk about this a bit later Also, we use a four-way parallel shake for the expansion of the of the matrix in mlwe and for sampling polynomial vectors and then there are two new things which actually are from last week basically so in the So the the the public key and signature compression is now done somewhat more cleverly and we also have a fast assembly modular reduction now and with these For improvements our avx2 version is about 3.5 times faster For signing messages And as I said this recent update is really important So it is also now faster about by about 40 percent then then the numbers we have announced in the chess paper So yeah, then let me Somehow come to an end with this talk to talk Oh, yeah a bit about entity or implement here our entity implementation. So The the basically the the prior state of the art before deletion for fast entities, which are used in lettuce cryptography They they were based on floating point arithmetic So for example, the new hope entity is using floating point arithmetic and what is new in deletion is that we have Come up with a really fast approach that uses integer arithmetic and then the question is how do you How do you do modular reductions and it actually does the same Montgomery reduction strategy as as the reference implementation There's there's a small caveat to this. So unfortunately our deletion entity is not as as fast as the kaiba entity um so Kaiba uses a prime which is on a modulus q which is below 16 bits. So you expect a factor of two speed up when using these Small smaller integers because you can pack your vectors Double the number of integers, but there's another factor of two Speed up in in kaiba on top of this okay The kaiba entity is about four times faster than deletion entity and the reason why we cannot use the same tricks in deletion is because there is no um instruction to only compute the high part of 32 bit integers in The inter-instruction set. So yeah, you see the numbers, but yeah, so our and from them you see that our Deletion entity ends up to be about two times faster still compared to the floating point approach before and i've also listed as a last column the the multiplication implementation of sabre because this is also 256 dimensional ring that they use, but they don't really use Entity friendly prime. So they need to resort to tom kuk multiplication and you see that yeah If you compare this to kaiba, which is also 16 bit that we are still much faster with entity base multiplication yeah, and So it's basically the last thing in the talk. These are the performance numbers of our avx to implement of our avx to optimize implementation and you see that for signing We are down to 510 Cycles in the median or 635 thousand on average, which we think is is very competitive and yeah with this i want to end my presentation Maybe maybe one more thing the Yeah, so although there's a very fast multiplication and shake implementation now. This is somehow still the the most time-consuming operation, but not with the with this much With this much somehow Gapped to the rest. Yeah, okay. Thank you very much Any questions for gregor? All right, maybe I can ask a question then So you said shake is once on besides multiplication one of the most time-consuming operations Did you consider maybe using something else than shake? Yes, we did so We ended up using shake because it's one of the nist approved Or it is a nist approved scheme and it's also easy to to get implemented in constant time But yeah, so they this is not crucial for the for the For the scheme itself. So you get you could easily exchange Shake against something else which is faster on your platform and then you get a Yeah, you might get some speed up. Yeah But and the current four-way simd afx implementation. So these are really nice results How would that carry over to for instance if you do neon alarm? Can we expect similar? Performance results. So you mean for shake or for the scheme in general for the scheme in general Yes, so the the same vectorization things you can do on every Yeah, so there's no reason why this is only should only be possible with avx too. Yeah All right, so this do these numbers you expect similar results. I'm not an expert. I don't know but But the implementation strategy would in general. I guess. Oh, yes. Yeah, okay Yeah Any other questions if not, then let's thank rager again