 OK, so our next talk is about masking technique in the context of tweakable block ciphers with application to authenticative encryption. So this is a joint work by Robert Granger, Philippe Jovanovic, Bart Menning, and Samuel Neves, and Philippe will be giving the talk. Exactly. So thanks for the introduction, and hello, everyone. So let me start with some terminology. So we all know what a block cipher is. It takes a message and a secret key as input, and then applies some non-linear operations on it, and you get a ciphertext as an output. A tweakable block cipher, on the other hand, has an additional input, namely a tweak T, which is a public value, and which adds some flexibility to the cipher, and it does some internal randomization. And this means if you have fixed inputs, but you have different tweaks, then you also get different ciphertexts as an output. Another important concept that we will encounter in this talk is about authenticative encryption. And what it does is it takes an associated data block, a message, and a secret key, and a non-cess input, and produces a ciphertext and a tag. And the ciphertext is the encryption of the message M, and the tag protects the authenticity and integrity of the associated data and the message. And the nonce here, so denoted by N, has a similar functionality as a tweak in a tweakable block cipher, namely it randomizes the scheme. So how are these two things? How can we put them together? I will show you this on the example of the OCB block cipher, and in particular by the whole family of OCB block ciphers, which was introduced by Rockaway. And so as you can see here, when you look at this first block here, this is basically a tweakable block cipher. And the nonce here is the value N, and the tweak, sorry, is the value N, the nonce, and a value T, A1, which here depends on the position of the input block. So that means, in particular, each input block here is transformed with a different tweak. And since we do this very often, we, of course, want that the change of the tweak is very efficient. Tweakable block ciphers have already quite a history. It started all in 1998 when ciphers held still funny names like hasty putting cipher. And it was an AES submission, and it was the first tweakable block cipher. I put this in parentheses here because the cipher didn't manage to get over the first round, because there were some flaws in it. But nevertheless, it introduced this concept of tweakable block ciphers. And then there is a whole history. So that's not the full history of tweakable block ciphers. It just took some interesting points. For example, Mercy was a tweakable block cipher for disk encryption. Then there is the three fish block cipher, which was used in the SHA-3 finalist scan. And then there is, more recently, in 2014, there was the tweak key framework introduced, which is then used in CESA submissions like deoxys, yoltik, and kiasso. So our focus in this talk is we will look at generic tweakable block cipher design. And when we started out working on this, we asked ourselves, OK, what is the simplest approach that you can realize such a tweakable block cipher? And what you in fact do is you take your existing block cipher and you somehow generate a tweak-based mask and you add the mask before and after the call of your block cipher. You can, of course, also use public permutation. So you have here in the middle, you don't have the key block cipher, but a public permutation. And you do it in a similar way. But you have to take care here that your tweak now also depends on the secret key. Otherwise, you don't have any key material at all here. So usually the block cipher-based approaches are done with a 128-bit block because here you usually use the AES for a block cipher. And in the permutation-based approaches, they have usually much, much larger block sizes between 256 and 1,600 when you, for example, think about the ketjack permutation. So what are approaches to generate these masks that we add before and after our block cipher? One approach is the so-called powering up masking, which was introduced in the XEX construction by Rogowe in 2004. And the tweak is basically composed of values alpha, beta, and gamma and the nonce n. And what you do is you first encrypt your nonce using the block cipher. And then you modify this base value by multiplying it by 2 to the power of alpha, 3 to the power of beta, 7 to the power of gamma. And this way, you can, by changing the values alpha, beta, and gamma, you can generate new masks. This was very popular and was used in the OCB2 variant of the OCB block cipher family and also in various CSER candidates. And later on in 2014, there was the tweakable even Mansoor construction introduced, which goes in a similar way, but is now permutation-based. And instead of having here the block cipher, you do this construction. You concatenate the key with the nonce and x-ray it with the key and concatenate it with the nonce and you apply the permutation on it. And this was also used in CSER candidates like Minalpha and Prost. So let's have a concrete look at how power up masking is used in OCB2. So when you look here at the first block, then you see how the mask is constructed. Namely, you take this L value, which is your encrypted nonce, and then you multiply it by 2 to the power of 1 times 3 to the power of 2. And when you go to the next block, you just multiply it by 2. And you continue until you process all of your associated data blocks. And you do over here the same thing, but you, as you can see here, this 3 to the power of 2 is missing. That's for the domain separation of the different masks so that you don't have here any mask collisions and use, for example, a mask twice to mask different blocks. So when you're done with encryption, you go back and you compute here the checksum of the message blocks and have, again, a unique mask, 2 to the power of m times 3 times L that you use for this mask of the message blocks. And when you look at this, the masking update can, in fact, be done very efficiently. Namely, it's just a shift and an XOR. But what you have to take care about when you implement this is that this XOR is conditional. So this means because you're computing all of this over Galois fields, you also have to once in a while do a reduction with respect to a polynomial. And this might introduce when you don't take care about it, or if you're not very careful, then you might introduce some timing leaks in your block-style construction. And also on certain platforms, this might, for example, on smaller platforms, this might get quite expensive to implement. Then there is another approach, which is work-based power-up masking, which was introduced by Chuck Raborty in Sarkar in 2006. Here, again, the tweak is a tuple I in the nonce N. And what you do here is you basically you construct a tower of fields where this set value is not now a value over F2, but in fact, a value from F2 to the power of W. So this is a little bit more software-friendly since you're doing now a work-based approach instead of a bitwise approach. But still, it has similar drawbacks as the regular power-up approach with respect to this timing leaks. A last approach is great code masking, where, as the name says, you use a great code to construct this masking. And it is used in variants OCB1 and 3. And again, here, for updating the mask, what you basically only have to do is you need a single X or provided that you did some pre-computations before. If you are not doing this, then you need up to log2i field doublings to update this mask. And again, this might introduce timing leaks when you are not careful about how you implement this. However, as Roga Wei and Krovet showed in 2011, this is more efficient than the powering up approach. And that's also why it's used in the latest variant of OCB. So our contributions are now a new tweakable block server constructions that we call the masked even mensuel construction, which uses an improved masking approach. And the nice thing about this is that an implementer, he does not need to know anything about finite field arithmetic whatsoever. It is much more efficient than the previous approaches. And by default, it's also constant time. So when you implement just the basic versions, it's by default constant time. And what's quite nice about this is that I talked about the domain separation before. And to do this in our setting here, you need some discrete logarithm computations over big finite fields. And also some applications to I promised you some applications to authenticated encryption. And I will show you two schemes. One is a non-respecting authenticated encryption schemes, which runs in 0.55 cycles per byte. And one misuse-resistant variant, which runs in 1.06 cycles per byte. So how does this constriction concretely look like? Our tweak is, again, composed of four values, alpha, beta, gamma, and the nonce. And what we do is we use our permutation here to get a base value by concatenating the nonce n with the key and applying the permutation on it. And afterwards, we have some LFSRs here, 0, phi 0, phi 1, and phi 2. And they operate on this base value. And by this, we can generate new masks. And this combines basically the advantages of this powering up masking that I showed you before and word-based LFSRs. And as I said, it's very simple because you basically only what you do is you evaluate your permutation once, and then you only do LFSR-based updates of your mask. And you don't need any Galois field arithmetic whatsoever. It's also very efficient, as I will show you later on. And as I said, it's also constant time. So when we were starting with this work, we were particularly interested in getting an efficient masking schemes for large states. And we also wanted to keep the amount of operations that you need to do this update very low so that we can get something very efficient. And here I took some LFSRs that we found with our approach. And for example, so what does this table mean? We operate on a state size of B bits, which is separated in n words of W bits. And for example, here, when you look at the first row here, this is 128-bit state. And it is separated in 16 words where each has one byte. So this is very suitable, for example, for masking AES-based block ciphers. And I also want to show you this redly highlighted LFSR here, which is over 1024-bits. And it's based on 64-bit words and 16 of them, so on. What you basically do is you take your array of 16 words, you shift it by one to the left, and then you update your last word by this instruction sequence. And as you can see, it takes only one bit rotation, an XOR, and a bit shift. And as you might imagine, since ARX primitives, so efficient versions of ARX primitives are usually implemented in a word-sliced way, this is very suitable for applying them to ARX primitives. And I will also show you in a moment why this is the case. So one thing that you have to think about, however, is, OK, how do we get this domain separation when I'm applying different LFSRs on a base value? How do I ensure that the different values do not collide? And intuitively, when you have two different values, alpha, beta, and gamma, and alpha, prime, beta, prime, gamma, prime, then you want to make sure that the evaluation of the LFSRs on your base state is on both sides different. So the challenge here is how to set the proper domain for this alpha, beta, and gamma, and that's where the discrete logarithms basically come into play. So let's have a look here what has been done and what we did. In 2004, Rogaway showed for 64 up to 128 bits how to do this masking. Then some ciphers used results between 256 and 512 implicitly, such as Prist. And what we did here is we solved this whole range here to be able to do this domain separation for the masking schemes. And of course, this is not the end. This 1024, this recent breakthroughs in discrete logarithm computations over finite fields can be, of course, done over much higher values. So it's also no problem to do 2 to the power of 11, 2 to the power of 12, and 2 to the power of 13. But for cipher design, it was not that relevant. Or at least for our case here, it was not that relevant. But you can still do it. So to give you an intuition, for example, for the 1024 case, what you do is, so if you remember the TLS talk from today in the morning, that you have a huge amount of pre-computation that you have to do. And in our case here, this took about 33 hours or something, if I remember correctly, to do this pre-computation. And then to compute discrete logarithms, you needed, I think, the overall time was around 50 hours on a very normal desktop machine. So some implementation results of our masking scheme here. This table shows the cycles per mask update. So when you, for example, take the Haswell architecture, the power up masking takes 10 cycles per update for a 1024 bit state. Great code masking takes 3.6 cycles per update. And our approach takes roughly 2.7. But you have to take this with a grain of salt, because when you plug then in the masking into concrete construction, then, of course, these performance numbers are somewhat hidden by the block-sci or by the permutation, because it's usually much more costly to evaluate that in comparison to the masking. So to the applications to authenticated encryption, the first one is the offset public permutation mode, which has security against non-respecting adversaries. And what it does is, here, you have an authentication part, which processes the associated data and the checksum of the message. And as you can see here, for example, this first block applies phi to the power of 0 to this base mask L. And then you update your base value with the LFSRs and continue. And then to do the domain separation here, as you can see, you just do a different LFSR operation phi 1 squared, where this phi 1 is basically phi x-ort with the ID. And you also have for the encryption part here, you have also something similar. OK, then MRO is the misuse-resistant variant, which has here the authentication part. And it also has a very similar approach to do this domain separation. And then you compute the authentication tag and use that as a seed to seed your CTR mode for encryption. And to give you some more details about the implementation, what we did is here we went with the 1024-bit state and the LFSR that I showed you before. And for the permutation, we used the one from Blake to B with four or six rounds. And what you end up with is for the non-respecting cipher. For example, when you go to a Haswell, it runs in 0.55 cycles per byte. Or when you go a little bit higher in the round numbers, then you end up with 0.75. And what's also nice here is in comparison to the AES-based variants, it has also very good performance numbers on ARM architectures. A similar picture is here for the misuse-resistant variant, where you basically on Haswell, you get values around one cycle per byte or 1.4 for the six round variant. So when you translate this to throughput, then we end up basically with OPP with around 6.36 gigabytes per second on the Haswell architecture and for MRO with roughly 3.3 gigabytes per second. So my second last slide here is I want to show you why this is very nice from an implementation point of view. So when you start your LFSR here with the 16 words, what you do is for the first update, you take x0 and x5 and compute this x16 word through the update of your LFSR. Then for the next step, you use x1 and x6. For the next step, you use x2 and x7. And for the last step, you use x3 and x8. And as you can see, since they depend on different words, you can parallelize this very, very well. And in addition, what's very nice is with four additional words of storage, you can keep four complete masks in your memory. So and as I said, since ARX primitives are usually to get them with good performance, you usually do word slicing. And this is, of course, very nice to go together with these primitives. So as a conclusion, what I showed you is the Maschiven-Manzur construction, which is very simple, efficient, and constant time. The domain separation was justified by these breakthroughs in discrete logarithm computation. And I also showed you two schemes for authenticated encryption, which has very, very nice performance numbers. For more information, you can go to the e-print archive, which has the full version with all the security proofs. And you find on GitHub all the implementation with C reference code and also the optimized versions for AVICS, AVICS2, and unit instructions. OK, that's it from my side. Thanks for your attention. Thank you, Philippe. Any quick question? Yeah. Equal to 0. So the question is, what's happened if n is equal to 0, right? When what is n? L. Ah, yeah. So if you feed you LFSR with 0s. Yeah, yeah. So you have to, of course, when you really want to instantiate this, then you have to make sure that your state in the beginning is not completely 0. So you have to put in some constant values in the beginning. That's what I left out here. Because otherwise, you would just, with your LFSRs, you would just update all the time the value 0. And then you always have the constant 0 masking. So you want to prevent that somehow. Any other question? No, so it's time for coffee break. So let's thank Philippe again. Thanks.