 I think we can start the AES and Byte book session. We have three papers. The first one is Zimpera V2, the family of efficient permutations using the AES round function. The paper is written by Shai Boron, who is in University of Haifa, and Niki Moha, who is now in NIST, and Niki is giving the talk. We're in a very small, cozy room. Also, I won't take up too much of your time. I think I can give the presentation 20 minutes. So that will be either enough time for questions afterwards, or give me a brief interruption if there's something unclear, and I'll give a bit more details about particular sites. I guess something that may have caught the attention of some of you here already is that it says version 2 here in the title. Version 1 was never published, but it was broken. It is a very early version of the work that was presented at Dachstuhl earlier this year. And we got a huge amount of feedback from many people that are actually here in the room today, some of which also have found some attacks against the early version. And I think this led to a subsequent design that is much better than it would have been if we didn't involve the community already from a very early stage. I already want to thank all of you present here for giving all the feedback that we received on Simpira. Let me first give a bit of context about who I am, what I'm involved in here. I've decided during just about all of my PhD studies, I've come to the conclusion that there's no need for any new design in symmetric key cryptography because all of the things that we need already exist, more or less. But that changed since I graduated four and a half years ago. Now I've been involved in three designs. So there is APE, which is part of Primates, that is an authenticated encryption algorithm that was submitted to the CSER competition in advance to the second round. Then you have Chosky, published at SAC 2014, that is a MAC algorithm for microcontrollers that is now undergoing ISO standardization, that is also performing quite well when you look at benchmarks. Like, for example, the benchmarking framework of the University of Luxembourg has Chosky at number one, according to their figure of merit. And then now, I hope you're in the right room, this is Asia Crypt 2016. I'm presenting Simpira, the latest design. So this will be a family of permutations based on the AES round function. Let me talk a bit more about the backgrounds of AES or why you want to use AES. So AES instructions are available on the Intel processor initially and then later on AMD and also ARM. Actually, if you look at any recent 64-bit processor, you will have instructions available to accelerate the advanced encryption standard AES. And AES has on the latest Intel Skylake processor that we were working on during this research, has an AES instruction that can compute one round of AES. And it's an instruction that has a latency of four cycles and a throughput of one cycle. So it takes a while to get the output. It takes four cycles to get the output of one round of AES. But already after one cycle, you can start the computation of AES, one round of AES on another input. So what is important here is that if you want to use AES, you need to use, if you want to have an efficient solution, a parallelizable mode of operation or you need to be processing independent messages. Otherwise, you will need to wait for cycles for every output that you need to use again. So this will lead to very inefficient use of the processor. Now we will take this into account and even embrace it and say that for the solutions that we are going to look at, we only focus on throughput. We don't focus on latency. So this will mean that the latency may not be so good, but it's a problem that is inherent already to AES. So you should really understand this presentation in the context of you need to use a parallelizable mode or you need to use independent data. And this seems to be the main thing that confuses people about this talk. But once you get an agreement on this, if people say, OK, so the latency is going to be bad, then what you will see will hopefully make sense in the rest of the presentation. Let me give a small example to motivate why you want to use things based on AES or one round of AES in the first place. I think it can be interesting instead of looking down at all the details, all the algorithms, to seeing what all possibilities are, I think it can be interesting to just look at a higher level what, for example, Google Chrome is doing. They have a privileged position where if you go to the Google website, they control both the site of the client and the site of the server. They can use whatever algorithm they like and they apparently don't even want to restrict themselves to standardized algorithms. So the field is pulled wide open. Well, for Google Chrome is the case that only if you do not have AES instructions supported on the client side, you will use Chacha 20 with a Poly 1305 authenticator. In other cases, if you use it on any recent processor that has support for AES, you will use AES 128 GCM. And also the fact that you use AES instructions, at least that's what Chacha has informed me, who's working not just at the University of Haifa, but also at Intel, so he should know, I guess, is that if you want to look at designs that are not just going to be efficient on today's processors, but also on processors in the future, then using AES or AES instructions is the way to go because those will only become faster and will only become better on future processors. And this will allow you to get the most out of your cryptographic algorithm for future applications. OK, so I mean, this slide just shows use AES. So we can stop here. But actually, AES cannot do everything. So let me discuss a bit the limitations of AES. And then it will become clear what we will develop in this presentation. For AES, one big problem is that you need a key schedule. And the key schedule of AES is quite complex. So this means that you have two choices. Either you compute the round keys of AES on the fly, but that means that you need to store them in RAM and that it becomes heavier to, I mean, always becomes heavier to change the key to another key because you need to redo this computation. You also need to store those round keys securely. Or if you don't, I mean, you always have the overhead of a key schedule and this can become a large cost. And looking at the key schedule as well and some of the modes of operation or some of the uses of a block site, you may want a tweak. But this is not natively supported by AES. It can be another limitation. The block size of AES is always 128 bits. Different key sizes are supported but only one block size. So this means that in most commonly used modes of operation, you become, I mean, security, the source will become insecure after about two to the power 64 blocks of data. And also means that there's no secure hashing support if that's something you want to do with AES block cipher. You can think of what alternatives can we use to AES? There are many different options there. I mean, originally, Rheindal, which is the algorithm that was submitted to the NIST competition for AES as a submission, also supported different block sizes and even different key sizes than the one that is now standardized. So you could think of maybe using Rheindal with 256 bit blocks. You can think of maybe Chateau. For Chateau, there are also instructions available in the Intel processor or base yourself on some other types of primitives. And we do an analysis in the paper of all the alternatives that you can look at, at least the most obvious ones, and see that they are not so, I mean, not inefficient, but not as efficient as they could be. And in this paper, we will propose Simpira, which is a much faster solution if you want to tackle the problems in this paper. Simpira is a family of permutations, cryptographic permutations, that support any multiple integer of 128 bits. Now, we're not saying that you're supposed to use the very biggest variants in this family, but also as a designer, we don't want to restrict ourselves to the type, the size of the permutation that we want to recommend. The design is scalable. And if you find the use to use a permutation of a very particular size, if it's a multiple of 128, then we are offering it. And that's the one that you can implement in your protocol or in your mode of operation, in your system to use as a building block. For B larger than two, we will look at FISTO and generalized FISTO constructions. The way that they are designed, I mean, here you see a FISTO on the right side, is that we will use two rounds of AES inside. We also look in the paper of what happens if you use one round of AES and then the security results are not as good. Two rounds of AES. And also a constant is needed. Basically, if you use AESank, which is the instruction to do one round of AES, then as one of the operands of this instruction, you can give a constant, or I mean, typically it's used if you use it for AES, the round key that you want to use for this round. What we'll do in this case, or we'll use for the first operation of AES, we'll use a constant that we need to add in to destroy the symmetry. If you don't do this, then if all of the bytes are equal at the input, then they will remain equal because the AES round function has no way of breaking the structure. And then use the second XOR that we get as part of AESank to do a combination of the branches in the FISTO. So the design goal here is then to have security up to two to the power of 128 queries that you will do. So this is security in the sense of structural distinguisher, the same that is done for SHA-3. They have a very easy analysis to guarantee this. Just from the fact that we say two to the power of 128, we're not considering any of the modes of operation using AES that typically only have security that breaks down around two to the power of 64. And if we can use the trick that I explained higher, it should be in the ideal case that the amount of cycles that the processor is executing is roughly equal on the best case equal to the number of rounds of AES because it means that you can launch a new AES round every clock cycle, which would be the ideal case of having your implementation, have the properties that you're looking for during the design. So now we need some basic requirements to analyze the security and we follow quite closely what is done for AES. So we look at counting the number of active S boxes and we require that the number of active S boxes should be, I mean, you calculate the number of active, the number of rounds for which you have at least 25 active S boxes. You also calculate the number of rounds for which you have full bits diffusion, meaning every output bit affects every input bit. For the first criterion, of course, both linear and differential cryptanalysis and you multiply this number of rounds by three and then we argue in the paper that this is going to give us a very good, not just security, but also security margin against all of the common attacks that would apply to AES-like constructions. And then of all the constructions that we look at, we want to use the one with the fewest number of F functions and as every F function contains two AES instructions, two rounds of AES, sorry, that means that you have the fewest number of AES-ank instructions, fewest number of rounds of AES. When we have multiple designs that would satisfy those criteria, the goal is to choose the simplest design and of course also these design criteria only supposed to guide the selection of the algorithms. Still, once you get the algorithm, you're supposed to do cryptanalysis to see if they are secure or not. So here I present you a simpere with B equal to one, so like the smallest member of the family processing 128 bit blocks and this is AES with a fixed round key. So you have 12 rounds of AES if you want to satisfy the criteria of the previous slide. You may see that this is a design that has a very bad latency because you need to wait quite a lot of cycles to get the output, in this case 12 times four is 48, but actually already in the next cycle you can launch another instruction, so this is an hand-prone for everything I will present here. We have a normal FISTO for B equal to two, where we have 15 rounds corresponding them because every F function has two AES ends, and the AES ends is 30 AES ends to satisfy those criteria. And then for B equal to three, it's also a really simple design. This is supposed to be the advantage when you see the designs are simple, it means that they're probably secure, there's not a lot of doubt you can have once you see the design of this case with 20 background. And when we look at B, so here's C for 21, C larger than or equal to four, except for two values of B which I will discuss later, we found that the construction by Yanagihara and Iwata is the one that minimizes the number of F functions, all the constructions that we looked at with an identical, all the constructions of generalized FISTOs, all the brands that we find in literature. This is the one that minimizes the number of, for B equal to six, and also the next slide for B equal to eight. There are two designs in literature that we found that are better, but in the current sense of having fewer AES rounds to be executed, so that means that if you want to be fair, well, those are the ones who really should use the standard constructions that we've analyzed quite a while already. They just do the shuffling of the blocks, in this case, for 15 rounds, and then here you have the varian B equal to eight using AES, but this is not what we used because that was the early design that turned out to be broken. The first attack by Tobronich at all, as in the room, was published at the back of this, and it shows collision on a Tintera-based hash function where the full round Tintera is attacked. With the complexity of Tutorpower 83, so we said security you should have up to Tutorpower 128, so this version is clearly broken. Then there's another attack as well that was published just about in the same week by Ronium, where he shows an invariant sub-space attack for the same varian, and there his attack has interesting properties that it is not just independent of the number of rounds, increasing the number of rounds will not set restore security, but also that you can break it with only two queries. That means the design is completely broken. If you want to look into these two attacks and see what actually has been going wrong here, it turns out that the problem is only with the Simpira varians that use this Yanagihara Iwata generalized spysal construction that we analyzed in detail and we explained in the paper. Basically, you need to be careful with independences. When you have a cryptographic permutation, there is no secret T, so it becomes difficult to assume that in every round, probabilities that you're going to calculate the behavior of the ciphers independent of the previous round, and that's an assumption that, classically, was never a problem because we classically use very heavy key schedules to have this Markov cipher with independent sub-key assumptions all, but in the modern world, especially since a lot of research has moved into hash functions, that doesn't hold anymore and we can have, for example, things like invariant sub-space attacks that are often overlooked that make it so that you have an interesting situation where you can have security against just about every attack in the book, in your cryptanalysis, differential cryptanalysis, meet in the middle attacks, but you can be completely insecure against this type of attack. It's something that, of course, not a defense. I mean, this time was broken, so this is something we should have avoided, but when you look at recent literature, also including a paper we'll present here at Asia Prep, a lot of research is going into invariant sub-space attacks and it's something many designers have gotten burned by. How does it need to be fixed? Well, basically this type one GFS by Yanagihara and Iwata is problematic and needs to be replaced. Also, the round constants are not the problem, but could have prevented the problem and strengthening the round constants is something that will not make the site for weaker. So what we do is we strengthen the round constants. We basically do exactly the same change that was done for the Rustle-Hash function when it moved into the final round of the chat reconpetition. We do exactly the same change in how the round constants are being computed and we add, I mean, I'm a cryptanalysis guy myself. We add for fairness as part of the paper that we still think that with the old round constants, the design should be secure. So the problem should not be as part of the, I mean, a result of choosing the round constants badly. So this is more to make a logical change to strengthen the Cypher and to remove any suspicion of insecurity, but we also want to attack it even with the old round constants as soon as the type one GFS is replaced, the design should be secure. So as you can see here, this is the updated variant for B equal to four, because this is the simplest design. And for B equal to five and bigger, it's, this is an animation I will show, because we have only four inputs. It's a construction that is recursive. Here you see what it's like for four inputs. But this thing, as you can see here, I'm also mentioning at the bottom, this thing is on the whole construction. You need to iterate it three rounds. So this thing needs to be repeated three times. Then you can turn this into five. This is the construction we use by adding two F functions to the right and two to the right, right top and right bottom. You get the results by having five, 120 bytes. Increasingly when it becomes bigger, it becomes less readable. But luckily people, you basically get a big X somewhere. So we lost the simplicity in this sense that we don't have identical round functions anymore. But we hope that this construction is still simple enough to allow an easy analogy. The number of calls to the F function is the same as in Yanigihara Iwata's construction. So we don't lose anything in efficiency between the scheme and the problem that was in the solution. And to round up, if you look at benchmarks, we basically show that we can reach what we claimed in the beginning, which is that every round, we secure that dispatch is a new AES instruction. But there's a small remark that you need to make yourself clear yet. And that is that you can look at interleaved or non-interleaved implementation, where interleaved means that you would have laid out in memory the first block of the first message, the first block of the second message, the first block of the third message, or like the second input. So interleaving them instead of having the whole message block right after each other. When you interleave messages, you can have the optimal performance even for very large calls of the permutation. And otherwise, you will start to see that the overhead becomes quite significant after around permutation inputs of larger than 120 pages. So the main goal of this presentation is not to go into applications, but I will, I want to move here to my conclusion to allow a bit of time for questions. So Timpira is a family of permutations based on the AES function that supports any multiple of 128 bits. As a building block of two rounds of AES, and it's an easy security analysis, it's also security up to two to our 128 query. And when you look at the implementation figures, then we show that we are very close to the theoretical optimum, which would be to dispatch a new AES bounding block. Thank you very much. Thank you. Any questions? Picture with the X, big X. I'm afraid. The big one. And the big one, the big one, okay, this one. Yes, so I'm a bit worried about the highly restricted influence that the left side can have on the right side due to the fact that you have a very narrow bottleneck. It's only in the middle where data is moved from influences the other side. So if you're thinking about collision attacks, for example, you can show that such a design cannot be secure up to the square root bound. Exactly, yeah. So did you consider schemes in which you don't have such bottlenecks, which limit very strongly how one part can influence the other part in your permutation? Actually, I mean, so it's a very valid remark and the simple answer could be just to say, we repeat this three times and this is where the security should come from. But actually, we're playing to your question. This is a problem that not just this design has, all of the designs, all of the piecel designs that you can consider. So you need a certain number of rounds so that every input can start affecting every output. Otherwise, you have these bottlenecks as you can see that more rounds are needed before you can see it. No, the problem is actually deeper because in this design, every input bit can influence every output bit, but if you are going to have a collision of the information which is passed from one side to the other, you just don't have sufficient variety of information theoretically, how much information about the left side is going to influence the right side. So it's not the single bits which do not influence, but it is that many combinations are going to have exactly the same influence from one side to the other. Let's not forget that we're only targeting security up to two to the power of 28. That's what will save you. Yeah. Any other question? That number down there, it says number of AES calls 24B. So it means that for every message, what you're doing 24 AES rounds, is that correct? So you need to say what B is, right? So when you fix the value of B and B is the number of 128 blocks of input of your permutation, then 24 times B minus 36, that's the number of AES rounds that you're executing for the whole permutation. Let's forget about minus 36. So 24B only means in general that you're using 24 AES rounds for 128 bit message, no? No, of course, okay, I see where you want to go. I mean, the input is B times 128. The input is B times, B is the number of blocks. So the input is B times 128 bits. Look, you have 24B, you forget about the 36, right? You have 24B. For processing B times. Of course, 16B data. So this means, and I have this here on the last slide, this means that when your input become large, your performance is, when you can ignore the 36, 1.5 cycles per byte. That's bad, right? That's what you want to say. You can do better, but we want something that is secure up to two to 128, right? Exactly, yeah, yeah, yeah. For 128 times B bits, exactly, yeah, yeah. I think into the details, we should leave it for the coffee break. Okay, we can discuss further, yeah. I have a short question about, I mean, we don't have much time, but I would like to know about, Well, you're in the session. Efficiency, yeah, yeah. About efficiency of the input. Can you go again to the large figure, large? No, no, the large one, the construction that you have. Yeah, this one. You designed this to, let's say, use all the steps, pipeline steps of the AES implementation in the Intel microcontroller. And then two of them, let's say the left one and the right one can go to the pipeline after each other, but at least, because in each of these F functions, you have two AES rounds, then with these two, you feel completely the pipeline, right? And the next one, when one of them comes out, then you can start one of the other ones. The order of the operation is important for you. You reach the highest efficiency, or doesn't matter which of them you give birth to the pipeline. If you want to execute just one permutation call, you cannot fill the pipeline. No, but F and F, two F you have at the top, one at the left, one at the right. But you have latency of four. Yeah, but... You need to be executing four independently if you want to fill the pipeline. No, you give one of them left one to the F, which is the one AES round. Next clock cycle, you give the next one up there. That one you cannot keep, but that one you can keep, right? To the left. The right one. Yeah, and then you're stuck. But then you are not using the whole pipeline again. And this is why, and that's, like, I mean, if people can agree with the very, very first fight, which says, already you have the problem with AES inherently of the fact that you're using AES, you need to have a parallelizable mode, or you need to be processing independent data. So putting it into shy words, the way you explain this to me is, you've been advocating for such a long time at Intel that we need to use parallelizable modes because that's important to make AES work efficiently. So if you're willing to accept it, that the mode needs to be parallelizable, then that's not a problem because you just have your different input block where you will be able to process the pipeline. Sorry, but your construction is not using the whole pipeline of the AES. Available in the market. To use the whole pipeline, you need to have a parallelizable mode, or you need to have independent data. Otherwise, you don't fill the pipeline. Okay, thank you. Okay, let's thank Niki again. Thank you.