 Hello everyone. My name is Fatih Palli and I will be presenting our work, Swap and Rotate. So the goal of our work is to minimize the easy area of block ciphers. As many of you already know that the ongoing NIST competition for lightweight cryptography. The goal is to design an authenticated encryption primitive that is small and lightweight and among 32 candidates that are still active as of the second round, 15 of them are actually based on block ciphers. That means that it is still an active trend to take a block cipher and define a mode of operation and then hence get an authenticated encryption primitive. And among them, GIFT is quite popular as a choice for block cipher as four of the candidates are already using it. So in that sense, this paper provides the smallest GIFT and present implementations and more concretely the 64-bit block variants among those family of block ciphers. So the question is whether there are novel design ideas that will further improve the reduction in this footprint of block ciphers. And that is the answer we give in this paper. Before that, we have to understand the trend towards reduction in the data path size. That is, if you take the AES as an example, if you look almost from two decades ago, the implementation by SATO was 5400 gates, gate equivalent, normalized so that the differences between libraries are somewhat reduced. And then as we go towards 8.1-bit implementations, we naturally see reduction in the area because the auxiliary gates such as the XOR for key addition or MAX for choosing signals, we can dispense with them as we go lower in the data path size. And before this work, the state of the art smallest implementations for the particular block ciphers, namely for 1-bit present implementation, that is, encryption on the circuit was by Moradi in 2017. And for combined present implementation, it was by Benig and then there was also encryption-only GIF circuit by Benig as well. And in comparison, our paper provides much smaller implementations. Namely, we provide both encryption-only and then encryption-decryption circuits for both of them. And among them, I think is the most impressive one is the 1-bit encryption-only present implementation, which is smaller than 700 gate equivalents. So I will explain how to actually make this gain. First, I have to give a very brief summary on why we have chosen the GIF and present as the block ciphers on this paper. Well, because they are SPN block ciphers, meaning that they consist of substitution and permutation layers. What is especially different with these block ciphers in the permutation layer is that the permutation is defined in the 1-bit sense. That means that there is a permutation function or a table, as you see in this picture. There is a specific instruction to move each bit position i to what is given as PI. For example, the bit at position 0 is supposed to remain the same, but bit at position is supposed to go to 16 and then the bit at position is supposed to go to bit position 32 after the application of the permutation layer at each round of the encryption. That is different than what is happening, for example, in AS or skinny, because there we have a shift rose, which is operating on the byte level. So the handicap of this type of permutation is that it is harder to implement in a 1-bit serialized hardware implementation. That is happening because the bits in general in a 1-bit implementation are arranged in a pipeline fashion, meaning that the bits are coming as an input from this flip-flop denoted with 00 and then they move all the way towards 260 and then they final exit the pipeline. At the exit of the pipeline, you typically have some kind of key addition and then the fresh bit is fed back into the pipeline. And that is how a round is typically implemented. So the way to implement the permutation on top of such an architecture is a trivial solution basically is to add MUX before each of these flip-flops so that we can either choose the default rotation operation or alternatively we can choose this PI values. So for example, if the bit as position 24 is going to 63, then we would put an extra wire from here to there. That would be the trivial solution, but that is quite expensive and that would mitigate almost all of the gain that you would gain by using the 1-bit serial implementation, so we have to do better than this. That is where the swaps comes into play in our work. Swap is a very basic primitive introduced on top of the pipeline. So the pipeline consists of a basic rotation operation. So here at each clock cycle, you can assume that the contents of these flip-flops are rotated by one. And now we will have a swap operation and swap here we denote with the pink color, allows us to exchange these two contents just before the rotation. So we can either do the rotation on this pipeline or alternatively if we choose to enable, we can first swap the B1 and B0 values and then apply the rotation. This can be easily done by putting MUXs after these colored boxes and then wiring them from B1 to here and then from B0 output to here. That can be easily done as you can see. So that requires only two MUXs. The interesting observation from a theoretical perspective is that once you introduce a swap such as 01 and then you already have a rotation operation, you can basically execute any permutation as long as you are given a sufficient long number of clock cycles. That means that you can derive a very large sequence of operations which eventually does whatever permutation you want. Before moving on to that part, let's look at a very simple example of how swaps work. Here assume that these symbols, these numbers 3 to 0 are stored at the beginning in these flip-flops. And then we want to achieve a permutation. It just obtains this arrangement. This is basically just rotation by one. And then assume that we want to obtain it after 4 clock cycles. Well, we can use this swap operation as follows. We can first do swap and rotate during which we first imagine that there's a swap between B1 and B0 so that the values that are contained are exchanged and then later this value is completely rotated by one. And then the next step we do the same. We just swap these two values and then we rotate. So after rotation, as you can see, everything is in place. And then once more we swap these two. And then at the final step we just rotate and then we eventually arrive to this value in 4 clock cycles. That is a very simple example. So let's look at a more involved one. So here assume that you want to do some transposition over a matrix. And in this case, assume again that the pipeline entries are from here and then they exit from here. And in this example of transposition, what we want to have is basically swapping these two values swapping these two values and so on and so forth. So there's some kind of diagonal change in the position of the symbols or the numbers, whatever you call them. So here we will actually make three full rounds, meaning that each of these round consists of 16 different executions of either swap and rotate or just the simple rotate operations. So there we will use the first round to fix the positions of, for example, here 14. And of course the swap here is denoted with this green color. So this allows you to exchange any pair of bits that are aligned in this fashion. So you can exchange this one and then the next clock cycle because each of the bits will move. You can exchange these two and so on and so forth. And that is how it goes. So you have this diagonal kind of exchange power here, swapping power here. So here the first pass, first round we actually fix as you can see the position of 14 with respect to the final result. We also use this to move 13 all the way from the bottom row to its actual correct row and then similarly 12 moves here. So essentially all of these values are fine now. So we just fix all of them. But now we have kind of things that you have to fix more in these diagonals. So there we use the second round again and then this positional swap again as you can see. And then this sequence allows us to obtain this matrix value. And then here you have even this row and then this value is correct. And then the last one that is incorrect is 6th and 3 and then we use the last one simply to do this one. And that is how we could actually do this transposition operation. Well, the transposition does not really represent what the present is doing. So that in the present case, like I said before, we can just take it as an arbitrary permutation. And on top of that we can pick one swap and then we can derive a sequence of swap and rotates. And then we can look how many cycles does it take. So in this case, it turns out that this is not quite useful because executing just a single permutation on top of these bits take 14, 7 or 2 cycles, which is way too large. In comparison you would expect something like 64 because 64 is the number of cycles just to rotate everything by one completely. So this leads to quite a very high latency and on top of that it is also quite larger than the previous implementations of present and gift only because the sequence of operations need to be stored somehow and that sequence is not repetitive. So what we can do naturally is to add few more swaps because as you can see the swaps have some kind of power and that by adding different variants of swaps we can enhance our expression in defining this permutation. So here the goal is to add some permutations and then create some kind of trade-off between the number of cycles and the number of swaps that we add. So let's again look at the previous example, but this time we add one more swap operation. So this allows us to swap by position of 2 instead of position 1 before we had this one. So naturally you can see that in order to fix the position of 7 and 13 in this transposition operation we can directly use the swap operation denoted by purple. And if you wait sufficient number of clock cycles so that the elements denoted by 2 and 8 arrive to these exact positions and then we enable the swap again we can also fix these ones. So here we use the purple one for this one and even for this large one we can use the power of purple to make two jumps at a time and eventually this leads us to a better result in terms of the number of clock cycles because we can complete the full transposition in two rounds now. Well we can add one more with distance 3 so that we can actually complete everything at once. Here we can actually execute all of these three swaps simultaneously because there is no reason to prevent that in the circuit level. We can just do all of them at once and then rotate by one which is coming naturally with the swap operation. And then we can complete everything in one round. So as you can see adding introducing a few more swaps allow us to reduce the number of clock cycles. Well we have seen the transpositions but the present permutation and the negative permutations are not transpositions they look much more arbitrary. Well the truth is that these permutations have some kind of structure embedded in them so we just have to look for some kind of composition of permutations so that we can do them much easily with the swaps. So in this case it turns out that for example if you look at the initial nibble here 0 to 3 and look at the nib position of them it marks 0, 16, 32 and 48 you just take this nibble here and then put them vertically here in the new version after the execution of the permutation. Similarly you take this next nibble and then put it vertically here. And then it goes like this you take this nibble and then you put them here. So this is the structure that you have. So this could actually be done instead by defining these four different matrices each of them containing 16 of the bits and then over them we can execute the transposition that I have shown in the previous slide. And then later we just have to do some additional fix because each of these vertical columns they wouldn't actually be in the exact positions that we wanted. So we have to do some extra permutation which is fortunately something we can do easily with swaps again. So this is the case of present. So this allows us to find a sequence of operations, swap and rotate operations. If you carefully choose the positions of swaps, so here we have chosen as you can see two different swap operations. And two different swap operations allow us to complete the full present permutation in six rounds and each round it has different decomposition of the sequence. And in the case of four swaps, if we increase our power by adding two more swaps we can make it even shorter by four rounds meaning four times 64 clock cycles. And if you add two more and use six in total we can actually complete everything in two rounds. So there we use the first round to first execute the transposition layer of the permutation and then the second round to fix the positions of the columns. How about the gift? Gift doesn't have the exact same permutation but we can use still the same technique and divide the permutation into two different layers. One of them is still looking like a transposition with the difference of bits being chosen with a fashion of interval between them by jumping by four positions and then we just take the transposition of these bits and then later there is another permutation to fix the positions of the vertical columns again. So the same idea works for gift. And in the case of gift we actually come up with more efficient sequences. So the two swaps allow us to complete everything in five rounds and then the four swaps allow us to complete everything in three and then five swaps is sufficient to complete everything in two rounds. So this gives us 128 clock cycles and then it only requires five swaps meaning 10 boxes in circuit. Much shorter than a 64 you would have had otherwise with the straightforward technique. So now the more important question and then the more harder part is to push even further so that we reduce everything to one round. So the one round could be important because in a one-bit serial implementation you already have to run a full round because you have to do key additions. So if you complete even the permutation in 64 clock cycles they would be perfectly synchronized and then you basically wouldn't spend extra clock cycles for the permutation layer at all. So that is something quite novel in this work because the previous papers, the previous works they always have to freeze the pipeline or they have to somehow use another latency required approach to resolve the permutation layers. In this case we find a very good way to interleave these swap operations so that they can actually be completed in 64 clock cycles. So here we have actually not only for encryption we actually have the decryption as well meaning that not only for the forward decryption of the permutation layer we have the solutions but we also have the the decomposition of permutations for the inverse of the permutations for both present and gift. And in both case you can see there are six different swap operations introduced in the circuit. So just to recap the benefits of our approach we can resolve the dependencies between these two decomposition of layers in the permutation layer and then we can interleave them completely so that it is possible to do everything in one round and then the benefit is that the pipeline is continuously active so you do not have to use extra clock gating techniques which are hard in practice and you do not have to do freezing which would require something like enable signal in the flip-flops which lead to larger flip-flop sizes. Of course there is an extra challenge here which is the positioning of S-Box and the key addition which we also resolved in this paper and then the last benefit is that if you want to go from encryption on the circuit to encryption and decryption combined circuit the transition is much more simple and then you only need to add very few number of gates because we can reuse the same swaps to execute the inverse of the permutations. So this is what the encryption and decryption combined present circuit would look like. In this case the top of the circuit is the state pipeline and there you have the blue swaps and then the green swaps so the green swaps are the ones as you have seen in the previous slides there was a transposition which we could have done with the three swaps operation in one task and here that's exactly what is happening so the green ones are actually executing the transposition layer of the permutation and then later you have the blue which is just basically fixing the relative positioning of the columns and bottom part is the key pipeline there we don't have any need for swaps and then the rest is basically flux controller circuit and then key addition part so if I summarize our results so we obtain much smaller present and gift 64-bit variant implementations by using this need technique of swaps which allows us to make further gains in the permutation layers by introducing this need technique so there you see we obtain these results which is 694 as you can see here using one swap is actually not quite efficient not smaller than the previous ones but one thing that one should notice is that the difference between ED version encryption and decryption combine and encryption only circuit is quite small that is and then this growth is basically coming from the fact that you have to do inverses box and then you have to do some more complicated things in the key scheduling part because of the inverse key scheduling in the case of gift you have again the smallest results and not only those are small these are actually faster per round as well if you compare for example this one with the previous work you can see that this is actually completing around even faster in a faster fashion the same goes for the gift as well as you can see this is 64 versus 96 our results are also extended later into a gift 128 version which is actually the version that is used quite frequently in the NIST competition and it is also extended to skinny and AES because they are also quite popular among this 15 NIST candidates as well thank you very much for your time