 Hello everyone, my name is Fatih Palli and this is my presentation for our paper titled The Area Latency Symbiosis Towards Improved Serial Encryption Circuits. Let me start my presentation by giving you an example application in which the lightweight cryptography is essential. So on the left side what you're seeing is an identity card which is equipped with a chip inside and on the right side there's a terminal which is wirelessly powering this card for the communication. And the eventual goal is to establish a tunnel between these two parties so that there's a bidirectional communication that is secured from the outsiders. This is typically realized by the primitive authenticated encryption associated data. And in particular in this paper what we're concerned is about the efficiency or lightweightness of the implementation of such a particular scheme. When we talk about efficiency in our paper what we refer to is first the silicon area of the final circuit and later it is the throughput and the latency metrics. And it is needless to say that our effort goes in parallel to the lightweight cryptography which is an ongoing standardization process. Currently it is in the final round but at the time of writing this paper it was at the end of the second round. And typically what happens through the standardization or in general in the cryptographic discipline is we have on the left side as you can see we have this design paradigms which are used by the designers to instantiate their own schemes. In the case of mode of operation we see a different flavors of mode of operation definitions which are employing a common block ciphers such as GIFTS, SKINNY and AES. In the second round there were four tree and four candidates respectively for each of these block ciphers whereas in the final round you only have one candidate which is using GIFT and another one which is using SKINNY. And a particular question that our work tries to deal with is well how can we make these AED schemes as small as possible in the ASIC platform while also considering some of the other metrics at the same time. And naturally it boils down to how we can efficiently build and implement these block ciphers so that we in this paper will particularly look at the implementation of block ciphers. And at this point we should remember what this type of block ciphers look on the inside because from an implementation perspective when we try to build circuits that are smaller and smaller we are going to have to deal with a different type of operations which might be either suitable or non-suitable for the time of strategy that we are following in our implementation. And all of these block ciphers are GIFTS, SKINNY and AES. What we have is a round function and key schedule and algorithms that are running once every round. And inside the round function of course you have the addition of the key here followed by the substitution layer, followed by the permutation layer and then depending on the block cipher there is the linear matrix operation as well. As I said before our main goal in this work is to have the smallest implementations in terms of silicon area. For example one could first consider the cornetory implementation of a given block cipher where each of the RF and KS blocks would be implemented and copy pasted in a row so that the final large combinatorial circuit could execute the block cipher operation. But of course this is very large because we keep repeating the same type of operations inside the same circuits. So the better way to do would be considering a round based implementation where the register blocks are used to store the intermediate values that are coming out of the RF and KS blocks. And the round based implementations are known to be sweet spots in terms of their performance regarding various metrics. For example they are not necessarily very large because they use only one RF and KS blocks. Also they have they have better and smaller energy consumption compared to other type of architectures. As well as they have an acceptable range of latency in the sense that the operation or the encryption operation from the block cipher takes a small amount of time. But our interest is partially theoretical here and we want to know what is the best we can do in terms of area. So we further want to optimize for the smaller silicon area probably at the expense of other metrics. In order to do that one would follow the trend of further and further serialization. And in this case for example one could first consider a 32-bit serialization for something like a yes block cipher. Here it means that the update only 32 bits of data every clock cycle. Both for the key scheduling and the round functional operations. This makes the KS and RF blocks smaller because we avoid repeating some of the same sub blocks embedded inside the RF and KS functions. We can push this idea even further and have the 8-bit serial implementations. And in particular in this paper as well we could go for the 1-bit serial implementations. And 1-bit serial implementations will have these smallest implementations because they will naturally avoid reuse of same type of gates inside the KS and RF functions. And the two previous works that our results directly correlate to or build on top of are as R.S. follows. The first one is the bit-sliding paper from Jean Etel in chess 2017 where the authors provide for the first time 1-bit serial implementations for the block ciphers AES, KINI and present AD and at the time of that paper these are the known smallest implementations and this area gain comes from the fact that we are able to remove the gates that are doing the same operation by reducing the degree to 1-bit. And in the second part there is this other paper which was written by Benek et al including myself. And in this paper the authors look in particular the present AD and gift 64 block ciphers because they have this interesting property that their permutation layer is operating at the 1-bit granularity and it is hard to implement a permutation like that in a pipeline architecture. They introduce an ocean called permutation through swaps in other words they define the operations that are swaps on top of the pipeline because those operations can be executed and implemented very cheaply on hardware and then there is this mathematical question of how we can execute a predefined permutation through conveniently chosen swap operations. And then following these two works, the footsteps of these two works the goal the starting point for us was whether we can apply the same techniques from the swap and rotate namely use the idea of swaps in order to make it smaller, the circuit of A.S. Skinny or gift 108 which is the more popular variant also can we also make other gains according to various metrics, namely latency and the energy consumption. And the motivation was again to explore what I call the continuous execution paradigm. It's a different type of implementation strategy for serial architectures and then our idea was to get a closure on the 1-bit implementation domain. And just to summarize our contribution beforehand we will provide, we do provide 1-bit and 4-in-8-bit implementations of A.S. 1-to-in-8, Skinny 1-to-in-8 variants and then GIF 1-to-in-8 and then other variants of GIFs that are used that assumes a different input and output ordering for its bits. And then what we provide is a continuous executing pipeline where there is a lot of stopping or freezing type of operations which means that we don't have to resort to using something like a clock gating or we don't need to use things like enable flip-flops which are larger than the default flip-flops. Also we achieve a similar latency per round as well and the source code is available as a chess artifact. And with that said let me come back to the implementation by Jean Etel from 2017 in this 1-bit serial pipeline implementation of A.S. In particular in this slide what you're looking at is 128 flip-flops organized in the form of a pipeline so that any bit that comes into the pipeline which is from here moves by being shifted towards left and then going upward direction. It roughly spends 168 clock cycles until a full round is completed or 128 bits. So that means that this pipeline is surrounded by this additional circuitry which is performing the add-round key, the sub-bite operations, shift rows and mixed column operations. Particularly our attention is on here on the blue arrows that are denoting the ports for the shift row operation. As you can see here these blue arrows are organized in a fashion that the shift row operation can be performed over multiple clock cycles. And later when this shift rows operation is completed over the 128 bits of data then it is followed by the mixed column operation. So this figure was taken directly by the paper from Jean Etel and the main novel contribution here is how we can execute the mixed column operation by reading the 8 bits of input and then feeding back the 4 bits into the pipeline. But we felt that there were a couple of missing things here and I'm going to summarize them in the coming slides. So just to summarize again, what could be improved here when we look at this implementation is the first that the full round can be executed much faster and in this particular figure that you have seen before the designers require 168 clock cycles per round. That is because 128 clock cycles are spent for the addition of the round key. 8 clock cycles are used for the shift rows operation and 32 clock cycles are used for the mixed column operation. And that is because these three group of operations are performed in a citric sequence. And moreover some kind of small rotation ports are enabled on the pipeline, which means an extra gate added on top of the pipeline so that the the implementation can avoid freezing some of the content stored in the register. Furthermore, the permutation could have been done with a scam flip flop, which is the contribution from the other paper as I said before, so that there are a couple of improvements that are awaiting us on the data pipeline. And on the key pipeline, the main problem and the main drawback of this design is that the pipeline must be frozen for 40 clock cycles because the pipeline ends up, it finishes, it is key scheduling execution after 128 clock cycles, which means that there is a surplus of 40 clock cycles. And now coming back to the second paper, which was inspiring for us. This paper from Banik et al looked at the implementation of present AD and gift 64 block ciphers, where the blocks are 64 bit. And here the main idea was to use the swap operation. So here the swap operation is a pair of flip flops that are interconnected with each other in a way that if this swap operation is activated in a clock cycle, then the bit stored within will be exchanged. If this is disabled, then the pipeline is going to move in the regular fashion. By using the swap operations, the designers show that you can execute the permutation layers which look rather arbitrary by carefully choosing the swap locations. This means that the implementation would be much smaller if you use the swap operations. Furthermore, it can actually execute the permutation faster because the permutation operations or the swap operations that are being used can be run in parallel with the other type of operations, the namely add run key and the sub byte operations. And the main challenges in such a design, again, is to first express the permutation in terms of conveniently choosing swap locations. And secondly, the secondary challenge is to interleave the operations so that there's not a data dependency between these operations running at the same time. And by doing so, the authors were able to complete the full run update in 64 clock cycles without using extra additional clock cycles for permutation type of operations. And this is how the pipelines actually look like, eventually, where you have the pair of colors which are denoting the specific locations of swaps, which are hard coded in the circuit, of course. But the position of swaps are carefully chosen, like I said before, and it depends very much on the definition of the permutation from the block cipher. So it does not naturally imply that we can directly do the same thing for AESkin or other type of block ciphers. It means that we have to spend extra efforts to also come up with those convenience swap sequences. And to come to our contribution in this paper, let me use the one bit serial AES implementation as an example. In this implementation, which looks quite similar to what we have seen with the Jean Etel's implementation, again, we have a pipeline that consists of 128 flip flops. And the main difference, the first main difference, is how we actually execute the Schiffroff's operation. So the Schiffroff's operation in the Jean Etel's implementation was done in a natural fashion by connecting these 32 flip flops at a time for each of the rows. Whereas in our case, we only use three conveniently placed swap locations in our circuit. So the first main challenge that we have resolved was the expression of the Schiffroff's operation in terms of swaps. And secondly, and the harder challenge was to accommodate all of these operations in a way that everything can be done in exactly 128 clock cycles. So that means that if you take a single bit and imagine the journey, this single particular bit is going through, it is first going to be performed on top of this bit at the entrance of the pipeline, which is not shown in this slide. And then it is going to be fed into the pipeline. And when the convenient position of this bit is reached, then the S-Box operations will be performed and the result will be loaded. And then later, this bit will go through the particular ports of the swap gates. And if the swap needs to relocate this bit to its particular new location, the correct location, implied by the Schiffroff's operation, it is going to arrive through this swap gates. And then later mixed column operations will be performed. But what is really different, and I think novel in this particular implementation that we have, is that while this bit is going through its own journey, possibly through the mixed column gates, there could be other bits from the next round, which is being performed with the, which is just being updated by the S-Box, which means that every bit is basically treated separately. And every bit that goes into the pipeline now will exit the pipeline in 128 clock cycles. So that means that we can completely avoid the freezing idea. We can avoid things like clock gating, and then we can streamline all of the operations. So that becomes the major challenge that we resolved in this paper, because if you compare it with the previous work of implementation of present and gift, the paper that introduced the ideas of swaps, there they had to accommodate only three layers of operation, because there was not a mixed columns operation. Whereas in this particular case, you have four layers, which has to continuously update the bits that are coming in and then going out of the pipeline. And furthermore, this implementation strategy has been realized on top of the skinny on the gift 128 bit variants as well. So that we also provide implementation for these two popular block ciphers among MIST LWC. And then we do not conclude our work only at one bit serial implementations, and then we port our ideas to eight bits real implementations as well. So the eight bits serial implementations could be useful if the designer is willing to spend a bit more cost in terms of silicon area to gain a lot of throughput and minimize the latency even further. In the case of AS, we provide the very first implementation, eight bits serial implementation that can execute a single run in exactly 16 clock cycles. But this comes with a drawback, that is, we have to use two S-boxes, because the key schedule and the run function operations, both of them use S-box. It means that the single S-box cannot be used to maintain this minimal latency of 16 clock cycles per round. And the second thing about this AS implementation is that it is slightly larger. So this is a consequence of using two S-boxes, because the AS S-box is particularly large. And then later, again, we resolve the issues about the interleaving of the operations. In terms of skinny, we do not have any major conflicts when we upgrade our implementation with eight-bit serialization. It just requires duplicating one gate by a factor of eight to realize the swap gate. In the case of Forbit Serial, a GIF 128, it was implemented in the Forbit Serial fashion only because the S-box was Forbit. We resort back to using the fully maxed pipeline. But of course, the main message here is that we are able to maintain 132 clock cycles for executing the full rounds. And you might wonder, how does it really fair in comparison to state-of-the-art implementations? Here you can find the direct comparison. So in the case of AS, the size of the circuit is rather similar. It is slightly smaller in the 9 gate 45-nanometer library, but the difference is quite small that we would consider them to be equal. But in the case of latency, you can actually see the main gain here. From 168 clock cycles to 128 clock cycles per round. And this is the minimum that you can achieve for one bit realization anyway. And this of course directly translates into the energy consumption as well. Because now we can avoid this extra clock cycles as well as this unnecessary rotations which are just trying to keep the pipeline synchronized. The same type of results apply for the skinny as well. You see a roughly same size in terms of comparison with the previous work from the work of Jan et al. And here again you see the same gain in terms of latency as well as the energy consumption as well. In the case of gift, it's slightly different. For example, in one particular case, we do not end up having the smaller power consumption. That is only because now when we try to make it faster, when we try to make everything be completed in 128 clock cycles, now we have to make a major improvements on the key pipeline. And the key pipeline becomes a major problem if you want to run it faster. And similarly, we can give a comparison with the state of the art 8-bit implementations. Unfortunately for some of the comparisons, we do not have the implementations in the same common library so that we cannot provide a direct comparison especially in the case of AS because if for example, if you take the numbers as they are, it looks as if our implementation is slightly smaller than the one done in the previous work. But the reality of the case is that here the STM typically provides a smaller gate equivalent numbers. And in our implementation of AS, it is probably slightly larger because although we make some gains in terms of other parts of the circuit, we use two boxes, two S boxes in the end. In the case of latency, you can see the gain as clear as possible because latency does not depend on the technology library anywhere. It's a feature of the implementation. Again, for the case of energy, it is better not to try to do comparison because they are different libraries. In the case of skinny, we can actually provide some comparison but not with the work from Jean Etel. For the lack of the same library again. Again, it is clear to see that the latency wise, our implementations provide the smallest, the minimum number of clock cycles. And it typically gives less energy power consumption compared with the previous work of Banik Etel here. And in the case of gift, again, we can make some gains in the latency and the energy consumption as well. But it must be noted that the implementation that we did here is different variant of gift, which assumes a different input and the output ordering for convenience, especially in the software implementation, it becomes very convenient and cheap to implement this variant. At the beginning of my presentation, I talked about design paradigms and the mode of operation as a one particular design paradigm which provided a lot of schemes in the LWC, especially in the second round. Therefore, it is natural to assume and it is natural to say that the block ciphers are not end user primitives. So therefore, we need to also consider the mode of operations. And here, just to show you that the mode of operation circuitry can also be done quite cheaply and it is possible to execute these operations in one bit or four bit or eight bits realization. We provide implementations for the authenticated encryption primitives as well. And in particular, I think the Sunday gift is very impressive because it is quite small and it is a full and user friendly primitive here. So we provided here in this paper one candidate implementation that covers at least one for each of the block ciphers that we have provided. In that case, let me conclude my presentation and summarize the contributions from this work. Again, we provide open source implementations that are accessible through the chess artifact for one bit, four or eight bits real implementations for ES128, a skinny 128 block variance, three of them. We provide implementation for gift 128 and then the other variant of gift which assumes a different input and output ordering. And then the main novel technique that we introduce is the one where the pipelines are continuously active and the operations are set in line in a fashion that while some of the bits are going through things like add round key or sub bytes, there are other bits that are being executing through the mixed columns or the shift of operations. And this naturally implies or the outcome of such an approach is that we can achieve the minimum latency that is possible from this level of civilization from given block ciphers. And then as on top of this the implementations that we provide always respect to standard ordering. So one of the very convenient thing that comes along with this swap technique is that we can drag the follow the standard ordering and the swaps are very useful to handle the movement of the bits while they go through the pipeline. With that, I thank you for your time and attention.