 on the last day. My name is Aksir Poshman and I'm chairing this session on hardware implementations. Without further ado, I'd like to introduce Arash, who is giving a talk on smashing the implementation records for AES S-Box. Okay, thank you. Good afternoon, everyone. This is a joint work with my postdoc, Mustafa Taha and Dua Ashmavi at Western University in Canada. Let's have a brief outline of my talk. I start with the introduction, short introduction, and then I'll talk about our proposed architecture. And then, within this architecture, we are going to use a number of things. First, we proposed new logic minimization algorithms. Then we talked about the new composite field inversion, which includes triple ox and I explained about them. And then at the end I talk about the comparison and conclude the remarks. So this shows a brief history, a completely brief history of the S-Box, which was introduced in 1998. And then it was a standard in 2001. And there are a large number of implementation about S-Box. I would say the first one is the one using the tower field by Sato in 2001. And then there are different targets of the implementation of S-Box. One is the targeting small area. So the most compact S-Box, I would say, is designed by CanRight in 2005. And then the number of gates in CanRight was reduced to 115 gates in 2010. And then it was reduced to 113 in 2016. On the other hand, there are a number of implementation which targets the fast or low, small delay or high efficiency. And these two actually, I would say the best ones. And then in this paper, we target these two area, low area and low delay. And then we propose the most compact S-Box to date. And we propose another design as the most efficient S-Box to date. So let me briefly explain about the implementing pitfalls that I try to avoid in our implementation. The first one is to use AND gates. So when we teach to students in the undergrad courses we teach, if you want to implement an AND gate in CMOS technology, you have to use NAND gate and then connect it to the NOT gate. So as a result, an AND gate in hardware, in ASIC implementation, is slower than the NAND gate. NAND gates are cheaper and faster, so it's better to use NAND gate instead of AND gate. The second pitfall that we try to avoid is to use simple gates. When compound gates or an AND or invert may be more efficient in different technologies. So in this paper we improved the designs, previous designs, which uses AND gates to the one using NAND or NOT gates. And in our improved design we use Boolean algebra to make actually the new design would be equivalent of the original one in terms of functionality. And this table shows the implementation in the CMOS 65 nanotechnology that was available at Western University. And then it shows the different design in terms of the area and delay. And the area was measured by gate equivalent, the NAND, two input NAND gate equivalent. And then as you see, the improved designs as compared to the original ones have lower area and are faster than their counter parts. And if you want to mention which one is the smallest, these two are the smallest original, and this one would be the fastest original, and then the smallest improved and fastest improved are shown here. So at the end of this presentation we compare our S-Boxes with the improved ones, and then the formulation of the improved designs are included in the paper. So let's talk about the S-Box. S-Box consists of the multiplicative inverse over binary field gf2 to the power 8 followed by the affine transformation. The affine transformation is multiplication by constant matrix followed by addition of the constant vector. When you want to implement this S-Box for low area implementation and even faster, then you need to use composite field. In order to use composite field, what we do is we use a transformation matrix which converts the 8-bit input to the 8-bit output and then perform the inversion in the composite field, and then at the end we return back the composite field representation to the binary field and then use the affine transformation there. And then this part is our input transformation, and then you can combine these three blocks into the one block as a t-out or output transformation. So what we propose here, this is our proposed architecture which has the input transformation which transforms the 8 inputs to the 20 inputs. And then there are 12 additional terms that you see in the previous original S-Box, and then these terms are shared between the exponentiation and multipliers. And then for this one, we propose new logic minimization algorithms for the input and output transformation, and then we propose some new formulation for exponentiation, subfield inverter, and the output multipliers. And after we design each block, then we try to use all the available resources, so we try to optimize by hand, even try to optimize by cat tools, and then use the optimized block into the entire S-Box and see what will happen. So now I'm going to talk about the logic minimization algorithm that we propose before talking about our algorithms. So we define the logic minimization algorithm, implement the isomorphic transformation matrices using a small number of gates. It's a hard problem, and we have two transformation matrices, one at the input and the one at the output, and that figure shows only the input transformation matrices. So these are the results of the optimum results of the input transformation matrices that we used, and it has 8 input and 20 output, and then the 12 additional outputs are shown here, so aij is x or of aij and aij, and similarly for bij. And there are some previous works, which includes cancellation free. So in the cancellation free, gates are never used to cancel out common terms, and they are heuristic with cancellation, and I tried to briefly explain this algorithm. So if you want to optimize the multiplication of the matrix by vector, then your input would be g and your output would be in the right side. So in this algorithm, we test adding one gate, so let's see adding g4x or g5, and then in the next step, we compute the distance, hamming distance to each target, and then we select the gate leading to the minimum average distance, and then add the selected gate and redo the process. So in this paper, we proposed three logic minimization algorithms, one is the improved version of the previous one, so which tests all the ties and monitor the progress of the delay, and then the two others, which would be the shortest distance first and focused search. I'm not going to explain these two algorithms, but briefly it's shown here and explained in the paper. So what I'm going to do, I'm going to show you the results of our algorithms. So we studied these two matrices, t and t out, for all possible isomorphic transformation, a total of 96 matrices we considered, and then the proposed algorithms consistently to equal or better implementations, and this would be the results of the lightweight implementation in terms of the number of x-orgates, and so basically these three proposed algorithms, this is an original normal pp, and then we try to use the available synopsis cat tool to see what the synopsis actually would result. So what we did is we code the multiplication method algorithm, multiplication by vector using behavior modeling of the VHDL code, and then let the cat tool optimize for us, and that would be the results actually we get from the cat tool. You can see it's not optimized. So then and then you can see our result would be, this would be used actually in our design. The actual formulation for the lightweight design provided in the paper, you can see it, and then our second target implementation was the fast implementation, and then that would be the results of the number of gates and the gate delay that we used. So the next, I'm going to talk about our composite filling version. So the new explanation stage which would be, which is shown here, this block is basically, we combine these sub blocks as a one block, and then we design two designs, one for lightweight and one for fast, and then we optimize them actually by hand, and also we design, we use the cat tool to give us the optimized design. So, and then these results actually are shown here, we implemented an ASIC, and then the lightweight and fast actually, which are optimized by hand, the results show here, and this is the optimized by cat tool. So even with design, we didn't consider actually three input XOR gates, we consider three input XOR gates as just XORing the two XOR gates, but the cat tool actually gives us the design using the three input XOR gates. And then you can see this is used for the lightweight implementation, and this would be used actually for fast implementation. Similarly, we did this one, the same method for the sub filling version, and then we consider many different cases, only this is shown here. We have one section in the paper, and then in that section, I believe this is section number three, we explain everything, every details actually we did, and then that actually wasn't part of, is not part of my presentation. Okay, so and then the results of the implementations actually is shown here. So one bit sub filling version, there are two designs, one is optimized by hand and one is optimized by the cat tool. So the sub filling version has a four-bit input, four-bit output, and then this is only for one bit, the three other bits actually is a cyclic version of this implementation, because we are using normal basis, and that's the beauty of the normal basis. And then you can see here, for the sub filling version, the one actually optimized by cat tool, actually we are using, which has lower area and lower delay, and the reason is that it uses a compound gate or an invert gate, okay, that we didn't consider it in our design. So the last block is going to be the auto multipliers, which it has two multipliers with a common input E, and these multipliers actually have four inputs in both inputs, and then it has five outputs, and the reduction from five bits to the four bits actually is part of the output transformation matrix. And then in the previous design, actually they considered different number of inputs and outputs as you see here. So and then in our design, we try to focus on the combined cost of the two multipliers, and then try to share as much as possible in order to make it actually compact. So in order to share the terms, what we should do is we have to use actually the multiplier, which has a XOR gates in the first layer, because that would give us the flexibility to share, okay, and then the multiple has three layers, the first layer is XOR gates, and then the second layer is going to be NAND XOR, and the two XOR blocks actually is a part of the input transformation matrix, and then for the second XOR, we have to implement it and then share it for the two multipliers. And some multipliers do not allow sharing, so if you want to compare the complexity of only one multiplier, this table shows actually the multipliers that are used in the different S-boxes, okay. Actually that was my starting point of the designing of this S-box, because the S-box we consider many different multipliers, and then we end up with this multiplier, and that was the starting point of the designing of the other parts, okay. As you see, the multiplier has a lower number of gates, and it has the lowest delay, so dx is the delay of XOR gates, and dNAND is the delay of NAND, and dNAND is the delay of NAND. So when we want to consider the two multipliers, then this is the block diagram of the two multipliers combined, so as you see, the two XOR blocks actually are not shown here, because that was a part of the input transformation matrix, and we implement these to try to optimize by hand and then try to optimize by the CAD tool, and as you see, the optimization by hand actually is better in terms of both area and the delay, okay. So now I'm going to compare our results, so we use the best blocks for each block, and then our target light weight implementation are shown here. As you see, our design, our light weight design has the smallest, fast, and most efficient light weight box, and then we target fast implementation, and then you can see our design as the most efficient design, because the efficiency is measured by the area time product, and what we did is we compared only to the improved versions in this slide, and we coded a large number of VHDLs in order to obtain actually the best design, so the 46 I think is not the exact number, but anyway that's what we did. So there would be effect of the target technology, because we use one technology. If you use the industrial technology libraries that what we use, the result would be the same, okay. So in the light weight, it uses three input XOR gates or an invert with three and two inputs, and it ended up with 182.25 gate equivalence, and then for the fast one it uses NAND3, which is the three input NAND gate, and if you use the just free libraries which are available, then what we did is a NAND gate 45 millimeter technology, it doesn't actually have the three input XOR gates, okay, and it doesn't have that just big compound gate, but it has other kind of compound gates, and what we did is for the light weight, the gate equivalence is slightly just increased, but we have the same number of gate equivalence actually for the fast implementation. If you don't want to use any compound gates, then the result, your result would be 191 gate equivalent, the best result was the 1940, and then the fast actually gives us this one, so and as you see the proposed design is superior under any restriction by the target library, so I conclude my talks, okay, in this paper we propose two new designs for the ASS box, one light and one fast, and then for each of the blocks actually we propose new blocks, which includes new logic minimization algorithm, formulation for sub blocks, and then output multiple, more importantly we propose the design methodology for an optimum synergy between theoretical analysis and technology as you said candles, okay, thank you for your listening if I have any questions. Thank you for the work, it's impressive to see that after 20 years of research there are still substantial gains to be made, actually we don't have time for any questions, if you do have any questions please see the author in the coffee break, let's thank the speaker again.