 minimization techniques for a smaller, faster AES S-boxes, and this is joined work by Alexander Maximov and Patrick Egnal, and Patrick is going to give the presentation. Thank you very much for your introduction. So this is all about trying to minimize the AES S-box in an ASIC hardware implementation. Now minimization could be lots of things. You can minimize the area, you could minimize the critical path, and thereby increasing the maximum clocking frequency that you can have in your circuit, or you can minimize, for instance, the power and so forth. And typically if you want to minimize or maximize the clocking frequency of the circuits, you would use some kind of lookup table. Just let the code compiler do the work for you and implement the big table. And the problem here is that the area of that lookup table is going to be quite big. So if you're in a position where you need both quite small area and quite fast circuit, then you will need to implement some kind of the S-box in some kind of, with some kind of gate functionality in the ASIC. So that's what we have trying to do to minimize the area based on some delay constraints as well in the circuits. So if you fall asleep right after this slide, I just put the highlights on here what to remember. So we have in this paper some new improved methods for circuit minimizations. And we have also a new architecture for the S-box which improves the critical path and thereby increase the possible clocking speed. So typically this is the flow of the AS S-box. You have the inversion in the field 2 to the power 8. And then after you have an affine transformation and then you get your output R. And the problem here is the direct implementation of this inversion in the Rindale field is pretty complex. So already in 2000 Raymond proposed based on work by Itu and Tsuji that we should use a composite field instead and do the inversion in the field 2 to the power 4. So if you do that you typically get, let me see, you have one of these. Oh never mind. You get this structure here where you have a base conversion matrix to start with and then you have your nonlinear parts and then you have a base back conversion matrix and then the affine part of the S-box. And Satu et al in 2001 reduced this inversion further down to the field 2 to the power 2. And maybe the most sighted work in this field is Ken Wright from 2005 and he investigated the importance of the subfield representation because the circuit would look a bit different based on what base you convert to there. So he investigated a lot of different subfield representations there. So this is for low area. If we go for low depth or fast circuits we have the main work or maybe the most famous work by Boyar and Peralta in a series of papers and they used a normal base representation for the elements and a somewhat different structure for the inversion circuit where they first raised the element to the power 17 and then do the inversion and then you get the, because of the normal base you basically have the coefficient shift there. But in all basics it's the same. You have the base conversion matrix, nonlinear parts and then the back base conversion matrix. And then several people followed this result here. Nugami looking at mixed bases and we have Wenu et al in 15 looking at redundant bases. Last year at Chess, Raihani et al improving the Boyar Peralta algorithm for searching for good circuits for the X conversion matrices. And recently this year Lee et al also incorporated some other things regarding the depth into the BP algorithm which I will talk about later in this presentation. So what you typically do now is that if you have this structure and you take all the linear parts from the first part and push that into the base conversion matrix and you take all the linear parts after the inversion and push that into the back base matrix. And then you end up with sort of an architectural starting point where you have this top linear matrix taking the input producing 22 bits of output for different things in the middle part here. And then you have the middle part where the inverse happens in the field 2 to the power 4 and then you have a bottom linear matrix which does the back base back conversion together with the affine transformation of the AESS box. So the problem that we're trying to solve now is how to minimize this top and bottom linear matrices. So the basic problem is that they are given a binary matrix M and a certain maximum allowed depth of that matrix. Find the circuit of depth D which is lesser equal to this max D with a minimum number of XOR gates that implements, that computes Y equals to M times X. And here's just a small example of what the matrix looks like. Basically you have only 1s and 0s in the matrix. What we have done this paper additionally to that statement is that we have two additional requirements which is called the additional input requirement which means that input signals may arrive with different delays and you can think of the bottom matrix where you have this nonlinear part coming into that that the signals could arrive into that matrix with different delays and you can take advantage of that. And also the additional output requirement that output signals from the top linear matrix may need to arrive earlier than a certain things because they're part of a critical path. So we can put additional constraints on the individual signals in these two matrices. So the contributions of the paper is to follow. We have some new techniques for minimizing this top and bottom matrix. We introduced there's a famous cancellation free algorithm by PAR from 97. We introduced some probabilistic heuristic approach to that algorithm which gives some better results. And we have also a cancellation allowed algorithm which is an exhaustive search algorithm. And that you can only apply to two matrices. We have quite a few input. So we can apply to the top matrix which has eight bits input, but we cannot apply to the bottom matrix which has 18 bits of input. But I will not talk about those two things here in the presentation. What I would like to do is talk about the floating multiplexers and the generalization of this Boyard-Peralta algorithm for other types of gates which gave good some good really good results for the combined S-box. And then we have this new architecture that removes the bottom matrix and reduce the overall depth of the circuit. We have new circuits for the inverse operation. And we have some additional things that we call the initial transformation matrices that also gave very good results. So if we look at the combined S-box. So if you, depending on what kind of a mode of operation you're aiming to run your AES in, you might need both the forward and the inverse version of the S-box. So in some cases you only need the forward, but in some cases you need both. And the typical way to implement this is you have two different top matrices. And then you have multiplexer and you have the combined common part because both the forward and inverse uses the inversion field inversion part. And then you have two different bottom matrices as well. And then you multiplex the output together. Now if you look at the top part here, for example you have two different Ys, Y for forward and for inverse. And then maybe the expression for the forward one is X0 plus X1. And for the inverse you have X0 plus X2. Now you will do a multiplex, a multiple selection from this, these two signals X0 plus 1, X0 plus 2. Now if you replace this instead with the multiplex of X1 and X2 and then do the XOR, you say one gate because in the first case you need one multiplexer and two XORs. Whereas in the second case you only need one multiplexer and one XOR gate. So in this case we're floating the multiplexers up into the linear matrices to take advantage of that. So in general you can replace any expression you have there A plus multiplexer from B and C with the expression A plus some delta and then the multiplexer of B plus delta and C plus delta where delta is a linear combination of the input signals as well. And in the paper we have some algorithms how to search for these deltas to give better circuits. But we can also do more. We can take this famous Boyard per Alta algorithm and try to extend it. So this is the basic Boyard per Alta algorithm and there is some notion of some different things. There is a notion of a point in that algorithm and that's basically in the original argument this is just a linear combination of input signals. And there is a set of gates that you can use. In this case it's only the XOR gate. And then we have a base set of known points which initially is just the input signals. And then in this case they all arrive at the same time. So this algorithm does not take this additional input requirement with different delays into account. And then you have a set of target points which you want to find. Basically it's the row of the matrix M, the Y's that we want to find a good circuit for. And then you have some metric using a distance function which and this is the delta here. So given a certain set of known points how many XORs do we need to implement Y from the set of known signals. And initially this is just the hamming weight minus one. This is the number of XORs that you need for each row. And then you have a set of candidates for each step here. And the algorithm basically runs like this. You try all base pairs S, I, S, J in this set of known points at a certain time. And you form a candidate C by combining these two S's with the available gates. In this case it was only an XOR. So you have the candidate is S, I plus S, J. Now you calculate the new distance vector delta based on this joint set, the known signals plus the candidate. And then we save the candidate that gives the lowest distance and then we start over all over again. And we repeat this until the distance vector is all zero. And then we find a good circuit. Now what we do in this paper is that we extend this algorithms with different gates. So in order to float these multiplexers high up into the top matrix, we extend it. So instead of just using XORs, we say that, okay, we can use multiplexers also. And since they do not do not commute, we have to have both max from V and W and W and V, of course. So it gets six gates you need to consider there. And in this case, a point is not just an input combination, a linear combination of the input, but we have two different for the forward and the inverse case where F and I are the linear combination of the input signals. And this translates into expression using a multiplexer with F times X and I times X. And the input and target point basically the same as in the original algorithm. But the metrics here and the way to calculate delta is getting more complex this time. So we have a lot of new ideas to speed up this delta computation in the paper as well. And for each step in the algorithm, we track these additional input and output requirements to see that we fulfill them at all time and take it can take advantage of certain things there. Now, for the full affine transformation, you need to have a little more complex expression than I wrote up there. But just just so you know, it's trying to just give you the ideas of what we do in the paper. Now you can take this even further and define a Boyard-Peralta algorithm for any nonlinear circuit basically. And this, so we consider all kinds of gates in the set G. Different inputs, a number of inputs and different, you know, linear, nonlinear, whatever. But now a point is the truth table of the Boolean function. And we combine points when we construct the candidate. We combine the points using the truth tables and the gate functionality. So it's a very sort of raw way of approaching this. The target points, again, are the truth tables for the signals that we want to achieve. Now, since this is going to be very, very expensive, computational wise, so we can only do this for circuits with four or five input signals. But the number of output signals is not limited. It doesn't have any bearing on that, actually. And we use this algorithm to derive a smaller inversion circuit over the field 2-3 power 4. So this is, I think, the main result of the paper, the most interesting part of this circuit minimization techniques. We also put, at the end of this Boyard-Peralta algorithm, we also implemented a stack algorithm approach with a search tree where we not only use, for each step, save one candidate, but for each node, we enter which state we enter. We derive 20 to 50 different children and we branch it out. At each stage, we're tracking about 400 different children. And also, we have a depth there. So as we reach TD, we trace back and see, okay, which root has the best metrics here, and then we save that one. And the tree that is grown from that node sort of. And the whole idea is to try to keep leaves from as many branches as possible because it seems like, at each stage, you cannot just take the candidate with the best metrics. But sometimes, you get a better circuit in the end if you take, at some point, a little worse metrics, actually. So this algorithm works really, really nice. And we have some really good results there. Now, for the new architecture with lower depth, if you look at the picture there, the top part is the classical architecture with the top linear and the bottom linear matrix. Well, if you look at this bottom linear matrix, it only depends on the multiplication of the four bit signal y with some linear combination of the input single u. So basically, we can write this, the result here is r equals the linear output from the inverse, the y zero plus a matrix m times the input value u, the input vector u. So the m is an 8x8 matrix to be scaled and multiplied with the nonlinear y bit. So if you look at this, then you can say, oh, why? Why don't we calculate the m in parallel in the top matrix? And then we only have to use 56 gate at the end and very low depth to actually implement this equation that you see there. So that, and that architecture we call architecture D for depth. And that makes your critical path to shrink very much, as you will see in the results. So this gave very, very fast results actually. And we also, I also mentioned this new circuit for the inversion. Here's the equation that Boyar and Peralta used in their original paper. And they found a circuit of 17 gates of depth four. And now they only use two base gates and an XOR. But we applied this extended we are Peralta algorithm for general nonlinear circuits. And we achieve an inversion circuit with nine gates and depth three. And you see we we use the multiplexer gate in a slightly unconventional way here, but but it works. We get good results from it. But if you don't like it, we also have a conventional one of 15 gates and depth three also in the paper. So the last thing that we did as an improvement is what we call the additional transformation matrices. And if we look at the S box and exclude the final constant from the affine transformation, we can write the S box as the inverse times and linear, the linear part of the fine transformation. Now in any field characteristic two, we have both squaring and square root and multiplication by constant. There are linear functions. So we can actually write the S box as as we're done below here using some alpha and some beta. And then we can explore this different these different values of alpha and beta to see what kind of transformation matrix we should use for the top and bottom matrix to get good results. And we have tried for the forward only we have tried all for the combined we did a first heuristic algorithm. And then we've run the full generic fully melt multiplex algorithm to try to get the absolute best. There was recently a similar proposal, but they only consider the multiplication in this step. So looking at the results to the right, you see the synthesis graphs. And x axis is the clock period and area is is y axis is the area. And you see that we have and you want to be as far in the lower left corner as possible. You see that our results is is significantly better. We have the smallest one previously was 130 gates, we are down to 102. And the critical path going for 15 this year recent result down to two using this new architecture down to 12 using this architecture. And for the combined s box, we have gone gone from 15149 down to 127. And the critical path down from 17 to 14 actually. So that's, that's like 20% improvement. Alexander also applied this today as mixed column circuit. And you can see a recent result of his on the e print coming down to 92 x source of depth six for that circuit using the same kind of algorithm that we have presented in this paper. So thank you very much for your attention. If you have any questions, okay, maybe a very quick question. Hi, I was wondering, you mentioned in a slide that you reduce the depth of inversion in the power into to the four sized field. And also, you were in another slide, you were talking about just the full S but the inversion in the full S box into to the eight and you said that you reduce that to 32 none gates. I was wondering if you also knew the and depth for that minimize s box, because I think right now, like the best is like six or something like that. So I was wondering the I'm sorry, I didn't quite like for the 32 non gate inversion. 32 32 NAND gate inversion. I think yeah, I think it wasn't one. No, that's just you mean, you mean this one? Yes. Oh, that's just that's how you sort of collect the results of the to make this to implement this equation are here is so the 32 924 x or is that the size of the full S box? No, no, no, that's just for the assembling of these results in the our equation there. Oh, okay. So I'm sorry, I was confusing you. Sorry. Thank you. All right. So let's thank the speaker again.