 So, the final talk of this session on hardware implementation is from Lauren De Maier, Ami Moradi and Felix Weegner, and Felix has given their talk. It's called Spin Me Right Round, Rotational Symmetry for FPGA-Specific AES. Yeah, thank you for the kind introduction. So FPGA-Specific, what does that mean? Let's start with mapping a Boolean function to an ASIC. What you would usually consider is how does it look like in the terms of gates, the gates have a specific function and they have very few inputs. Now, for FPGA devices, it's a completely different story. What you have here is lookup tables, specifically in the Xilin 6 and 7 series we looked at, you have loot 6 tables. That means an arbitrary Boolean function that depends on 6 input bits and one output bit can just be represented in one loot. So if you try to do area optimization on that, of course you have to think about very different things. Yeah, you have to break down your function in that component and it's very much important on how many input bits something depends. It does not matter that much what the algebraic degree is, for example, or the multiplicative complexity of something. And you should of course not waste the inputs you have. You should most definitely always use the 6 input bits if you can to hide a complex function in a lookup table. Now going to more complex blocks and FPGA is further structured into slices. And the slice contains 4 loot 6 elements and some additional multiplexers. Also some registers, though I chose not to illustrate them here because I want to get one very crucial point across on this slide. And this is that you can think about one slice as a bigger lookup table that depends on 8 bits. So basically just keep in mind for the rest of the talk, with only one slice, you can realize an arbitrary Boolean function that depends on 8 input bits and has one output bit. So enough of the basics, now let's apply that. We are looking at the AESS box. So to recall, but you already saw it in a previous talk, we're thinking about the inversion in a finite field here. And we can also represent it as a power map with the exponent 254. Now, and next, there's also an affine function. But please just disregard that for now because we can just easily integrate that somewhere in our hardware structure. I will also show you where exactly we do that. So let's just think about the power map itself as the AESS box. Now we want to realize that in an FPGA device. How are we going to do that? Now the naive idea would be to just take each of the 8 output components because you map 8 bits to 8 bits in a separate slice. Of course, that is rather naive. You can directly implement that in a few minutes. Maybe there's a more elaborate approach. So if you look at the algebraic degree, it's very, very high. So it's not really clear how to do that. If you look at approaches we already know, for example, the can write S box to just represent it in a tower field based on normal basis. It doesn't really suit the structure of an FPGA for a reason I told you on the first slide, because then you just waste input bits. In a load 6, you should have a function depending on 6 bits. But in a tower field, you have a function depending on 4 bits or on 2 bits depending on how deep you go in the tower field. So none of that really works. And so far, most implementations just use the naive implementation of it. Now our contribution here in this talk is how to reduce that by 50% from 8 slices to only 4 slices. Now how are we going to do that? We need some help from the theory perspective. So one notation first. So first of all, we consider the inversion represented as a power map. We will have a conversion function to the normal base. The normal base is just a different base similar to the polynomial base having different kind of properties. And we need the rotation of bits. So you can think of the simple rotation in a register, just a usual rotate. And yeah, with those terms, we can borrow a theorem from Reimann et al. stating that if you have a power map and you represent it in the normal base, you call this new function s. And this function has a certain property. And this property is called rotational symmetry. It means if you rotate the input bits through, the output bits are also rotated through. And I will now show you how to use that for hardware implementation. So let's first illustrate that a bit further. Let's consider this s star is just realizing one component of the s box we target. We have all the input bits. And now we rotate the input bits through. By one position, we get the next output bit. By another position and the next output bit. We can do that all the way to get all the output bits. Now, we gained quite a lot from this because we only realized the LSB of this s box in a circuit and not all eight coordinates separately. So now let's look at the real circuit, how to realize that in hardware. So our ASS box proposal takes a byte. And first, it converts that from the standard polynomial base to the normal base. Then it is in one rotational register. And the function s star is evaluated on it in each cycle. And the result is always shifted into a second register where this is buffered until it is complete. And then it is converted back from the normal base to the polynomial base. And if by now you're still worried about where the affine function is, the affine function can be hidden here in the N2P conversion. This is a linear function by itself. So you can easily integrate the affine part of the ASS box into that. And we actually found a normal base to realize this whole construction in only eight cycles and in only four slices. So this is only an s box. So what we now want to do is, of course, have the full ASS and show that this is actually worthwhile to use it. What we do is we look at the former record by Zastrich Hedal. They managed to implement ASS on a Xilinx Spartan 6 in only 21 slices. So they used some interesting points. For example, to represent the key and the state, they used distributed RAM. They used a very unconventional way to compute mixed columns so they can reduce the memory footprint. And this is all really good. And there is a huge reduction in the number of slices. The only problem is they use the naive s box, which is good for us in that sense because we can just replace it with our s box. And the reduction I just showed just translates to the whole design. So due to the reduced s box, we get a total design here. That is minimal in the number of slices with only 17 slices. OK, so next I want to present another design because maybe it is also worthwhile to not just start with an FPGA design, but maybe start with the smallest ASIC design, port it to FPGAs, and then apply our rotational symmetry. So as you may remember from last year's chess, this is the bit sliding design presented there. And what we're now going to do is look at the components and I will tell you how to optimize them for FPGA implementations. The first thing you should have a look at is the s box. So it uses only a one bit data path. So we use an 8 bit data path. So we have to slightly adapt that. OK, so in our fully bit serial s box, it's only slightly different. So the main difference is that the x is shifted in bit serially and is first buffered in a register. And after all the bits are there, the transformation to the normal base happens, writes it back into the register, and then it starts, as you already know, that the s star is valuated cycle-wise on it. It's written to the other register and converted back. So unsurprisingly, this has, of course, a higher latency. So we need 16 cycles here because we need 8 bit, 8 cycles to shift the operand in. But again, here we found a normal base that is very suitable for area minimizations. And with that, we can actually realize it in four slices. Now, the other part of the design is both the state and the mixed columns. Most interesting is the state, which is here separated in 8 bit chunks using scan flip flops. And we do not have those things on FPGA devices. So we have to somehow solve that differently. And what we chose to do is instead represent the state in 32 bit registers. But not naively, but we used a lookup table as a 32 bit shift register. This, of course, allows us to realize that in a really, really minimal space of only one slice. But the latency is, of course, increased because now the shift rows has to actually shift through the entire 32 bits. The other thing is the mixed columns operation, which due to some optimizations we did, also fits very neatly in only six lookup tables and four flip flops. Now, if you take everything I just said together and you receive an optimized design, which is also area optimal, but here not in the terms of slices, but in the terms of lookup tables. So this is the first comparison in this talk. The first row shows the former record by Sastry Hedal. And then you can see on the second row our design optimized for the number of slices. 17 slices. And on row three, our design was the minimal number of lookup tables. In terms of clock cycles, of course, the latency is increased. But this is not exactly about an area latency trade-off. It is just about providing a new area record for FPGA devices. Now let's go further and talk about side channel protection, because it's nice to have a small unprotected design, but it would be even nicer to have a protected, really small design. So we actually are going to continue the last design I presented, the bit sliding design. And we need some more mathematics here. So first of all, you still remember that we considered a power function. And we're going to decompose this power function now into two separate power functions. The exponents are 26 and 49. This was first suggested by Moradi. And we are now, because we still have power functions, this is not just some arbitrary decomposition into bijective functions, but it's very important to still harness the power map properties. So we can still apply the theorem by Reimann et al, you saw before, and just apply it individually to both parts of this decomposition. And this allows us to now mask functions of lower degree, because masking the inversion in one step with algebraic degree 7 is very, very difficult. You would need a lot of output shares. But now just masking two cubic functions requires fewer output shares. So if we are going for first order security here and you know the consolidated masking scheme, then you know that we need two input shares. And the minimal number of output shares we can achieve is 8, but it is not really guaranteed. And there's also not really a constructive way to find it. There are some suggestions, but what we did is we developed a completely new heuristic to do that. And very important in our heuristic is that we actually separate the function. So instead of sharing one cubic function in one go, we split it into parts, and for each part, it is guaranteed that the minimal bound holds, so that each of those parts can be shared with actually the minimum number of 8 output shares. If you want to know how it works, please look in the paper, because I don't think I will find the time here to explain it. But what we're going to do is we're going to look at the end result this heuristic produces. Now this is a circuit for the function block f star g star. So what you can see here is that each of the functions, g and f, is split into three blocks. And you can see basically in each block the consolidated masking scheme to evaluate it. So you take in a shared representation, you also take in three bits of randomness each, making it 18 bits of randomness total. And then to prevent glitches, you store the 8 bits output each in a register, and then you compress it back to two shares. And yeah, this has quite nice area properties. One thing I really want you to notice here is the number of input bits. So you can see that 16 input bits, because two shares of 8 bits are provided in the start. But we only need 14 in each component. And this is a really crucial part of heuristic that really gives us a lot of reduction in area. So we basically find a split here such that each of the components it is split into does not depend on all variables. So of course, this is a huge area reduction if you think in terms of lookup tables where area means on how many variables does something depend. And with that, we can realize the function in 144 lookup tables. And now we're actually going to integrate that in the full S box. So you already saw the bit serial design earlier. So you already know we shift in bits first, we convert it with a P2N function, and then we have the correct representation in the upper register. The only difference now is that we're not going to do that only once, the evaluation of F star on it, but we're going to evaluate F star in eight cycles on it, put the result in the lower register, write it back in parallel, and then in the next eight cycles, we will evaluate the function G on it to have actually the final result stored in the register two, and then we convert it back and we have the actual end result. So in total, this S box has an area of 182 looks and a latency of 26 cycles. From a security perspective, I want you to know that it's really important here to protect against transitional leakage because the input of the F star G star block gets the same input variables just rotated. So it's very important to get a clearing register beforehand that clears on a negative clock edge to prevent any kind of leakage due to transitions between two clock cycles. Okay, if we put that then in the full bit sliding design with two shares, we of course want to evaluate that, and we did moments correlating DPA. I referred to the talk yesterday given by Thomas if you want to know why we did a moments correlating DPA instead of a T test, and we implemented it on a Sakura G board, and you can see at first the evaluation that our implementation is correct. So if we don't use randomness, it should leak with only a few traces. And if we then show you the whole evaluation, you see that there's no exploitable leakage in the first order with 10 million traces. But of course, in the second order, as this is only a first order secure design, you can see leakage. We briefly compare that to other state-of-the-art first order secure implementations, but I have to say that of course they did not target any FPGA specific improvements. They are only mapped to FPGAs starting from an ASIC design. So we can see that we have the smallest area regardless of the area metric you want to consider. So in terms of lookup tables, flip-flops or slices, we use a competitive amount of random bits per S-box evaluation. The only thing that is of course far higher here is the number of cycles it needs to be evaluated. Yeah, but again, we focused on area in this paper. So to summarize that briefly, I provided three designs. Two of them are area records of unprotected designs specific to FPGA. Another one is the smallest first order secure AES design in FPGA devices. There are more contributions we made, for example latency optimizations of the original paper by Zastrich et al. And we also provided a new heuristic how to find a consolidated masking scheme or sharing with D plus one shares and you can find them in the paper. Thank you very much. Thank you for the talk. We have time for plenty of questions. Exceptive, take mine. So I have a question. Like you are using normal basis. So did you consider like fault tolerance because it probably comes with some inherent fault tolerance? So did you consider like protections against fault attacks? Not yet, but that seems to be an interesting idea for follow-up work. Thank you. So for that decomposition, are you aware of any decomposition works with only quadratic permutations rather than cubic permutations? Do you mean for power functions? Yes, yes. Is it possible like to do only? It's provably not possible. So it's impossible? Yes. Okay, thank you. So just to clarify, that statement only holds for power maps. In general, it might be possible to find bijections that are quadratic, but again, they will not be power maps. Other questions? I have a final question then. Are you planning to also do higher order protection what's the future work for this? I mean the general scheme we provide is applicable to D-shares, so you can easily use our heuristic also for higher order countermeasures. More specifically about this design, you could also try to fix the higher order just with some noise generations and there are no methods for that. All right, and let's thank all three speakers of the session again and enjoy the coffee break. Thank you.