 Hi guys, I'm alexandre mnike and in this video I will present our paper fix slicing as like ciphers This is a joint work with tomap around back when I was working at NTU Singapore As you probably guessed our work is about a yes implementations Nowadays a yes is running on a wide range of platforms from resource constraint devices to high-end servers And it appears that many embedded devices do not enjoy hardware as and joints and rely on software implementations instead We know that a yes can be efficiently implemented in software using pre-computed lookup tables The so-called t-table implementation, but we also know that it implies memory accesses that are key dependents Which can lead to cache taming attacks? one can prevent such attacks by Using a constant time implementation variants like bit slicing for instance and That's what our work is about it improves bit slice a yes implementations for processors that do not use vector or SIMD instructions Our paper focuses on 32-bit embedded platforms especially ARM Cortex-M and 32-bit RISC-5 processors If we have a look at the results previously reported in the literature for Those platforms we see that a yes 128 runs around a hundred cycles per byte and If we have a deeper look on how the cycles are spent within the cipher We see that many of them are spent for the shift rose operation Which is somewhat counter-intuitive Because the shift rose is maybe the simplest operation within the as it's just byte reordering within the state So let's first try to understand why it's so costly in a bit slice setting So here is the 16 byte as internal states But because we are interesting in bit slicing we will consider the 128 bit version If we consider a 16 bit slice that we reorder in a row wise manner So it can be stored in a 16 bit register and if we try to apply the shift rose So for the first one it's trivial We have nothing to do but for the second one we see that it translates to a nibble wise rotation and It will be the same for the two remaining rows, but with different rotation indexes Now if we go back to the 32-bit Architecture we see that we still have 16 bits available within our registers. So we can process a second block for free. It can be actually Useful for modes of operation that allow parallelization It can also be useful if we want to integrate counter measures against photo attack because it will provide redundant computations for free Anyway, at the end this is what we have to compute to apply the shift rose So we now have byte-wise rotations instead of nibble-wise rotations and The previous work did it that way. So it has to Deal with a lot of bit mask and bit-wise operations And actually it can be slightly improved Has highlighted by Peter Detman when we first publish this paper And so we pointed out that one can use the swap move technique. So it will be a little bit more efficient, but what we have to keep in mind is that we have to apply To compute those operations for each slice independently. So we have to compute this eight times per round So the sheet rules will remain closely and the goal of our paper is to Investigate how we can mitigate this cost to improve the AS bit slice performance So first what we tried to do is to investigate if it will be interesting to have another representation another way to pack The bits within the registers So if we only consider a quarter of a slice for eight blocks now So we'll need a 32-bit register to store a quarter of slice And so we'll need four 32-bit registers to store a single slice but what it's nice now is that the Shift rows can be computed using 32-bit rotations instead of byte-wise rotations and That's why we called it the barrel shift rules representation because for instance on arm processors 32-bit rotations can be computed for free. Thanks to the barrel shifter So it means that you think this representation on arm processors We should be able to Pay nothing for the shift rules Um Yeah, so at the end we have three 32-bit rotations per slice per round So it means that the shift rules operation required a 3 2 4 3 2 bit rotations per round 24 sorry 24 30 bit rotations per round But on the downside it requires to process eight blocks in parallel Which can be quite inappropriate for embedded devices that usually have to deal with small payloads Also, it's required of 32 32 bit registers to store the 1024 bit internal state and it can also be troublesome because on arm processors only 14 such registers are available So we would need a lot of stores and loads on the stack which will clearly have an impact on preferences Last but not least it also increases the RAM conception by effect of 4 to store the round keys Because for the other round key operation We don't have eight registers to consider anymore, but 32 instead So although The barrel shift rule representation Comes with a lot of drawbacks. We considered it for our benchmark. I will present the results at the end of this talk but we also investigated if We could find another optimization path and To do so we had to look at the fix icing implementation strategy that we initially introduced as the new representation for the gift block ciphers And it allowed us to to boost its performance on 32 bits platforms And the idea of the fix slicing implementation strategy is to fix A slice to never move and to adjust the order accordingly for Remaining operation Yeah, so in the case of gifts we only have to care about the sbox But we thought that it could also be of interest for also ciphers because A lot of them require to move the bits around within the registers at some point. So Coming up with an alternative representation can help to boost the performances And so yeah, what about the as we tried to have a look about it And so in the case of a yes So we move the bits within the registers doing the shift rules and we do it in the same way for all the registers So we cannot fix a single slice to never move because if we do so We will not have the bits properly aligned for the sbox operation then So it means that fix slicing the a yes Means that we have to fix all the slices which means that At the end we simply omit the shift rules operation So this is not an issue for the for the sbox layer as I just mentioned because we will still we will still have The bits within the bytes that will remain in line But on the other hand, we will need to adapt the mix columns operation Um, also note that we will have a synchronization with the classical representation every four rounds since four applications of the shift rules operation Uh, lead back to the original position so to understand, uh, our Adjustment of the mix columns. I will briefly recall how it can be Efficiently computed in a bit slice manner. So it was introduced a decade ago by kesper and shaba So here is the AS mix column operation And so what we can remark is that for Each byte within the column we will have to multiply multiply it by two Add its adjacent bytes multiply by three and add the Two remaining bytes, right and it will be the same for each Each byte within the column um Yes, and the multiplication by two correspond to left shift and also a conditional Exclusive or with this value depending on the value of the discarded bit And for the multiplication by three we can simply consider it as a multiplication by two and an exclusive or so Let's see how it translates to the in the bit slice setting so For the multiplication by two We don't have to compute the The shift Because the corresponding bit will be just in another register, right? So we can we just have to consider the proper register and what is nice is that for the conditional exclusive or We can just add The register that contains all the most significant bits, right? So we just have to add it at the proper place And that's it. We have our multiplication by two for the multiplication by three of the adjacent rule adjacent by sorry, so We can see that we can get the adjacent byte by a rotation to the right by eight And then has for the multiplication by two we consider The right register instead of performing a left shift and then for the remaining Bites we do the same we just have to compute right Rotations to the right and this is it. We have our Mix columns in the bit slice setting Actually, it can be factorized So we'll have command terms that will appear And yeah, this is actually quite efficient and can be computed in 27 x or and 16 rotations in total So now let's see how it Comes with our Fixed slide representation. So Yeah in the remaining slide, I will just consider the the first register for The sake of simplicity, but the same applies for all the other ones So normally this is what we should get but in our case because we did not compute the shift rows We will have to compute the bytewise rotations for each register during the mix columns So at first glance, it's not very clear what's The interest of doing it. It does not seem to Clearly improve the performances since we since we Are performing our computing this bytewise rotations anyway, but the thing is that We can do some factorization instead we can Instead of computing a bytewise rotation with different indexes for each register We can see that This can be transformed to this And the interest is that we will have common terms that will appear and That will help to Have something more efficient than the classical representation actually it saves 56 logical operations and 16 logical shifts And it will be even more significant in the next rounds because in the next round The shift rule is now simple to compute right we have So the bytewise rotation Now use the same indexes for all the registers And so if you compare to the previous one, we we actually saved this bytewise rotation, right? So now during the second round it saves 80 logical operations and 32 logical shifts compared to the classical Classical representation then for the The third round it's actually similar to The first one And it's obvious because it's just that the index rotation is swapped between This row and this row right instead of having a bytewise rotation by six We have it by two, but it's exactly the same as in here, right? So we have the same result here and Last but not least what is nice for The fourth round is that we are now synchronized with the classical representation so We don't have nothing weird to do at all just Implement the classical mix columns in a bit slide setting so typically it saves the Entire shift rows cost every four rounds So if I sum it up omitting the shift rows allows to speed up the linear layer On 32 bit platforms And also compared to the barrel shift rows representation. We only process two blocks at a time Which is more appropriate to embed the devices On the other hand it requires four different implementation of the mix columns So it has a slight impact on the code size And also I didn't mention it, but it requires To adapt the round keys accordingly So it will have also a small impact on the Key schedule operation But what's interesting is that we can come up with several trade-offs So for instance if code size is really a matter We can define a semi-fixed list representation where we will compute the shift rows every two rounds only and the The idea behind this is that Computing the shift rows after two rounds is way more efficient have has we Just seen because we have the same rotation indexes for The two rows to consider So yeah, we can come with different trade-offs and This table summarizes The number of operations required for the linear layer over four rounds and So I we tried to distinguish the logical operation from the logical shift and rotations because as I mentioned those can be Computed for free on arm for instance, thanks to the bar shifter So on arm, this is the the operations that really matters And for the fully fixed slice implementation, we see that we have almost A factor or two between the number of operations. So that's a significant improvement Obviously, it's a little bit less efficient in the semi-fixed list since it's a trade-off between code size and efficiency but And here is the results of our benchmark on arm cortex m4 processor So the previous results by stoven and schwaabe runs around 100 cycles per byte while our fully fixed slice implementation Runs around 80 cycles per byte note that it's we almost have the same Results for the fully fixed light and the barrel shift rows representation. So Indeed as expected the barrel shift rows is not that efficient because of all the loads and store we have to do on the stack because we lack of registers to store the entire entire internal state So overall With fixed slicing we have roughly 20 percent spin improvement before previous results And Yeah, this is a little bit different on RV 32i so 32 bit 35 Where the barrel shift rows is clearly more interesting For this platform and it comes from the fact that On RV 32i we have 32 32 bit registers. So we have Less overhead due to load and stores on the stack And yeah, the barrel shift rows is clearly the most efficient operation But still the fully fixed light is also very efficient with Approximately 30% improvements before previous work And yeah, I didn't mention it, but as you can see we have the same cycles caused for the sbox because we directly reuse the same sbox implementation from Stoffel and Schwaabe So it's only the linear layer implementation that is changing here All right, so if I sum it up Fixed lighting the aes allows to outperforms Previous bit slice results by 21 percent and 30 percent on arm context m4 and RV 32i respectively And especially the barrel shift rules representation fits well the RV 32i architecture and it can Be significantly enhanced thanks to the bitmanip extension because actually we Spends many cycles for the the rotations But if we could do this if we could do a rotation in a single single cycle it could run Clearly faster. So yeah, it may be An interesting further direction Uh, note that our works also directly improves masked as implementation that are based on bit slicing in our paper. We also improve um The performance for a masked First-order masked as implementation Uh, you can have a look if you are interested and it will also be interesting to Assess The gain of fix slicing for higher other masking schemes And also what's interesting is that our techniques applies to other as like ciphers And in our paper, we also briefly mentioned an application to skinny 128 And it led to improvements up to a factor of four compared to the previous results reported in the literature for this cipher Um, all our code is available at github. So please feel free to to have a look and to to use it Actually, it's at It has already been integrated in two projects. So uh in the rough crypto package So they had a significant improvement About two point time five faster um, so If I remember correctly it works on a 64 bit Implementation so It's not only of interest for 32 bit platforms. It can also be quite interesting for larger platforms As long as we do not consider Vector or SMD instructions It has also be integrated into picturium 4 Which is used to benchmark a pass quantum control algorithm So here are the references I talk During this presentation And thank you for your attention Please feel free to contact us for any question on remarks at the following email addresses Thanks. See you