 Okay, the next talk is titled efficient side channel protection for arcs ciphers by Bernhard Jung, Richard Petri and Marc Stöttinger and Richard is on the stage. Yeah, thank you for your introduction. As said, we are talking about efficient side channel protections of ARX ciphers. ARX ciphers as a short introduction consist of addition, rotation and exclusive aura as their basic building blocks. And the things about these ciphers they're quite easily protected against timing side channel attacks, but they're quite hard to attack against power or EM leakage attacks. And the usual culprit is the modular addition as seen in the previous and attacks from previous released attacks. And early work on this by Goubon suggested that we use Boolean masking for the to protect the rotation and the XOR and then do a conversion to arithmetic masking to secure the modular addition. And this is quite costly. It has a cost of K, K being the bit width of the addition. This was later lowered to lock K with the suggestion that we apply Boolean masking directly to an addition algorithm. And we do this in software directly. And previous talks already mentioned threshold implementation. So I'm not going in too much detail with this. But it was mainly of interest in hardware implementations due to needing three shares for first order security, but recent developments have reduced this to two shares by introducing a registering stage. And we use threshold implementations to implement a mask addition algorithm. And we introduce some optimizations. The first of this we introduce a mask versions of combined gates or combined gadgets. It's also often the term which perform a shift and and optionally also an XOR in one step. And we, these combined gates allows us to use the flexible second operant of the ARM platform which will perform an operation including a shift in one single instruction and usually in a single cycle. And we introduce a method to reduce the number of necessary remasking steps and reduce the amount of required entropy for the implementation. Not in this presentation is reintroduced a simpler algorithm for modular subtraction which is needed by some Cypher, like the spec Cypher in the decryption slab. And as an addition algorithm we use the cork of stone adder here shown as a simplified circuit diagram. This adder is a parallel prefix adder which means it processes each bit in parallel, not just like a ripple carry adder which will process each bit sequentially. So this is quite optimal for software because each software instruction will process n bits in parallel. So we use this and what I want to highlight is these blue gates which is a combined shift and an XOR which we'll use to speed up the operation. All right I'm going to introduce the first introduce an and XOR gate. As shown before the goal is to split the operation into multiple shares. In this case two shares and this would be a quite straightforward approach to implement an end gate, secure end gate. The problem is the one of the properties of the threshold implementations was violated which is the uniformity. To gain uniformity we need to introduce a guard share M. And this is where we stepped in to optimize a bit and developed a method to reuse one of the inputs as a guard share as because as said all the bits are processed in parallel so we can use one of the inputs shifted by the left and we need to shift in essentially one random bit. So the software just needs the end gate just needs a single fresh bit but in the case of an end combined with an XOR we don't need a refresh at all we just simply replace the M with the shares of the XOR operand. Then we can do a shifted end XOR gate which is quite heavily used by the corgis stone adder and this lends itself quite well to the arms flexible second operand which can perform for example one of these calculations in the single instruction. And of course we can also combine this with an XOR so we don't need a refresh mask at all. So now to protect corgis stone adder this would be the algorithm we use to perform the modular addition and we simply replace all the gates like the end gate XOR gate and these special shift end XOR gate and just the shift end gate with the secured variants like this. And what I want to highlight here is this part we as said we reuse a single bit of one of the input shares in this case for example before the addition the first bit of the XOR zero or each time the first bit of one of the propagates we reuse in the next iteration and to highlight why this works I'm going to show the simplified diagram again as said during the iterations the first bit is never used so we can reuse this as a guard share. So this allows us to implement a modular addition which just requires a single random bit the U for remasking the entire addition. Further optimization as possible previous works introduced a special end gate which doesn't require any remasking at all and we can combine this with our approach of the combined end shift or the end shift XOR gate which we will call the secure end optimized and secure end shift optimized gates. And we compare this with previous works of course we are a lot faster in the secure XOR gate or the secure shift gates which we don't need at all we're not really a lot faster with our secure end gate but notable here is the combined gates the second shift and the optimized versions where we will only need eight instructions in the ARM case and we can lower this to even six instructions and even more notable the previous work here requires always requires a refresh mask for an end gate which we don't we only need a single bit and generating this single bit will only cost us three instructions. We also did an instruction count measurement for the complete modular mask addition for a 32 bit modular addition on an ARM platform and we were able to improve by almost 31 percent when we combine this with previous works senior and our subtraction algorithm which I wasn't getting which I haven't mentioned yet is a lot faster than this and to reiterate our complete addition algorithm only needs a single random bit and it also outputs a single random bit so we can implement for example the Chacha cipher where each addition in the Chacha modular addition in the Chacha cipher will produce the next random bit for the next addition and we measured the performance for the Chacha cipher as well and as we see the overhead of the mask implementation is quite large with our reference implementations implementation just needing below a thousand two thousand cycles the overhead is quite immense we compared this to previous results which we didn't find a lot of previous results but this is less and if we use the optimized gates we can shave off a bit more and we see that masking the addition is quite a driving factor in protecting the Chacha cipher and just to note that the cycle counts aren't quite comparable because we measured on multiple platforms and the results varied a bit we think due to differences in the memory architecture we simulated we also simulated our Chacha cipher with using the micro architectural power simulator called maps and we extended the simulator by 11 instructions and the simulator the simulator samples the hamming distance for each risk register assignment and performing a t-test on a fixed versus random setup with 100 000 noise free traces will not result in any leakage but we think it's still noise amplification methods like shuffling should be used because it's just a simulator that's it thank you for listening thank you any questions okay can you get back to one of your source codes that you showed about where you use the for instance one of this can you explain more about this secure and shift or secure or secure and shift x or you are following a particular concept which should be you mean you refer to threshold implementation about this yeah yeah can you explain more about what you do here and what is the difference between this shift x or and shift and probably I imagine that the x or is helping you because you don't need more fresh randomness we went into this with this part so this would be a basic implementation without x or and without our mask reuse such scheme so we do a cross product with the with the two shares with the end gate and the threshold implementations the the work we are basing this on which would allow us which allows us to do to reduce this to two shares requires that we can't recombine these shares without registering the the s shares without registering them first but in software implementations everything is registered so we don't need to use that and we if we want to later recombine this we need to put in a refresher mask the end and what else now my question here is do you completely ignore the architectural issues of the market controller means that you have a pipeline for instance and you are not completely aware of the details of the pipeline and then consecutive instructions in the pipeline may have effect on each other we know that that the pipeline has an effect on this we did measurements on a real platform we need to add a few instructions to flush the pipeline in some cases but it's some kind of ad hoc or you you identified which parts you have to introduce new instructions or it's a systematic way to say that okay if you want to implement or if you want to run this secure and x or then then you have to put those extra instructions between two consecutive instructions this is just a baseline implementation for for protected a baseline for a protected implementation any micro architectural micro architecture will introduce some leakage which we don't know about for example pipeline leakages and so you need to add some for example in the arm case we added some x or instructions which don't process any valuable data which will flush the pipeline and will remove additional leakage you know what i'm concerned about if you look at the first two instructions at the left if you if if they are in the pipeline and if you are going to the same bus the x or between them will be actually leaking right for instance the x or between s0 and s2 in the middle it may lead to direct leakage yeah it might yeah okay good but the simulation that you have done doesn't consider such an issue just as you simulated just the values which are processed exactly it just shows that the the register distance during the assignment the hamming distance that this won't leak and the simulator also considers two pipeline registers so we also show that these won't leak but of course for example we might the especially the shift might leak doing glitches we're not sure thank you and the simulator i said pick up slides i wanted to say okay okay thank you any more questions okay let's thank thank the speaker again