 Hello everyone and welcome to my presentation about secure and efficient software masking on superscalar piped land processors. My name is Barbara and this work is joint work by my colleagues and me from Graz University of Technology. So first of all, in the setting of physical side channel attacks, we have a specific device. For example, a credit card, a SIM card, a government ID and this device has a certain asset like a cryptographic key. On the other hand, we have an attacker who has physically access to this device. This means that the attacker can observe certain properties of the device. For example, the attacker might observe the power consumption of a microprocessor which executes cryptographic software. What can the attacker then do with this information? The power consumption of a CPU, for example, depends on two things. First of all, the instructions which are being executed by the CPU and second of all, the data which is involved in these instructions, which might be in the case of an AES implementation, the key, which is processed by the implementation. In order to prevent power analysis attacks, we have to break the dependency between these things and the power consumption. This can be done by applying a counter measure which is called masking. Masking is a secret sharing technique where we split our sensitive value so the value we want to protect into multiple random shares. If the attacker can now observe up to D of these D plus 1 shares, it will not reveal any information about the sensitive value. Here I have one example. This could be the attacker observing the power consumption of a masked AES implementation where we have split our key into three parts, K1, K2 and K3. Then the power consumption at each point in time will only depend on one part of the key, but never on the unshared key itself. So masking is very nice, but it has several problems. One of it is that the assumption of a masked software implementation is that independent computations in the software result in independent leakage. Let me give you an example. We have a microprocessor which executes a certain row of instructions and the assumption here would be that each instruction leads to independent leakage. So that's only cause leakage of the data which is processed by the instructions. This is unfortunately not the case in some microprocessors. How can we fix this? First of all we can add our masked software to the microprocessor itself and we can do that if we know several things about our microarchitecture and we know the leakage behavior of the microarchitecture then we can fine tune our software such that leakage will not be caused during execution. Second, we can also apply a lazy engineering approach. This means if we do not know that much about our microarchitecture we can simply say we use a protection order which is higher than theoretically required except a certain leakage and a certain loss of protection orders and simply apply a masking scheme with higher order than actually required. These are two approaches but what they have in common is that the runtime of masked software and applying each approach is significantly increased. And the second point is that it still requires manual leakage assessments which is done in order to make sure that there is really no exploitable leakage so that our fixes have really prevented the problem. Therefore in our work we want to evaluate the security of masked software on complex processes because manual leakage assessment is not so easy there anymore. For example if you apply lazy engineering or fine tune your masked software you might still get leakage in your manual assessments and you will not be able to find out so easy where this leakage comes from and how to fix it. Therefore we want to focus in our work on the security of masked software in such a complex processor. By complex processor I mean processes with multiple pipeline stages with forwarding logic between the stages which are maybe super scalar and have even data caches. And as already said analysis for these processes can barely be done manually anymore. Therefore we want to stick to a formal approach. In our case study we focus on the RISC-5 Swerve core which is exactly one core you would consider more complex. And we focus on the following questions. First of all which CPU components are the components which will cause problems in the context of masking and how can we deal with these problems. First of all are there things we need to change in our software and which things do we need to change and are there general rules which we can apply to our software to get secure software in the end on such complex cores. And last how can we still design efficient masked software. So these rules of course add a certain overhead to the masked software but is there a way to keep this overhead low. As already said we consider the Swerve core as the target processor platform for our analysis. The Swerve core is an open source RISC-5 core which was designed by Western Digital. The core is applied in I would say data intensive fields of applications for example for storage controllers. It is comparable to an ARM Cortex A15. The core is an in-order core and features a dual issue pipeline. This means that it can not only execute one instruction per clock cycle but two instructions per clock cycle. And it has some load store buffers some of them can be compared to a small data cache. This Swerve core has nine pipeline stages. Here in the figure you can see them. So the first three pipeline stages are responsible for fetching instructions from memory. Then these instructions are decoded and sent to one of the two ALUs for example. And of course there is a certain part which handles forwarding between these pipeline stages such that we can for example forward a result from the sixth pipeline stage back to the input of the ALU. The last two stages are responsible for commit and writeback. Our goal in our analysis is that we investigate the security of mask software when executed on the Swerve core by using formal methods. And when it comes to such an analysis one has to think about the attacker's abilities and we do that by using a certain probing model. The classically probing model which you apply for hardware because CPU is hardware equips the attacker with D-probes so the attacker has D-probes and can distribute these probes as the attacker likes in the hardware circuit. And each probe will deliver the value of a specific gate or wire back to the attacker. This is good because by using the classically probing model for hardware we can capture side effects like glitches, transitions but it is actually not that suitable for mask software because the attacker is too powerful. For example if the attacker places one of its probes to the output of the register file of the CPU the probe will still or will always deliver every value which is ever contained in any register and which is read in an instruction. So this is immediately broken and we cannot really design mask software with that probing model. Yet we decided to stick to the time constraint probing model where the attacker can use the D-probes so this is the same as in the classically probing model to measure a specific gate or wire but only for the duration of one clock cycle. And the attacker which is also important can distribute these probes as the attacker likes into multiple clock cycles or multiple wires and gates. The time constraint probing model is actually applied in previous work in the work about the cocoa verification tool which we will also apply for our analysis. So the cocoa verification tool verifies that a certain piece of mask software when executed with a specific CPU netlist is secure in the time constraint probing model. So you can see it here in the figure the cocoa tour takes as an input the mask software, the CPU netlist and also a certain piece of background information for example the location of the shares when the execution starts so in which register is which share in which memory location is which share and so on and you give that information to the verifier and the verifier will check for each gate in the CPU netlist for each cycle in the execution whether an attacker can measure some information about any native unshared value there. And if the verification is successful the verifier will output yes secure otherwise it will say no it's not secure and it will give us the cycle and the gate which causes the leak in the implementation. The work about cocoa also included a case study of the RISC-5 eBEX core so the eBEX core is a simpler and smaller core but it already contains hardware components which is problematic for example the register file. So the register file can actually be a threat to the security of mask software because there are parts which can cause glitches and transitions and which will lead into leaks. The work also suggested some modifications to do to the hardware so to the eBEX core in order to obtain a secured eBEX core. The secured eBEX core will then allow the execution of mask software without any leaks as long as certain software constraints are followed. So the initial work about the RISC-5 eBEX core suggested first of all fixes which can be applied to hardware to your microprocessor and constraints which has to be met by your software such that you can execute and guarantee that the execution of the software is secure. Now we do not want to consider the eBEX core but the more complex eBEX core and our initial analysis with Coco shows that the swerve has similar problems so one problematic component there is also the register file and we find out that we can simply map the hardware fixes which were suggested in this work to the swerve core to obtain the secured swerve core so to say and the secured swerve core will be the base point for analysis for all our further experiments. So now let's start with the actual former analysis so we have our secured swerve core and now we want to verify something we chose to verify a software which is generated by tornado as a starting point so tornado is a nice tool which will generate mask-c implementations based on unmasked high level descriptions of ciphers and not only that but it will also give you a security proof in the register probing model. The register probing model is a probing model which is often chosen for software so for mask software in which an attacker can place the probe on a specific register for one cycle. In our experiment we generate several masked ketchup s-box implementations with tornado with several I mean we generate four different implementations each implementation refers to one masking order and then we use Coco to verify the execution of this software on the secured swerve core. The result of this verification is that the implementations lose our protection orders because there are certain components in this swerve core which cause first of all big problems so by this we mean that there are components combining more than two shares and smaller problems so there are components which combine up to two shares. Let me give you one example of such a big problem so here we try to visualize the execution of software which contains ten shares and the shares are in the pipeline at the same time due to the masking scheme so the masking itself is correct on algorithmic level. Then we perform a gateleaving timer simulation of the swerve core to visualize whether glitches and transitions on a specific wire in the processor's forwarding logic can lead to any leaks. Of course for this experiment we use a specific cell library so the cell library will map timings and area constraints to each gate which is used in the swerve gate list and this is the result of the analysis so the question was again an attacker who probes a wire in the piper's logic of the swerve core for the duration of one clock cycle so this is important. What can the attacker see? And we've visualized what the attacker could see here in this timing diagram so first of all for example there is the first share then we have a combination of two shares then we see even combinations of up to three shares and the wire switching around before finally stabilizing to the value the wire should have so in the end we found out that the attacker can observe up to five shares when this specific cell library and this concrete timings are used so this would be a big problem. Yeah so we saw this in the processors forwarding logic and then we use Coco to analyze what the exact problem is there so here you see a diagram of the forwarding logic of the swerve core you see here each pipeline stage so here's the decode stage the elu stage and the further execution stages and here we have a multiplexer which will forward the data from the correct pipeline stage to the elu. Yeah the multiplexer has a select signal which is called m1 select and the select signal is computed by combinatorial logic which means that it might glitch and if a glitch happens also the output of the multiplexer will glitch and if you imagine now a software where we have multiple shares of the same secret in each pipeline stage then forward data will kind of forward the result of each share to the elu when m1 select glitches and then we combine multiple shares and this is what we call a big problem now the question is how can we fix that first of all we thought about fixing that in hardware similar to what was suggested in the work about the ebex core but yeah that would involve that we need to gate each pipeline register with a bit indicating whether the value of the register should be forwarded to the elu or not this could be done but the gate bits need to be glitch-free and this is not that easy to achieve and requires in the end a very large latency overhead which is impractical instead we need to find some solution in software basically the software solution or the software constrained as we call it needs to ensure that at no time there are multiple shares of the same native value in the pipeline how can we do that we need to make sure that the distance between two instructions which process shares of the same native value is large enough so that there are enough unrelated instructions between them what do I mean by unrelated instructions well in the basic case this is a knob operation but it can also be an instruction which processes a share from another secret or a elu operation on non-secret data like incrementing the counter of loop we performed further analysis and it turns out that there are a lot of other components which cause leaks in this work core if you're interested you can have a look at the paper I will now only give you one more example which are the management components of the data memory for example there is a component which is called the LSU bus buffer this is similar to a small data cache and there might happen a leak if the buffer contains a store and this if the buffer contains a share sorry and the share is overwritten by its counterpart by performing another loader store instruction so this is really bad and again the hardware solutions turned out to be impractical and we need more software constraints for example for the LSU bus buffer the software constraints would would be to flush the buffer between loading two shares of the same native value okay so this was the results we had for this work core and now we try to derive some generic rules for that so our analysis clearly shows that software constraints are necessary even though we already work with a secured or hardened version of this work core one effective software constraint turned out to be the insertion of unrelated instruction between two instructions and the question is now how many instructions need to be inserted there and this number can actually be expressed in terms of the length of the pipeline of the core and the amount of execution units so here I divide the amount of pipeline stages P into PI and PT PI is the number of stages which deal with fetching the instructions so now no data is involved there and PD is the number of instructions which actually processes data and it turns out that you need PD times e the amount of execution units plus one unrelated instructions between yeah two dangerous instructions and we also tried to formulate a factor for the order reduction so if we want to apply the laser engineering approach how many or how much more orders do I need in my masking scheme to still be secure on such a processor and the factor we computed here is actually the reduction factor of a lazily engineered mask software implementation when executed on such a complex pipeline processor okay so now I have told you that we need a lot of software constraints even on a secured swerve core and if one adapts this rules or constraints strictly you can imagine the overhead is really huge here I have a table which summarizes this so you see a certain set of example programs we have here for example DOM and gate then we compare the number of cycles and the number of instructions required by the software when we implement it without constraints and when we implement it with constraints so by applying the rules strictly so to say if we have constraints I also give you the number of total instructions and the number of knobs so knobs are kind of the unrelated instructions which had to be mapped to two knobs and as you see here for example for the DOM and we need 33 cycles instead of only 10 when we have no constraints or if we look at higher order implementations it's even worse so here we need 33 cycles without constraints but 250 cycles with constraints and considering the number of instructions we almost need 300 instruction instructions and the major amount of those is knobs so unrelated instructions where we cannot really do anything which makes sense or anything useful can we change that yes the answer is yes so if we we apply the right implementation techniques we can reduce this overhead fortunately one of these implementation techniques is to stick to parallel implementations instead of serial implementations let me explain that based on an example assume we have a catcher sbox and the state of this sbox consists of five lanes each lane itself is again shared into D shares in a serial implementation we would take the D shares of three lanes process them and store the output lane then we would again take the D shares of the three lanes process them and store them in the output lane of course there are lots of unrelated instructions which would be needed to separate the processing of the D shares for the same native value in the case of parallel instructions on the parallel implementations on the other hand we use instead of the knobs which we would use in serial implementations computations of shares of other lanes so we kind of mix the computation of the lanes here you can see one example so here we have a serial implementation of a DOM catcher sbox compared to the parallel implementation and if we now compare the overhead which the constraints introduce for a serial implementation we have 80 in cycles compared to 240 but if we do that in a parallel way the overhead is much smaller so we have 36 cycles compared to 81 cycles and also here we have almost a hundred instructions compared to 400 with a lot of the 400 being knobs and here we can really use 7 use 79 out of the 144 instructions are only knobs okay so let's have a look at another technique which is called threshold implementations so threshold implementations is a masking technique which is based on the property of noncomplete component functions so noncomplete means that I can compute each component function in such a masking scheme and I need for the computation or the shares except of one so the computation is independent of at least one of its input shares for the ti catcher sbox this would mean that the linear layer can still be done in sequence for each year but when it comes to the nonlinear layer where we will require multiple shares we will do it in sequence for each component function this means on the other hand that we can ignore smaller problems so small problems are the problems where we combine up to two shares because we have component functions which will only ever compute based on in the first order case two shares and we can therefore ignore this more problems but of course the downside is that we need three shares for first order security however the results were really promising here we have a ti catcher sbox and we see that yeah 66 cycles are required without constraints but with constraints only 72 cycles are required and from the unrelated instructions only 15 are knobs for the scon implementation which we also did we have seventy seven hundred and twenty one cycles compared to almost a thousand and seven hundred cycles here the overhead is mainly due to register splitting and also due to memory overhead because the ask on state is much bigger than the catch up state and therefore we cannot hold or our shares in the register file or the time so we have to load and store the shares and this introduces on the swerve core a lot of overhead because we have to clear the load store buffer so yeah but this actually already leads me to the end of my presentation so we've discussed that architectural side effects of complex abuse can reduce the security of mask software by multiple orders and this is due to problematic components which cause big or small problems so the combination of of more than two or up to two shares and these components are mostly pipelines and memory management components and however we showed that it is still possible to have secure and efficient masking when we carefully consider both hardware and software so thank you all for listening and yeah thank you for your attention