 Hello, welcome to my talk. My name is Sigao and I'm presenting a joint work with Ben, Dan and Elizabeth. So I guess most audience of chairs are already familiar with the concept of such-channel analysis, unlike traditional cryptanalysis. Such-channel analysis takes advantage of the information leakage coming out of the cypher's execution. Depending on the specific application scenario, the attacker might be able to recover the secret key potentially within a few minutes. So in order to light up this talk a little bit, I made up this hypothetical story between an industry engineer and an academic researcher. So the industry engineer learns about the threat of such-channel analysis and asks, oh no, what should I do? Well the researcher said, well we have provided a lot of countermeasures. If you are into hardware masking, we have threatful implementation. We also have domain-oriented masking. We provide various schemes basically. But then the industry engineer might say, well, we are a smaller business, smaller company. We are using other people's general purpose core. We cannot make everything the hardware from scratch. Besides all of our devices are already on the market, so I can update the codes by updating firmware, but there's no way I can revoke all the devices on the market right now. Okay, the researcher might say, well, let's treat here, because that forces us to use software masking, but still doable. If you have a lot of memory resources, then perhaps you can use the lookup table based approaches. For example, if you have huge memory space, then you can perhaps store all the shared operation into one huge table. The overall masking scheme might be quite efficient, but at the cost of using a lot of RAM space. But on the other hand, if that's not good enough for you, you can perhaps trade some memory space with your time, execution time. So if you're recomputing some of the shared table online or live, then you can have a smaller memory footprint, but with the cost of longer execution time. But there are, we know there are a lot of applications which are quite memory tight. So in that case, our industry engineer here is asking for any other sensible solutions. Well, we also have the bit slice masking. If you're using bit slice masking, then you are just constructing smaller secure gadgets like a 2-bit AND gate. Here by 2-bit AND gate, what I really mean is a small piece of code operating on software but providing exactly the same functionality as a hardware AND gate. The only difference here is like it has some security property. Because this is a bit slice solution, then each bit will be stored in a separate register as a separate variable, which means the memory card is slightly larger. But in the end, the overall solution will be quite flexible because you are just constructing small secure gadgets you can have whatever circuit you want with it. The other drawback of this will be a little bit difficult for the chaining mode. This is because in bit slice implementation, you got the best throughput if you can actualize all the bit width provided to you by a common processor like nowadays it's usually 32-bit or 64-bit. If, well, in order to fulfill the whole register or the bit width, you might need several concurrent encryption blocks. If that's not possible, then your efficient will be compromised. So in the CPC encryption mode, that's one of the cases, because in order to encrypt the next plain text, you have to know the results of the current encryption. In that case, you cannot really fulfill the whole bit width, which will give you a lower efficiency. Well, I guess from the industry point of view, although you want something that's good in every aspect, we all know that's too good to be true. So our industry engineer simply says, well, that's fair enough. I will take it. I will take it. Let's do it. Then we have several schemes available. You have the SW modification. You also have the modification in the bounded moment model. Both of them have security proofs. And to add on, there are also, in the last few years, a lot of researchers has been publishing their code on GitHub. We also have the security evaluation, sorry, performance evaluation on ARM processors. Okay, the engineers are brilliant. I'll implement one of this. But then our researchers simply says, well, you probably need to be careful with your implementations. Judging from our previous experiments, we know for sure that pitfalls are quite common in masking implementations. For example, if you have some bad readiness, no matter how secure your scheme is in your security model, it's gonna be quite bad in practice. And also, we also know that the security model does really comply with the practice all the time. For example, there is this thing called order reduction theory, which suggests if you've got a D order secure scheme in practice, the security order might be halved. So on the code level, that means if you've got the D share schemes, which in theory can be D minus one secure, order secure, in practice, it's seldom D minus one order secure. It's still possible to do it that way, but then do it that way means you have to go through the full diagnose and cure cycle, which based on my previous experience, I can guarantee you is quite devastating and time consuming task. So field researchers really have the motivation or insensitive to actually do so. And even if you finish the whole cycle, you got your scheme to be D minus one order secure. If you're D it's more, it's still quite weak protection after all. Our industry engineers simply says, all right, that's far from my do, but I'll keep that in mind. Then a few days later, our engineer simply says, oh professor, I've implemented my four shares secure AS. Specifically, I find this secure modification, which works in parallel. It actually adapt quite well in my software development framework and because it's operating all the shares in parallel, it's actually quite efficient. In order to do so, I have to store all the shares within one register, which will be called share slicing in this talk. The first instance of hearing all of that, we really says, okay, are you sure your scheme is working properly? Well, it should be okay. I guess the engineer said, I'm using four share schemes, but I'm only claiming first order secure. Although that's quite limited after all, but according to the other reduction theory, that should be fine. Also, the previous study also suggests this is when using this scheme in a software environment, it's not such a big deal. It's not really devastating if we can't do our physical coupling effects. Okay, so if we search through our memory, or the knowledge that's actually memorized in my mind, I will probably say, okay, maybe you're right, but the talk is basically concentrated on, it's not really right. So I think it might be difficult to actually talk about whether the comments, the statements are completely correct or not, but it's relatively easy to evaluate the security of that specific implantational scheme in practice as we already have the code. Then let's start with LUT. We're gonna have an experience set up with ARM M0 and M3 cores, both from an XP. So our cores are working at 12 megahertz. Our scope is sampling at 250 mega samples per seconds. All of our target code are written in farm assembly, so it works on both ARM M0 and M3. So one of the things I would like to stress here is, in the shared slicing schemes, if you're using like D shares, let's say D equals four, you know, four share schemes, only four bits after a bit width is actually defined. So all the other 28 bits, if you're using 32-bit processors, are completely undefined, so that's completely up to the engineers to decide what to put in there. Of course, the trivial way will be setting it to some constants, let's say O zeros, this is trivial way of doing it, but it's quite a waste because all those unused bits are not really providing any useful computation. You can have it as all randomized numbers, of course, that will create some random noise for the attacker as well, but if the randomness is coming from a random generator, that could be quite costly as well. And the last one I presented here is repeat, you can also repeat the lowest or whatever, the four shares not useful and create, well, fulfill the whole register with that. It might sound quite absurd at first place, why would I repeat that? But if you think about it, the attacker in a real life application, real estate application, most likely the other 22, 28-bit will be the concurrent in crucial blocks. So if the attacker can actually get some control on the plain text fitting into a scheme, then he or she may be able to send you eight plain text with that's exactly the same. So then in the end, he or she might know, although the shares might be different, but in your register you got eight groups of four shares, but all of them all coming to the same end share value, that might give the attacker some benefit. So what we are testing is originally, farm written secure end too, but then in the end, well, originally we find a lot of leakage, then I was trying to analyze where the leakage is coming from by comments out some of the instructions, but then I comment one by one, in the end I almost comments out all the instruction. Of course, most of this leakage will be transition-based leakage, but when I was trying to minimize the transition-based leakage, I basically comments everything else out. Then in the end, I only leave the first shift instructions here. So what we can see here is only one shift instructions, the first shift instructions in the algorithm, and then packed with two knobs. So there shouldn't be any transition-based leakage anyway, but even in this case, we are finding some exportable leakage that can actually lead to successful attacks. Most specifically, we are using a two-share version where the all other 30 bits are set to random. So basically, this is the worst case. In the following graphs, you're gonna see the crack key guess in red line and all the run key guess in the green lines. So we can see both the first attacks, the first other attack and second order attacks succeed while the second order attacks because this is a two-share version, then that's basically allowed by your security proof. The first order attack, although it doesn't really seems to be too much a big deal because you're gonna see the second order attack seems to be more efficient. So we're gonna say we do have first order leakage or interactions that contradict with the model, but that doesn't really give you a security flaw or anything beneficial for the attacker in practice. But if we move on to the four-share version, what we're gonna see is what two surprising point. The first one will be, we still see some first order leakage which is basically going against with the other reduction filtering. And the other thing is the second order attacks that's almost as efficient or even better than the fourth order attacks, which means this might actually lead to some practical security flaw or we say problem. So our industry engineers then will ask, well, how can they be? Where is it going wrong? Well, the first instance coming to my mind is, have you ever checked your model assumptions? Well, the engineer says I have read your assumptions, but I'm not really sure I understand it completely, but I checked the implementation default section, so the discussion on the implementations. So most of that are actually coming from the hardware perspective, which is not really relevant in my case because I'm doing software development. Then what does your assumption actually means if I'm using software development framework? Well, I think for most of us, the answer will be, we might need to think about that. Okay, let's think about that. What does the independent assumption really means in practice? Well, it basically means, each literally means in theory, each share should leak independently. Each share can have its own leakage function, it doesn't matter what form it is, but shouldn't be any interaction or crosstalk here. Originally in the hardware masking setup, it was sort of guaranteed by some architecture requirements. For example, if you think about threshold implementation, it got its roots or motivation from, well, not motivation, but feature motivated by the MPC multi-party computation. So the overall circuit can actually be divided into several parallel and implemented, but separate sub-circuit. So each of the sub-circuit actually represents one of the computation parties. Until the next stage, they won't really communicate with each other. So in the last sense, there won't be logical crosstalks between each of the sub-circuits. And they also explicitly ask for this option to turn on called keep hierarchy. So the synthesizer, when synthesizing your whole circuit, won't really add any additional crosstalk between all these sub-circuits. If we think about what that sort of architecture level support means in software development, that basically gives you, if you're following the same level of scrutiny, then each gate in our ALU should connect with only one bit of the register. Then the problem is whether that's possible. This is one of the ARM diagram of their core. Of course, when ARM is designing this core, it doesn't really have the independent assumption in their mind, but is it really possible to be true that this core can actually support our independent assumption? Well, if you look at, if you zoom in all the components within the core, you will have a lot of headache. The first one will be your shifter, because in theory, the arbitrary shifter will connect each input bit to each output bit, which means each output bits will connect to all the input bits, which is already a contradiction to the independent assumption. If you think about the other parts of the ALU, there are also other things can contribute here. For example, like the header, you know there's a long carriage in which basically connects with various bits in your register. In the last sense, we can actually test the instructions along. There are quite a few, doesn't really complies with the independent assumption. Here I test the shifter instruction along. So the left, the blue line stands for the first other tag and the red line stands for the second other tag. If the first other one stands above the second other one, then we say this is not only a leakage, interaction leakage, but also this leakage will affect your security in practice. So we see it will affect the security in practice for the left shift on M3 and the right shift on M0. For the other two cases, the interaction leakage still exists, but it doesn't really necessarily affect your security in practice. Okay, then our industry engineer might ask, but didn't the previous study already verify this assumption? Well, let's read past the headline. Let's see what's really happening in the technical sections of the previous paper. So lab paper actually used TVLA on specific instance, implantation instance. It's important to actually remember that's not about the assumption itself. In that specific instance, only two or four bits, if you're talking about like two-share or four-share version, only those two or four bits are actually used. All the other bits are basically set to constant zero. If you are using exactly that in your implantation, you are not using all the other 30 or 28 bits, then that's perhaps fine, but if you're using other bits as the co-current, other encryption blocks, then that's a completely different story. And also, if you read about the following security estimation section, they kind of take a conservative interpretation of it, so they have a 32-share masking, but then, in theory, it provides you 31st order security, but then they accept a certain, well, they leave a quite large security margin by taking it estimation with only 15 order of security. So altogether, it's quite fair for their purpose, but if you take their comments or their statement out of the context, you're basically misleading yourself. Okay, then how about the order reduction theory? It doesn't let that protect my implantation. Well, if we go back to the order reduction theory, the theory is basically talking about security reduction for transition-based leakage. At that time, there are so many implantations which are actually storing all the shares within one register, so if you read the proof, this theorem basically talking about all the different shares storing in different registers. So at the first place, you shouldn't really apply this theorem here. It doesn't really apply to any cases with shares lessing. Interestingly, this point has already been addressed in the previous publication. It's already mentioned before, but in a completely different tune. They basically said this theorem doesn't really directly apply here, whether it's more secure or less secure, that's a completely different story. Okay, so as a conclusion, I think our results, safely to say is that just independent assumption shouldn't be taken for granted, especially for software platforms, but following on many of the previous discussion or previous publication, we turn to a long list of the misinterpretions or possible ways of misinterpreting our results. I think this is a good habit actually to avoid further confusions or myths in our community. I would like to remind our readers that our result doesn't mean share slicing should be forbidden. You can actually switch for much weaker assumptions saying, well, I have big interactions, but the interactions are quite weak in the magnitude. I don't really have to care about whether this will affect my security in practice. That's a fair argument, as long as your security evaluation are using exactly the right implementation, so you cannot use one type of implementation saying four bits with 28 bits zero, and then using this as an argument and evaluation, and then do the implementation and application with all the bits used as co-current encryption blocks. It also means that the security proof doesn't really guarantee everything. It guarantees security in that model, but out of that model, it's hard to say how far it can go. And also remembering our results as long as all the previous evaluation results are quite platform dependent, so as long as we are considering the independent assumptions on software, as I said, there are no architecture level protections, so you have to evaluate, basically you have to evaluate each time switching, each time you're switching from one platform to another. And also we are not claiming the shifter is the only resource or only source of interaction here, or we are not even claiming this is the right resource, this is exactly the resource of the interaction we are observing in this paper. There are various components that can contribute, including the adder, as I said. Also, statistically there's no way to locate the exact source unless we know exactly the technical designing details of the CPU, which is not likely to happen within a few years. I think what is really important in this what does our model assumption really means in practice? So in academia we also, we often offer schemes in a security model and we have our model assumptions, we understand that in our security model, but what does it, in an industry, what does it mean in practice, and they need the connection of our security model to practice. I'm not suggesting like who should be doing what, whether it's the industry who should be taking this part as their responsibility or our researchers should take that as our responsibility, but we need someone to stand in the middle as an interpreter who can convert all the wisdom happening in our community, researcher community, to the practice, industry practice without losing a lot of security guarantee. Okay, that concludes my talk. Thanks a lot for listening.