 Hello again. If you have heard my last talk, in my last talk I talked about how you can construct complete leakage models and in most of the existing tools they are very far from complete leakage model. So in this talk I'm going to talk about how you can reverse engineering the micro-architecture features from Cortex M3 core and make your leakage model a bit better, a bit more complete compared to all the existing tools. So this is the speaker is still me, Siko, and this is still my drone work with Elizabeth, but this time I also got my previous colleague from Bristol then on board to guide us through all the micro-architecture mysteries. All right, so we already talked about leakage simulators from my last talk. If you miss it, then leakage simulator are some early-stage feedback tools which can help the developer avoid using a real-life source as well or waiting for the response of certification centers. They can right after they've finished their code writing, they can check whether their implementation is okay, and this will also tell you exactly what caused a problem and how we can fix it. So although this is a really enchanting, favorable idea, most of the current simulator actually takes two different routes. One is taking the query box routes where you often see lists on ARM processors. For example, the ELMO family, whether you are using original ELMO, the extended ELMO called ELMO Star, you are both targeting the Cortex M0. So both, well, the entire ELMO family will always rely on the instruction simulator code and the simulator, which emulate the family instructions. So everything, all the knowledge are actually based on this. And your leakage model is trained from the profoundly trace you got from the M0 core you have. And most specifically, your model will focus on the ALU leakage from the STM32F0. And if there are a few extensions exist, for example, there are extensions on leakage on the memory bus. There are extensions extend this to another version of M0, for example, the M0 Manifact by NXP. And there are also extensions that extend this to the Cortex M3. You can also take the white box route. For example, what was laid out in MAPS in 2018 is taking the academic version of RTL code. So in this case, you actually know what's happening in the micro-architecture. You know everything. But they also actually choose not to include everything in their model. They actually choose to only focus on the register bit flips in their model. They only capture the register having distance. The good thing of all this is you don't need to guess about the micro-architecture. You know everything and you don't really need any measurements here. But a bit recap from my last talk when you are, well, verifying those models, both of them are really far from what's observed on the trace. So both of them are really far from I do. And the reason for that is both the models are quite, well, relatively simple. If you think about ELMO, ELMO Star, they actually focus on the ALU part. The ALU lies in, for three-stage pipelines for like Cortex M0 or Cortex M3, the ALU actually lies in only one of them. It's the esq. stage or the other stage, the two-stage, basically ignored. And also the model built is only for the ALU buses. I mark them as logitech lines here. Both of them are not actually in architecture level. They're both lies in the micro-architecture. That means you might not really know what's happening on those buses. For example, if we have this add instruction, add R0 to R1, you don't necessarily know which one will go to R0, goes to busway, or R0 goes to bus B. So what's happening in ELMO represents the ulcers, or we say more specifically, David's guess. And in maps, the situation is quite different. So maps have access to the white box code, so the RTL code. So there's no guessing involved. But the question would be like, is it really the same as the product on the market? So you don't necessarily know whether on provide us the same version, the academic version and the industrial version are they exactly the same, or whether the manufacturer might make their own revision. So there is some previous work working out in this direction and finding out their leakage behavior is not entirely the same. So we don't really know exactly why, but there are some difference. And also stated in our original map sweeper, they already said the leakage trace trace only the registers. So they only care about the registers. If you've got some leakage, not from the registers, for example, from the ALU, then that's not covered. So if you listen to my last talk, sorry about this, this is not exactly the same as W multiplication. This is another version. We actually work on several versions of SW modifications here. But this one actually helps us to explain what's happening in all those existing tools and what layer show comes. So this is another version written in pharma simply is still and the realistic one will be tested on ARM Cortex-M3 from the XP and with ELMO, I have extended it to the M3 model. So if you look at this, this is still 10 cycles, 10 instructions, but here we got two cycles being leaked, one in the second line and cycle 15, this is the realistic device results. We don't really know why, but if you take a look at ELMO results, ELMO miss both of the leaks. And for ELMO star, not only miss both of the leaks, you also provide and produce the false positive here. For maps, you capture one of them, but the other one is missing. And you might wondering why. So in general, that means your leakage model is overly simplified. It doesn't really capture everything, especially micro architecture features in your circuit, in your realistic core. This motivated for reverse engineering, the micro architecture features. By reverse engineering, I would like to mention this is a leakage-wise reverse engineering. That is, we only care about the micro architecture features that affects the leakage and can be observed from the leakage. So if it's not really leakage relevant, we don't care about it. And this is clearly not the same fine-grained analysis as binary code disassembly reverse engineering. And our final goal is building a micro architecture enhanced leakage simulator. Okay, let's start our reverse engineering journey now. So our starting point will, of course, be the public information from ARM. We know Cortex-M3 is a three-stage pipeline core. The three-stage usually called fetch, decode, and execute as this figure shows. And the only thing very interesting in this figure is in the decode stage, here it says not only doing the instruction decode, but also doing registry. This means the decode stage will not only do the instruction decoding, but also prefetch the necessary operand for the seal stage. This also means because there are pipeline registers between pipelines, then you need some register temporarily storing the prefetched operands. So at least there are two of them because most of the instructions have at least two operands. And also for the registry fail, because you need to simultaneously fetch at least two operands, you need at least two reading ports. And let's now take three-stage pipeline one by one. So first off, fetch stage, fetch stage fetching instructions from your memory to your instruction register. The entire stage is driven by PC, PC providing the instruction addresses. But if you take a look at this picture, most of the wires, the buses here, we know what's happening on them. We know the value on them. And perhaps more importantly, most of them are not even data dependent. So we only care about data dependency. If it's not really data dependent, it's branch-related issues. I personally believe there are better solutions for it. We don't really have to do this with leakage analysis. And for the decode stage, everything before the register fail, they are still not data dependent. But after the register fail, we got several reading ports here. And the obvious question is which operand for each instruction, which operand goes to which reading port, which will further affect what kind of leakage one might see. For this question, we have to test it because there's no way we can get it from the assembly code. So we test this customized code, where it's quite simple. The first XOR, we send A and B to the microarchitecture. And then in the next target instruction, I said, I send B, C and D. And then I would like to observe whether I can see an interaction, or you say bit flip between A and C, or between B and C, we find something between A and C, probably means A and C share the same reading port. Or we find something B and C, that means B and C share the same reading port. So in those four, below we have like four graphs, each of them, if you find something above the dash line here, that means we have observed a significant contribution of this interaction, whether it's A, C or B, D, A, C is the blue line, B, D is the red line here. So for both two operand addition and multiplication, we have seen both A, C and B, D, that's quite normal. And if you have only one operand addition, so for example, with one intermediate number, then you will only see A, C, there's no B, D. But if you have like three register instructions, such as list additions, then what we find is all three of them well below it, but only B, D is a clear interaction. So in this case, we can't necessarily know which one goes through each, we assume A and E share the same port. And there's still something need to be fetched from another port. So we assume there is a third port. This might not be correct, this might be something caused by physical effects such as glitches, but this is the best guess we can get. There are also some instructions don't load anything. For example, there's load instructions. In our test, it just don't have any interactions because it actually doesn't really load any operand. Okay, so the next step from decode to execute, we know that after pre-fetching the operand, we will send the operand to the pipeline registers. So we assume there are two pipeline registers here are say RS1 and RS2. The next question will be, first of all, which operand goes where? So you can still from here to here or from here to here. So you can still go other way. Or then there is also as because we are attacking while we're targeting registers here. The control signal can tell you tell the register saying, please don't update your value. Please reject whatever comes to your door and remains your previous value. So this is not possible with buses, but this is possible with registers. So we also like to know whether RS1 and RS2 will be updated or not. So I'll skip all the technical details, but directly telling you our results. So this is our previous results with which data goes to which reading port and this is our current results on which data enters which register. So a signal like signs like this means it will not be updated. The register will remain their previous value. And the last part of our analysis is the memory subsystem, which is kind of a headache. This is actually what this part is often ignored by most existing tools, but actually for a fair reason. So if you think about it, the memory system, although it contributes a lot to the leakage, actually lies a bit far away from the core. So this is graph from ARM. So our everything we analyzed actually lies within this blue blocks. So it's only a part of this blocks and this is the core. And where is the memory? The memory is not this memory protection unit. The memory is usually connected through here, something like this. So the memory actually lies far away from the core. And to make it worse, the memory is kind of self-timed. So it's not like the core telling you, please fetch me this operand. The memory will respond in a constant time. So the memory can say, sorry, please wait for me. And that means the memory, how the memory will respond, cannot be predicted by a simple instruction emulator. You might also need a memory emulator to know what's actually happening there. So this might be a problem for our completeness test because our completeness test is actually synchronizing what's happening on your trace and was actually executing in your micro-architecture. So if it's impossible to synchronizing the timing of it, if the memory can say, wait for me one cycle of scale and wait me five cycles there, then it will be an obvious problem. So here we didn't do what our euro thing did. Well, pretty much what the previous works has been doing, relying on the existing knowledge. For example, the memory assesses always work-wise, like I said in my last talk, and some specifications saying perhaps there is one data bus which is shared between read and write, and there is a shared address bus, and there is also an additional write buffer. So this is of course far from my do. And now let's see whether we can, how should we build the leakage models. We already know how the data flows in the micro-architecture. So first of all, how should we generally build leakage models for circuit. So in circuit, we have several components. For example, we have buses and registers. Previously, we usually assume buses, leaks is currently the value. So usually we assume it's hamming weight. And when they have bit flips, we assume it might also have hammy distance leakage, especially like registers we always believe it has, when it flips, it costs hammy distance leakage. So here what's happening here in the graph is if we, previous value is a prime, and now we have a new value a, then we assume the leakage can be a prime flip to a. And we take some conservative approach, assuming a prime and a will leak jointly. So it will leak both of them, it can include any sort of interaction, including the hammy distance, but not really restricted to the hammy distance. So for any bus or a register, we always assume that the leakage is the previous value times the current value. So jointly leaking. But for conventional, conventional logic, this is much more complicated because you have multiple inputs. And this will also create all sorts of glitches, which is quite difficult to predict. We also go through the conservative modeling, and we assume the leakage will be both all of its previous inputs times the current inputs, all of them will be jointly leaking. And again, for fetch, we already said all everything here is not really data dependent, so ignore it. But for the code after here, after the registers fail, everything is data dependent. So we have to consider it. And we already know what's on D.5 to D.7. So for all of them are three buses. So we just say each of them previous value times the current value jointly leaking. And for the excuse stage, we know everything here is a big combined tonal circuit. So no way to restrict it, what the leakage might look like. So we just allow all the previous input times the current input. And the memory, there are three buses, address buses, read buses, and the other is the buffer, so additional white buffer. So we assume jointly leaking on their own. So the address bus, the read write bus, and the write buffer. And all together, we add all three of them together. And we assume this is our overall model and verify the quality of that in our complete test in my last talk. So we have six instructions here. We can say most of them, if something is above the dash line, that means you're missing something. And for most instructions, we got something okay. But for the specific addition instructions, we are missing something. I will skip all the technical details here directly and tell you this is what I would call a glitchy register as it means the glitch is in your decoder will causes an incorrect register as we shouldn't really happen according to your functionality. But it happens and causes some leakage. So if you add this in, you will have find this line below the stretch hold. All right, let's now go back to our example in the beginning. Now we have reverse engineered all how the micro architecture flows in your core and how the leakage behaves. So let's now go back to our beginning example to say with C9C15, why is it leaking and why ELMO or MAPs fails in one of the models of them. So for C9, our explanation for this will be the C9 is leaking the ALU outputs hamming distance. So this is a bus hamming distance. It's not really a register. Of course, MAPs won't find this because MAPs doesn't really care about buses. For ELMO, because ELMO's model only takes the two operand bus, the input operand bus, and there's no ALU output operand bus. So this is also not in ELMO. For C15, this is register pipeline registers hamming distance, as we said before, MAPs takes care of all the pipeline registers. So of course, you MAPs find this, but ELMO couldn't. Okay, so let's shortly briefly summarize what we have achieved in this paper. So we have successfully reverse engineered the micro architecture of an M3 core. Again, this is a leakage-wise equipment version. It's not really comparable to any binary code on that list with a reverse engineered core. We built a micro architecture intense leakage model and show its impact on various masking implementations. We only talk about one of them, but there are more in our paper. And currently there are a few things we haven't, well, we haven't worked through. A few things we have touched, but not really in a mature stage. So for example, the cycle accurate memory emulator, we hope maybe the memory manufacturer can help us with this. And we can use the information, the reverse engineering, reverse engineering information to exploit more subtle micro architecture leaks, the system ongoing work. And I have done some higher-order testing, but not really in a mature stage. We, most of the experiments of all of the experience in this talk is basically first out there. And we are working on some flexible framework that works not only for the architecture, but also for perhaps in the future, RISV architecture. And last but not least, the leakage model we get here works not only for leakage detection, but also for more verification, but we haven't really gone very far in that direction. All right, that concludes my talk. If you have any questions, please ask me during the Eurocrypt live session. Thank you. Thank you for listening.