 He will talk about on-device power analysis across hardware security domains and its joint work with Alex Duar. Thank you very much. So thank you for coming to the first talk, finding the new venue after the event last night. So I'm going to be talking about work I did with Alex Duar at Dalhousie University, which if you remember is where SAC 2020 is being held, so consider it. So in the next 17 minutes, I'm not talking about time travel, unfortunately, or anything interesting. I'm going to be introducing a bit of remote and cross-domain attacks and how that applies to embedded systems. I'm going to be looking at TrussZone M, so this is sort of a new variant of TrussZone and what the attacker model is that I'm considering. We're then going to attack a specific implementation, so we need to move to hardware, so there's one sort of recent device that came out at the time, and this is the one that I've selected for the TrussZone M work. As well as looking at some of the attacks that are possible once you're doing this on-device power analysis and how we can cross hardware security boundaries with it. So the basic idea of on-device power analysis, you probably saw yesterday actually, and that's that many of these devices have an ADC or some sort of analog measurement circuitry, so there's quite a few variants of how this is done, and this onboard analog measurement circuitry can be used to actually measure power as the device is running crypto. Or as always, Simpsons did it first and stop hitting yourself. So TrussZone M is, you've probably heard of TrussZone, and that refers to normally TrussZone A or the Cortex-A version of it. TrussZone M is another sort of hardware security boundary mechanism that targets really low-cost devices, so Cortex-M type devices. And the basic idea of this rate is similar to other TrussZones. You have stuff like you have a protocol processing, you have third-party libraries. Something that's pretty untrustworthy and you're worried about getting remote code execution. What you want to stop is someone who achieves remote code execution on these very complicated states to move into the secure space. So you have a very straightforward interface from the unsecured to the secure space. You can validate very well what the secure code is doing, and it helps add another layer of defense. So it's quite useful when we start talking about real devices where we don't have the ability to fully validate all of the non-secure code for cost reasons or similar. What this sort of looks like in practice, so what the code itself is doing in this example, it's very basic, and we have in this little block here would be in the non-secure space. It calls into the secure space, so we have this non-secure callable function, which calls the secure function, in this case just a simple crypto column. And during the crypto call, we're running an ADC, which we can then sample afterwards. So that's the very basic idea. The specific implementation I used, so unfortunately you need to choose someone, and thank you Microchip for sponsoring CHES, but that had nothing to do with the selection here. The SAM-L11 is one of the first M23 cores available on the market. This was in June 2018, so this was obviously sort of the device selected when this work started. The other thing that sort of made it an interesting target originally is that some of the original datasheets had claims or talks about side channel and fault injection protection, so that's since been changed. But at the time it also made it an interesting device to look at. The product usage, so what's the exact threat model here? So when I started this there was no, because it was a brand new device, nothing on the market that I could find anywhere was using the SAM-L11. So I sort of looked at this generic product design, and this is backed up by datasheet examples, so this is what I showed actually back here, is this sort of split between secure and non-secure. And as an example, in the datasheet we had a note that a configuration may move the ADC timer and in this event system are available for the non-secure application. So it's not unreasonable to assume you know the ADC is something that may be available to an attacker in the non-secure space. It's important to note that for all of this to be valid, an attacker has to previously perform something in the non-secure space. So we assume code execution on that non-secure space, which you would not normally have. So when we talk about these cross domain attacks, this is why I'm calling that cross domain, is you have to have some exploit into the non-secure space for this to be possible. We're also going to sort of emphasize, we're using a lot of data to do this, so we'll see it's like five gigabytes of data ends up being used in the analysis, or 160 million encryptions. So you may have a question of is this remote or not. And so I'm also going to introduce this idea when I talk about the threat model of something a little more like a quasi remote attack. An example of where that's a threat rate is if you have something like unlocking an ECU where there's a commercial incentive to do so, but it's a question, is that commercial incentive possible if someone has to do an actual DPA talk in every single unit? If there's unique keys per device, and we're talking about doing DPA attacks on a per unit basis, this is quite expensive. But if we talk about plugging in sort of a box that has a debugger access or a serial port or Ethernet access to the device, there's an exploit to get execution on non-secure space, it does become pretty reasonable, I think, to consider this is a commodification of a side channel attack that becomes relevant to people. All right, so what does this build on? To begin with, there's quite a few things when we talk about cross domain attacks in trust zone A, and similar, you know, previous domain, hardware security domain splits, cash timing attacks and general timing attacks have been used extensively in this area, and there's also this idea of remote fault attacks that are available to trust zone A, or has been demonstrated on trust zone A. One of the issues with Cortex-M specifically is that these are often lacking a true cash, so the cash timing attacks can be difficult. So they may have data cash, but often lack a true instruction cash. So some of these attacks become, you know, you can't execute the same types of attacks on Cortex-M or trust zone M devices. There is previous work on side channel power analysis with this remote threat model, so there's several work on a shared FPGA fabric where an onboard voltage monitor circuitry is built, and you would have seen yesterday as well this general idea of using the onboard ADC of a microcontroller to run this type of attack. As a note, again, it may require a very large amount of data transferred out to actually execute this power analysis attack. There's some related work about nearby side channel attacks. So in this case, we have some sort of analog measurement, but it's not, you know, at the board level. And again, this is interesting for the sort of threat model that this whole framework fits into. So we have had quite some time, the idea of measuring the power on an IO pin. This leaks information, so the IO pin you could likely access from outside the device, and there's various works on sort of band limited attacks, either on RSA and other sort of asymmetric crypto. And we also have some examples of AES specifically, so both on a radio receiver as well as going through a switch mode power supply. So both of these are looking at cases where you don't have as much information available as classic onboard attacks. All right, so to see what the effect of the onboard attacks is, we begin with sort of classic side channel setup. So we have the micro that we're using. We have a shunt resistor, and we're just measuring the power consumption of it. And down here, we're just using a programmer at this point for reprogramming the device. And you get sort of the expected power trace out of this. So this is the AES hardware accelerator running. And so you can see it's sort of a ROM. Some functions are accelerated and supported probably by some sort of ROM code. You get a leakage of the last round S-box output. It's sort of a hamming weight leakage towards the last round. So this is for each of the bytes. It's basically what you expect, right, in terms of the correlation output. And it roughly falls at the end of that section I showed to you. So when we do this onboard, we're going to have a lot more limited data available to us than the external device. So I took these traces and then investigated how much data could we remove and still have a successful attack. The first thing that I compared on the specific device is the bit depth. So it's expected that the onboard ADC is going to have a very, very limited number of bits of actual useful data out of this. So we can take the external data and reduce the number of bits present in that external data, re-perform the attack, and see how that works. And what I have written here is it's a 10-bit ADC that I'm externally using. There's an effective bit depth because the scaling isn't perfect. So for example, when I go down to making a 2-bit ADC, it's in fact an effective bit depth of around 1.6, because I haven't scaled it to reach the full input exactly. So that's where we have that effective bit depth. The results I'm using here will show partial guessing entropy. We're possible averaged over several trials per byte. So you'll see the 16 bytes here. And with the original, or this is showing already the 8 bits, so reduced from 10 to 8 bits, you can see after about 5,000 traces, it's basically recovering the keys. If we go down to 4-bit ADC here, it's a little worse, but you're still getting quite good results after 5,000 traces. Once we get up to the 3-bit range, you can see that's about 25,000 before it recovers all the information, but still reasonably plausible. And then effectively with 2 bits, we've lost the information available to us. So the other issue is going to be the sample rate reduction. So the internal ADC at maximum runs 1 over 26 of the clock. And this is because you can see it samples, and then it samples each of the bits. It's a 10-bit ADC, and then repeats it. So we're going to have a massive reduction rate in our sampling speed. It's important to note that because the ADC clock on board is derived from the main clock, we do still have it synchronized. So yes, it's running much slower, but it's not the same as if I used a 1 over 26 just standard external scope. The basic reason being, even though it skips, it's taking 1 over 26 samples, the ADC samples are still going to be synchronized to that internal clock into the clock driving the AES core. If we had an asynchronous, that's going to slowly shift over time that sample point. So I wouldn't expect remotely as good results in that case. So if I take this data and reduce the sample rate, what does it look like? I'm initially oversampling. So on this external ADC we're using for the measurement. Externally, we've run it four times faster than the clock. And you see in that case, with 10-bit ADC, it's 1 or 2,000 traces that can recover the keys in. If we go to 1 to 4 down sampling, so it's now running four times slower than the internal clock, you can see in about 10,000 traces that's fully recovering it. And we can keep going down. So even to the point of 1 over 26, which is how fast the internal ADC is running, you can see it's still recovering information. Even though it's taking quite a bit more traces, it seems plausible. Part two, once we knew it seemed possible, we moved to using an onboard attack. So in this case, there's no external measurement. We're using a SEGAR for the JTAG data transfer to get 1,000 or so traces per second out of it. So recording the internal ADC and downloading the computer. And this is done on four test boards that basically were expected to go from least difficult to most difficult was the original hypothesis. The least difficult one has an external amplifier that's measuring a signal and feeding it to the internal ADC. As a note, so you'll see the shunt resistor out here. This is effective because there's an internal regulator that's hard connected to the digital logic. What happens though is that these regulators don't react as fast to transient, so they normally have an external decoupling capacitor. Because of that, we can put a shunt in our external decoupling capacitor and you still get a fairly good signal. There's a bit of noise, of course, because basically the regulator is gonna be recharging that capacitor, so this is obviously adding some noise to your system. But it's still fairly effective for doing the analysis. And you can see here, so this is now with that one over 26 clock. You know, it's very similar to that external ADC that I had to down sample. So you can see about 200,000 traces that's sort of recovering the clock and the PGE reduction is pretty significant, much earlier than that. We did look quickly at doing a cycle offset, so you can imagine there's a, we take every one in 26 samples. If we move that through, what happens? And as we move it through, you basically see some adjustments in what bytes are recovered. At the end of the day, though, it didn't really make sense. You're better off just recording 400,000 traces at one offset than taking 200,000 at some offset, 200,000 at another offset or something like that. But that was one hypothesis, it might matter. Board B removes the amplifier, still has the external shunt. And we basically see it's now taking more traces because we don't have the amplifier. Board C and D didn't show any different results from the development kit, so I'll sort of summarize them into the development kit attack. So the development kit attack basically looked like this. What we had is the board had a JTAG connected to it, and we were just doing all the measurements off the standard development board. And what you can see is, you know, to fully recover the key with like PG of zero nor additional guessing is basically around 160 million traces that were recorded to the computer. As a note, you might wonder, okay, what about TVLA testing? Shouldn't you have done that first? Yes, you do see this TVLA leakage. We were initially a little worried that due to the really strong down sampling, we couldn't correctly focus on the middle third of the algorithm, and if this leakage was actually coming from the crypto core or just part of the output. It did align those, this T test peak aligned with where the actual CPA peak was in the data. As future work, there's a switch mode power supply that you can turn on. The switch mode power supply, so this is three overlap traces here, and you can see the really strong like the ramp in the switch mode power supply. And to get rid of that, we used a high pass filter. The high pass filter did give you a T test result that looked promising, but we weren't able to recover the full key in about 240 million traces, which is where we sort of stopped the acquisition. So you can divide down the trace count by the 1000 traces per second, and it's like 40 hours to record this data, for example. All right, so cross domain attacks are very applicable to the real world. They require a lot of data. Counter measures, you can move the ADC to the secure world. You can do environment validation. Some of them have trouble available. So with that, I'll take any questions, I think now, because as a reference too, if you would like any of the data, there's like 285 gigs of data file available for you as well. Thank you very much. Thank you. Are there questions? Yes. Hi Colin, thanks for the talk. Just a quick one, the key you're exploiting, is this the one, I mean, user supplied key, or are you breaking one of the keys, for example, device unique key used to drive all other keys or some pre-shared kind of secrets? So this device doesn't, and I could be wrong, I'm going by memory here. We basically were using a key that was loaded in the secure space, so it would be up to whoever's using that secure code. I don't think, there's no root key that's programmed in in this specific device by Atmel or something like that. More questions? I have a question actually. Can't you just isolate the two power domains from each other, like from the secure zone and the insecure zone? So everything on this device is the one power supply. So trust zone M doesn't have separate secure and non-secure cores, it just switches from a secure to a non-secure mode with all the same core itself. Would that be a valid counter measure? We'll know, so the issue is that the both, it's always supplied by the same power, so it's very fundamentally in the design of it, they're sort of connected together, so, I mean, of course it was designed differently than yes, but in this case I could. If there are no further questions, then let's thank Colin again.