 Hello, fellow creatures. Welcome. And I want to start with a question, another one. Who do we trust? Do we trust the trust zones on our smartphones? Well, Keegan Ryan, he's really fortunate to be here. And he was inspired by another talk from the CCC before, I think it was 29C3. And his research on smartphones and chips, systems on a chip used in smartphones will answer these questions if you can trust those trusted execution environments. Please give a warm round of applause to Keegan and enjoy. Hi, thank you. So I'm Keegan Ryan. I'm a consultant with NCC Group. And this is Microarchitectural Attacks on Trusted Execution Environments. So in order to understand what a trusted execute environment is, we need to go back into processor security, specifically on x86. So as many of you are probably aware, there are a couple different modes, which we can execute code under in x86 processors. And that includes ring three, which is the user code in the applications, and also ring zero, which is the kernel code. Now, there's also a ring one and ring two that are supposedly used for drivers or guest operating systems. But really, it just boils down to ring zero and ring three. And in this diagram we have here, we see that privilege increases as we go up the diagram. So ring zero is the most privileged ring, and ring three is the least privileged ring. So all of our secrets, all of our sense of information, all of the attacker's goals are in ring zero. And the attacker is trying to access those from the unprivileged world of ring three. Now, you may have a question. What if I want to add a processor feature that I don't want ring zero to be able to access? Well, then you add ring minus one, which is often used for a hypervisor. Now, the hypervisor has all the secrets, and the hypervisor can manage different guest operating systems. And each of those guest operating systems can execute in ring zero without having any idea of the other operating systems. So this way, now the secrets are all in ring minus one. So now the attacker's goals have shifted from ring zero to ring minus one. And the attacker has to attack ring minus one from a less privileged ring and try to access those secrets. But what if you want to add a processor feature that you don't want ring minus one to be able to access? So you add ring minus two, which is system management mode. And that's capable of monitoring power, directly interfacing with firmware and other chips on a motherboard. And it's able to access and do a lot of things that the hypervisor is not able to. And now all of your secrets and all of your attacker goals are in ring minus two. And the attacker has to attack those from a less privileged ring. Now maybe you want to add something to your processor that you don't want ring minus two to be able to access. So you add ring minus three. And I think you get the picture now. And we just keep on adding more and more privileged rings and keep putting our secrets and our attacker's goals in these higher and higher privileged rings. But what if we're thinking about it wrong? What if instead we want to put all the secrets in the least privileged ring? So this is sort of the idea behind SGX. And it's useful for things like DRM, where you want that to run ring three code but have sensitive secrets or other signing capabilities running in ring three. But this picture is getting a little bit complicated. This diagram is a little bit complex. So let's simplify it a little bit. We'll only be looking at ring zero through ring three, which is the kernel, the user land, and the SGX enclave, which also executes in ring three. Now, when you're executing code in the SGX enclave, you first load the code into the enclave. And then from that point on, you trust the execution of whatever is going on in that enclave. You trust that the other elements, the kernel, the user land, the other rings, are not going to be able to access what's in that enclave. So you've made your trusted execution environment. This is a bit of a weird model, because now your attacker is in the ring zero kernel. And your target victim here is in ring three. So instead of the attacker trying to move up the privilege chain, the attacker is trying to move down, which is pretty strange. And you might have some questions, like under this model, who handles memory management, because traditionally that's something that ring zero would manage. And ring zero would be responsible for pating memory in and out for different processes and different code that's executing in ring three. But on the other hand, you don't want that to happen with the SGX enclave, because what if the malicious ring zero adds a page to the enclave that the enclave doesn't expect? So in order to solve this problem, SGX does allow ring zero to handle page faults. But simultaneously and in parallel, it verifies every memory load to make sure that no access violations are made, so that all of the SGX memory is safe. So it allows ring zero to do its job, but it sort of watches over at the same time to make sure that nothing is messed up. So it's a bit of a weird convoluted solution to a strange inverted problem, but it works. And that's essentially how SGX works and the idea behind SGX. Now, we can look at x86, and we can see that ARM v8 is constructed in a similar way. But it improves on x86 in a couple key ways. So first of all, ARM v8 gets rid of ring one and ring two, so you don't have to worry about those. And it just has different privilege levels for user land and the kernel. And these different privilege levels are called exception levels in the ARM terminology. And the second thing that ARM gets right, compared to x86, is that instead of starting at three and counting down as privilege goes up, ARM starts at zero and counts up. So we don't have to worry about negative numbers anymore. Now, when we add the next privilege level, the hypervisor, we call it exception level two. And the next one after that is the monitor in exception level three. So at this point, we still want to have the ability to run trusted code in exception level zero, the least privileged level of the ARM v8 processor. So in order to support this, we need to separate this diagram into two different sections. And in ARM v8, these are called the secure world and the non-secure world. So we have the non-secure world on the left in blue that consists of the user land, the kernel, and the hypervisor. And we have the secure world on the right, which consists of the monitor in exception level three, a trusted operating system in exception level one, and trusted applications in exception level zero. So the idea is that if you run anything in the secure world, it should not be accessible or modifiable by anything in the non-secure world. So that's how our attacker is trying to access it. The attacker has access to the non-secure kernel, which is often Linux, and they're trying to go after the trusted apps. So once again, we have this weird inversion, where we're trying to go from a more privileged level to a less privileged level and trying to extract secrets in that way. So the question that arises when using these trusted execution environments that are implemented in SGX and Trustone in ARM is, can we use these privileged modes in our privileged access in order to attack these trusted execution environments? Now to answer that question, we can start looking at a few different research papers. The first one that I want to go into is one called the clock screw, and it's an attack on Trustone. So throughout this presentation, I'm going to go through a few different papers. And just to make it clear which papers have already been published and which ones are old, I'll include the citations in the upper right hand corner, so that way you can tell what's old and what's new. And as far as papers go, this clock screw paper is relatively new. It was released in 2017. And the way clock screw works is it takes advantage of the energy management features of a processor. So a non-secure operating system has the ability to manage the energy consumption of the different cores. So if a certain target core doesn't have much scheduled to do, then the operating system is able to scale back that voltage or dial down the frequency on that core. So that core uses less energy, which is a great thing for performance. It really extends battery life. It makes the cores last longer. And it gives better performance overall. But the problem here is, what if you have two separate cores and one of your cores is running this non-trusted operating system? And the other core is running code in the secure world. It's running that trusted code, those trusted applications. So that non-secure operating system can still dial down that voltage. And it can still change that frequency. And those changes will affect the secure world code. So what the clock screw attack does is the non-secure operating system core will dial down the voltage. It will overclock the frequency on the target secure world core in order to induce faults to make the computation on that core fail in some way. And when that computation fails, you get certain cryptographic errors that the attack can use to infer things like secret keys, secret AES keys, and to bypass code signing implemented in the secure world. So it's a very powerful attack that's made possible because the non-secure operating system is privileged enough in order to use these energy management features. Now clock screw is an example of an active attack where the attacker is actively changing the outcome of the victim code, of that code in the secure world. But what about passive attacks? So in a passive attack, the attacker does not modify the actual outcome of the process. The attacker just tries to monitor that process in for what's going on. And that is the sort of attack that will be considering for the rest of the presentation. So in a lot of SGX and trust zone implementations, the trusted and the non-trusted code both share the same hardware. And this shared hardware could be a shared cache. It could be a branch predictor. It could be a TLB. The point is that they share the same hardware so that the changes made by the secure code may be reflected in the behavior of the non-secure code. So the trusted code might execute, change the state of that shared cache, for example. And then the untrusted code may be able to go in, see the changes in that cache, and infer information about the behavior of the secure code. So that's essentially how our side channel attacks are going to work. If the non-secure code is going to monitor these shared hardware resources for state changes that reflect the behavior of the secure code. Now we've already talked about how Intel and SGX addresses the problem of memory management and who's responsible for making sure that those attacks don't work on SGX. So what do they have to say on how they protect against these side channel attacks and attacks on this shared cache hardware? They don't, at all. They essentially say, I do not consider this part of our threat model. It is up to the developer to implement the protections needed to protect against these side channel attacks, which is great news for us, because these side channel attacks can be very powerful and if there aren't any hardware features that are necessarily stopping us from being able to accomplish our goal, it makes us that more likely to succeed. So with that, we can take a step back from TrustZone and SGX and just take a look at cache attacks in general to make sure that we all have the same understanding of how the cache attacks will be applied to these trusted execution environments. And to start that, let's go over a brief recap of how a cache works. So caches are necessary in processors because accessing the main memory is slow. When you try to access something from the main memory, it takes a while to be read into the process. So the cache exists as sort of a layer to remember what that information is. So if the process ever needs information from that same address, it just reloads it from the cache and that access is going to be fast. So it really speeds up the memory access for repeated accesses to the same address. And then if we try to access a different address, then that will also be read into the cache slowly at first, but then quickly for repeated accesses and so on and so forth. Now, as you can probably tell for all of these examples, the memory blocks have been moving horizontally. They've always been saying in their same row. And that is reflective of the idea of sets in a cache. So there are a number of different set IDs and that corresponds to the different rows in this diagram. So for our example, there are four different set IDs and each address in the main memory maps to a different set ID. So that address and main memory will only go into that location in the cache with the same set ID. So we'll only travel along those rows. So that means if you have two different blocks of memory that mapped two different set IDs, they're not going to interfere with each other in the cache. But that raises the question, what about two memory blocks that do map to the same set ID? Well, if there's room in the cache, then the same thing will happen as before. If that memory address and those memory contents will be loaded into the cache and then retrieved from the cache for future accesses. And the number of possible entries for a particular set ID within a cache is called the associativity. And on this diagram, that's represented by the number of columns in the cache. So we will call our cache in this example a two-way set associative cache. Now, the next question is what happens if you try to read a memory address that maps to the same set ID, but all of those entries within that set ID within the cache are full? Well, one of those entries is chosen. It's evicted from the cache. The new memory is read in, and then that's fed to the process. So it doesn't really matter how the cache entry is chosen that you're evicting. For the purpose of the presentation, you can just assume that it's random. But the important thing is that if you try to access that same memory that was evicted before, you're not going to have to wait for that time penalty for that to be reloaded into the cache and read into the process. So those are caches in a nutshell in a particular set associative caches. We can begin looking at the different types of cache attacks. So for a cache attack, we have two different processes. We have an attacker process and a victim process. And for this type of attack that we're considering, both of them share the same underlying code. So they're trying to access the same resources, which could be the case if you have page duplication in virtual machines, or if you have copy on write mechanisms for shared code and shared libraries. But the point is that they share the same underlying memory. Now, the flesh and reload attack works in two stages for the attacker. The attacker first starts by flushing out the cache. They flush each and every address that's in the cache, so the cache is just empty. Then the attacker lets the victim execute for a small amount of time, so the victim might read an address from main memory, loading that into the cache. And then the second stage of the attack is the reload phase. So in the reload phase, the attacker tries to load different memory addresses from main memory and see if those entries are in the cache or not. So here the attacker will first try to load address zero and see that because it takes a long time to read the contents of address zero, the attacker can infer that address zero was not part of the cache, which makes sense because the attacker flushed it from the cache in the first stage. The attacker then tries to read the memory at address one and sees that this operation is fast. So the attacker infers that the contents of address one are in the cache. And because the attacker flushed everything from the cache before the victim executed, the attacker then concludes that the victim is responsible for bringing address one into the cache. So this flesh and reload attack reveals which memory addresses the victim access during that small slice of time. And then after that reload phase, the attacker repeats. So the attacker flushes again, lets the victim execute, reloads again and so on. So there's also a variant on the flesh and reload attack that's called the flush and flush attack, which I'm not going to go into the details of, but essentially it's the same idea, but instead of using load instructions to determine whether or not a piece of memory is in the cache or not, it uses flush instructions because flush instructions will take longer if something is in the cache already. So the important thing is that both the flesh and reload attack and the flesh and flush attack rely on the attacker and the victim sharing the same memory. But this isn't always the case, so we need to consider what happens when the attacker and the victim do not share memory. And for this we have the prime and probe attack. So the prime and probe attack, once again, works in two separate stages. In the first stage, the attacker primes the cache by reading all the attacker memory into the cache, and then the attacker lets the victim execute for a small amount of time. So no matter what the victim accesses from main memory, since the cache is full of attacker data, one of those attacker entries will be replaced by a victim entry. Then in the second phase of the attack, during the probe phase, the attacker checks the different cache entries for particular set IDs and sees if all of the attacker entries are still in the cache. So maybe our attacker is curious about the last set ID, the bottom row, so the attacker first tries to load the memory at address three. And because this operation is fast, the attacker knows that address three is in the cache. The attacker tries the same thing with address seven, sees that this operation is slow, and infers that at some point, address seven was evicted from the cache. So the attacker knows that something had to evict it from the cache, and it had to be from the victim. So the attacker concludes that the victim accesses something in that last set ID and that bottom row. The attacker doesn't know if it was the contents of address 11 or the contents of address 15, or even what those contents are, but the attacker has a good idea of which set ID it was. So the important things to remember about cache attacks is that caches are very important and they're crucial for performance on processors. They give a huge speed boost, and there's a huge time difference between having a cache and not having a cache for your executables. But the downside to this is that big time difference also allows the attacker to infer information about how the victim is using the cache. We're able to use these cache attacks in the two different scenarios of where memory is shared in the case of the flush and reload and flush and flush attacks, and in the case where memory is not shared in the case of the prime and probe attack. And finally, the important thing to keep in mind is that for these cache attacks, we know where the victim is looking, but we don't know what they see. So we don't know the contents of the memory that the victim is accessing. We just know the location in the addresses. So what does an example trace of these attacks look like? Well, there's an easy way to represent these as two-dimensional images. So in this image, we have our horizontal axis as time, so each column in this image represents a different time slice, a different iteration of the prime, measure, and probe. So then we also have the vertical axis, which is the different set IDs, which is the location that's accessed by the victim process. And then here, a pixel is white if the victim accessed that set ID during that time slice. So as you look from left to right, as time moves forward, you can sort of see the changes in the patterns of the memory accesses made by the victim process. Now, for this particular example, the trace is captured on an execution of AES repeated several times. An AES encryption repeated about 20 times. And you can tell that this is a repeated action because you see the same repeated memory access patterns in the data. You see the same structures repeated over and over. So you know that this is reflecting what's going on throughout time, but what does it have to do with AES itself? Well, if we take the same trace with the same settings, but a different key, we see that there's a different memory access pattern with different repetition within the trace. So only the key changed, the code didn't change. So even though we're not able to read the contents of the key directly using this cash tag, we know that the key is changing these memory access patterns. And if we can see these memory access patterns, then we can infer the key. So that's the essential idea. We want to make these images as clear as possible and as descriptive as possible. So we have the best chance of learning what those secrets are. And we can define the metrics for what makes these cash tags powerful in a few different ways. So the three ways we'll be looking at are spatial resolution, temporal resolution, and noise. So spatial resolution refers to how accurately we can determine the where. If we know that the victim accessed a memory address within 1,000 bytes, that's obviously not as powerful as knowing where they accessed within 512 bytes. Temporal resolution is similar where we want to know the order of what accesses the victim made. So if that time slice during our attack is one millisecond, we're going to get much better ordering information on those memory access than we would get if we only saw all the memory accesses over the course of one second. So the shorter that time slice, the better the temporal resolution, the longer our picture will be on the horizontal access and the clearer of an image of the cache that we'll see. And the last metric to evaluate our attacks on is noise. And that reflects how accurately our measurements reflect the true state of the cache. So right now we've been using time and data to infer whether or not an item was in the cache or not. But this is a little bit noisy. It's possible that we'll have false positives or false negatives. So we want to keep that in mind as we look at the different attacks. So that's essentially cache attacks in a nutshell. And that's all you really need to understand in order to understand these attacks as they've been implemented on trusted execution environments. And the first particular attack that we're going to be looking at is called a controlled channel attack on SGX. And this attack isn't necessarily a cache attack, but we can analyze it in the same way that we analyze the cache attacks. So it's still useful to look at. Now, if you remember how member management occurs with SGX, we know that if a page fault occurs during SGX enclave code execution, that page fault is handled by the kernel. So the kernel has to know which page the enclave needs to be paged in. So the kernel already gets some information about what the enclave is looking at. Now in the controlled channel attack, there's what the attacker does from the non-trusted OS is the attacker pages almost every other page from the enclave out of memory. So no matter whatever page that enclave tries to access, it's very likely to cause a page fault, which will be redirected to the non-trusted OS where the non-trusted OS can record it, page out any other pages and continue execution. So the OS essentially gets a list of sequential page accesses made by the SGX enclaves all by capturing the page fault handler. This is a very general attack. You don't need to know what's going on in the enclave in order to pull this off. You just load up an arbitrary enclave and you're able to see which pages that enclave is trying to access. So how does it do on our metrics? First of all, the spatial resolution is not great. We can only see where the victim is accessing within 4,096 bytes or the size of a full page because SGX obscures the offset into the page where the page fault occurs. The temporal resolution is good, but not great because even though we're able to see any sequential accesses to different pages, we're not able to see sequential accesses to the same page because we need to keep that same page pageed in while we let our SGX enclave run for that small time slice. So temporal resolution is good, but not perfect. But the noise is, there is no noise in this attack because no matter where the page fault occurs, the untrusted operating system is going to capture that page fault and is going to handle it. So it's very low noise, not great spatial resolution, but overall still a powerful attack. But we still want to improve on that spatial resolution, want to be able to see what the enclave is doing at greater than a resolution of one page of four kilobytes. So that's exactly what the cache zoom paper does. And instead of interrupting the SGX enclave execution with page faults, it uses timer interrupts because the untrusted operating system is able to schedule when timer interrupts occur. So it's able to schedule them at very tight interval. So it's able to get that small and tight temporal resolution. And essentially what happens in between is this timer interrupt fires. The untrusted operating system runs the prime and probe attack code in this case and resumes execution of the enclave process in this repeat. This is a prime and probe attack on the L1 data cache. So this attack lets you see what data the enclave is looking at. Now this attack could be easily modified to use the L1 instruction cache. So in that case, you learn which instructions the enclave is executing. And overall, this is an even more powerful attack than the control channel attack. If we look at the metrics, we can see that the spatial resolution is a lot better. Now we're looking at spatial resolution of 64 bytes or the size of an individual line. The temporal resolution is very good. It's almost unlimited to quote the paper because the untrusted operating system has the privilege to keep scheduling those time interrupts closer and closer together until it's able to capture very small time slices of the victim process. And the noise itself is low. We're still using a cycle counter to measure the time it takes to load memory in and out of the cache. But it's useful. The chances of having a false positive or a false negative are low. So the noise is low as well. Now we can also look at trust zone attacks because so far the attacks that we've looked at, the passive attacks have been against SGX. And those attacks on SGX have been pretty powerful. So what are the published attacks on trust zone? Well, there's one called true spy, which is kind of similar in concept to the cache zoom attack that we just looked at on SGX. It's once again a prime and probe cell attack on the L1 data cache. And the difference here is that instead of interrupting the victim code execution multiple times, the true spy attack does the prime step, does the full AES encryption, and then does the probe step. And the reason they do this is because as they say the secure world is protected and is not interruptible in the same way that SGX is interruptible. But even despite this, just having one measurement per execution, the true spy authors were able to use some statistics to still recover the AES key from that noise. And their methods were so powerful, they were able to do this from an unprivileged application in user land. So they don't even need to be running within the kernel in order to be able to pull off this attack. So how does this attack measure up? The spatial resolution is once again 64 bytes because that's the size of a cache line on this processor. And the temporal resolution is pretty poor here because we only get one measurement per execution of the AES encryption. This is also a particularly noisy attack because we're making the measurements from the user land. But even if we make the measurements from the kernel, we're still going to have the same issues of false positives and false negatives associated with using a cycle counter to measure membership in a cache. So we'd like to improve this a little bit. We'd like to improve the temporal resolution. So we have the power of the cache attack to be a little bit closer on trust zone as it is on SGX. So we want to improve that temporal resolution. Let's dig into that statement a little bit that the secure world is protected and not interruptible. And to do this, we go back to this diagram of ARMv8 and how that trust zone is set up. So it is true that when an interrupt occurs, it is directed to the monitor. And because the monitor operates in the secure world, if we interrupt secure code, that's running at an exception level zero, we're just going to end up running secure code in exception level three. So this doesn't necessarily get us anything. I think that's what the authors mean by saying that it's protected against this. Just by sitting interrupt, we don't have a way to redirect our flow to the non-trusted code. At least that's how it works in theory. In practice, the Linux operating system running in exception level one in the non-secure world kind of needs interrupts in order to be able to work. So if an interrupt occurs and it's being sent to the monitor, the monitor will just forward it right to the non-secure operating system. So we have interrupts just the same way as we did in cache zoom. And we can improve the trust zone attacks by using this idea. We have two cores where one core is running the secure code, the other core is running the non-secure code. And the non-secure code is sending interrupts to the secure world core. And that will give us that interleaving of attacker process and victim process that'll allow us to have a powerful prime and probe attack. So what does this look like? We have the attack core and the victim core. The attack core sends an interrupt to the victim core. This interrupt is captured by the monitor, which passes it to the non-secure operating system. The non-secure operating system transfers this to our attack code, which runs the prime and probe attack. Then we leave the interrupt to the execution within the victim code and the secure world resumes and we just repeat this over and over. So now we have that interleaving of data, of the processes of the attacker and the victim. So now instead of having a temporal resolution of one measurement per execution, we once again have almost unlimited temporal resolution because we can just schedule when we send those interrupts from the attacker core. Now, we'd also like to improve the noise measurements. Because if we can improve the noise, we'll get clearer pictures and we'll be able to infer those secrets more clearly. So we can get some improvement by switching the measurements from user land and starting to do those in the kernel, but again, we have the cycle counters. So what if instead of using the cycle counter to measure whether or not something is in the cache, we use the other performance counters? Because on ARMv8 platforms, there's a way to use performance counters to measure different events, such as cache hits and cache misses. So these events and these performance monitors require privileged access in order to use, which for this attack, we do have. Now in a typical cache tech scenario, we won't have access to these performance monitors, which is why they haven't really been explored before. But in this weird scenario where we're attacking the less privileged code from the more privileged code, we do have access to these performance monitors. And we can use these monitors during the probe step to get a very accurate count of whether or not a certain memory load caused a cache miss or a cache hit. So we're able to essentially get rid of the different levels of noise. Now, one thing to point out is that maybe we'd like to use these ARMv8 performance counters in order to count the different events that are occurring in the secure world code. So maybe we start the performance counters from the non-secure world, let the secure world one, and then when the secure world exits, we use the non-secure world to read these performance counters and maybe we'd like to see how many instructions the secure world executed or how many branch instructions or how many arithmetic instructions or how many cache misses there were. But unfortunately, ARMv8 took this into account and by default, performance counters that are started in the non-secure world will not measure events that happen in the secure world, which is smart, which is how it should be. And the only reason I bring this up is because that's not how it is in ARMv7. So we could go into a whole different talk with that, just exploring the different implications of what that means. But I want to focus on ARMv8 because that's the newest of the new. So we'll keep looking at that. So we instrument the primary mode attack to use these performance counters so we can get a clear picture of what is and what is not in the cache. And instead of having noisy measurements based on time, we have virtually no noise at all because we get the truth straight from the processor itself, whether or not we experience a cache miss. So how do we implement these attacks? Where do we go from here? We have all these ideas. We have ways to make these trust zone attacks more powerful, but that's not worthwhile unless we actually implement them. So the goal here is to implement these attacks on trust zone. And since typically the non-secure world operating system is based on Linux, we'll take that into account when making our implementation. So we'll write a kernel module that uses these performance counters and these inner processor interrupts in order to actually accomplish these attacks and we'll write it in such a way that it's very generalizable. So you can take this kernel module that was written for one device. In my case, I did most of my testing on the Nexus 5x. And it's very easy to transfer this module to any other Linux based device that has a trust zone that has these shared caches. So it should be very easy to port this over and to perform these same powerful cache attacks on different platforms. We can also do clever things based on the Linux operating system so that we limit that collection window to just when we're executing within the secure world. So we can align our traces a lot more easily that way. And the end result is having a synchronized trace for each different attacks because since we've written it in a modular way, we're able to run different attacks simultaneously. So maybe we're running one prime and probe attack on the L1 data cache to learn where the victim is accessing memory. And we're simultaneously running attack on the L1 instruction cache so we can see what instructions the victim is executing. And these can be aligned. So the tool that I've written is a combination of a kernel module which actually performs this attack, a year's land binary, which schedules these processes to different cores and a GUI that will allow you to interact with this kernel module and rapidly start doing these cache attacks for yourself and perform them against different processes and secure code and secure world code. So the intention behind this tool is to be very generalizable, to make it very easy to use this platform for different devices and to allow people a way to once again, quickly develop these attacks and also to see if their own code is vulnerable to these cache attacks to see if their code has these secret dependent memory accesses. So can we get even better spatial resolution? Right now, we're down to 64 bytes and that's the size of a cache line which is the size of our shared hardware. And on SGX, we actually can get better than 64 bytes based on something called a branch shadowing attack. So a branch shadowing attack takes advantage of something called the branch target buffer and the branch target buffer is a structure that's used for branch prediction. It's similar to a cache but there's a key difference where the branch target buffer doesn't compare the full address when seeing if something is already in the cache or not. It doesn't compare all of the upper level bits. So that means that it's possible that two different addresses will experience a collision and the same entry from that BTB cache will be read out for an improper address. Now, since this is just for branch prediction, the worst that can happen is you'll get a misprediction and a small time penalty but that's about it. The idea behind the branch shadowing attack is leveraging this small difference in this overlapping and this collision of addresses in order to sort of execute a shared code cell flush and reload attack on the branch target buffer. So here what goes on is during the attack, the attacker modifies the SGX enclave to make sure that the branches that are within the enclave will collide with branches that are not in the enclave. The attacker executes the enclave code and then the attacker executes their own code and based on the outcome of the victim code and that cache, the attacker code may or may not experience a branch prediction. So the attacker is able to tell the outcome of a branch because of this overlap and this collision, like it would be in a flush and reload attack where those memories overlap between the attacker and the victim. So here, our spatial resolution is fantastic. We can tell down to individual branch instructions in SGX. We can tell exactly which branches were executed and which directions they were taken in the case of conditional branches. The temporal resolution is also, once again, almost unlimited because we can use the same timer interrupts in order to schedule our process, our attacker process. And the noise is, once again, very low because we can once again use the same sort of branch misprediction counters that exist in the Intel world in order to measure this noise. So does anything of that apply to trust zone attacks? Well, in this case, the victim and the attacker don't share entries in the branch target buffer because the attacker is not able to map the virtual address of the victim process. But this is kind of reminiscent of our earlier cache attacks. So our flush and reload attack only worked when the attacker and the victim shared that memory but we still have the prime and probe attack for when they don't. So what if we use a prime and probe cell attack on the branch target buffer cache in ARM processors? So essentially what we do here is we prime the branch target buffer by executing many attacker branches to sort of fill up this BTB cache with the attacker branch prediction data. We let the victim execute a branch which will evict an attacker BTB entry. And then we have the attacker reexecute those branches and see if there have been any mispredictions. So now the cool thing about this attack is the structure of the BTB cache is different from that of the L1 caches. So instead of having 256 different sets in the L1 cache, the BTB cache has 2048 different sets. So we can tell which branch it attacks based on which one of 2048 different set IDs that it could fall into. And even more than that on the ARM platform, at least in the Nexus 5X that I was working with, the granularity is no longer 64 bytes, which is the size of the line, it's now 16 bytes. So we can see which branches the trusted code within trust zone is executing within 16 bytes. So what does this look like? So previously with the true spy attack, this is sort of the outcome of our prime and probe attack. We get one measurement for those 256 different set IDs. When we added those interrupts, we were able to get that time resolution. And it looks something like this. Now, maybe you can see a little bit at the top of the screen how there's these repeated sections of little white blocks. And you can sort of use that to infer, like maybe there's the same cache line and cache instructions that are called over and over. So just looking at this L1i cache attack, you can tell some information about how the process went. Now let's compare that to the BTB attack. And I don't know if you can see too clearly, it's a bit too high of a resolution right now. So let's just focus in on one small part of this overall trace. And this is what it looks like. So each of those white pixels represents a branch that was taken by that secure world code. And we can see repeated patterns. We can see maybe different functions that were called. We can see different loops. And just by looking at this one trace, we can infer a lot of information on how that secure world executed. So it's incredibly powerful. And all of those secrets are just waiting to be uncovered using these new tools. So where do we go from here? What sort of countermeasures do we have? Well, first of all, I think the long-term solution is going to be moving to no more shared hardware. We need to have separate hardware and no more shared caches in order to fully get rid of these different cache attacks. And we've already seen this trend in different cell phones. So for example, in Apple SoCs for a long time now, I think since the Apple A7, the secure enclave, which runs a secure code, has its own cache. So these cache attacks can be accomplished from code outside of that secure enclave. So just by using that separate hardware, it knocks out a whole class of different potential side channel microarchitectural attacks. And just recently, the Pixel 2 is moving in the same direction. The Pixel 2 now includes a hardware security module that includes cryptographic operations. And that chip also has its own memory and its own caches. So now we can no longer use this attack to extract information about what's going on in this external hardware security module. But even then, using this separate hardware, that doesn't solve all of our problems. Because we still have the question of what do we include in this separate hardware? On the one hand, we want to include more code in that separate hardware. So we're less vulnerable to these side channel attacks. But on the other hand, we don't want to expand the attack surface anymore. Because the more code we include in these secure environments, the more likely that a vulnerability will be found. And the attacker will be able to get a foothold within this secure trusted environment. So there's going to be a balance between what do you choose to include in this separate hardware and what you don't. So do you include DRM code? Do you include cryptographic code? It's still an open question. And that's sort of the long-term approach. In the short term, you just kind of have to write side channel-free software. Be very careful about what your process does if there are any secret dependent memory accesses or a secret dependent branching or secret dependent function calls. Because any of those can leak the secrets out of your trusted execution environment. So here are the things that if you are a developer of trusted execution environment code that I want you to keep in mind. First of all, performance is very often at odds with security. We've seen over and over that the performance enhancements to these processors open up the ability for these micro-architectural attacks to be more efficient. Additionally, these trusted execution environments don't protect against everything. There are still these side channel attacks and these micro-architectural attacks that these systems are vulnerable to. These attacks are very powerful. They can be accomplished simply. And with the publication of the code that I've written, it should be very simple to get set up and to analyze your own code to see, am I vulnerable? Do I expose information in the same way? And lastly, it only takes one small error, one tiny leak from your trusted and secure code in order to extract the entire secret in order to bring the whole thing down. So what I want to leave you with is I want you to remember that you are responsible for making sure that your program is not vulnerable to these micro-architectural attacks. Because if you do not take responsibility for this, who will? Thank you. Thank you very much. Please, if you want to leave the hall, please do it quiet and take all you belong with you and respect the speaker. We have plenty of time, 17 minutes for Q&A. So please line up on the microphones. No questions from the signal angel. All right. So we can start with microphone six, please. OK. There was a symbol of secure OSs at the arm trust zone. Which is the idea of them? If the non-secure OS gets all the interrupts, what is the secure OS for? Yeah, so in the R&B, there are a couple of different kinds of interrupts. So I think, if I'm remembering the terminology correctly, there's an IRQ and an FIQ interrupt. So the non-secure mode handles the IRQ interrupts, and the secure mode handles the FIQ interrupts. So depending on which one you send will depend on which direction that monitor will direct that interrupt. Thank you. OK, thank you. Microphone number seven, please. Does any of your presented attacks on trust zone also apply to the AMD implementation of trust zone or are you looking into it? I haven't looked into AMD too much, because as far as I can tell, that's not used as commonly. But there are many different types of trusted execution environments. The two that I focus on were SGX and trust zone, because those are the most common examples that I've seen. Thank you. Microphone number eight, please. When trust zone is moved to dedicated hardware, dedicated memory, couldn't you replicate the user space attacks by loading your own trusted user space app and use it as an Oracle of some sort? If you can load your own trusted code, then yes, you could do that. But in many of the models I've seen today, that's not possible. So that's why you have things like code signing, which prevents the arbitrary user from running their own code in the trusted OS or in the trusted environment. All right, microphone number one. Some of these attacks are more powerful against code that's running in trust execution environments than similar attacks would be against ring three code or in general in trusted code. Does that mean that trust execution environments are basically an attractive nuisance that we shouldn't use? There's still a large benefit to using these trusted execution environments. The point I want to get across is that although they add a lot of features, they don't protect against everything. So you should keep in mind that these side channel attacks do still exist and you still need to protect against them. But overall, these are better things and worthwhile in including. Thank you. Microphone number one again, please. So AMD is doing something with encrypting memory. And I'm not sure if they encrypt addresses too. But would that be a defense against such attacks? So I'm not too familiar with AMD, but SGX also encrypts memory and encrypts it in between the lowest level cache and the main memory. But that doesn't really have an impact on the actual operation because the memory is encrypted at the cache line level. And as the attacker, we don't care what that data is within that cache line. We only care which cache line is being accessed. If you encrypt addresses, wouldn't that help against that? If you, I'm not sure how you would encrypt the addresses yourself. As long as those addresses map into the same set IDs that the victim can map into, then the victim could still pull off the same style of attacks. Great. We have a question from the internet, please. The question is, does the secure enclave on the Samsung Exynos distinguish the receiver of the message so that if a user application asked to decode an AES message, can one sniff on the value that the secure enclave returns? So that sounds like it's asking about the true spy style attack where it's calling into the secure world to encrypt something with AES. I think that would all depend on the different implementation. As long as it's encrypting for a certain key and it's able to do that repeatedly, then the attack would, assuming a vulnerable AES implementation, would be able to extract that key out. Cool. Microphone number two, please. Do you recommend a reference to understand how these cache line attacks and branch oracles actually lead to key recovery? Yeah. So I will flip through these pages, which include a lot of the references for the attacks that I've mentioned. So if you're watching the video, you can see these right away or just access the slides. And a lot of these contain good starting points. So I didn't go into a lot of the details on how, for example, the true spy attack recovered that AES key, but that paper does have a lot of good links for how those errors can lead to key recovery. Same thing with the clock screw attack, how the different fault injection can lead to key recovery. Yeah. Microphone number six, please. I think my question might have been very almost the same thing. How hard is it actually to recover the keys? Is this like a massive machine learning problem or is this something that you can do practically on a single machine? It varies entirely by the implementation. So for all of these attacks work, you need to have some sort of vulnerable implementation. And some implementations leak more data than others. In the case of a lot of the AES attacks, where you're doing the passive attacks, those are very easy to do on just your own computer. For the AES fault injection attack, I think that one required more brute force in the clock screw paper. So that one required more computing resources but still it was entirely practical to do in a realistic setting. Cool, thank you. So we have one more, microphone number one, please. So I hope it's not a too naive question but I was wondering since all these attacks are based on cache sheet and misses, is it possible to forcibly flush or invalidate or insert noise in cache after each operation in trust environment in order to mess up the guesswork of the attacker? And so discarding optimization and performance for additional security benefits. Yeah, and that is absolutely possible and you are absolutely right. It does lead to a performance degradation because if you always flush the entire cache every time you do a context switch, that will be a huge performance hit. So again, that comes down to the question of the performance insecurity trade-off, which one do you end up going with? And it seems historically the choice has been more in the direction of performance. Thank you. Yeah. Well, we have one more, microphone number one, please. So I have more of a more of a question of how well should we really protect from attacks which need some ring zero cooperation? Because basically when we use a trust zone for purpose, we would see clear like protecting the browser from interacting from outside world. Then we are basically using the safe execution environment for sandboxing the process. But once we need some cooperation from the kernels of the attacks, in fact, empower the user instead of the hardware producer. Yeah. And you're right, it depends entirely on what your application is and what your threat model is that you're looking at. So if you're using these trusted execution environments to DRM, for example, then maybe you wouldn't be worried about that ring zero attacker, that privilege attacker who has their phone rooted and is trying to recover these media encryption keys from this execution environment. But maybe there are other scenarios where you're not as worried about having an attack with a compromised ring zero. So it entirely depends on context. All right, thank you. So we have one more, microphone number one again. Hey there. Great talk. Thank you very much. Just a short question. Do you have any success stories about attacking the trust zone and the different implementations of TE with some vendors like, I don't know, some OEMs creating phones and stuff? Not that I'm not saying at this time. Okay. Thank you. Thanks. Thank you very much. Please again, a warm round of applause for Keegan.