 Okay, so before we start with the technical parts of the talk, let's talk about some philosophy, something fun. So we have a major problem in our world's economy today. Yuval and me have identified a major underutilization of juicers. We all have a juicer in our house, and we use it to make juice. Now, except one problem, we don't use our juices too much. We wake up in the morning, we make juice, we drink a cup of juice, we go to work, the juicer stands there, idle, doing nothing. Then we come back from work, the juicer is happy, we make juice once again, we drink our single cup of juice, we go to sleep, and the thing waits, idle, alone in the dark. So we're not using our juicers too much. So how about we rectify the situation, and we put the juicer on the cloud, and then we have a common juicer for everybody to use, and we will get better utilization from our juicers. Now, of course, when you do this, and you put, you make a cloudy juicer for everybody to use, and it's going to be a big juicer, then there are concerns. And when you have a public infrastructure, then two people want to make juice at the same time, and we absolutely must require a complete separation of juices, because frankly I really, really hate tomato juice. Now it gets worse because when you make such an infrastructure, you make it publicly available, then some people like me, we are the creative bunch, and assuming this thing is made out of chocolate and not another thing, then we expect something like that to come out from the other thread, while making orange juice. So, who thinks it's a good idea by a show of heads? Who wants to fund this thing? Maybe I'll go talk to the blockchain guys, they'll pretty much fund everything. Maybe I'm just in the wrong room, can you make them hear me? But who thinks it's a good idea? Damn it, no takers. Well, if this is not a good idea for juicers, why are we making our computer architects do just that with cloud systems? Well, the answer is economics, of course, because computers are more expensive than juicers, but if we're going to make our architects do that, then we need to understand that sharing is not caring when it comes to security, sharing is downright hating. And if we are doing this, then something would always leak, and we need a new notion of acceptable leakage, and it's traditionally a hard problem because we need to balance this with performance. Okay, now having said that, let's talk about computer architecture. The way that they teach us computer architecture, first class in computer architecture looks something like this. We have an input, we have an output, we have a central processing unit that has some control logic, has some ALUs, has some memory, and then we spend our first year looking at a computer when it looks something like this, right? Then first architecture class, they showed me this, I went like that, and where I quickly came back to here. This is where I spent all of my CS degree. And when you look at it, this is how we train people. We start from a good, we take the architecture class and we systematically wipe off the knowledge of architecture from computer science graduates. We teach them high level processing language, we teach them that they should look at asymptotics rather than concrete behavior. Architecture is a minor topic for the chosen few that like to hit themselves on their head. If we focus on obstruction and not on concrete details, it's the interest of the CPU vendor for you not to know what's under the box. And lo and behold, you're back to the good old model of that, of course, the Turing machine and you should compute on that. Some concrete examples of how this can go wrong is that we're all used to think about a dedicated piece of hardware that has uniform memory access that executes programs serially and is always correct. In fact, what they actually sold us is a shared piece of hardware with non-uniform memory access that does things in parallel and sometimes out of order and most of the time incorrectly. Oh wait, that's called speculative execution. That's a title for a different talk. And well, to quote Thomas Dahlin, insecurity lives in the breeze and cracks between obstruction layers. Pretty much that's where Yuval and me exist, okay? And we are out there to get you. So with that introduction, let's talk about one of the failures for this year's computer architecture, Ramblin. How can we read bits from memory without directly accessing them? So some background about DRAM first. DRAM, the CPU, the computer's main memory is arranged in rows and cells, a row is made up of multiple cells. Cell is a basic storage block in memory, basically a bit. Okay, now a row is quite wide, it has eight kilobytes. And when you're trying to access a bit in a given row, there is no such thing. You have to access the entire row. You have to activate the row and then the control logic would take the data in the row, bring it down to the row buffer. And there you can read your row. And if you want one bit out of that row, you'd still nonetheless activated the entire row. That's basically how DRAM works. Tough love, if you want a bit, you're gonna get eight kilobytes. Now, another over-subscription of DRAM cells is that cells lose charge over time, well, because that's how they're designed. It's volatile memory. If you cut power, it's gonna lose charge. And if you don't refresh the cells, if you don't read or keep rewriting the same values to memory over and over again, they're gonna slowly fade away and become corrupted. So we have a process in our operating system that keeps on rewriting the same values into memory every time and time. Again, I always forget what the refresh rate is. I believe it's 60 milliseconds, but I might be completely wrong. Now, here's where the over-subscription comes in. Turns out that repeated activations of rows cause bit flips in nearby rows. What does it mean? If I repeatedly access row one and row three, repeatedly constantly accessing them, I'm gonna get a bit flip in row two. Why? Because memory cannot handle the maximum workload often issued by CPU. The reason why we don't see this workload often is because of caches. And therefore, this workload according to computer architects doesn't really exist, so we can just ignore it. Why would somebody intentionally cause cache misses for everything to go to DRM? That behavior makes no sense, right? You're gonna lose performance. So that workload is so rare that they didn't care about it. And the fact that, I think, 20% of all DDR3 systems have bit flips. If you access memory in a funny enough way, it's not something that they cared about before row hammer came in. Now, this attack is known as a row hammer attack. And if you keep on hammering rows, you're gonna get bit flips. If we look at the data in those flips, then here's a Mona Lisa, and here's a hammer version. Okay, you flip bits in somebody else's memory, basically, okay? And it's known, it's commonly used. There were plenty, it's known as a right channel. You get to write to other people's bits, even though they're not your bits. And now it's a question of which bits did you overwrite? And what are the implications of those bits? Because not all bits are born equal. So if you flip bits in page table, you get RSA, sorry, you get privilege escalation. If you flip bits in somebody else's public key, then you can then subsequently factorize that public key and get him to install their own package. You can destroy somebody's control flow and make him sudo on enter just because if a branch equal and a branch not equal is one bit away in an Intel architecture, okay? So we're literally one bit away from disaster, but that's row hammer. You can go ahead and you can just write into somebody's memory and flip some bits there. What bits you flip depends on what you get. But okay, let's say we don't care about any of that and we are mainly worried about integrity. So, sorry, about confidentiality. So I can write to other people's bits, but I cannot read them, okay? Let's say I just want to read somebody's memory. So here's a little known observation about row hammer. And this goes back to the original row hammer paper of Key Metal in 2014 that striped patterns result in more bit flips than non-striped patterns. Basically, if I have this pattern of one, zero and one, and I keep hammering the red rows, then the green zeros in the middle want to become like ones. And the same is true if I had the opposite patterns of zero, one, zero would result in the one want to become like a zero. The quick way to remember this is that bits want to become like bits above them and below them, okay? They don't want to stand out. Now, but what does this mean when it comes to cryptography? It means that a bit flip leaks information about bits above it and below it, right? If it flipped, it probably means that the bits above it and below it have different charge relative to what it is. Okay, that's something to work with. And that's fundamentally wrong with. We exploit the fact that bits flip that the distribution of a bit flip changes based on the data above it and below it to recover the bits above it and below it at the rate of 0.3 of a bits per second. That's pretty much it. So you can go and read some bits of DRAM at a third of a bit per second, okay? It works even better when you have error correcting codes, which are the things that are trying to protect DRAM against bit flips that makes everything infinitely worse. Which is always a good idea when you have a counter measure that just makes things worse. And with both of these, we were able to read out an RSA private key from OpenSSH directly from DRAM in a few hours. Okay, so now with that in mind, let's start hammering. So here's a very problematic memory layout for a poor victim. So it has three DRAM rows and each row is eight kilobyte wide. That's gonna be important in the future. And I have somehow managed to get my poor victim to allocate data here and here, identical copies of the secret key. I'm the attacker, I control everything, but these two memory pages, okay? And now I'm gonna show you how I can use this layout to go ahead and read this zero bit there, okay? So everything except the red is mine and I wanna read the red. I cannot just go ahead and access the red, because it's not my memory. But what can I do? I can write in the green region. I can write a bit opposite to the bit I think it's there. I only bear in mind there are two options for a bit. So let's just guess that the bit that I'm trying to read is zero. And I'm gonna write one in the same location in my page, in my sampling page, okay? Then I need a row hammer effect. I need a way to get the attacker to repeatedly access, sorry, to get the victim to repeatedly access its secrets, right? Because I wanna hammer the two secrets to get it to flip the bit in the middle. Except, well, how do I get a victim to access his secrets fast enough such that you have cash misses? I'm sure there are ways, but it would have been infinitely easier if we just don't have to do this. So let's just not do this. Remember that the DRM row buffer is 8 kilobytes long? Well, if it's 8 kilobytes long, this means that I have a victim page. And near it, I have another page. And if one of those pages happen to be mine, these are the row activation pages. Then me accessing the same memory that, me accessing my own memory that shares the row buffer with the victim is equivalent to the victim accessing his memory. Okay, so no victim interaction needed, good, great, done, just go ahead and hammer your own memory here, which is equivalent to the victim accessing here, which will create a row hammer effect here. And now, if I guessed correctly, then the bit in my sampling page should go from one to zero. Right, because I was supposed to get the opposite value from the value of secrets that I'm trying to read. But wait, I just hammered myself so it's my memory, so I can go ahead and check it and see if it actually flipped. If it flipped, then here we go, I just guessed the bit. And to quote Shafi Goldwasser, the difference between a trick and a method is a trick that works more than twice. This thing works one bit at a time across our key bits. So just repeat for the next bit until you've got all of them out. So what do we need? We need flipable bits, flipy bits in flipy dims. How flipy are the dims? Well, depends on how good your dims are and how old your dims are. DDR3, for example, the rumors are that 20% of systems have bit slips to certain rates. Okay, for us, it was super easy to find a DDR3 system that had bit slips. I simply went to the university's junkyard, pulled out a few computers, and some of them have bit slips. Here we go, as easy as that. So we need to profile memory and find all the flipy bits, not all bits are flipy as other bits. So you need to know your dims. But again, this is easy, I can do this from user space. I'm just going to go ahead and allocate some memory and hammer myself to find the locations of the flipy bits. So now that I know the locations of the flipy bits, I have another problem. I need to make my victim do this. I need to get my memory into this layout. And how would I do that? Because I don't have any control over the victim, and I want the victim to allocate two secrets right above and below my sampling page, and then near my activation pages. So how would I do that? Well, it turns out that the security of the Linux allocator is not very good, in the sense of it's not designed to have security, so it doesn't have any. And there is yet another memory access dance that you can dance that has a strange allocation in the allocation patterns, such that it leaves nicely fit holes in the memory system, such that when the victim wants to allocate, well, we're just going to allocate there, because the data that it's about to allocate nicely fits into the hole I left it ahead of time. Nice example is that if that open ssh likes to allocate the secret key after 31 page allocations. Why? Because I ran debugger and saw it. So I'm going to allocate 32 pages, and then deallocate them at once. So I'm going to allocate 32 pages, deallocate the first 31 into some random junk, so it's going to put its data there, and then, lo and behold, accidentally, I'm going to deallocate the page that has a flippy bit, and it's going to put the secret key on the flippy bit, which is exactly what I wanted. And now that we have that allocation, then just go, hammer away, read sampling page, get the bit, move on to the next bit. Now, there are other headaches that one needs to clear about obtaining contiguous memory, because we are doing all of this from user land. And if you're trying to do this from user land, you need a nice physical block of physical memory, except because of previous attacks, the translation between virtual and physical memory is not something that is accessible to users. You need to be administrator, and we don't want to be administrators, because otherwise you can just read out the key. So what do we do? And I'm going to refer you to the paper, but bottom line is that there is a way to trigger the allocation to allocate two megabytes of consecutive memory using just a weird allocation pattern, allocation and the allocations, eventually just going to give you a nice block of two megabytes that is physically contiguous. You then, after you got your block, need to figure out the physical address. Again, I'm going to refer you to the paper, can be done because memory allocators were not designed with security in mind. So now that we have this, we have the attacker almost arranging everything in this layout, we just need to deallocate the two purple pages just in time when the victim wants to allocate secrets in them. So you need to monitor the behavior of the victim, you need to know when exactly it wants to allocate and then just deallocate just before. Can be done. So on our deems, we found vulnerable bits at a rate of 41 bits per minute, just by simply scanning the deems. Yes, these are really, really bad deems. Again, 20% of systems were made like that. We can read bits from a test client at three bits per second with 90% accuracy. And this is just a test client that simply allocates a secret every time it gets an IP packet. Why is this useful? Well, this is the behavior of OpenSSH because OpenSSH every time you're trying to log in into somebody's computer, even if you don't know the keys or passwords, it will allocate the RSA private key that it uses to sign its identity every time you just log in. So every time we're trying to log in, it allocates the thing. We have an attacker running on the same hardware that just ahead of time deallocated the right pattern. So the secret, the signing keys land right on the flippy bits, so we can leak them, okay? And with an OpenSSH with all the noise and clutter that this thing does when it tries to handle an incoming connection, we lose an order of magnitude in the speed and 20% in the accuracy. So now we are at 0.3 of a bits per second with 72% accuracy, except that's the problem. 0.3 of a bits per second at an RSA key that takes 2048 bits, okay, fine. It's a few minutes and you'll get it done. The 72% accuracy is actually unpleasant because 72% accuracy across 2,000 bits is not something that you can fix, at least not by brute forcing, but we have an idea to thank for for the awesome Henninger-Schachem algorithm presented in 2009. We have yet another variant of Henninger-Schachem that can handle those type of flips and in this accuracy. And now because of the nice mathematical properties of RSA, we have our key, okay. Basically, RSA key extraction within a few hours where the few hours goes mostly for computing time, okay. And bonus feature, error correcting code. So the horror story I just told you, what do you mean bits flip at random? What do you mean you can flip a bit in somebody's memory? Well, yes, you can. On a PC computer, you just can go ahead and flip a bit and nobody's not there either and whatever the bit does, then that's the extent of your problem, okay. But on a server, not quite. Why? Because we build servers better than we build computers or client computers. And on a server, we really, really don't want bit flips because they run all the time, they get hit by cosmic rays, at least that's the stories I've been told. So we have error correcting codes where there is an ECC directly on the DRAM and every time you access DRAM, it computes ECC and if you flip a bit, the ECC will unflip it. Okay, that's the story for one bit. What happens if you unflip a bit? What happens if you flip two bits? Well, then it's gonna crash. And what happens if you flip three bits at once, then it's gonna freak out, error correct, the wrong way, because it cannot distinguish the one bit case from the three bit case and at that point it just doesn't know that anything even happened. And that's exactly the basis of a recent paper code, ECCploid, where they found a way to flip three bits at once as opposed to one bit at once, reverse engineer the error correcting codes and voila, Rohamer strikes at servers. But let's say that you have a really, really great DRAM that simply doesn't have three bit flips at once. That event just doesn't happen. Are we done? So two bits is gonna crash, one bit is gonna correct. Seems secure, right? So these people, every time you flip a bit, they unflip it. And then we decided to take a look at how long does it take to unflip a bit. This is what we got. Okay, I don't know why it takes 10 million cycles to unflip a bit. I really don't. I've asked a few people at various vendors, they didn't know either, including those that sell them memory. Because an error correcting code is a matrix. These things, the error correcting codes that they use are linear. We have a 64 bit word with a 64 by 64 generator matrix and we need to multiply one by the other over bits. Why does this thing take 10 million cycles? Beats me. Okay, I don't know whom they hired to do that. But the bottom line is that you can see bits as they're being unflipped. Why would a bit be unflipped? Because I flipped it, okay? And all I need to know for Ramblin is I need the ability to see if the bit was flipped or unflipped. So, fine. That's my oracle right there. How long does it take to access memory? You can spot a difference of 10 million cycles with a stopwatch. And if you ever tried flipping a bit on a server, it's awesome to see the entire machine just holds while it's trying to recover from what was done to it. It still recovers, but it just takes a long time to do that. So, in that case, we can read bits at 0.64 bits per second on a server machine. And this is where ECC goes to become a security problem instead of a security feature. And then ECC subsequently corrects all flips. So nothing to see here. No flips were ever happened. And if you're trying to look at all the data values, then no data was ever computed incorrectly. Yay, who is trying to prove that the computation was correct? Yes, it was. No bits were flipped because ECC unflipped. So that's round blade.