 All right, let's get started. It's time, so hopefully you're ready for some other mysteries. We're gonna pick up on that and conclude the lecture with three mysteries today, and then some lecture and course logistics in the last 15 minutes or so, depending on how fast we finish the mysteries. Okay, let's hopefully remember the first mystery. I'm not gonna ask you to recite it, but I'll make you recall it. We were talking about a multi-core system, and you're gonna see a lot of these components in this lecture. Again, if this scares you right now, don't worry. One of the goals of this lecture is by the end of the lecture, by the end of the class, not this particular lecture, but by the end of the class, you will understand all of these mysteries and hopefully how to solve them. I'll give you a glimpse of why these mysteries are interesting or why they may be interesting to solve and glimpses of the solutions or underneath how things work. You may not be able to perfectly understand them just by this lecture, because you're just learning this, but hopefully this will motivate you to understand them to see the importance of some of these things. Okay, these chips are everywhere basically. This is a picture from 2006, but multi-core chips are growing as we've discussed yesterday, and we're gonna look at the different components. And I've shown you this picture. You don't need to know all of these things, but basically, you have many, many cores today. Actually, this is an old NVIDIA GPU that has 448 cores, if you will, everybody defines cores in a different way. These are really, really simple things. But right now, I think NVIDIA chips are around 1500 cores. So on a single processor, you get 1500 to 2000. Maybe I'm even behind 1500, it's probably a lower number. Okay, recall that if you want to add more cores to the system, we want to get more performance, right? Ideally, that's what we like. But I showed you this picture, based on an experiment that we did at Microsoft Research about 10 years ago, actually more than 10 years ago now, that showed that if you run these two applications together, one application doesn't slow down that much, whereas the other application slows down by almost 3x. Remember, I gave you the example. This takes 100 seconds to run on a single processor. Whereas when you run it on two processors, but together with GCC, it takes about 107 seconds. So slow down of 1.07, that's how you calculate it. On the other hand, GCC, which is a compiler, it's commonly used, and the MATLAB is also commonly used for many, many things in the world, not only mathematics. Have you ever used MATLAB? Okay, excellent. So I don't need, have you ever used GCC? Oh, okay. Good. When I came from high school, I didn't even know what GCC or MATLAB was. That was a long time ago, but basically GCC, if you aren't alone, it takes, let's say, 100 seconds on the multi-core system, but without any other application. If you run MATLAB next to it, now its performance goes down to 300 seconds, 304 seconds. That's how you get the 3.04x slow down. Now, I also told you that this is worse than running two applications serially, right? You run MATLAB first and then GCC next. It's 200 seconds, but here you get 400 seconds. It doesn't make sense, right? Well, it may make sense, actually, later on, if you know what's going on underneath. So you can actually go into the operating system, we will talk about that later on, and change the priorities of the applications. Tell the operating system this MATLAB is low priority. I don't care about it, whereas this compilation job that I'm doing is very high priority. I care a lot about it, and set the priorities accordingly, and nothing changes in the system. You still get the same result. Because it turns out the operating system actually puts those two jobs, two applications, on two different cores, and doesn't deal with anything else. Okay, we called this a memory performance hog, and I asked you to think about three questions, right? Basically, why do the applications slow down at all? And why is there a disparity in slow downs? And can you fix the problem? And can you answer all of these questions without knowing the underlying system and how it works? And these were rhetorical questions, basically. I don't think you can answer these questions without knowing how the system works underneath. But this is actually motivated by real problems. People in the data centers or virtual machines, when they put those together, they saw these slow downs, and they couldn't explain why these slow downs happened. So the explanation, well, again, the same questions. And the last question is, I'll get to the explanation. Last question is, how can we solve the problem if we do not want that disparity in the slow downs? Then there's another question over here, right? What do we want the system to provide? That's always an important question to ask. But let's defer that for a while. Why is this important? I'm going to skip the slide because we talked about this yesterday. Basically, we want to put many applications on the system together, and we want the system to be controllable as well as high performance at the same time. We don't want this sort of performance loss happening. Okay, I'll give you the answer, unless somebody has the answer based on what we've discussed yesterday. Anybody wants to venture? Some people venture privately. They all had good points, but it was not the right answer for this system, yes. That's right, yes. Yeah, that's actually the reason. I'll explain exactly why it is the reason in a little bit, but it's because of basically what we call resource sharing. We added more cores into the system, and maybe we added more caches. Caches are where you actually store the data that you recently accessed, and we're going to see that a lot later on. But this part of the system is still shared between the two cores. If these cores try to access this part, which is memory in this case, DRAM is Dynamic Random Access Memory, which is the memory you have in almost all your devices, external memory. These cores, if they access memory, and if collide with each other, if you will, and if the memory controller is not doing a good thing in scheduling their requests, it may be actually causing some unfairness. Let's take a look at it, basically. We're running MATLAB and GCC on the two cores. They send some requests, and it turns out MATLAB is a very intensive application. It basically sends lots of requests at the same time. Why it may be demanding a lot of data, right? Maybe you've written a huge matrix in MATLAB, and you're doing matrix multiplication across two matrices, and you're doing a lot of operations, but a lot of data movement, also, to actually get those matrices to be multiplied inside the processors over here. So it's requesting a lot of these little things, our memory requests. It's requesting a lot. On the other hand, this poor GCC is once in a while requesting things. Maybe it's not that intensive. It gets the data, and it does a lot of operations on a single piece of data, right? Or maybe two pieces of data. And then it writes it back. And then it gets another piece of data and lots of operations, and then writes it back instead of requiring a lot of data, like MATLAB. Now, it's good so far. Memory controllers receive these requests from both cores. Remember, this is parallel processing. Both of them are executing at the same time. Now, what does a memory controller do? It turns out it prioritizes MATLAB's requests. Another magic, right? It turns out memory controller keeps prioritizing this MATLAB's requests and ignoring GCC's requests for a long time. You can see that GCC is getting denied service. So clearly, there is an unfairness going on in the memory controller. In this particular case, the problem is really the unfairness that the memory controller is putting into the system. It's unfair against an application like GCC. Okay, we've found the problem. Can you tell me why it's unfair? This is actually the behavior of many memory controllers. Circa 2006, and a lot of memory controllers until now, but they've been changing quite a bit. That's a tough one, because you need to know exactly what's going on underneath. Yes, across banks. That's certainly true in some times during the execution, but that's not really the reason for unfairness. Okay, I don't expect you to enter this, but okay, one more shot. Go. I see. So that's a good argument. That's a bad design for the memory controller, but it's not doing that, yes. The memory controller designs were a little bit smarter. But I agree that that could be the reason for this, basically, if you're always prioritizing core one and not core two, you could get this behavior, but that was not the reason in this case. Okay, good. You guys are thinking. Thinking is the most important part. That's the most important point of this lecture actually, thinking critically. You may not understand exactly what's going on here, but you will at the end of the class. Yes, one more. That's a very good point also, actually. If you didn't hear what your colleague said, basically, there might be a queue in the memory controller that's of finite size, and because MATLAB is generating requests at such a high rate, it fills up that queue, and GCC is not able to enter that queue. That's actually partially the reason, but it's not the full reason because eventually GCC actually sends something into the queue and the problem is actually much worse than just the queue size. There's something else going on that makes it worse. But this is good, you guys are thinking. Shall I give you, okay, one more. Big blocks. Yeah, so this is a good point actually. It's possible, but it doesn't happen in processors because at least in general purpose processors, all of the requests come at the cache block sizes, so they're all equal in terms of sizes. Whereas if you look at your hard disks, for example, that's very different. In hard disks you can request very, very large blocks, whereas some other applications may request smaller blocks and as a result you might get unfairness. Okay, let me move on. Let's think a little bit deeper. I'm gonna scare you a little bit more. Let's understand what's going on in a single bank. So for now I'm going to focus on a single bank because that's one of the big cause of unfairness in this controller. And hopefully this will foreshadow what's coming on later in the lectures and why understanding this is important. So a single bank, you can abstract it as a two-dimensional area of rows and columns, rows and columns of memory cells. Imagine every single cell being storing one bit. Of course this is a cartoonish picture, so I'm not showing you the exact size. So this is really an abstraction. Internally a bank consists of many cells, transistors and capacitors, which we're going to talk about also in the second mystery, and other structures that enable access to the cells. But for now we'll keep this high level abstraction. In order to access this bank, you need to bring a row into what is called a row buffer. Again, this is an abstraction. There's a reason why this exists. At least in DRAM you need this row buffer and you have to go through this row buffer because when you access data, you need to sense the data. So it's very low level circuits that we will build up from. You need to sense the data and the job of the row buffer is to sense and amplify the data such that you can actually read it with the memory controller. But every data that you need to access, you first need to bring the row that contains the data into the row buffer. So let's take a look at it. Initially this row buffer is empty. There's nothing there. Let's say we want to access an address that's at row zero and column zero. What do you need to do? Basically you will see these structures later on. First of all the address needs to go through a row decoder. You need to decode the address to figure out which row is really need to be activated if you will access. You will build these things by the way in this course. But for now I'll keep the abstraction level at this level. Row address zero, the decoder is a combination logic which you will see in the next lectures which basically evaluates to this row being activated. This word line over here becomes a one and all of the other word lines become zero in terms of voltage levels. And basically you activate all of these cells which means that the data gets brought after some time over these bit lines which I don't show over here. Basically wires all the way into the row buffer. Now we have row zero in the row buffer. Then the next thing the memory controller needs to do is it needs to send the column address to the row buffer. Basically it sends a column address zero and it sends a command saying I want column address zero from this row buffer. And what internally happens is you have a column MUX over here. You guys know about MUXs? How many of you do? Okay, good. You don't know, you'll learn. A MUX, it's called a multiplexer. Basically it's a selector, right? You have all these different one, two, three, four, five, six, seven, let's say columns over here. Actually it's much more than a seven. Let's assume 128. What the MUX does is, and you will see the implementation of the MUX, you will implement it yourselves in this lecture, what the MUX does is it takes all of those, it takes an address and it gives you the column that's at that address. In this case it basically selects from 128 items that are in the row. It selects the zero one and it gives you the zero one. It's basically a selector. So it's the opposite of the quarter if you will. The quarter, you get the full address and you figure out where to go. With MUX you get the address and you take out the thing that you need among some number of items. Does that make sense? At the abstract level, that's the operation. We'll go down in the logic level and you'll implement a lot of these. Okay, basically that MUX brings the data and now you have the data that you need. Let's see what happens. Now the interesting thing is Robofer now has row zero in it. It buffers it. If you want to access the same row but another column it's already there in the Robofer of the bank. So what the memory controller can do is it just says memory controller and the memory controller needs to keep track of this of course but memory controller can simply say, oh I already know that row zero is here. I don't need to activate it again. I don't need to go through all this row decoder again. I just need to send the column address because it's a hit in the Robofer. This is another term you should get used to. You just hit in the Robofer because the data is there. It's a cash hit if your Robofer hit in this case. So I just need to send the column address to get the second data item and the MUX will give me the second data item. Make sense? Because the data is already in the Robofer. Now this is fun. Which also means that this second access is much faster than the first access, right? First access we need to activate the row and then send the column address. The second access we don't need to activate the row which takes a long time actually but we just need to send the column address, okay? Now let's take a look at one more example. Again the processor request, think of this as MATLAB by the way. It's always good to think about that. MATLAB exactly does this except it doesn't jump to 85. It says zero, one, two, three, four, five, six, seven, eight, nine, 10, dot, dot, dot, 128. Which is what we call a streaming application. Streams through memory. It requests accesses sequentially. Now let's say we jump to 85. Again, the memory control sees, oh, row zero is already in the Robofer. So I'm not gonna activate it because it's already there. It's a Robofer hit. So I'm gonna send column address 85 to this MUX over here to the DRAM chip. And the DRAM chip sends back the data, bank. Okay, now let's see what happens if there's another request that doesn't follow this nice row pattern. Let's say we get an access to row one and to column zero. Now the memory control sees this. Memory control says, oh, row zero is here but somebody's requesting row one. So I should probably do something with this row zero first which that something is basically, basically it's a conflict. What you want is not in the Robofer but something else is in the Robofer. It's called a Robofer conflict. You'll see this term later on also in cash conflicts for example. And the memory control needs to basically first write the data back. This is an abstraction again. At the circuit level it works differently. But think of the abstraction as write the data back into the bank which takes time. The memory control needs to activate row address one which takes time. Bring the data all the way into the row which takes time. And only after that it can issue the column address to that row. It wants column zero. Basically now you know how it works right? You get the data maxed out or multiplexed out of the buffer. So you've just seen that this Robofer conflict access is much much more expensive than the Robofer hit access. It's actually three X in some DRAM it could be four X. Make sense? Is that clear at some level? Okay, excellent. So you've just seen that basically the conflict access is much much more expensive. So what do the DRAM controllers do? They basically try to optimize for Robofer hits. They want to maximize the Robofer hits right? Meaning that if they get a lot of accesses that keep hitting the Robofer they prioritize those accesses. Whereas if they get accesses that don't hit in the Robofer that are not present in the Robofer that's open, they just say oh wait I'm not gonna service these unless there's no access to the row that's open already in the Robofer. Make sense? So now you understand perhaps the unfairness that's coming. So basically a row conflict memory access takes significantly longer than a row hit access. And current controllers as I just said take advantage of this fact. It's like very commonly used scheduling policy. It's actually older than this 2000 paper. It's called FRFCFS. You don't need to worry about the acronyms here. But basically they prioritize row hit memory accesses first. If an access is to a location that's already in the Robofer it's prioritized immediately. If there is no such access or if there are multiple such accesses the second prioritization rule is oldest first. Pick the oldest one among those accesses that are Robofer hits. And if there is no Robofer hit access pick the oldest one among all those accesses that are Robofer conflicts. And this is implemented in a lot of your processors today. And the goal again is to minimize the latency or maximize the DM throughput. Throughput is the rate at which you can get data from any kind of memory. If you're maximized, how do you maximize throughput? You minimize the, you basically prioritize the accesses that take shorter. This is actually a very well known in computing theory or scheduling theory in general, shortest job first. If you do shortest job first scheduling you actually maximize the throughput from a single server. Okay, so that's the idea. So what is the problem? This sounds good, right? We were maximizing the AM throughput. We're maximizing the data rate at which we are getting the data rate we're getting from memory, getting out of memory. The problem is this may be good if you have a single core. If you have a single core maybe that's what you would like to do. But if you have multiple cores they share this DRAM controller and now you can be unfair to some applications, right? Basically, if you keep doing row hit first you can unfairly prioritize applications that have high robofer locality. If an application keeps hitting in the robofer that's always getting prioritized now, right? Whereas this poor GCC that you saw which unfortunately keeps missing in the robofer gets deprioritized. That sucks for GCC, right? It's good for MATLAB probably. But unless your goal is not to always maximize the performance of MATLAB or always maximize the performance of an application that always hits in the robofer then this is not a good goal. This is not a good design basically. And the second solution, you might say, okay, let's get rid of the robofer. I don't care about it, but let's do oldest first. This is also an unfair policy actually because of the reason one of you stated over there. If you do oldest first scheduling an application that's very, very intensive, it generates lots of requests into the memory, will queue up its requests in the buffers so it will appear older than another application, another poor application that doesn't generate a lot of requests, right? So the application that's not very intensive gets deprioritized unfairly whereas an application that's very, very intensive gets prioritized. So both of these rules are unfair actually. And as a result of this, what's worse? Now the DM controller is vulnerable to denial of service attacks which means that you can write a program really simply to exploit the unfairness and deny service to other programs that you don't like. Yeah, right? I don't know if you want to do that, but that's what we did basically 10 years ago. Not for profit actually, but for to show that there are problems in the design and we could actually design things better for various purposes. Basically you can write programs. And I'll show you an example of this program actually. This is very simple. If you ever written a program, you can appreciate this. Basically it's a streaming application. Streaming means you always access sequential locations. You basically initialize two large arrays A and B and basically copy each element in B to the corresponding element in A in a sequential manner. And you do it forever. Let's say you pick a very large N over here such that your caches are ineffective because you're never touching a location. So this is a streaming application. It's sequential memory access. It's the same thing. It has very high robot for locality as a result. Basically very high hit rate. And it's very memory intensive. So you can ensure that this is memory intensive by ensuring, by writing the program such that it always, every access goes to main memory. Now let's take the opposite application, which is random. Basically this is exactly the same program except the copying happens randomly. Basically you don't copy the same index in RAB to the same index in RAA, but you kind of shuffle, if you will. You pick the index that you copy in a random manner. You don't access things sequentially. You do it randomly. And it turns out this has random memory access, very low robot for locality and similarly memory intensive, okay? Makes sense? But I think, but basically my point is this is the memory performance log. This is an application that's basically similar to it, but has exactly opposite characteristics. So what you do in your program actually matters significantly as you can see. If you actually do this randomly, then you'll get much worse behavior in terms of sequential access patterns. Okay, now let's take a look at the memory controller. Let's, I'll show you example of a stream and random, think of this as MATLAB and GCC for now. Actually I would prefer to call these MATLAB and GCC and change these from stream and random. Okay, rename them in your head to MATLAB and GCC. Let's take a look at what's happening. So MATLAB initially generates, and MATLAB generates lots of requests, sequential requests, GCC generates few requests that are not sequential. So the memory controller basically opens Robofer row zero in the Robofer, accesses it, and then another request comes from MATLAB, for from GCC, sorry. And then another request comes from MATLAB to row zero again. The memory controller has a prioritization decision in the request buffer, right? It basically says I'm going to prioritize this blue request over here because it's going to row zero where this red request is not going to row zero because that blue request will be serviced much faster. And it prioritizes it. And then GCC keeps sending requests. And then another request comes to row zero. The memory controller again has a prioritization decision when it's done with servicing the previous request. Again, it prioritized the request to row zero. And again. And again. And again. And again. And this poor GCC is waiting, doing nothing, right? And you're hoping your compilation results are coming soon. Well, good luck. So this might happen in the cloud, for example, right? If you actually send your compilation or simulation to the cloud. If the company decides to run your application together with MATLAB, let's say, well, good luck getting results. Okay, so you can do the calculations here. Basically, with a row size of eight kilobytes, which is essentially what the row size are like today, with a request size of 64 bytes, you will see that 64 byte is a popular cache line size, cache block size, but you don't need to understand this exactly. But you need to be able to divide eight kilobytes by 64 bytes. Basically, with these parameters, 128 requests of the stream are serviced before a single request of random. Now we can calculate the delay, right? 128 times, let's say, 100 nanoseconds to access memory. That's a long time. Does that make sense? Yes. Yeah, so there are some things that I abstracted that way over here, because it turns out when you switch between these row buffers, there is additional penalty. So you get a lot more delay than you would otherwise get if you were running alone. That's the reason. That's a very good question, though. It's a good catch. But also, it's also dependent on the parameters I chose. This is just to give you, it's good critical thinking. Basically, the numbers don't exactly work if I say they're both taking 100 seconds. So if you actually change those numbers, you'll see a different story. Okay, now that, do we know what goes on underneath now? At least some abstract level? Maybe not perfectly, because you don't know exactly how these circuits work, right? But at least you know that there's a resource sharing problem and the memory controller is picking a request in an unfair manner, but it's doing it to do something good to minimize the latency, but it's resulting in unfairness. So if you know what's going on, how would you solve the problem? What's the right place to solve the problem? I think it's always good to think about that. Like you have all of these choices, right? Probably electrons are not the right place in this case, but micro-architecture changing the memory controller could be the right place. Maybe changing the operating system could be the right place if you're running a very high priority application. The operating system says this is the only application that I'm going to run on these cores, right? Maybe not very efficient, but if you really care about that application, if you cannot change the hardware, that might be a good solution, right? Depends on your constraints. Punting to the programmer, say the programmer don't write code that's like GCC, is that a good solution? Probably not, right? Programmers will write any kind of code in the end and hardware needs to execute that code. You cannot unfairly penalize some code because it's written in some way. Okay, system software we just discussed, but think about this basically, and I'm not expecting you to come up with solutions. Did you change the memory controller? Actually, I'll talk about the memory controller very briefly. Did you change the circuits? Well, maybe not, right, how do you? Maybe you change the circuit such that this latency differential does not exist, but then you may lose performance. There's always a trade-off. You can say, oh, I'm going to change the DRAM circuit such that everything takes exactly equal amount of time, and then you eliminate this source of unfairness at least, but that equal amount of time may be very long because now you need to ensure that all of the requests complete by that amount of time. Okay, so solutions come with different trade-offs. Okay, so two other goals of this course is to enable you to think critically. I've said that yesterday, and think broadly also. So there's no single solution at a single layer. There might be solutions at different layers depending on your design point, which we will talk about, which is your goal, like what is your goal in the design, and the solutions that are available to you. So if, for example, as I said, if these are already existing systems, tough luck, right? You need to find a solution on an existing system, probably a software solution if you are not able to change the controller. But if you're designing the future systems, maybe you design the controller such that it doesn't have this problem. And that's what exactly other people have done. So this is, I'm gonna put up some readings like this. These are really for the really interested. I'm not recommending you to read these papers. This is for completeness, okay? Your readings, you know which readings you have to do, right? We already assigned that yesterday. I'm gonna briefly talk about that later at the end of the lecture also. But basically, after we discovered this, we actually wrote a bunch of works that talk about how to fix this problem. Some of those are picked up by industry, so some of the principles that are described over here are actually in Samsung's SOC's System on Chips. And later on, there's been a bunch of work. So it doesn't work exactly like the way I described right now. So if you want to do this experiment, but there are some cases where you can still launch these attacks if you will, the knowledge service attacks. So if you're interested, you can try to do that. Maybe end of this course. This course will be hard enough so that you won't need to want any extra assignments if you will. Okay, so take away basically breaking the abstraction layers between the components and the transformation hierarchy levels. In this case, it's really between the components, understanding how they behave, and knowing what is underneath enables you to understand and to solve the problems. I don't think, I cannot imagine how to solve this problem without understanding what's going on at the very low levels of the hardware. Shall I give you another mystery? Is this fun? Yeah, okay, good. Am I going slowly or too fast? That's fine, okay. You can shot at me if I'm going too fast. Okay, let's look at another example of mystery. This is one of my favorite topics. Because this is, have you guys heard of DRAM Refresh? Who's heard of DRAM Refresh? Okay, not that many, that's good. So basically, even without knowing anything about how DRAM works, DRAM forgets. So DRAM is dynamic random access memory. If you store some data into it, if you don't refresh it, if you don't basically access the data and refresh it, it's called refresh, right, refreshing your memory, you just lose it, and you lose it fast. And in modern DRAM, you need to refresh it. According to the standard, you need to refresh every single DRAM cell every 64 milliseconds. So this phone right now is refreshing its memory. That's why it's losing battery life. And that's one of the main reasons why it's losing battery life at this point in time. Because I'm not using it, all it's doing is refreshing its memory. It sounds stupid, right? It is stupid, but that's the technology we have. Well, that's the technology we have for other reasons. We can talk about why it's very cheap, right? DRAM is very cheap technology. But it has this problem, which means that the controller needs to send commands to refresh it every 64 milliseconds to every single row. It's not, it doesn't happen right at, so basically you send a refresh command and things don't happen right away. You need to actually send a refresh command to refresh every single row. Well, DRAM in the system is over here, as we've said. Now let's take a look at this problem. Why is this happening? So DRAM consists of what is called the DRAM cells. This is basically the storage element. You can store one bit of data in this capacitor over here, and this is the access transistor. One bit is stored inside this thing, if you will. And basically the charge status in this capacitor indicates whether you store a one or zero. Let's say charged is corresponding to one. Discharged is corresponding to zero. And you have many, many of these cells connected together in a very dense array. So this is one of the rows that I showed you earlier, right? You have rows and columns. Basically this is a row over here. And if you wanna access an entire row, you need to enable this word line, if you will. Again, you don't need to remember these exact terminology over here. As long as you can reason about things, that's fine. You will see these later on. Basically, because this is capacitor-based and access transistor-based, it turns out, even if you don't do anything to this, if you don't touch this, the charge on this capacitor leaks through this wire. You have some electrons over here. You cannot keep the electrons over there. They just escape, right? Because there is resistance and capacitance. There's an RC path, if you will. Resistance-capacitance path through this wire, where the charge keeps leaking. As a result, you need to refresh that charge before it leaks too much, right? Okay? So the AM capacitor charge leaks over time. So far so good, right? It's a good enough abstraction level. You don't need to know more, I think. You don't need to know how electrons actually move from one place to another. Okay, so the memory controller needs to refresh each row periodically to restore the charge. This means you need to activate each row of every 10 milliseconds. Remember the activate in the previous mystery? You basically need to activate. We're gonna go into how that activate works later on in the third mystery, actually. You actually need to apply a high voltage to the wordline. But it's not relevant at this point in time. We'll keep the abstraction level. Typical end today is every 64 milliseconds. So there are many downsides of this. It sounds terrible, right? Energy consumption basically. Every 64 milliseconds, you're consuming refresh energy for each refresh. You get performance degradation. When you want to access DRAM, too bad. You cannot access it from this core. MATLAB cannot access it because it's being refreshed, right? Sounds bad. There's a quality of service impact because of, similar to the performance degradation, you may have long pause times during refresh. While you're refreshing memory, you cannot access it again. And it turns out, I'll give you some results. Refresh rate limits DRAM capacity scaling. If you go to a DRAM designer, let's say Samsung, Hynix, Micro, and people who do DRAMs, if you tell them what's keeping them up at night, they'll tell you refresh. I don't wanna have refresh. Because this is one of the major, major problems in scaling DRAM into smaller technology nodes. So why is this important? Again, DRAM has been very successful. It was invented circa 1965 by IBM, by Dennard, Robert Dennard, he has a patent in 1968. And since then, it's been one of the most successful technologies, basically. Think about it, well, almost 50 years, we've been using DRAM in our computers. Why? Because when we reduced the size of the circuit with smaller technology nodes, we were able to do it well. We were able to pack more cells into the same area. This is called technology scaling. Now this was going very well, but recently things have become so small. Basically the size of a DRAM cell, the future size of this DRAM cell that I showed you, today is around 18 nanometers. Very small, right? And they're so small that this refresh is one of the major bottlenecks. They need to refresh it more as you scale it to smaller sizes. But let's do some analysis. Hopefully that give you an idea of why the problem is important. So the problem is, let's say, oh, at some point we cannot have DRAM that we can refresh at a reasonable rate, so we're not gonna scale DRAM, we're not gonna have large memories. Now think about the implications of this. Especially given that we have so much hunger for data in many, many, many, many applications today, right? We don't have large memories, so what do we do? Applications cannot scale anymore either, okay? But let's do some analysis first, because I like this analysis, it's fun. Imagine a system with one exabyte, DRAM. Does the system exist? It will soon. That's the exabyte, actually I always have to consult somewhere to figure out what an exa is, but it's two to the 60 basically. It's two to the 60 bytes. For reference, one of the largest supercomputers, I think it's Tianhe two, not anymore probably, but it had almost a petabyte of DRAM. It's two to the 50, I think, PETA, right? And two to the 40 is Tera. Okay, I think I got it right. If I'm not right, tell me. But two to the 60 bytes DRAM, and assume a row size of eight kilobytes. Now let's do some calculations. How many rows are there? It's gonna be simple math. You just divide the top one with the bottom one, right? Two to the 60 divide by two to the 13, which is two to the 37, which is still a very large number. It's basically almost one Tera rows, let's say. How many refreshes happen in 64 milliseconds? Basically it's one Tera refreshes every 64 milliseconds, right? Two to the 37 refreshes. Every 64 milliseconds you're doing this two to the 37 refreshes. That sounds bad. Now it's a harder question, and I'm not gonna give you the answer. What is the total power consumption of DRAM refresh in this system? Well, to be able to get this, you need to take two to the 37 and multiply it with the power it takes to do a single refresh. And you can find this number somewhere. And that's, if you're really curious, you'll do this assignment. And what is the total energy consumption of DRAM refresh during a day? Now, once you have the power over here, you multiply it with the number of times you need to do the refresh during a day, right? You get the joules after that, okay? Because energy is power times time, right? You know, that's from your physics classes. So it's a good exercise. I'm not gonna give you the numbers. And you'll get brownie points from me if you do this exercise. But I think the numbers tell you the scale of this, right? And you can do this for your cell phone, for your computer. Maybe the numbers are not as dramatic, but they're still important, especially given that this problem is going to increase. But these numbers are not unimaginable. They're going to happen very soon in the highest performance supercomputers of the world. And there are many of this. And if you think about data centers, there will be a lot more also. So we've done one study with one of my students went on an internship to Facebook and he's done a study of the memory errors in Facebook's entire fleet of servers. Now, unfortunately, I cannot tell you how many servers and how much the year on Facebook has, because they say if we write that in a paper, it affects their stock price, because they have so much. So they're not only supercomputers, but places like big data companies like Facebook, Amazon, Microsoft, or anything. So we're consuming a lot of power on refresh. Maybe unnecessarily. Okay, let's look at this. So why is this a scaling problem? Why is this a scaling limit? So one of my students actually did this exercise. Basically, he looked at the device capacity. You don't need to know exactly how these numbers get calculated. But basically today we have eight gigabit devices. That's the size of a DRM device. And this is the percentage of time we spend refreshing the device. Percentage of time, the device is not available because it's doing refresh. Around 10%, let's say. Not bad, not that great. But if we keep the business as usual, if we keep the same scaling methods that we use to scale DRM into the future, a 64 gigabit device, which we really want to store more data, will spend about 46% of its time refreshing. That sounds bad, right? You produce this memory and 50% of the time it's unavailable because it needs to refresh. If you plot the percentage of DRM energy spent refreshing, again, today, it's still reasonable, like about 20%. Especially right now, almost all the energy that's spent in this is due to refresh, actually, and some leakage power. But in the future it's going to get much worse. Basically, if the device capacity is 64 gigabits, half of the energy is going to be spent on refresh. So that sounds bad, basically. So the goal of a system designer is to figure out how to fix this. And there may be many solutions. Again, I'll challenge you with how do we solve the problem, but I'm probably going to give you answers because I'd like to finish this mystery before we take a break. Is that okay? Okay, and then I'll give you a break for the mystery. Next mystery. Okay, the observation is today all DRM rows, all of the rows that we have are refreshed every 64 milliseconds. Now, a critical thinker would ask the question, do we really need this, right? I'd like to teach you with critical thinking. When you see a statement like this, always question it. Do we really need to refresh everything every 64 milliseconds? Now, once you ask this question, there may be many, many solutions, right? Somebody may say, oh, I don't need my data to last more than 64 milliseconds. If I can tell you that, you don't need to refresh that data, right? The memory is not allocated over here. Why are we refreshing this memory that's not even having data that's not interesting to us? My data can, my data may be okay with errors once in a while, right? There are data like that, some video is like that. There's a lot of data that's gathered across the world and you run some machine learning algorithms on top of that and these algorithms are by nature statistical and they can tolerate errors. Maybe 10 bits going bad is nothing to these algorithms. Of course, you need to do it in intelligent fashion but once you ask this question, you can have a lot of potential solutions to the problem. But I'll give you one other solution by knowing what's underneath. So I've given you a lot of potential solutions. So what if we knew what happened underneath and exposed that information to the upper layers? So let's look at the underneath. This is DRAM, two to the 60 bytes and you go and measure every single row in DRAM and figure out how long it can retain data. What do you think this DRAM would look like? Turns out it looks like this. Basically, most of DRAM, an overwhelming majority of the rows are okay with being refreshed every 256 milliseconds, not 64 milliseconds. There are only a very, very small fraction of rows and I'll give you the exact number. In a 32 gigabyte system, only 30 rows, 30 times eight kilobytes, 240 kilobytes needs to be refreshed every 64 milliseconds. Everything else, ignore this part for now, everything else you can refresh every 256 milliseconds. Which means that if you can somehow figure this out and expose it, you can cut your refresh rate by almost four X. You can eliminate 75% of the refreshes, right? In fact, you can do even more finer grained over here. Some rows can retain data for seconds, actually. In fact, people have proposed attacking a system by ensuring that you can read those rows after seconds and seconds by stealing someone's computer and actually keeping the memory intact. Okay? So basically, this is the observation. It's fascinating, right? But we're doing this refresh every 64 milliseconds. So why do we have such a profile? I'd like to view this aside because you always want to understand what's going on underneath. Let's go down one more abstraction level. Why do we see this phenomenon? This is basically heterogeneity in the retention time, data retention time of DRAM. Basically, manufacturing is not perfect. Not all DRAM cells are exactly the same. Some are more leaky than others, it turns out. Some are really small, some are really large. Small ones leak a lot. Or maybe it turns out that the wiring is not good in one of the DRAM cells because manufacturing is not perfect. As a result, it leaks much faster, right? And this is called manufacturing process variation. Basically, there's a variation across the chip in terms of the process. As a result, you get these different cells. There's also another kind of variation, which is temperature variation. At high temperatures, cells leak much faster. This is true of everything. At high temperatures, humans also sweat much faster, right? You're leaching much faster in a sense. You're no different from DRAM cells. But at low temperature, cells don't leak as fast. So if you're operating things at low temperature, that profile gets exacerbated. Basically, that profile, you don't find a lot of, you can find even fewer cells, basically, that need to be refreshed very often. Okay, that sounds fun, right? So opportunity, how do we take advantage of this profile? So assume we know the retention time of each row exactly. What can we do with this information? Where do we expose this information to? How much information do we expose? So it affects a lot of things over here. Hardware, software overhead, power consumption, verification complexity, I'm not gonna go into this. How do we determine this profile information is very interesting, actually. It turns out it's not that easy, but I'm not gonna go into the details of this. Okay, so let me give you this. Basically, let me repeat the observation because I'm gonna give you a very simple solution. You don't need to understand a lot of the details of it. Basically, we saw that overwhelming majority of DRAM rows can be refreshed much less often without losing data. And this is from a real numbers from Samsung. You have this refresh interval. The axis is not that great, but this is 64 milliseconds. This is the number of cells in 32 gigabyte DRAM that are failing if you move the refresh interval to be 128 milliseconds, about 30 cells. If you move the refresh interval to be 256 milliseconds, you get failures in only about 1,000 cells. Not a whole lot, right? So if you know this information, one solution you can say is, oh, I'm gonna identify those cells and map them out. I'm not even gonna use them, right? So let's say the cells and rows are same because the distribution is random, it turns out. So you have 1,000 cells, 8 kilobytes, if 1,000 rows, 8 kilobyte rows, 8 kilobyte times 1,000 is 8 megabytes. You basically put 8 megabytes of your memory out of commission and not use it. That's a very simple solution, right? And you still have 32 gigabytes minus 8 megabytes of your memory, but you're refreshing it every 256 milliseconds instead of every 64 milliseconds. So in the end, I would argue that's a gain. And in fact, our results suggest that it's a gain. Well, I guess I gave you the answer to the question, right? Can we exploit this distribution to reduce refresh operations at low cost? And I've given you this, basically, only 1,000 rows and 30 gigabyte DRAM need refresh every 250 milliseconds, but we refresh all rows every 64 milliseconds. I gave you one answer. Another answer is, you refresh weak rows more frequently and all other rows rest frequently. So you want to keep that 8 megabytes if you want that, and there are some systems that want that. You basically have a heterogeneous refresh rate. For rows that are weak, you use a much aggressive 64 millisecond refresh rate. For rows that are not weak, that are strong, you use a less aggressive 250 millisecond refresh rate, but much less energy. So you can read about this. I'm not going to go into this in detail, except maybe if you have time later on, I'm going to introduce bloom filters to you, because you've got to, at least I'll drop the name because it's good for you to know in the future. So how does this work? You need to identify the retention time of all DRAM rows, that's called profiling. You need to store rows into bins by retention time. You can imagine how to do that. You'll see a lot of data structures. You can do it in data structures, but the key is you need to do it in hardware, assuming you're doing this in memory controller, and you'll see the hardware design and its difficulty across this course. You need to be very efficient. You cannot have one bit. So let's assume that you have two refresh rates, 64 milliseconds and 256 milliseconds. Let's say one bit indicates which refresh rate you use. If that bit is zero, you use 64 milliseconds. If that bit is one, you use 256 milliseconds. That requires one bit per DRAM row. So if you have two to the 37 DRAM rows, that's a lot of bits. How can you reduce that storage cost? Bloom filters is one answer, but I'm not gonna go into this. It's a very cool idea. I'll give an example of this later on in some of the lectures. But basically use some mechanism to distinguish between these, and you can have very little storage cost. And then memory controller, when it needs to refresh something, it says, should I refresh this row every 64 milliseconds or should I refresh this row every 256 milliseconds? It looks at that bit per that row, for that row. If the bit is one, it says, oh, 256 milliseconds. If the bit is zero, it refreshes every 64 milliseconds. Okay, you can think about how to design that circuit later on. But let me motivate this with some results because in the end, all of these effects, the bottom line that you get from a computer system, right? Let's assume a system with 32 gigabyte DRAM, H cores, various workloads. I'm not gonna go into the workloads. Workloads are applications, basically. The hardware cost is not that bad. Refresh reduction is almost 75% as you would expect, right? Because there are very few rows. You can actually reduce the refresh rate by 4x, and there are only very few rows that you need to refresh at the original refresh rate, 64 milliseconds. Now this translates after simulations to a 16% energy reduction, dynamic energy reduction in DRAM, and 20% idle power reduction DRAM. So I could save 20% power here, which is not unreasonable, right? And a performance improvement of about 9%. Not bad, right? For a single mechanism, it's not bad. It turns out you have a lot of mechanisms that build up like this to actually have a newer computer system. Okay, and also what's more important is benefits increase as DRAM scales in density. These are the benefits for a smaller device, four gigabit device. But if you look at a 64 gigabit device, which is the future, you should always think of the future being an architect, right? Present is present, it's already gone. But future, like 10 years later, five years later, that's what you're really architecting for. Remember the buildings yesterday? Some of those buildings are still the most beautiful buildings in the world because they were not architected based on precedent, right? Not based on the present also. They were based on principle. So if you actually do this, I think this is one example where a principle is in action. The principle is don't optimize for the worst case. Worst cases, you need to refresh the worst case cell every 64 milliseconds, but that's not always true, right? There are many, many cases that are much better than the worst case. So actually identify the differences, put heterogeneity in the system, and optimize for the common case, if you will. Common cases, many, many cells can be refreshed every 256 milliseconds. Okay, so let me give you one more result and we're gonna take a break. Basically, for the future, if you have a DRAM chip that has 64 gigabit cells, this mechanism reduces the energy by 50%, energy per axis, that's a lot. And this is dynamic energy. Ideal energy reduces more. And the performance, ignore what this says over here, you don't need to know everything on these slides, but basically the performance improves by 108%. Basically, you double the performance because refresh becomes a much, much bigger problem in future technology nodes. And again, remember, you're architecting for the future, not today. Okay, so if you're reading for the really interested, again, over here, and if you're really, really interested. So basically, the second takeaway, I'll finish with this, breaking the abstraction layers between components and transformation hierarchy levels. And again, knowing what's underneath, knowing the fact that cells have different retention times, data retention times, enables you to solve key problems and design better future systems. And I'll encourage you to think about that going forward. And again, there's another takeaway here. Here you have cooperation between multiple components and layers, right? Because underneath the device can give you information saying, oh, these cells need to be refreshed every six, four milliseconds and these other cells need to be refreshed every 256 milliseconds. But what you do with it depends on you at the higher level, right? You could use them in the memory control, you could use them in the operating system. If you expose it to the programmer, God forbid, probably don't do that because this is too much low level information for the programmer, then the programmer can potentially take advantage of it also. Okay, I've kept you, let's see. Okay, we'll start at 1428. I'll give you 15 minutes. Sure, yeah. I appreciate that, it's better. Okay, I guess I don't have a strong enough voice to reach everyone then, I figured that out. Okay, we're gonna cover one last mystery and then we'll introduce the course. I guess you already introduced the course so we don't really need it, but we'll go over some basics. So hopefully we'll have enough time. Were these two mysteries interesting? Yeah, you learned something? Okay, good. Again, don't worry if you don't understand everything. Sometimes I question even if I understand everything underneath because you always get fascinated with what happens at the lower levels and it surprises you, but these abstractions are good for you to understand. Okay, let's go through one more example. Has anybody heard of DRM row hammer? Only one of you? Oh, it's not that famous yet. Maybe somebody's hammering your computer now so you should be careful. Okay, I'll cover this because I think this is really fascinating and fun, potentially even more fascinating than what we've discussed so far. But basically this is an example of how this simple hardware failure mechanism can create a widespread system security vulnerability. A little bit flip that happens in your memory affects everything else in system security. And if you haven't read it, if you can do a search, this is one search that I did and the wired magazine came up with forget software. Now hackers are exploiting physics. And that essentially describes it at some abstract level. But basically I'll tell you again at an abstract level what's happening, what happens in modern DRM chips. Remember, this is like a bank that we've discussed, a bunch of rows. This is another abstraction of it. A row, the wire that connects across the entire row is called a word line and you have a bunch of rows and you have a lot more information in our paper which you don't have to read. You can stay at this abstraction. So if you want to activate a row, remember the activation that we did in the first mystery, you need to apply high voltage to that row, activate. And if you want to do something else, you need to basically remove that high voltage. That's called pre-charge in DRM but you don't need to know about that. Basically think about it, applying high voltage, low voltage. Activate, close. Activate, close. Activate, close. Activate, close. Activate, close. Activate, close. Activate, close. If you keep doing that enough times in probably your laptop or any other computer system that you have with a DRM chip if you can do it at that fast, before the cells get refreshed, it turns out in many DRM chips you can find errors in adjacent rows, physically adjacent rows. That's called a bit flip basically. Some bits over here flip. Instead of storing one, they now store zero. And you didn't even access them. All you did was access this other row that's next to it. Activate, pre-charge. Activate, pre-charge. Activate, pre-charge. Which is essentially an access. Not even write. You're not even writing to it. You're reading from it. Activate is essentially a read as we've seen. So we called this the hammered row and we discovered the problem and we called these the victim rows. Basically you're victimizing some rows in DRM. It turns out repeatedly opening and closing a row enough times within a refresh interval induces this sort of errors. They're disturbance errors and we'll talk about why these happen a little bit. In adjacent rows, in most real DRM chips you can buy today. So what is most? So when my students did the experiments, they found out from three different manufacturers more than 80% of the chips are vulnerable to this sort of errors. And you can read the paper. So why is this interesting? Do we think DRM turns out to be more vulnerable? And we'll get to why that is the case. Because the cells are much closer to each other. I'll give you that basically. So you test chips from 2008. You don't see the errors. This is when the module or chip was manufactured and the error rates per 10 to the nine cells. You get zeros until 2010. In 2010 you start seeing the first appearance of the errors in DRM chips. And if you look at any DRM chip from 2012 to 2013 all of them are vulnerable. Basically you'll find chips that, all of the chips will have these errors. Sounds bad, right? Well, maybe it's okay if you can tolerate those errors. But maybe not. So why is this happening? One of the first questions, there are many questions where you can take this once you observe this, right? I guess I've already given you the answer. DRM cells are too close to each other. And why is this a problem? Because now they're not electrically isolated from each other, right? We've broken the electrical isolation. Whenever you access one cell you're affecting some other cell a little bit in some way because the circuits are coupled to each other. So access to one cell affects the value in nearby cells due to electrical interference between two things. The cells, it's called cell-to-cell coupling. And the wire is used for accessing the cells. Remember the word line? They're also coupled to each other. When you apply high voltage to a word line another word line gets a little bit of that voltage because it's too close. So you're inadvertently applying a little bit of voltage to a word line that you're not reading, a row that you're not reading because things just happen to be so close. That's not what is supposed to happen, but this what's happening because things are too close to each other because we've scaled technology too much. As a result, you're opening that little row next to the row that you're really accessing just by a little. And if you do it enough times you're opening it many, many times. And if there are vulnerable cells inside that row they're leaking every single time you open it. And if they leak enough times before they get refreshed you lost the charge, right? That's the idea. That's also called cell-to-cell coupling. I guess I've already given you what I told you. When you activate or apply high voltage to a row an adjacent row gets slightly activated as well. Slightly, a little bit of voltage goes there, unfortunately. And vulnerable cells in that slightly activated row lose a little bit of charge. And if row hammer happens you keep doing it many, many times charge and such cells get drained. And if the cells are not refreshed before that happens, tough luck. You lost the data in those vulnerable cells. It doesn't happen to every cell. The cell has to be vulnerable. Now what does that mean? Remember the manufacturing variation that we talked about? Not all cells are equal. Well, some cells are more vulnerable. Well, not all cells are equal because the process is not perfect. Not every DRAM cell is exactly the same as each other. You'll find this many, many times in anything you design actually. Not everything is exactly the same. Things are not absolutely homogeneous. There's heterogeneity. So some cells are more vulnerable to row hammer effects like this. Some cells are not vulnerable. They're very strong. They won't leak charge. Maybe you can read it. Even if you turn off your DRAM you can read it 200 seconds after that. Maybe one day after that. Some cells are that strong but there are many cells that are not as strong also. Okay, so is this clear at the subtraction level? Does it sound fascinating to you? Yeah, it's pretty fascinating I think. This shouldn't happen, right? But going forward in technology, that's one of the reasons why we're in this exciting times. Going forward in technology we'll see such effects more and more. So knowing what's going on underneath is going to be a whole lot more important into the future as people have figured out with this row hammer effect. So basically it turns out the simple circuit level of failure mechanism has enormous implications on the upper layers of the system stack of the transformation hierarchy. Remember the transformation hierarchy? We're looking at problems really around here, right? Devices and logic. It's really a combination of both. And it's already affecting the users actually. So let's go a little bit higher in the transformation hierarchy. I've said that you activate and you get this but can you do this with real programs? Well it turns out you can. You can actually download this from my group's GitHub. It's a very small assembly code. You don't need to understand everything that's going on here. Again by the end of the course hopefully you will be able to understand this. But you can download this and what this program does is it basically ensures that accesses to X and Y go to two different rows and you keep activating them in an interleaved manner basically. And also of course the program is constructed such that the data that you bring in from these rows are not accessed in the CPU again. Because if you keep accessing them in the cache for example you're not able to induce row hammer, right? Because row hammer happens inside the DRM. Okay basically you need to avoid caches. You don't need to understand all of this but if you're interested you need to think about these. But basically what this program, the simple program causes it ping-pongs activates between X and Y and I could do this forever I guess. Maybe not as fast enough as row hammer needs to happen. And you get, if the chip is vulnerable you get errors like this. Okay, sounds fun, right? You could actually do it on your system. I would recommend if you're curious try out this program. Actually Google took our program and made it better. So you probably are better off with Google's program. I'll give a link to that later on. So you can observe errors in real systems and these are some of the errors that my students have reported. And it depends on how fast you can access memory. This poor processor over here doesn't access memory as fast for various reasons. As a result it gets fewer errors. That's not necessarily a good thing but you get fewer errors, right? Some other processors can access much faster. And also this is not specific to Intel and AMD. This is observed in any other processor that can access memory fast enough. Okay, so what's even worse about this is the simple thing, when it gets exposed to the higher levels you can take over an otherwise completely secure system. You do all of the things in your software to secure your system. But if you have this little bug someone can take over and get root privileges on your system. Let's talk about how that can be done. I'm not gonna read exactly how it's done. But basically when my students and I wrote this paper we said in the first sentence memory isolation is a key property of a reliable and secure computing system. And access to one memory address should not have unintended side effects on data stored in other addresses. And I think this is a very, very fundamental principle actually if you violate this principle all hell breaks loose. And you can see that the hell broke loose. Basically this is a paper, not a paper, a blog post by Mark Seaborne at Google. After I think, maybe almost about nine months after we wrote the paper they basically put up a blog post where they exploited what they called the bug. I wouldn't call it a bug, I would call it a failure mechanism. But basically they exploited this to gain kernel privileges. And this is a really fascinating blog post and they have a black hat presentation. If you don't like reading the blog post you can watch the presentation online. But basically this is, I copied directly from their blog post over here. They test the selection of laptops and found that a subset of them exhibited this problem, this row hammer. They built two working privilege escalation exploits that use this effect and you can read the blog post which is really fascinating. One exploit uses these row hammer induced bit flips to gain kernel privileges on X8664 Linux when rather than unprivileged user land process. Which sounds like fun, you can try this also if you want. And how did they do it? Again, I don't expect you to have any background to figure this out. But I'll give you at the high level what they did. There's something called virtual memory in existing systems which basically enables, which basically decides whether a program is able to access this memory, is able to write to that portion of memory. That's basically called access protection. The operating system, the supervisor says this. For example, I cannot access the system memory as a user because whenever I try to do a load to a location that's in the system memory the processor says, oh, you don't have access because there's a bit over there that says, oh, you don't have access to this location. Makes sense? So similarly, there's a write enable bit, read enable bit, dot, dot, dot. So they were able to flip one of those bits that enabled them to write to the system location that enables them to write to any part of memory. Once you can write to any part of memory as a user, you can do anything to the system. Because you can go on and change the operating system, you can, of course, you need to know the structure of how things work, but that's essentially it. So the attack is really clever because you cannot easily do this. You need to induce these bit flips at those locations where you can actually, at those locations if the bit flips, you get the access privilege. Now how they did it, that's actually amazing security and systems engineering, it's a lot of creativity also. So maybe you won't be able to understand if you watch their lecture or read their presentation right now, but hopefully by the end of this course, you'll have a lot better understanding, a lot stronger understanding of it. But they had a great understanding of not only the hardware, but also the operating system, all the access privileges, and how things get mapped in the hardware, and a lot of creativity and probability. Because it's a probabilistic attack. It's not deterministic. You don't always get successful. First of all, you don't always get successful because the machine may not have a DRAM that's vulnerable. Second, you may not be able to discover it. Third, things may not align very well such that you actually get that bit to flip. Okay, I already said this basically. You were able to gain right access to your own page table which enables you to access any parts. If you don't know what a page table is, don't worry. It'll come. So basically later people thought of this as row hammer vulnerability and start drawing pictures like this. One of my favorite things actually is this, basically this is from Twitter I think, but basically someone said, I'll demonstrate it. If you're locked out of here, if you wanna open that door to escape, you keep banging on this wall over here, and the perturbations caused by that banging eventually magically opens up that door. I think that's a beautiful description of the problem. Okay, and people actually later developed things like you get root privileges remotely to some other system, and you get root privileges on androids, and you can actually read about papers that people have written, and people keep doing this over and over. It's a hard attack to exploit, but it's possible to do it. And later, this is potentially another security exploit. This may be the best prevention for a row hammer. That's possible. Okay, so basically this is the infrastructure that we worked on where things happen. So hopefully the reason I'm showing this picture to you is because you're gonna be dealing with FPGA boards in this lecture. You may discover some other things like this going into the future if you know what to do with those FPGA boards well. But basically the way we tested for row hammer was my students designed a memory controller inside this FPGA board that's able to test memory, and basically test lots of memory by having many FPGA boards, and figure out its characteristics, including row hammer characteristics, but also retention time characteristics as we talked in the second mystery. So hopefully it'll be fun. I think this is to show you that you can actually, by knowing these FPGA boards really well and what to do with them, you can actually do a lot going into the future. And one more, this aside, it's not related to row hammer. These FPGA boards are becoming very, very popular. They're being used in data centers right now. Microsoft, for example, is employing it as your servers. You can actually use the FPGAs to accelerate computation. They're using FPGAs to accelerate encryption tasks, system tasks, and also search tasks, because these FPGAs, field programmable gate arrays, are very good at accelerating some particular tasks. Now you're gonna use FPGAs, not for accelerating tasks, but for prototyping your processor, NIPPS processor. But it all starts from there. Okay, now that we know of the row hammer problem, how do we fix the problem? Assuming we want to fix it, and I believe many of you want to fix this, right? Some of you will want to exploit it, which is good, some of you will want to fix it, which is also good, because exploiting it eventually leads to fixing it, right? That's always the game in security, right? It's always a good way of designing the system. Whenever you want to improve the system, the first thing you should do is attack the system. Like, what's wrong with the system today, right? What can I exploit? Okay, well, let me give you some examples. We're not gonna go over this in detail. Make better DRAM chips, tough luck, right? This is happening because things are too close to each other. You could punt the problem and say, DRAM manufacturer should be fixing it, but it's actually not that easy. Refresh frequently. Actually, that's a good idea, because the problem is happening because you're doing enough activates before things get refreshed. So if you increase the refresh rate, you avoid the problem. But as we've seen in the second mystery, increasing the refresh rate sounds like a bad idea, right? So it's a good idea to fix the reliability problem. It's a bad idea because it's caused a lot of energy and power consumption and performance. So now you see a trade-off here, right? To get reliability, you can increase your energy, reduce your performance. And that's a very classic trade-off, actually. Energy-reliable to trade-off or performance-reliable to trade-off. Ideally, you don't want to make that trade-off that way. But refresh is actually a solution and this is actually a solution. This may be the only realistic solution that can be employed in real systems today where you can change the memory controller. I'll give you one example of that. Error correction. So DRAM doesn't have error correction in Commod DRAM. So whenever a bit flip happens, it doesn't get corrected here. But remember the Hamming codes that we discussed yesterday? You could apply Hamming codes to DRAM. And this is actually one of the solution directions that the manufacturers are following. They're basically saying, oh, these issues are too hard for us to deal with. We're gonna add error correction, error correcting codes into the DRAM chip so that whenever we read something, we also look up this code that's redundant that says, oh, do we need to correct something? Basically, do we have an error in that part? And access counters. Another solution is to say, oh, you keep a count of how many times you access a row and if you do it too much, say, oh, I'm not gonna access that row again for a while. It's possible and people have proposed that solution but this turns out to be hard to build because you need to keep a lot of storage, keep a lot of memory to keep track of those counters. So there are a lot of downsides to these but again, it depends on what trade-off you want to make. We'll get back to the refresh in a little bit. Actually, we did get back to the refresh now. So Apple's patch and along with others to the problem, to the Rohammer problem is, basically, I'll read this over here. A disturbance error also known as Rohammer exists with some DDR3 RAM that could have led to memory corruption. They're very careful about this, I guess. This issue was mitigated by increasing memory refresh rates. So they could do this because they can basically release a software patch that changes the refresh rate in the memory controller. If they had burned the refresh rate to be 64 milliseconds in that memory controller, they wouldn't have been able to do this. So this also shows you a point. In designing a system, it's good to have some flexibility. You never know when it would be useful. So because they have a configurable refresh rate in their memory controller, they could release a software patch that changes the microcode in their system such that they can change the refresh rate of systems that are outside in the field. Okay. And there are many patches that were released also. So let me give you a cheaper solution that my students proposed, which I like. But basically it's called probabilistic adjacent row activation. The idea is very simple. After accessing your row, closing your row, let's say, you activate or refresh its neighbors, adjacent neighbors, physically adjacent neighbors with very, very low probability. You can understand this, right? At the high level, this is very simple. And it turns out this gives you a very good reliability guarantee, assuming you set the probability to five out of a thousand times, which is not bad in terms of performance overhead. You get really good reliability, much better than hard disks, it turns out. And you can actually change your reliability guarantee by adjusting the value of P. Of course, again, this becomes a trade-off lever between reliability and performance loss, right? Because you're refreshing more if you increase the probability, and you're refreshing less if you reduce the probability, but you're becoming more reliable if you increase the probability, right? That sounds good, right? And the performance overhead is very low over here. Now it turns out this is not easy to do. Well, this is clearly not easy to do in existing systems, maybe going into the future. This actually makes a lot of sense, and some manufacturers are probably implementing it. But in existence, so row hammer is interesting because it affects existing systems and future systems, and the solution you employ for those would be different. So if you know what's going on in the entire system, you could employ these different solutions. Okay, some thoughts. Basically, this may be the first real example of how a simple hardware failure mechanism can create a widespread system security vulnerability. People are always sought for this connection between reliability and security. And there was always interest in it, but this may be the most widespread and practical way of connecting these two. And how to exploit and fix a vulnerability requires a strong understanding across the transformation layers. The solution that I proposed previously is not implementable. It requires different interfaces between the memory controller and DM. So in the future, it's going to be different. So it's strong understanding of tools that are available to you. Can you refresh it? And fixing needs to happen for two types of chips. As I described, existing chips already in the field and future chips, and the solutions mechanisms are different between the two types. Okay, is this fascinating? Okay. Yes. That's a good question. I'm not sure if that fixes the problem, but I like your idea basically. I think what you're suggesting is if your data is really critical, protected by having two buffers around it, if you will. It turns out it's difficult to do for page tables because you can circumvent that effect. And if you read the black hat, you'll see that, I think. But that's, I like the way you think. It's good, yeah. Yeah. So that's also possible. I think you guys are developing good defenses. I'm sure attackers will go one more step beyond and try to figure out how to do that. Security is always like that, right? To develop defense and attack. But I like the way you're thinking basically. Spread the page table such that it becomes more correctable with error correcting codes, right? That assumes that you have error correcting codes in the system. But yes, that's good. Okay, let me move on. We don't have a whole lot of time, but we can always talk about these in some restation session. So I'll give an aside. I'll introduce the term Byzantine failures. This goes into the heart of distributed computing theory that you might see at some point. But this class of failures like Rohammer are also known as Byzantine failures. I'm not gonna go into detail why it's called Byzantine failures, but there's this famous Byzantine generals problem that was introduced by Leslie Lamport. They're characterized by undetected erroneous computation. This opposite of fail fast. When you design a system, you really want it to be, if it's going to fail, you want it to fail fast and you want to know about it. Basically, either with an error, it says, oh, I failed and I had an error, or I failed, I had no result. So do something about it. Whereas Rohammer is undetected erroneous computation, right? That's why I all held the excludes in the end. And erroneous actually can be malicious. The only distinction is really intent and the mechanisms you use to exploit it. And you can see that this is a reliability problem, but also a security problem, right? So it's very difficult to detect and confine Byzantine failures. So when you design a system, you gotta make sure that these things don't exist. If you do all you can to avoid them. You guys will be designing dependable systems potentially in any sort of form. Do anything you can to avoid Byzantine failures. And this is a human failure also, actually. You can apply a lot of the concepts in computing to humans as well. So if you actually have humans that trust each other, but one bit flips and that trust is broken, then you have a problem that's a Byzantine failure, right? And if you're really interested, you can read the Lamport paper, the Byzantine generals problem. Okay, so if you're really interested, this is the paper where we discussed Rohammer the first time. This is our source code. This is Google's source code and attack. So you can look at that. So Google's source code is actually very interesting because it does double-sided Rohammer. Instead of single-sided Rohammer, it hammers a victim row from both sides. Okay, some takeaways before I introduce the lecture in the last six minutes. But basically I think these three mysteries are, there were multiple reasons why I put these together, but I think the most important one is it's really an exciting time to be understanding and designing computing platforms today. Much more so than perhaps any time in history except for the beginning of computing platforms. There are many challenging, exciting problems in platform design that no one has really tackled or thought about before. Rohammer is one example. Refresh is another example. People have thought about it before, but not as much because it was not a critical problem. And there are other things. The scale of data is another example. We've never reached the scales of data that we've had in systems in the past. That can have a huge impact on the world's future and human life, dot, dot, dot. And this is driven by applications on top, problems on top, huge hunger for data, big data. I don't like the word, but it exists. Basically this can enable many, many things, right? New applications, a lot of solutions to many, many problems we can potentially have and even greater realism. So if you want to do a video, for example, you can be present at multiple places perhaps, right? So we can today easily collect more data than we can analyze and understand. Genomics is a very interesting field, for example. Today we can sequence so many genomes that we cannot even store them at this point, let alone analyzing them. And also the platform design is also driven by the bottom part. There's significant difficulties in keeping up with that hunger at the technology layer. Basically there are three walls, energy, reliability, and complexity. Performance is always a problem, of course, but energy, reliability, and complexity are also big problems today. And you've seen that, I think, in all three things that we've discussed. The first problem was a quality of service problem that's caused by the complexity of the system. And they're increasingly demanding applications. I'm not going to list all of the applications, but I'd like people to be dreamers basically, dream and they will come. And people, for example, thought that in 1980 there were processors that were built by MIPS R2000. And they said, why do we want any processor that is faster than MIPS R2000? MIPS R2000, I don't remember the exact frequency, but it was much less than 500 megahertz. Or, but it was basically a very simple processor. Clearly we've come a long way, right? This is 40 years later almost, and we still want more powerful processors and powerful platforms. Because we can dream, and if we have that power, things will come. Machine learning wouldn't have been possible if we didn't have processors more powerful than MIPS R2000. Artificial intelligence mechanisms that are being deployed in a lot of places wouldn't have been possible. And there's increasingly diverging and complex trade-offs. I'll give you one example, trade-off that has changed a lot over the course of computing. In the past, like 40 years ago, computation was expensive. If you want to do a double precision floating point computation, it was the most expensive thing perhaps in the system. Today, tables have turned around a little bit. This is a slide from Bill Daly, who is the chief scientist of NMIDIA right now, and also a professor at Stanford. But basically, he shows the cost, energy cost of doing things on a chip. 64-bit double precision floating point, 20 picajoules. A DRAM access, a memory access, 16 nanajoules. Now if you do the calculation, piconano, that's three orders of magnitude difference. This is three orders of magnitude much more expensive. So if you want to get two pieces of data to do a double precision floating point computation, bring them all the way from DRAM, and write the result back to DRAM, does it really make sense to do three DRAM accesses to do just 120 picajoules operation, and waste three times three orders of magnitude energy? So think about this. This trade-off has changed a lot over the course of computing. And you can see the other numbers over here too, but nothing is as dramatic as an external memory access. Okay, past systems look like this. This is my cartoonish picture. And we have increasingly complex systems. Basically today, modern systems, we have different types of memories in the system. We'll talk about that later. Some of it is persistent. But we have not only a single processor. We have multiple processors clearly, but we also heterogeneous processors, different types of accelerators in the system. Video engines, audio engines, different kinds of engines. FPGAs, as I mentioned, they're being integrated closely together, maybe on the same chip. GPUs are already on the same chip with existing cores. So systems are becoming increasingly complex. So how do we take advantage of it is becoming interesting. And also how do you connect these things in a much better manner is becoming interesting. So basically I'll recap quickly. Some goals of this course is to teach, enable, empower you to understand how a computing platform works. Hopefully we'll see parts of this, but not all of the complexity. And implement a simple microprocessor from scratch on FPGA. Understand how decisions are made and hardware affect the software. Think critically, that's the most important part, perhaps, in solving problems. Think broadly across the levels of transformation and understand how to analyze and make trade-offs in design. So let me take a few minutes to introduce all of the teaching staff. We have, can I borrow two minutes from your time? Okay, I'll introduce myself and everybody else over here. I was hoping this would be shorter, but I mean, you know me, so I've been here since, I mean, officially September, 2015, but I joined in May. I was at Carnegie Mellon before, and I got my PhD from UT Austin and worked at multiple places, which I can tell you about. And I do research in computer architecture, systems, and bioinformatics. And you can see that over here. Sir John, Professor Chapkun, who will take over some of the lectures, is here also, and he's from EPFL, and he's actually a co-founder of multiple startups in his area that has really computer security, including different types of security. So I can ask him about that. So we will have guest lecturers actually next week. As Darion mentioned yesterday, he is the head lab teaching assistant. Next week, he's going to talk about the labs. So definitely attend the lecture on Thursday, because you need to learn the labs to be successful. And we'll have Frank Gyurkinak who will be teaching some of the lectures because he's taught it in the past, and I think you'll enjoy their teaching. And Darion is responsible for the labs. Don't email him directly all the time because you may not get a response quickly, but we have a lot of lab assistants. These are tentative. You can look at these slides. So if you need help, write an email to this list or write directly to us. But again, if you write an email to this list, it'll be much faster. Okay, I'll stop here. Enjoy your weekend.