 For those of you that are not here, since that seems to be an increase in number of people, I just want to remind anyone who might be experiencing this with some type of telepresence or might hear things from other people that this material will be on the exam. So you're responsible for the content of these papers. And the slide decks that I've written for these papers aren't necessarily particularly complete because we're going through the papers in class. So if you want to learn about these papers, if you want to know about the types of things that I might ask about on the final, I would suggest that you actually come to class. Or watch the videos, I guess, but if you're not here, you can't ask questions. Okay, so let's talk about an analysis of Linux scalability to many cores. So let's start with our announcements. So the 100-100 Club is still only one group. We're waiting for new members. People are getting close. There's a couple of 90-point-plus submissions. So these people seem like they're on track to finish up in the next couple weeks. But you have just over two weeks now, two weeks from Friday. Okay. So this paper has, you know, this is not a particularly long list of authors for a systems paper. The final three authors, Franz Kasek, Robert Morris, and Nikolai Zaldovich are professors at MIT who are variously famous for doing work in this area on a variety of different topics. The primary author of this paper, Silas Boyd-Whisker, is a shadowy, mysterious figure who I met a few times. He's a very interesting person. I was just trying to figure out what he was doing because I knew he graduated a few years ago. So I'm going to use this as an opportunity to call out to him in cyberspace and say, Silas, where are you? What are you doing? Update your little MIT webpage so that I can find out what he's up to because I'm sure it's something interesting. So he's a really interesting guy. Just to try to lure him out of hiding, I'll tell a little story about him. So when you write a paper like this and you publish it somewhere, you end up having to go somewhere and talk to other smart people about the work that you've done. And this is sort of a nerve-racking experience. You're getting up in front of a room full of people who are famous who have done all sorts of work and may not agree with you and may think that your paper is crappy and may not like your conclusions. And so frequently people are just wildly nervous about this opportunity. And that's particularly true when it comes to the question and answer session because unfortunately systems people can be blunt, shall we say. There are less nice words for it, particularly when they're asking each other questions about their work. And so there's a habit people get into of sort of when you're asked a question in a talk, you have some preplanned canned set of responses that you use. So you kind of take the question and you try to map it onto a question that you're prepared to answer and then you just start answering that question right away. And that can be effective. So Silas doesn't do this, right? So when I've seen Silas give talks people ask him a question and he'll stand there for a minute sort of thinking about it, that's going to be an awkward silence. And then he'll say something like, no. And it's awesome because you can actually tell that he's thinking about it. He's actually not just giving you some sort of pre-prepared canned response he's acting there, it's like processing going on if you can sort of see the link. Anyway, he's a great guy. So this is a very, very cool paper. I like teaching this paper mainly because, so let's start out. What kind of paper is this? Paper falls into a couple of different categories. But how would you categorize this paper? You don't have to use the taxonomy that we've established before. Maybe this is a new kind of paper. But what kind of paper is this? You want to take a guess? You guys read it, you have some idea of what it's trying to do. What kind of paper is this? I'm going to provoke my own Silas-like awkward silences today, yeah. Yeah, so one way of thinking about this is this is a performance and benchmarking paper. So we've been talking about performance and benchmarking. On Monday we presented all these hints for system design. And I wanted to show you sort of a relatively modern, this is from 2010. But a relatively modern example of what I think is just a fantastic performance and benchmarking paper. It's the people finding problems in systems and fixing them. So if you look at how this paper works, they describe this, and we'll go through it, but they describe the process. We find a bottleneck, we fix it, and we iterate again. So it's a really nice example of choosing there's discussion of benchmark selection in the paper, and then really they're just turning to crank. There's nothing fancy here. But this is also a big idea paper because why? So what's the big idea here? Or you could also sort of describe this as a wrong way paper. Despite the fact, so this paper feels normal. It just looks like, oh, I just did the normal thing. I just applied very standard classic performance benchmarking and improvement techniques to Linux, but why is this sort of a radical idea? Or what's the somewhat controversial claim that they're trying to make here? Or the claim that they're trying to rebut? Is it good? Yeah, so, and this is really interesting. And one of the things that gives this paper credibility is that the authors are some of the people who had been proposing and arguing in other papers that traditional operating systems designs were not scalable. So remember, this is one of the big shifts in computing that's happened over the last decade. We had years and years and years of single core Moors law scaling. Single processors got faster and faster and faster and faster. And then at some point, essentially, we just ran out of density. And there are a lot of power issues associated with extremely small transistor sizes. And around that time, and again, I mean, this is sort of within your lifetimes, maybe you guys really didn't notice this. But core counts on commodity machines started to go up. So for years and years and years, it was really weird to find a machine that had more than one core in it, a commodity machine. Now you can't, your smartphone has four cores. Your desktop has four cores, eight cores, 16 cores. Everything has got multiple cores. And so this was a big shift in how operating systems were designed. And in this case, they're talking about real high core count machines. So this analysis was done on a 48 core machine. We've got 64 core servers in the lab. In fact, one of them is running the VMs that you guys used to do, to test your OS 161 assignments. So when that shift took place, there was a lot of people who started looking at some of the same data that they present, showing that traditional workloads are not scaling. So we want, if I buy twice as many cores, I want things to go twice as fast. If they don't, I'm mad. Cores are expensive. And so there were a lot of people, including some of these authors. Silas wrote another paper on a system called Cori, which was an operating system designed to scale well to multiple cores. But this paper takes a very different approach with just to say, look, maybe there's nothing wrong with traditional operating system designs at all. Maybe we just need to apply classic performance analysis and performance improvement techniques to this existing systems, and we'll be able to remove scalability bottlenecks. So I think it's just a really, really interesting argument. And again, the paper is just extremely well thought, right? So that's really the question, right? Is, will traditional OS design scale to my many cores? So this is from the introduction. There's a sense in the community that traditional kernel designs won't scale well on multi-core processors that applications will spend an increasing fraction of their time in the kernel as the number of cores increases. What does that mean? Remember, the kernel to an application is just overhead. So spending an increasing amount of time in the kernel, not good. Not what I want. That doesn't mean I'm doing more work, doing more system calls. No, it means that the system calls I'm doing, the times when I need the OS to help me, it's taking longer and longer as the number of core cores increases. Prominent researchers, such as the authors, have advocated re-think it, operate system, and new kernel designs intended to allow scalability have been proposed. This paper asks whether traditional kernel designs can be used and implemented in a way that allows applications to scale. And then the paper has a certain amount of humility to it, which is also appreciated. They say that we attempt to shed a small amount of light on this question. I was at a meeting where Silas was doing a practice talk. He was presenting this paper at this conference four years ago. And I think somebody asked him something like, well, you've done all this work on 48 core scalability. What can you say about 64 core machines? And Robert Morris, who's one of the co-authors, said, we don't know anything about 49 core machines. So there was this really nice sense that this is the data we have. This is the machine we used. We draw our conclusions. But those conclusions are inherently very limited. I'm not even going to try to predict. Now, a 49 core machine would be weird. Is it like 7 by 7 or something strange? But anyway, the point is that one more core we're not willing to say anything about, much less 60 more cores. So what's the approach that is used in this paper? This is a little bit of reviews. This is something that we talked about just last week. Yeah, Yusuf? Run some benchmarks. OK, so first step. Run some benchmarks. These benchmarks are designed to expose scalability problems with traditional kernel designs. Then what do I do? I run the benchmarks. What's the first thing I have to decide after I run a benchmark on this type of system? What's the first branch point here? I run the benchmark. Now, I'm going to run it in a configuration so that I can see how well it scales. So I'll run it on one core, on two core, on four cores, on eight cores, on 16 cores. What's the first thing that I have to decide? Yeah, it doesn't work. Is it scaling? So they find one benchmark that actually scales. Well, and it's very interesting, both interesting and unsurprising what benchmark that is. We'll talk about that in a few minutes. So run the benchmarks. Find the ones that have scalability problems. Then what? This is the easy part. What do I do now? I have a benchmark. I've identified that it doesn't scale well. What's the next step in the process? I didn't say it's easy. It should be easy to remember. Yeah, I used it. Identify the problem. So understand the application behavior and its use of kernel resources, which allows me to identify the problem. And then the fourth step, fix it. Fix the bottleneck. And then what do they do at the end? They're very careful to say that once I identify a scaling bottleneck and correct it, what do I now need to do before I should identify other problems? Rerun the benchmark. So I benchmark Linux using applications that should scale well. This is important because I want to identify kernel benchmarks, not application benchmarks. There would be no point running bash. Bash doesn't take advantage of 64 cores or even 48. Identify and fix the scalability bottlenecks and repeat. Rerun the benchmark. So in their words, first we measure scalability of the MOS bench applications. We'll come back to what those are. On a recent Linux kernel, this was 2.6. This is 2010, 48 cores. So to avoid, and this is a common technique when I'm running benchmarks that are designed to try to focus on one aspect of the system performance, they use a in-memory file system in order to make sure that the bottlenecks they're identifying are not in the disk. So it's possible that there are file system bottlenecks that would prevent scalability or disk issues. I don't want to pollute my data with that. So what they did is all these benchmarks require files, but they mounted a temp file system, which you can do on your own machine, in RAM. So it's a way to make a portion of RAM look like a file system, and it's useful in certain cases. So G makes scales well, but the other applications scale poorly, performing much less work per core with 48 cores than with one core. So perfect scalability would mean however much work I got done on one core, I would get 48 times more work done on 48 cores. And what they see is that I get much less work done per core when I start to increase the core count. Next, we attempt to understand and fix the scalability problems by modifying either the applications or the Linux kernel. In many cases, if you look at their description, the benchmark suite, they modify the applications or use the applications in such a way that's designed to actually expose kernel bottlenecks. So we'll talk about this a little bit when we talk about the benchmarks. And then we then iterate because once I've fixed one scalability bottleneck, another one emerges, a new one that I hadn't necessarily seen before. So there's a very funny, well, funny, I don't know, sort of obvious, somewhat ironic comment in the paper. I don't know if you guys saw it about the best way to get good scalability is just have a really badly written application because a badly written application spends a lot of time doing unnecessary stuff in between its calls to the kernel. So if I take your application that doesn't scale very well and make it twice as slow, it may scale a lot better to 48 cores. Of course, the problem is it's still twice as slow. So removing bottlenecks, if I remove a bottleneck from one part of my system, it's possible that something else quickly becomes a bottleneck because the performance loss caused by the bottleneck I removed was hiding the other scalability problem. So it's really important that I iterate at each step. So when they're finished, the end result is either, so they repeat this process until probably they ran out of time. I'm sure they could have kept going. But what they say is for each bottleneck, they get to one of two different states. Sorry, each application, they get to one of two different states. Either the application scales acceptably or they get to the point where they have a hard problem to solve. So they make the case in this paper that the modifications they make to Linux are fairly straightforward. And again, that's what I really like about this paper. There's no magic. There are a couple of cool ideas that are in paper we'll talk about when we look through their solutions. But a lot of this is just standard, good application and kernel design. And they make the case that their changes are not huge design changes that would involve rewriting large portions of the operating system. There are small localized changes that have big impacts on performance. And then the analysis, when the kernel design is compatible with scaling. So remember, this is the question in the paper. Can a traditional OS design scale to 48 cores? That question depends on the results of these experiments. So if a bunch of the benchmarks got to the point where they hit this really, really hard problem that solving would have required redesigning the entire kernel, that would have been considered a negative result. So what did they actually find? What's their claim in this paper? It's a performance of benchmarking paper, but they do have a result. What did they find? I mean, the core question of the paper is can traditional OS design scale to 48 cores? What do they consider the answer to be for these benchmarks? Fixing these bottlenecks allows the kernel to scale. So their claim is that there is no immediate scalability reason to give up on traditional kernel designs. So this is a really interesting result. They're essentially saying, look, all these, now here's the thing. If you're an academic researcher or you're a graduate student, it is fun. Now, okay, I should back up a minute. If you're a computer systems academic researcher or a computer systems hacker graduate student, there is probably, there's a certain type of person that there's probably nothing that they'd rather hear someone tell them to do than design a new operating system. They're like, oh yes, you know, how cool is that? You get to design a new operating system from scratch. Now there were years in the research community where if you look at the proceedings of the top conferences like OSDI where this paper was published and SOSP, there were a lot of new OS designs. Every year we have a couple papers about a new operating system. And recently that trend has slowed quite a bit. So there are many, many fewer new research operating systems being developed. Now of course part of that is because maybe we've solved some of the problems that those operating systems were trying to solve, but I think it's also maybe because it's a huge amount of time and energy that goes into building an operating system from scratch. So you have to have a really interesting idea that you want to test, a really interesting hypothesis, a really interesting new design that you want to experiment with to justify taking a bunch of really smart people and having them spend years, as it takes that long, rebuilding an operating system. However, there are people who want to do that. That's the thing that they're, okay, I'll do it. So I think part of this is pushing back. I mean, Linux is a mature system and maybe it's not as exciting to hack on Linux because a lot of the problems have already been solved for you. But from a practical perspective, if what we care about is performance rather than giving people the chance to write new operating systems, which is not necessarily like an enormous amount of social good to it, traditional operating system designs may still be the way to go. Okay. So what's Moss Bench? Like any performance paper, what do I need to start with? What is Moss Bench? The answer is somewhat embedded in the name. There's a big hint. Yeah. So it is a benchmark. Or properly speaking, it's what you would refer to as a benchmark suite. And it's really an application benchmark suite. Remember, these are not micro benchmarks. They're not stressing the little pieces of the kernel. Moss Bench comprises a set of applications that should be able to take advantage of multiple cores. So Moss Bench includes things like a mail server. So the mail server should be able to take advantage of multiple cores because each time a message is received, it spawns off a new process to handle that message, which involves some file system operations and parsing, figuring out what to do with it and eventually it landing in some mail queue somewhere on the system. There is an in, I'm curious, how many people have heard of Exxon? Okay. What about Memcache D? Anyone heard of that? Oh okay, it's probably because Facebook is involved with developing it. So Memcache D is a tool that is used by a lot of dynamic, it actually is used for all sorts of things. It's very simple. It's an in memory key value store. So Memcache D allows you to store values and look them up. It's not persistent. If you read with the system, all the data is lost, so it's really only useful as a cache. But a lot of websites and other types of tools use it to cache data because it has a very nice interface and because it's in memory, it's very fast. Now the interesting thing, they point out that Memcache D has this inherent scaling problem, which is that there's a global lock that's acquired when it's accessing data. And so what they did, remember in certain cases they had to sort of fix or modify the benchmarks in their suite to get them to expose kernel scalability problems. They just ran one server per core and then they had some sort of scheme for load balancing requests to each server. And this is actually something that Memcache D supports. So if you run Memcache D, you can run a bunch of different instances of Memcache D on different machines and have them provide sort of a unified namespace for clients to use. A web server, one of our classic examples going all the way back to the beginning of the class. And they configured Apache to use one process per core. Database server, Postgres, starts one process per connection. So when you open a connection to the database and for a lot of websites, these are long-lived persistent connections that they use to handle multiple requests. Postgres creates a new process and of course Postgres has a lot of potential places where it can bottleneck inside the operating system, particularly because Postgres databases are stored as files. A parallel build, gmake, this is essentially make. The g is just there to distinguish it from bmake, but this is the default make on Linux systems. And they were building Linux on multiple cores. A file indexer, this is the only one I hadn't heard of, psearchy. Has anyone heard of psearchy? Anyway, I don't know. I don't index my system, so I don't care. I'm old-fashioned, I use hierarchical directories. And then a map-reduced library called menace. So these were the ones I hadn't heard of. So do you guys consider these to be good benchmarks for this purpose? Or put another way, what would make these good benchmarks? What am I trying to expose here? And again, they've made some changes in various places, yeah, what's that? Well, right, so what I'm trying to do is I'm trying to measure scalability on Linux on multiple cores. But what's important about these applications? Where do their bottlenecks, where should their bottlenecks be? So for example, if I had just run memcachedd and configured it to try to use multiple cores, what would the bottleneck have been to scalability that memcachedd would have experienced? I mentioned this in the paper and I said it two minutes. Yeah, so memcachedd has this inherent scalability problem at the application level. I don't care about that, they don't care about that. The paper is not about how to write applications that scale well to multiple cores. The paper's about how to write kernels that scale well to multiple cores. And so the goal here is to identify kernel bottlenecks. And the goal was to pick applications where the applications have done a reasonably good job of trying to scale themselves, and therefore the kernel is the potential bottleneck here. So database servers have always been inherently designed around server-class machines. And so database servers are not new to the idea of trying to scale. So the database application logic itself has a lot of tricks it plays to try to avoid cross-core bottlenecks. So what are left are bottlenecks in the kernel. So what's the, and when they talk about each one of these bottlenecks, they have some statistics that they refer to repeatedly that to them identify bad kernel scalability. So what number, how do I identify bad kernel scalability from the application perspective? As I take the same app, I run it on two cores and then I run it on 24 cores. What's one number that'll tell you that it's kernel, that the kernel is not scaling well for this particular application? Yeah. Yeah, so the percentage of time that the application spends running in the kernel. So remember, I'm running the same applications. They're doing the same things. This benchmark suite is deterministic. So I run it multiple times, it does the same stuff. Maybe that's important to point out. Maybe that's not obvious, it's obvious to me, sorry. Which is that a benchmark suite, you run it twice, it does the same thing. So for example, for memcache D, they would start up a bunch of memcache D servers and then do a bunch of requests that would be even the distributed across all of them. And then they would see how long that took on four cores and then they would run that again on eight cores, the same thing. You want the benchmarks to be deterministic. So the system calls that are being made are identical. And if those identical system calls are taking a longer percentage of my runtime as the core count increases, what that implies is that the system calls themselves are slowing down. They're hitting bottlenecks in the kernel that are causing name lookups or other operations to take longer and longer as they add more cores. Okay, so I thought this was funny. And again, something that's very interesting, this has been well known in the Linux community for a long time, but why is Gmake the only benchmark that scales well? So if all their benchmarks Gmake is the only one that they said, forget it, it works, why? What's that? Well, again, all these applications have parallelism. I'm starting to different memcache Ds. I'm running N instances of memcache D. I mean, that's actually N different applications. There's a great deal of parallelism there. Now, I would agree that there is a lot of inherent parallelism in building things because there's a lot of different objects I'm creating and stuff like that, but you would think that about a lot of these other applications. But Gmake is the one that scales well. Why? Yeah. No, Gmake runs the kernel all the time, right? Opening files, right? Opening files, directory lookups, stuff like that. So Gmake, yeah, Gmake spends quite a bit of time doing, it does a lot of system calls. It's not that. Why? Note about this in the paper. Because kernel developers use it all the time. People that work on Linux. So if you work on Linux, what do you do a lot? What do you guys do a lot? You're working on OS 161. Run Bmake, right? You're on it over and over and over again. Now, on your systems, that completes really fast. If it doesn't, maybe you're on Timberlake and I'm sorry, right? But on your own machines, it should complete very fast. You're building a small curl. Building Linux can take a long time, right? Minutes used to be hours doing a full Linux build. And so you have these very impatient kernel developers that are trying to make changes and evaluate them and they hit this bottleneck all the time. So what do you think they fix? These bottlenecks. In fact, they actually point out that there are a bunch of Linux patches that talk about how this change speeds up compiling the kernel, right? So, you know, and again, just like little bits of internal frustration, right? Kernel developers fixing their problems instead of your problems. Now, you know, whatever, they work for free to some degree, so they're not paid to fix your problems. But anyway, so it's interesting how the little bit of myopia sneaks in here, right? I mean, I build the kernel a lot. I have to wait for the kernel to build a lot. So what scalability bottleneck do I fix? Gmail. Okay. So let's talk about some of the problems. And I think, you know, if you take to the degree that there are real sort of design lessons to take away from the system and deep insights about how systems work. And again, these are not new, but this is I think a really fantastic distillation of them. Here's one of them. So they say these are all classic considerations in parallel programming. Here are some of the bottlenecks. So as they put it, serializing interactions. Interactions within the curl that force serialization of things that should be able to happen in parallel. Okay. Here you go. Classic. Locking a shared data structure. And this is interesting, and they point out in various places that locking data structures, this bottleneck can appear at a variety of different core counts. So to some degree, imagine I have a lock that I hold very briefly. And actually there's one place in the paper where they fix a place where I was holding a spin lock just to copy out some values from a structure. That turned out to be a bottleneck. Why is this suddenly a bottleneck? You know, Linux up to this point had probably been extremely well tested on two cores, on four cores, on eight cores. Why do some of these locks suddenly become bottlenecks? Or why do locks emerge as bottlenecks at different points when you look at scalability? Yeah, that's it. Well, sure, right? But you can kind of think of it this way, right? When I try to acquire a lock on any core, there's some probability that that lock is held by another core. Now as the number of cores goes up, that probability increases. But that probability also depends on how large the critical section is. So a critical section that's so small that it doesn't cause any scalability problems on a four core machine can suddenly be a big problem on a 16 core machine. Because now that probability has gone up a little bit and now it's to the point where it's actually slowing things down. So these bottlenecks occur at different points. But this is a classic, classic problem. So here's the second one. The task may write a shared memory location. So then increase the number of cores, increase the time spent waiting for the cache coherence protocol to fetch the catch line in exclusive mode. This problem can occur even in lock-free data structures. So now I've gotten clever, I've gotten rid of the lock and I still have the skill about how many people understand what this means? I want, just raise your hand. I'm gonna talk to the person who teaches computer architecture. Okay, so these are, let me try to briefly explain it. The machine model they're talking about is what's called a shared memory microprocessor. So a machine that has multiple cores and provides what's called cache coherence guarantees those cores that they will always see the same view of memory. And that, you know, and of course there's timing operations involved with interleaving memory accesses. But the goal of these systems is to provide each processor with a coherent view of the entire memory space at all times. Why is this hard? Why is this hard when I have more than one core? And why does this get harder and harder when I have more cores? So this seems simple, you know, I write a memory location and then the other processor reads it, it gets the same value, yeah. So that's a fantastic answer. It's actually true for really large systems where I start to actually care about the speed of light through copper. So he's saying they're actually just different distances away. So it's not true on a machine with only 64 cores. What makes this hard? Yeah. No, but that global data structure is called memory. No problem. Why is this hard? If it was just the cores and memory, it wouldn't be that hard. But there's another part of the system you guys are forgetting about it. What's in between the core and memory? Who's ever heard of an L1 cache? Don't you guys look at this when you buy machines? L1 cache size. I want it, right? And so there are many levels of cache and the cache hierarchy can be arbitrarily complicated but it doesn't matter, okay? For the sake of this argument, you could just consider each core to have a local cache. Now that cache caches memory. The cache is closer to the core, it's smaller and it's faster than main memory. You thought main memory was the only thing you used for a cache? No way. Remember, the system is a series of caches. The first cache are registers. Registers are caching memory values. The second cache is the L1 cache. So the L1 cache or the only cache you have to worry about is caching memory. Now the L1 cache is different. Every processor has its own cache. So now that you know a little bit about this, what is cache coherence? What does this require? So again, I don't just have memory. If I had memory, I'd just worry about reads and writes and are leaving a memory and that would be the end of the day. But because I have a cache, what do I need to do? Yeah, yeah, oh my gosh. So this, if there's any, so you guys may be wondering, why don't we have like 10,024 core machines right now? Of course they're cheap, right? This is the problem that's actually causing problems of hardware scalability to really, really high core counts. So let's say that all 64 cores are caching. Now cache is usually what's called a cache line. Cache line could be like 32 bytes. It's usually not organized by byte. And we'll come back to this in a minute when we talk about one of the other bottlenecks. So when I do a read from memory, I get the 32 bytes around that read and I load it into my processor cache. So far so good. So let's say that 64 processors have read the same 32 bytes of memory and that 32 bytes of memory, that cache line is stored in every cache on every processor on the system. So what happens when one processor modifies its cache? What do I have to do? Yeah. I have to write it back to memory but what else do I have to do? I have to update every other cache. And this is not simple, right? I have hardware that has to be pre-programmed. First of all, these cores now suddenly have to communicate with each other. So core B needs to get a message from core A saying by the way, my core just wrote this value of memory. If you have a cache, you might want to update your cache. Cache coherence protocols do not scale. They're terrible. And so what happens here is that if I have sharing, even between, so if two cores are using the same memory and by here I mean same cache line at the same time, that is a scalability bottleneck and it's certainly a scalability bottleneck on large core count machines. And on something like a 1024 count, you just never get it to work. You'd spend so long waiting for your cache coherence protocols to finish that you never get any useful work done. And so this is one of the reasons why we've gone to things like MapReduce because we can't build machines with this many cores. We can't get the cache coherence protocols to work. Okay, does this make sense? How many people now think that this makes sense? Share in memory can cause bottlenecks because of the cache coherence protocol? Okay. Worse than that. Okay, so now here's something else. Remember, the cache is a lot smaller than memory. And so if my tasks compete for space in limited size shared hardware cache and if increase in number of cores increases the cache mis-rate, the problem can occur even if the tasks never share memory. So, and this I think is even more of a problem when I realized that in certain cases, multiple hardware cores can share the same higher level cache. So each core may have its own private cache that only that core gets to use, but it may have a slightly slower but larger second level cache that maybe four cores get to use. Okay, so this is just getting more and more terrible, right? So here we had a lock and you could look at it and be like, that's bad, that's a lock. That's bad. Get rid of the lock, something's wrong. Now I have shared memory. Okay, and I could say, okay, I know more about caches now. This is a problem. Now I don't even have to share memory. I just have to have a workload that doesn't quite fit in the cache. So tasks may compete for other shared hardware resources. So this just goes down the line, right? The inter-core interconnect, sending messages between cores or how about the DRAM interface? I actually may be competing with other cores for the bus to get to memory. So that's another place where I can have scalability. And all these things can result in additional cores spending more time to wait for shared resources and less time computing. And really on some level, all of this is about competing for access to shared resources. It's just involves some shared resources you guys didn't know about until now. Okay, and finally, the last thing that can happen, although this isn't that common, is there may be too few tasks to keep all the cores busy. This is not a problem for the bottlenecks that they identify. I wanna say one more thing about cache lines because I think this is important for you guys to understand. So remember that cache is cache this chunk of memory. It's 32 bytes or 64 bytes or whatever. So it's a cache line. And cache coherence protocols operate on cache lines, not on bytes. So if any part of the cache line is updated, all the other cores have to refresh that cache line. So what other problem can this cause? So there's one case where that cache line contains some piece of a shared data structure that a bunch of cores are trying to use. And in that case, this is sort of unavoidable, right? I have to make, in order to provide a cache coherence and a coherent view of the memory space, I have to do this. There's no alternative. But there's also something that they refer to in the paper as false sharing. What's that? Yeah, so there's something called false sharing, which is when two cores are competing on the same cache line, but the variables that are causing the cache line to be updated are totally unrelated to each other. So imagine I have foo which is eight bytes and I have borough which is eight bytes and they're both in the same 32 byte cache line. Core A is using foo, core B is using bar. Foo and bar have nothing to do with each other. They're totally separate variables. They're private variables, maybe. They're part of the stack. That's not what would happen, of course, because they'd be farther apart in memory, but whatever. They're shared variables that the two cores are using, but there's no actual overlap between them. But the cache coherence protocol is gonna force updates each time each one of those are modified. So essentially you can think about it. When core modified, two modifies bar, core A is gonna have to refresh the cache line even if those refreshes never change foo, which is the only thing that process A, which is running on the first core cares about. Does this make sense? This is something you guys will read about if you read more hardware papers. It's called false sharing. So from the perspective of the cache coherence protocol, it thinks that there's sharing going on, but there's actually no sharing going on at all. Yeah, so caches are fast. Yeah, you don't wanna stop using caches because caches are really fast, right? So the eventual thing, so there are research groups working on massive multi-core machines, like 10,000 cores. The thing you have to do to get those machines to work is give up on shared memory, the illusion of shared memory. So you end up with a system that feels a lot more like a network system where there's a portion of memory that isn't updated automatically for me. And I may have to do special things to update my view of that memory. I may have to communicate with cores that have a better view of that memory than I do, but this illusion that all the cores see a consistent memory hardware memory address space that's always kept consistent by the cache despite all the caches, you just, you have to abandon that because it just doesn't scale. Essentially, you can think of cache coherence protocols as scaling with the number of cores squared. Most of them have some sort of exponential scaling behavior, and that's why you just run out of, you run out of performance at some point. Yeah, I don't think GPUs had the same problem. Yeah, I think GPUs must not be fighting for some sort of global cache. Or the model, I mean, remember, if every core just had its own little private memory space that it was using, I don't have this problem. The problem is created by wanting to give cores continual sort of coherent shared memory. That's the problem. I mean, to some degree, if every core has its own private view of memory, what I have is now a data center. And we know how to program those. It's just a totally different set of tools. Okay, I am almost out of time. And let's see here, zoom in here. You know what, here's what I'm gonna do. I wanna quickly talk about a couple of their solutions because they're kinda nice, okay? Oh, no, wrong thing. Wish I had done a few more slides for this, but I didn't. Okay, so let's talk about a couple of their solutions that are kind of neat. So there's this great table, and I wanna encourage you guys to go through this because you guys can understand this stuff, right? I mean, so there's a really neat table where they talk about every bottleneck they found for every application and what they did to fix it. But so let's talk about some of the changes they made to Linux. So one of them briefly is this idea of multi-core network packet processing. So a bunch of these benchmarks use sockets or process network data in some way. Email server, web server, things like this. Postgres might have clients connecting over sockets. And just to summarize, what they say is that Linux already has this system for establishing per-core packet processing pipelines. So here's what I want. I want the core that takes, for example, let's say I have a packet of data to transmit over the network. Ideally what I want is I want the core that receives that data through the send system call to handle that buffer all the way down to the hardware. And they actually talk about the fact that modern hardware network interface cards have support for concurrency at the card level. So they have multiple ports that different cores can be using simultaneously to transmit data to the card. Clearly at some point it needs to be serialized before it's sent out over ethernet link because the packets are gonna get transmitted in some order. But the cards now actually have some support for this. So here what they did is they, Linux already has a lot of the support. The problem they identified is there were cases where for certain reasons the packets end up being handed from one core to another core. So core one gets the send call, but then at some point for some reason core five ends up transmitting the packet. And this creates a lot of overhead because maybe the packet fit in my cache and now I have to flush the cache and before five has to reload it. And so what they did here is they just made some small changes that made it more likely that the thing that they wanted to happen which is the same core gets the packet and sends it all the way down. Or ideally when I receive a packet, remember some core talks to the network device, receives the data, that core is the core that actually ends up transmitting the packet to the application when it calls receive. So this is my goals. I just want packets to move up and down the network stack on the same core. The next idea I think is a little cooler. So it's called, and I'll finish up today with this. So this is called sloppy counters. I love that name. So the point here is very simple. Linux like operating systems you guys have built uses reference counts to protect shared resources. Same thing you guys do. These are ideas that our operating systems use as well. When the reference count falls to zero, I know that it's safe to destroy the object. And of course what they notice is that reference counts create memory hotspots because every time I acquire a reference or release a reference to the object, it's possible that a bunch of cores now have to communicate across to the same memory address. So the reference counts stored somewhere. And even if I have some sort of hardware operation that I can use to do an atomic decrement or increment, which I might, so I don't have to grab a lock, but I still have a shared memory access that's hot because a bunch of cores are accessing it concurrently. And so lately they fixed this is pretty clever. Each core caches a couple of spare references to popular objects. So in the case when I see a bottleneck caused by the counter, the reference counter for a particular object, I create a sloppy counter. So the sloppy counter still has one global counter that's used, but what cores do is they keep another copy they keep a couple of references around as spares. And so the hope is when multiple tasks are accessing the same resource, they can get a reference from the core they're running on without having to communicate to the shared memory location. Now they talk, and you guys should read this section, it's not very long, it's not that hard to understand. They talk about the fact that if you add up all the counters across all the cores, the number is still the same. But the difference is that I'm avoiding accessing a shared memory location that could potentially become a bottleneck because of the cache coherence reasons we talked about by just being a little bit less precise about things. Now of course, the thing that I need to be precise about, so if I'm gonna do something like this with a reference count, what do I need to make sure is still true? There's one invariant that I have to preserve if I'm gonna do something like this to a reference count. What is that invariant? What is that property? I mean, what's the point of the reference count? The reference count exists so I know when I can safely do what? Like deallocate something, right? Maybe I'm cleaning up a cache and I wanna know if a particular cache entry is free to be removed. So when I deallocate it, I still need to collect all the sloppy counters together and make sure that they add up to zero. But in the cases where they're using this, they're talking about objects that are used often enough that that's rare. So a lot of this, if you go back to one of the hints from Butler's paper that we didn't talk about, Butler-Lampson's paper, I'm not on first name basis with Butler-Lampson. Butler-Lampson's paper that we talked about was this idea of separating the average case and the worst case. So in this case, the average case is my core has a reference for the object I need. The worst case is either the core doesn't have to have a reference and has to go to the cache and get more references for the sloppy counter or I'm in the process of deallocating the object. But both of those are cases that don't happen as often because of how they use this. All right, I think that's where we will stop today. Yeah, so on Friday, I would encourage you guys to look at the rest of this. There's some other nice things that they do and there's some nice techniques. On Friday and Monday, we will do some lectures on virtualization and I will see you then.