 All right, it's Friday. Guys, people are ready for the weekend, I suppose? OK, so today we're going to wrap up the week on performance. I'll talk about a pretty cool paper. I don't know. I just don't even want to know how few of you have read any of these papers. I get it, it's late in the semester, whatever. But this is a good one. So I tried to pick these carefully, so there's no real losers in this bunch. So yeah, so this is a, we're going to spend most of today talking about how you scale Linux. So it's a couple-year-old paper from sort of like the very quite famous systems group at MIT. And the goal of this paper was to examine whether sort of traditional operating system architectures could support concurrent workloads. And there's sort of an interesting history behind this, but we'll talk about that in a sec. All right, you guys are currently at 45%. I don't know what's taking you so long. Please do the evaluation. It's OK to say mean things about me. It's OK to say you hate the class, whatever. You have to say those things just when you do what you have to do. You have to say those things to get one of the exam questions released. I mean, I want to release the exam questions. Let me just be blunt about that, because then I don't have to write as many. So not just to provide feedback for us, because like I said, we did some new things this year, please do the course evaluation. Please. Thank you. OK. So before we go on, I just wanted to pause again and talk about this slide. I think you could actually teach an entire course a whole semester just based on this slide. If you look at the original paper, the examples are not necessarily all that great. So this is from the Butler-Lampson Hits paper that we talked about a little bit last time, sort of got to at the end of class. Just want to pause here and basically ask you guys, I mean, what do you want to talk about? So I would look through this. Again, I would cut this out. I would paste it onto my laptop. And I would look at all of these suggestions and make sure I understand them. You don't have to necessarily understand his examples, because his examples are drawn from the distant past. And they may involve systems that you don't really understand. But all of these ideas and slogans are things that you can apply to modern systems and learn from. So who wants to talk about one of these? Pick anything. Last time we had an example of separating normal and worst case. Anyone want to pick another one out of the bag? Talk about it? Yeah. Keep basic interfaces stable. So what does that mean? What do you guys think that means? Keep basic interfaces stable. So this is in what section? This is in the section where we're talking about the systems interface. So we broke down the discussion in three axes in three ways. So we talk about completeness. Does the system accomplish everything it's supposed to accomplish? Interface, what is the interface that the system provides? Useful systems provide an interface. A system that doesn't provide an interface is a closed world that is inaccessible to the outside. So all the systems that are interesting that you guys are going to build in the future, whether it's a little standalone application or something much, much, much bigger and grander, will provide an interface. They have to. Otherwise, they don't do anything. So how do I use the system? How does someone else interact with that? That's the interface. And then implementation. How is the system actually built internally? And then there were three things that we were interested in. Functionality, does it accomplish what we want to accomplish? Speed, does it perform well? And fault tolerance, does it keep working? So keeping basic interfaces stable, what does that mean? Anyone want to take a guess? Yeah? OK, so that's a good, OK. What are the dependencies in this case? Yeah. What's that? OK, so that's interesting. So now we're talking about actually human computer interfaces. That's even more interesting. So you're right. When I said every computer has an interface, maybe that interface is a graphical interface. But how confused would people get? I mean, how confused would your parents and grandparents get if suddenly Windows moved all things all over the place? Which they kind of did, right? I mean, don't new versions of Windows? They had these really scary API, all these boxes in all different sizes. And I'll understand why any box is the size that it is. Does anyone know? Really? Come on, you guys use Windows more than I do. Why are the boxes different sizes? The tiles. Sorry, I'm using the wrong word. Oops. It looks like a box to me. And it has four corners, and it's a square-ish shape. So why are the tiles or whatever Windows calls them? Like, how do they? Can you change the sizes? Oh, OK, that's nice to know. Live tiles. I like this. Live tiles, wow. So the tiles will, like, do things. Anyway, I shouldn't mock Windows. But the point is, like, for a while, they did have a start menu. You can still get there, right? Isn't there something that's kind of equivalent to the start menu? If I poke through the tile in the right way, like, it'll go away. What's that? I think Windows 10 broke that. Oh, OK, yeah. It did it. Anyway, the point is that for at least a long period of time until maybe some recent mistakes, Windows had this consistent user interface, which is down in the left corner. And even I know this. I don't use Windows. There's this button called start that does all these other things, right? So if I get confused and lost, I can always hit that button. And have some familiar menu show up that maybe I can use to do what I'm trying to do. So that's a great point. What about computer interfaces? Programmatic interfaces. So when Windows 10 does the start menu, a lot of people get mad because they can't figure out how to use the computer. What's the analogous? Yeah, so OK, so I have a story about this. Years ago, we had a group that was working on the assignments in this class. And they were very, very diligent. They were good students. They worked hard. They learned a lot. They did some dumb things. One of the things they did was that they decided to change the order of the arguments for argv and argc. Like somehow they decided it would be better to pass them in the other register. And so they broke all of the user space code. Because all the user space code expects argc and then argv or whatever. And so they fixed it. They changed the whole C library to just expect this new ordering of. But it was like, why did you do that? Anyway, whatever. So that's an example of a stupid change to an interface. But yeah, if I make changes to basic system interfaces, the lot of other code uses, that code stops working. And this is unfortunate. And you guys may be bitten by this in the future. There is, I don't know, I don't even want to speculate about how much time and energy goes into in the software development world, just trying to maintain old interfaces so they still work. The best story I've ever heard about this, and this is slightly apocryphal, but somebody can certainly try to confirm it using the internet, was that apparently in early versions of Windows there was a bug in the memory manager where it was OK to free things twice in certain cases. So I could free the same object multiple times, and that would be OK. And then they fixed this problem. So when they fixed the problem, what do you think happened? Thanks so much. So this is clearly a bug. This is a bug with their malloc implementation. Malloc should not allow you to free things twice. However, when they fixed the bug, what happened? What's that? Yeah, a bunch of things broke. Like there were, and I wouldn't even call it code that was depending on that, but other software applications that hadn't realized that they had a bug where they were freeing things twice because the library was allowing it to do it. So what do you think Microsoft did? Well, they want to put it back in. Yeah, I like that. Just hey, whatever. That's how malloc works now. Call free as many times as you want. It's more emphatic to call free twice. It's like, free it right now, please. No, you want people to write correct software, so they fixed the bug. But what else can I do? What's that? Oh, no, no, no. I'm not going to help someone else fix their code. That's terrible. And who knows how much of this out there. They put a special case in. So if they would look at the name of the executable, this is where this gets really sketchy to me. But I kind of believe that this happened. So it was like SimCity or something. So if you ran Sim2000.exe, they would automatically emulate the old behavior in the new allocator. Awesome. So yeah, so you get stuck with things like this. If you build a bad interface and people start to use it, you better be prepared to support it forever because somebody will complain. If these companies try to remove, like if Chrome, I'm going to try to pick on Google a little bit just so people don't think I'm biased, if Chrome tries to remove some menu that you're used to using, people will complain. And so even if they think there's a better way to do things, it ends up that the old stuff still lives on for a long, long time. That's not necessarily a bad thing. That's a suggestion by Butler-Lampson. It's just a fact. Yeah. No way. Always kind of changing. Do you know how long it's taken us to migrate to IPv6? We're still going. IPv6, so we've known about this problem for decades. There are not enough IP addresses on the internet. Do you know how far we've come in solving it? Not very far. Like your own campus network does not support IPv6. We know how to solve the problem. There is a solution. It is accepted by the standards community someday. Someday. You guys will live to see this, I think. We will have IPv6, but this stuff happens really soon. Really soon. Actually, part of what's driving IPv6 right now are our phone carriers. I think Verizon is starting to support it internally for all of their devices. It's going to happen, but it has taken decades. And this is a tiny, tiny little change. Well, look up. It's pretty tiny. And it's also totally required. We only have a certain number of IP addresses, and that number is smaller than the number of people on Earth, so that is not a good thing. Everybody at some day, on some day, most people on the planet will be connected to the internet probably with more than one device. And so not having enough addresses for them all to use it is a problem. So even solving that simple problem is something that's taking a long time. Yeah, be prepared to be stuck with whatever you decide for a while. You guys should think about that as you guys are starting startups and building new software and stuff like that. All right, any other hints we want to talk about? Josh, pick one. Steve. Use a good idea again. Yeah, implementations, right? What does this mean? Use a good idea again. This is an implementation. This is how I ensure that things work. Yeah, I try to see opportunities to use similar ideas in your code. If you have, if you've written something and you've tested one piece of code, let's say you've written something, it works, it's solving a particular problem, and then you encounter a similar problem later. You have two choices. Option A, I can write a completely new piece of code to solve that similar problem. That piece of code will probably be very similar to the other piece of code. I can even cut and paste, right? Because that's an awesome way to write software. Or what's option B? Well, I have to make my existing solution a little bit more general. And then I can use it to solve both problems. So it turns out, I might have mentioned this before, the assignment two solutions that I noticed this year is very clever, uses one function to do both read and write. You guys may have noticed as you wrote read and write. They are essentially the same function with a couple of cases where there's a little bit of a difference. So the solutions that said, hey, rather than having two identical pieces of code that I have to maintain, I write one thing that does both read and write, takes a flag that tells it what does it do read and write. 90% of that code is the same. In a couple of cases, the flag controls what to do, but not very often. So yeah, try to see opportunities to reuse code that you already have, even if it requires making the code a little bit more general. So the function that does both read and write is more complicated than the function that would do either read or write, but it's less complicated than both. All right, one more here. Actually, Steve, you had one. Keep secrets. Keep secrets. Yeah, I like this. Where is that one? That's a good one. Implementation, yeah. What does that mean? Steve, I like that. Yeah, yeah. Keep secrets from yourself. Write really complex codes, so no one understands it. No, this is not about obfuscated code, although I like that. What does this mean? Keep secrets. What's that? It's connected to abstraction, right? What does it mean? Keep secrets. So it's related to interfaces. So yeah, yeah, I like that, right? And frequently, that information is about the implementation. So we talked about this a few times. A good interface does two things well. It tells you what the code does, and it does not tell you how it accomplishes those things, particularly details that you don't need. Because if I tell you things, the more the interface tells you about how the code works, the more likely it is that people are going to write code that exploits those observations even if I wanna change those things later. And that means that I get locked into doing things a certain way. So there are, I don't know if that's a great example of this, but yeah, so if I give you information, I say a particular piece of code works in this way, and it's not supposed to be part of the interface. The interface is not guaranteed that this is true. Then when I come along later and re-implement this function because I wanna improve performance or fix it or just change it in some other way, your code breaks because your code is making assumptions about things that it's not supposed to. So writing good interfaces involves not exposing too much information to the user. Yeah. Well, remember, there's a difference between the source code and the interface. The source code, if you're reading the source code of something, you probably are coming away with secrets that you're not supposed to know. And that's why it's important to actually understand and read the interface descriptions themselves, right? In C, this is in your header files. In other code bases, it's in the documentation or in the description online or whatever. So yeah, make sure you understand what the interface guarantees, not just how the code works. Because how the code works, someone could come in a day later and decide to change it. So simple example, I mean, sorting can be either stable or unstable. It may, the current implementation of a particular function may sort something and it may not guarantee anything about the stability of that sort. So if you want the store to be stable, you have to make sure that you do that yourself. If you realize that the store is stable and don't account for that, then if someone comes along later and changes that, you're in trouble, right? All right, any other questions about this? We'll move on, because I like the multi-core paper, but I just thought we wanted to spend a little more time with this. But yeah, look at these, think about them. These are, again, this is just like, you know, one of the best compilations of wisdom about how to build software, really, in general, ever compiled. What I really wish someone would do, maybe you guys can do this and get Butler-Lampson to sign off on it, is like, write the modern version of this paper. Take all these ideas, find modern examples of them that you guys can connect with that are based on systems that you're familiar with and illustrate these points again, because they're all still true. But the examples in the paper are sort of correct. All right, any questions about hints? One, all right, oh, sorry. Okay, so the scaling paper. So this is an interesting paper. This group of scientists was working in this area for years, and they were building a bunch of different systems. And on some level, this paper kind of looks like a benchmarking paper, and we haven't looked at one of these yet. They propose a benchmark, they use the benchmark results in a bunch of places to motivate changes to the system. But on some level, what's interesting about this, and what makes this paper more exciting than a lot of benchmark papers, is there's also kind of in a sneaky way, a wrong way, or a big idea paper. Because previous to this, so the parallel and distributed operating systems group at MIT, PDOS, parallel and distributed operating systems. So one of the big trends that's been interesting in computing over the last 20 years is sort of what people call the end of Moore's law scaling. How many people have heard this before? Maybe for me, someone else, okay. And so because we're having a harder and harder time for a variety of different reasons, packing more and more transistors into the same amount of area, how are we making computers quote unquote faster? What have we been substituting? For years, you had like more and more transistors, faster and faster clock speeds, and then that all stopped and now we have what? Not just distributed systems, although it's an interesting perspective on more cores. Up into the P4, these are single core systems. And then suddenly, and this is kind of a slow motion thing that's going on, it's really changing computing, we just couldn't do that anymore. I mean, I think we've talked about this before, but you remember those old P4 heat sinks that were like 12 inches tall? That just got to be unsustainable. So now you have 32 cores, you have your smartphone has four cores. What's different? What's like fundamentally different about a multi-core system from the earlier systems that some of these operating systems were designed around? Yeah, you have to be able to exploit concurrency. So for example, if your program only has one thread, only does one thing at a time, it gets no benefit from a multi-core system, zero. It's impossible because it's just gonna, it's acting like the core, it's a single core system. So now all of these applications and operating systems have to contend with the fact that they have to deal with these multi-core systems. And there were a lot of, where this paper becomes interesting is that leading up to this, there was a lot of, there was a sense within the systems community that we needed these new operating system architectures to deal with this problem. The way that we've been doing things is wrong. Linux and all these systems that came out of the single core era are no longer sustainable. We need radically different architectures. And these guys were proposed some of those architectures. So Silas Boyd Wikazur, his name is, well he's a tongue twister for me, had written papers about these new multi-core architectures before. And so it's funny that a couple years later, they write this paper and because this paper, ooh, I don't know what's going on with this today. This paper, one of the big ideas is that it's actually not true that we need these new radical architectures. We can achieve good scalability from existing operating system designs if we make some simple changes. So in a lot of ways, what I love about this paper is they took this classic benchmarking-based approach. They came up with the benchmark, they ran it, they found problems, they fixed them, they ran it again, and they continued doing this until they achieved acceptable scalability. And I think this was a 32 or 64 core machine that they were using. And at the time, that was pretty high core count, still pretty high core count. And they did this and what they were able to show was, hey, it turns out rather than considering a complete rewrite of these entire legacy operating systems, all we have to do is make some changes. All we have to do is fix some of the scalability problems. And we know how to do it. We have these existing tools and techniques. Some of these problems are solvable in some easy ways. So that's sort of the big idea here. So this was how they positioned the paper. There's a sense of the community that traditional kernel designs won't scale on multi-core processors. Applications will spend an increasing fraction of their time in the kernel, as a number of cores increases. Why is this a problem? Why is that an indictment of existing kernel designs? What's wrong with applications spending more and more of their time in the kernel? What does that represent from an application perspective? Josh? What's that? No, but just like again, if my application spends, if I'm running some profile, my application spends more and more time in the kernel, this is bad because Because kernel time is wasted time. That's just overhead, exactly. For an application perspective, I don't want to be in the kernel. I asked the kernel to do something for me. I want it to get done now as fast as possible. So any time spending in the kernel is just overhead from an application perspective. It is not useful time. And so if that time is growing, it means that more and more of the percentage of what I'm doing is overhead and that's bad. So that was the problem. Prominent researchers have advocated rethink in outbreak systems and new kernel designs intended to allow scalability of proposed. Corey was worked by this group. This paper asks whether traditional kernel designs can be used and implemented in a way that allows applications to scale. So this is the goal. And then one of the really nice things about this paper is that it's a really, really clear objective. If the number of cores, so what do I want here, right? If the number of cores goes up, application performance scales linearly. I want the applications to get faster and faster as the number of cores goes up, right? There's also a certain amount of humility in this paper that I've always liked. So we attempt to shed a small amount of light on it. So one of the co-authors of this paper, this guy named Robert Morris, and I was watching a stylist give a practice talk about this paper. And I can't remember. I think they maybe ran some experiments up to 32 cores. And so somebody asked, well, what do you think would happen at 33 cores? And Robert said, we have no idea, right? We only ran experiments at 32 cores. We're not gonna speculate about 33. Now you could speculate about 33, but you know, this is a very empirical study. If we didn't measure it, we have no idea. Who knows, maybe the computer just melts down, turns it to ash at 33 cores. 33 is a weird number of cores, right? I wouldn't put 33 cores on a machine. Bad luck or something. All right, so what do they do? Like, what is this, what's the approach that they take here? We're just already sort of hinted at this. Why are we talking about this paper this week? Trying to fix scalability in Linux, what's a reasonable approach based on what we've been discussing this week? Nobody wants to guess? I should use this exam question. So that's the goal, right? The goal is to see these linear increases in performance as I improve the number of cores. Here's what they do. They take applications that should scale well, and they use them to benchmark scaling on Linux. And then they identify, fix those scalability bottlenecks, and then they run the application again. And then they keep doing this until they reach reasonably good scalability. So again, one of the nice things about this paper is that it's very easy to describe what I want. What I want is a line that goes up to the right, which is performance versus number of cores. What I typically get is a line that starts drooping at the end, because as I add more cores to the system, I'm actually getting less and less increase in performance per core because of these bottlenecks that they've observed. Yeah. All the time. Yeah, these guys certainly are familiar with the celebratory view. All right, so this is what they do. They say, you know, we measure scalability. So they came up with this benchmark suite, which we'll talk about in a minute. They have a particular version of Linux. I think this is still more recent than what's running on Timberlake, but I'm not sure, it's July 12, 2010. 48 cores, sorry, okay, awesome. 48 cores, they use the in-memory tempfs file system, because the point here is to benchmark sort of core kernel performance. They don't want to introduce the file system as a bottleneck, so all the benchmarks run in memory. Has anyone ever used tempfs before? Well, you should, it's awesome. If you have stuff that you don't, that wants to act like a file, but you want it to be really fast. So actually, the whole website that we use for the class is hosted on tempfs. Why not? It's fast, it's memory. But the other, so Gmake scales well, but the other application scale poorly performing much less work per core with 48 cores than with one core. So what I want is, as I add a core to the system, let's say I'm at two cores and I add a third core, I should get a 50% performance improvement. If I'm at three cores, et cetera. What I get is less. And frequently, as I add more and more cores to the system, it actually gets worse and worse. So there's a sense of diminishing returns. The core, going from one core to two cores might get me a 90% speedup, whereas as I go farther out, the speedup I'm getting is less and less than what I want. And then we attempt to understand and fix the scalability bottlenecks by modifying either the applications or the Linux kernel. We'll come back to that in a sec. And like the corollary to Omdol's law would indicate, once we fix one problem, we see better, we see others more clearly and the other ones start to become bottlenecked. So once we fix one problem with the system, we repeat and go on to the next thing. All right. And so what do they accomplish? Either one of two things. Either good scalability on Linux up to 48 cores and nothing past that, no idea about what would happen on 49. Or a sense that there is something really wrong. So this is a nice way to approach this problem, right? And I don't want to presume that the authors had any idea whether this would work out or not. It's also possible they could have done the same thing and not been able to fix the scalability bottleneck. So if they weren't, let's say they did this and they had 10 applications and they were only able to fix the scalability bottleneck for one of them, what would that have indicated? Go back to the sort of the introduction to the paper. What's the sort of big question that this paper is considering? So if I really can't get applications to scale well on Linux, then exactly, right? Then I really need these radical new designs. That's the point. Do I need the radical new designs or not? If I can fix the scalability problems on most applications, maybe the answer is no. If I can't, then maybe I have the answer is yes. The analysis of whether kernel size is compatible with scaling, blah, blah, blah, blah, yeah. So the point is whether or not this, how I answer this question depends on two things. First of all, is it possible? Second of all, how hard was it? How difficult was it to fix these scalability problems? Because they only handle a certain number of bottlenecks. Their final result suggests that there are no immediate scalability reason to give up on traditional OS designs. That's pretty cool, right? I mean, look, everyone likes to build new things, but think about all the time and energy and blood and sweat and tears that have been invested by all these probably hundreds of thousands of people in Linux. And you're saying we're gonna have to give up on Linux because it doesn't scale well, that would be kind of sad. I mean, if we have to do it, we have to do it. The future is the future, but turns out that this works pretty well, so that's nice to know. Okay, so what is Moss Bench? What does it sound like given its name? It's a benchmark, it's actually a benchmarking suite. So Moss Bench consists of a bunch of different applications that should scale well. So this is actually pretty important. One of the things I talk about in the paper is they chose applications that where the application writers have done a lot of work to make them scalable. For example, what type of application wouldn't be in this suite? Well, it's an application you guys use every day where I wouldn't expect to see better performance as the number of cores improves. How about bash? I mean, it's kind of a dumb example, but there's certainly programs that I could consider that either shouldn't scale at all, because who cares, right? Bash doesn't do anything in the background. Or the authors haven't, the people who maintain them haven't done the work necessary to fix the application. So if the application doesn't scale well, forget about the kernel, the kernel's not the problem. So what they did is they picked applications where they were mature enough and the authors had done, authors, the maintainers of the application had done enough work to get them to work well. So Exim is a mail server. Has anyone ever set up Exim before? I think I should do this stuff, it's fun. An in-memory key value store, wow, is that actually a sentence? An in-memory key value store object cache. There we go, got it. MemcacheD, has anyone ever heard of MemcacheD, please? Yeah, these are like super useful tools. So this is using multiple servers and have multiple connections in parallel, right? So apparently there was, so this is an interesting choice because here they're acknowledging there's a problem with MemcacheD that it doesn't scale well. So what we do is we start a new server for every core. That's how we achieve good scaling on this particular piece of software. MemcacheD, last time I checked, was used very heavily internally by companies like Facebook to do caching. Web server, Apache, these web servers are supposed to scale pretty well. Database server, Postgres, I haven't used Postgres before. Okay, thankfully a few more people. And again, so I'm starting it up. So Postgres will create a new process per connection. So any independent connection from Postgres, Postgres will fork, spawn off a new process. So it sort of scales, it should scale fairly well, naturally. Parallel build gmake, building Linux, how many people have used make before? Okay, thank you, yeah, so just make, right? Make runs things in parallel. If you don't use make in parallel and you have more than one core, please start because you will waste less time in your life that you can use for other profitable activities. And then there's this weird file indexer called psearchy. I've never heard of this before, but this is some sort of tool that's used to build file indexes. Again, something that you can imagine is very easy to scale. I just divide the file system into different parts and sort of use multiple processes or multiple threads to index them in parallel. MapReduce library called metis. And the goal here was to identify problems with kernel scalability and the way that they measure this is what we talked about before. If the application spends more and more of its time in the kernel as the number of cores rises, this is an indication of bad scalability because that's overhead and that's reducing the performance of the application which should scale. So this is interesting. What they point out is that all of these applications scale really badly on multiple cores. This is before they started fixing Linux except for one. The one that worked well was gmake. Does anyone have any idea why this is? Why would gmake scale really well on Linux? So think about this application sweet and think about the people who work on Linux. Why would gmake scale really well on Linux? Wanna guess, Steve? The kernel developers use gmake all the time. That's probably the tool that they run the most often. Every time they change something, they're rebuilding. And it turns out that gmake kind of, it's weird, and this is one of these cases of just like complete community myopia because gmake has become, they point out almost this informal benchmark in the Linux community just because people use it so much. So people, oh man, but that version of Linux made gmake really slow. Well, you know what, the fact is no one cares about, except you about gmake. What they care about is, did my animated cat Jeff speed up or not? And so this is a weird benchmark, but they point out that many Linux patches include comments like this speeds up compiling the kernel. So improvements to Linux designed to improve improving Linux. It's a little circular, but this is the case. All right. So for the rest of the applications that don't scale well, what they found are some pretty classic problems. And again, actually go back and look at this paper because these are really interesting things to understand from a systems and architecture perspective. And I'm not sure that you guys have been exposed to some of these things before. Here's something you have been exposed to. Overhead due to locking. So when multiple threads or multiple tasks on different CBUs are locking a shared data structure, that creates overhead. The locking is serializing. Those operations would be able to happen in parallel, except for the fact they have to grab this lock. Once I have to grab a lock, I'm serializing them. That's what the lock will do if I have a critical section. This I think you guys understand. All right, this is something you guys may have seen. Your VM invitation may be slowing down because of this, because of unnecessary or big locking. Here's one that's more fun. Writing to shared memory locations. So we've touched on this a little bit. Did they teach you guys about cache coherence? How many people have learned about cache coherence? That's very sad. Cache coherence is very interesting. So I think we've touched on this earlier in this semester. Every processor, every core here, has its own local cache of memory. Sometimes multiple caches. Cache architecture is very quite a bit. But definitely, every processor has this local cache, sometimes referred to as an L1 cache, that caches memory. Why? Why do I have that? Why do I need that? It's at speed, it's way faster. Remember, caches are one of the classic ways to improve performance. So the L1 cache is a lot smaller than memory, but it's a lot faster than memory. Some of that is just because it's closer to the processor, literally. But the way I access it is also a little bit different. So when I try to read and write from a memory location, the hardware will look in the local cache first. If the object is not, if the byte I'm trying to access is not in the local cache, what do you think happens? How do caches work? I try to read a byte of memory, I look in the cache, it's not there, so what do I do? I go to memory, need to find out what the byte is, but in the process of reading the byte from main memory, what am I also going to do? Put it in the cache. Now it turns out, maybe some of you guys know this, I don't just put that byte in the cache, what else do I put in the cache? What's that? Yeah, there's this idea, it's not that different from a block of memory or a page, there's this idea of what's called a cache line. So a cache line is the granularity that these caches work. They don't just cache bytes, that would kind of be dumb. It turns out that cache's memory has locality just like pages did, just like file access does. So if I'm gonna go get one byte, I'm gonna get a whole cache line, a cache line might be like 32 bytes long. So I take the whole cache line and I move all that memory into the cache. Does that make sense? So I'm caching on a level of an object that's a little bit bigger than a byte. Now, the problem is, let's say I have a, and now it's in this cache, so let's say I write to it. What problems does this create? So first of all, what can happen here? Let's say I have a bunch of different, let's say I have four cores and they're all running along, what can happen naturally based on, let's say I have the same application, it's running on all four cores, what could happen based on reads that are generated by that application? So let's say the application, I'm running the same code. So over time, what could happen? Might you say the texture? What's that? Might you say the texture? Well, no, I'm just reading, right? I'm just reading right now. You know, so imagine that I run the same piece of code on all four cores simultaneously. What is going to happen in the cache? When that completes, what will the caches look like? On each core? Yeah. They'll be identical. So the objects will be in all the caches. So there'll be some cache line that is in every cache. And in fact, in the example I just used, the caches will have basically identical contents. So what happens when any core tries to write to anything that's in the cache? What does the hardware have to do? Yeah. I have to modify the local cache. I have to modify memory. Now I was gonna have to do that anyway, but I also have to find that cache line in any other caches and modify as well. This is one of the reasons that these systems don't scale, at least we refer to as shared memory systems, don't scale very well. You might wonder, you know, why don't we have 256, 512, 1024 core machines? This is part of the reason. A lot of these cache coherence protocols have N squared performance. So as I increase the number of cores, I'm doing way more work just to support the cache coherence protocol. So this is bad. So what this can mean is that even if I'm not explicitly grabbing a lock, if I have a bunch of cores that are modifying the same part of memory, the whole system slows down because every one of those writes causes the processor to block until I can make sure that I've modified the object in every cache. Sometimes it's worse than that. Sometimes I don't modify the object in the other caches. What's something else I could do instead of modifying the object in the cache? That produces even worse performance. So processor A updates something that's in processor B, C, and D's cache. Option one is I can go find that object and fix it, or object two is I can do what? I can force all of those processors to flush that object from their cache. And then what do they have to do when they read it again? All the way to memory. So that's another thing that these protocols sometimes work. It's even worse. Okay, so this is, yeah, okay. So this is another aspect of scalability. I'm not gonna talk as much about this one, but if there's some shared hardware cache between the cores, in certain cases, each core, and again, cache architecture is very quite a bit, but imagine that each core has its own private cache, but then sometimes groups of cores or the whole core set will also have another shared cache, and that's just virtually like an L3 cache. That cache is shared by all the cores. So to some degree, what can happen, one scalability problem could be even if there's no fine-grain sharing of memory between different tasks running at different cores, the amount of memory they're using in total gets to the point where it starts to get too big for that last level cache, and it starts to spill into memory. So what I see is sort of thrashing in that last level cache, right? There's plenty of other, it turns out there are plenty of other shared hardware resources for things to compete with. I can start competing for the memory bus. When I go to memory and I actually need some data, there's a bus, there's a protocol, I have to ask for it, there's some shared stuff on the way to memory, that can become another bottleneck. Here's another potential problem. There's just not enough work. There might just be too few tasks to keep all the cores busy, so that as I increase the number of cores, the program's just not generating enough work. Okay, so let me go back and talk a little bit more about some of what they did. I'll just back up here. Oh, I've got four minutes, okay. Yeah, so let's talk about thrashing the cache. So this is interesting. Now, if I have a case where all the cores are actually trying to modify the same byte in memory, there's no way around this, this sort of thrashing in the cache. It's inevitable, it's just a function of that architecture because I have to make sure, this is what these shared memory architectures do, I have to make sure that as soon as a core modifies a byte, every other core sees the modification. Does that make sense? However, remember, the cache lines are big. Let's say the cache lines are 32 bytes or 64 bytes. So what else can happen? What's a case where there's actually, what's a case where I could have a bad interaction between some operations that the application was doing and the cache protocol? Because remember, anytime that cache line is modified, I have to fix all the other cache lines at any core. So can you guys construct an example where I'm doing a lot of unnecessary synchronization? Oh, I thought he was gonna raise his hand. Remember, if they're modifying the same byte of memory, I have to do this, there's no way to get around. In that case, the cache has just become a bottleneck, the cache coherence protocol becomes a bottleneck. However, what's a case where I might not be modifying the same byte of memory, but I would still see the same pattern here. So consider the following example, I've got four cores, each core is modifying a private byte of memory. So the core only changes its own byte. No cores share memory between them. All I'm doing is reading and writing from a byte that's associated with my core, so there's no memory sharing. How can this still happen? Yeah. Yeah, so if those four bytes are on the same cache line, every time any core writes to its private byte, what's gonna happen? The cache protocol thinks that the line is invalid and it's gonna invalidate everybody else's cache line. So this is something that's called false sharing. There's actually no memory sharing that's going on here and yet this will drive the cache coherence protocol crazy keeping all of these cache lines up to date because it doesn't know, it's not smart enough to know that these updates don't conflict. All it knows is that this cache line is invalid and now I've gotta go do something about it. So what's the solution? Simple solution to this problem. And they apply this solution in a few places in the paper. Simple solution to this problem. I want each core to be able to modify its own private byte of memory without colliding with other cores due to the cache coherence protocol. How do I make sure that happens? What was required for this problem to happen? Yeah, Steve, do you have an answer? No, I can't do that. That's way too inefficient. Remember, what's the problem here? No, that doesn't help. Yeah, the cache, the outer space is irrelevant here. It could be, I'm in the same outer space. These could be four tasks as part of the same process. But what was the root of the problem here? Where are these four bytes of memory? In order for this problem to happen, the four bytes of memory have to be in the same what? Cache line, so how do I fix the problem? Move them to different cache lines. So in certain cases, this is really interesting, right? I mean, they found data structures where there were private pieces that were being accessed by multiple cores, but because of how C had set up the data structures in memory, those fields were right next to each other. And so the cache coherence protocol was being driven crazy by this. So yeah, put some padding in, right? So there's ways to actually use this fine-grained information about memory layout to make dramatic improvements in performance. Because if I can get those bytes memory onto separate cache lines, scalability, now I get perfect scalability because there's no problem with the cache coherence part. Any questions about that? Yeah? No, because remember, there's gonna be other things on the cache line, right? The point is, I wanna make sure that the cache lines don't contain several pieces of memory that are actually private, but are gonna be accessed in parallel, right? If I move things around, it's not that I'm wasting memory, it just means that that byte is now next to other stuff. Does that make sense? Yeah, just other pieces of some other data structures. Yeah, Steve? What do you mean? Yeah? No, you rewrite the code. Yeah, this is done by rewriting the code itself, like making changes to the data structures. In certain cases, there's ways to get seed, the compiler to behave a little bit differently. Yeah, no, no, this is, sorry, this is not done at runtime, this is done at compile time. This is a one-time change. But what they did is they found some of these places, they found the data structures that were causing the problem, and they rewrote them to reflect how they were shared, in a certain way. All right, I might use some of the other examples from this paper on the exam, so, because there's really some cool things here. I will see you guys on Monday. Have a great weekend. If you guys have ideas about what you wanna talk about on money, please post them online. I didn't get very many hits to that, but I have nothing on the calendar. So, we'll.