 Thank you, Marisa, and thank you, Shuaq. As we said, we're going to unravel some RC usage mysteries, and to get started, this is kind of a roadmap of what we're going to be going through. And first off, you are here, kind of a thing. We're going to be talking about the green boxes and the white ones as well on this chart. This is kind of a map of the use cases. We'll do Closet, Read, and Write a lock first. If we have time, we'll get to the phase-state change. And a pre-pitch, we expect to have the rest of these in a presentation in February. Let's look at a use case. Let's look at a multi-threaded use case. What we have is we have something more of configuration information, variables a and b. So just there's integers, and they're updated occasionally. So there's some external inputs that cause a new a and b to be calculated every once in a while. But a given reader needs to have consistent values. It's okay for it to get something a little bit old, you know, maybe a second or two old, maybe a few hundred milliseconds. But it's really bad if a reader gets an old a and a new b. It has to have the corresponding a and b put together. And the other piece is that there are a lot of readers, they happen really often, and we need this to be very fast. It's on a fast path. So think of this as just a question you're trying to solve, a problem you're trying to solve, and of course one approach is to use read or write a locking. After all, we're mostly reading, writing occasionally. That's what it's designed for. And here's one way you could solve that. This is using Linux kernel syntax, give or take, and you define a reader, write a lock. Its name is MyRWLock. There it is. And we have a couple of global integers, a and b. And those are the current configuration values. We have a get function. You pass it the place to put a and b. Curry and curr b are both pointers to places to store the current configuration values. We do a read lock on MyRWLock. And once we've got that lock, we pick up a, we put it through the specified pointer, we pick up b, put it through the specified pointer. And because we hold the lock and read mode, we know that no writers can happen. Therefore, we're guaranteed we're getting consistent values of a and b. We release the lock and updates can happen and life is good. Plus, we can, in theory, anyway, have concurrent readers with this. Okay, so let's look at the updater. Straightforward again, we have a set function. It gets just the integer. No pointer needed this time. We have a new a and a new b. We write acquire the lock, we apply the new values to the a and b variables, and we release the lock. Pretty straightforward. So what's not to like? The semantics are great. This is time-based. We have time going from top to bottom, which will happen in most of these charts. Those are exceptional too. And we have three processes. We have a pair of readers at the extreme right and left. We have a writer in the middle. So the reader does its read lock and picks up the values and releases just as it did before. And the writer attempts to do a write lock, but has to wait. That's the triple dots there. And once the reader releases the lock, it can go ahead and do the update. Another reader, the one on the right, is now trying to read lock, but can't. It has to wait for the unlock. And once that happens, everything's consistent, it can pick up the values, and everybody's happy. And if we look at this, what's happening is that the reader-writer lock is doing two things. It's waiting for prior configing operations, and we saw the right lock in the middle waiting for the reader to get done. And it blocks subsequent conflicting operations, and we saw the right lock blocking the reader on the right. Okay, so pretty straightforward semantics. And we can apply time, and we've got time in chunks here. So we have time-based mutual exclusion. We have a time for readers. We have a time for writers, and then we have a time for readers again as we advance from top to bottom in the time. We've got a time for a lot of other things, according to a popular song when I was a kid, but that's probably best forgotten. Apologies for not having forgotten it. Okay, what's not to like? Well, that's not to like, okay? Look at that, right? We've got up to 192 CPUs here, and this is log scale in both axes, okay? So a little bit up on the top is a big chunk. And it looks okay to start with. We're somewhere between 10 and 20 nanoseconds per operation. And then as we add threads, it just gets worse and worse. At the end there, we've got 192 threads, we're looking at somewhere between 7000 and 8000 nanoseconds. That is between 7 and 8 microseconds, and that's a lot of clock cycles with today's CPUs. I mean, when I was a kid, that had been pretty good with the computers we had then, but I'm sorry, this is 2021, not 1973 when I started. Okay, so this is the worst case in the sense that these are just the operations of the cells with no critical section. We acquire the lock and immediately release it. But on the other hand, it's read-only. We're doing read acquisitions and read releases. That's the part that's supposed to be fast, okay? And you can see it's not. It's slow and it's also very unscalable. And I don't know about you guys. Maybe it's just part of being old, but I'm kind of the kind of guy that when I add another CPU, I want to get another CPU's worth of throughput out of it. I don't like performance results that scale that badly. And this is not just lock contention. I mean, we know it's not lock contention because it's a reader-writer lock, readside only, and so in theory, there's no contention whatsoever. But we have hardware latency, and this graph shows that latency. We have a number of different operations ranging up to 440 CPU's, which are how many there are there. The reason that the previous graphs only went up to 192 was that I was using, I was running a Linux kernel to get those measurements. And I was doing as a guest OS. And there are some limitations in what KVM and QMU allow you to do in terms of number CPU's. Okay, so we've got a bunch of things. We've got a compare exchange. And we've got a blind compare exchange. And what's the difference? Well, the difference is the compare exchange, we load the value, and we use the value, we hand the value as the old value with some work in the middle to the compare exchange. Blind compare exchange, we hand it to constant. We say, for example, if it was zero, make it be one. And that means that the compare exchange at the top, the green one, the one that's up at about a microsecond per little sequence of instructions, that one is loading and then it's compare exchanging. So we have on this particular system, and Arbor guys can do various tricks and sometimes do, but on this particular system, that's two trips across the bus. In the blind case, we are just doing a compare exchange instruction in a loop and handing a constants. And so we just have the one instruction. But even so, we're up pushing 400 nanoseconds for a single instruction in a loop. In contrast, if we're doing local operations, local lock of local compare exchange, that's that green line almost to the bottom, just sort of touching the little ticks on the x-axis. So the shared memory is causing us quite a bit of problem. And by the way, each of these points is not all of the threads jumping on the same memory location. This is CPU zero talking to one other CPU using these operations. So in terms of what you're seeing here, why is it different? Well, the difference is that this is a system with multiple sockets and multiple cores per socket. And also it's got four sockets on one side and four sockets on another. Plus it has hyperthreading. All that stuff put together gives you the situation you see there where we have different CPUs more or less able to communicate with CPU zero. The worst case is where it has to go across the interconnect to the other set of four sockets. And that's a big tall part nearing one microsecond over there between about 128 and 224 CPUs there. So we have a problem here with reader idolock. And we'd like to go faster and we're going to have to do something different to make that happen. So the key point here is that global agreement is expensive. And that's what's happening with reader idolock. When that writer lets go, globally, everybody has to agree that the writer's time's over and now we're over to readers. And if there's anybody that has a different idea, we break. We have a bug. Okay, similarly, when the last reader lets go and the writer takes over, that point in time has to be globally agreed to. And this is very expensive. Okay, in this case, we are even having global agreement among all the CPUs. This horrible graph is local agreement between a pair of CPUs. Okay, so when I say expensive, this graph doesn't really communicate just how expensive it is. For that, you'd have to go back to the previous graph showing the reader idolock. And this is because of the laws of physics. The speed of light is finite. I'm sorry. It might seem fast, but it's finite. And you know something? Atoms are of nonzero size. And this is a problem with today's computers. Now, my 50 years ago self back in 1971 would have been surprised that this would have been a problem for computers. Interstellar travel may be sure, but computers. And so what we're going to do is we're going to take a leaf from Admiral Hopper's book. And we're going to note that a nanosecond is about that long. Okay, so if you have a system with a 1 gigahertz clock cycle and you have something that has to get across the whole system, you need global agreement, uh, light goes that far in a vacuum. Except that, you know, in a computer, today's computers, you don't just send the information in one direction. Normally, you make a request and get response back. You try to load some memory and you eventually get the memory back. Okay, so in that case, we're talking that long in a vacuum. All right. And that's for 1 gigahertz CPU. This is a 2 gigahertz, but we got 4 gigahertz CPU, so it could be like that. All right, to get done in one clock cycle. Okay, light travels that far and back in a vacuum. But it gets worse yet. You know, computers aren't vacuums. They're solid state, in fact. Now, yeah, my dad got his E degree in the 50s and he worked with vacuum tubes, but I'm here to tell you the vacuum tube computers are even slower than today's computers. The vacuum isn't helping much. So chips use conductors made out of copper these days mostly, and that gets you down to about here, okay, between my thumbnails there. And that's copper. That's for conducting things long distances. Inside the transistors, that's silicon. And you're talking about other factor 10s, you're about down here. And you probably can't see, I'll get up there, I suppose, we probably can't really see the difference there. I'll just tell you, it's about one 20th of an inch, or if you prefer about 1.2 millimeters, okay. And computer systems are a lot bigger than that. So you don't get all the way across the computer and back in a cycle. You don't even come close. Furthermore, that's just the laws of physics. There's other things running against you have mathematics, you have cash-consistency protocols where things have to work out properly to keep things consistent. And this also there's some additional mathematical constraints saying that more communication has to happen than just a straight over and back. In addition, you have electronic issues. You know, the whole system doesn't run at the same clock cycle. CPU cores run way faster than the rest of the system. And getting signals reliably across a change in clock rate requires some interesting electronics and incurs a penalty. If you're talking to some types of SSDs, you have phase changes, so you have chemistry involved too slowing you up. All right. So global agreement is really expensive and this is not, you know, something that's going to go way easily. Now, we could ask, well, can the hardware guys do better? And to some extent they can, okay. So for example, if you have a chip, and if you were to slice it for, you know, two slices and stack the four pieces on top of each other, which they're starting to do to some extent, in that case, you've reduced the size of that piece of the system by a factor of two. So, you know, hopefully these is probably a factor of two, but that's just the chip. You have the rest of the system as well. Your memory normally is not on the CPU chip and you have to talk to it, for example. Also, there are heat dissipation and power distribution issues with that sort of thing that can cause other problems. There have been some famous cases I won't name names or go into them. There was a famous case where people have done a lot of work to speed things up in the hardware, but the thermal dissipation caused them to throttle down to where they got no benefit out of it. And so you can have that kind of a problem. It would be good to have things go in the same direction. Back many decades ago, a guy named HT Kung came up with the idea of systolic arrays. And potentially, modern hardware accelerators might get the same thing where instead of sending data out and sending requests out and getting data back, you just push the data through the system. The historic problem with those is that they have been very specialized and you have to have an application that is just really, really has a lot of instances required to justify the investment and making it real in developing it. Finally, there are some cool hardware developments over the past couple of decades, including my favorite is vacuum gap transistors. It's kind of like a vacuum tube on silicon. The cool thing about it is that at those scales, really tiny scales, the atmosphere is, for all intents and purposes, a vacuum. The problem with it is to make one working, you have to kind of hand place the atoms, which isn't exactly consistent with billion device wafers that we need these days. Well, actually the wafers are worse, billion device chips. So we're in kind of between a rock and a hard place. The hardware guys and us are on the same side. We're trying to make this go fast. We're up against the laws of physics. So let's ask what we can do to help them. We've gone through this a little bit. If you want the exact numbers, there you are. We've talked about that. What we're going to use for this thing is read-only replication. If you're reading a lot, you don't have to wait. You wait once and if you need it again, it's already there. That's the trick we're going to use here. The way we're going to do that is we're going to use lighter weight semantics. So this is back what we saw before for the reader-writer lock. We have time-based mutual exclusion. We have all these, all this waiting that happens at each transition. So I've drawn thin lines to those transitions. That's a lie. That line is really, really fat because we have to take into account not just the electrical diameter of the system, but kind of the information flow diameter of the system. And that takes a lot of time, as we've seen. So let's try just weakening things. Okay. We don't want that global agreement. Let's get rid of it. So let's wait for prior conflicting operations only, as the updater, only before freeing. And let's just forget about blocking any conflicting operations. Let's not just not block them, okay, with the possible exceptions of updaters interacting with each other. Let's not block the readers. Of course, that does pose some challenges. As you can see here, we've got this funny area where we've got readers and writers and we no longer have the nice temporal separation between the readers and writers. So we've got this zone of confusion. What do we do about that? Okay. And that's what RCU is designed to look at. We're going to look at a very small subset of the API, kind of the core of the URL at the bottom. We'll let you look at the whole thing and we'll get the slides posted, as Marissa said later on. So we've got RCU read lock and RCU done lock. And those mark the beginning and end of a reader. Just like read lock and read unlock, mark the beginning and end of a reader, write a lock reader. We have synchronized RCU, which waits for pre-existing readers. That's important. In write locks, we wait for all the readers or block them, one of the two. It's either wait for a reader to get done or block a reader from starting. Synchronized RCU does half of that. It waits for the pre-existing readers. It doesn't worry at all about readers that come later. Okay. And this is kind of sort of like a write lock immediately followed by a write unlock, but not quite. But still, that can be helpful in thinking about it. If it helps, use it. If it causes confusion, pretend I didn't say it. Call RCU is the same thing as synchronized RCU, except that it is asynchronous. We're call RCU blocks waiting for the readers. Excuse me. We're synchronized RCU blocks waiting for the pre-existing readers. Call RCU instead just returns almost immediately, but it registers a function that gets invoked after the pre-existing readers complete. Sort of a fire, forget, asynchronous operation. The remaining three APIs here are for managing data. RCU to reference load is an RCU protective pointer. It's essentially just a load. You could, instead of saying p equals RCU to your reference of something, you could say p equals to something, except don't do that please, because both compilers can really mess that up if they want to. And some CPUs, we're talking about you, that can also mess it up. So RCU to reference is just a load, but it's a load that is sane as far as RCU is concerned. RCU to reference protected is for update side. You could hold the update side lock. And so it is essentially just a load, but it interacts with locked up checking in a way so that it doesn't yell at you. Also sparse. RCU assigned pointer publishes a pointer to a new RCU protected data structure, and it ensures that the readers won't see pre-ignitialization garbage. Again, it's just an assignment. You say RCU assigned pointer p comma v. It's just like p equals v, except that it's making sure that the concurrency works out. You just do p equals v, the compiler can really, really mess you up, and some hardware can as well. So please use RCU assigned pointer. Okay, let's restate the semantics. The first bullet is the semantics we saw a couple slides ago. Same, exactly same thing. And we weaken the semantics, and so we got to do something. And the something we do is the second bullet, we compensate for the weak and poor semantics by adding restrictions and also spatial semantics. And we'll see how that works over the next little while here. And I'm going to restate the semantics for RCU. And that is, if synchronous RCU cannot prove, prove beyond a shadow of a doubt that it started before a given RCU read lock, then it must not return to its caller until the matching RCU done lock completes. All right. So if there's any doubt in synchronous RCU's mind about whether an RCU read lock started before or after, it has to wait for the corresponding RCU read unlock. Okay. And here are the restrictions. And this is rules of thumb, really. There are some surprising things, I guess. But if we start in the upper right there, if the situation is read mostly, for example, configuration variable, and staling inconsistent data is okay. And the stale part is okay for our example. It's okay to get A and B old, but then this inconsistent part is wrong. If we get an A, we have to get the corresponding B. So in a great case, the RCU works great. Things like network routing algorithms show up in the blue chunk. Our examples in the green parallelogram there read mostly, but we need consistent data. All right. Things like the path name walk in the Linux kernel using the D entry cache is in the yellow area where we don't write, but we need consistent data. Some time ago, I said if you're in the red area, forget about RCU, but the Linux kernel community being who they are, they came up with a couple of exceptions, and they're listed down below. But still, the key point is that RCU is a specialized mechanism. And you want to use the right tool for the job, and this figure kind of gives you some guidance, rules of thumb, as to where you should use it and where you should not. And the other thing is that RCU is most frequently used for linked data structures. All right. And that allows us to pull some tricks. You can use RCU without linking, and we'll see an example of that later on, but the common case used in the Linux kernel and elsewhere is for linked data structures. All right. So let's look at the RCU semantics from another viewpoint. Show me the code. And again, this is just the same application we saw before. We've got our variables A and B, infrequently accessed, updated, excuse me. The reader accesses need consistent values for a given reader. Two readers at the same time might see different A's and different B's, but each of them has to see an A that goes with the B that it gets. And a little bit of stainless is okay, and we have very frequent reader access again. All right. So how do we code this up? Well, the overall thing is we put A and B into a structure to obtain the consistency. A lot of people have said it since David Wheeler, but as far as the first one, all problems with computer science may be solved by another level of indirection, and RCU does that for most of its use cases. So what we're going to do for an update is we're going to allocate a new structure, update a pointer to it, and we're going to free the old point structure when it's safe to do so. All right. And coming back, here's our API again. We're not going to use call RCU this time, but we're going to use the other ones. So the reader is going to use RCU readlock and readunlock, and it's going to use RCU dereference to load the structure's pointer and get the right thing. The updater is going to use synchronize RCU, RCU dereference correct, and RCU assigned pointer. Let's see how it looks. For the reader, we got our structure. We have a variable, a pointer, of that structure's type. And to make things easier and fit on a slide, I'm going to assume this is initialized as something reasonable to begin with. All right. I don't show that here. I want to fit on a slide, but I want to avoid the checking for null. If you're doing real code, of course, please check for null when appropriate. Otherwise, you get an act if you try to get in the kernel, hopefully. So we've got to get the same API as we saw before. We got a couple of pointers to the new values, excuse me, to the place with the old values, sorry. We now add a struct pointer. We add an RCU readlock that used to be a readlock in the old case. We pick up that curc config pointer we have up there at the top with RCU dereference again to keep the compiler in the case of .gov of the CPU a day. Then we just pull the two entries out of that structure. We have just the one structure, so they're consistent with each other. And then we do the RCU readunlock to tell any updateers around that we're done and they can proceed. Okay, let's run through how that might work. So I said it was initialized. Let's say it's initially points to this structure over here that has 37 for a and 46 for b. So we do RCU readlock. We do RCU dereference. We get a pointer to that structure. Then we pick up a, and it's 37. It's right there. But then somebody else does an update. Now curc config points to this other thing, which has 39 and 44. But that's okay because we still have a pointer to the old structure. So we get 46 for the value of b despite the fact that the current value of b is 44. All right. But that's what we want. We're a little bit out of date, but we're consistent. That's what this application needs. Then we release lock. And despite the change, we got consistent values, which is what we wanted. Let's look at a writer. A writer, well, we're going to have a spin lock. So the writers, the readers don't care about the spin locks. They're using RCU. But if we have a pair of writers running concurrently, we have to have some way to exclude them. Let's start with a spin lock. That's easy. It's familiar. Let's start there. We have the same set API we had for the reader writer lock. We've got the two values coming in. And we're going to allocate some memory. And let's not worry about what we put in there, but dot, dot, dot, because we want to fit it on a slide later on. And we have a variable to hold the old value. So we just bug on if we get an allocation failure. Again, please be graceful and real code. But this is textbook code here. We initialize the structure we just allocated with the new A and B. We acquire the lock. We pick up the RCU director is protected. We pick up, have the variable. And then we say lock up depth is held my lock. And that's just shorthand for saying the way I'm synchronizing is with my lock and give me the value. And lock depth does check that and verify that you do, in fact, hold the lock. We then do RCU assign pointer to make the pointer reference the newly initialized data structure we have there. So from here on out, if a reader comes in, they'll see the new values that were passed into the set function. We then release the lock. We do synchronize RCU to wait for all readers. Once synchronize RCU returns, any reader that has a pointer to the old value that we just overwrote, the point we just overwrote, should have finished. That means really only ones holding a pointer to the old set of values. And at that point, it's safe for us to free it. The thing is, I'm going to have to put, as I did with the reader right lock, I've put two readers and a writer on the slide. This isn't going to fit. And plus RCU doesn't care how the readers synchronize with each other as long as they do. And if you look at this thing, look at the spin lock, this load, this store, and the spin unlock, that's just an atomic exchange instruction. So let's use that. Okay. And so if we do this, get rid of the lock and just use an exchange function. Now, what exchange is going to do is it's going to take the old value of kerkinfig and put it in MCP. And it's going to take the old value of MCP and put it in kerkinfig. So what's going to happen is we're just doing an atomic exchange. When we get done, kerkinfig is going to point to the new structure and MCP is going to point to the old structure. So we can just up and free it there. And what we're going to do on the slide is use those blue background areas to represent one writer just again to make it fit. All right. So let's revisit the RCU semantics. This is the same English we saw before, but let's switch to a graphical representation. Take a look at this from different views just to kind of have different ways for people to get to it. So we've got four different situations. In each of the quadrants, time goes from top to bottom. In the upper left quadrant, we have the case where RCU lock happened before our synchronized RCU started. All right. And in this case, the reader might have a pointer to the old state before the remove. And so synchronized RCU had better wait for that reader to get done before it returns. Otherwise, it could possibly be we free some memory of that reader's referencing and use ever free bug is just as bad with RCU as it is anywhere else. So in the upper left, synchronized RCU has to wait for that reader to get done. In contrast, in the upper right, synchronized RCU started first. That means that there's no way that reader could see whatever it was when we removed. Since that reader can't possibly access the memory we removed, it's okay to free that memory before that reader gets done. So that's the upper right there. In the lower left is our belt and suspenders combination where the reader doesn't have access to the thing we removed because it started later. And not only that, but it's not around after we free it. So we're doubly protected in that case. The lower right is a bug in RCU. All right, if this happens, the RCU is buggy and I need to fix the bug, or somebody else's RCU, they need to fix the bug. In this case, we might have access to the thing was removed and was free before we got done. So that's bad if you're writing RCU, don't let the lower right thing happen. That's the whole purpose of RCU is to make a lower right thing not happen. Okay, I'll point this out. This is kind of important conceptually, but we're not going to worry about it too much for this presentation. When I say time here is really ordering, all right, because what time is it is kind of a slippery concept, a large multi-processor, as many people are trying to implement time on large systems that have learned the hard way, including me. So it's really ordering, but thinking about time is going to be okay for this presentation. So another view of the semantics, we've shown you the code, let's exercise it. All right, so we have our same thing before, we have a reader on each side, right and left, and we have an updater in the middle, and they're both abbreviated. Initially, Kirk and Big points to a equals 37 and b equals 46 as before. And if we start executing, we've got the RCU read lock that happens, and so that's going to hold off synchronized RCU. We pick up the value of mcp and I've got it because I want to fill out a slide, but that means we get a pointer to that blue 37 comma 46. And of course, again, when we go forward and we pick up a, we get 37. At this point, the updater might happen. It's going to do an allocation and initialization, that's what that all stands for there. And that initialization is going to give it the green box with a equals 39 and b equals 44. It then starts the exchange instruction, now it's going to pick up the Kirk and Fig, and it's going to update Kirk and Fig to point to the new value, like that. So now mcp points to the blue thing, but Kirk and Fig points to the green thing. So all future readers are going to see the 3944 as they go forward. So they update at this point, it's kind of happened, except that we need to clean up, we don't like memory leaks. So we do synchronize RCU, which notes, hey, there's an RCU read lock that started before I did, I need to wait for it. At the same time, the last reader starts, the one on the right, and it starts after the synchronize RCU, as the dotted arrow indicates. So it picks up mcp and it gets the green one. It can then run through the rest of its critical section, and it sees 39 comma 44, which are consistent pair of a and b, and notice that it's kind of gotten done before the other guy is finished. All right. So the guy on the right has picked up all his values before the guy on the left has, and that's okay. Again, remember, the important thing is that the a and b be consistent with each other. It's okay for them to be a little out of date. You can imagine this being for things that are doing environmental. It may be the temperature pressure need to be synchronized, but if we're a little bit out of date, they don't change that fast, for example. Okay, the guy on the left now picks up b, gets 46, and synchronize RCU is having to wait. Our triple lats there again. Now it does the RCU read lock and synchronize RCU waits again, but now it sees that the RCU read lock has completed, and it can move forward, and it can do its K free. Once the K free gets done, the stuff's all gone, but that's okay, because the guy on the left is done with it. He's out of his reader. He's not going to use that pointer anymore. The fact that it's now free is just fine. Finally, the guy on the right does its RCU to unlock, and we're done. So we've had the concurrent update running while the readers are running, if they're all making progress at the same time. But nevertheless, we get consistent results for both readers, different results because they got a different version, but consistent for each reader sees a consistent view of the world as needed. So let's split that out like we did the readers based on time, and we've got old readers in blue there going to the end of the grace period. The grace period is the time that synchronized RCU has to wait. It's the time from when you start waiting until all the old readers get done. But there can be new readers, green ones in this case, going all the way up to the start of that grace period. So we have this area in the middle where we have this area of confusion, all right, where it could be either way. And the way we resolve that confusion is using the address space, all right. So where reader-writer lock just used temporal synchronization. RCU is using a combination of temporal and spatial synchronization to allow the readers to just go plowing through as if nothing was happening, but still get reasonable results. And my major prop, John Walpole, and his two students were the ones that pointed out this, the fact that the time and space are interlinked here. I was kind of using it for almost 20 years before really they articulated it for me, but that's what happens. Okay, so we're using time and space in order to resolve that area of temporal confusion. The confusion is strictly temporal. If you put the space in there with it, there's no confusion at all. Everything works out. Okay, so we've got space-time synchronization. And one question is, you know, how does this relate to the API? And here we are. It relates pretty straightforwardly. The blue API members are temporally synchronizing and the green ones are spatially synchronizing. You know, RCU read lock starts a reader in the point in time. RCU den lock ends a reader at a point in time. Synchronized RCU waits for a period of time until the pre-existing readers are done. Call RCU doesn't wait, but it causes a function to happen at some point in time in the future when all the readers are done. And then RCU dereference, RCU dereference is protected or are fetching the current version of the pointer. So that's the spatial part. RCU assigned pointer is synchronized to the update side, changing the space that you should be looking at. Okay, and in our example, of course, we're using exchange, the atomic exchange function, to substitute those last two. Okay, so that's how the time and space fits into the RCU API. There's a very clear separation. And we can look at that also in terms of the readers and the updaters, because we have read side access and upside to side access. And this is the same thing, but split up by readers and updaters. So readers do temporal synchronization with RCU read lock and RCU to unlock. And the spatial synchronization with RCU dereference updates have a more elaborate relational time and space. They initialize, then use RCU assigned pointer to spatial synchronization for ads. Deletes will remove something, that's spatial synchronization. Then they wait for a grace period, that's temporal synchronization waiting for the pre-existing readers. And then they do a free, which cleans up after the remove. So another example of spatial synchronization. So what's happening is RCU is not really doing ownership. Okay, I mean, it kind of is. You can think of it that way, but it really is doing space existence guarantees. If you pick something up as a reader, if you're between RCU read lock and RCU read unlock, and you pick up a pointer with RCU dereference, you are guaranteed that whatever you've, whatever that points to, that will stay around until you get to your next RCU read unlock, the matching RCU read unlock. Okay, so what we're guaranteeing with RCU is existence, rather than ownership. Although you can use RCU in ways to get to ownership, and that's a topic for later on. Now let's go in the right direction here, sorry, if you're going backwards. Okay, so let's map this onto the reader itself. So it's the same thing we saw in the last slide, we actually have it in the code for the reader. And then we can talk about the purpose. So the first temporal synchronization start the reader critical section, the last one ends the reader critical section, and the spatial synchronization gets the current version. So there you can see it in the context of the get function. Similarly for update, we've got spatial synchronization being the exchange that changes what space you're supposed to be looking at. And then the synchronized RCU is the temporal synchronization that makes sure that anybody's using the old space is done before we blow it up with K-free. Okay, now as I said there. Okay, so RCU has been, is kind of unusual in exploitable temporal synchronization, excuse me, exploitable temporal and spatial synchronization has for decades. It's almost three decades since Jack Slingwin and I came up with this back and back a long time ago. So one question is who the heck does spatial synchronization? Others than crazies like myself anyway. And the answer is quite a few people. If you do any kind of concurrent programming, you use spatial synchronization. You write a C function. It's got an auto variable. That means that you can have several tasks or several CPUs depending on what you're doing that are running that function at the same time, but they end up with different stack locations and you have so thus have spatial synchronization between all the different instances of that function running at the same time. Per CPU and per thread variables, same deal. Hash tables with per bucket locks. Again, you do synchronization on a given bucket only and therefore you're doing synchronization across space one bucket at a time. And that applies to sharding in general. You used to call this data locking back in the day, but sharding appears to be the name for it. There's a bunch of other things that do this. Hazard pointers. There's some other people co-indentated about the same time. MAGA has probably done the most to make it used in practice. And there's other deferred recommendation techniques as well, but reference counts if nothing else. So the answer is that pretty much everybody uses spatial synchronization. The thing about RCU isn't that it's unusual that it uses spatial synchronization. It's that it uses a combination of temporal and spatial synchronization in a fine grain all the same time. That's what's special about it. Okay, so everybody uses spatial synchronization. You've used it probably if you've done any concurrent programming at all. The trick again is that RCU combines temporal and spatial synchronization to get better scalability and so on. Well, okay, I said I'll get better scalability. Maybe I should prove that. And by the way, this RCU is not preempt equals nRCU. It's not the fast one. It's the slow one. It's the one if you have preemptability on. If you have config preempt equals y in the delay's kernel. So that means the readers are actually executing instructions, real instructions, as opposed to the server's class things that where you just have a barrier call, which is just a compiler constraint. There's no actual instructions emitted explicitly by it. As you can see, we're well over an order of magnitude better even on a single CPU. So RCU is performing even in the slow variant and in real-time kernels is going to be quite better than read or write a lot on a single CPU. And as you add more CPUs, RCU really doesn't get that much slower. Now you may notice there's a lot of fuzziness as you get out beyond a couple CPUs. In fact, even to the second CPU, you're starting to see some fuzziness there. The solid line is the median and the dashed lines are the extremes. So it's not a standard deviation. It's the very lowest is the low dashed line on each sample. And the high dashed line sample is the maximum value. And what's happening here is that this is run as a inside of the guest OS. And that means that CPU number three, well, who knows what it means because KVM and QMU decide and as does a scheduler host, what it means exactly. And so the thing is that RCU on this workload can use more than half of a core with a single thread in that core. So the uncertainty, one aspect of uncertainty is that we may or may not have some number of the threads in the OS sharing the same core. If they do share the same core, they get a little bit slower because of the fact that a single RCU instance can use more than half of the core's resources. Still, it's flat pretty much out to 192 CPUs. And we have done tests, haven't published them, but done tests in previous lives going up above a thousand CPUs and still runs flat. So we're doing quite well in performance. However, this is empty critical sections. And who uses empty critical sections? Come on, right? But if we have something in the critical section, the more we have in the critical section, the less RCUs advantage. Also, the more CPUs we have, the less RCUs advantage, all right? And so I'm starting at 100 nanoseconds for the critical section duration. I could take it down lower, but the graph gets messy looking. So that's a pretty long duration. The thing we had before, we're just loading A and B, if they're hot in the cache, it's going to be way less than 100 nanoseconds. But still. And as you can see, the more CPUs we have, the farther up you have to go before it gets down into the noise, the difference down in the noise. But please note, this is a log-log scale. Okay, so fairly small differences are largeish, you know, tens of percent differences up there. Still, if you're up at 20 microseconds, you're not getting much by RCU, really. The RW lock difference even with 100 CPUs is pretty much down on the noise for this particular hardware. Your hardware may be different. Still, RCU does the best with small critical sections and with large number of CPUs compared to read or write a lock. And usually use higher level things. We've been talking about RCU de-reference, which just picks up a pointer. We've been talking about RCU assigned pointer, which changes a pointer. And you're better off, if you're doing lists, use the list primitives, which are right here. List.RCU has an element to a list and allows RCU readers not to get messed up. List.LRCU does the same. You know, remove something or list and avoids the readers getting messed up. And if you're traversing a list, use the list for each entry RCU iterator to make it easy. And there's also an asynchronous K-free, K-free RCU. You pass the pointer and it just gets freed later. And there's a couple different variants of it, but I'm going to get into that for this presentation. Paul, I have a hand up. Yes. Go for it. Kiran. Yes. Puttapatil, would you like to ask your question now? Hi, Paul. I have a question on the slide, 91. So, 91, let's go back there. Thank you. Go ahead. So, here the question, I think you briefly mentioned that going to at the CPU between basically two or three where there is a spike. And this is because this test was run within a VM. If the test is run on the host, have you seen the similar observation or does it not happen? Oh, that's an excellent question. And what happens if I run it on a host, which I and I have some things where I've done that? I'm sorry I don't have any of this presentation, but if you look at the Perf book, there's plenty of them and there's, they'll be URL to that at the end of the slide deck. What I do there is you have control of the CPU, right? And the way Intel CPU numbering works is that CPU zero is hyper thread zero on core zero. CPU one is hyper thread zero on core one and so on, right? And so, if you do the normal thing of just saying, okay, one CPU, I'll do CPU zero, two CPUs that is zero and one and so on. What happens is you end up with a line that kind of slopes up. And then when you get to where you're starting to use a hyper thread, it slope decreases. Okay. So it's the same effect, but it looks different. And the reason it looks different is because you have control. Whereas in this case, there's no control, so we get variation. Does that make sense? Yeah. Yes. Thank you. Okay. So yeah, great question. And you do notice it. And that's for the CPUs I've run on. I'm, you know, there might be some CPU somewhere where RCU wasn't able to use its share of a CPU and it might scale linearly across the core. I don't know. If you find one, let me know. I'd be kind of curious, kind of curious, right? There is another question. Oh, thank you. Go ahead. There is another question looks like, Afath, would you like to ask a question now? Yes, please. So thanks a lot, Paul. So my question, just to understand the difference between temporal and space synchronization, is the, is the following statement correct? Because, and the statement is because sequence counters and sequential locks only use temporal synchronization. We cannot protect data structures with pointers in them using sequence counters and sequential locks. Okay. So sequence counters, sequence locks, how do those relate to temporal and spatial synchronization? That's actually very interesting and quite a challenge actually for some programming language, including C++ and Rust as it turns out, although I think they're coming up with solutions. There are solutions. They just have interesting properties. But what happens is that the, there's, there's not really spatial synchronization. So you can get a kind of a fuzzy answer. But if you do, then the, the end of the reader, and I'm sorry, I forget, I have to look up the name of the reader, but it's something, read count retry or sequence count retry or something like that, seek read retry or something like that. What will happen is it'll tell you, hey, you got a retry. All right. So what it does is there's, there's, it just, it allows the the execution overlap, which avoids global agreement. And it converts the global agreement to a local agreement in that did I run just me, just me, did I run without an update happening? And if I did, then, then great, I'm good. I go. So it's, it takes a different approach. Instead of like RCU does using a combination of temporal and spatial synchronization, it reduces the reader-writer lock global agreement problem to a pairwise agreement problem, which is much cheaper. So what that means is that you end up with heavy memory barriers on the sequence lock readers, but you don't have this, this atomic instruction, make sure everybody agrees, ring roll that you have with reader-writer lock. Instead, you just have a simple check, did an update or happen while I was reading? If so, try again. Does that make sense? Yes. Thank you. Good questions. Thank you. Any more questions or should I plow ahead here? Go ahead, Paul. I don't see anyone. Go ahead. Okay. So we've talked through the, and basically the thing here, there's also RCU predict the hash tables, RCU predict trees. If you, if you have a higher level data structure that works for you, use it. Okay. That's the, that's the general rule here. Okay. So let's talk about RCU to cross the reader-writer lock. We demonstrated the code. Now we're going to look at it from a use case viewpoint. All right. So RCU, what it provides is it provides published subscribe. That's this, that's the spatial synchronization and wait to finish. That's the temporal synchronization. That's all. But we had to add things to that. The things we had to add, we had to add a heap allocated link structure. And that was so we could pick up a pointer and get our consistent values in that structure. We also had to add deferred reclamation. And we did that manually by saying synchronize RCU and then K-free. We could have used K-free RCU or do call RCU, but there's no ways to do it. But either way, you have to defer memory freeing, memory reclamation, until after the readers that might be referenced in that memory are done. We don't like use after free problems. Thank you. And then once we do that, we have RCU readers as kind of like, kind of sort of read-held reader-writer lock. And it does say quasi reader-writer lock there. We mean quasi. It's not quite a reader-writer lock, but it can be used in some cases where reader-writer lock could be used. Wait to think of it. And in doing this, we're using both spatial and some temporal synchronization together in an intertwined fashion. Okay. So that's our quasi reader-writer lock. Let's go to the next green box we had there, which was faced state change. And this use case is quite a bit more rare, but it's a much more primitive, if you will, use of RCU. It's using RCU much more directly than the quasi reader-writer lock. And so we have a different application here. We have multi-trended application again, but we have some operation that in common case, must be fast. So we got readers sort of, I mean, they're kind of reading the state, but they're doing something. And normally they want to just do it fast and get it over with. But if we have a maintenance operation going on, so occasionally, maybe every night or every weekend, we do some kind of a maintenance thing. And during that time, the common case operation needs to be careful because stuff is changing out from under it. Okay. And so we're going to use a flag that says, hey, be careful. But if you've tried that, you know it's really painful. The reality beats you over the head pretty badly if you're not careful here. We need to reliably synchronize that. And the way we do that is, again, we get rid of global agreement. And we do that by saying, you know, we only really need to be careful during the maintenance operation. But if we're careful a little bit before and a little bit afterwards, yeah, that's okay, right? As long as once we get done with the maintenance operations, the time has passed, we need to be quick. And if it's a long time before, we need to be quick. But if we're coming up on one, it's okay to be unnecessarily careful. And if we just got done with it, it's okay to be unnecessarily careful for a little bit. So the key trick here in all these use cases is to try to turn global agreement into something weaker. And as we saw from the question earlier, one way is to turn the global agreement into pairwise agreement. That's a sequence block, a sequence counts. And what we're doing with RSU usually, not always be usually, is returning global agreement into, you know, global point time agreement into agreement that's fuzzed out over time. Okay. And that's what we're doing here. So let's look at it graphically. And this is the exception, because the slides are wider than they are tall, I'm sorry, time has to go right to left here, right up to down. So this is the exception. And so we have operation or normal operation for a while. And we're going to do things quickly. And we come to a point in time when I say, Hey, you know, we're going to start maintenance pretty soon. So, you know, start doing things quick, carefully. And during that time, they do things quickly, fine, do things carefully, fine, we don't care. But once maintenance start, they had better do things carefully, or things will break. Once maintenance gets done for a little while, you do it carefully still, yeah, okay, fine, no problem. But after a certain time, we better do things quickly, or the system's not going to meet its throughput requirements, it's not going to be fast enough. And usually, this isn't the old days where you didn't have global operations. Although today it works to you just have different data centers and different time zones. The idea would be you would be doing the maintenance at some time when the load was low. And thus you could afford the common case operation to go more slowly. Okay, so that's what we're trying to do. And we can look at the code and the read side is pretty simple. We're not going to actually worry about exactly what maintenance is, or what we're doing to avoid the maintenance, we're just going to have function names. So we have this Boolean be careful. That's the flag that says whether or not we need to be careful. It's initialized to false. So by C default initialization rules, it sets the zeros. And we have this common case operation CCO function. And what it does is it has a reader, RC read lock down to the RC read lock unlock on the bottom. And hopefully you can guess what that's going to do, but we're going to go through it carefully over the next few slides. And what we're going to do here is we're going to do a read once, be careful. And that keeps the compiler from messing us up. It says just load this thing once, give me the value, don't do any fancy, don't read it bit at a time. Don't decide to read it twice, you know, don't decide the read it a little while early, so you don't need to read it again. Read the thing now once carefully. Okay, that's what it means. And if we're supposed to be careful, we do CCO under our carefully, some function that does things carefully, we don't really care exactly what. And on the else clause, we have the CCO quickly, which says just get it done. There's no maintenance operation. There's not going to be one. Just get it over with. Okay, now let's take a look at the updater. So we have this maintenance operation that's called occasionally, hopefully when the load is low, so that we can get away with the common case operation being slow, because that's what's going to happen. So the first thing we do is we write be careful to true. And this is the counterpart to read once. It's essentially just an assignment, be careful equals true, except that it keeps the compiler from messing things up. It just does the right in one instruction. It doesn't mess around slicing it. It doesn't do things like say, ah, yeah, I'll just, there's this right that has to happen. I'll wait a while and happen it or, hey, you know, I wrote true a while back. So it should be still be true. I'll just let it still be true. And the compilers will do all those things for you, by the way, or against you as the case Sandy. Once we've done that right, we do the synchronized RCU. Synchronized RCU waits until all preexisting readers get done. Let's go back to the previous slide. There's a reader. So any reader that is in that CCO quickly right now, down there at the bottom, well that reader must have started before that synchronized RCU. Therefore, going back to the update side slide again, that synchronized RCU cannot return until all the readers that might have seen be careful falls have completed. So once that synchronized RCU returns, all the common case operations are doing things carefully. At that point, it's safe to do the maintenance operation. We have this do main function that does that. And once we get to that, we can do the synchronized RCU again. I don't want answers right now, but if you, if you, you know, feel like this is a good day and you want to think about it hard, why do you need synchronized RCU? Why can't you just do the right ones just straight? But in any case, once we get done with synchronized RCU, we can do a right of false to be careful. And once the readers see that, it'll take a while for readers, you know, those readers to see true for a while. But once all those readers get done, all the readers after that, the common case operations after that, will see be careful falls and they'll do things quickly again. Paul, there is a question in the question and answer. I'll read it out to you. In CCO and MainT, why don't use SMP load acquire and SMP store release? Okay. So tell me, so SMP load acquire, I'm going to go back to the read slide. So presumably, you'd like to make be careful the SMP load acquire, is that correct? Instead of read once. Yes. Okay, great. We've got a couple of right ones is here. Which one do you want to be SMP store release? Why? And how does it help? Levi, would you like to just turn your microphone on and speak? Yes. Yes. Actually, I just don't know. Instead of the right ones, and just the least ones is just for preventing the optimization to see, to just prevent the save someone's list and then just going and just to use the direct access to the memory to see the value. But in case of this case, because in the problem of synchronization and to see all the core and just try to see the correct value in the be careful, I think it is so used to better to use SMP acquire and then just SMP just store. But I don't know. Okay, hold on a minute. Excuse me. I asked you a question. Could you please answer the question? The question is as follows to repeat again. The question I'm asking you is, which of those right ones is do you want to be SMP store release? And how does it help? Which one? That's right. Which one do you want to be? SMP store and release is much better to use. That's not the question. I'm sorry. That's not the question I asked you. Let me ask the question again. Which of those right ones is you want to be an SMP store release? And how does it help? You're actually onto something important, but you need to, you're, you're, you could be onto something important except you keep taking a turn before you get there. So I need you to think about it. If you'd rather take time and think about it, that's fine. I can continue. We can come back later. But which one of those right ones is do you want to be an SMP store release? And what does it do for you? Let me tell you, let me, let me, let me tell you what, what the concern is. If, if we go back to the reader here, we make that read once via an SMP load acquire, then on weekly ordered systems, for example, arm, that's going to slow things down. And remember that the common case operation is, in fact, a common case operation, that slowdown may be really, really bad during the times we're actually doing maintenance. So we're paying a penalty for SMP load acquire. So my question to you then, I mean, and it may be, maybe worth the penalty. Okay. I mean, it might be. So if we're going to pay the penalty for making that read once be an SMP load acquire, I need you to tell me again, which of those right ones is, maybe it's both of them, just tell me both, if that's the case, wants to be an SMP store release and how it helps. Because we need a benefit that outweighs the common case memory barrier we're taking in some of the weekly ordered architectures. But you're on to something important, okay. If you take the right steps, you'll answer the question of why is this needed? That should be a hint for you. Just because I'm just only just the only just in the theorem, I just thinking about when you just thinking about the weak, weak actually, so I think it's much better to use the kind of SMP SMP load and then it's actually really deceased. Okay. I'm going to stop you right here. I'm sorry, I have to stop you because we have to go on the presentation. Again, you're kind of taking a turn off away from away from the answer. Okay. You need there is a there really is a situation where what you're suggesting might be helpful. But you need to think hard about the two questions I asked you. The two questions I asked you are, which of those right ones is do you want to turn into SMP store release? And what does it how does it help? Because we get all the synchronization where we got synchronize RCU will make sure that that the readers are careful and they need to be careful. And they're quick when we know that it is possibly quick. You're not going to you're not going to add any more synchronization. Okay. And on the read side, we have minimal overhead read once is just a load. Okay. There's no, I mean, on X86, okay, perhaps it doesn't matter so much. But on other, but I mean, on X86, the SMP load acquired does imply a barrier. Okay. It does prohibit reordering. Whereas read once not so much. And on the weekly order architectures, you're getting extra instructions from SMP load acquired that you're not there for it once. Okay. Anyway, if you labor this enough, you're actually started in a great and a right direction. But every time I asked you a question, you've kind of veered off of and tried to argue philosophy. And you need to think about what's happening in the machine. I'm sorry for that. We go ahead. Sorry. There is another question in the question and the answer box. Go for it. Has the code not developed devolved back to a reader's writer lock? First, the writer waits for all quick readers, then for all careful readers. And finally, you still have a shared variable that's going to hit shared cash lines. Yeah. You can think about this in numbers or ways. We could think about this as a reader writer lock. Although there's the writer isn't really, we're not excluding other writers. So that's kind of a stretch, right? And the thing about the readers is this not excluding the writers, the readers will just execute either way, right? So even if the writer is in the middle of domain, the reader still makes forward progress, which would not be the case for a reader writer lock. So there's a resemblance, you're right. Okay. But there's a lot of differences as well. Does that make sense? Or I missed the point of the question. Ilan, does it answer your question? Yes. Okay. That's good. And there's a couple of other questions, Paul. Would you like to take them now or later? Why not? Okay. First, there is a hand up. Ahmad, would you like to ask your question? Ah, yes, please. So Paul, please, if you go back to the reader's slide. There we are. Reader slide. My question is my understanding, I'm sure it's wrong, but what I thought my understanding is was that since RCU readlock and RCU read unlock already hint to the compiler not to do things before or after, the read once will not be necessary. Can you explain a little bit more, please, why the read once is needed, even though we have RCU readlock before? And like, there are markers for the critical section. Sure. So if you look at an implementation, if you're using a specific compiler, it might be that, you know, you're fine. Okay. And some people would argue that because be careful of Boolean, it can only be one or zero, that the compiler really can slice and dice it all at once. If they get a bit at a time and it wouldn't matter. Yes. But if this was, if we just did it by itself, we just did, if be careful, the compiler would be within its rights to turn the else into a reload of the be careful. So the compiler would be in its rights to, it would be a stupid thing for the compiler to do, but the standard would allow it to say, if be careful, CCO carefully, and then say, if not be careful, CCO quickly, right, which would cause both CCO carefully and CCO quickly to be executed on a single run, which is probably not what you want. Does that make sense? I'm thinking it in my head, but please continue. Okay. But the thing is the compiler's, if you don't put the read once, the compiler's allowed to assume the variable does not change. And that means that it could. So for example, if you had, let's say you had a, you had a funny machine that had a conditional call instruction. All right. So you give the, you give the call instruction memory address of the variable supposed to check and you get, and then you give the call instruction the function to call. If you didn't have the read once, the compiler could say, oh, okay, great, we'll use this conditional call instruction to check be careful and call CCO carefully if true. And then we'll use that instruction again to call CCO quickly and be careful false, which would result in two loads of be careful. You see what I'm saying? Yes. Okay. So we have one more question, then I'm going to push forward. So thanks a lot. Thank you. Good question. All the questions made good, actually. Go ahead. Next question. There is no question. Oh, okay. Go ahead. Well, in that case, we'll save it for later. Okay. So we've been through this. And here I'm splatting it on to the same time and their space, but it's value space. And there's a trick there. Is it really value space? I'll let you guys think about that. And so what's happening is we have the readers, we have the reader on the right, we have the reader on the left, and we've got the updater down the middle. And so what happens is that the writer is going to change, be careful, and we haven't quite got to that point. In the meantime, the reader, the first reader, the one on the left has checked, be careful. And it said, oh, I need to go quickly. So it does. And now it's the middle of doing CCO quickly. Then what can happen is that the writer switches the variable to false. Okay. Excuse me. I'm confused. Okay. It switches it. Yeah, that's interesting. I clearly didn't do this slide very well. I apologize for that. In the meantime, what's happens is it's going to have to wait for the old reader to get done. And that means the domain can't happen until R3VLock gets done. And on the other side, the function is going to see that the value is true, and therefore it's going to do it carefully. Now it's going to set it to false after it gets done. But that's okay, because it's okay for the CCO carefully to continue after this happens. And the gentleman asked the question, why can't we use SMP store, load acquired SMP store releases onto something? The trick is, how do you make that beneficial in some cases? Okay. So what do we do to get from RCDIS phase state change? What we did was we added a weight to, we took weight to finish and only weight to finish. And we added check state variable. We used the temporal synchronization RCU, and we added a state variable to check for it. So how does this use the latest kernel? I'm not going to go through this in detail. We've got a bunch of different. Go ahead. Before we move from this area, there is another question, related question. How is read once different from volatile? Not very much in execution. Okay. In fact, if you look at read once, it's essentially is a volatile with a bunch of funny stuff around to make the compiler happening a lot of situations. If you think of it as read once as a volatile read, you won't be very far wrong. The thing you have to be careful of is that it's okay to do read once of a really big structure. And if you do, it'll still be volatile. But if the CPU doesn't have an instruction big enough to do the whole thing, it'll do a piece at a time. But if you do read once of a machine-sized aligned thing, it'll just do a single instruction to load it as it would for a volatile read. Another question. If we didn't have the second synchronized RCU, then either the CPU or the compiler to some extent can reorder the right of false into the do maintenance function. All right. So for weekly ordered architectures, do maintenance function, it does a bunch of memory accesses. And if we don't have some kind of synchronization, in this case the synchronized RCU, that write once can be pushed up into do maintenance. And that might mean that some poor reader picks up write once, which was written prematurely, sees that be careful is false and does it quickly, even though there's maintenance operations still kind of in flight. Does that make sense? Another follow on question. Could we swap the synchronized RCU with some kind of memory barrier? Well, that's an interesting question. And that relates to the question was asked earlier about why don't we use SMP load acquiring SMP store release. I'll give you a little bit to see if you can tell me how that's done, given that as a hint. Can you tell me how that's done? I'm not sure what you mean, how which is done. So you ask, can we get rid of the second and synchronize RCU and use a memory barrier instead? Right. So my understanding is that that would prevent the write once a false from rising into the do maintenance function. It would. But what other problem could we run into if we did that? I mean, we could we can reorder the write once into the domain. What else might be reordered? This is a really tricky one. Go ahead. I don't think I fully follow when the memory barrier prevent things from being reordered. But that's one one process do maintenance. There's another process, the readers, what reordering might happen there. This is kind of a trick question. It's a it has to get the right answer. You have to understand memory models. So I'm going to, unless you object, I'm going to give you the answer. If you'd rather go think about it, I'll not give you the answer and let you think about it. What's that? I do not object. Okay. So the same thing can happen for the reader. You've got to read once there. Now, intuitively, you know, we had to check it and only then we did the then clause or the else clause in this case is the hard one. So we had to have read before we did CC quickly, right? Wrong. If we have weekly order to CPUs, that's true for the stores that happen in CCO quickly, but not for the loads. So what can happen is we can get reordering of that of that read once the load with any loads that are down in either CCO carefully or CCO quickly. CCO carefully probably don't care. CCO quickly, you might. Okay. And so you might have a case where, so a way of thinking that is CPUs do speculative execution. And so the CPU might say, I think this is, this value is false. So I'll do CCO quickly. And, and then it confirms that, but the right kind of happened somewhere in between. Okay. And so it could, it could do some loads that run safe because the do main was still kind of sort of happening because of reordering on the read side. So this goes into your careful after careful before a little bit of a buffer. So what you have to do is put ordering of both sides. And this is the question we had earlier. And this, that's the answer to it. If you want to get rid of that second synchronized RCU, you can't get rid of the first one. But what you can do is just get rid of the second synchronized RCU replace the right ones with SMP store release. Okay. Only if you also replace three ones with SMP load acquire. Because that way, if the SMP load acquire sees a false, then we know that all the previous stuff in do main has been, has come to us, you know, that there is ordering there. So that is, that's the, and the benefit of that is we don't have this extra synchronized RCU after do main, which might or might not be important. The price we've paid for that extras performance of the update is that the reader on weekly ordered systems like arm is going to be using more expensive instructions to do that read. Does that make sense? Yes. Thank you. Okay. That's those questions aren't easy. So, you know, I didn't expect us to get get through them. And I'm glad to see we did. Okay. Let's see here. Okay. So we talked about this learning external usage. We talked about that. This is an example. I'll have to go through it in detail. This is an example of the phase state change actually in use in the latest kernel. And the way it's used is it makes we actually didn't do the previous slide. I think if I remember correctly, this slide. Yes. Okay. This is just a list of some data structures in the latest kernel that are RCU protected. So I'm not going to, I'm not going to go into detail on any of them, but there they are. Thanks. And this is an example of something that uses phase state change to make a reader-writer lock that has RCU reader-like performance if there are no writers. But if a writer shows up, then the readers are still safe. And that's the, that is the per-CPU reference, per-CPU ref functionality in the latest kernel. So that's just a mention to show that the phase state change, strange as it looks, does have real application. Okay. This is just usage kind of showing you which, so the subsystem is just the top-level directory name. And these are ordered by intensity of RCU use. Okay. So the more RCU API members you have, per thousand lines of code, the farther up your directory is in this list. And as you can see, IPC has almost, almost 1%. Verge is a little bit less than that. Net has the most total uses, 7,000. It's also the thing that's used RCU the longest, give or take modules. And the interesting one is drivers, which has one of the smallest intensity of use, all right, it's less than average by quite a bit, but drivers has so much code in the latest kernel that it has the second largest number of RCU API calls in it. So again, this is a specialized thing. There are 18,000 lines out of 27 million. So it's not like, you know, you randomly pick a line, you've got a fairly small probability of hitting an RCU call, but it's used throughout the kernel pretty much. And over time, there was a time up to about 2014 when things look kind of exponential, but at this point, as the size of the Linux kernel increases, the use of RCU increases. So we're pretty much at steady state there. And we'll go quickly through some of the things to help debugging Linux kernel RCU code. If you're making use of RCU, these are some things that help keep you out of trouble. We'll talk about on separate slides about lockdown, as far as in KC-SAN. One thing is, if you're writing algorithm, it can be very helpful to use user space RCU. And that allows you to use a real debugger in user space, and it can make you much more productive than booting a kernel every time you get something wrong. And the last thing is something that a lot of people have a best-of-love-heat relationship with. That's RCCP stall warnings. But again, these things can keep you out of trouble too. It can be painful sometimes being kept out of trouble. After all, you know, running with scissors is great fun until somebody gets hurt. Okay, so let's talk about lockdown assertions. One of the things that can happen in Linux kernel is you may have a reader, but that reader may be a few function call levels down from the RCU read lock. And it might be called from multiple places, but it might really, really need to be protected by RCU. And you use these lockdown assertions. They're not assertions, they're just moving flags. You can pass the lockdown assertion, sorry. And RCU read lock held is what it says. It says, yes, you're an RCU reader right now. And BH and SCED are just different flavors of RCU from the read side. Basically, it means that you, the first one is that you have bottoms of halves disabled. The next one down, you have branches disabled, some one or another. And any held means that one of those three is held and you don't care which. And then RCU difference check, BH check and SCED check allow you to check your updater to make sure that you hold the right things. And the BH and the SCED and the check allow you to say, have common code between your reader and updater. Called from both. And it's satisfied if either you're in the RCU reader or if you have the specified lock held. Sparse, you can take an RCU and mark it under RCU. Excuse me, take a pointer and mark it under RCU. And then sparse will yell at you if you use normal C language loads and stores to access it, which can help for cases where you're confronting something over and miss something. And then the KCSAN assertion is going to be very helpful. They let you determine whether there are accesses and concurrent accesses. So if you're in a state where you expect there shouldn't be anybody else writing to this thing, you can say assert explosive writer and KCSAN will yell at you if there's a right that happens concurrently, at least in the test that you managed to run. A small warning to talk about a little bit. A normal RCU, what happens, you've got CP0 and 1 and they have readers. The readers end at some point. CP0 happens to go into idle. CP1 goes into user space execution. And at some point, somebody wants a grace period and checks the CPUs. And once both CPUs are out of the readers, the readers have ended, then it says, great, this grace period is done. And these get more complex if you have preemptible kernels. Because then readers can block, but let's not worry about that right now. Now, let's say that CP0 does an RCU reader, has interrupts disabled and it persists for 21 seconds. Or in some distro kernels, 60 seconds. I hope you all agree that having interrupts disabled on a CPU for tens of seconds is a really bad thing, especially if you have any kind of response time requirements, like most Internet data centers do, or were she at real-time requirements? And so what happens is that RCU will complain if a CPU is not responding for too long. And this normally indicates a problem outside of RCU, but of course, I've done bugs in RCU that result in stall warnings. And this can be a little irritating when they happen, but again, it's helping to keep you out of trouble. But there are some cases where you don't want stall warnings for any number of reasons. And I'm not going to go through these in ETL, but there are some K-config variables and some kernel boot parameters that allow you to control how long stall warnings take and whether they happen at all. And this is something that can be useful in certain circumstances. You have a slow embedded system. You test it on a fast development system, and you found that it worked fine on the fast development system, and you don't want stall warnings on a slow embedded system, for example. Again, though, if you have any kind of response time requirement, even an internet data center, hundreds of milliseconds style one, if you're getting stall warnings, you've got a problem. So they do find real bugs. There's some more information. There's a 2018 Loose Conf EU presentation, video and slides there. And also, in the source tree, there's this file that talks about various causes and various reasons they happen and how to interpret them. Again, I want to revisit RCU as a specialized mechanism. It has places to use it, has places you probably shouldn't use it, and it's important to use the right tool for the job. I might have a warm spot in my heart for RCU, but I have an even warmer spot in my heart for using the right tool for the job. So summing up, RCU synchronizes in time as well in space, and the time of space aspects are deeply intertwined, unlike traditional approaches where they're kind of separated out. And this allows near zero cost read size synchronization and really great scalability. We've gone through a couple of example use cases. We'll have some more in February. And we'd like to talk about RCU's dirty little secret. The secret is it's dead simple. You wait for pre-existing readers. But in order to make good use of RCU, you have to change the way you think about your problem. That's the hard part. And hopefully, this presentation has helped take a look at the RCU semantics in a bunch of different directions to help you understand how you can think about your problem in a way to use it where you need to. The other thing is hopefully, if you've listened to this and gone through this, hopefully you're at the ICNR member stage so that if you see RCU in the Linux kernel, it won't be foreign to you. It'll be able to see what the code is doing and how it's being used. If you really want to understand it, you need to be at the I do and I understand phase. And that means you should play with it. Maybe get user space RCU and write some little things. There's some example code in is parallel programming hard and if so, what can you do about it? The book. It'll be another slide. I'll show it there. Or you can just play with code in the Linux kernel and see what happens. All right. But again, if you really want to really thoroughly understand something, you really need to play with it and use it. Nevertheless, that might not be your goal. Your goal might just be to kind of understand what's happening, code you're reading, in which case, hopefully this presentation suffices. Again, you're here. We looked at the green boxes, quasi reader and phase state change. And of course, the basics of RCU are waiting to finish and like publish, subscribe. So we had to look at those in February. We're looking at the remaining blue boxes. So if you're interested in those, we got more coming for you in February. At that point, here's a place to look for more information. And hopefully this is helpful. If there are more questions, we can take them for a little bit. In any case, really appreciate your time and attention. Go ahead. There is one question in the question and answer box. Is it possible to use RCU? I'm guessing that's RCU on user space application slash program and harness the benefit that come with it. Yes, you can. An example program that uses user space RCU is QMU. And when they started using RCU, they were able to get up to 64 CPUs at a time when a lot of the other hypervisor players were only at like 16. So that was the benefit. They got better scalability. There's a DNS implementation that uses it. There's quite a few uses actually in user space. It works a little bit differently than in the kernel, but the stuff I've talked about in this presentation still applies. You still have readers and you wait for the old readers, not for the new ones. We're working on getting into the C++ standard. It's gotten into a prototype technical specification. And hopefully in a few years we'll get into the international standard. And of course, C++ is primarily used in user space. So that's an example where things are. What happens? The thing that confuses people is that the industrial RCU implementation makes really heavy use of the fact that it's in a kernel. It does heuristics and optimizations that are set up for the kernel. In user space, the same semantics are there, but the optimizations and heuristics are quite different. Let's see if we got it here. There's a C-language user space RCU and the Folly library has an RCU implementation. So you look at that. And if you look at the piles of information, there will be a 2012 reference in transactions and distributed and parallel programming that talks about user space RCU and how it's implemented. And if you send me an email and I'll send you pointers to the specific things there. That helps. So yes, you can use it in user space and people do. Please, one more. I don't know if it's a question or a kind of a statement. I'll read that out to you, Paul. The problem with user space mode RCU or for that matter in fully preemptible kernels, don't know about Linux. Is that you need to prove that a quiescent state will be reached within a finite time. Otherwise, either the writer can be blocked indefinitely or memory can accumulate. That's true. You know something the same is true for locking. If you acquire a lock, you have to prove that you will get to the unlock in a finite time. Otherwise, you hang the system just as thoroughly as you would with yet an OM event, an out of memory event. So you're right. But that's true of many other synchronization primitive as well. So the thing is what it comes down to is if you have a situation where you're using reader-writer locking, you're already making sure your readers are not messing up your latency. And so in that case, it should be okay to wait for them, not just for the writer, but for the reclamation stage. It's the same situation. And what happens, it depends on the how you want to do it. And user space RCU allows multiple things. One way to do it is to have the RCU read lock and the RCU read unlock explicitly note the beginning and the end of the reset critical section. Just like happens the linux kernel if you build a preemptible kernel. And then there's something called system memory that allows the update side to efficiently go and check who is actually right now in a reset critical section. Alternatively, some applications are structured so there is a natural quiescent state. And that quiescent state happens sufficiently often. In that case, there is another variant of the user space RCU implementation that allows you to call out during that quiescent state and synchronize more like a non-gramtive kernel would. But yeah, if you're using blocking operation, you have to do a lot of blocking and that applies not just to RCU, but all the locks and seek lock and everything else and seek count as well. Thank you, Paul. That makes sense. That makes sense if I miss a point in the statement. Ilad, does it answer your question? Marissa, how are we doing on time? We have three, I see three hands up and then a question in the question and I'll survive. I can answer some more questions. So go ahead. If you guys got time, I got time. It's okay. I'm going to get rid of the presentation though because it's... Okay, that sounds good. Ilad, looks like you have a follow-on. Would you like to ask that and then I'll move to the other questions in just a bit? Okay, so Ilad said he's going to bug me later, which is fine. Okay. Or if he wants to bug me now, that's fine too. Yes, you'd be fine. Ahmed, would you like to ask your question? Sorry. Yes, please. So thanks a lot, Paul, again as usual. My question is you mentioned the cold sites, which makes most use of RCU. So the question is, what were the most common cases you saw where RCU was actually abused? So what's the common abuse case as opposed to the common use case? Yeah, exactly. Oh, there's been a bunch of over time. In fact, those have driven the debug features. So one thing that would happen is that people would just forget an RCU done lock, just like they forget a normal unlock primitive. And that's where stall warnings came from, although we've had all sorts of other things caused them to happen. One thing that has caused confusion in the past is people have decided that the RCU side critical section somehow excludes the update, not just the synchronized RCU, but the update itself. And that's why I do presentations like this. It's also why we have kernel concurrency sanitizer. Kernel concurrency sanitizer will normally detect that sort of problem. Another thing that happens is people will forget the RCU lock. We'll just go through and this happens, and that sounds kind of stupid, but what will happen is you'll have a common function. This call from a bunch of places and it's required that you be in an RCU reach that critical session when calling it. And somebody calls it, but they miss the comment. And so they forget to have a reader there and then things go bad. And that's why we have the locked up annotations that allow that common function to say, hey, you'd better be an RCU reader if you're calling me. So those are some sorts of things. There's other situations where people use it where they should have just used a lock and those get caught and people go back to using a lock in that case. But it's like anything else. If you're not being abused, you're probably not being used either, right? Thanks a lot. Kieran, it looks like it's next. Yes, no, it is not a question. I'm just wondering, will this recording be shared with us? Will it be available? Yes, it will be made available with the slides as well. It'll be uploaded. I do have a slide up the end of the slide, so I'll need to send another copy to you guys, but that's fine. It's not that big of a difference. Okay, thank you. There is one in the question and answer box. Looks like two now. Is RCU Barrier similar to Synchronize RCU? That's the first one. Kind of, but no. Synchronize RCU and RCU Barrier are not in any way interchangeable. They're waiting for different things. Synchronize RCU will wait for grace period to complete, but it's not guaranteed to wait for all callbacks to be invoked. That's not its purpose. Similarly, RCU Barrier will wait for all callbacks, things that are registered by call RCU, wait for all of those to be invoked, but it won't necessarily wait for grace period. For example, if you invoke RCU Barrier, there are no callbacks in the system. So nobody's invoke RCU at all ever since the system booted. There's no callbacks. Then RCU Barrier is within its rights just to return immediately. You said to wait for all the callbacks? Are any callbacks awaited for all? Great, I'm good. Alternatively, if there's callbacks, but they all were queued a long time ago, so they're just waiting for the last little bit of a grace period, you may wait for the callbacks who might not have waited for a full grace period, might have waited for only part of one. So there are two really different things. Please do not interchange them. Now, there have been cases, for example, user space RCU got RCU Barrier late in the game, and so there was a habit of invoking synchronized RCU multiple times. Also, if you have a single callback queue, then for some implementations, they end up being the same, but in general, they're very different. Does that help? Your question worries me. What are you trying to do? I'm not sure I should be looking to see the typing for this one. There is another question. Will it be a bad practice to take RCU ReadLock recursively? No. I mean, you could have an invitation and like it, but in the kernel, you can nest them as much as you want. In the case of non-prampable kernels, that's pretty indefinite. You can nest them a whole pile. I mean, there's not any real limit, unless you have a running locked up or something. For preemptible kernels, it's limited to the value that can be represented by an integer. So if you nest more than a couple billion times, you might get in trouble. But they can be nested in R. That's really important because there are cases where you have a function that be called either directly or from another reader and you want to be able to have the nesting to make that software engineering work out to allow more sharing of code. Good question. So I have a question of my own, I guess. RCU Barrier and RCU. To be safe, do both need to be called? It depends on what you're doing. So the original reason that RCU Barrier was created was in response to all things Ryzer FS. So I guess you know that it was a killer requirement. Sorry, I couldn't resist. What happened was that in Ryzer FS, you needed to be able, they would use call RCU to wait for readers to get done to do certain cleanup operations. But there was a sun mount, okay? And the problem was that if we needed an unmount, you got rid of everything. And that was a problem because you might have a callback, you know, the thing that gets executed after you do a call RCU might happen quite a bit later. I mean, you're guaranteed that the function gets called after a following grace period. You aren't guaranteed it gets called in a timely fashion after a following grace period. And so they didn't have a way to synchronize getting rid of the super block because those callbacks needed access to the synchro block. So we created RCU Barrier as a way to make removing the super block safe. So the way they did it is they said, okay, we're shutting down, we're unmounting this file system. Okay, so we're going to stop doing operation of the file system. Now we're going to RCU Barrier to make sure that everybody was kind of waiting to clean up operations, cleans up. Now we can remove the super block. This is also done for module unload. If you have a module that uses call RCU, and you'll unload that module, something needs to do RCU Barrier to make sure that all the callbacks that evoke functions that module get invoked before the code for those modules gets removed when you do unload the module. Right. So those are a good example. Yeah, thank you. Ahmed has his hand up or maybe didn't drop it down, but I'll have to take another question Okay. I think it disappeared now. Okay. Oh, we got a question. Can I talk about how preftable kernel help ensure prefted RCU reader doesn't kill performance? Okay, so let's talk about that from several, several viewpoints. The first thing is from the viewpoint of readers. Well, the fact that a reader's been preempted doesn't affect the other readers. They still come in and read, they leave and everybody's happy. One of the problems comes in is if somebody wants to do a synchronized RCU, that preempted reader is going to prevent that synchronized RCU from returning. Okay. And now the only time a reader is going to get preempted is if it is not the highest priority thing in the system. So, you know, this is only an issue for low priority processes or lower than highest priority processes anyway. And what we have for that is something called RCU priority boosting. And so what happens is that if a reader is preempted and is taking too long, it'll get its priority boosted up to a preselected level. And that should cause it to run again, complete, and that allows the grace grade to complete. Now, in real-time systems, this is the debugging tool because real-time systems are supposed to keep the load down to a level where things get done for all sorts of reasons, right? If they got a whole pile of work piled up and everything's running all the time, that's going to affect their real-time response. And so if you have a real-time system, presumably all the CPUs at idle, which should allow that preempted thing to run and complete in a timely fashion. But if you fail to do that, you'd like to have RCU priority boosting happen and to get some error message to have the system be usable so you can debug it instead of just getting out of memory. Okay, we have something wrong here. Looking at a debugging problem where our RCU barrier takes 28 milliseconds in some cases and 28 milliseconds in other cases. And there are no real guarantees. RCU barrier doesn't really have a time guarantee. So what'll happen is if you have a huge number of callbacks piled up. For example, let's say you do something like create a tarbol and do an RMRF on it. And when I say create a tarbol, I mean something millions and millions and millions of files. A huge thing, okay? You can end up with a lot of callbacks in that case. And if you have a lot of callbacks, your RCU barrier is going to have to wait for the ball. On the other hand, if you have a system where there isn't much happening, then RCU barrier might happen a lot faster. But if RCU barrier taking 68 milliseconds is a problem, the thing to do would be to look at how many callbacks you have. And you may need to keep your callbacks down to a dollar or if you need RCU barrier to go fast. The other question is why do you need RCU barrier to go fast? And please feel free to email me and we can discuss what we might do in your situation. I'm not seeing any typing or hearing response, so we're there. Any other? Oh, we have one right here. Who increased KLI usage? Are there any implications to see if you cash performance using RCU? And the answer is the candidate. And this is one of the reasons why I had that diagram where I had the red, yellow, green, blue. Those implications are more severe in the red in the VAT bar than they are in the blue in the VAT bar. Because if you read mostly, you aren't doing a whole lot of allocation to the allocation because you're just reading the stuff that's already there. So for example, RCU is used heavily to keep track of the hardware and software configuration in large. For example, security policies, what hardware is present? Do you have a memory stick there? Don't you, you know, what devices are around? That sort of thing. Those don't change often. Yeah, you're going to have some allocation free, but it doesn't happen very often. You don't care. On the other hand, if you're doing something where you're using RCU and you're just updating like crazy and each of those updates as a came out, okay, free. What's happening compared to not using RCU is that there's a delay from removing the thing to the free. And that allows it to get cash cold or can. And that's one reason to use K-free RCU because K-free RCU makes that cash cold us a little bit less of a problem. I'm not going to go into details of how that works. It's some cool stuff that Vlad Retsky did. But it can be a limiting factor for why you don't want to use RCU if you're doing huge numbers of updates. Thank you, Paul. I think, you know, we can make that the last question. Okay. Well, thank you very much for your time and attention. I hope this is helpful and I hope you guys have as much fun with RCU as I've had. And I'll await Gailad's email. Wonderful. Thank you so much, Paul and Shua, for your time today. And thank you, everyone, for joining us. Just as a quick reminder, this recording will be on the Linux Foundation's YouTube page later today. And a copy of the presentation slides will be added to the Linux Foundation website. We hope that you'll be able to join us for future mentorship sessions. Have a wonderful day. Thank you.