 Though my watch is on camera time that got confusing this is who is it? Sorry last conversation Paul. Yes Sorry, we just had Mark and he was cool too, but you're the cool presentation after Yeah, just to make it difficult for me Yeah, so this is Paul. He's talking about Cyclones no, he's talking about something. I'm sorry Have fun. Yeah, I can't yeah high performance and scalable updates sounds cool Enjoy your presentation big round of applause for Paul Thank You Rachel all kidding aside so I've been doing the read mostly thing for a long time and a series of rather strange accidents got me into the update side a little bit at least and here we have the real presenter walking in here We're doing the younger crowd this is getting started really we figure we give him in the kernel that way Anyway, I'm talking a little bit about before this will go out challenge the challenge itself and Why parallel updates aren't or aren't always a solved problem and talk about one solution This is I should warn you ahead of time This is a case of conference-driven development And that means that I work on it for a few days before each conference And so it isn't quite as Pristine as some of the stuff I might do in the Linux kernel where I have to get it right before Before ingo monar and then Linux torbles will dain to accept it So but you know if you got to have some fun. So here we are So before this a quote challenge it seems that I have an interesting relationship with the transactional memory guys And it's not like I'm opposed to transactional memory. It's just that we don't necessarily see I die technically This is especially the case around 2005 where they would write papers saying things like you can't make a double-ended cue that uses just locking and Get concurrency so that you know somebody can in queue and dequeue from both ends at the same time And the reason they say that I mean they had a good rationale. It's not like they're being totally stupid the Are you guys reading something into that statement? I just said that Okay, okay As long as you're being friendly, it's okay Anyway, the thing is is that if you're on the top there if you have four elements in this double-ended cue Clearly somebody can dequeue Obviously in queue both ends without any interference whatsoever one of those are messing with the right head and D and the other one with a in the left Head, there's no overlap. So you don't even need to have any mutual exclusion. It's just work and Deletion the same deal. I mean, you know looking at the two hours fine In fact, if you're down to three elements, it still works the reason it works when you're deleting is a little trickier you probably have a double-ended like list and They're both gonna hit B, but they're hit different pointers and B. Okay, so that'll work, too But as soon as you get down to two elements life gets hard Because if you are kind of stupid about it, they'll both grab a and B But they'll link the things to the each other and it'll just be a mess because the guy removing B will say all right I'm gonna make the right head point to A except the other guy removed it at the same time So his left head points to B and and life gets very bad very quickly and Worse if you've only got one element and the two guys are trying to dequeue it if you don't have synchronization between them It might they might both get a or both decide the list was empty both of which are grossly illegal now This is another caveat about this talk. I have never in my entire career for 40 years had any use for a double-ended cue Okay, I mean I've heard people have I mean I've heard people say they have Dave is it Okay Okay, there you are You had some inertia race conditions to deal with to decide When the one the one guy went at ten elements, but but we will take that offline this this could be a while In any case It turns out so this is this slide is wrong There is apparently some point to at least one person in this room has had some use for this data structure sometime in their career But it turns out there is a simple solution and I Have a I'll have it at the end the Transitional embryos were actually really cool about it They wrote a couple papers where they cited it and included in their performance results and said very nice things about about it And it did quite well But you know there's always another shoe that drops right I mean you know that was one shoe Where's the other shoe and so we'll fast forward to early this year 2014 at the C++ standards committee meeting in Issaquah, Washington This is not too. It's essentially a suburb of Seattle and They were they're trying to standardize transactional memory and add it to C and C++ not initially at the standard itself But to make a they call it a technical report or a technical specification Which is kind of off to the side of the standard It isn't part of the standard but allows people doing it to collaborate and Debug that part of the standard before they add it and you know I'm actually participating that group I'm not a Particular fan or a detractor of transactional memory. It's it's there But so if they're going to add it I would rather they didn't disadvantage my employer's hardware and so my participation has been more More participation than you might expect Anyway, so they presented a thing they were saying okay here are the reasons we we need transactional memory This was an example of one of them automatically moving something from one tree to another And possibly back And they had a bunch of other examples They were kind of doing a dry run for the presentation and it was late in the evening I was having a heck of time keeping my tongue bit because you know, I'm supposed to the loyal opposition, right? And it was getting hard to stay loyal because it was like every time they put this like You know because every one of them was like there's a way of doing this But you know, I didn't want to mess things up Anyway, so at the end of it and then there's some other conditions here You you have to view these moves without contending between the two operations. All right in other words you should be able to move element four from the Right to the left without interfering with the movement of element one from the left to the right You know without any coordination or any Cache line sharing or anything right? And what that means of course is the normal Solutions by blocking one tree or having a lock for both trees or even anything else just don't work They don't meet that sort of thing Anyway, with the end of the spiel I was going oh my god. It's finally done. Thank you And the guy Lee it's okay. I think we've got a good case And I think we can present this but we are missing the actual algorithms using locking McKenny, would you like to do that? Of course, there's kind of a subtext. Okay, McKenny if you think you're so smart. Let's see you do this But you know after biting my tongue for a few hours is a great relief say yes anyway One thing you say is well geez we're doing parallel updates for longer than I've been alive and that's saying something okay You know, why is this a problem? I mean use a hash table, right? What could be simpler you can you can map a hash three into a hash table if you really want to some databases do it and You partition it up you have a lock on each chain You have to hash this stuff across and you know perfect scalability stunning performance is great in theory on Recent machines you might see something like that Now the top two lines top three lines the top one is the ideal you just take without locks and extrapolate R2 and hazard pointers are pretty close to that. I'm not going to worry about hazard pointers, but it's a well-known Thing that kind of does it kind of does what RCU does different properties He was put up bucket locks. This is a read-only workload all we're doing is acquiring the locks everything's random It's spread out. You know it should be perfect. That's what we get It goes up to eight CPUs and drops. This is a log scale. It drops a long ways. I Mean this really is bad. Of course, we knew a global lock as many trouble and it is So we've got this thing that in theory. We're we're not even doing updates We're just acquiring the locks to protect the reads. All right, and it falls off a cliff We don't have hot spots We don't have false sharing, you know Okay, well the usual thing is all right fine. You don't have enough you don't have enough buckets, right? That would be your first first thing you suspect and that's actually true in this case We have 1,024 buckets and we go to 2,048 he gets a lot better Get a little better at that between eight one nine two and six one six three eight four We don't see much change But still even at 16,000 buckets. It's still dropping off a cliff. This is a linear scale All right, so the cliff is a lot shorter than the 1,024, but there's still a cliff This is on x86 hardware, you know a couple generations old but still there So we got some improvement that the basic problem of it not scaling is still there the problem is is that It's a four socket machine When we're with eight CPUs the way that Intel and their infinite wisdom numbers their CPUs Those are each on a core and those are confined to one socket We had the ninth CPU we go off on another socket and suddenly our cash misses get way more expensive The workload is randomly hitting the different Hash buckets and that means the lock you acquire is probably last acquired by somebody else Therefore, even though there's no contention The data structure contain that lock is in the other guy's cash And we have to go out across the internet over over there a long ways to get it And you know the speed of light is pretty fast So you're trying to run to keep up with it But you know At what this is a two gigahertz machine. It's about that far over and back speed of light in a vacuum All right now very far. Well the chip smaller than that. So what's the problem? Except that, you know, I don't know of any computers that use vacuum light guides You know waveguides or whatever to transfer, you know data across. It's all like silicon and and copper and aluminum and In copper you might get 30% of the speed of light In silicon, which is by the way what the transistors are You're gonna be doing really good to get 3% of the speed of light and so instead of something like that We're talking, you know, we're talking smaller than chip for a big high-performance chip So we have a problem and that's what was happening to those hash things. In fact, we have two problems and According to a guy named Stephen Hawking. These are fundamental problems He visited Intel Research and and actually came up with these two and I'm bumming off a hem The first one is that the speed of light is finite as we saw in the previous slide And we haven't figured out a way to make information go fashion the speed of light Oh, there was some false alarm about Neutrinos a couple years ago at one of the national labs in the US so we can only go so fast as far as we know and Right now everything we do is made out of atoms and the atoms only get so small About it was about seven years ago. I saw a scanning electron microscope of a cross-section of transistor and There were about this many atoms across the base and the base of the transition is a layer in the middle And that's what controls the speed the transition to switch at They have actually made research prototypes that look like that one atom thick layer for the base I'm sorry I don't know. I I didn't ask Let me know how what you find out when you check into that Yeah, we we take atoms for granted. They may be bitching us all the time. We wouldn't know you should definitely take a look at that and I don't know what they're using production, but they might be down that thin I don't know but it's definitely less than five atoms and so there's some limits on how far we can go One of the cool things about read mostly I mean I've been doing the easy way for all my time with RCU and things like that because the reader only data It's replicated across the caches of all the CPUs that use it and that's a little black square You can see in each of the little green areas there So if all the CPUs are reading given value, it'll appear in all the caches and have very Small distance to get to it. It'll be very high-speed access works out really nicely It works out nicely that is until the first time you update it So let's say the CPU down here updates that same variable and all sudden bang It gets yanked out of all the other CPUs caches and appears only in this CPU's cache Well first off that update is going to do a slow operation It doesn't have to go and talk to all the CPUs to rip that cache line out of their cache and Then all the other CPUs next time they read it are gonna have to go talk to this CPU a long ways away to get it back out And so that's part of the pain of doing updates Well, you know doctor it hurts when I do updates. What's the doctor say? Exactly you guys are with it here exactly that don't do updates Well, you know the reply back of course is well if I don't do updates around a registers I Mean some algorithms don't and then they're they're in great shape with this advice but stuff I work with doesn't fit in the machine register, sorry and As a result, we really don't have any choice but to do updates at some point or another But what we're gonna have to do is be very careful about how we do the updates and There's actually some ways of dealing with this There are some special cases where you can do updates at full rate At the full machine rate And they're used really really heavily In in the Linux kernel and other places I mean it was I first came across this in the early 90s and it was considered to be an old technique at that time and of course I'm talking about the first CPU things like split counters You just put a counter on each CPU or on each thread You know and go from there. There's also read only traversal to the location being updated So this lesson here is when you're doing an update update only what you have to update Don't do updates of stuff that isn't involved in the update and we'll look more at that and of course So perhaps this trivial lock-based concurrent DQ that I teased you with earlier in the presentation So split counters, how many people have used split counters per CPU counters or per thread counters? Yeah, good. We got some people doing it. That's straightforward This is a little misleading. You would not put the counters right next to each other You would have at least a cash line is worth between them And in fact normally you have all the per-CPU variables run CPU in one clump and somewhere else all of counters for another one But and yeah, I've got a Christoph has some kind of allocator. He says does something about that So if you want to allocate per-CPU variables the latest kernel talk to Christoph I Even use it. I think or do I I can't remember. Oh, there's the SRCU uses it In any case what happens is that you if you're doing an increment in other words You're doing an update you're going to make your counter and only your counter and that means it stays in your cash And and you get full speed at it Now if somebody wants to read this thing they're gonna have to do some more work They have some all the counters up which is slow but If you're for example in networking, so we have yes, we're here what happens is you'll have packets coming in So what's the what's the biggest pack where second rate you got in a machine? It's like millions. I remember that but I can't remember how many million 14 million. Okay, so what are you at now? That's what you're trying to reach 9 million. Okay, so he's got a machine that's gonna increment its counter 9 million times a second He says he's got some balls. Otherwise, he wouldn't have any work to do He's got a house. He'll give the guy a chance to do some work 56 million. Okay, so in that case you're not gonna if you're doing 56 million per second You're not gonna do a huge amount of cash misses. In fact, I'm not very man at all in any case The thing is is that he's so he's got this thing we're doing about 9 million packets per second But it's like a system in a station thing to look and see well How many packs this machine sent and maybe you'll do that every five seconds if you're really worried about it and so what's happening is that the Read is very rare and the update very common and this is a great way to optimize very heavily for the update okay, so What that means is we don't have to have I mean it's not necessary that updates are guaranteed to slow us down At least in some special cases We can make use of some tricks to make the updates fast Possibly expensive the reads and it turns out there's a whole bunch of ways of skinning that counting cat I've got a book that has a whole chapter on different things you can do and you can make the You can make the reads and the updates fast But what happens is the is the change in value propagates fairly slowly from the update is the readers and there's a bunch of other other tricks you can play okay so counting is great, but And sometimes counting is the whole problem, but most of the time you're doing something you're counting it So we need something else besides counters and for at least for some software So one trick is is fairly common. You've got a fairly large data structure You're updating part of it maybe any particular part, but the update is fairly small For example, you might add or delete something to a binary search tree, which is the problem I was tasked with in Issaquah last February Or you might have a hash table has the same sort of thing Graphs any number of things are kind of like that where you'll traverse something and then do an update in one location Now the classic methodology isn't all that smart about this With one exception we'll get to and that has to do with tree balancing Which I was allowed to ignore because they can't do balanced trees fast with transactional memory. So I don't have to either so I get to pre-balance and not worry actually Having a balancing algorithms for concurrent binary trees is a very heavy-duty research topic right now There's a lot of people trying to figure out ways of making that work but I Figured I'd stick with the challenge that was given Because I'm cowardly and all that So the classic thing you'd lock the root you'd use a repeat key comparison to search flex descendant You'd lock the descendant release lock on the root and you keep doing that and that's fairly classic has been around for a long time We've been doing that forever in databases and other things But you know the lock dish in the root is going to really be a problem. All right, that's just not going to be fun at all And it if you did this it clearly doesn't qualify as a solution to the Issaquah challenge because there will be contention between two different updates to that tree and so this is just need not apply And that's why we have RCU. I'm going to kind of give a really quick run through RCU I could I have a guess lecture I do at universities from time to time that runs between one and two hours And so I'm not doing that here At the end of this presentation the slides will be available in this presentation There's a bunch of places you can go to look to find more information. This is just kind of Give a really quick glossy high-speed fire hose overview so The problem we have is that we have these expensive operations we had this hash table back there ways and we had per bucket locks and Aquaring those locks was really expensive if they weren't if there was no locality and So back to the it hurts when I do that doctor approach. We can say well. Let's just not use expensive operations all right and So I'm going to propose the I'm going to argue these are the the lightest weight possible Operations and those are that just see pound sign to find RC read lock new line If you look in the Linux kernel, they'd be more stuff there But it's debug and so it doesn't do anything it disappears in a in a production build and if pound sign to find RC read lock Unlock nothing All right, and this may seem a little extreme Maybe you can do better that will require a negative overhead if you do that I want to know about it be really cool But until somebody actually shows me something like that. I'm claiming this is the best you can do And this may seem a bit extreme I'm reminded of a sign that an old you know big white beard long long hair White hair guy that I worked with in the early 90s had a sign in his cube and the sign said only those who've gone too far Can possibly tell you how far you can go So, you know, let's go all the way and see what happens Now I assert this gives you the best possible performance scalability real-time response weight freedom and energy efficiency So we got some benefits we get one question if you haven't dealt with RC before and even maybe we have is These things clearly aren't defecting machine-state. In fact with this they're not even making to the back end of the pilot Right. I mean not only they're not emitting any instructions, but the back end of the compiler doesn't know they exist And so a reasonable question at this point is it doesn't affect machine-state. How the heck do you use it as a synchronization primitive? That's a valid question. We'll get there But let's start by saying, okay Let's say that the readers don't affect machine-state and we want to add something to this pointer So we got a sequence of four states with three transitions time advancing from left to right So we start off with the CPTR, which is the pointer and we're going to make it point something And we're gonna do that in a way that the readers can just go flying in any time they want I mean clearly if the reader starts with pounds line with, you know, RC read lock being nothing There's nothing that the update is doing to block them There's no way to do that and there's not even any way to tell whether the reader is there or not So if we can make something as the structure without that knowledge, we've got at least part of the problem solved So what we do is we allocate it into this data structure and this temporary pointer points to it That's presumably on the stack of the guy doing the update and it's garbage And then we initialize it and now it's got some values for its fields that are predicted And then and then we do RCU assigned pointer, which is think of it as assignment statement We just assigned TMP to CPTR so that now it points to it But it this prevents the CPU and the compiler from doing nasty things to us Which they would do otherwise and then the readers are coming in with our sue to reference again This is an assignment statement that prevents CPUs and fighters doing nasty things to us And what happens here is This assignment is atomic in the sense that if a reader in concurrently picks up that pointer That reader will either see the old null value for the pointer or it will see the new pointer to this data structure It's not going to see some mush of the two values all right and That means readers will either see that structure being there or they won't but either way They'll see a valid thing that leads to a valid null pointer or they'll see a valid pointer to a properly initialized structure So what this means is that even with this non-existent? readside primitive We can add things to data structure safely and that's wonderful But if all we can do is add we've got a big memory leak and sooner or later the machine was out of memory and life is hard So let's see what happens if we delete something So in this case, we're going to have a linked list because we have to leave something and we have some animals we're going to leave the cat and These are marked red because the readers might be there the night not the update or has no way of knowing the updater has assumed There is somebody there because the updater can't tell So the first thing we do is we remove it from the list by making the boas pointer point to the canoe So it hops over the cat and this this list LRCU is atomic in the same sense It stores the new pointer the readers either see a pointer to the cat or they see a pointer to the new They don't see some mush of the two pointers So either way they may see a list with the cat they may see a list without the cat either way they see a valid list Now if we have this magic operation synchronize RCU that somehow knows how to wait for all the readers that Aren't giving an indication. They're there if it can somehow wait for all of them to get done It doesn't wait for all the readers just the old ones. I mean only has to wait for the readers that Got to the pointer before it got changed Because the new readers at this point don't have any way to get to the cat So you only have to wait for the old readers But if we do wait for the old readers somehow and all those old readers get done and leave the data structure Then nobody can be except the updater can be referring to the cat That's why it's color green. It was yellow when only the old people get at it Now it's green because nobody except the updater has a reference to it And then we can do whatever we want with it including just freeing it, which is what we do here All right So if we have a magic operation That waits for the old readers and only the old readers they can wait longer if it wants to but it has to wait for all the old readers then we can safely Yank that thing out of the data structure and just destroy it and Again, this is kind of a brief overview. I'll have some some places you can look for more information later But to give you sort of a flavor for what's happening And it turns out you actually can't implement this if you have a non-pramptive environment an example non-pramptive environment Is the Linux kernel when it's built with config preempt equals M Now if you build with if you have a non-prample environment And you have a pure spin lock which the Linux kernel has you say spin lock the lock you say spin unlock to release it One of the rules is that you do not block while holding that spin lock Regional block is let's say you have two CPUs one CPU grabs a spin lock and then blocks The other CPU says okay. I want that seat spin lock Well, it spins because it can't have it and it's a spin lock. So just keep spinning till it does get it Let's say the first CPU run some other task I mean the other guy blocked right and that wants to spin lock too. So it spins waiting for the lock Now we're deadlocked The two guys spinning on the CPU can't get the lock until the guy releases it the guy who holds a lock can't Possibly release it until it gets a CPU and the guy spinning the lock aren't gonna let go their CPU until they get the lock So, you know, we're done So therefore if you're in a non-pramptive environment, and you have a pure spin lock you do not block while holding spin lock We apply that same rule to RC your readers If you've done RC read lock, you are not allowed to block until you do the corresponding RC read unlock What that means is that if people are following the rules You have to follow the rules if they are Then as soon as CPU zero does a context switch We know all the readers are done on that CPU because they aren't allowed to block See that blue thing there that started with an RC read lock and ended with an RC read unlock If that that guy's not allowed to block inside of there So there can't be any context which which is inside that RC reads our critical section So once we see that CPU context switching Then we know great that all the previous readers on that CPU were done and Then we got one immediately with synchronized RC because of VD blocks and then finally when the CPU one does its context switch At that point all the CPUs have done a context switch and we know that in each of those context which has been after we remove the cat Therefore, we know there are no readers reference to the cat So at this point the grace period is ended. We can free the cat. Nobody's looking at it It's a little bit weird It's one of these things that's simple, but you have to think about it kind of upside down And so, you know, if it doesn't make sense on the first round don't worry about it try it again later And yet it'll it's I had a heck of a time with a recursion and people have the same similar problem with our C you so I guess I'm getting back for my tortures in in the school. Yeah Okay, can I give a quick example of a blocky operation not allowed to do I wait for a for a network packet to come in or Time wait, you know a in user space would be a sleep of five or something like that and the kernel would be Scheduled timeout interruptible or something like that of whatever jiffy's so think anything like that Take a sleeping lock acquire acquire a semaphore lock that blocks is another another good one. Yeah, and you read it It wouldn't have Okay, so yeah, that's that's a good point So let me let me make sure I understand the question if I understand credit question is we got CPU on and it's got this reader Here the first one that might see the cat because it started before the cat was removed But these other two here they can't possibly get to the cat Because they started after it was removed. That's true What we're doing is we're being more conservative than we have to I mean we could if you look at it and say Okay, when are the readers then we could let go right here because we know by looking at the diagram that all the readers That had access got done at this point But that would mean we'd have to have a non-empty or she read lock and I should unlock doing it this way We're being a little more conservative. We're taking more time than we have to but The return we get from that is very very lightweight is in zero cost reads that operations That's right, we are we are waiting for readers with this algorithm that we don't strictly speaking need to wait for Wouldn't Waiting for all the CPUs to do a context switch So do you know that all done require hitting some global memory wouldn't waiting for wouldn't tell you so it does And the trick nasty trick we use here is that context switches are fairly expensive and The extra instrumentation that RCU puts on it is by comparison insignificant But yes, there is some cost. We what we've done is we piggybacked on top of an expensive operation and Kind of buried the buried the cost there Okay, so that's kind of a rough I'd a really quick overview of RCU We'll go back to this bit about synchronizing without changing machine state because member one of the objections and I've gotten this objection Really loud and I've gotten people who've just really got in my face and screamed at me about it Is you know, we've got these R2 read locks that don't do anything. How can they possibly be involved in synchronization? They don't change them. They seem in state Well, the trick is they don't have to they act not on the machine hardware, but on the developer The developer has to follow this rule that you're not allowed to block while you're in a read side critical section therefore RCU is an example of synchronization via social engineering and That's one of the reasons that it's kind of weird to think about and why people have our time with it because it's not It's not working with the machine people are used to synchronization being just the machine. That's not the way RCU works. Yes, I Think I think I'll have the let me let me repeat the question here The or the objection which was that given that there's always debug code in the Atlantis kernel Doesn't that mean that synchronization via social engineering doesn't work and what I'm going to do is I'm going to defer to Dave Chinner's talk yesterday, which is a people cannot program But that's yeah, we do have debugging code when we people aren't always good at following the rules Including me Multiple levels down and so you've got absolutely no idea that the code that you just called does actually block and so it's a it's a It's a trap that there's other things that we'll find so Anyway, the thing is is that every other synchronization mechanism also has a social engineering component if you use locking It's your responsibility to make sure that you access the protected variables only while holding the lock Of the locks not gonna enforce that you have to What's different about RCU isn't that it has social engineering, but that it's only social engineering It has no mechanical component to it. Although again if you're doing in a different environment For example, if you're a preemptible kernel or in some versions of user space There will be a little bit of work that our series lock and our series unlock have to do for those variants But for this pure variant, this is the first one we came up with long long ago There's nothing about social engineering involved Okay, well all right So this if we have RCU we can do better a better job of doing read only traversal to the update location Remember before we're getting a lock of the written though, which is a real bad idea because everybody's gonna hit the written Oh, this will be a massive contention. It's gonna go slow. It's not going to scale What we do instead is we enter a read side critical section and we go start with the root root of the tree but we don't lock it and We just go traverse in the tree without locking until we find the node that we want to update Then we acquire the locks Of course, there may be several other people doing the same thing at the same time with the same nodes So after we cry the locks, we have to do some consistency checks because the world may have changed Those two nodes may have been deleted from the tree entirely Okay, they can't have been free because they're still in our series I critical section, but they might have been removed So I'll do some checking to make sure the tree is still in its shape that allows us to do our update And if it isn't we'll have to retry the the access we have start over at the red If it is all consistent, we can carry the update and get back out And that means we don't have any contention on the root node unless we're updating the root node itself Which I avoid in my sheet there So what you can do is you put removed flags for example a data elements So if you remove it you have a you hold the lock on the element You're removing it you set a flag inside saying hey this has been deleted then you release a lock When somebody else comes along having found the same element because they got there before you actually deleted it They look and see there's a deleted element. They say okay, I'll pretend I didn't find this and start over And I'm not gonna go through this in detail, but that's what this does It basically leaves State saying this thing has been removed so the people that come in afterwards can fix up and get out Anyway, the idea of this is to focus the contention only on the part of the data structures It's actually being updated The rest of the day a structure you just traverse in a read side critical section and let RCU keep you from being sent off into the free list and This has actually been around for a long time in one form or another and I'm not going to go through those in detail You're we'll have again. We'll have slides up at some point You can look at them and we can use this as the basis for one solution to this a co-op problem What we're going to do is we're going to Have the tree be protected by different synchronization primitives in different parts of the tree and it's a little more complicated in this because you might have an unbalanced tree, but if it is perfectly balanced everything down to the Except for the bottom two layers of the tree are protected by RCU Below that point you're using locking for the updates You'll also use RCU to find them and then we'll be using some tricks to get them to go back and forth What we end up with is the same tree algorithm we used to do the RCU mapping in the end Excuse me, we're using a standard tree algorithm with a few changes in a few places to do the existence checks and also to do the Consistency checks So what what bad thing can happen while you're acquiring the lock? I'm here you go down and you find the things you want you acquire their locks and then somebody else is messing with you They got the locks first so you know One thing that happened is that both of the nodes might have disappeared in which case the both will have deleted flag set You'll look at them and go okay forget this and start over It's possible the parent was deleted and the child was moved somewhere else the same deal You'll see the deleted flag set the child might have been deleted and the parent might have someone else One that I found out about the hard way is they're both still there, but some stuff got stuck between them and And that's pretty bad because you're assuming they still point to each other then you just mess the tree up pretty badly and a Little fun debugging that So obviously I was not programming to the Davis point yesterday So what we're going to do the thing is we it's not enough to pull stuff in and out of the tree we have to do it atomically and We're going to solve yet another computer science program Problem excuse me by adding another level of indirection. I mean what else do you do right and? There are better ways to do this or well. We don't know what's better or worse yet There are ways involving fewer levels of indirection to do this, but we'll mention those shortly We have our data structure. We have a pointer that we add to each element in the data structure and that pointer points to Normally be null and we'll get to that in a moment But it the thing is being moved to the point to this existence structure with has an offset and the existence pointers all Them point to this yellow box So all of these things point to the yellow box the offsets may be different here. We're just have zero and one We only have two states you could have more if you wanted And then the existence switch points you to the ray on the top or the ray on the bottom So what happens is? That we go in we say oh the this is pointers non No, so we find that remember the offset zero find the switch it currently points up there Offset zero is one this thing here. If therefore exists The guy on the bottom at the same time we go we pick it up Oh offset one great exist switch points up there zero this guy does not exist the nice thing about this is we can do a single store to that yellow box and Suddenly be pointing to this array with different outcomes for those for that check So we can have an arbitrary number available at the top existence thing there With offset zero of another arbitrary group of data structures pointing to the bottom with offset one we can do a single store to that is this a switch and Atomically make the whole lot on top disappear and atomic with that make the whole lot on the bottom up here All right, and so that's that's kind of the basic trick. We're going to use and Well, we can animate it right? I mean the guys are coming in they're gone. They're in they're gone, right? The little dottedness on the arrow there. It's a little subtle. Okay, so We do that single store and we can make an arbitrarily large group of changes of pure atomically This is just c code that goes with the existence part I'm gonna go with that in great detail and I'm going to demonstrate with a tree But I not good enough artists to like put one of those in everywhere it would fill up the slide night You'd be mess so I'm going to abbreviate it like that And so a pointer is coming to this little box and a pointer coming the one is like going a pointer to the bottom Existence pointer there and a pointer to the zero is like to the top. Okay, and so this is after the switch has been done And so the red and blue tell you whether these are not essentially a color hint to go along with the zero and the one Okay, there's some gestures this I added three levels of interaction not just one and even by computer science Standards that may be considered excessive on the other hand most of the time elements aren't being moved and During those times you can use a null pointer All right, and no pointers are cheap you say oh, it's null now This is confusing because of the pointers null it means the develop the element exists And I'll guarantee you I get that backwards every time I've tried to do that. Therefore. We have an API And in the only in the uncommon case where the element is actually in the process of being a time of me move somewhere Then and only then we have a pointer we have to go through And there's no free lunch I mean that that actually is expensive the mobile cache misses and but in the uncommon case If a given element isn't being moved very often. This is should be perfectly acceptable We do have to use more expensive expensive operations traverse those lists because we wanted to appear atomic We have to use something like SMP load acquire. We can't use read once access once or RC to reference to that case. Unfortunately however, we can reduce the number of levels in direction from three to one if we use a trick that Dmitry Vyukov pointed out Which is to have the that pointer coming out of the data structure use a load or bits to have the offset and Then you have many fewer and then you put in the in the switch instead of having a pointer to the array You just put the bits in there and then you have one level of direction Of course that means you have to give up the bottom bits of your pointer, you know pick your poison Yeah, ten years ago. I might have been yelled at for that but no longer right You know Dave's point was that that tricks used all over the place in the kernel Anyway, so I think I have enough time to go kind of go through a Animation of this and then performance data. We'll skip these slides So we start off like this. We've got almost one two and three in the tree on the left and three two three four on the tree on the right All the pointers for all the nodes their existence pointer for all null this node exists essentially So first thing we do is we create an existence pointer This is a structure and we take four and one and point it to the one side they exist still We allocate the other four and the other one and point them to the non-exist part now We made a change non-atomically to the original element one and original element four and that's okay The answer is the same the element exists. It's just as suddenly it's more complicated and expensive to compute it But before we change that null pointer it didn't exist it existed when we had a pointer to this stuff It exists still so we can take our time populating these existence structures throughout the however many masses of data structures or whatever type we want Now we insert elements one and four in the tree we do that non-atomically again We just do the protection we need to keep the integrity of the data structure But it's okay. They didn't exist before because there wasn't a pointer to them They don't exist now because these instructions says they don't exist. So, you know no big deal now We switch that point that existence switch and suddenly The old stuff doesn't exist anymore and the new stuff does That's the part we have to be careful about that has to be atomic and then all we have to do is clean up We know a lot the pointer to foreign to the new things They didn't exist they existed before they exist now for different reasons and we disconnect the old things They didn't exist before for the existence structures. They didn't exist now. They don't exist because You know they there's no path to them and then we clean up and You don't have to restructure of the trees you couldn't potentially move something from a tree to a hash table or something like that I'm not sure why you would do that, but You could have a whole pile of all such infrastructure and to make their elements all swap back and forth You don't have to have two states I just showed two states you could have a whole pile of them have a sort of animated gift of a data structure They just throw things around and I'm not sure why that would be useful, but you could do it anyway I'm not going to go through the API aside from saying it's there For read-only of course, you know the performance is what you expect very good We get super linear speed-ups because and the reason is we have this big tree and when we have more CPUs They have smaller parts of the tree in their cache therefore their accesses are faster standard stuff There's a mix of operations from a paper in the ACM on transactional memory. I used that this is not scale quite as well It scales okay, but at about 32 CPUs it It doesn't scale as much so we get 40 x out of 60 CPUs and that's up from you know one a little bit more While back so what's happened here is that I haven't really increased the scalability by increasing performance slightly If we only moving in other words the only operation we're doing on the tree is to move stuff and that's it life isn't quite as good 100% moves and this is only up to eight CPUs. We're about you know 7.1 now We were 6.4 in September and 3.7 back in May. This is again conference driven. So those are the conferences If we go to 60 CPUs at 32 CPUs things heal over and the job now is to find out what that bottleneck is and Maybe it's hardware in which case I'm stuck or maybe I've got something stupid. I need you know Another stupid thing I need to deal with This is all the stupid things that I've had to deal with what it I we don't know what practice things useful for if it is I was given a challenge, right? Maybe it is I don't know And yeah, so Best guess is if you have a large data structure, you're making small changes to it But you know maybe it's something that's just there to say yes, I can do that damn it Okay Okay, so nothing under the Sun as they always say Yeah, Chris Christoff mentioned I think it was something else but Christoff hell we mentioned some similar thing What Dave said was that XFS has had something that looks like this although involves the disc, but but still I guess it does have some use all right Anyway, we're kind of out of time So I'll I guess I'll mention that right now this is kind of R&D prototype if you use in production fine But you know don't cry to me when it doesn't when it breaks somewhere or nothing and Anyway, that's how you solve the other problem I Just have two of them. It's easy But yeah, and the thing is the other thing is why why why are you doing this in the first place? If you have a parallel program putting all of stated through a DQ you got a design problem my friend Anyway, no silver bullet might be useful. We don't know Some things to look at if you want to see more about our sewer this stuff This is the usual slide sponsored by IBM legal. I don't know if we have time for more questions I know we've run over a little bit So Is it break time at this point tea time? Yeah, okay? So I'll be here for a little while if you want to ask a question individually and hand it over to Rachel other than that All right, so we have run over time slightly so big round of applause for Paul And then we've got a small gift on behalf of LCA. Thank you Rachel and LCA. Yeah, so big round of applause and go enjoy lunch And thank you all for your time and attention