 Hi folks, welcome back to another like live coding stream. We haven't done one of these in a little while. And part of that is because it's really hard to find good topics for these like long form live coding streams because it has to be something that like I think is interesting and I care about because otherwise I'm not gonna put like some of these end up being like three videos of six hours each. And it's just like a lot of time to sort of stay focused on. But recently someone sent me this paper called the practical weight-free simulation of lock-free data structures. And we'll go through what that actually means. And it sort of piqued my interest. I sort of got nerd sniped into this is really interesting. I wanna read the paper and while reading it, I was like, this could be really interesting to try to encode in Rust. And so that's what we're gonna do today is sort of walk through this paper, try to understand what it is that it's proposing and then try to like implement some at least skeleton of it in Rust. I don't know exactly how far we'll get. And there are some holes that I can sort of see already will be present in the code that we write. But I think it's gonna be a really interesting sort of adventure in translating academic work. And in particular, like sort of algorithms and data structure work that wasn't written for us in the first place. And in many cases, we don't even have the code for and turn that into like actual real Rust code. I wanna prefix this with the fact that this is academic research. And one of the artifacts of that is that there are open questions. Like there are things here that there just won't necessarily be an obvious solution for. There might be places where we get stuck. There might be places where we sort of write the initial implementation. It turns out to be wrong or subtly incorrect or it doesn't scale the way that we want it to. And that's okay. We're sort of doing this a bit as an exercise in trying to understand this academic work as well. And in sort of understanding how to model the sort of abstraction that they're presenting in the paper in the context of the Rust type system. And just sort of similar to how most of the other live coding streams work. My guess is that this stream will probably be around four to five hours. And I mean, if you're watching this after the fact, you can see how long it is at the bottom. But for those who are live in chat. And I'm guessing we're gonna do a couple of videos on this because I could totally imagine that once we have the sort of architecture up and running, we might then go and implement multiple different data structures on top of it. We might do some like benchmarking and debugging and whatnot. I also already have another stream that I think we're actually gonna end up needing fairly soon for implementing this paper. And that's implementing something called hazard pointers which I wanna see if I can actually dig up this paper. I don't have it open here. But Facebook has this open source library called Folly. It's a C++ library where they've implemented a really neat variation of this concept of hazard pointers. And I wanna try to port that to Rust. And that might be the next stream. So it might be that what we end up doing is this stream is gonna be on sort of getting the initial model setup for this kind of algorithmic abstraction that the paper lays out. The next stream is gonna be implementing hazard pointers which is sort of a completely separate topic. And then the third stream is gonna be using hazard pointers in the context of this paper because I think the two are gonna tie together. But we won't actually talk too much about hazard pointers today. I just sort of want to give that as a sort of context for something that is gonna come up later in the stream and make up up in the future too. All right, so back to the paper. I'm gonna put, there's a link in chat now. I'll also put it in the video description. This is also the same one that I tweeted out. There's a shorter version of the paper as well for those who don't wanna look through everything around it. But I'll be working from this one that's sort of labeled the full version. And this is sort of the extended technical version that has a bit more detail, a bit more performance analysis and appendix that includes some code and a bit of that. Before I try to explain what this paper actually is and like what the words mean. Is there anything that you have that are sort of meta questions about what we're about to embark on? Like I know that many of the people watching are probably coming from the sort of crust of Rust Streams and this sort of live coding stuff might be a little weird or do you just have questions about how going from an academic work into code is gonna work? Like these sort of meta questions to handle those at the top. Weight-free, yeah, where I'm gonna talk about what weight-free means, I'm gonna talk about what lock-free means and I'm gonna talk about simulation means and I'm gonna talk about what practical means. So the title I promise you we're gonna go through in a lot of detail. Where's Chai? Chai is downstairs staring out the window looking for suspicious things on the street. Let's see. Well, you give a tutorial on how to read academic papers along the way. That's a good question. The way that we're gonna approach this paper is a little different than how I normally read academic papers because here our goal is to implement it rather than, I mean, we are trying to understand what they're doing, but we're not really trying to understand the work in the sense of like why its performance characteristics are what it are, which is what I often look for when I read a paper to sort of understand the approach and be like evaluating the paper more formally, whereas here we're taking a much more practical approach of like we're sort of assuming that this paper is a good idea and now we're gonna go and implement it. So it's a little bit of a different type of reading, but I'll give some pointers about how I generally go through these kinds of papers. Is there any specific background knowledge we should need to be able to understand the stream? So in general with these live coding streams, I assume that people are familiar with Rust beyond just sort of having read the book. I expect that you sort of know how the Rust language work. You have written some like Rust code that wasn't just like a little world example, but I don't have any expectations that you necessarily know how like concurrency and atomics work in great detail. Much of that will go through as I sort of go through the paper and also as we start writing the code. I won't spend as much time explaining exactly how things work as I do in Crust of Rust. Instead what I'll do is sort of when we get to a topic that there's already a video on, I'll probably say, if you're feeling stuck at this point, go watch this video and then come back later because I'll upload the video at the end. The llama supervising, that's true. So we just blindly believe it. Sort of for the purposes of this stream, I'm not really looking at like, is this a good idea? Like should all your production systems be using this approach? And there are a couple of reasons for that. One is this is very much academic work. It's not clear that it's something that you would use in production in its current state. It might be at some point, but we'll talk a little bit about what the sort of advantages and drawbacks of this kind of algorithm is. We don't necessarily care about the performance results for this particular paper or like the higher level argument of like, should you prefer wait-free or lock-free or lock-based or obstruction-free data structures and algorithms? We won't really get into that detail here because really what this stream is about is about teaching you how to, or one way to encode academic papers into Rust code. More so than evaluating whether this particular algorithm is one that you should care about and use. It will be on the test, yep. I will push this to a GitHub repo, both the code that we end up writing and some of the source code that I managed to get from the authors of this paper that we'll be using for reference. Great, all right. I think that's enough prefixing. So let's first start out with just the title of the paper. The title very often gives you a good summary of the paper. And part of the reason for this is because, well, or to rephrase slightly, the title has to be short. Doesn't have to, but in practice, paper titles are short. And this means that as authors of papers, you spend a lot of time thinking about exactly the words to use in the title of your paper. And so in general, any given word is just like not unnecessary. And so what we'll find out is to sort of start out the stream by understanding what does this title mean? Why were all these words chosen? If you've read the paper or read some part of it, as a result of me sort of tweeting about it early, you might have sort of read the abstract and some of the introduction of the paper that goes into a little bit of detail. But at a high level, data structures that are concurrent, that is data structures that allow concurrent access to the data structure from multiple different threads, can generally have one of many different properties. So the weakest one of these is something called lock-based concurrent data structures. So these are the ones that you're probably most familiar with, where you have something like a standard library hash map and you just like put it behind a lock. And so there is only one thread accessing the data structures at any given point in time. So you can use any data structure you really want and the mutex, the lock is what ensures that you don't have problems where two threads both try to say, remove a key at the same time. There's a variation on lock-based data structures that use what's often referred to as fine-grained locking where there's not necessarily a lock over the entire data structure, but the sort of sub parts of the data structure are protected by locks. An example of this would be, you could have a hash map where every bucket of the hash map is like a linked list, but the buckets are themselves protected by a lock. So if two threads try to access different keys, they don't take the same lock, they take different locks because they're in different buckets and therefore they don't contend with one another and they can both proceed in parallel. So it's sort of the advantage of fine-grained locking. But ultimately, these have a couple of challenges. There are a couple of reasons why you want to avoid lock-based data structures, even the ones that are fine-grained. One of them is that they are sort of by nature blocking. That is, if one thread is trying to do something and another thread is trying to touch the same resource, whether that is the whole data structure or say a hash map bucket, no other thread can make progress on that bucket at that time. And this has some implications. So first and foremost, this means that if some thread, just like panics, like it crashes and doesn't release the lock, or it like hangs, like it deadlocks while still holding the lock, or it goes into some kind of infinite loop. It issues a syscall where it waits for like a TCP packet that might not arrive for a long time. In all of these cases, the thread that has the resource locked is blocking all other threads from making progress. There's a more extreme version of this if you're doing a little bit lower level programming where you might be dealing with stuff like CPU exceptions and interrupts or slightly higher level like signal handling, where these primitives can interrupt the execution of a given thread, right? So imagine you have a single threaded program that's like running through its code and at some point it takes a lock and then it's in the critical section of using that lock and then an interrupt comes in or a signal comes in, that causes that thread to jump to executing some entirely other code without sort of finishing up the work it was doing here. So one thread stopped in the middle of a critical section and jumped to some other code. If that other code tries to access the same resource, you now have a deadlock within one thread because it is holding the resource, but it's also trying to acquire the resource. And so it can't release the resource because it blocked trying to acquire the resource. And so this is another example of where these like lock-based data structures have some issues. That's not to say they're never useful. In fact, they're quite useful in many cases. Very often you like know that you don't have a signal handler. You know that you don't have interrupts or exceptions that come in that you have to deal with. You know that none of your threads issue blocking system calls while they're holding a lock. And so you can sort of work your way to system settings where it's okay to be using something lock-based. And the advantage if you get into that space is that you can design algorithms that are simpler because locks in general allow you to write simpler code. And in some cases, more advanced data structures that have better properties. In general, when you try to design things without locks, it constricts the designs that you can use. And so you might end up with data structures that don't have as many optimizations in them because the optimizations are just too hard to make lock-free. Whereas in the lock-based approach you might be able to sort of have more optimized data structures. And so there's only a trade-off space there. So then we get to the distinction between things that are blocking, so this is lock-based programming, and data structures that are non-blocking. And non-blocking algorithms include things that are wait-free, things that are lock-free, and things that are obstruction-free. And these are all slightly different in terms of exactly what they guarantee. So the weakest one is something called obstruction-freedom. And this was actually announced fairly recently. So obstruction-freedom, I think this paper is from 2004, 2003. So obstruction-freedom just means that no thread can prevent other threads from making progress. So that is, if some thread gets stuck, it doesn't prevent other threads from making progress. That doesn't mean that other threads will make progress, and this is where the distinction is important. So obstruction-freedom sort of guarantees correctness but doesn't guarantee liveness, is one way to maybe think about it. Obstruction-freedom sort of dictates that something else has to make sure that the other threads make progress. So an example of this would be that in sort of obstruction-freedom, you could say that this data structure makes progress as long as every thread gets to execute some instructions. Like basically you have a fair scheduler is one way to think about it. That's an assumption that's not generally okay in something like lock-free or wait-free data structures. But in obstruction-free data structures, you sort of say that we don't guarantee progress, we just guaranteed that no thread can prevent progress. And then you make assumptions that enable progress to happen anyway. Lock-free is a slightly stronger guarantee that dictates that at any given point in time, one thread can make progress, or rather some thread can make progress. This could mean that you end up with starvation, that one thread is just doing work over and over and over, and it's making progress, like it's doing everything it wants, but all the other threads end up sort of, like you can think of it as, they end up trying an operation, realizing that this thread that's just like spinning, doing things has changed things in some way that causes them to retry, they retry, they do the same thing. And so ultimately only one thread ends up making progress. And then you have wait-free, which is the sort of extreme case of this. So in a wait-free data structure, you have to guarantee that every thread makes progress in a finite number of steps. So that is if you have lots of threads, then even if, well, you basically, you have to design the data structure in such a way that even if one thread like dies and goes away and never does anything again, it can't prevent other threads from making progress. So this is sort of the obstruction of freedom. You have to guarantee that at any point in time, some thread is making progress, but even more strongly, every thread has to be making progress sort of at all times. You can think of this as like, you can't say that, oh, this thread will get to make progress when it next gets scheduled, for example. That's not good enough for wait-freedom because that assumes a fair scheduler, which you might not have, like the operating system might not implement a fair scheduler and now suddenly your algorithm doesn't actually guarantee that everyone makes progress. And so generally what you see with wait-free data structures is that there's some notion of helping where if a thread sort of can't make progress, it stores the fact that it can't make progress and then the other threads are sort of required to come help it. And so that way, as long as any thread makes progress, all threads make progress eventually. And then this is obviously not the formal definition, but just to give you a sense of what these different categories are and what guarantees they provide. I'm gonna pause there because I just threw a bunch of like algorithm and data structure terminology at you. Does this roughly make sense? Does the distinction between these make sense and to some extent does it make sense when you might want these different ones? I'll add to this that in general, the stronger the guarantees you want, the more complex the data structure is likely to be. So lock-based data structures can be very simple, very often that it's just a mutex, which is a fairly simple data structure in and of itself, and an existing data structure that is not concurrent, that you just like stick them together and then it works. As you go towards wait-freedom, you end up with stronger guarantees about progress, but you also end up with much more complicated data structures and algorithms and generally worse performance. And performance here is weird, right? So the performance is sort of single threaded performance is one way you can think about it. Like there's more stuff every thread needs to do for a given operation. So the operation is more costly, but also more threads can make progress. So you might scale better with say the number of cores. So that's often the trade-off you make. All right, let's do some questions on that because that was a lot of stuff. Are there domains where lock-free and wait-free is commonly used? So lock-free constructions you will usually see in cases where you really want scalability to many cores. An example of this could be like GPU programming, although I don't know that they use a lot of lock-free data structures there, but it might be a good case. The other one is when you're doing very large computations where say you want to make sure that your workload can scale nicely to say like 64 cores or 128 cores or like the thousand core processors that Intel or Wattnard are developing, right? Once you get to that point where you actually want a thousand cores to make progress at the same time, a lock could be really problematic because even if you have fine-grained locking, the chances the two threads are actually trying to access the same fine-grained lock is decently high. And so you end up with your concurrency being limited by basically the number of hot locks. That's the maximum parallelism or concurrency rather you can achieve as opposed to the number of cores. Whereas a lock-free implementation will generally give you the guarantee, this is not quite true, but will generally give you the guarantee that your performance scales with a number of cores, not just with a number of non-shared resources. Weight-free takes us a step further and says that even in the case where you have say uneven resource use, like one thread is spinning and other threads are trying to make progress every now and again, the spinning thread can't keep up the work that you need the other threads to achieve. This comes up particularly in things like real-time programming and in some cases embedded devices, but generally anytime you need to give a strong bounce on how long a given operation can take, you might start to tend towards weight-free things because with lock-free you can end up with starvation. And so you might end up with some thread seeing like a, I don't know, a hash map insert, take a very long time. You end up basically with very high tail latency because it might be stuck trying to make progress when some other thread just like keeps getting in front of it. And so that's sort of the scale. What is the difference between non-blocking and weight-free? So weight-free data structures are also non-blocking. Weight-free is strictly stronger of a guarantee than lock-free is. Traditionally, I think lock-free was called non-blocking and then now we're sort of favoring the term lock-free over non-blocking because non-blocking sort of covers everything from obstruction freedom and onwards. Like obstruction freedom is also non-blocking, but it's sort of only non-blocking. Let's see, how can you have a weight-free data structure in an OS environment where you can't guarantee your whole program is going to progress? Nevermind all the threads. So the basic idea with a weight-free data structure is that as long as some thread progresses that thread is going to be forced to help the other threads. And that's how you guarantee progress. So as long as something gets to execute, progress will be made on the sort of tasks of every thread. So even if a given thread doesn't execute, other threads are gonna execute on its behalf. That's sort of where you get this helping mechanic that's often needed for weight-free stuff. Left, right would be classified as lock-free. Yes, I think that's true. Well, it's lock-free for reads. In fact, it is, I think it's even weight-free for reads and it is, maybe lock-free for writes, it's unclear. Yeah, as long as no, a writer can be held up forever by a reader who doesn't release. So I don't think it's lock-free. I think it's actually, it's not non-blocking. I think it's like obstruction-free maybe, but even that's unclear. Like imagine in left-right, right? If a reader crashes while having the epic incremented once but not twice, the writer will never make progress. So left-right is blocking for writes, but it is, I think even weight-free for reads. Are atomics allowed in weight-free? They are, I mean, weight-freedom, I mean, you basically need atomics for anything that doesn't use locks. The question is more how you use them, right? So with something that's weight-free, you can't have a thread like, you can't have one thread spin on like a compare-and-set loop, because that compare-and-set might never succeed if other threads end up running and then you don't get weight-freedom. So you need some additional mechanism and we'll actually look at how that works in the context of this paper. Is work stealing a form of weight-freedom? Not really, sort of. So work stealing does have this property that if I can't make progress, someone else will make progress on my behalf. I don't think it's generally weight-free in the formal sense. All right, so let's try to move forward a little bit. So now we have this rough notion at least of what weight-free and lock-free means in this context. Data structure, I sort of assumed that you know what is. And then we have this word simulation. So what does it mean that this is a weight-free simulation for lock-free data structures? And this is something that the paper talks a decent amount about in its abstract and introduction, which is basically that it's actually fairly hard to design weight-free algorithms. Designing lock-free algorithms is complicated. Designing weight-free ones is even more complicated. In fact, I think it was only, this paper is from 2017, although I think the original version came out in 2014. And I think it was only like a few years before that that we started seeing like a binary tree implementation that was weight-free. Like it's complicated to get here. Like weight-freedom is hard to achieve. And what this paper manages to do is take a lock-free data structure or lock-free, call it a lock-free algorithm maybe. There's sort of synonymous terms in this context. And run that lock-free algorithm using a weight-free implementation in such a way that, so you take a lock-free algorithm and you run it using machinery that is itself weight-free. And this guarantees that the overall thing that you've combined is also weight-free. It's sort of the magic of this paper. And this ties into the word at the beginning, which is practical. So there was actually a paper many years ago that I think it's linked down here somewhere. Yeah, by Hurley. And then this paper is sort of a classic in the literature. But basically the idea here is that there's such a thing as a universal simulation. And a universal simulation, the paper has a lot more details, but basically the idea is that there's been a proof that you can take any lock-free algorithm and you can run it in a weight-free way. The problem with it is that it, like what that means is any of the operations in the lock-free data structure complete in a finite number of steps for any given thread, which is all that's required for weight-free them, that every thread makes progress. The problem with this universal simulation is that while it is a finite number of steps, it's a really large finite number, right? Like if it took years for any operation to finish, but every operation does finish, like even if that thread never gets to run, the operation will finish, then the algorithm is weight-free, but that doesn't really help you. You don't actually want to use this for anything because the overhead of the simulation is way too high. And so that's why this paper has this practical word in here because what they came up with was a way to simulate certain lock-free data structures in a way that's weight-free, in a way that's practical. That is that the finite number of steps that's sort of the upper bound for how long any given operation will take to complete for any given thread is not just finite, but also a reasonably low number. And in fact, if we scroll down to the performance evaluation part of this paper, if you look at this, for example, let me go ahead and zoom in a little. This should probably say top and bottom. Oh, left and right, yeah, that's fine. Yeah, so if you look at this, the red line is the weight-free, the blue line is the lock-free version that we're simulating, and the y-axis here is in seconds, and then the x-axis here, I think, is a number of cores, if I remember correctly. And so what you see is actually, the weight-free version is slower. It takes longer time for any given number of cores. And that is, again, because it has to do more things, but you see that the difference is still fairly small. As the observation here is that the weight-free implementation is slower, but it provides stronger guarantees about how long operations will take. So one thing that would be interesting, and I'm actually a little surprised the paper doesn't include this, is sort of a histogram of the time per operation. Because what I would expect to see is that for the lock-based algorithm, you end up with some operations taking a much longer time because they get stuck behind other threads or making progress, and so you end up with this like, most things are fairly fast, but some things are really slow. And in the weight-free case, you see that whole curve move a little. So all operations are slower, but you don't have a tail, or the tail is sort of bounded, which is what weight-freedom guarantees. Stream conferred to be weight-free because we're trying to move forward. That's funny. Is it just about risk, or will weight-free actually catch up for some very high core count? This one's hard to answer because I don't think we really know. I don't think we expect weight-free to outperform lock-free for any given core count in average performance. And part of the reason for that is because the operating system actually gives fairly strong guarantees around things like fairness. Like it's not gonna leave a thread not running for a long time. It's just like going to try to avoid that really hard. But when it happens, that's when you see really poor performance changes. In practice, I think where this matters more is that real programs are a little bit different from benchmarks in that a benchmark usually spins up like the same amount of work per thread and just every thread just spins doing an operation. And so all of the threads sort of do the same amount of work. In a more realistic situation, you might have one thread that's like spinning and doing a lot of operations and one that is time-sensitive. It's like maybe a user clicked a button and the other stuff is all automated. And the time it takes for that one user button click to happen actually matters a lot more than the average time it takes for these automated operations to go through. And that's where the weight freedom sort of matters more because if you look at the average, it doesn't make a difference, but in some ways like the actual customer experience, if you will, that depends on the tail latency, not the mean and weight-free algorithms tend to do better in the tail. There's an interesting paper by, I think it was Facebook a few years ago called Tail at Scale, which looks a little bit at this effect as well. The same thing comes up in things like real-time operating systems where you have some operations that are important, they're high priority, they need to finish in sort of a certain amount of time. Like imagine that, I don't know, you're flying a rocket and you need to make sure that you control like the pressure in the fuel tank and that can't be held up for a second. Like that's not okay. Even if it happens once, the whole rocket might blow up. Like this is a big problem. And that's the kind of cases where this sort of theoretical concern about some things are just gonna be slow and it's fine becomes a very real concern because you actually have hard deadlines you need to hit. And a less extreme example is something like gaming where like your time budget is until the next frame draw. And it's not okay if you have an operation that happens to take longer because of contention, right? But of course the flip side of this is even in a weight-free algorithm, like the weight-free algorithm just says that there's a finite number of steps to make progress for any given thread but you might not necessarily know what that finite number is, like what the upper bound is. If the upper bound is really large it might still be too slow. The hope of course with this paper is that the actual upper bound they achieve is low enough that it actually meaningfully reduces the tail. And so one of the things that I think we'll do probably like a couple of streams down the road is sort of benchmark this and look at the histograms of latencies for operations and look at the distribution and see does it meaningfully reduce the tail or is the upper bound set by the weight-free algorithms like so high that it doesn't meaningfully impact the tail or phrase differently, the slowest that the lock-free version ever is is still below the upper bound set by the weight-free implementation and therefore the weight-free doesn't buy you anything. Our weight-free algorithms use for things like machinery that can't risk hitting a worst case scenario or incanses in operations, yeah, so that would be an example. Yeah, another example you could tie in here is something like garbage collection where like a garbage collection pause doesn't matter for things like means but it does matter for tail latency. In some sense it's similar, right? Like a garbage collection pause is sort of similar to encountering contention during concurrency. They're a little bit different and then garbage collection tends to sort of stop the world, it doesn't have to, they're concurrent garbage collectors, but they're a little different in how they actually impact the performance of your program and in general, you have more control over the garbage collector than you do over the concurrency, right? Like if you have, like you can just allocate less memory, that's like the way to fix something that ends up incurring a lot of garbage collection pauses or you sort of incur the garbage collector at times when you know that you have some budget to spare, whereas in the lock-free case, you have very little control. Like if you have lots of threads that need to do stuff, the contention is sort of determined by factors that are like timing-based that you can't necessarily control in a good way. The tree use case seems to hit on some kind of worst case scenario. So the tree use case is actually really interesting because what you see there is the weight-free version is much slower than the lock-based version. It's not that it's better, it's worse. This isn't performance given number of cores, it's seconds, which is, so a higher is worse. So the tree example is one where the lock-free version does much better than the weight-free version or phrased differently, the weight-free version introduces a bunch of overhead. And I think they talk about that. Let's see, where is the discussion of figure nine? This is most vividly displayed in the tree data structure on the AMD. About one in 64 operations ask for help and completes in the slow path. This means that roughly during half of the execution time, there's an operation running in the slow path. We'll talk a little bit about this notion of a slow path, but the basic idea is that in the tree case, you much more often end up with threads having to help each other. On the flip side, like, yeah, and you see, as a result, all threads help this operation, sacrificing scalability for this time. And so you get worse performance, but again, my guess is that you would see this actually impacted in the tail latency. Then for the lock-based one, you end up with some operations just taking a lot longer, but it sort of gets amortized, if you will, in the measurement because the thread that wins still gets to measure a load time. Doesn't weight freedom still imply that all threads will get a CPU slice eventually? No, so that's one of the reasons why weight freedom is so cool is that even if a thread doesn't get to execute any steps, other threads are obliged to do work on its behalf. And that's how you guarantee that in a finite number of steps, like of the machine, every thread makes progress. Great, all right, I think we're now, I think we now understand the title of the paper, which I think is an achievement in and of itself. So let's then start to talk about what we're actually gonna implement. So let me find where they start talking about this here. So recently we've seen some progress with respect to practical weight-free data structures. This is what we talked about, that many weight-free data structures weren't even known until fairly recently before this paper was published. And in general, like there have been some practical designs of weight-free cues that rely on this compare and swap operation that is very common in atomic and concurrent algorithms. There's been independent constructions of weight-free stacks and cues, linked list and a red black tree, which is sort of the most recent weight-free implementation at least prior to this paper. And one of the techniques that was employed in much of this work is this idea of the fast path, slow path method. The basic idea is to have a fast path, which is like, if I'm not contending with anyone, if I can make progress without having to retry, then I should just do that. And because that should be the common case, we're gonna make that case as efficient as possible. And then what you do is you have a slow path where if you detect that there's contention, basically if you're like, I have to retry, then you take a much slower execution path where you like carefully coordinate with the other threads so that all of you end up making progress. Think of it as like, you, I don't know, what's a good example of this? Okay, imagine you're a bunch of people standing with basketballs in front of a basketball net. And your goal is to get through the most number of balls per second. So if every now and again, someone throws a ball, chances are you can just like, the moment you get a ball, like you mentioned, there are few balls basically, then the moment you get a ball, you throw the ball. And chances are no one else is throwing a ball at the same time, because there aren't that many balls going around. So this is an instance of you can take the fast path, which is get the ball, throw the ball. It's a very simple instruction. If there are say, as many balls as there are people, chances are if everyone follows that instruction, everyone throws the ball roughly at the same time or at least in a very short time period. And all the balls just like bounce off each other and you don't actually get any balls through. And then all the balls roll back and or everyone goes from fetch the ball, all of them throw them again and then you get the same problem once again. And no one actually makes progress. And so the way that you can try to get around this is if you detect that there's contention, for example, if you see that your ball bounces off of someone else, your instruction now changes. It moves from the fast path, get the ball, throw the ball, to the slow path, which is a more carefully considered approach to throwing a ball. This might be, you could imagine the slow path being something like look around. If anyone else is currently holding a ball, then shout out to them and say, I throw on one, you throw on two, and then count one, two, right? So this would be an example of the slow path being definitely slower, but it ensures that you get more balls through the hoop in any given period of time because you're coordinating with the others. And to take this to sort of a weight-free approach, imagine that someone drops dead so they can't throw the ball they were holding, then, I mean, this is where this gets a little edgy, but like if someone drops dead with a ball, you're actually gonna walk over, grab their ball and throw it for them to ensure that you keep everyone making progress. Let's say they fall unconscious for a second. So rather than that ball just like lying on the floor, you're actually gonna help them throw it. And that way they make progress even though they're unconscious. This would be sort of an example of the slow path being a thing that ensures that everyone is making progress, that is the balls are continuing to further, or rather every ball is going through the hoop at some rate is the other way to think about this. Like imagine all the balls are different colors. You want to make sure that not only is a some ball going through the hoop frequently, but every color ball is going through the hoop over time. All right, I don't know if that analog made sense, but that's the kind of thing we're going for where the fast path is don't think, just do. And then if you notice things being weird, you go to the slow path where everyone tries to work together. Yeah, it's a brutal game. Death is no excuse. The game must go on. What if everyone is dead? Well, then I don't think it matters whether you make progress. Yeah, don't play basketball with John. Me and my 31 friends frequently throw multicolored balls. That sounds about right. But so that's the kind of thing that we're going for. Or rather, that's what most of the actual wait-free algorithms that people had implemented before this paper were implemented in this way where there was a fast path, which was basically sort of the lock-free version, right? Of just do this thing, and if it doesn't work, retry. And then the wait-free version becomes, if it doesn't work, then coordinate with everyone else and then make progress. And what the authors of this paper realized was that there's sort of a pattern here where as long as the algorithm has a particular structure, which is what they call the normalized form in this paper, as long as it has this particular structure, we can actually say, you just give us sort of the definition of the fast path, a definition of the slow path, and a definition of how to detect when you need to move from one to the other. And then we can implement, we can sort of take that description of a lock-free version and simulate it efficiently using a wait-free implementation. Think of this as like, you can sort of define a trait for a lock-free, a simulatable lock-free algorithm. And then you can write an implementation that is wait-free that uses the implementation of those methods to run the lock-free algorithm in a wait-free manner. And that's basically what we'll be encoding today. It's a long path to get there, but. And as you can see, the authors sort of get at this by saying, the process of designing a fast wait-free data structure for an abstract data type is complex, difficult, and error-prone. And as they say, one approach to designing them is to start with a lock-free data structure, work possibly hard, which I think is like understatement of the paper to construct a correct wait-free data structure by adding this helping mechanism that we've talked about, right? So you add a helping mechanism, and then you work again to design a correct and efficient fast path, slow path combination of the lock-free and wait-free version of the original algorithm where lock-free here is sort of code for the fast path and wait-free is sort of code for the slow path. And again, understatement, designing a slow path, fast path data structure is non-trivial. And yeah, it's really hard. And so in this work, what they're looking at is basically, can you mechanize this process of saying, as long as you give us a description of the algorithm in some sufficiently normalized way, we can sort of standardize the reasoning for how to run it wait-free. And as they say, we present an automatic transformation that takes a linearizable lock-free data structure in a normalized representation and produces a practical wait-free data structure from it. And the resulting data structure is almost as efficient as the original lock-free one. And this paragraph is really like the thing that made me excited about this paper because I haven't had a need for wait-free data structures. I think lock-free data structures are cool. I was like, this is a really cool transformation if they can actually deliver on the promise from this paragraph, like this is really neat. And when I started reading this and when I went through the sort of normalized data structure, I realized that this automatic transformation that they talked about and the sort of normalized representation is something that I think we can represent in the Rust type system in such a way that you can have a wait-free simulator type that is generic over a normalized lock-free algorithm type or trait rather, and then just have that work. And that would be really cool if we can get to that. And that's what we'll try over the course of the stream to get to. And they sort of give this, they have this argument for the normalized representation that they have is one that can actually be, you can represent a lot of different lock-free algorithms and data structures in this normalized representation. And therefore, this is not just like, we did this for these three data structures and it worked fine. It's actually a more general thing which makes it all the more attractive try to encode it in Rust. So that in theory, you could sort of take other lock-free data structures you can think of, use this basically as a library and say, now give me the wait-free version. And if we can get there, that's really neat. In the context of this paper, they transform a linked list, a skip list and a binary search tree from lock-free to wait-free using this sort of algorithm, this mechanized transformation. I think what we'll probably get to today just because there's a lot for us to cover, I think we won't actually implement any of these three today. Instead, what we'll do is sort of implement all of the mechanization, all the automation today and sort of the modeling of the problem and of this like normalized representation and stuff. And then next time we'll deal with, probably with memory reclamation, which will be the sort of hazard pointer stream. And then the next stream after that we'll actually try to implement like the linked list, the skip list, maybe we even get to the binary search tree on top of the mechanization that we built today. Wait, I have a cat meowing at the door which sounds like a good timing. She's very concerned about wait-free data structures. Hi, do you want to come in? You can come in. Hi, do you want to come say hi? Yeah, say hi. Where are you going? Where are you going? You're going. What? You're gonna go eat Jay? Well, we have a cat on stream now. I'm sorry, did you don't want to be up there? All right, she's leaving. So this is not a way to implement any lock-free algorithm in a wait-free way, but a way to design lock-free algorithm in a way that makes it possible to run it wait-free at some performance cost. Sort of. So they're not claiming that this is a way to take any lock-free data structure and make it wait-free. They're claiming that if you can structure your lock-free algorithm in this way, that is if you can encode it in their sort of normalized representation, or rather if you can represent your algorithm in their normalized representation, then they can make it wait-free at minimal cost. Yeah, derive wait-free is sort of the coolest thing we would get to. Can't this be described as they invented a way to create wait-free algorithms that can be made just lock-free if you don't run the simulation? Well, I mean, if you have a wait-free algorithm, it's already lock-free. Like, there's no simulation that's needed there. The hard part is going the other way. Chai is not wait-free. Chai is lock-based. She definitely blocks progress sometimes. All right, so, hi, what's up? I'm sorry, are you suffering? Wow, she just collapsed on the floor asking for scratches. What? It's okay. There you go, there you go. You can refill on scratches. She's like grumbling. She's a very vocal cat, grumbling about not getting enough pets. Right, so let's take a look through the transformation overview, which is sort of where we'll start getting into how do we actually encode this in Rust? So the move from the lock-free implementation to the wait-free one is executed by simulating the lock-free algorithm in a wait-free manner. Simulation starts by simply running the original lock-free operation with minor modifications that we'll assume we discussed. A normalized lock-free implementation has some mechanism for detecting failure to make progress due to contention, right? So this would be the example of all the basketballs are colliding. Clearly we're not making progress. So for any given lock-free data structure there has to be some mechanism to detect that there is contention. Basically something to inform us that we need to move from the fast path to the slow path. When an operation fails to make progress it asks for help from the rest of the threads. A thread asks for help by enqueuing a succinct description of its current computation state on a wait-free queue. So we can already see that the transformation, the mechanization, the automation is going to require a wait-free queue sort of at the bottom. And so this is something we're going to have to implement. It's going to be, it's not that we're gonna simulate a wait-free queue, it's that we're actually gonna implement a wait-free queue that is then gonna be used in the mechanization. And so this might even be what we get to sort of second in today's stream. The first one being like sort of defining the trait because I think it's useful to have that mental model of what we're trying to get at, but then we're gonna have to implement this wait-free queue, because it's sort of the building block for the whole system. So yeah, so when some thread detects that it's not making progress, it has to sort of put a description of what it failed to make progress on into this wait-free queue. And the fact that it's a wait-free queue means that the enqueuing operation is wait-free, which is sort of helping it to make the overall thing wait-free. And then one modification to the fast lock-free execution is that, so this is to the fast path, is that each thread checks once in a while, which is sort of undefined. I'm guessing we're gonna run into some weirdness here of what exactly does once in a while mean. And you can imagine this once in a while sort of is the sort of scaling factor for how much, like how finite is the finite sequence, right? So we're saying every wait-free operation has to terminate in some finite number of steps. Well, this once in a while is gonna dictate how many steps that actually is, because if a thread won't make progress until someone helps it, then how long it takes until some thread tries to help dictates the overall time. So in the fast path, we're gonna have each thread check once in a while whether a help request is enqueued on the help queue. If it notices an enqueued request for help, then it moves over from sort of executing the fast path to helping that single operation on the top of the queue. And that includes reading the computation state of the operation to be helped, and then continuing the computation from that point forward. So this is the idea of we're trying to help this other thread that is currently stuck make progress. So eventually, like all, if one thread is sort of stuck and all the other threads sort of notice that it gets stuck, they're gonna start trying to help it. And eventually maybe even all the threads start helping. And at that point, there should be no contention and therefore progress will be made. But even if just one thread helps, that might be enough to make progress. But ultimately, what these other threads are gonna do is sort of look at what did the thread that enqueued this request for help? What was it trying to do in the first place? And then go, oh, let me go help. And then pick up the computation that was enqueued and it will also try to do that same computation. And so ultimately what you'll end up with is if all the threads end up helping, all of the threads are trying to execute the same operation and so someone must succeed because they're all trying to do the same thing. And so yeah, continue the computation at that point until the operation completes and its result is reported. The major challenges are in obtaining a succinct description of the computation state. So this is basically if a thread can't make progress, how does it tell everyone else what it needs help doing without having to write gigabytes of memory to somewhere that's shared? You want a concise thing saying, I got stuck on this, please help. And also how do you achieve the proper synchronization between the potentially multiple concurrent helping threads so that they don't just all contend with each other and therefore no progress is made there either? Like again, with the basketball metaphor, if all of the people who are not unconscious walk over and all try to grab the ball from the unconscious person at the same time, none of one, no one is gonna get the ball and therefore no progress is gonna be made. So you need to ensure that the helping mechanism itself is wait-free. And this is in all likelihood where the wait-free queue is gonna come into play. And then also in the synchronization between helping threads and threads executing other operations on the fast lock-free path. So this is sort of the jumping back and forth between I'm gonna execute my operations in the fast path sort of sequence and I'm gonna go help other threads. And there has to be a balance between the two and sort of a way to determine whether to do one or the other. And so you say, I see that the normalized representation is enforced in order to allow this succinct representation and to ensure that we can detect when a thread isn't making progress and to minimize the synchronization between the two. Hello? Do you wanna smell the microphone? She's walking across my keyboard. You can see the tail right here. No, don't drink my tea please. Okay. Oh, you're scared of the Roomba. Why did the Roomba start? There's an enemy downstairs. It's the Roomba, it's driving around in her space. Let me stop that. She's funny. Pause. There we go. Could a thread that needs help just pin up a closure of what it wanted to do with the lock? So there isn't really necessarily a lock here. That it's more that, and you don't really want it to be a closure because we need some amount of control in order to guarantee weight freedom. It's not like it can be any operation that someone else can help with. If it was a closure, you can imagine that you encode all sorts of weird stuff in that closure. Instead, you want something where the other threads can help knowing that they themselves won't also be blocked by executing that code. So we probably want something a little bit more principled than just saying a closure. What does a thread do after it detected that it can't make progress, just spin? No, so if a thread detects that there's contention and therefore its operation failed, it's going to retry. And the sort of key insight here is that as long as other threads start helping, eventually the retry will succeed. In a lock-free system where there is no notion of this kind of helping, a thread might retry forever. And that's bad. That means it's not making progress or at least retry for a very long time. If you know that help is gonna come in some finite amount of time, you know that you'll also make progress in some finite amount of time. But isn't the original waiting thread completely dead at this point? The original waiting thread itself will be killed after some other thread succeeds with this helping operation. There's no thread killing involved here. Like when I say that an operation fails, it just means that the most trivial example of an operation failing and sort of you detecting contention that way is like imagine that you're trying to append to a linked list, right? The atomic operation that you do is you try to do a compare and swap of the head of the queue, the head pointer of the queue to a new node that points to the immediate successor like the first node in the list, right? And that is a compare and swap. That compare and swap will fail if the head has changed since you last read it, right? And so your compare and swap failed. Now, the way that you sort of recover from that, the way that you retry is you read the head again, you construct a new successor node that's gonna be injected and then you retry the compare and swap. So there's no death of a thread because it failed. It's just the operation failed because of contention. And so it's going to eventually succeed as long as you sort of redo the stuff that got you to that operation in the first place. How can a non-progressing thread ask for help? Ah, so the point here is that the thread sort of endures the need for help when it detects that it has to retry. So when we say that every thread has to make progress, that doesn't mean that if a thread actually died, like it panicked or something, then there's no more for it to do. So it doesn't have to make progress, but that means all the other threads, like basically think of it as all live threads have to make progress. Yeah, this will probably make more sense when you start seeing the Rust code. This is currently somewhat abstract. Why are you eating my glasses case? Weirdo, she's a weirdo. All right, so I know we're reading through the text here. We're not gonna do that for the rest of the stream, but it is useful particularly for the sort of transformation overview because that's basically what we're gonna be encoding. And so it's worth looking at this in some detail. So the helping threads, this is if you detect this contention, you go to the slow path, you try to help some other thread. The helping threads synchronized during the execution of an operation into critical points, which occurred just before and just after a modification of the data structure. So assume that modifications of the shared data structure occur using a compare and swap primitive. So this is very common, right? So the linked list example I gave where you do a compare and swap of the head of the list. This is very common for lock free data structures, basically that the way that you ultimately sort of commit an operation almost, like actually change the data structure is you do a bunch of work and ultimately like there's a compare and swap that like commits the change. So you sort of stage the change and then you commit the change and that commit is usually like atomic in the sense that not just that it's a compare and swap but it's a single operation most of the time. And the idea, right? Is that if it wasn't a single operation, then you would need some larger mechanism for like arranging for multiple threads where if imagine you're doing two compare and swaps to commit or you're doing two operations, then now what if a thread sees one of them but not the other? What does that mean? And so generally lock free data structures are gonna have this structure where they do one compare and swap and then that is the commit point. So a helping thread is gonna run the operation by a helping thread runs the operation and attempts to help independently until reaching a CAS instruction that modifies the shared structure. So the idea here is there's a bunch of work that you do up until the commit point. This is like the staging the commit. This might be again in the context of a linked list constructing the node that you're about to inject, right? And that work can happen concurrently on multiple threads because it's really read the head pointer, construct the thing and now you're ready to do the actual like commit point, the actual compare and swap. So the helping threads all execute that sort of pre compare and swap operation. The stuff that is entirely independent and sort of isolated doesn't need to be synchronized. And then when it gets to, I wanna execute this compare and swap, it coordinates with all helping threads which CAS should be executed. So before executing the actual compare and swap instruction, the helping threads jointly agree on what the CAS parameters should be. So remember a compare and swap is like, at this address change from this value to this value. And the operation fails if the expected value, the old value no longer matches the old value that you pass in. And so if all the threads end up sort of agreeing on what operation to do, then as long as they're all doing the same compare and swap, they can't really be a failure because no one else is changing the value that's in place and therefore the CAS should succeed. Again, and I think this will be clearer once we start looking at the code, but trying to convey the theory is pretty important. And the simulation ensures that the CAS is executed exactly once so that we don't get weird properties where like because of helping, you like inject the node twice into the linked list or something. And then each thread continues independently until reaching the next CAS operation. So the idea here is if there was a data structure, a lock free data structure that required multiple compare and swaps to actually commit, what you do is all the threads do the stuff that they can do independently before the CAS, then they agree on which CAS to execute, and then they all do the steps until the next CAS which they can do independently because there's no atomic operation in there, and then they all agree on what the next CAS could be. So there's sort of this like, execute in parallel, agree, in parallel, agree, and ultimately that guarantees that you make progress because they all agree on the CAS and so they will succeed with the CAS. And then upon completing the operation, so this is like after all the CASs have completed, the operations result is written back into the computation state. It's removed from the helping queue and the owner thread, the one that put the help request in there can then return from whatever operation it inserted, right? So to sort of give the full overview here, and maybe I should actually try to draw this. Let me see what drawing this might look like. So let's do like light blue is probably good. So what we have is what's a good way to draw this. We're gonna have some thread over here. This is the owner thread in like turquoise. So the owner thread is let's say trying to do an insert into a linked list, right? And as part of that, it does like a, it sort of constructs the new node. Yes, new, right? It constructs a new node and then it does a CAS to try to inject it into like the, into the head of the linked list, right? So that's how it's gonna try to make progress. If that fails because say some other thread, thread two, also updated the head pointer at the same time, if there was like a conflict there, then this CAS is gonna fail because of that conflict. And at this point, what the original thread is gonna do is it's gonna say, it's gonna sort of encode the information that is needed to do this operation. It's gonna stick that into this shared queue of help, right? Help. So it's gonna sort of stick its insert in there. It won't actually be the insert, but we'll look at the sort of how to make a succinct representation to stick in that queue, but you can sort of think of it as the insert. And then at some point, the other threads in the system, so let's say there are three other threads, right? Because we said that on every fast path, every thread is gonna check the help list every now and again, so this thread is gonna sort of see this. And eventually this thread is gonna see this, right? And whenever they see it, they're gonna also try to do the sort of insert, which is gonna be a new node. And they're gonna run it in parallel, right? When they see it, they're gonna do new node. And in fact, the original thread, this thread, after it enqueues, so it does, I guess, one, two, three is when it observes the failure, four is when it enqueues the help, right? And then five, it's going to retry the operation as well. Like it's gonna try to help itself as one way to think about it. So notice that at this point in time, this thread and this thread and this thread are all executing in parallel. They're all constructing a new node for this one same insert. So they're all sort of doing the same work. And then all of them are gonna try to do this compare and swap in order to insert it into the head. And because there are no other threads in the system, the operation has to succeed because a CAS only fails if the expected value is different, but the expected value would be the same because no other thread is updating it. So we're gonna have some mechanism for them to sort of coordinate on the CAS and therefore this CAS is gonna succeed. And then regardless of which thread succeeded with the CAS, the owning thread is the one that's sort of ultimately going to now return from the insert, right? And these other threads are gonna like continue doing what else they were doing at this point because now they've like done their one help per fast path or whatever the actual ratio ends up being and now they get to execute their own instructions, but they did help the blue thread here make progress, which is what is required. So yeah, you can sort of think of this as consent, like the slow path is sort of consensus in a sense, but it's not, I mean, it's among threads. It's not sort of over a network. So you have some slightly stronger guarantees. This sounds like it would degrade to single threaded execution all the time. So the hope is that it doesn't, right? So this is why you have this fast path, slow path mechanism where if there isn't any contention, then they can just proceed in parallel. But okay, take them the linked list example, right? If you have multiple threads that are all trying to append to the queue, the best you can sort of do is like one and queue per cycle. Like you can't have multiple in queues per cycle because they're all updating the head pointer. This is like a limitation of linked list, right? That there is only one head pointer. And so if you can achieve one insert per cycle, depending on however you define long a cycle should be, then that's great. Even though that is, you're right, that single threaded execution, but the alternative is that no thread makes progress. That would be way worse. In the linked list example, it's hard for that to actually happen, like a linked list is actually fairly easy to make weight free or not to make weight free, but to make lock free. But you sort of want it to degrade to single threaded performance if the alternative is that no one would make progress because what you can do with the single threaded execution here is to make sure that every thread makes progress. Because as long as you do the help list in order, then like if I try to do some operation, I stick it in the help queue, then I know that it will get executed in some like finite amount of time. Do people who come up with these algorithms emulate a pretend machine that simulates all sorts of horrible scenarios, or do they typically test on real hardware? It's a bit of a combination. Some of the work recently I think has been looking at like sort of model checking and formal verification and stuff. In this case, I think that there's like a formalized argument a little bit later in the paper that actually goes through like here's the actual formal reason for why we can guarantee that everything terminates in a finite number of steps. In practice, you also do benchmarks. But as you saw from the paper, like weight free is a little weird in that your performance results will at least naively not generally reflect the benefits of weight freedom. Which might make you think like maybe there are no benefits to weight freedom and there are people who argue that that it's just never worthwhile. In fact, if you read the paper on obstruction freedom, that's basically what they're saying that you should just write things that are obstruction free because it's good enough in practice. And this is where you sort of need to develop more advanced benchmarks to figure out whether it's actually worthwhile. I'm hoping we can look at that a little bit later too. How is this better than just normal CAS? I thought the point of atomic operations with the OS would make sure that operation was synced on all threads. So then what's the point of this transformation? The point of the transformation here is that if you didn't have all this machinery, then what does it thread do if it's CAS fails? It retries, right? But you could imagine that it has to retry forever because every time it does a CAS, some other thread wins the race with it. Like some other thread is doing an NQ and it happened just like a microsecond sooner. And so every CAS you do fail no matter how many times you retry. What this algorithm does is it ensures that your operation will eventually succeed when you retry and not just eventually but with an even stronger requirement that in some finite number of steps, some upper bound, your operation will succeed. I think in practice, the way the finite steps are recovered is like the number of finite steps is like linear in the number of cores and entries in the data structure. I forget exactly that the paper says, but basically like the finite number of steps is like finite as in it's some known limited quantity. That doesn't necessarily mean that it's like three seconds. Like it's not quite that finite. It's not that all threads go to the slow path if there's any contention. That's not quite right either. It's that if the thread gets stuck, then it adds something, it adds the operation it got stuck on to the help queue and then it tries to help itself. It might succeed before any thread even tries to help it. All we're requiring is that threads that are going through the fast path occasionally check the help list and then try to help the top thing in the help list. And it might be that the moment any thread tries to help, progress is made and so the next time some thread is executing a fast path, there's nothing on the help list anymore because it succeeded. And so it's not necessarily all threads. It's not like if anyone has contention, they start like a blocking consensus algorithm. It's more the sort of dynamic detection of if someone needs help, I'll help. Otherwise I'll do my own thing. Aren't you effectively wasting a bunch of time because surely only one thread's cast will actually be committed? So what you're trying to do with this algorithm is to ensure that it's not just some cast that succeeds, it's sort of the cast that's been waiting for the longest time. That's not quite accurate either, but basically you don't want the scheduler to determine who wins the race because that can easily lead to something like starvation which is where you get the tail latency from and that wouldn't give the guarantees of weight freedom. If there's other stuff in the queue at five, does the blue thread only help itself? We'll look at that. The paper talks a little bit about this too. In general, threads are executing on the fast path only help one operation and then proceed with the fast path. Threads that are themselves stuck will help the entire queue in order until their own operation has completed. Is the help queue a solved problem? How does it not also cost contention? So this is where the, I think this was in the text too, the help queue is a weight-free queue. So enqueuing and dequeuing from the help queue is itself weight-free. So this is where we're sort of building this tower of things that are weight-free all the way from the bottom to ensure that the overall thing is also weight-free. And this is why I said, we're probably gonna have to implement the weight-free queue fairly early on because it is such a fundamental building block. Who clears the insert from the help queue? They actually all try to clear the insert from the help queue once they realize that it's finished. We'll look at that in a second too. Yeah, so we will implement the weight-free queue in this stream. If you wanna look at it, it is appendix A of the paper that I just had open and sent the link to to chat. So for weight-free instead of exponentially backing off CAS attempts in case of failure, we pass our work to other threads so that we get the work done by another thread. That's almost right. So in the case of this weight-free simulation, what we're doing is if a CAS fails, we ask for help from other threads and then we keep retrying. And because threads will eventually help, there's a guarantee that even if it keeps failing, at some point, every thread, every single thread must be trying to help. And therefore the operation must succeed because there is no contention anymore. In general, the operation should succeed far before every thread comes to help. But there's a guarantee that there's a finite upper bound. So what if the first thread dies before it puts insert in the help queue? How does that work? If a thread dies, then it's gone. There's no requirement for you to make progress on it anymore. What you need to guarantee is that every thread that has more instructions to run, or has more operations to run, gets to run those operations, you're not required to resurrect threads. That's not a thing. You can think of it as if a thread crashes, it has no more operations because its execution sequence has ended. Why can't the other thread not just do nothing until the original thread has helped itself? I mean, they could do that. You could imagine that helping is just do nothing. It requires the same amount of synchronization. So I don't know it really benefits you. And also it's not clear how you do nothing. Is that a spin loop? Because if you do a spin loop, then you might be taking up the core that the other thread could execute on to make progress. Imagine you have one core, but a thousand threads, right? The owner thread, the one that needs help and queues the help operation. And then all these other threads start getting scheduled on that core. And they all see that some thread is trying to make progress, so I'm gonna wait. I'm gonna just like do nothing. It can go to sleep, but then you need to specify how long it sleeps for. And now you might have a bunch of threads going to sleep when the operation actually finished like a second ago. So that's not good. So the alternative is that you spin waiting for it to finish. But if you spin, you're taking up the current core and there's only one core, so the thread that could actually make progress never gets to run. As you need to wait for in a fair schedule or a thousand or 999 threads to finish spinning before you can actually make progress. Whereas in the help case, the first thread that comes along on that core sees the help operation, executes, succeeds, and now can go on with its own thing. So waiting wouldn't actually benefit you. It would hurt you. This applies even if there are multiple cores. When do helper threads actually try to help? Do they have to help every time after the first fast path operation or is it heuristic based? Like help another thread for every end fast path operations. This is where the paper says once in a while, they have some more details later on the paper that we'll get to once we actually do the implementation. Great. So I think now we have a sense for what this helping business is doing and sort of what the algorithm is getting at. So there's this notion of the lock free algorithm has some notion of things you do before the CAS, the CAS, which is the actual commit operation that you might contend on, and then maybe stuff you do after the CAS. But the CAS is sort of the point of contention. And that's basically what their normalized representation is. We'll look at this in a second. There's the stuff before the CAS, the CAS, the stuff after the CAS. So yeah. So this is where upon completing the operation, so that is when all the CASs in a given help have executed, the result is written in the computation state. It's removed from the helping queue. And then the owner thread, the one that initiated the operation can finally return whenever it gets to execute next. So its operation got to complete, even though it didn't actually run. And as they say, there are naturally many missing details in the above simplistic description, but for now we'll mention two major problems and these are kind of relevant for the implementation we'll do. The first is that the synchronization that we need for the helping threads before each CAS and after each CAS so that they all learn whether one of them succeeded on the CAS is like fairly complicated. There's a lot of, it's basically consensus among the threads that decided to help around that CAS. And consensus is hard. And there's like ABA problems and versioning. We'll get into all of this when we get to the actual implementation. And the second problem is, as we mentioned earlier, like how do you represent, like in the drawing, right? I have the insert operation in the help queue, but in reality, you don't want the help queue to actually contain like all of the parameters to the function that originally came in. Like that's probably overkill. And the observation they make is that for a lock free algorithm, there is a relatively lightweight representation of its computational state. Let's see, so what are they getting at here? This is because by definition, if at any point during the run, a thread stops responding, the remaining threads must be able to continue to run as usual. This implies that if a thread modifies the data structure, leaving it in an intermediate state during the computation, then other threads must be able to restore it to a normal state. So the idea here is that we're basing this algorithm on, or we're basing this simulation on an algorithm that is lock free. And that means that the algorithm as it already exists must be able to cope with some thread executing and then that thread dying and that thread dying is not allowed to prevent us from making progress. It's like one of the guarantees of lock freedom, which means that there has to be some concise description in the algorithm already for how do we undo what that other thread that didn't finish do? Or how do we help it finish? Like we need to restore the data structure in order to keep operating. So some notion of this like compact state must already exist. The information required to do so must be found on the shared data structure and not solely in the inner thread state because otherwise the dead thread would be required to talk to the living threads to tell them what it failed to do. And that can't happen because the thread is dead. So basically what they're arguing here is that any lock free algorithm you come with, there must already be some succinct description of the operations of the algorithm stored in the data structure somewhere, somewhere shared. And so we can just reuse that representation for what goes in the help queue. Again, we'll see a more practical example of this once we get into the code. Yeah, so section five has this normalized representation. We'll look at that when we start writing the trade. Yeah, and this also, this is not terribly important in our case, but basically you might have lock free algorithms that have multiple CASs. In general, like think of something like a link list. The way that removal usually works in a lock free link list is that you're not, you don't actually, like you don't free the item the moment you remove it from the link list because if you did, there might be other readers who are currently accessing that thing and that's not okay. So generally there's some like two face process where you like mark the node, actually it's not even about freeing, it's about if I try to remove a node and someone else tries to add a node before it at the same time, you end up with a broken link list and that's bad. But the observation is that during say an insert operation, what most of the lock free algorithms do is there's like a, there's an important CAS which is the one where you actually insert the node that the user gave you. And then there are unimportant CASs which are you sort of walk the list a bit and look for nodes that have been marked as to be removed and then you actually remove them. But those auxiliary CASs as they call them are ones that any thread can do them. And in general, they're also idempotent but that's not terribly important in this context but generally like this is just like CASs that anyone can do at any time, they're not something that you have to synchronize on because if you fail, it's not the end of the world, you could always just do it again later. And therefore these can be run by the helping threads without any synchronization. And so in general, the assumption here is that any lock free algorithm has some notion of like the critical compare and swaps and those are the ones we have to build consensus around but the other ones we don't. So again, for insertions, critical one is the insert auxiliary ones are like actually doing cleanup from removals for example. All right, so we now have... How do I wanna talk about allocators yet? No, I don't think so. Okay, so section three is just like talking about what is the sort of assumptions of the computational model like basically what do they assume about the machine and the context in which this code is running? This is important because once you start to give a formal argument for why something is or is not weight free, you have to rely on assumptions about how the underlying machinery works, like how the CPU works, how the instructions work, how the scheduler works, how memory reclamation works. That's not terribly important for our case because we're just trying to implement running code. We don't necessarily care about the formal model. One thing that is important here is that they assume automatic garbage collection is available and it's not in Rust. This is going to be something we're gonna run into in our implementation where there's just like not a mechanism in here for how to deallocate memory safely. We have some options there. We could use something like Epic-based memory reclamation, which we talked about back when we did Flurry, the port of the Java concurrent hash map. You could do something like, something called hazard pointers is a very common implementation. And in fact, something that they also talk about, you see here, they like refer to section 11.1. And in that section, they talk about, oh, you can just do it with hazard pointers and it's fine. Hazard pointers are not entirely trivial. And there aren't, as far as I know, there aren't any very reliable Rust implementations. There are some, but they're either a little simplistic or haven't been maintained for a long time. And so this is why I'm thinking, for the next stream, we might actually implement those so that we can use that to implement memory reclamation and what we built today. And then here, they talk about sort of some of the typical lock-free algorithms. So let's see, this is some definitional stuff. What they do here is they talk about existing data structures. So this is stuff like the Harris's linked list. So this is a wait-free linked list. Someone asked in chat, like, can you give a brief introduction? Like the brief introduction is basically this paragraph and what follows, which is sort of, how does a wait-free linked list work? How's that implemented? And they talk about sort of this notion of deleting nodes, what are auxiliary CASs? Like this section is worth reading just because it gives you some intuition for how do the existing lock-free algorithms work? And how does that sort of tie into the model that you're trying to establish for their normalized form? I'm not gonna go through this in detail. This is something that I recommend you read on your own time if you want to get a better understanding for how these fit together. One of the things that they talk about in here though, which is worth talking about is they talk about this idea of counting the number of failed CASs as a way to indicate contention failure. So the idea here is we need to have some mechanism as we talked about earlier on for detecting when is there a contention, which in turn informs when do we move from the fast path to the slow path? And what they're saying is one good way to measure contention is if a CAS fails, add one to the contention counter. There are more complex cases, like if you look at appendix B, they talk about more cases where you can have contention that doesn't manifest in the form of a failed CAS, and it's important for the weight freedom of the overall algorithm to include those. And so that's something that we're not gonna worry too much about for our implementation here, but it's something that does become relevant down the line and is very important for their formal reasoning for why this is always weight-free. We're not gonna talk too much about it. The other thing they introduced in this section is the notion of a CAS description. So this is sort of a description of a CAS that you want someone else to help you execute. And it's just like the address of the thing to do the CAS on, the expected value, and the new value. This is what a CAS, the arguments to a CAS is. And we'll see this come up in a bunch of algorithms later on. And now finally, we get to this notion of the normalized data structure. And this is where we'll try to encode this in basically a rust trait and sort of a structure around it. Let's pause there though for a second because again, I've been talking a bunch sort of getting to this point. And if they're unclear, un-clarities, if it's unclear how we got to this point or if anything is unclear about what I've said so far, now's the time to sort of re-synchronize no pun intended before we dive into the actual code. And I know that some of you are like get to the code and we're about to, I promise. Let's not use blockchain for consensus. No, please no. If this is based on CAS, does this still apply on ARM? That's a good question. This is, so you can emulate compare and swap with load locked and store exclusive, store conditional on ARM. So that's one way to get at this. There definitely are sort of assumptions in here that you have access to compare and swap. I don't think it's necessarily, I think it would still be wait-free if you used LL and SSX as E, like the ARM equivalent to implement the CAS, this would still be wait-free. It might be a little bit less efficient. Like you could imagine that there's a slightly optimized ARM version of this encapsulation. I think they talk about this a little earlier on the paper too of how this translates to sort of the sort of load locked construction, but it's not something we'll talk about too much here. I don't think they assume sequential consistency. I think acquire or release is sufficient for what they need, but I forget. This is also, so the actual code they implemented for this algorithm is written in Java and Java has its own sort of special memory model. And so some of that is probably implicitly encoded in some of the algorithm. I think what we'll probably do here is start with just writing it using sequentially consistent ordering. And that should be okay because the sequentially consistent ordering shouldn't remove the guarantees of weight freedom because really what the sequentially consistent thing does, well, it does two things. It requires a bit more synchronization among the processors for deciding who gets to go next. Like they need to mediate the axis a bit more, but that should be finite number of steps. I don't quote me on that, I could be wrong, but I don't think that matters. And the other is you lose out on some like pipelining and compiler optimizations. That too shouldn't impact the sort of algorithmic complexity or the whether the number of steps is finite, even though it might increase the sort of absolute quantity of steps. Two hours and still not code. I know, I know, I know. But it's important. I think if we dose straight into the code, I don't think we'd get very far. I think we'd have to refer back to the paper a lot. Let's see, are there bathroom breaks? Yes, in fact, I think this is a great bathroom break right now. In fact, I'm gonna go take one right now. I will, actually, I'm gonna use my new pause screen. I guess it's not new. Some of you have seen it before. There should be a pause screen on screen now. I'll be back in like two seconds. So we could take a short bathroom break before we actually dive into the code here. The topic is wait for you, but we still have to wait. That's funny. Let's see, let me bring this back here. Programming is more reading code and less of writing code. Yeah, it's true. I mean, I think you're also right that if we dove, just check that you can still see my screen video, just that I pressed the wrong keys. But yeah, I think it's important to go through this paper in part because what we're actually encoding and modeling is fairly complex. And I think if we complex, if we dove straight into the code, I think we'd basically have to refer back to the paper a lot anyway. And I mean, we still will, but at least now we have this sort of shared understanding before we go into the code, which I think is gonna make that a lot nicer. Yeah, the bathroom break does have a guaranteed upper bound, actually. I know it's amazing that I can guarantee that. Remember really funny, is someone else just sat down and finished it? It's true. I'm gonna get rid of the stream. Go help John, that's very funny. All right, so what I actually wanna do here is I'm gonna, instead of starting to go through the normalize data structure chapter and then write the code, I'm gonna try to go the other way around. Even though I've read this section like once before, but what I wanna do is start writing what I think this trait will look like based on what we've discussed so far in the stream and then check it against what they actually describe in this like chapter five. And the reason I wanna do that is because doing it that way I found is a great way to like check your own understanding. It like, hi cat. It forces you to like face your own misunderstandings. If you make a prediction and then the prediction is wrong, this is like science, right? You make a prediction and if the prediction is wrong, you learn something. Whereas if you just sort of follow the stuff and then write the thing, you might not explicitly realize that you have the wrong mental model. And so I think this is gonna be interesting. Okay, so we have the biggest problem of the stream which is what do we name this? We're gonna create a new library and this is gonna be important because this is gonna be a library that everyone's gonna be using for simulating lock free algorithms using wait-free implementation. Basketball is pretty funny, but bystander is good. Ooh, I like bystander. It's very subtle. Bystander is a subtle name, but it is pretty nice. Hi cat. Do you have a suggested name? Chai, what do you think? Here, let me give her the mic. Chai, what name do you wanna give? Why are you climbing on the printer? Climbing on the printer to be next to the microphone and smelling it. Dodgeball, non-stop, also pretty funny. Hurry, meow, meow. Cueless is like cute, but it's false is the problem. Submaritan, also pretty funny. Unstoppable, also very good. Oh, freeway, that's too silly. Okay, I think it's between bystander and unstoppable. All right, what do we think? Bystander or unstoppable? I have some plus ones for bystander. I have one plus one for unstoppable. All right, it seems like bystander is the winner here. All right, let's go with bystander. I like that. Just imagine in 10 years from now when you can tell your grandkids about how you were there for the naming of this legendary library that is used everywhere now. All right, we don't need no test yet or we don't have anything to test yet. So we're gonna trade and this trade is going to be called. Let's start out with the boring name and then we can come up with a fancy name for it later. So this is a normalized lock-free, normalized lock-free. This is not generally the way that we name traits in Rust. Why are you eating my pen? Chai, excuse me, excuse me. You're looking very cute in there, but what? Those are my keys, thank you. Are you bored or something? Come here. Hi. Chai, what do you think it should be called? Huh? What do you think the trait should be called? Okay. It has a normally in Rust, traits are named by verbs. And like they're sort of named by what the thing enables you to do. Most of the time at least. There are some exceptions. Like extend is a good example, right? Of if something is extend, it allows you to extend the thing that implements extend. Iterator is a little weird because it allows you to iterate, but it's not called iterate. Append has a normal name because it allows you to append to things. Future is weird because it doesn't allow you to future. Like that's not really a verb. So there are like, there's like a general rule and there are a bunch of exceptions. In this case, I do think we want a more descriptive or more rusty name, the normalized lock free. The reason I chose this name for now is because I think it'll help put in context all the things that we're gonna put inside it. And then we can figure out a better name for the trait afterwards. So the reason I don't wanna call this, someone suggested like transform to wait free is because the trait itself is not a transformation to lock free. Basically what we're gonna have is sort of like a struct that's gonna be wait free simulator. And it's gonna take a normalized lock free. All right, like this is the sort of situation that we're gonna run into. We're gonna have a wait free simulator that is gonna be generic over the lock free algorithm that it's going to simulate as long as the lock free algorithm is in this normalized form. So this is sort of the construction that I'm imagining. And so the idea would be that if you wanna create say a wait free linked list of type T, then what I'm imagining is that this internally really just contains a simulator which is gonna be a wait free simulator over a like lock free linked list over T. And then this is just sort of pseudocode at this point, but I'm imagining that there's like methods on this, like this has like an insert method which takes not me itself because it's gonna be concurrent. And some T, I guess push front would be a more idiomatic name. And that's gonna do like self.simulator. I don't even know what this is gonna end up looking like. Like I don't know how the API of the simulator itself is gonna look like yet, but think of this as like NQ insert T, probably something along those lines. That's gonna be like let I, and then we're gonna do something like self.simulator.wait4I. I don't know yet. Well, we'll see how this turns out. But yeah, so lock free linked list is gonna be like a we're gonna implement normalize. So let's say we have a struct lock free linked list over T and we're gonna do something like implement normalized lock free for lock free linked list T. All right, so this is the kind of structure that I'm imagining and I don't quite know yet what methods will have on wait4I.simulator and what methods will have on the trait. That's sort of what we're trying to figure out on how to model this, but this is the construction that I'm imagining, right? That the actual like public type of someone who like this, I guess this is gonna be in bystander, right? The trait is gonna be the bystander, the simulator is gonna be the bystander. This is gonna be in a consuming trait, wait free linked list trait in consuming trait, right? So we provide this, which is basically everything that's in the paper, and then for each data structure, we implement a wrapper around the simulator and we implement the inner lock free type and the simulation between the two is provided by the bystander library. Does that construction make sense? Oh, someone said, wouldn't it be try push front fast forward someone said, wouldn't it be try push front fast path? Joe, I'm imagining that actually Joe, in this consuming trait, this will be public, but the wait free simulator is not public. The lock free linked list, inner implementation is not public. So it's only really this API that's public and behind the scenes is sort of in queue operation that maybe there's no in queue, like maybe it's just run, right? Or maybe it's just that, I don't know yet, but run is going to first try the fast path for the inner lock free implementation, detect contention and fall back for the slow path, all internally in the simulator. That won't be something that's a part of the caller, like the public signature of whatever uses bystander internally should be sort of a standard interface to say a linked list. As for it being derivable, I don't think the inner type I don't think is derivable because that's really encoding a lock free algorithm, which like you can't derive that, but you could totally imagine deriving the sort of, the wrapper type is sort of trivially derivable, like you can imagine having like a derive wait free on this that generates this type and also these wrapper methods. There might have to be some annotation on like which operations you can do, maybe it's not quite a derive but some other like wait free and then like push front equals insert. Like you could imagine something like this that generates this. So I think you could conceivably have a wrapper. I don't know that the wrapper adds too much. Like I don't know that a procedural macro will really benefit you very much here, but I think you could still imagine something like that. Someone asked in chat, yeah, the streams are archived. I upload all the videos to YouTube afterwards. So we still have this question of sort of fundamentally, what is in the normalized lock free trade? And the way that we've described it so far, right? There's like a pre-cast. I guess all of these probably take self. There's a like help cast. We're probably more arguments here and there's a post cast, right? So we talked about how like there's a sort of commit step and that's where people help. There's the stuff that can be run independently leading up to the cast and then the stuff after the cast that all the helping threads can also execute concurrently. And so the lock free algorithm needs to dictate all of these because that separation is sort of a property of the algorithm itself. And there probably has to be something else like this probably has to return like a cast descriptor. So remember we talked about this in the model, but the cast descriptor probably varies per data structure. Right, so maybe there's a cast descriptor associated type. And the reason this is an associated type is because I don't want to get warnings for this quite yet, thank you. In fact, let's just get rid of this stuff all together. For now, I don't care about these warnings yet. And let's make this pub. And let's make this pub. Just stop yelling at me compiler, thank you. That's fine, great. Yeah, so precast is gonna have to generate a cast descriptor and in fact we know that they sort of mentioned that there might be multiple sort of commit castes even though there's usually just one. So here I could totally actually imagine using something like arrayVec, which is a crate where it's an array up to a certain number of values and beyond that it's a vector. So I could imagine using something like an arrayVec of descriptor of one. And then helpCast probably takes in like a, maybe like a slice of cast descriptors and maybe even which one to help with. Maybe something like that. So like these are all the castes that came out of the precast. And I want you to help with this one. I guess we don't have an arrayVec here yet. So let's just say that this is an array of length one for now. And we can always optimize it later. This to me is like, at least a start. Oh, cast descriptor probably has to be executable in some meaningful way, but might not be something we have to do in the simulator. There also has to be a contention failure counter here, right? Like we have to have some way, actually maybe this is not help. Maybe this is execute cast. And maybe this is not even pre, maybe it's just, maybe it's just pre-executed in post. So maybe this is prepare, this is execute and this is cleanup. Oh, they're all the same length. Very nice. Yeah, so remember there has to be some mechanism for say like detecting contention so that we know when to switch from the slow, the fast path to the slow path. So maybe there's like, this is also given a contention, which is gonna be a contention measure or something. And so we're gonna have a pub struct contention measure, which is gonna have some API that I don't quite know what it looks like yet, but maybe just detected. And maybe this is just a use size. And all that does is self.zero plus equals one. Hi, what's up? So we're gonna pass this into the execute so that we have some way of learning about the amount of contention that's going on in the system. Oh, yeah, tiny back is maybe the one I'm thinking about. It's not terribly important. What's up? Hi. What, do you want lunch? It's not lunch time for you yet. Well, she disagrees. No, it's fine. She'll survive. Yeah, so I feel like something like this is roughly what we're looking at. And then the other question becomes like, how do you actually like enqueue an operation? Like there has to be a sort of operation here, right? In fact, maybe operation is an argument to the trait. The reason I make this a type parameter to the trait itself is because you could totally imagine that some algorithms, some lock free algorithm is able to be sort of normalized in multiple different ways, depending on the operation. Like the trivial example of this is if the, whoop, shy, I'm sorry, are you feeling neglected? Yeah. Yeah, so for example, for our linked list, right? It sort of wants to implement the trait for each T. Maybe. I don't know yet. Let's leave it as an associated type. In general, I think the recommendation is like, if it can be an associated type, then make it be an associated type and only sort of promote it to a generic type parameter of the trait, if that ends up being necessary. And I'm not sure whether it needs to be necessary yet. And I think the idea is like, prepare is gonna take a description of an operation, maybe a self operation. Right. So that's sort of what I'm thinking. And then for info, wait for simulator. This is probably just gonna have like a run method that takes self and an operation and returns, I mean, who even knows? Maybe there's a result as well. Maybe it's input and output. And maybe it returns LF output. Cause you can totally imagine, like if you want to do a get operation, it has to be possible to get an actual return value out of the operation, right? This isn't done yet, of course. And I'm imagining that run does something like self. I guess this is like algo or inner or yeah, algorithm. Algorithm.prepare the op. So that's gonna be like, this is gonna be essentially encoding the fast path, right? So this is gonna be CAS and then we're gonna do self.algorithm.execute. I guess the CAS is, let's leave out the I for now. Doesn't seem immediately necessary. We're gonna need a sort of contention, which is gonna be a contention measure starting at zero. We're gonna give it in like, I guess this has to be a mutable reference to one. So we're gonna have this pass in a mute to contention. And I guess execute probably needs to have some way to say that it failed, right? So maybe this returns a result cause we need to know whether we succeeded or whether we failed. And if so, there's probably some kind of contention. So either it returns nothing or it returns a use size maybe, which is which one failed. So we're gonna match on this. If it's okay, then we sort of succeeded on the fast path. If it returned an error of I, then this is gonna be the slow path, right? So this is gonna be ask for help. And the fast path we just do self.algorithm.cleanup. I don't know, maybe cleanup needs to take in the CASA somehow or take in the operation. I feel like probably not. And who knows what the slow path has to do yet or how we even get the result. But I think this is the sort of structure I'm imagining that this ends up having. Mm, yeah. And for the slow path, I guess is gonna be something like, well, so what I drew, right? It's gonna be something like this has to have like a help which is gonna be a weight free queue. Or I guess it's gonna be a help queue. We don't actually have help queue yet. We're gonna have to implement that. And I don't know what it's gonna hold yet, but help queue, it's gonna be a weight free queue. And it's gonna have a sort of add. It's gonna be weight free, so it's gonna be concurrent. It's gonna take immutable references self. And it's also gonna take, I guess, a description of help. I don't know what help looks like yet. It's gonna have this. And we're gonna have to figure out what does help look like. We don't know yet. And it's gonna have a sort of, I guess, a peak. It's just gonna return an option help. And we're gonna have a, I guess, pop front. Or remove front. Right, so I think the idea here is that for the help queue, right, you want to, if you're trying to, if you're in the fast path, for example, you're gonna look at the head of the queue. If there's nothing there, you keep going. If there is something at the front of the queue, then you try to help this operation. And then at the end, if some to be helped operation completes, then you like sort of, I guess try remove front, and you pass in the help that you tried to remove, maybe? And it does something. I don't know exactly how this is gonna pan out yet. I'm imagining something kind of like this for the help queue. And I guess, like if false, currently we're not being very helpful, but if sort of, okay, what was the phrasing in the paper? Once in a while, which we're gonna define as false, then we're gonna, if let some help is self.help.peak, then do something to help, I guess, help, help make progress. If, and then there's gonna be something there to sort of remove it if we actually ended up completing the operation. And the slow path here, I guess, is gonna be self.help.add a help. I don't know what the help actually looks like yet. And then I guess while, actually, maybe what we do is we create a help. We add like a reference to the help, maybe. Maybe we stick on here as like a, not a static, but it's like a raw pointer to the help. What this gives you back is a raw pointer to the help. This is also a raw pointer to the help. And then what we'll do is like, we're gonna pass in, we're gonna stick the help on the queue and then while like help.help.help.help completed. And we don't know how completed it's gonna look yet, but you can imagine that whichever threads end up like helping it pass the finish line is gonna set this completed flag. And so while it hasn't completed yet, then we do like help, which I guess maybe we factor out help into its own method. And that's gonna be this like whatever we do to help help make progress. And maybe this takes like a one. All right, I guess maybe it's always just one. So this one is gonna do self.help, self-help, nice. And this one is gonna do while my help hasn't completed, I help out whoever is at the head. And I guess this means that help is gonna have to have something along the lines of completed, which is probably gonna have to be an atomic bool, I'm guessing because it might be modified by multiple threads. So we'll need standard, standard sync, atomic, atomic bool. Probably gonna need a bunch of things from atomic, but that's sort of the start. So our help here is gonna have completed be an atomic bool of false, because it hasn't completed yet. This needs an ordering. So we're gonna bring in ordering. And notice that this is still just sort of based on our rough understanding of like the organization of the code, like what has to happen when. I haven't actually figured out like what does help do? How do we describe operations like that? I think it's gonna be described much more in the paper, but it's more like the outline of the algorithm. All right, so we don't know how we help something yet. This is gonna help until our help has completed, which is all we really care about. And I guess the help is probably gonna have to say like at, which is like, where are we helping from? And that's probably gonna have to be an atomic use size because we're gonna make progress on it over time. Right, so initially we need help from the very beginning of the list of like CASAs, but as we make, as we succeed on some of the CASAs we're going along, we no longer need help with the earlier one because they've already completed. So maybe this has to be like an atomic use size that initially starts out at the place where we started failing. And that might be the zero of CAS, like imagine that for a linked list, right, there's only one commit CAS. So if it failed, I will always be zero, it'll be the first CAS. And so we'll ask for help starting at index zero of the CAS list. Yeah, that seems about right. And we still don't really know how we get the output from this. We don't really know how the input ends up going into the algorithm or into the help. So there's definitely some stuff we're still figuring out, but I guess this is at least a start. Maybe help just has a pointer to the T or to the input. I don't know yet. All right, the structure that we've set up so far makes sense before we start looking at the paper and seeing all the things that we got wrong. Is it safe to peek if the queue is constantly changing? Yeah, so this is one of the things that we need to figure out, right? Which is it should be safe. This is one of the places where memory, like freeing memory becomes a challenge, right? Because peak is always gonna return you, like think of it as a pointer to the help at the head of the list. And that pointer, like if someone frees it, it's a problem because our pointer is no longer valid. And this is basically the sort of heart of concurrent memory reclamation. There needs to be something that ensures that the help doesn't go away. I think what we'll actually do in the context of this stream is just not worry too much about memory reclamation and just be willing to leak memory. In which case, if nothing's being freed, all the pointers remain valid. And then we'll deal with the actual memory reclamation issue probably in a later stream. The formatting of the false is really confusing. Yeah, you're right. Help just look like a real word anymore. Can you talk a bit more about our associated types and their intended usage? Yeah, so the associated types here are the cast descriptor is gonna describe which castes still need to be done. And maybe this isn't an associated type. Like maybe this is just a type that we define. I'm not sure yet. It's a good question. This might have to do with versioning. Might be the reason they have to be different. But that's sort of getting ahead of ourselves. It might be that the cast descriptor doesn't need to be something that is sort of provided by the implementation. It might just be the same for every implementation. I'm not sure yet. The input type is like, what is the operational type for this data structure? All right, like a linked list has the operational types like insert and I mean, insert, remove, lookup. Those are the inputs. And so we need to be able to represent that somehow. In practice, the input type is probably always going to be an enum. And I don't think we can get away from that. Like it would be nice if there was a way to not have it be an enum. But I think ultimately we need a single type to represent the operation that is to be executed because we're gonna have to stick that operation maybe in the help struct for example. And so it can't have, we can't stick multiple types in there. It has to be one type, which is then probably gonna be an enum. And similarly, the output is going to be the sort of operational, the output of the given operation. So the output for say, a lookup is gonna be like a reference to the inner T or something similar. The output for a remove might be the element that was removed. In practice, I think there's an argument for input being, yeah, I mean, it's a good question. Like output here will also be an enum in the way that we've currently structured it. And that's a little weird because it means that illegal states are possible to represent, which is generally a bad thing, right? So imagine like if input and output are both enums, you can imagine input being like the lookup variant and the output enum being like the remove return type variant, that would be bad. We might be able to tidy this up a little by having like input needing to implement some other trait operation. Maybe that has an associated type output. And so if input implements operation, then like whatever is using the output is actually gonna be LF input output as operation output, maybe, like something like that so that we, but even then that doesn't really get you very far because the input type is an enum, right? So the output type here is still an enum. So this still doesn't give us the sort of type safety that we want, but I don't have a good way for us to represent that. Maybe that will come to me later. But I think for now we're just gonna keep it sort of simple and say input and output are both gonna be enums. Where are we using raw pointers here? Well, the reason I'm using raw pointers for help is because what else would I put in here? I don't wanna put the help itself because I still need to check whether my help has completed. So I need to retain a pointer to it. I could put an immutable reference there, but if this is an immutable reference, right, what is its lifetime? If it's lifetime is tick A, if it's lifetime was static, then we would have to leak every help. We could never free it, which is not okay. If it's just like a tick A, that tick A is gonna be this lifetime, but that's gonna differ for every operation. And so what is the lifetime of the help queue that holds many of these? There's no well-defined lifetime. So there's no statically known lifetime. And so that's why I'm making this a raw pointer. And the idea here with peak is that if we know that the pointer will only be freed after the thing has already been removed from the queue, then we know that if we peak that it'll still be valid. It's not quite true either, but yeah. The reason, technically, these will probably be something like guard of help, right? Where guard is something that ensures that the help won't go out of scope, that won't be dropped while we still have a handle, a reference to it. And this is where something like hazard pointers is gonna come into play. I think for now, we're just gonna have it be raw pointers. And I suppose, yeah, it could be non-null, technically. But I'm gonna have it be const now because ultimately this is not the real type it's gonna be. But you're right, it should be non-null given that we know that these are non-null. If input was a generic type rather than associated, we may be avoid the enum. So the problem here, right, if we did this, is that it would mean that you would have a, where's our little code down here. When you implement normalized lock free for lock free linked list, what you would end up with is multiple different implementations of the trait one for each input type, which means that you can't really construct this sort of thing that just holds a lock free linked list because it would have to also specify the input type, which is a little weird. Maybe we can have a second trait here where like, yeah, I'm not sure. There might be a way to express it. I think what I want to do here is start with having it just be the sort of enum construction and then see if we can make it more typesafe afterwards. The reason I want to do that is because I don't have a good sense for the actual code is going to look like here. So I don't know what the best way to represent it in a typesafe way will be. And so I think for now we're going to stick with it like this. The lifetime problem is not solved by existential lifetimes because the lifetimes are actually different. It's not just one lifetime. It's different lifetimes. They're distinct. They can't be RC because, well, basically because concurrent memory reclamation is hard, if it's an arc, then imagine that we try to remove the arc from the head of the linked list. At the same time, someone else reads the arc from the linked list. The reference count is still one until the person who peeked clones it and increments the reference count. But imagine they peek, we remove, we drop the thing we removed, the reference count now goes to zero. So we drop it and then the other threads gets to execute, which is after it peeked. So it still has the arc, but its reference count is now zero and it's been dropped. So it's illegal memory. So it can't increment the reference count. There's a race condition there. And this is why reference counting is actually, it's actually kind of hard to get reference counting to solve the concurrent memory reclamation problem. Not impossible, but it's not quite as straightforward as just making this be arc help, sadly. All right. Yeah, so I think we have like a sort of structure that represents at least kind of what we want or what we imagine this protocol or this algorithm to look like. So it's now, weak references have the same problem. Dropping the last weak reference drops the, there's still a heap allocation that holds the weak reference count. And so you end up racing on dropping that. It doesn't solve the problem, sadly. Weak references count would only help solve cycles. It doesn't help solve concurrent reference counting. Unfortunately. This is where you basically need something like hazard pointers. Okay. Oh, sorry. I should have warned that this is a bright screen. So now let's look at what the paper actually says for what this normalized structure looks like. And you see that this section specifies the sort of normalized structure. And then the, there's like a little bit later in the paper where they show, given a normalized lock free algorithm. So given some, in our terminology, given something that implements the trait, how do you simulate it in a wait-free manner automatically? Even if the peak clones, you have the same race, the race is just inside peak. A normalized lock free data structure is one for which each operation can be presented in three stages, such that the middle stage executes the owner CASs. The first is a preparatory stage and the last is a post-execution step. Using the linked list example, the delete operation runs a first stage that finds the location to mark a node as deleted while sniping out the list, out of the list all nodes were previously marked as deleted. By the end of the search, we can determine the main CAS operation, the one that marks the node as deleted. Yeah, so the pre for delete in this linked list is walk the list until you find the thing that you want to delete. And you do a bunch of auxiliary CASs as you go to remove any things that were marked deleted previously. But these multiple threads can do them in parallel. There's no problem if it fails. It doesn't prevent you progress because it's not the main operation you care about. And ultimately you get to, okay, this is the node I'm actually going to delete. And then the CAS descriptor that you output is going to be mark this node as deleted, which is going to be a compare and swap. Now comes the middle stage where this CAS is executed, which logically deletes the node from the list. And then finally, in a post processing stage, we attempt to snip out the marked node from the list and make it unreachable from the list head. Right, so the post CAS operation is now that it's marked as deleted, try to actually remove it, like unlink the next pointer of the previous node to point to the next node. But that one, multiple threads can do in parallel too. That's not a problem because if they fail, it probably just means someone else did it already. And so that's not actually a challenge. The only problem is if multiple threads try to update a given node at the same time. All right, in a normalized lock-free data structure, we require that any access to the data structure is executed using a read or a CAS. The first and last stages must be parallelizable. That is they can be executed with parallelizable methods. And each of the CAS primitives of the second stage be protected by versioning. All right, so this versioning is something that's gonna come back to bite us later. It's like I have a, I know this from having read other bits that this will be a problem later, but we're gonna ignore it a little bit for now. Yeah, let's discuss more later. That's fine. Okay, so the idea is that the pre and post methods are gonna be executed in parallel by multiple threads, whereas the main CAS operation is not. A lock-free data structure is provided in a normalized representation if any modification of the shared memory is executed using a CAS. Every operation of the data structure consists of executing three methods, one after the other, and which have the following formats. There is a CAS generator, whose input is the operations input, and its output is a list of CAS descriptors. The CAS generator method may optionally output additional data to be used in the wrap-up method. Okay, so that sort of matches the prepare we had except they call it generator. And because it has to be executed in parallel, I assume that it's fine for it to take a reference to the input. It doesn't actually want to consume the input. It returns a list of CAS descriptors and may optionally output additional data to be used in the wrap-up method. So there's some like additional, right, where this returns this and additional. Now, at this point, I would probably simplify this a little and say that there's like an associated descriptor and we're gonna require that the descriptor implements a slice index of u-size. We're gonna say that this has to return self descriptor and exact size iterator. No, we don't care about that. Yeah. And I guess we're gonna have to use standard ops slice index. Is that not where it lives? So we're actually gonna just, all we really care about is that we can like look at the individual elements of this and maybe some notion of like a length maybe. And then this is gonna take a descriptor and this is also going to be given the descriptor performed. Something like this. So the implementation can choose what it's like descriptor representation is because all we care about is the fact that we know which like operations we can do inside of it that we can slice into it. And maybe something like the output of this has to be a CAS descriptor that we can execute. Not quite sure yet. They call this the executor and this the wrap-up. So let's call this wrap-up as well. This is sort of have a consistent phrasing. CAS executor which is a fixed method common to all data structure implementations. Okay, so CAS executor. Ah, so CAS executor is actually not a thing that is implemented by the algorithm. It's one that's provided. So we could have a default implementation but it sounds like we actually don't want this to even get overridden. It sounds like the CAS executor is actually something that is part of the simulation. Its input is a list of CAS descriptors output by the generator method. The CAS executor method attempts to execute the CAS' in its input one by one until the first one fails or until all CAS is complete. Its output is the index of the CAS that failed. All right, so and it says its output is the index of the CAS that failed which is minus one if none failed. That's our representation here. It's either a result where we succeeded or I guess this is a U size which is the index of the one that failed. So there we're kind of right except that it's not a property of the trait. So I guess really what we have is there's gonna be a CAS executor here which is gonna return this sort of, which is gonna be that common implementation. I guess to do is better. And it's going to take, what do they say, the list of CAS descriptors. So it's going to take the LF descriptor and it's gonna run them start to finish and return the first one that failed. So that's then not actually a part of the API. That's interesting. This is not a website, it's a PDF. So dark mode wouldn't help. And also I don't like dark mode. I only use it for streams. Yeah, Sly's index might not be right here especially now that the executor is a part of the simulator. I think we might actually want to have like a pubtrate CAS descriptor which has a LEN. So this is gonna, we're gonna require that this implements CAS descriptor. It's gonna have to implement, it's gonna have to be able to tell us the length. It's gonna have to be able to tell us the nth, which is, I mean, this is arguably like, maybe not Sly's index, it's just, maybe it's just index really. Like maybe what we really want here is just standard ops index. What's the definition actually of the index trait? The index trait. So we want where self as index use size output. In fact, I don't think that's even what we want. I think what we want is output equals CAS descriptor. So maybe this is descriptors. So this is CAS descriptors. And then there is a struct CAS descriptor which we don't really know what is yet. So what we require of whatever it uses as its descriptor of the operations is going to be like, we can find out how many there are and we can index into it to get CAS descriptors. And each CAS descriptor is then something that we can execute in the CAS executor. I already have the PDF open in the browser. I don't think dark mode readers change the PDF that gets rendered. Shouldn't CAS descriptor be the same for all implementations. I'm also sort of imagining that being the case. And in fact, that's sort of what we're saying, right? By saying that the CAS descriptor, like any individual descriptor is the same across all implementations. The reason I want this to be generic is because for like the linked list, there's only, you only want to execute one CAS as the commit operation. But there are other, like for the binary tree, I think they say, you actually need two CASs. And so by having this generic, we can actually have like when you implement normalized lock free for linked list, you can have type descriptor be just CAS descriptor, or if you will, this, right? But when you implement it for BST, you can have it be this. Does that make sense? So rather than us sort of dictating here that you're gonna use Tinyvec or something like that, the implementation can just say, for me, the descriptor is two descriptors. And I think that's a nice, nice way to be able to delegate that to the implementation to choose how it represents its collection of descriptors. So they say, back to the paper, the CAS executor method attempts to execute the CASs in its input one by one until the first one fails or until all CAS is complete. Okay, so the CAS executor is gonna do like Len is gonna be descriptors.len and then we're gonna do for i in zero to Len. I guess we're gonna do like we're gonna execute a CAS which means that we're gonna do a, I guess really what this means is that the CAS descriptor needs to hold like a reference to some kind of atomic. Let's say atomic pointer for now. See, this is where it gets weird, right? Because the descriptor might be the commit point for a given algorithm might be different. Like some might be changing an atomic pointer, some might be changing atomic Boolean, some might be changing something completely different. And so maybe it's weird. Maybe it really does need to be generic because the actual underlying descriptors is gonna be different. Now, so maybe CAS descriptor really needs to be a trait as well that has to like have an execute, I guess, and it's gonna return a result of nothing. And so here what we're gonna require is that where self as index use size output implements CAS descriptor, right? Where does this error come from? I will see about that later. So here, I guess we're gonna do something like if descriptors i.execute.isError, then we return error of i. Otherwise we return, okay. So that's sort of the CAS executor loop. And my guess is that here maybe we want to like record contention. Yeah, so the other reason why I want descriptor to be generic here is so that we can, the generator can generate arbitrarily complicated extra data that then gets passed to wrap up. That's a good point. What am I doing wrong here? I'm doing something silly. I guess this actually needs to be generic over the output type. I wish this was a, I guess really what we're saying is you might have a collection type that, like the collection type and the descriptor types are different. You could have two different users of this that use the same collection type with different descriptor types or you could have them use different descriptor types with the same, different collection type with the same descriptor type. And so therefore this needs to be generic over the descriptor. This is gonna be output equals D and this is gonna be D implements CAS descriptor. And see this is not pretty. I wish this didn't have to look like this. I think there are ways for us to clean this up. I'm just choosing to not do it quite yet because I still feel like there's still a lot of motion in these traits. And so I want to get the sort of stupid version working first and then we can get at the sort of nicer tidier version of this. And so the idea is that for any given normalized lock free algorithm or data structure, it has an input type which is the operation you wanna do. It has an output type which is what those operations return. It has a CAS type which is what kind of CAS operations can it execute. And it has a CAS this operation which is a wrapper around those CAS types. So that is like, for example, an array of length one. And those are all sort of distinct types that are associated with the particular implementation we're using. Why does this complain? Because it needs to take CASs. Normalized lock free. Why is it complaining about this? Have I written something silly somewhere? I've definitely written something, oh, down here. Right, there's still a bunch we haven't, right. So this is gonna be generator. This is gonna be self.cas execute. And I guess that means CAS executor is also gonna take our contention measure, right? So our contention detected. Something like that. So now this does CAS execute. Does that know what we called it? Oh, CAS, let's make it CAS execute. Reads nicer. And so the generators should return CAS. See, CAS reads as cases which is its own kind of problem. I think this should take probably an immutable reference to that. We'll see how that's to rate it later. And then this should call wrap up with CASs, right? Because wrap up, let's have this be this too. Wrap up. All right, so now it's happy again. It could require an iterator instead of index and then the for loop run over the iterator. We could do that. The reason I'm using index is because I suspect that there'll be like a resume from this index, which I mean, you could do with iterator skip, I suppose. Yeah. You could also do with this could be, like there's a lend on exact size iterator. This just feels like the slightly nicer way. I understand your point. I honestly don't quite know yet. I think it depends on how helping ends up working. All right, so let's head back to the doc. Wrap up whose input is the output of the CAS executor method plus the list of CAS descriptor output by the CAS generator plus optionally any additional data output by the CAS generator method to be used by the wrap up method. Okay, so there's output from the CAS executor and the output from the CAS executor is the index of the CAS that failed. Oh, I see, so wrap up gets called regardless and it gets, so it also gets a sort of executed, which is gonna be the result use eyes. So it sort of gets the outcome of the execution. So wrap up runs regardless of whether we succeeded or not. So that's interesting. So I guess, so that means this is gonna be like result is this and then we're gonna wrap up with the result and then we're gonna do if let AI is result. I guess we'll see exactly how this pans out. I'm not sure yet. But so we do the CAS execute and whatever result we get back, whether that is we succeeded all the CAS is or we failed at some point, we call wrap up with that information and the CAS descriptor. Its output is either the operations result which is returned to the owner thread or an indication that the operation should be restarted from scratch from the generator method. Okay, so wrap ups output is the operation result. So wrap up is the thing that should return self output or an indication that should be restarted. So it's a result of this and unit. Great. And so then down here, what we're actually gonna do is we're gonna, I guess, like this is really gonna be a loop, right? Where in fact, maybe this is just a for retry in zero and onwards. And if retry is zero and once in a while. So if it's the first time, then we're on the fast path and once in a while, which is, I guess, help is gonna be once in a while, false. So if retry and help, then we're gonna do self.ifretry is zero, then if help, sorry, I know that syntax highlighting gets in the way here. It's gonna be better in a second. Why does it not? There are too many things. This is gonna make it happy. Great. So if we're on the first iteration through the loop, so I guess now we can get rid of this thing. If we're on the first iteration through the loop and we decide to help, then we do help. If we're not on the first time through the loop, then there's probably gonna be something else around helping, right? Cause this is the case where we failed once and now we want to help. We want to help more aggressively maybe. I don't know yet. So we're gonna match on the wrap up and if it's okay with the result, if it's okay with I guess the outcome, then I guess we're gonna break with the outcome. We're gonna break out of the retry loop and if it's an error, then that said we should retry from the generator which I guess just means continue. And I still, I don't know where this ask for help business has kind of come in yet. I guess that's something we're about to find out. All right, fine. I guess loop and let me fast is true. If fast help more, fast equals false. And now maybe this can go away. Now, yeah, this looks like the kind of structure thereafter. The generator and the wrap up methods are parallelizable and they have an associated contention failure counter. Interesting. So the contention failure counter actually then sounds like it's passed to the generator and also to the wrap up, which I guess is so that they can also detect contention. Like it might be that you can detect contention without necessarily doing the sort of owner cast as they call it the commit point. And we wanna be able to detect that too. So this is gonna take our contention counter and so is this. And I'm guessing there's gonna be an if the contention is big enough for something down here, then we're gonna try to get help. Finally, we require that the castes of the generator method outputs be for fields that employ versioning. That is a counter is associated with the field to avoid an ABA problem. The version number in the expected value field of a cast of the generator method outputs cannot be greater than the version number currently stored in the target address. This requirement guarantees that if the target address is modified after the generator method is complete, then the cast will fail. Okay, so this is kind of subtle. And how are we gonna implement it is sort of an open question for now. But the basic idea here is that we want to make sure that what's the best way to describe this. Let's say here that, in fact, let me explain this with a code comment because that seems nice. Okay, so, oh, stop trying to be helpful editor, please. Thank you. All right, so let's say that we have let's say that the head pointer of the linked list points to some node A. And it's at, let's say, pointer address like 0x21 doesn't really matter, which some pointer address. And what we're trying to do here is a cast where we're trying to insert B, which is at 0x2, right? And so what we're gonna do is we're gonna do a cast of head from 0x1 to 0x2, right? So here, I guess B.next is 0x1, right? So B.next points to A. So we're gonna insert B at the head of the list and have its successor pointer point to A. And that's the way you do a push to a linked list, right? And what we're gonna do is we're gonna cast of the head pointer from 0x1, which is currently points to A, and make it point to 0x2 instead, assuming that the head pointer is still 0x1. And we want this to succeed if A is still at the head and we wanted to fail if A is no longer at the head. And that is because B.next would need to be different, right? We need to update B.next if A is no longer at the head. Otherwise, what's gonna happen if we do the swap, right? So we stick B at the front, then now if this was no longer what was at the head of the list, like let's say someone inserted a C, we would basically be overwriting C with B. Like the head of the list would now be B, B's successor pointer is A, and C is nowhere to be seen. And that's the problem, right? So that's sort of the basic idea of CAS. But there's another problem that arises, which is a little awkward. Imagine someone else CAS' meanwhile, very ominous. There's an insert C, which is gonna be, let's say C is gonna be as 0x3. And C.next is correctly gonna be 0x1. We're gonna, that operation is gonna CAS head from 0x1 to 0x3, so long, so good. If that succeeds, then this CAS will fail because 0x1 is not equal to 0x3. And so that's all fine. Now let's imagine someone removes A, right? So they do a CAS of C.next from 0x1 to a null pointer to indicate that this is the tail end of the list. That's all fine. Now let's say someone inserts D, and D is assigned the address of the old A, right? Because we removed A, so let's say A got dropped, right? Like now that memory is reusable. So now there's a 0x1 here that's actually D, and D.next is gonna be 0x3, right? And now that operation is gonna CAS head from 0x3, which is C, to 0x1. That succeeds because C is at the head. I'm still imagining that this CAS is like being delayed or whatever, right? This operation succeeds. So now the list we have is D, which is at 0x1, which points to C, which is at 0x3, which points to nothing. And now, right? And now the conclusion, which is this CAS that was originally insert of B continues, right? So it tries to do this CAS again, and the head of the list is now 0x1. Like the values are the same, and so this CAS is going to succeed and set the head pointer to BB. So now what we have is we have a head, which points to B, which is at 0x2, which points to D, well, which points to A at 0x1, but that's actually D because that's also at 0x1, which points to C at 0x3. Now in this case, it actually sort of works out because the addresses happen to be similar, but imagine the B also encoded, say, what's a good example of this, like let's say that this was a doubly linked list or something, or better yet, let's say that B included like the title of next item, and it would encode A, even though the next one is technically D, right? So basically we had a CAS operation that succeeded because the values happened to be the same, but the thing that we're pointing to aren't really the same. One was A and the other is D, and even though they happen to be at the same address, they're not the same. And so we really wanted this CAS should have failed. And this is known as the ABA problem of a value was A, it became B, and then it became A again, but from the point of view of this other insert, that should be considered the value has changed so failure operation. And so what they're getting at here, sorry for the lack of warning, is in order to solve the ABA problem, we wanna counter associated with any given field to ensure that the ABA problem can't happen. Basically, if the A allocation gets reused, then it must be given a different value. So even though the pointer is the same, like D gets allocated again, we require that the CAS fails if ABA has happened. There are a couple of ways to do this. You could imagine that you manipulate some bits in the pointer to include a counter there. So even though, if we go back to the terminal for a sec, like you could imagine that even though this is 0x1, maybe it's like this is 0x1 and A was actually like 0x1, right? So notice this doesn't have a one here. And then if, so when the allocation got reused, we also set this extra little counter bit in the pointer. So that way, this is actually a swap from 0x00001, which is not the same as this one. And then we have some logic for figuring out, like this value, if you wanna use it as a pointer, just disregard the first few bits. So one way to go about it, there are other ways, we'll look at it when we get to this, but basically the problem we're trying to prevent here, and what they're getting at in the description is, you need to guard against the ABA problem. And we are going to do that in some way, and in order to sort of follow this. Okay, so this is their sort of definition of the normalized representation. And that is now what we reflect as well. They haven't actually said how the simulation of this is gonna work, right? As we saw at the top of this section, they say we later show how to simulate this. And that's sort of what we're running into in our code, right? Which is, we now have the trait, right? This is the trait that they're telling us that we need to use. Like they're not necessarily dictating we use these associated types and stuff, but this sort of the structure of it. And then we've sort of made some assumptions about what that simulation is gonna look like. Like we think it's probably gonna have a loop like this and some helping, but they haven't told us yet. And I do find it really useful to sort of write out what you think it's gonna be before they tell you, and then try to adapt your understanding based on that. Okay, I talked for a while. Let's do some questions for a sec before we continue. Shouldn't the contention counter not be mute as the methods are parallelizable? So I think, I could be wrong about this, but I think the contention counter is a local measure of contention. So multiple threads might execute generator at the same time, but each thread is gonna have its own contention counter. Does that make sense? So that's why they have, it takes an immutable reference to self, a shared reference to self, because multiple threads are gonna call CAS executed once, but they're allowed to have different contention counters. And that's why they take immutable reference to basically this thread local value. Why do we count the contention failures for generators and wrap-ups together? I think the idea here is that you want to count anything that resembles contention, and you might not necessarily in advance know where that's gonna be. Like for some algorithms, maybe contention can only be detected by the owner CAS and cleanup. Sometimes it can be detected before you do the owner CAS too. And so you want to give the maximum flexibility to the algorithm to say whether or not there's contention. Yeah, debugging ABA bugs is a huge pain. It's super painful. Yeah, it's generational pointers, generational indices is one way to think about this. What would you see in the outputs that would indicate that you have an ABA issue? One example of this is imagine that you have a linked list of boxed-in values. And so even though the pointer is the same, the V-table is different. They're just like different objects. And I guess it would still be fine for, the V-table's will still be in the right place that would work. Yeah, the problem occurs when what is encoded in the node whose CAS should have failed includes information about the successor, like the target of the CAS. So in the linked list it's sort of stupid because the node we insert, all it cares about is the address of the successor. And so it being reused isn't really a problem. And in fact, this is something that designers of linked lists have taken advantage of is that in the linked list you don't need to care about this. But there are other cases where you do need to care about this, which is like in a, let's go to example of this. In a binary tree, for example, if the successor, like you have a left and right pointer, and even though the pointer of the thing that is next on your tree is the same, it's contents like it's left and right might be different because it switched from like A to D for example. And depending on those, you might actually have to be in a different place in the tree, you might have to do like a rotation or something in the tree. And so you really need to like reconsider whether this is even the CAS you want to do. And so you actually need to go back to the generator that's the sort of the architecture that they're laying out for us that if this CAS fails, the CAS is the wrong one. You need to, maybe you need to insert somewhere else in the tree instead, but you don't learn that that is the case because the CAS seems to have succeeded or the CAS did succeed, it just shouldn't have. The ABA problem would not occur on ARM because it uses LLSC. I think that's true. Ooh, is that true? No, it's... I'm not sure. So the biggest difference between the locked load store conditional in ARM and compare and swap is that the conditional store is allowed to fail spuriously. Spuriously in the sense of the expected value might still be the same, but it rejects the change of the store because for example, some other core has tried to modify the value, but it modified it into the same thing. So in that sense, it would deal with the ABA problem. I don't know whether a store conditional will fail spuriously, spuriously if the current core is the one that did the change. So imagine that two threads are running on one core on ARM. One does a load locked, a different thread, and then it gets scheduled out, a different thread runs on the same core, does a load locked and a store conditional on that value, setting it to, let's say the value is A, setting it to B. Some third thread runs on the same core, load locked, store conditional, stores A again. The original thread comes back and now gets to execute its store conditional. Does the store conditional fail? I honestly don't know. I think it only fails if some other core changed the value, but I'm not sure. I think the ABA problem still applies here. Oh, YouTube broke, that's sad, but Twitch is still here. But yeah, basically I think this problem still applies on ARM. Okay, all right, let's go back and see how they actually implement the actual simulation code. I need some nuts first. All lock-free data structures we are aware of today can easily be converted into this form. That's good. Yeah, in fact they show all abstract data types be implemented in a normalized lock-free data structure, but the universal construction is likely to be inefficient. This is similar to the argument for the universal simulation that we talked about in the beginning, that you can make any abstract data structure wait-free this way, but the finite number of sets might be a very large number and you might not actually want to do it. Intuitively, the first, the generator method can be thought of as running the algorithm without performing the owner cast is the actual commit points. It just makes a list of those to be performed by the executor method, and it may execute some auxiliary castes to help previous operations complete. Yeah, so they give the example again of the linked list where the generator is gonna be the one that does the search of the linked list and while searching, it might also remove nodes that have been marked as deleted in the past. Ultimately, the generator returns the node to be deleted. And then the cast that gets emitted is the one to mark that node as deleted. So the generator doesn't execute the owner cast, but outputs it as a descriptor to the cast executor. And if there's no node to be removed, there are no castes to be executed and the generator just returns an empty list of castes. This is another reason why the container might, why it's a good idea for the container to be generic, is that in the linked list case, it sounds like, for example, you might actually want the cast descriptor to be an option cast descriptor, because it could be none if you don't find a node to remove. The cast executor method attempts to execute all the owner castes in Harris's linked list, like in most known algorithms, there's only one owner cast. So basically like you shouldn't need a vector of cast descriptors or anything, but if you do, our interface supports it. The cast executor method attempts the owner cast or the multiple owner castes one by one until completing them all or until one of them fails. So that's the sort of, it returns a result where the error case is the use size, which is the index of descriptor that failed. After the cast executor method is done, the operation might already be over or it might need to start from scratch. Typically if a cast failed or some other auxiliary castes should be executed before exiting. The decision on whether to complete or start again and possibly further execution of auxiliary castes done in the wrap up method. And Harris's linked list example, if the generator method output at no castes, then it means that no node will be with the required key exists and the wrap up method should return with failure. If a single cast was outputted by the generator but it's execution failed in the executor, then the operation should be restarted from scratch. Yeah, so this is like if there's contention on the head or the ABA problem happened, then we wanna run the generator again and then try the castes a second time. Finally, if a single cast was outputted by the generator and it was successfully executed by the executor, then the wrap up method still needs to physically remove the node from the list, which is the auxiliary cast and then return with success. We note that that's not super important. That's fine. All right, we're now up to the transformation details, which is basically an explanation of the algorithm itself, like how does the simulator work? All right, so since we're on a new chapter, let's do a quick bathroom break here too for before we start getting into the generator itself. So don't block on me, but you may have to wait. I'm back. Mission wait-free failed, it's true. Yeah, isn't it such a cool research paper? Like, we're very in the weeds of it, right? But if you take a step back, it's really cool what we're doing, right? We're just taking algorithms that already exist that are lock-free and just like magically making them wait-free. It's really neat. Can a custom fat pointer be the solution to the versioning? Unfortunately not. The big problem you run into with a fat pointer is you can't cast it. So you either need to do bit fiddling, which is tricky because there's like a limit to how many bits you can actually tweak in a pointer before it gets a different meaning. The way, so I've sort of cheated a bit and looked at the Java code implementation of this and what they actually do is you do a double indirection. So you allocate, you basically have an object that contains the pointer you're gonna cast and you, when you do the cast, you have like a, it's weird, you like have this intermediate object that contains the additional information that you want to mark and points to the thing you actually wanna cast. And you do a cast of the pointer to the intermediate object and that gives you this versioning. All right. You can see my screen again and take it. Just this first sentence. It's like so neat. I think they must be, they must have been very happy when they wrote this first structure. When they wrote the sentence, they must have been like, yeah, we did something cool. To execute an operation, a thread starts by executing the normalized lock free algorithm with a contention failure counter checked occasionally, what does occasionally mean? To see if contention has exceeded a predetermined limit. To obtain non-survation, we make the thread check its contention failure periodically. For example, on each function call and each backward jump. If the operation completes, then we are done. Otherwise, the contention failure counter eventually exceeds its threshold and the slow path must be taken. Okay. So that's interesting. So that suggests that the contention measure is constructed at the top and we check it like, I guess, so use slow path. So it suggests that we should do this for each function call. So that would be something like this. And on continue, which would be here. And I guess we don't actually know what that's gonna do yet, but use slow path. And I guess that's just gonna be if self.zero is more than threshold. Threshold. I wonder if the different thresholds might make sense. Let's set it to two, I don't know. If different thresholds might make sense for different data structures, it could be that this is a good place for something like const generics. Where like, maybe we should have the weight-free simulator be like const generic over the threshold of the contention counter. It should be pretty damn cool. But that's sort of a consideration for a different day. That would be really cool. That's totally what we should do. Yeah. So I guess the idea is that we're gonna just check the contention counter every time we go somewhere interesting. I guess it doesn't really need to be here. It could be just after every method call. And that would probably be good enough. Because the top here is the same as the bottom, right? Cause it's a loop. So if the operation completes, then we are done. Otherwise the contention slow path must be taken. So we don't actually know what the slow path looks like yet. There's also a possibility that the contention failure counter never reaches the predetermined limit for any execution of a single method. That the wrap-up method constantly indicates that the operation should be restarted from fresh. This must also be the result of contention because if an operation is executed alone in the lock-free algorithm, it must complete. Thus, the thread also keeps track of the number of times the operation is restarted. And if this number reaches the predetermined limit, the slow path is taken as well. Oh, interesting. So there's actually two values here. So this is internally per loop. Is this contention counter? And then it separately says, track the number of retries for retry in. And this is going to return outcome. And this is unreachable, which is interesting. I wish it could just do that for us. And I guess now fast is just retry equals zero. There might not even make sense for this to be a separate case. I don't know yet. Yeah, so it sounds like the contention counter is one thing. And then there's also like if retry is more than retry, retry threshold, then also slow path. And retry threshold and I guess contention threshold. All right, the key point, and this is something we've talked about in the past, the key point is that an operation cannot execute infinitely many steps in the fast path. Eventually it will move to the slow path. Okay, so maybe what we do here is that we say that the slow path is after the loop. Maybe that's what we do. So maybe this is fast path and then use slow path is just gonna be like break. Yeah, so I guess this means that the algorithm is set up so that you can retry the fast path, but at some point you move to the slow path. And that's based on contention or if you just retry many times, which is also really contention, which I guess means that this code that we had for the slow path is gonna go down here somewhere. I mean, we don't actually know what this looks like yet, but I guess I don't know what I is gonna be yet either. So there's still something missing down here. I guess let I is zero because I don't know yet. The slow path begins by creating an operation record object that describes the operation it is executing. A pointer to this operation record is then enqueued in a wait-free queue called the help queue. Okay, so they call this an operation record. The thing that we called help, they call it operation record. I'm just gonna use the same name for now. I think we can always change it later, but I've found that while you're going through and doing this implementation, or doing this sort of porting from the English version to the code version, using the same names helps a lot in ensuring that you're using the correct abstraction and the right abstraction in the right place. And then we can always change this later on to be something more rustic or something that feels better for documentation or something, but at least we get this direct sort of translation or closer correspondence between the text and the actual code. All right, so we're gonna create an operational record that describes the operation it's executing, which we don't quite know what is yet. A pointer to this operation record is then enqueued in the wait-free queue called the help queue. So that's, we create the operational record. We add that to the help queue. We add a pointer to that to the help queue. Next, the thread helps operations on the help queue one by one, according to their order in the queue until its own operation is completed. So I guess, let's say this is a little clearer if we call this help first, right? So it says until its own operation is completed, we're gonna help according to the order in the queue. So while we haven't completed, we're gonna help the first thing in the queue. And then we're just gonna go around in a loop like that. Threads in the fast path that node is a non-empty help queue provide help as well before starting their own fast path execution. So that sounds like they always help. Which also seems a little surprising. But I guess the help queue could be empty, right? Like maybe most of the time when you go through the help queue is empty, so you just, you just sort of check and there's nothing there, so you keep going. Great, we still haven't figured out how we, like let's say that our thing is completed, how do we actually get out the return value? Cause maybe that's executed by some other thread. It sounds like the sort of operation record needs to include some information about like the current state of the helping and potentially the operational outcome. The help queue and the operation record. Yeah, so the wait queue is the one that we're going to have to implement. I guess what we should probably do, and this is instead of calling this help queue, we should call it like wait free queue. Let's keep help queue, that's fine. It says the queue in, I guess in this reference supports the standard NQ and DQ operations. We slightly modified it to support three operations. NQ, peak and conditionally remove head. Add, so NQ, peak, so peak and try remove front. I think we got pretty close, that's pretty good. And I guess we need to make this add down here, be NQ just to sort of match. You'll notice I didn't rename the try remove front. That's mostly just because we already have rust parlance for these terms, it seems fine to reuse the same one. NQ operations and queue a value to the tail of the queue as usual. The new peak operation returns the current head of the queue without removing it. And finally the conditionally remove head operation receives a value it expects to find at the head of the queue and removes it only if this value is found as the head. In this case it returns true, otherwise it does nothing and return false. Okay, so this should return bool. So this should probably be maybe remove front. Because try is usually associated with the result type. I mean, actually I feel like try remove is still the right one and then we can just have it be result, nothing, nothing. Like that is an encoding of a boolean, right? And it either removes the thing you told it to remove or it doesn't, like it either succeeded at removing or failed at removing. I feel like that's the right sort of encoding here. This queue is in fact simpler to design than the original queue because D queue is not needed because peak requires a single read and the conditionally remove head can be executed as a single cast. Oh, so this is important. Peak requires only a single read. We haven't looked at the implementation of the wait-free queue yet, but this is important because remember in the fast path we're gonna check whether we're gonna call peak basically on every fast path. And so if peak was expensive we would be making the fast path a lot more expensive by having to sort of check at the head of the help queue every time. But they're saying the peak is a single read. So this sort of suggests that if the help queue is empty the fast paths will remain fast. And conditionally remove head can be executed as a single cast. Therefore conditionally remove can easily be written in a wait-free manner. So that's nice. Remember we need the help queue to be wait-free otherwise the whole sort of tower collapses. Some care is needed because the interaction between queue and conditionally remove head but the similar mechanism already appears. The Java implementation for our variation of the queue is given an appendix A. Okay, so basically we're gonna have to implement this I guess to do implement based on appendix A. So that's a pretty big to do we're gonna have to do. We use this queue as the help queue. If a thread fails to complete an operation due to contention it asks for help by enqueuing a request on the help queue. This request is in fact a pointer to a small object that is unique to the operation and identifies it. It is only reclaimed when the operation is complete. In this operation record box object there's a pointer to the operation record itself and this pointer is modified by a cast when the operation status needs to be updated. All right, so that's a little involved. So they have an operation record box that has a value that is a pointer to an operation record. Why? Ah, I think I see why. So really what this is saying is that there's an operation record box which is gonna be sort of a, let's use the same words as they're using. So an atomic pointer to an operation record. And then there's a operation record which is the thing that actually holds the stuff. And these don't actually need to be pub, I don't think. We need atomic pointer. So the operation record box doesn't actually need to be box. Like this is one place where we're gonna want to change the name. It's totally fine for this thing to be on the stack of the owner, for example. What's important is that this basically means that all of the, like you put this pointer on the help queue and that gives all of the threads access to the same atomic pointer that they can use to update whenever the sort of status of the operation changes. And in fact, this operation record thing, my guess is this is basically gonna be like RCU. If you're not familiar with RCU, RCU is read copy update. So the idea is that if you have multiple threads contending on some resource, one way to get them to operate concurrently is that they both read the current state. They both copy the current state. They both update the copy they have of the state and then they both try to sort of atomically swap in the updated state back. So I'm guessing we're gonna end up doing the same thing here. That if two threads both try to help, they're gonna read out the current pointer to the operation record. Then they're both gonna do whatever the next operation is and update their operational record. Like they're gonna take it out of here, clone it, modify it to represent them having executed the operation and then try to sort of atomically swap in the updated value back here in the box. So that over time, like the box will eventually point to an operational record that has completed set. So rather than all of these threads updating these in place, my guess is this is actually gonna end up being a bool and this is gonna be a U-size because it's gonna be read copy updates. So every thread that decides to help is gonna clone this thing and write back a pointer to its updated clone. There's gonna be, deallocation is gonna be a pain for this and it's also a little unfortunate that we're gonna end up with more allocations but maybe that's okay for now. I don't wanna deviate too much from the design and part of the reason for that is the design is already complicated enough. I don't think we should try to optimize design while trying to implement it. I think we should implement it first, get it right and then we can optimize if we see that this becomes a problem. All right, so down here, we're actually gonna construct a operation record box. Just gonna have a val, it's gonna be an atomic pointer new, which is gonna be a box into raw of a box new off an operation record of completed false at I, this. This is gonna be an ORB and we're gonna queue and queue the pointer to the record box. And that also means that our help queue is gonna do operate on boxes rather than on records and ORB. And so what we're gonna end up with here, right, is this is gonna be a load of the, a load of the pointer that's in the box currently because it's gonna keep updating to point to new, basically think of it as like new versions off the operational record as it makes progress. So we're gonna load the latest version and look at it's completed status. So this is gonna be an unsafe dereference because the load of the atomic pointer is, oops, because the load of the atomic pointer just returns a raw pointer that we need to, or B dot val dot load. And this won't have to be a load. And so safety here is gonna be, who knows yet, we don't have a safety guarantee for this yet because we don't know where these operational records are gonna come from. I think what's gonna end up with is in this stream or whenever we end up finishing up this part of it, it's just gonna be safety is we never drop anything. That's often a good sort of start approximation to get the algorithm correct and then do the memory reclamation. And then ultimately the safety argument here is gonna be because we're using hazard pointers, we know this pointer is still safe to access. But because we don't have hazard pointers yet, we can't actually get to that. All right, so let's see what they say. So the operational record and let's just sort of copy this whole sale for now, they say holds a owner TID, which is a thread identifier. I guess we could do, we have such a thing. An operation, which I guess is gonna be, so operational record is gonna be generic over O is gonna be generic over O and O is gonna be the, ultimately gonna be the operation. So help queue is gonna be generic over O and this is gonna be generic over O and the help queue is gonna hold the LF input. And we don't actually know what this is gonna hold yet. So we're just gonna say it holds an O for now. Where's the place where we construct one? All right, what other fields are there in our operation box? Input parameters for the operation. All right, so this here, you can actually see that this was done in Java because here they say there's the OP type and then there's the input parameters for the operation but in rest we have algebraic data types so these can combine into one thing, which is the input. So the input is gonna be O, which is the operation. So we could call it operation input, it doesn't really matter. Um, and then there's state, which I guess there's gonna be sort of an enum which is gonna be operation state, which is gonna be pre-cast, execute-cast, post-cast, and completed. So state is gonna be operation state and here we might also be able to do some like, because we have algebraic data types, I guess this is gonna be owner, because we have algebraic data types maybe we can wrap up some other state in this. Result is operational result when completed so that's gonna be a part of this one. So completed is just gonna hold a T. So state is gonna hold and this is where it's a little weird, right? Because we have the input type and the output type. So for now, let's make them separate. I think in reality we can probably tie these together with some, with like a trait to say that the input type dictates the output type. I don't know if we necessarily need to, but let's start with having them be separate for now. So that means the help queue is also gonna be generic over the input type and the output type. So it's gonna take this and all of this is gonna be generic over both. So this is gonna be generic over the LF input and the LF output. Like so, what else, cast list. There's gonna be a list of cast descriptors. So castes, which is gonna be a cast descriptors. Oh, which is gonna have to be, it's almost like this should just be generic over LF, which is gonna be, I guess, a normalized lock free because then this can just be LF input. This can be operational state LF output and this can be LF castes. And now we can simplify this to say it's gonna be, oh, LF, which is gonna be a normalized lock free. This is gonna be generic over LF and this is gonna be generic over the LF. And so this is gonna be generic over the LF and so is this and so is this. And then that translates all the way up here and all the way up here. And this is probably gonna be a phantom data LF. So we're gonna bring in that just because we need to use the type, otherwise rust complaints. We actually need to say that we're gonna require that. All right. And now, right. So now it's gonna complain at us and say, when we initialate the operational record, we need to actually give the right fields. So let's go ahead and do that. And so the owner is gonna be a thread current.id. I don't know if the current thread ID is that relevant really here, but might as well give it. Input is gonna be the operation state, I guess is initially going to be operation state precasts, I assume, and the castes, what are the castes even gonna be? Cause we don't know what they're gonna be as. I feel like there's something missing here, which is I think the castes are actually a property of the state. We'll see about that in a second. And then this is gonna be state.isCompleted or isFinal maybe, but completed. So we'll have a method on operational state, which is gonna be isCompletedTransibule, which is matches self, self-completed. Nice. Yes, we're gonna have to figure out what this castes should actually contain. Cause we don't actually have a castes here yet. I guess there's something missing here of like when the generator runs in relation to this, like how do we generate the list of castes? What does it mean that we're precasts? We're about to find out, I suppose. Cast list is definitely better than castes. Yeah, you're not wrong. In fact, maybe we should, yeah, we're gonna go through and clean up a bunch of these names after a while. There were a bunch of them I'm not happy with. Both cast, cast is cast descriptor, cast descriptors. I'm not happy with input output. Contention measure is a little weird. I don't like this operation record box syntax, but I think for now let's just stick with what we have so far. I suppose we could at least make this cast list to sort of match the, ooh, what did I do? I did something weird to match the paper like I said we were gonna do. So what do they say? A cast descriptor, they say have a target, an expected value, a new value, and a state. So cast list is actually a list of these cast descriptors. Interesting, so this is not something they're generic over, which I think is kind of interesting. Like this doesn't include the stuff that gets passed to wrap up, for example. And maybe that's intentional. So maybe then, maybe operation record shouldn't be over LF. Maybe it should be over INO. Because it sounds like, it sounds like this should be a cast descriptor, but an actual concrete type. So like if we have a struct cast here that is target, which is gonna be a, oh right, this is why we did it, because there's not a, that's why. Yeah, so this really is an LF. This really is an LF. Ah, here's what we're gonna do. This really holds the cast type, but many of them. We don't actually want the full descriptor because the various metadata that might get generated to be passed to wrap up, we don't necessarily want to store here. I mean, maybe we just do, maybe it doesn't matter, but I feel like we might want to go a way like this. Let's see if we can pass the full castes and get away with it. Just to avoid defining our own array wrapping here. All right, it doesn't really say much about, like I'm imagining this cast descriptors is something that is then implemented by each user type. And you could totally imagine that we provide like a sort of standard implementation of cast descriptor for like atomic pointer casts, atomic U-sized casts, and then people just use those. Giving help. When a thread T starts executing a new operation, it first peaks at the head of the help queue. So we have that already. If it is a non-known value, then T helps the in queue to operation before executing its own operation. After helping to complete one operation, T proceeds to execute its own operation even if there are more help requests pending on the queue. All right, so this is the fast path thing that we already have, right? Which is if help help first. And it's a little unclear whether this should go inside the loop or outside the loop. Like if we retry, should we try to help again? It sort of sounds like we shouldn't. Like this should go here. But I'm not sure. To participate in helping an operation, a thread calls the help method. Is that what we called it? We called it help first, I guess. Telling it whether it is on the fast path and so willing to help a single operation or on the slow path, in which case it also provides a pointer to its own operation record box. In the latter case, the thread is willing to help all operations up to its own operation. The help method will peek at the head of the help queue and if it is a non-null operation record box, it will invoke the help op method. A null value means the help queue is empty and so no further help is needed. And then that's basically what we have encoded already. Although they're suggesting that the sort of pointer check in the while loop, whether it's a while loop or just a single help should go inside the help method. I don't think that matters. The help op and vote by the help method helps a specific operation O until it is completed. Its input is O's operation record box. The box may either be the current head in the help queue or it's an operation that has been completed and is no longer in the help queue. Right, so this is, you might peek, then the operation finishes and then you try to help the operation. As long as the operation is not complete. Okay, so let's actually write this method. So this is help op takes a self and a operation record box. And then what they're saying is, as long as the operation is not complete, help op calls one of the three methods as determined by the operation record. Okay, so we're gonna read the, so let's see, this is the ORB. So the OR is gonna be ORB.load. And this is another instance of the unsafety, right? Where we need to make sure that between, that basically that the target doesn't go away. This is where hazard pointers or something is gonna be necessary. Why, I'll write this is gonna be this. Load, this is gonna be dot val. So now depending on the state of the OR, we're gonna do different things. So if the operation state is completed, help op attempts to remove it from the queue using conditionally remove. When the help op method returns, it's guaranteed that the operation record box in its input represents a complete operation because no longer in the help queue. Okay, so if it's completed, then we're gonna do self.help.tryremovefront of the ORB. I don't think it actually cares about the return type. I don't think it cares whether it actually removed it or not. Ultimately, the promise of help op is that guarantees that when, that on return ORB is no longer in help queue. That's the guarantee it provides. So even if the tryremovefront fails, that just means that someone else must have removed it for us because the only thing that happens at the front are removes. So it's no longer at the front, then it's already been removed. Nice. All right, what else do we have? What happens to the value in here though? That's what I wanna try to find out. We don't take the value, which is kind of interesting. I guess, oh, I see what happens. So this could be executed by some thread that's not the one that's trying to return. And so even though we are the ones who remove it, like we observe that it's completed, we remove it from the head of the help queue, we don't deallocate the operation record box. Ultimately, the owner of the record box, remember, down here has this like while loop on when it's completed, they're done. And so here, I guess what they're gonna do down here is like extract value from completed and then return. In fact, this could probably be a loop with a match which should be pretty nice, which is like loop, well, it's gonna be a little bit of pain here, but basically the OR is gonna be this. And then we're gonna say if the operation state is completed, then we have some ref t here, then we're gonna return the t. You'll see why we can't do this in a second. Otherwise, we're gonna continue helping. And I guess this can be a break t. The problem here, of course, right, is that OR here is a reference to the operation record. And so we can't move the t out of the OR state. And then the problem here, of course, is we have no guarantee that no one else owns it. You could sort of think of this as like, if this is the case, we sort of wanna say let OR is box from raw of, I guess, OR, right? But this is unsafe. How do we guarantee that no one else holds a reference to the OR? So there's a memory reclamation problem here as well. I think what we'll do for now is actually just say that this is just gonna, we're just gonna leave it like this and then eventually the bar tracker's gonna yell at us, which is always a good problem to get into. Right, and then we have the other cases up here, which is if we're in any of the other states, right, we're gonna need to implement the other ones too. As long as the operation is not complete, help op calls one of the three methods as determined by the operational record, okay? The precast method invokes the cast generated method of the normalized lock free algorithm. Ooh, look, there's even a thing. You see, this is somewhat similar to what we have. We take advantage of the option type, so we don't have to do this null check. And you see, this is where they sort of encode this like loop if you're being helped and waiting for your operation to finish and don't loop if you're just on the fast path and being helpful. And I think we just want to not encode that and help function and instead have that be in the caller just either does a if let sum or a while. That's okay, I think. And so here, this is where this gets interesting. So help op reads out the value from the box to get the record. That's the same thing we do. Extract the state, and then I guess this is really just a match on the state which says, okay, if it's precast then it does precasts, which I guess is execute the generator and the OIC execute the generator and then stick the cast list into the state for like try to make it visible by basically sticking it into the operational record. So what we want here then is in here, what we're going to do is self.algorithm.generator of on here, I guess we're going to have to say, let actually, so this is interesting. We're going to do OR is OR.clone. Because we're going to do, we're going to sort of create our own copy of this operational record that we can then write back, right? This is the RCU part of things. And so we want to do, the generator is going to be given OR.input. And in fact, because we clone it, arguably we could just give it the owned version, but that's fine for now. The generator also takes contention. So this is where I wonder whether we should actually pass in the contention counter here. I feel like probably not. So maybe this is just like a contention measure that is not being used. What else do we have to pass the generator? Right, so this is going to be cast list. It's going to be this. And then what we're going to do is, I suppose, OR.castlist. So this is really OR.castlist is this. And then we're going to do OR.val.store. Or actually, this is going to be a compare and swap or compare exchange really of OR.castlist. Compare exchange week is fine, but let's do compare exchange. And it's going to change it from OR to, I guess this is going to be the sort of updated OR. And we're going to change it from OR to updated OR with a compare and swap because someone else might have helped this operation make progress anyway. And this is going to be, that's fine. And here actually, if we fail, that's fine. If this operation failed, we're just going to go round again. I think there's an argument for this actually counting as detected contention, but I think for now we're just going to ignore this. Like we don't care whether this succeeded or failed, but we are just sort of going to try. I guess actually there is one case here, which is if this is an error, then we can free the thing that we allocated. So this is sort of going to have to be a box new of this. That's going to be necessary. And then this is going to be updated OR. It's going to be box into raw because it's an atomic pointer. So we have to actually pass it raw pointers. And then if it failed, then we can do box from raw of the updated OR because it didn't get written anywhere. So it never got shared, so save to drop. Ooh, what did I do? I did something silly here. Oh, this is a box reference. That's not what I want. Operation record. Why does this clone the reference? That's really weird. Am I being silly? So, I mean, I guess I am. Like that will do it. It's just weird. This should be a box operation record. Why does that not have a cast list? Oh. Imple clone for, because it doesn't implement clone is why. For LF where LF input implements clone, where LF output implements clone and where LF castes implements clone. We can't use derive here because that would require LF to implement clone. And that's not a thing we really want to require. So we're just gonna have to write it out. State.clone, operation state can totally be derive clone. It's a little sad that this thing is gonna end up being cloned a bunch. Even though it doesn't contain a T, you still need to do the mem copies. But I don't think that's terribly important. And cast list is gonna be self cast list clone. And this is gonna be normalized lock free. And so now what we're gonna require down here is we're actually gonna require where operation record of LF implements clone. And so now this doesn't need this anymore because it implements clone. That's great. And this is gonna be, I thought this should be automatically coerced, but I guess not. Why does this need to be mute? Operation record, operation record. Just to be explicit about the cast. Great. I actually don't like these annotations, but they sort of complicate this. Maybe I need to tune the coloring of them. But basically what we're doing is this is sort of the RCU part of this, right? Where we take the current operational record, we clone it, we run the generator, and we try to write back the result of running the generator so that another thread could sort of pick up and run from here. So we write, we try to sort of compare exchange back in the operational record, the updated operational record in place of the previous one. And if we fail, we just sort of, that's fine. We just sort of go around the loop and try to help more. And my guess is, and here we can free it because we never actually gave it out. If we succeeded, we can't free it, right? If we succeeded here with this cast, then now that operational record is given out. And it's the same thing here where if we succeeded the old one, we can't free because it might still be other threads. So this is another example of, we might need hazard pointers to actually free the operational records that do end up getting shared. We don't have a mechanism for. Isn't casting const to mute undefined behavior? No, const to mute is not undefined behavior. It's only going to a mutable reference that is undefined behavior. And we're not actually doing that here. In fact, it's unclear whether there's even a difference between star const and star mute at the moment. There's some debate about whether there should be, but there isn't currently. All right, so we did that. If it's execute cast, then failed index, I guess that just means try to execute the list. This to me, you know, this really looks like the cast list is a part of the execute cast and post cast states. So it's not a field. That's what I think. I think that execute cast here is a LF cases. I think this holds an LF cases and I think this holds an LF cases. That's what I think. And I think there's even an argument here for we don't need to clone it first. We can just construct a new one. So I think I don't want this to be cloned. Don't think I want this to be cloned. I think the operation state, I mean, this could just take, maybe this should just take LF actually, but let's say it takes LF output and LF, or LF casts an LF output like this. And so, yeah, so if we're in pre-cast, there is no existing list. In fact, I don't think we want to allocate this to that. I think we want to say let cast list is this, which is going to be OR.input. And then what we're going to do is we're going to say updated OR is going to be box new of operation record where the thread, the owner is just going to be OR.owner.clone. The state is going to be operation state execute cast of the cast list that we just generated. What else is even in there anymore? The input. I don't think the input is necessary either. I think the input is only relevant for the first step. So I think this is what holds the input. So at this point, this really just should hold LF. So in the pre-cast step, we hold on to the input. In the execute cast step, we hold on to LF casts and the post one we do as well, and here it's the output. That's much nicer. We're going to require that this is normalized lock free. That's fine. This is going to take LF. I think we don't even need this helper anymore. Now this can just be LF. Yeah, yeah, now we're talking. So the pre-cast is given the input and we can do input here. And then that just gets passed here to get the cast list. We're going to match on a reference to it because that makes it nicer. The input, then we allocate a new operational record where the state is the execute cast. And now there's no longer an input here. It's just the owner and the state. Yeah, owner and the state. And now we have our updated or. That's much, much nicer. And I guess let updated or is going to be box into raw so that we get a raw pointer that we can use with the cast. And now in the execute cast, we're given the cast list and I'm getting execute cast is going to be self algorithm or in fact, it's not even algorithm. Self cast execute of the cast list. And the, what else does the cast execute take? Contention, which we don't care about here. So this is going to be the sort of result of running the execute cast lists. And then it says record dot failed cast index is failed index and record dot state is both castes. So I think what that really means is same thing here. I think we're going to do is updated or is going to be this. This is just going to return because it doesn't need to do anything more. Now this is just going to be a nice state machine transition is what this is going to be now. Cause this stuff, all the stuff for actually updating the thing can happen regardless of what the actual type is. So we can hoist that out there. And now we can say here, we're also going to return an operational record where the owner is still going to be the same thing, but the state here is going to be post cast of the cast list. But dot clone, I suppose. And I think the result has to be passed along here too. So I think post cast also needs to take a result of nothing and use size, which is the outcome of the cast execute, right? Yeah. And then post cast is given the cast list and the outcome, if you will. So let's call this outcome. And what's it going to do with the outcome? It's going to do post castes, which I think is just the wrap up, right? Algorithm dot wrap up, which is supposed to be given the outcome and the cast list and a contention counter, which we don't care about. And what is it supposed to do? That is the, we're going to match. We're going to just wrap up return or returns the result of self output and nothing. So we're going to have the, in fact, what's that even going to do? Is that even going to update it? So this one is actually not going to update. This is going to be if let okay result is wrap up, then we're going to return a box new operation record, right? So this is transitioning the state machine forward to the completed state with the result. Or in the else case, we need to restart from the generator, which I guess means we're going to just do, we're back at precasts, which needs the input. So that's the reason why the input should actually be a part of the whole thing rather than a property of precasts. But that's fine. Okay. So basically what we have here is like, we're moving our way through a state machine, but all of the threads are trying to move through the state machine sort of cooperatively. Interesting. Okay. And we need to find a, the input is going to have to be carried along with each one. So precast is nothing and input is LF input, because that needs to be preserved even across the sort of retry loop, which means this will actually be OR.input and input will be OR.input.clone. So we're actually going to end up requiring that input is cloned and castes is cloned. Yeah. And so we're going to bring the input forward in each new box. And now precast here is just this. Okay. So now we have the state machine for actually driving helping forward. I know I talked a lot without pausing there, but the rough architecture makes sense and how this sort of ends up having everyone cooperate on the sort of state machine for execution. Five D chess. Nice. And we still have this guarantee that on return, orb is no longer in the help queue. We're going to have that guarantee by saying that this all happens in a loop, right? So we're going to keep looping until we observe that it's in the completed state. And once it's in the completed state, then we just return. Yeah. It definitely needs to not be called castes. It's bad. I guess cast list is right. The reason cast list is a little misleading in this context is because it really is a descriptor for all the castes, right? Cause it's a list of castes and also information for wrap up. So it's not strictly speaking just a cast list. It's like computational state or something. Three hours ago, we were all like, seeing the code will help. There's help. Isn't it better when there's code? Yeah. All right. So I guess what we want then is this now is just self help op of help. Nice. Which really means, this is another instance of hazard pointers will help us here. We hope. And I'm going to remove these safety comments to try to deal with them separately. And now there's no cast list here. So that one is done. This. Yeah. And now we finally get to the bar checker error, which tells us can't move the T out of the AOR.state. There are a couple of ways we can deal with this. What we're going to currently do is we could require the result to also be clone. I think ideally we wouldn't have to do that, but let's just, we're going to start easy and go the Java way and say we're going to break with t.clone. Oh, public interface. Why is it in the public interface? Private type operation record in public interface. I disagree. Ah, the help queue is not public. Try that on for size. And of course, we haven't actually implemented the help queue. So none of this will actually work yet. Why does it now complain about private type operation record in, oh, this isn't required anymore. So that's fine. What other warnings you got from me compiler? Huh? Bring it on. Bring it on. I can deal with your warnings. Atomic bool not used. Boom, get rid of it. Atomic use is not, boom, done. Two warnings. Help not used. That's right. It's a to do. Can I pass things to to do? Fine. No warnings. It compiles. Ship it. It's done, right? Yeah, I mean, it would be nice to pass things to to do, but this is fine. Like, it's going to be pretty obvious to us when we try to implement this that we're going to need this value and same with this. So I'm not too concerned about that. Ooh. All right. I think, so we're now at the, what, four and a half hour mark? I think we're probably going to end it there. So the reason I say that is because the, oh, there's a monitor run business here. This is probably related to pre-casses. Does post-casses have the same? Oh, post-casses, execute-casses and post-casses have a bunch more things. So there's definitely stuff we're missing here in the state machine. So we're going to, so this is going to have, all right, so let's try to do a little bit of what it's telling us to do here. So it says pre-casses a monitored run. What is a monitored run? It runs a monitored version of the generator which occasionally checks the contention failure counter in order to guarantee this method will not run forever. All right, interesting. Interesting. If cast list is not no. I see. So what they're saying here really is that the, oh, that's weird. Basically what we're saying here, right, is that the generator is supposed to sort of give up if it does, if it runs for too long. Interesting. And then the cast is not equal to null here is if the generator runs for too long, like if it encounters too much contention, it's gonna, instead of generating a list of castes, it's actually going to return and go, I can't do it. So think of it as more as it returns an error. And in that case, pre-casses actually doesn't do anything else and just sort of loops. So I think what we do here is, how are we gonna have a monitored version of the generator? That's what I wanna know. Which occasionally checks the contention failure counter in order to guarantee this method will not run forever. You know, here's a cool way we can do this. Well, let's say this is probably the last thing we do, which is detected is gonna return a result of nothing and contention. We're gonna have a struct contention that doesn't have any fields. And if self.zero is greater than the contention threshold, then we're gonna return, then if it's less than the threshold, we're fine. Otherwise, we're gonna return a contention error. The reason we do this is because now, what we can say is that if you're writing an implementation of generator, this is actually gonna return result of either a castes or a contention. So you can use the question mark operator in generator on contention, like on the contention object.detected. And it will make sure to return early if you've detected too much contention. That's really cool. And so now what we can do down here is say that we can match on this and say that if this is okay, cast list, then it's a cast list. If it's an error of contention, then that we do the same thing they do, which is sort of have precasts do nothing and precasts doing nothing just means we loop. So we loop. So this is basically a monitored run. That's really neat. And then I'm guessing for cast execute, there's something similar. And for post execute, there's something similar. So let's see if we find post castes. Yeah, so post castes is also a monitored run on wrap up. So wrap up has the same property of, ooh, what should that even do? So wrap up can already fail because the overall operation hasn't completed and should be retried, but can also fail because of contention. And I guess that's fine. I think we can wrap those both up and say, those are both because of contention. And so we're gonna do in the post cast case is operation result is null. That's equivalent to the monitored run fail. There was contention. Oh, okay, so it's a little different whether wrap up encountered contention and wrap up wrapped up without contention, but the operation didn't succeed. And those are different things. So we do actually need to represent these different. So maybe this is option contention, right? So either wrap up succeeds or it fails in which case maybe there's contention. And I think question marks should still work there. So definitely a little weird of a return type, but I think it's warranted here. So what we do is we match on this and we say if it's okay result, then this, if it is error with no contention, then we need to restart the generator. And if it exited because there was contention, then basically it's not our job, not up to us to restart. And so we continue. All right, so that's the, if operation result is null, that is like we didn't succeed in wrapping up because there was contention. And so therefore we return, which is the same as going around the loop again. Otherwise we need to figure out whether we should restart and should restart is the same as the, us not complete. I guess, okay, maybe the better way to represent this is actually that it's a result option output. I think that's the better way to say this, that it errors if there's contention because wrap up didn't finish or it succeeded with wrap up, but there's no result. That's a nicer way to do it. So either we got an okay some result or we got an okay no result or we got a contention error in which case it's not up to us to restart because we didn't even finish our wrap up. Nice. And I think that is, it says, okay, so we keep the owner tid, we keep the operation, keep the input. If we're restarting, then we set it to pre-casses. So we're post-casse. If it's restarting, we set it to pre-casse and null-null doesn't matter because we have algebraic data types. If it's not should restart and operational result is not null, that's the same as an okay some. And so then we keep all the same, we set it to completed and we include the operational result, which is what we do here. We just stick the result in completed. And in either case, we do a compare and set of value for record. So that's the same thing as up in pre-casse where we try to update the thing. But notice that we don't actually check the return value because it doesn't matter. Nice. And in fact, I think this swap down here can be comparison week. We don't want it to retry it equally. And that's okay. Nice. So really execute cast, I think is the one that still needs a bit more fidelity because our implementation of cast execute is probably not quite right. Right, so we see it's like, we loop through the descriptors. Oh, and it checks whether the cast has already succeeded because some other thread may have completed it. So there's some stuff missing here, which is like check if already completed. So really this is really like implement figure five. But that looks like it's a little bit more involved. And I don't think I want to get into that today. I think we'll probably do like the execute cast and all of the weight-free queue stuff later on, like in some subsequent stream. I think that's a good place for us to stop. I think this is sort of, we're now at the heart of execution and that's where we're gonna need the weight-free queue anyway and we're not realistically gonna do that today. So I think that's where we're gonna stop. At the tail end though, let's see if there are questions on like, how do we get here? Where's the current state of affairs? Anything weird about what we've done so far? And also thoughts about what you wanna see next. So I think there are two paths to go down. Like ultimately we're gonna have to do both of them to actually get to a finished implementation. But our options are either next stream, finish up the cast execute, the weight-free queue and sort of tying it all together or next time we do hazard pointers and memory reclamation. Like ultimately they're both needed but we can do them in either order. So thoughts on that would be interesting too. Yeah, the algorithm translates really nicely into Rust. Part of that is just because it's a state machine and state machines are really great with algebraic enums. Yeah, the cat wants attention. She's gonna eat J. Can you explain why the weak cast failing sprucily is in the problem? The swap down here, it doesn't matter if the compare exchange fails because we're gonna loop anyway. In general you only want compare exchange non-weak if spurious failures won't be retried but they need to be. In this case they will be retried in the form that will run the loop again. So it seems like it's like 50-50. Some people are like hazard pointers sound cool and the other half is like let's deal with memory reclamation later. Let's just get to working. Yeah, I'll do some thinking and see where I go next. I honestly don't know yet. They're both gonna be really interesting. The hazard pointer just like, just nerd snipe me really hard and I started looking out the code and I was like, that's very exciting. All right, well, I think we're gonna end the stream there then. I don't know when I'll end up doing the next stream. These are fairly long. I'm guessing in like two weeks, two, three weeks is my hope, but we'll see. Thank you all for coming out. Thank you for watching if you're watching this in video demand afterwards and I'll see you next time. So long for well. Bye.