 Hello folks, welcome back to yet another crust of rust video in this one I'm gonna tackle a topic that Keeps being suggested over and over and that is Atomics and memory ordering if you don't know what those are that's fine I'm gonna sort of go through them in a decent amount of detail over the course of this But this is a topic that I've been hesitant to go into Partially because it's a topic where there aren't there's not really that great Documentation for how stuff works. So it's very easy for me to explain something and then get it wrong Which I don't like to do not not because I'm worried about being wrong But more because I'm worried about putting content out there that people then rely on and that content is wrong And the other reason I haven't tackled this because I felt like I was still sort of grappling with some of the concepts Myself and so I was worried that that might translate into sort of a poor explanation of some kind I feel like now I'm at a point where I've sort of read through enough of the documentation I've worked with this material enough that I think now I can both give a correct and understandable explanation of what's going on So that's what we're gonna try to do today the The sort of core of the video today is gonna be focusing on the rust atomic types And the memory ordering that's observed in rust and so the stream will be rust specific, but at the same time most of this translates to basically any other language that has Atomics in this kind of sense of the sort of C C plus plus To some extent Java to although the memory model is a little different and same with go So hopefully you should be able to take some of the things that you learn here and apply to other languages as well this the Underlying memory ordering stuff is useful to know regardless of what language you're working in So Most of what we're gonna be talking about today is gonna be the atomic the standard sync atomic model from the Rust standard library and this module has pretty good docs on sort of the high level Ideas of why we have these types what they're used for but there's just a lot of detail there That that makes it hard to get things right and sort of understand all the subtleties of the interactions of the different types What we're gonna start out with it's just like why do we need atomics in the first place? So you see here that the the documentation lists a bunch of types like atomic bool atomic eye size atomic u size Atomic I8 etc. And you might wonder well why not just use bool or eye size or u size Why not just use the primitive types? And there are a couple of reasons for that The primary one is that if you have a primitive type like a bool or u size or something That's shared across thread boundaries There are only certain ways to interact with that value that are Safe and when I say safe here I mean it both in the sense of Data races are undefined behavior. So if you don't control how Different threads inter interoperate on the same memory location you run into undefined behavior Which as we talked about in some of the previous streams is just bad But the other reason is because it makes sense to have slightly different APIs for these types because What you're doing when you're operating on the atomic versions of these types is really that you're issuing different Instructions to the underlying CPU and placing different limitations on what the what code the compiler is allowed to generate And what we'll get into what what those differences are and why they're important But the core of it is that if you have shared access to some memory value you need to have additional Sort of information about that access to let the CPU know when should different threads see the Operations that other threads do how do they have to synchronize? How do you know that if one thread writes to a value and another one reads one? What are the guarantees about which values the reader will read will always read exactly the latest one What is the latest one even mean but also for other reads and writes in the program and in both of these different threads Which of those are visible to this other thread? In general if you had something like a u-size and let's say you found sort of a sound way to do data raising So you had a thread that just like wrote to this value and a thread that just read from this shared value And it was a standard u-size if you did that There is actually no real guarantee that the reading thread is going to see the value that was stored ever And practice the CPU and the memory system will usually make that be the case But you're not guaranteed for that to be the case. There's no there's nothing in the specification that says that this should be the case And this is where Atomics comes into play. It sort of Let's you place limitations and rules on the use of this type and the value and what What values can be exposed when? Now when I say sort of the specification or the The sort of rules here for the compiler and such what I'm really talking about is the language's memory model And the rust reference which is sort of the So you have the rust documentation the standard library documentation of stuff You have all the RFCs and then you have the rust reference and the rust reference is at least the idea is that it should fully specify the language Such that if someone else came along and wanted to implement a rust compiler like a compiler for the rust language They would know exactly what to implement and the behavior of those things And it has a section on the memory model and it says rust does not yet have a defined memory model Various academics and industry professionals are working on various proposals But for now this is an undefined place in the language which you know is kind of Unhelpful if you're trying to write anything that uses Atomics and concurrency Now in practice, it's not actually that bad Because in general partially because rust relies on LVM Rust generally follows the C memory model and in particular like the C11 memory model I think is the the most sort of reason variant of that And so what we're gonna be following here is actually the memory ordering documentation from C++ It's not necessarily because it a C++ is exactly what we match here Sorry for the bright screen. This one doesn't have a dark mode We're gonna be using this because it has fairly good documentation For what the different memory orderings mean and good examples of what goes wrong in general We're not gonna be reading this page too much. It's more that this is where much of the much of sort of the Explanations I'm gonna be giving you the examples I'm gonna be using and just much of the sort of guarantees that I'll be talking about are coming from Just so you're aware So let's start out with a type like atomic use size. This is The atomic equivalent of the use size type If you were to like look at it in memory like the in-memory representation It's exactly the same as a use size the only real difference between an atomic use size and a use size Is that it has different methods on it, which we'll see in a second And you can only really access the value through those methods. You can't get the use size without calling one of the methods It's not like it's castable into a use size trivially, for example and It's the same size as a use size to the only real difference is what instructions get generated when you access this value And if we look at the Methods on atomic use size, you see that there's a constructor that gives you a new one The atomic use size and this applies to all the atomic types are not Inherently shared so they are sort of values that are placed on the stack So if you want to share them across thread boundaries You can't just like create a single atomic use size and give out I mean you can you can give out just shared references to those threads But shared references to something on the stack won't be static and so that usually gets you into a pain So generally what you do if you get one of these atomic types is that you stick it on the heap using something like box or more frequently an arc and That will allow you to share a pointer or reference Depending on how you want to use the language here to that shared atomic value, which you can then update The reason this works and that in the main differentiator between the atomic use size operations and use size operations is that you can operate on an atomic use size using shared references to self So notice that for a normal use size you need an exclusive reference to the use size in order to modify it That is not the case when atomic use size instead a shared reference is sufficient and the reason for that is because the The compiler generates sort of special CPU instructions that make it safe for multiple threads to access this value at the same time Okay, so let's just first go through sort of what are the methods We'll go into what each of them mean and how you might use them in a second But just to sort of get a survey of what kind of stuff we have to deal with So first and foremost there is load and store and they do basically what you would expect Load will load out the value that's stored in the atomic use size So in this case it returns a use size and store takes a use size and stores it into the atomic use size similarly, there's swap which sort of does both and You'll notice that these take an additional ordering and we will talk plenty about ordering If you look at ordering Ordering is just an enum that has these different variants and What do these means is like part of what this video will be about what kind of semantics they establish But fundamentally what ordering does is it tells the compiler Which set of guarantees you expect for this particular memory access with respect to Things that might be happening in other threads at the same time The Other methods on this are Compare and swap compare exchange and compare exchange weak. We'll talk about the differences between those later on in the stream these are basically ways of reading a value and Also swapping out its value but doing so conditionally and in one atomic step And what we might mean when I say one atomic step I'm gonna show you in a little bit of a second what that means But basically if you do a load and then a store There's a chance that some other thread comes along and does something in the meantime It gets a run between your load and your store and a compare and swap is sort of a single operation Where no thread can get in between There are also a bunch of these fetch methods So fetch add fetch sub fetch and fetch NAND These are basically operations that similarly try to Avoid something happening between when you load the value and when you change it So fetch add for example will atomically so as a single step Load the current value and add a value to it Without there being possible for any thread to get to execute in between Or modify the value or read the value in between those operations And we'll talk about why that's useful and why these are separate from the compare exchange methods in a second Before we dive into actually using one of these I'm gonna take some questions On what I've talked about so far because I went through a bunch of things fairly rapidly here And I think it might be useful just sort of make sure we have a shared foundation that makes sense Don't use 64 is on x86 have atomic access or I'm confusing it with something else So on some platforms the non atomic types have additional guarantees So for example, I think on like Intel x86 If you have a 64 bit value any access to it is sort of automatically atomic Assuming it doesn't cross a cache line boundary, I think like it needs to be in the line to access And That means it in theory like if you're down in the raw assembly and you know the urn on Intel x64 You can do this without atomic use size But realistically in the standard library what we want to do is expose sort of a common interface that will always work And so this is why we can't really expose like just for use size on x64 You can you can use them through shared references So instead everything is modeled through here. It just that on something like a target platform and architecture like that Atomic u64 for example will generally be free It's not quite true either but close enough Why is it non-exhaustive? Yeah, so ordering if you look here is non exhaustive This is a great keyword if you haven't used it before which basically says that no code is allowed to assume that these are all the variants So we'll ever be in ordering so if you match on an ordering for example You always need to include like an underscore branch and otherwise or else branch Because the standard library wants to be able to add additional orderings later on if necessary the biggest one I think is Consume ordering which is something that C++ has but rust doesn't currently have I don't know whether that will be added But the idea is that if we want to have the ability to add things later. It needs to be marked as non exhaustive Is the ordering in them related to the different memory models of the architectures like x86 or arm Yes, the different orderings are they're not it's not that they're related to the memory architecture They're related to what guarantees the operation gives you how different architectures Implement those guarantees will vary from architecture to architecture as we'll see in a second, too What's the difference between atomic and mutex so in atomic there's no locking With an atomic it's just Multiple threads can operate on this value at the same time in some reasonably well-defined way But it's not like with a mutex, right? What happens is one thread gets to access the value at a time And all other threads need to wait and the mutex usually guards a larger section of code, right? It says I grab the mutex I run this code and then I release the mutex and no other thread is allowed to execute anything under that mutex While I'm in this critical section as it's called With an atomic there's nothing really like that You can maybe think of it as a mutex that guards just a single memory access If you wanted to but but it's much more efficient than that Did those atomic operations then not block other threads potentially no all atomic operations are lock-free They're not necessarily wait-free This is on certain architectures that don't have Fetch add for example is like implemented in terms of compare and swap So you may have to wait for other threads, but they're not They're considered lock-free like there's no mutex in there The atomic operations are not just per CPU like per CPU architecture. It's not like they just modify the instructions They also change the compiler semantics again as well. We'll look at in a second when we start looking at the different orders So they they limit both what the CPU can do and what the compiler can do about a given memory access Yeah, so so the point here is if even if you're on something like like Intel x86 64 You might still want to use the atomic types because you need guarantees from the compiler as well Store swap and friends all take a Immutable reference to self not mutex I guess that means they rely on unsafe cell. Um, you know, that's a good question. I think you're right. I mean we can check that pretty easily It's cool. Of course, it's a macro Let's see we can find here. Oh, there's a lot of macros in here Let's see if we can find the struck definition for this Let's see if we can find the struck definition for this I see a self dot v dot get which is usually an unsafe cell So my guess is that all of these atomic instructions contain an unsafe cell That they then called get on to get the pointer to the value and then they Issued the actual assembly instructions on that. In fact, if we go all the way to the top of this macro We should see the Definition of it. Yeah, so here PubStrike to atomic type. So this is what the macro is going to expand to It just has a v which holds an unsafe cell of the inner type and you see that this is actually just the in type So if you have a an atomic use size, it's really just an unsafe cell use size and then it uses the the Sort of special instructions to access the value behind that And this ties back to the older unsafe cell or the unsafety stream where we talked about interior mutability Where unsafe cell is the only way at sort of the fundamental level to get Mutable access through a shared reference You said atomics are generally shared via arc rather than box But the only reason for that is that arc is sent sync while box is not so it's simply more convenient, right? That's not quite true. So If I box a value and place it on the heap then now I have a I have an owned value But if I spawn two threads Those two threads both generally required that the closure you pass in is static has a static lifetime if you pass a reference to the box to both threads and then now the Like that box reference does not have a static lifetime. It's tied to the stack lifetime of the box With an arc you can clone the arc and give each thread its own individual own individually owned and therefore static arc To the to the atomic use size And so this is why arc is usually used over box. You don't have to do it that way So for example, there's box leak box leak will leak a value on the heap So it will never call the destructor which gives you back a static reference which you can then pass to both threads Great Okay, so now that we have sort of a shared foundational understanding of like why these types are Let's go into what you might actually use them for and and in particular what this memory ordering businesses So here we're going to do a new lib and we're going to make it Bin and we're going to call it not hang. That's from a previous stream. We're going to call it Atomics because Feeling feeling boring today All right Fine just to make it stop complaining Watch this now like break breaks my install or something. It'd be great Okay, so let's say that we tried it we want to try to implement our own mutex type. So I'm going to define a mutex t And a mutex t is going to hold a unsafe cell t, right, which is the value we're going to be given out So unsafe cell and also What else are we going to need? sync atomic and we're going to need atomic bool and Ordering So a mutex is a combination of a boolean that marks whether or not someone currently holds the lock And then an unsafe cell to the inner type so that if we hold the lock we can give out a mutable reference, right? And then this is going to have a mutex t And I'm not going to implement the sort of guard mechanism because that's not that interesting for this particular stream instead I'm going to do like a with lock which takes a Imple f and once that's given a mutable I'll u2t or turns r So with lock is going to take basically a closure Function that it's going to call once it has the lock And let's for now just make this be call f This is not actually going to work, but for now and I want to caveat this with what we're going to implement here is known as a spin lock And the reason for that is we're going to if if the lock is currently held We're going to spin until it's no longer held We might like yield or something, but fundamentally it's a spin Don't use spin locks like you almost never want to spin lock You almost never want to implement your own mutex. There's a great article By mathclad on the spin locks considered harmful. I'll post it in chat here It's a great article read it and don't implement your own spin locks And probably don't use the spin lock in the first place But we're going to do it because it's a it's a good sort of exercise in understanding What this is for? So How's this going to work? Well, I guess we need a new method here, too. Let's make these pub Wow, I can't spell today Pubfn new which is going to take the value to initially create the mutex with It's going to create a self With locked which is going to be atomic bool New and I'm going to have some constants here just to make the code a little nicer Locked bool is true And unlocked is false So it's going to start out being unlocked And the value is going to be an unsafe cell new T So far so good So now we have a constructed mutex and the question now becomes how do we take the mutex? How do we lock it? well Let's start out with sort of the the naive way to do this right which is while self locked I guess this needs to be self While self dot lock dot load not equal to unlocked Spin self locked store Locked Call F Catchers return value set it back to unlocked and return red So the boolean here is going to be If the thread is locked so true means locked false means unlocked All right, so this is sort of the naive implementation, right while the while the lock is unlocked Oh, yeah, you're right. This should be here self dot v dot get and it's going to be unsafe because we need to Basically what we're asserting here, right safety is We hold the lock therefore We can create a mutable reference All right, so great now we have something that seems reasonable, right? Like we're waiting for the lock to become unlocked Then we store the fact that it's locked so that no one else can get the lock Then we call the function with a mutable reference and we create the mutable reference because no other thread can be in the critical section at the Same time trust me. I will get to compare and swap. I know this is wrong, but just hang with me for a second Just trust me on this, please And then we're gonna store that it's now been unlocked so that other threads can now access the value Great so far so good. So this code has some problems And the first is that it doesn't compile and why doesn't it compile? Well, it doesn't compile because we need an argument here, right? This this method as we as we looked at in the documentation this requires an ordering and it's not immediately clear what this ordering is going to be Because we don't know what orderings are yet. I haven't told you maybe you already know in which case that's great But for now who knows we're just gonna do ordering relaxed because we're relaxed people And so it seems fine for this to just be ordering relaxed We don't need to be super strict about things we like anarchy apparently ordering All right, so We're just going to We're just going to just relax load Whatever that means and see if it's unlocked great and now compiles and in fact if we try to run this It would probably work. So here's what I'm gonna do. I'm gonna create a lock mutex new And I'm gonna do a little bit of ugliness because why not We're gonna call this what type we're gonna use here. We're just gonna use one And this is going to be a tick static of whatever type we implemented And then we're just gonna spawn some threads so we're gonna do and Just spawn five threads and What all the threads are gonna do this is gonna be a sort of a silly test We're just gonna have lots of threads They're all gonna try to modify this value at the same time through the lock And we're just gonna see whether it ends up doing the right thing because we totally got this right, right? So Each such thread is gonna do L dot with lock and In they're gonna take the value And all they're really gonna do is they are going to say v plus equals one This is gonna be four And they're gonna do this a hundred times Actually, let's do a hundred. Let's do ten threads are gonna do each of this a hundred times And I guess we're gonna have to wait for these threads so Handles is a veck. In fact, let's do this the hardcore rust way So we're going to collect all the thread handles just so that we can wait for them afterwards So we're gonna spin up ten threads each thread is gonna increment the value a hundred times What am I doing here? Right, and I'll get to that in a second. That's worth explaining And then at the end we're just gonna do for handle and handles Handle dot join because we're gonna wait for the thread to exit and Then once all of the threads have exited surely now we can assert that L dot with lock V if we just read out the value from behind the lock, this should now be ten times a hundred Right because we have a lock. So all of these increments should exactly happen There should be no bad stuff going on here, right? And This complains that unsafe cell cannot be shared between threads safely we're just correct like unsafe cell in general does not implement send or sync because It cannot be shared between threads it just because like unsafe cell is inherently just so unsafe So we generally want people to opt in to send and sync for that type in this case we know that mutex is in fact sync Sync for mutex t and this is an important one where t is send So we're gonna implement sync for mutex so that you can concurrently access it from multiple threads as long as the value t is send and the reason we need the t is send bound is because Like the lock can be taken from multiple different threads and each of those threads might take the value In fact, I think this needs to be send and sync Because it might access the value as well No, it does not we never we never concurrently access the thread from the inner value from multiple threads at the same time All right, so now we have code that surely works right someone pointed out that the collect isn't needed The collect is needed if you didn't have the collect here. It's true. This would still be an iterator But you would join each thread the moment you spawn it so you would only have one thread running at a time Initial values one. Oh, you're right. This should be zero. Good call All right, great, so we have perfect code, right if I now what do I call this atomics If I now run this I'm gonna run it with release to make it fast Great. So this works right? Whoo. Our lock is perfect. All right. What if we increase the concurrency here? I'm gonna do this a lot more times. This is not gonna be a hundred times a thousand Nice works so well great. So Clearly this code is entirely right Fantastic stuff, right? It works ship it nothing more to worry about except it turns out There are some problems with this. It's just that it's really hard to reproduce the problems So in particular Remember how I talked about the one of the reasons we need atomics and what we need operations like compare and swap It's because here we're doing a load and then we're doing a store But in between here may be another thread runs So imagine the two threads call with lock concurrently and let's imagine the mutex is currently unlocked Right now both of these Both of those threads are gonna see that the thread is unlocked so both of them are gonna leave the while loop Both of them are gonna get down here They're both gonna store locked to the value, but they don't see that the other one has stored locked because they've already left the loop Then they both get a mutable reference to the same value, which is undefined behavior. They both call f with it And then they both unlock the mutex. So clearly like it's possible for that to happen it just that generally the computer is so fast that this just doesn't happen and And Like we can we can do things to try to encourage it to happen. So for example here, we could do like a yield now here Just to make it more likely that some other thread is gonna come in in between If we now run this you see that it panics. We didn't get to the final value Right, we got to some other value. That's slightly smaller and the reason for this is because two threads ended up executing the The closure we passed to with lock at the same time. So they both read the same old value They both wrote down that old value plus one But because they both ran at the same time That means that we're losing some of the increments because they get overwritten by another thread that runs concurrently This is also just undefined behavior. Like the compilers are allowed to do garbage here It does something that's somewhat reasonable, but you can't generally rely on that And you might be like, okay yield now clearly there's not a yield now here And so why would the some other thread get to run and there are a couple of different reasons for that? One is there might be multiple cores on your computer, right? Like this computer has like 16 cores or something Which means that if two threads are running on different cores, you can't control the relative Operations of when different threads do different things. So here it could totally be the two threads are running on different cores And they're both in the while loop at the moment. They both see it being unlocked and then they both proceed The other reason even if you have a single core you can end up in a situation where The operating system will generally limit how long a given thread gets to run for and then do something called preemption Which is basically stop the program forcibly in the middle because it's been running for too long Just to ensure that all the threads on your system get to run and when it does that you have no control over So you might be preempted at any point in time including between this load and the store now This is unlikely. It's unlikely the preemption happens exactly there But it can and that's sort of what we're modeling here with the yield now is we're pretending that the thread gets preempted here Okay, so before I go into how we solve this problem. Let's talk about what what just happened here why it's wrong And just make sure everyone understands that Would sleeping a bit while doing the plus one equals create concurrency issues So if we slept in here That would probably not increase the concurrency issues because if you're in here, you've already taken the lock It's taking the lock. That's racy. Not what you do under the lock So sleep here would probably not have the same effect as the yield now I inserted Isn't the compiler already predetermining the sum No, the compiler is not predetermining the sum here because this is behind a different thread So even though technically this could be computed statically The compiler doesn't it doesn't do that across thread boundaries With thread sanitizer catch this we'll talk about thread sanitizer later in the stream But thread sanitizer would catch something like this because you have basically you have two threads Concurrently writing to one memory location, which is the kind of stuff that it's a right-right conflict which thread sanitizer would generally catch We will also talk about loom. I promise. Please trust me on this It seems like sleeping there would result in more lock contention and thus more races No, actually you have more lock contention if the critical section is shorter So sleep would make it less contentious Well, it's not quite true either I don't I don't think the sleep is all that important and it's sort of it's it's not necessary to demonstrate the problem So we're atomics related to multi-threading or concurrency I mean atomics themselves are useful for concurrency and multi-threading is a form of concurrency It doesn't recode If you don't have multiple threads, it's unlikely you will need atomics if you have multiple threads Even if you only have one core, atomics are necessary Isn't sequential consistency by default in Rust? No, there's no default sequential consistency in Rust And it is true that for code that you write that doesn't use atomics Like if you're operating on a single thread, for example there You don't have to worry about these problems and the reason for that is because the if you look at the memory model any sequence of operations within a given thread are guaranteed to be Observed in order So they might execute out of order But the effects will appear as though the program ran sort of top-to-bottom in sort of sequenced order So that might be what you're referring to but there's no default of sequential consistency This is why all these methods take an ordering that you have to explicitly pass in All right, so it's now clear that something here is broken, right? And and one of the reasons it's broken is because this is race Between this load and the store where multiple threads might win that race at the same time and So Therefore we need to find some way to avoid that this sort of this race condition of two threads observing the Mutex being unlocked at the same time And the way we do that is using the compare exchange operation You'll notice that there's also compare and swap Compare and swap is often what the CPU instruction is referred to In general, you want to use compare exchange rather than compare and swap As like compare and swap is deprecated to for exactly this reason The reason you want to use compare exchange over compare and swap is because compare exchange is strictly more powerful In fact, I think the implementation of compare and swap used to just call compare exchange And the reason it's more powerful is because compare exchange lets you specify the memory ordering differently for whether the operation succeeded or failed whereas compare and swap does not and We'll talk about what that means too compare exchange week is Interesting and we'll talk about it in a second Yeah great Okay, so If we look at compare exchange, we see that it takes the current value. It takes the new value And then it takes two orderings and we'll talk about why they're two orderings and what they mean in a second And the first line of the documentation is pretty helpful Stores a value into the atomic integer or in our case Boolean if the current value is the same as the current value Right, so let's see what that would look like compare exchange We're gonna go from unlocked to locked Still just ordering relaxed because we don't know what this means And In fact now the store goes away. I Realize this is maybe hard to see there we go. I wish I would format this differently. In fact, maybe I'll do that here Just so that it's easier to follow what's going on Okay, so compare exchange Takes what the current value should be for us to update it and what it should be set to If the current value is what the first argument was So what will happen here is the CPU is going to Go look at the value the the atomic pool here see if it is unlocked So that is if it is currently false Then and only then set it to true and do that in such a way that no other thread gets to Modify the value in between when we look at it and when we change it So comparing compare exchange is a single operation that is the read and the write and notice we can do this in a loop right because if the Current value is locked Then the value will not be updated because the current value is not unlocked And so compare exchange will return an error and so we loop and try again and compare exchange will return an error if the If the value was not updated and it will return Okay, if the value was updated and in either case it will that so the value contained in the okay or the error is Going to be the current value whatever it was at the time when the CPU went to that memory location So if you get an error, it's going to tell you here's what the value was When I when it wasn't equal to what you passed me in our case We don't need the actual value all we care about is whether it was updated So hence the call to is error here But if you have something like an atomic u-size for example, it might actually matter what the old value was like imagine You're updating a counter right and you're trying to increment it by one then The next time around the loop you want to use the updated value to do the increment rather than the Value you've read previously All right, so does this make sense does it make sense why compare exchange solves this particular problem? Right because here there's no space in between the load and the store It's they're just one operation one atomic operation that's performed on the memory location. We're operating under Now in practice, you don't actually want to do it this way and there are two reasons for that first This means that we are going to try to do Well compare exchange is a fairly expensive operation Because if you think about it if every CPU is spinning during compare exchange They're all going to try to get sort of exclusive access to the underlying memory location And in practice what that means is Imagine you have like eight cores, right and one of them is currently holding the lock All the other threads are trying to get exclusive access to the value that's sort of that holds the true or false And so what they're going to do is each each core is going to say give me exclusive access of this value Which is sort of a coordination effort, right? It needs to coordinate with all the other cores to say I now own this memory location To make sure no one else is writing it at the same time and then it's going to look at the value and let's go Oh, it's not the current value and then it then some other core is going to say now give it to me And then just the the that memory location is sort of going to bounce between the course rather the ownership of that Location and memory is going to bounce between the course and this is very inefficient CPUs are not It's not that they're not great at it It's just inherently an expensive proposition to coordinate exclusive access amongst all these cores If you're curious about this, I highly recommend you look up the messy protocol Which explains a lot about like how this actually works on the hardware level It's a super nice protocol to know about if you want to understand sort of performance implications at this kind of low level So So the messy protocol basically says that a given Location in memory, it's actually talking about cache coherence and cache lines But I'm going to refer to it as a location in memory a location in memory can either be shared or exclusive There are some other states to but but basically shared or exclusive so in compare exchange the CPU requires Exclusive access to that location in memory which requires coordinating with everyone else alternatively a given location in memory can be marked as shared and Multiple cores are allowed to have a value in the shared state at the same time So in general if we can have a value stay in the shared state while the lock is is still held That's going to avoid all this sort of ownership bouncing And so what you'll see in certain spin lock implementations is another inner loop Which does this So notice that this doesn't actually change anything like the behavior is still the same It's just that if we fail to take the lock then we're just going to read the value So notice that this doesn't do a compare and swap. It doesn't require exclusive access. It's read only access So if we fail to get the lock then we're going to spin and just do reads which allows that value to stay in In the shared in the shared state Which means that we don't have all this ownership bouncing and then the moment it changes because some core takes Exclusive access to it probably because it's doing an unlocked store Only then do we go back to doing the compare the expensive compare exchange where we try to get exclusive access If we fail to get the lock then we fall back to doing these this read-only loop So this is actually a fair amount more efficient when you have high contention, but again, don't use a spin lock Do you think the performance degradation would be visible if you just redid your test again, no that these optimizations are You're talking like nanoseconds They only really matter if you have a lot of contention on a particular value And when you have a lot of threads a lot of cores are doing that contention So you're usually going to see these kind of things show up in if you plot the like performance like throughput or good put usually by the number of cores in a in a graph what you'll see is If you have a lot of contention like everyone is trying to get a exclusive access then as the number of cores goes up The good put starts to sort of flatten out or even go down Because the cores are spending all of their time just trying to like fighting over who gets exclusive access to the value Whereas if you do something like this where you you sort of avoid that collapse So you might still not see linear growth, which is like if you double the number of cores You double the good put but you will see at least usually see a better curve Because you avoid some of this performance collapse Is compare exchange then much faster than locking a mutex It's hard to say a single compare exchange is not that expensive The biggest difference between compare exchange and a mutex is that a mutex has to wait a Compare exchange will will never wait right what it'll a single compare exchange call is gonna go and try to do the operation You told it and then it's gonna say either I succeeded or I failed So it doesn't like if if it failed It's not like it blocks the thread or anything you can choose to do that like in this case We spin so we are sort of holding up and waiting for that to succeed a mutex on the other hand If you lock it then your thread will be blocked until you have the lock So this is sort of the difference think of compare exchange is a much more primitive operation In general what you'll see is that compare exchange is often but not always used in a loop So like keep retrying with some updated current value until you succeed Very often although not always there are some algorithms where if you fail the compare exchange There's some other useful work you can do so you go do that and then you try the compare exchange at some other later point in time But and this is where weak comes in so remember we looked at the documentation there's compare exchange and then there's compare exchange weak The difference between these is fairly subtle but basically it comes down to compare exchange is only allowed to fail if The current value did not match the value you passed in it is not allowed to fail under any other conditions Compare exchange weak is allowed to fail spurriously so that is Even if the current value is what you passed in it's allowed to fail it usually won't but it's allowed to The reason this matters is because there are some so What's the best way to explain this it comes down to that what operations the CPU supports so on Like Intel x86 there is like a compare and swap operation. It's not technically called that But effectively there's an operation that implements compare exchange. It does exactly that one operation. It's one atomic instruction on arm Intel I Guess slash AMD really I guess x86 On arm though, you don't usually have a compare and swap operation instead. What you have is a What's it called you have like a lock LDR a x and st rex So this is load exclusive and store exclusive and what so you have two operations and load exclusive You can think of as takes exclusive ownership of the the location in memory and Loads the value to you and store exclusive is only if I still have exclusive access to that location That is no one else has taken it away from me only then will I store and otherwise I'll fail and You could imagine that some other thread took ownership of the value for example just to read it or To overwrite it with the same value that's already there in that case The store x the store exclusive operation on arm will fail Even though the value might still be the same But it will fail nonetheless the upside of doing it with these two operations is that the store x is really cheap Right like you don't have to then go and grab exclusive access But it might mean that you fail the operation without needing to so on arm compare exchange is actually implemented using a loop of LD rex and Strx Because it needs to implement the semantics of compare exchange, which is only fail if the current value stayed the same And this means that on arm processors this loop ends up being a nested loop and Nested loops are they're not inherently a problem, but they tend to generate less efficient code They generate more registry pressure and stuff So this is why we have compare exchange Weak where compare exchange weak is allowed to fail spuriously and so it can be implemented directly using LDRX and the strx And of course on x86 64 compare exchange weak is just a compare and swap So it doesn't have to like it just always works. It doesn't generate spurious failures and The idea is that if you already call it in a loop you should use compare exchange weak if you are not calling it if an inner loop if you're expecting it to if you're if you Wanted to only fail if the operation if the current value had changed then you use compare exchange So in this case because we're calling it a loop this should be weak All right, did that make sense before we continue It's called load linked and store conditional. That's what These operations are good call arm 64 does have compare swap, but there were a bunch of other arm variants that do not Let's see Okay, so that that seems like it was fairly clear as to why well with the difference between compare exchange and a compare exchange weak is Okay So I guess let's leave that comment up here Stay in s when locked Just sort of to keep the code commented Is that all the conditions when weak is allowed to fail Weak is allowed to fail spuriously. So a weak can fail for all sorts of reasons It will only succeed under the one condition you expect it to but it can fail for whatever reason it feels like whereas exchange can only fail if The sorry the non-weak version can only fail if the current value changed All right. So so far so good. We're now at the point where we have something that looks like it can't have this problem and In order to test that I guess we could put like a thread You'll now in here and maybe one in here, too And then try to run it again. All right, so nothing fails Okay, so now clearly our code is correct, right ship it. It's all done now. We we fixed the race condition. There are no more problems Of course, no, that's not true. I wish it was but it is not and the reason comes down to this ordering business Let's see. Do we want to do ordering first or fetch first? Let's do Ordering first Yeah, let's do ordering first. Okay, so as we discussed there is This ordering, you know And you see that there are a couple of different variants here there's relaxed release acquire acquire release and sequential consistent sequentially consistent and These as I described before basically dictate the Allowed observable behavior when you have multiple threads that interact at some memory location. They sort of dictate What's supposed to happen if and what what is allowed to happen when you run this code? Okay, so Ordering relaxed Means that there are no guarantees and when I say no guarantees, that's not quite true like You're still guaranteed that The operation will happen atomically like if you do a load you can't like see Some bits that were stored by one store and some bits that were stored by another store Like the operation is still atomic, but it's the only thing that's guaranteed and to Demonstrate just how extreme that is I'm gonna give you a little fun test case Let's do to relax So here's what's gonna happen I'm gonna spawn two threads I'm gonna say X is box box leak box new Atomic I Guess you size comic you size New zero The reason I specifically give the type here is because oh I can't there we go It's because box leak returns a static mutable reference which I couldn't move into two threads So this is my way of casting it basically Sink atomic atomic you size I guess I can have it a test. It doesn't really matter I'm gonna have an X and Y that are both numbers And I'm gonna have two threads I can't spell and One thread is going to do Let It's called this thread one And this thread two thread one is gonna read Y With relaxed ordering and then it's gonna store that value Into X and I guess we can have it return our one T2 is gonna load X and then it's gonna store 42 into Y And then we're gonna do our one is T1 join unwrap So we're gonna join both the threads So that we have the values T2 All right, so we have two threads and one thread is gonna read Y and store X the other one is gonna read X and store to Y and The surprising bit that I'm gonna tell you here is That it is possible To have R1 be equal to R2 be equal to 42 This is a possible execution of this program Let's see why that's weird So here R2 loads X and Then it stores 42 somewhere That's what thread two does and I'm telling you it's possible for R2 to be 42 Even though the store of 42 happens after R2 is read This should be extremely surprising to you But it is possible with ordering relaxed and the reason is because When you have multiple threads executing concurrently by default there are no guarantees as to That's not quite true close enough There are basically no guarantees about What values a thread can read from something another thread wrote under ordering relaxed? So even though you might think well either this happen like this happens before this so how is it possible to have them? Happen the other way around The answer to that is In atomic operations Generally what happens is that there is a modification order that stored per value so this thread might see Are we gonna explain this so for? There's sort of the there's the modification order for y And there is the modification order for X. Let's make this the other way So for why the modification order is that it is zero and then it's 42 for X. It's that it's zero and then it's 42 right when you load a value You can see what when you load a value with ordering relaxed You can see any value written by any thread to that location There's no restriction on when that has to happen relative to you The the only restriction is if there's a synchronization point between the two of you And the the spec talks about this in terms of happens before relationships We'll get the buy into those in a second, but for ordering relaxed There is no guarantee that just because you read or write the same value. There's like a There's no guarantee that you happen after or before relative to some memory ordering So in particular this load of X is Allowed to see any value ever stored to X That includes 42 Because it's in the modification set of X and there are no constraints on which subset of that which which Range of that modification order is visible to this thread It might seem like time travel the other way to think about this is that The compiler is totally allowed to reorder these two operations It's maybe an easier way to think about it Similarly, the CPU is allowed to execute them out of order right just for optimization purposes The reason it's allowed to do that is because there's no There's no sequencing operation between the two like this operation does not depend on anything from this operation It doesn't use our to it doesn't access X This one doesn't write into why so there's nothing that strictly requires them to happen in this order You might wonder well Why doesn't the CPU just and CPU and compiler just run them in order and the reason is because very often they're significant performance gains from sort of Rejiggering the execution of a program if you run the the operation slightly out of order You might get much better performance. You might utilize memory better And so under either of these conditions the reverse execution might happen And if the code looks like this, it's totally obvious why our two might be 42, right? And with ordering relaxed you're not There's nothing that guarantees that this won't happen Think of it as relaxed does not establish an ordering between these two operations or between this thread and this thread And therefore our two is allowed to see 42 Does that make sense? It's weird. It doesn't make sense. I know it's it's it's really weird But but if you think about it again, like this reordering makes a lot of sense Like if you were a programmer like imagine you You were working a huge code base and like you don't know about these other threads You only you're only looking at this in like a 10,000 line long file, right? Would you ever think twice about swapping these two lines? They don't depend on each other So why not just swap them like is there a downside to it? Probably not, right? But of course the the observable Execution outcomes are different or in the case of relax. They're not because these are interchangeable as far as the the memory ordering specification is Yeah, the other way to think about this right is This is sort of speculative execution as well of the CPU is allowed to run this operation before it runs this operation So speculatively because it knows that it's about to be executed and in this case there are no branches or anything So speculative execution usually comes up in if there's a question whether it might be executed So if there was like an if here on something that's not related to R2 or X like Z is equal to 3, right? This is where you run into speculative execution where this the CPU doesn't know whether or not this is true yet But it still executes this operation in a way that it can later undo Just so that if the condition is true, it's already started the operation So that's where like specter specter and meltdown come in but in this case There's no it's not really speculative execution. It's just out of order execution Which today's CPUs are very good at because it's it's necessary. It makes programs much faster Wait, so when I write code the CPU can take any instruction that does not depend on others and start by that. Yes Sort of so there are a bunch of constraints as to what Reordering the CPU and compiler are allowed to do and what they're not allowed to do But in general the they're allowed to reorder anything that doesn't have a clear sort of happens before Relationship which is the formal way to specify this like if there isn't a dependent Think of it as like if your program is like a graph of dependencies You're not allowed to reorder like if a depends on B If a depends on B, you're not allowed to reorder them because a dependent on B But if there's no relationship between a and B if there's no dependence relation If there's no if there's nothing that says that B has to happen first then they're interchangeable And that's what the specification gives it And This is also where it becomes important that memory ordering and memory semantics are not just about the CPU or the architecture They're also about what the compiler is allowed to do right? It could be that the compiler reorders these or it could be that the CPU executes out of order and their equivalent like it As far as we're concerned all that matters is whether it's legal for them to be reordered Yeah, and so one of the reasons this might happen right is it might be that We already have Y in X like we have exclusive access to Y already But we don't even have shared access to X So the load from X is gonna have to wait for the memory system to do something But we can do the store right now because we already have it exclusive So we're gonna do that first because we can do it efficiently Okay, so this is an example of the kind of Shenanigans that can happen when you're using relaxed because relaxed just doesn't imply any ordering It doesn't imply that anything happens before anything else And Why does this matter? Well here remember we're taking the lock with relaxed And so here's an example of something that might happen So let's imagine that instead of f this just did this Right, it can't do that because we would have to specialize to use eyes But let's imagine that it just like directly did the plus equals one great Okay, well this is relaxed So this can just be moved up here this operates on self locked this operates on self dot V They're different parts of memory. They're different locations. There's nothing that says that this reordering is not allowed We know it's not around like this is not okay if these execute out of order This is really bad now this operation is executing while some other thread mild might be holding the lock And we're violating that that sort of sub the exclusivity of a Mutable reference Right. This is bad. Not okay. Not okay compiler not okay CPU, but based on the ordering we gave This is fine Okay, so how do we fix this well this is where we have to use a different memory ordering than relaxed relaxed is too weak So let's then look at what are the alternatives Well, if we go back to ordering We see that the next thing sort of up from relaxed is release and acquire and these sound a lot like the terms We use for locks you acquire a lock and you release a lock And that's no accident. It's because these memory orderings are Basically designed for being used in the context of acquiring or releasing a resource Okay, so Let's leave the compare exchange weak for a second and first talk about the store So here's another example of a reordering that's valid with relaxed this can just move down there Why not totally fine all good no fires here, right? But of course no, this is not fine. We don't want to allow this So it's not just the acquiring of the lock that we need to ensure that things don't move above But it's also the release of the lock. We want to make sure that nothing moves below And this is where the release Ordering comes in so let's switch back to the spec. This is gonna be a little bit bright. Actually, maybe at all say here Yeah, okay, let's use this instead because it's dark mode. So For the release ordering when coupled with a store all previous operations Become ordered before any load of this value with acquire or stronger ordering in particular All previous rights become visible to all threads to perform a choir load of this value And this ordering is only applicable for operations that can perform a store There's a lot of language there and all the language is important But Basically what this means is that if we do this store with release then any load of the same value That uses the acquire ordering Must see all operations that happened before the store as having happened before the store There's an additional restriction if we go back to the memory order. This would be bright So if we look at memory order release a Store operation with this memory order performs the release operation No reads or writes in the current thread can be reordered after the store So this is an additional sentence that actually not in the rust version Which seems unfortunate that it should be in there But basically that the second part we saw is nothing can be reordered to after a release store great So that's solved the other thing is that as you noticed in the documentation it said all previous operations Become ordered before any load of this value with acquire. So that is The load that happens as part of this compare exchange if it uses acquire ordering Must see any operations that we did in here This might sound weird. Like what why does that part matter? Well, if this is relaxed Then the next person to take the lock is not guaranteed to see the modifications we made to memory here This comes back to the the stuff we talked about down here with with relaxed, right? Is that this thread this operation might see values that like it in fact if we reorder them This is allowed to see zero For obvious reasons, right? We do a store to why but we load from X This red might have not have run in the meantime But there are more extreme cases too where it just like it might be that this thread just will not see 42 in the time that it executes even if we like looked at the wall clock time and saw that yet T1 ran after T2 it might still see zero because that's all ordering relaxed guarantees and So in our case with the mutex If we keep this as relaxed Then the modifications we made to memory here may not be visible to the next person who takes the lock Which is a huge problem, right? This is not okay. So this needs to be released This for now, let's say this has to be a choir Now we have the two guarantees. We have this cannot be reordered after this and Anything we do in here must be visible after an acquire load of the same value So whoever next takes the lock must see anything that happened before this store The way to think about this is that the acquire release pair Establishes a happens before relationship between the thread that previously released the lock and the next thread that takes the lock and That happens before relationship establishes that anything that happened Before the thing that did the store also happens before anything that happens after the load And there's an additional part to this which is we if we go back to the bright spec again Acquire says a load operation With this memory order performs the acquire operation on the effective memory location No reads or writes on the current thread can be reordered before this load so This cannot be reordered before this load Which is the other property that we needed nice So this has to perform release. This has to perform acquire There is So there's also acquire release Which is written as hack rel So acquire release when you pass it into an operation that does a read and a write like compare exchange, right? It reads the value, but it also writes the value the acquire release says do the load with acquire semantics and the store with release semantics in Our case the question becomes do we care whether the store is release or not? So this the store in this case is Storing that we now hold the lock and here we don't actually need that to have release semantics The release of semantics are established down here acquire release is is more commonly used When you do a fetch add for example or some kind of I will get to what those mean in a second But acquire release is usually used when you're doing a single modification Operation where there's no critical section like you're not going to do a later store anything You're just doing one operation that you want to synchronize with other threads. So in our case, this can be a choir All right, this is stuff so far makes sense. I'll get to this last argument in a second Let's see Is it only acquire ordering or acquire and stronger? It's acquiring stronger So sequential consistent sequentially consistent ordering is acquire and release and more Great so it seems like what we went through so far makes sense So the question then becomes what is this extra parameter for compare exchange and compare exchange week? the extra parameter is Imagine that the compare exchange looks at the memory and goes this Value has changed. So I didn't do the store Those this ordering is what ordering should the load have if the load Indicates that you shouldn't store In our case when taking the lock you can think of this as what is the ordering of failing to take the lock? You could imagine cases where if you fail to take the lock you still want to establish a happens before a relationship The cases where you need this are pretty rare, but they do exist in our case We don't need that if you fail to take the lock that doesn't mean that you have to now like Synchronize with the thread that last released the lock or the thread Yeah, the thread that last released for the lock. That's not important all that matters is the moment you do get the lock you have to establish sort of a Sequential relationship or rather a happens before relationship between the the thread that last held the lock and yourself Because you're about to operate on the stuff in there. So here we can keep this relaxed The reason this matters is because if you fail to take the lock You don't want to then have to do sort of Coordination with the other threads or the other course rather to how get like exclusive access is not important Great okay, so hopefully this now makes sense like now we have a Working example of a spin lock spin locks never work, but it is a working example of a spin lock. This in fact I will tell you is an a I believe Sound and correct implementation not one you should use but a correct and sound implementation As you might wonder well, why didn't these problems show up when we ran this with relaxed, right? When this was relaxed and this was relaxed like we ran the the test we ran the binary We did lots of iterations with lots of threads and lots of cores Why didn't anything fail? Why did we still? Observe the color the sort of correct output and the reason for that is because My machine is a an x86 64 machine and on x86 64 The architecture basically guarantees that Acquire release semantics for all operations It's a fairly it gives a fairly strong consistency guarantees by default That is you can't opt out of it It's just the memory architecture the CPU architecture is such that all operations are guaranteed to be a choir release That's not true for all platforms on arm for example, that is not generally true If you specified ordering relaxed you will get relaxed memory semantics And so this is one of the problems with trying to sort of test concurrent code by just running at lots of times And that is you're still only testing it on your current hardware and your current compiler Sort of like with undefined behavior like we talked about in past streams It's just you can't rely on The current state of affairs for you being representative of future executions It gives you some idea like our previous implementation, which was like obviously wrong the one with the load and store separate that one did break but Just running at lots of times and just not sufficient to do testing of this kind of concurrent code We'll talk about how you might do do something much better a little bit later on in the stream This load Can stay relaxed. Yeah, because here, too It doesn't we don't need to establish any happens before relationship here because we haven't taken the lock yet So this can still be relaxed in general the cases where you want to use relaxed is When it doesn't matter what each thread sees So for example, if you do something like maintain a counter like for statistics or something You can generally have it be relaxed because it if one thread doesn't see the increment of a different thread it doesn't really matter if the Counter gets updated slightly earlier slightly later in relative to execution order. It doesn't really matter And so there relax is great because it imposes the least amount of restrictions on the CPU and compiler And so they can they can execute the compiler can generate more efficient code and the CPU can execute the instructions more efficiently But for anything where the ordering between threads and the Relative ordering of instructions matter you might have to look at strong stronger orderings Okay So now that we've talked about a choir release Let's look at in combination the fetch operations and sequentially consistent ordering So Let's first look at fetch the fetch operations The Oops Okay, so if you look down here You'll see that there are a bunch of fetch operations in addition to load store and compare exchange This is fetch add fetch sub fetch and fetch nan fetch or etc and if you think of like You think of load and store as sort of being the like bread and butter the nuts and bolts like that the very low-level Primitive that you can tell there are things with and then you can think of compare exchange is like the sledgehammer like Do this or don't do it and do it atomically and there's nothing like It's either set this value to this value or tell me if the value changed the fetch operations are gentler versions of a sledgehammer like a Mallet or something rubber mallet More concretely and more helpfully perhaps the fetch operations are Instead of just saying If the current value is this Set it to this so imagine something like You read the current they imagine you have a counter or something read the current counter And if the and then you do a compare exchange of the current counter value and the current or value plus one Which will fail if the counter has been changed in the meantime a Fet you can do a fetch add which is sort of like instead of saying what the new value should be you tell the CPU How to compute the new value so that way a fetch add will never fail again for the counter example the fetch add you say do a fetch add of this value and Add one and then the CPU is going to go to it's going to get acts exclusive access to the value It's going to read the current value and regardless of what the current value is it just adds one to it and stores it back This means that fetch add just doesn't fail. It doesn't matter what the current value is The new value will just be set to one plus whatever that was and The reason it's called fetch add or fetch sub and etc. Is because it also tells you what the value was When it incremented it So this again is to get at the the race condition between doing the load and the store where fetch add assumes that you care about what the current value was You can throw that value away that that's not important, but If that there's no other way that you could Learn which value you just you just incremented right if you did if you just had like an atomic add operation You couldn't combine that with a load To figure out what value what the value was that you incremented to or from Because if you did a load and then an atomic increment Then there's a slight space in between there with some other thread could sneak in and similarly If you did like the add and then a load there's also a space in between so fetch add is a single atomic instruction But instead of just dictating what the updated value should be you tell it what the operation should be This is perhaps best illustrated by the fetch update method, which is a little bit of like an ugly duckling so fetch update takes a closure that is given the current value and Should return the new value so think of this is like the sort of generic version of fetch add and fetch sub and stuff where You could imagine fetch add being like fetch update with a closure that adds one to its argument The reason I say the fetch update. It's like a little bit ugly and weird is because the CPU has built-in support for doing add atomically we're doing subtraction atomically or And atomic like bitwise and atomically it doesn't have this for an arbitrary closure and So really what fetch update is is a compare exchange loop that's implemented for you So if we look at the source for fetch update Oops, that's not at all what I wanted to do I Just made my browser full screen and that did not interact well with my streaming setup Fetch update Okay, so fetch update internally loads the current value and then does a while loop and Does compare exchange in a loop? so it Really is doing is doing the compare exchange loop for you And you'll notice that they correctly use compare exchange weak because it's in a loop already, right? So this ties back to what we talked about earlier on the difference between non-weak and weak compare exchange But this is why I say the fetch update isn't really the same as the others because in general a fetch add Will just be a single atomic instructions whereas fetch update is actually a compare exchange loop The reason why fetch update still sort of fits here is if we go up and look at the documentation for atomic You'll see here under portability all atomic types in this module are guaranteed to be lock-free if they're available So for example like if the platform supports atomic access to u64's the atomic u64 Type will be available if it doesn't it won't be This means they don't internally acquire a global mutex Atomic operation atomic types and operations are not guaranteed to be weight-free This means that operations like fetch or or fetch add May be implemented with a compare and swap loop So the idea is that if the architecture you're compiling for doesn't support an atomic increment for example Then the standard library on that architecture implements fetch add with a compare and swap loop or compare exchange loop so in other words If you call fetch add because you want to avoid compare exchange That is the right thing to do because it means that you get to take advantage of The atomic increment operation if it exists on your architecture, but there might be architecture where it ends up being a compare exchange anyway Okay so Now hopefully you understand what would fetch the fetch operations are for in general these are used for things like if you want to give Like unique sequence numbers to a bunch of operations that happen concurrently Then rather than like have a lock which is like next sequence number You take the lock and then you read the sequence number you incremented and then you release the lock You just do a fetch add on an atomic u-size instead and that guarantees that every every call to get a sequence number will get a distinct one because Every increment will happen and the fetch will ensure that you read the value that was there when you updated it So no thread will read the same value if if every thread increments by one for example And it will generally be a fair amount more efficient than doing certainly a mutex But even a compare exchange loop on platforms to support it Okay That then blip brings us finally to Sequentially consistent ordering and this is going to be another instance of the head explosions maybe And here we're gonna need a different example because our lock is now correct The example I'm gonna give you here is as follows This is mutex test Yeah, yeah, I'll do that So here's what I'm gonna do here I'm also here gonna have a Sorry while I move around Sync atomic These are gonna be atomic rules They're gonna start out false and then Z is gonna be atomic use size and here Watch this this is this is fairly involved to Demonstrate the difference between a choir release and Sequentially consistent. This is also the same example is used in the C++ memory order page Except sort of translated into rust So we're gonna have one thread that stores true with ordering release So remember releases for stores and acquires for loads we're gonna have another thread There's a store to why we're gonna have one thread that does while not X load we're gonna acquire in a loop and Then it checks if Y again with a choir Then it's going to fetch add one two Z And this one is gonna be relaxed We're gonna call this T1 and Then we're gonna have a T2 That does the same thing except in reversal order. So T2 is gonna Spin until Y becomes true and then if X is true then it's gonna add one to Z And then down here, we're gonna wait on T1 And we're gonna wait on T2 and now the question becomes Z is gonna be Z dot load Ordering and here. I'm just gonna use sequentially consistent for Uninteresting reasons to think of this, but I just want to read whatever the final value of Z is so I'm gonna do it this way Did this ordering's you can ignore And the question is what are the possible values for Z Which essentially boils down to is zero possible Oh What did I do? Every now and again, I find weird vim bindings is zero possible is one possible is two possible More than two should not be possible because there are only two increments. So that would be insane So we're gonna assume that's not possible Let me zoom out a little bit so we can fit that whole thing on screen Hopefully that's still legible All right, so I'll give you a second to sort of consider this code and talk through it a little bit and then just check chat in the meantime If something if some things are implemented under the hood with compare exchange, why don't they what doesn't everything return a result? Because fetch add always succeeds the fetch methods always succeed This is why if it's implemented with compare exchange, it does compare exchange weak in a loop until it succeeds So fetch add will never fail and therefore it doesn't need to return a result Okay, so Let's let's work through this here and I'm gonna do these from last to first So I'm gonna start with whether two is possible and then go back up Is the mutex panic safe That mutex Is panic safe, but it won't propagate panics All right, so two is Clearly possible and just to make sure I demonstrate why let's call these I guess Tx And ty because they respectively store true to x and y So Two is possible under the following execution Tx Ty t1 t2 if the threat imagine that we just had one core the threads ran one at a time This is possible right tx runs and sets x to true Ty runs sets y to true T1 runs it immediately sees that x is true After observing that x is true it sees that why is it wise true so it increments z So z is now one then t2 runs it immediately sees the y is true It then sees that x is true both of which because tx and ty ever run and so it increments z by 1 So now we get z equals 2. Okay, so this is trivially possible, right? Okay is one possible One should also be possible right so we run Tx then t1 then t2 I guess then ty then t2 So with this kind of execution x is when t1 runs x is true, but why it falls so t1 runs It waits for x to become true, which it is immediately because tx already ran It tries to load from y ty has not run yet. So y is false. So it does not increment z then ty runs and Ty stores true to y then t2 runs t2 observes the y is true So it immediately exits the while loop then it checks whether x is true and x is true And so it increments z by 1. So now z is 1 great. So is 1 is a possible outcome for z Zero is more complicated So let's sort of try to work through this the same way we did one and two basically can we find some execution of threads Where the outcome zero is possible? Well, we have a couple of restrictions right, we know that T1 must run after tx The reason we know that is if you run t1 first T1 is going to spin until tx runs, right? It has the spin loop So we know the t1 like even if you run t1 first is just going to wait for tx anyway So t1 is going to happen sort of after tx It's going to be placed after it's going to have to execute after tx in order to complete We similarly we know that t2 must run after ty Right that must be the case for the same reason if ty hasn't run and t2 tries to run it's just going to sit in a loop and so at some point It's going to be like preempted or something or ty gets to run another core then t2 is going to observe that now Y is true. So it exits the while loop and then it just then it completes So we have these two restrictions So given that What are what execution would allow Z to be zero Well, we know that Tx Let's just arbitrarily picks tx goes first right Then at some point later t1 goes Great So how we have this execution where we don't know where the other threads are going to run But we know that this must be the case Okay, now let's try to place t2 in different locations If t2 goes first Then we know that t1 So so this is the pattern, right? So if t2 goes here Then we know that ty must go before that and this execution T1 will increment C Right because t1 runs after x and y have both been set to true Okay Back to the drawing board. Let's try another one. So let's say the t2 goes here. It goes after tx Well, ty still has to go before t2 is going to get to do anything useful So if whether ty goes here or whether ty goes here, right? It might be either of those but in either case T1 and t2 will increment Z Because they they're both going to go after the things that set x and y Okay What about if we place T2 at the end Well, if we place t2 at the end, I guess ty could go here But if ty goes there then tx is already run So t2 will T2 will increment Z If ty went earlier t2 would still increment Z So there's no place given this restriction. There's no place we can place t2 in this kind of execution order Such that one of them one of t1 or t2 won't increment Z So it seems impossible To have a thread schedule where Z equals zero seems impossible Did this ring true so far? Does the walk through here make sense so far? It seems impossible Right Okay, and now this is where it gets super weird the way we've currently written this zero is possible And this may not come as a as a huge shock But it is possible and here's why So it is true that there's no thread schedule that would allow Z to be zero But we're not bound by thinking about thread schedules thread schedules You're just like the human desire to put things in order But in reality Computers don't have to have a single order that things run in you have multiple cores and those cores can show old values new values All we're subject to are the exact rules of acquire and release semantics, which is what we've given here So if you go back to the Modification order of X right so we talked about how there's like a Modification order for each value so much modification order for X Is zero is false and then true the modification for order for Y is similarly false and then true So let's look at what happens when T1 runs Okay, so T1 runs and it observes that X is true Right, so we know that T So here's a valid execution T1 is going to observe this It's gonna observe true from the modification order of X and So what we know a couple of things that means we know that remember from the documentation of acquire and release it said that if you observe a value from a choir You will see all operations that happened before the corresponding release store Well, the corresponding release store is here in this thread There are no operations prior to this store, but if there were we would be guaranteed to see them here because we're we're synchronizing with Tx, but there are none When we here get to the load of Y This acquire Synchronizes with whichever store The value it gets stored There's no requirement that that's any particular store of Y If there had been a store of Y in this thread if this did Y dot store We must observe that Y dot store because it happened strictly before this store Which happened before this load because of acquire release So if there was a store here, we must see it, but there isn't so we're allowed to see any value for Y That means we're allowed to see this Y or we're allowed to see the Y that was stored here Regardless of whether TY has run or not Even if TY has already run sort of strictly speaking in wall clock time It doesn't matter. The memory system is allowed to still show us false That is TY is T1 is bound to get true from the modification over order of X, but it's allowed to see Either of these Regardless of whether TY is run and the same thing applies to T2 right this load synchronizes with this store so After this load we must see all memory operations that happened before this store to Y There aren't on great. We're done. No requirement that we see anything that happened in here Sorry in in here there's no relationship to that thread There's technically actually I'm lying a little bit here. So When you spawn a thread these threads all happen after the main thread that spawned them, so we must actually see this false It's not like we could see a value written somewhere else Independently, but we must see this false or anything that happens later We can't see if imagine that this thread did like X dot store true then The loads down here must see this store because it happened before them because this thread spawned this So when T2 runs even though it will synchronize with this thread and therefore must see Must see any previous rights here There's no requirement that it sees any particular operation that happened to X Because there's no happens before relationship between the store here in TX and the load down here There just isn't any this is not about reordering, right? So remember how The acquire loads says that you're not allowed to move any operation from below to above an acquire load So the compiler is not in the compiler or CPU are not allowed to reorder these not allowed Right by the acquire semantics. So that's not what's going on it's just that the This load is allowed to see any previous value for X Subject to happens before which does not include the operation of TY So therefore T2 must see this But it can see either of these I guess if I get rid of this it might be clear So T2 can see either of these and T1 can see either of these and that's still valid and if that happens then You could imagine that both of these threads run so both TX and TY run then T1 runs It observes X being true It does not observe Y being true even though T wine is run because there's no happens before relationship there Imagine that it's already in cash in the CPU or something It just uses that value. It doesn't bother to check again because it's allowed not to Therefore it this value is false even though T1 TY ran therefore. It does not increment one This one for some reason like it spins until Y is true. So let's say that it observes that Y is true great It's no requirement that it observes that X is true even though TX is run so it does not increment C So therefore Z is zero So this is weird, right? This is a really weird way to think but but the way to try to get at this with your brain is that acquire release and in general all memory ordering is about which things Which concurrent things happen before which other concurrent things and if there's not a happens before relationship between two operations then it's sort of up in the air whether One sees the other the the seams was a giveaway. Yeah, you're right Okay, so you might wonder well, why is this allowed like Why do the designers of languages and memory ordering memory systems and CPUs and compilers have this Have this be possible and the answer is because if you looked at something like a mutex, this is fine This doesn't cause any problems and it gives the CPU and the compiler more leeway to rearrange your code Imagine for example like here X and Y are just independent variables So why should we establish an arbitrary con sort of connection between them when there really isn't one? If there was then suddenly we're enforcing that the CPU and the compiler must execute things in order They must get exclusive access to certain operations And it's just technically not necessary. So it would be imposing a cost that you can avoid Think of this as like if you want stronger guarantees You have to opt into them because by opting into them. You're also opting into the cost of them if every operation Enforced sequentially consistent operation then you would have no way to opt out of it You can imagine that you have a default that you can override And that's exactly what rust gives you right every operation takes an ordering You must provide an ordering and if you don't want to think about these problems You just always provide sequentially consistent like ordering CST You can always do that But if you do you're also then requiring the cost that comes with that All right. So the question now becomes how does sequentially consistent operation change this? Well, if we go back to memory ordering, let's see what the rust thing says Sex CST, I don't know how to pronounce the abbreviation 6th 6th Like acquire and release and acquire release With the additional guarantee that all threads see all sequentially consistent operations in the same order And notice that this is all sequentially consistent operations It's not all sequentially consistent operations on this memory location. It's all So if we go back and change this program to be Sequentially consistent for all of these We can leave the counter as relaxed because it doesn't matter If we make these all be sequentially consistent now Zero is no longer possible And the reason for that is because If this thread observes that Observes x is true and then y is true It establishes a happens before relationship between this load and this load Right some thread Observed that x was set to true and then y was set to true And that means no thread is allowed to see y being false That is allowed to see x being false after y being true Because some threads saw the opposite ordering So that's what the text is trying to get at that every thread Or there must exist some ordering that's consistent across all of the threads If some thread sees that x happened then y happened Then no thread is allowed to see x not happen even though y has happened And so in this case if we here end up with x is true And then y is true Then here y is true therefore x must be true It's not allowed to see x being false because that would be inconsistent with the memory ordering that this That these sequentially consistent operations saw Notice though that sequentially consistent ordering Is only with relation to all other sequentially consistent operations So if some of these like if this was released and this was released This would not give us the guarantee we needed Actually this might Yeah, this might but if this was a acquire it would not Because here we're allowed to see x being True and then we're allowed to see x being true here, but there's no ordering relation between these No sequentially consistent ordering has been seen by a thread that has x true then y true So like this is subtle stuff But in general sequentially consistent only really interacts with other sequentially consistent Acquire release does interact with sequentially consistent So sequentially consistent it is always stronger than acquire release So if you have a operation that is acquire and you do a if you have a store that's released and acquire that's sequentially consistent Then the load with sequentially consistent will indeed be sort of happened after the the release store Um All right that fairly involved But basically sequentially consistent is acquire release with the additional guarantee That all sequentially sequentially consistent operations Must be seen as happening in the same order on all threads And therefore in this particular example If all of these are sequentially consistent Then z equals zero is not possible All right Nice. We did that on exactly the two hour mark. Very good. Um, I'm gonna talk a little bit about um testing shortly Like how do you how do you not make these mistakes? Um, but I'm very happy with the timing Okay, so as we've seen memory ordering is real subtle it's just like Hard to make sure you got it right or rather I think it comes down to the human brain is just really bad at emulating all of the legal executions And as we talked about like you can test this by just running your code lots of times in a loop or like Making your computer be really busy with other tasks to cause more weird interleavings to happen But it's not a very reliable way to do this kind of testing Because it might depend on the architecture might depend on the operating system scheduler It might depend on the compiler. It might depend on which optimizations you have enabled for that compiler It might just depend on Like how likely is the thing to happen? Maybe you need to execute your code like Billions and billions of time in order to get that one execution where the code is wrong And so surely you're like there must be a better way And the other way to think about this is like It might even be that there's a problem in your code, but it doesn't cause a panic right like if you think about the The counter we had early on with our mutex where the values didn't end up adding up to a hundred thousand or whatever the value was Right, it added up to something slightly less The only reason why the program crashed the only reason we noticed is because we had an assertion The check that the value was right at the end but if you have a program that just like Does a bunch of operations that takes locks just assumes that everything is right It might not be obvious If something went wrong if the lock had this kind of bug in it Two things run the critical section at the same time. Maybe You write a log entry twice Maybe the log gets truncated. Maybe your backup ends up empty right like who knows what happens when you just A mutex just isn't a mutex Right, it's very hard to predict and it might not crash your program. It might not be detectable sort of immediately And that's scary Right, but it also means that even if you ran it for like 10 years At lots of cores on lots of different hardware, right like your code might hit the bug But there's nothing that notices that you hit a bug um, and in these two problems of how do you Make sure that you test all the possible legal executions And how do you know if something bad happened are sort of separate problems and they're they're often handled a little bit differently In general for the second one, like how do you detect these bugs? When writing concurrent code like this with with atomics You want to stick in a lot of assertions just to make sure that you did the right thing You can make the debug assertions if you really want to and then run your test suite with like In release mode, but with debug assertions turned on or something like that so that it doesn't impact release builds too much But you do need to have ways to detect these problems There are also some automated systems. Um, so for example, there's um, so google has a bunch of different sanitizers That are now built into a lot of compilers like I think both clang and gcc and msvc probably others Have many of these sanitizers built in. There's a nightly flag to use it in rust um, and one of the sanitizers is the thread sanitizer and thread sanitizer runs your code In like an instrumented way where every load and store you do in your program like every memory operation Gets tracked like it gets special instructions added to it to get to track what it modifies and when Um, and then as your program runs the thread sanitizer in the background detects if you ever do like If you ever have two threads that write to the same memory location or if you have a thread that writes to a memory location After another thread is read for it, but there's no happens before relationship between them So it basically tries to detect when you have unsynchronized operations on a single memory location That can't detect all problems like there might be logic problems. You don't detect this way and if you look at the, um The documentation for the thread sanitizer algorithm Towards the bottom here. They give like the state machine and what assertions they check um, so here like Sort of the algorithm, um Specified, but there are things that aren't detected yet. There are problems. You can't detect this way Um, and also the thread sanitizer Is like you run your code and it will detect if that Execution of the code did anything bad It might be that this particular execution was fine, but you still have a concurrency bug Uh, and this comes back to the how do you make sure that you exercise every legal execution of your program? Um And so that's Then much harder actually to solve because you need to sort of Figure out how to explore all the possible behaviors on all possible compilers cpu's architectures according to the spec Now, um, the best tool I know of for this is a tool called loom Uh, and loom is a it's a rust project. Um that implements a, uh paper Which that's fine This paper called the cds checker, which is a um a paper written for c and c plus plus atomics But it translates to rust pretty well And basically what that paper outlines and what loom implements is a strategy for Taking a concurrent program Instrumenting it and and this is not automatic instrument instrumentation. It's like use looms Synchronization and atomic types instead of the ones from the standard library. So use like the loom atomic use size the loom Uh, mutex the loom channels and whatnot Use those instead of the standard library ones And then what loom will do is you pass it like a test closure and it will run that closure multiple times and every time When you do a load through one of the loom data structures It will feed you back one of the possible legal values When you do a store it will sort of keep track of all the values that have been stored So that on other loads it will expose your execution to One of those possible values And then every execution of the closure exposes you to a different possible execution So as loom executes it will run all possible thread interleavings all possible Like memory orderings So when we looked at the remember we talked about the modification order for each variable where t1 was allowed to see Either value for y or for x and t2 was allowed to see either value for for x Loom will make sure that there's one execution Where it sees the false value For where t2 sees the false value for x. There's one execution where t2 sees the true value for x And in general we'll do this for all of the operations in your program And if we if we look through like the the example here is going to be pretty illustrative of how this works So you see here we use a bunch of the the atomic types We use them out of loom instead of using them out of the standard library The idea here is that if you are doing if you're running your loom tests you use loom primitives And if you are like building your code for like release or just a standard test suite you use the the standard library primitives Um, there's like a bunch in the documentation. There's a bunch of explanation of how you go about doing this But the basic idea is that if you write a loom test Uh everything in your test So that includes like inside the library and inside any libraries that the library uses sort of all the way down You want to use the loom primitives instead? and then Inside your test you call loom model and loom model is the function that will Try all these different execution paths you pass in a closure And then inside of that closure you write the code that you want to test And loom will then call that closure over and over and over and over again And for example down here right like this thread does a load will acquire Every time the closure executes This load might get a different value Or you see that there are lots of threads each time through the which threads execute in which order will differ And loom will make sure that it totally explores all the possible legal executions Of course the downside of this is there might be an insane number of such possible executions Right imagine you have 10 threads that each do like 10 stores and 10s 10 loads and you end up with like 10 to the power of 10 to the power of 10 or that's not even right That's even a larger number than that but you end up with testing every possible interleaving That's legal according to the memory specification that you're using Which is can be insane. So loom is somewhat limited in that I mean not because of the implementation just because of the fundamentals of there are so many possible legal executions Loom has a bunch of implementation things that are from the paper that we just saw that looks at reducing those So imagine that a thread does two different loads But they're like of different values then It doesn't matter which one executes first like even though the memory Uh the memory model allows them to be reordered The correctness of your program is unlikely to depend on whether those loads happen in order and so this is a bad example because in the in the case we just looked at the load ordering might have mattered but The paper basically has a complete specification for What um what executions what possible executions Cannot differ from one another and therefore can be eliminated like you only need to run some of them Loom implements a bunch of those to try to help But ultimately there's like a limit to how complex your test cases can be when you run them under loop Loom like you can only have like so many modifications to a given atomic variable only so many threads so many instructions But even so loom is the best tool I know of to try to make sure that your concurrent code is actually correct under Only the assumptions that the memory model gives you rather than the assumptions that the current compiler or optimization or cpu might give you Uh, all right, let's do some questions before I do more on loom um Are there any sort of toy programs one can think of to try and drill this into my head Like the spin lock was a pretty good example, but this last example seems crazy sauce Yes, so the example between a choir release and sequentially consistent I don't have a less contrived example for you um I think my recommendation would be to look at some papers that implement concurrent data structures And notice where they use sequentially consistent up ordering as opposed to other orderings and then generally the paper will explain why um, I think it's hard to come up with like A simple motivating example for one like the one I just had is the simplest one I can come up with but it's It's still complex like even though there's little code the thinking and reasoning is complex um I don't have such a good Concrete example where um sequentially consistent is necessary Um Let's see Yeah, so uh, someone pointed out that the name loom actually makes a lot of sense. It spins threads in many ways right, um so, uh loom For the for the cases we've looked at so far what So the loom has some limitations. Some of them are known problems Some of them are more fundamental problems with the approach Or are just impossible to model. So for example Relaxed ordering is so relaxed that you can't fully model all the possible executions Even with something like loom. So one example of this is um In the relaxed example we have let me pull that up here so remember here where uh This load might see 42 which is stored by this store Imagine how you would try to emulate that right imagine you pass this into loom This method call occurs first So loom has to like if this is this is a load from like a loom atomic use size, right That load like loom has to produce some value for that load And it doesn't know about 42 yet because it hasn't this method call hasn't happened yet But the ordering relaxed allows 42 to be returned from this load But loom doesn't know that 42 is even a value yet. So how can it return it from the load Now if you go read the paper and it's a fascinating paper There are ways to do this Where because you know that you're executing the closure many times over You can like execute the closure remember all the relaxed stores you saw the next time through Return from load a value that will be returned from a later store Continue executing see if that store still stores that value And if it does you've successfully like done the sort of reverse dependency If it doesn't you discard that whole execution because That race no longer happened. It's real weird read the paper. It's fascinating But but fundamentally like without some really fancy tricks modeling relaxed properly. It's just very difficult um, I will Like loom is a really cool and helpful project that I know just like There's a lot more we could do with it. I know that the project is like looking for contributors So if you're interested in this concurrency stuff Please go contribute to loom like it. I would be so happy to see that project Do even better than what it does now There's another example of where like this kind of contribution could be helpful. So Loom currently doesn't really model the sequentially consistent ordering In that It will it won't enforce so so Let me restart that sentence As we saw sequentially consistent ordering is acquire release plus some more guarantees Loom doesn't currently model those additional guarantees. It runs sequentially consistent as if it was acquire release now the the reason for this is partially because Implementing the requirements or the the the additional guarantees of sequentially consistent ordering. It's actually fairly complicated The paper talks a little bit about this too, but the translation and implementation is fairly complex What loom does is it just downgrades every sequentially consistent to acquire release and that means that loom won't miss any bugs Because acquire release is weaker than sequentially consistent, but it might tell you that your code has a bug when it doesn't So it won't give you a false negative, but it might give you a false positive Which is like probably what you want, but it's also unfortunate. There's an open issue for it There's also a an open like there's a test case for it. That's in the code, but it's currently ignored This is something I know is like high on the priority list to fix Because sometimes like if you have an algorithm that depends on sequentially consistent ordering to be correct Loom can't currently tell you that it's correct because it'll downgrade that ordering This is something very worth fixing But uh, what's important is that it does model acquire release correctly Which is generally the thing that you're not able to test well On something like an x86, right? So uh in the case where this was still release release acquire acquire acquire acquire So let's say we we kept this code the way it was and we did like a assert any z zero So remember z equals zero is a possible outcome with acquire release, but it's not a possible outcome with sequentially consistent If you ran this through loom loom would correctly cause this assertion to panic That is it would actually explore This possible case with acquire release Um So in other words like loom will catch real errors and it will catch some of the ones that are the hardest to actually reproduce But there are some cases will give you a false positive, but hopefully those will be fixed Um, and similarly for relaxed ordering. There are some orderings. It just cannot model So like arguably the answer is you shouldn't be using relaxed for anything critical in a data structure anyway Um, but if you do loom doesn't quite capture it Uh, I actually recently pushed a large Documentation like revamped to loom. So much of what I'm talking about now is actually written in the loom documentation I highly recommend you read it. It explains both a bunch of sort of how loom works How you use it, uh, what its limitations are And how you can work around those limitations I think the documentation should be fairly good at going through this Um, and if it inspires you enough to go contribute to loom like please do Especially things like documentation now. Do you watch the stream? You should be in a really good position to try to document a lot of the the things that loom exposes Um And yeah, so someone asked in stream I see the tokyo uses loom as it cost serious caught serious issues and it really has Um, like I know the tokyo uses loom Especially for like the scheduler, which does a lot of lock free business and it's caused several like critical bugs there Which like luckily never as far as I'm aware made it into production. It's like hard to say But yeah loom is absolutely called cost Caught real issues one thing you run into with loom is like because The test cases have to be they have to have like a state space is reasonable to explore Just because they try every possible execution Um, you sometimes have to run them on like limited use cases sometimes They're more intricate use cases that can still trigger a bug I found that generally It can explore a state space that's large enough to still catch interesting cases and and the worst ones Um loom also requires that execution is deterministic Like if you did like a syscall or something in here loom would reexecute the syscall each time so You may have to do some mocking in or just like with normal tests really uh in order to be able to use loom efficiently um But but apart from that i've been i've been very happy with uh loom for concurrency I think I think That's all I wanted to go through Uh, let me see if there's anything more in loom that you need to know about Yeah, so here see there's even like the relaxed ordering example is like given as an example limitation of something we can't model um sweet I think that's all I wanted to cover Now the word's sort of at the the tail end of the stream Are there any questions at the end now that we've like been through all the weirdness all the hopefully all the explanations Is there anything that's still unclear that I can try to help with more explanation on? Uh Oh memory barriers. That's a good one. Uh, good call. So let's go to atomic So if we look at the atomic module in the standard library, you see that they're all these like atomic types Uh, there's the ordering enum. There are constants. So these are for um, these are from before The new method on the atomic types was const This was the only way you could get a const atomic This is the other way to share one between threads, right? Instead of sticking it in an arc or leaking a box You can just make a constant Um, it also has three functions Um spin loop hint, which should probably never have been in the atomic's module at all, which is why it's now deprecated It's now been moved to the hint module Because it's not really an atomic instruction It's just it was there because it's often used in atomic contexts So in our spin lock, for example, in the the while loop you might want to call spin loop hint Um, I'm not going to go into what that means. It's not terribly relevant Um, this compiler fence Compiler fence is weird. Um The compiler fence is a way to Ensure that the compiler won't move a given loader store Or not a loader store like you place in a fence and the compiler Is not allowed to move operations under the fence to above the fence or above the fence to below the fence Within that threads execution Um, this is only for the compiler the cpu can still execute things out of order So you very rarely need compiler fence specifically It's mostly used for making for preventing a thread from racing with with itself, which can only really happen when you're using signal handlers In general, you won't need this And then finally, there's fence. So fence is important. Um, fence is basically a An atomic operation that establishes a happens before relationship between two threads But without talking about a particular memory location So remember that, uh, when we talked about load and store and acquire release Those synchronize like a load acquire synchronizes or happens after a store release Of the same memory location right Um a fence is a way to say synchronize with All other threads that are doing a fence So if we go back to the memory order So the memory ordering document has like Really detailed instructions on what exactly these happens before dependencies mean So if we go down to Happens before Where is the Release acquire Oh, yeah, here's also the note about um on Certain systems release acquire ordering is automatic for the majority of operations Uh, only weekly order systems like arm Uh, have special operations here Um, let me see if I can find Fences I feel like there used to be a section on fences here Up I guess there isn't that's too bad, but the documentation on fence is pretty good too Basically what it does is that every fence synchronizes with Uh, another fence If there are memory operations before the the before and after the fences that happen to synchronize So a fence is basically a way to say Synchronize a little bit later or a little bit earlier without doing the actual load and store And again, I recommend that you actually read through the instructions the documentation here to understand what it's for But essentially a fence is a way to to establish a happens before relationship between two threads In a way that where the happens before is sort of moved a little bit relative to the actual memory access Um You're right. The atomic has to be a static not a const, but you can only assign Const values to statics is what I meant Uh Oh volatile. Okay, so someone asked about volatile. Um Okay, so notice that the atomic module has no mention of volatile There's nothing volatile in here And that's for a good reason volatile is a keyword that is just unrelated to atomics but Since why not it's a common question There are the operations read volatile and write volatile that exists in the pointer module in rust A volatile read sounds like it has to do with atomics because it it's often explained as Ensuring that you go to memory. So people think like, oh, maybe the reason why concurrent operations might race is because like one cpu does an operation In like a register rather than in memory But if it goes to memory, surely other threads must see it and therefore i'm going to use volatile And that's not really what volatile is for also volatile listen establish happens before relationships across thread boundaries What volatile is for is um, when you're interacting with memory mapped Devices so imagine that like you're I don't know. What's a good example of this? Uh your network card Like the that has to like send and receive packets um It maps itself into a a particular region of memory like at a particular range of addresses And let's say that the packet queue is mapped into that region of memory um It might be that reading from A part of that memory ends up modifying the memory an example of this is Many divide many memory map devices have regions of memory where Uh, there's like a ring buffer And there's a pointer to the head and the tail of the ring buffer if you don't know what ring buffers are Just bear with me for the next three or four minutes. Um, so when you the the Usually the head and the tail dictates where new writes are added and where The reader is reading from and these might be different threads that operate one operates only on the head one operates only on the tail Uh, and often this these kind of device mapped memory regions have the side effect that if you Read from the tail you also reset the tail Uh, or there's some other like side effect of reading Uh, and in those cases imagine that you read twice from a given variable that is in memory map memory If you didn't use read volatile the compiler would probably do the first read Realize that reads are read only so it caches It sort of caches the first read into register and the second read just reads from the register But in reality you need both to go to memory to have a side effect on the device map memory That is what read volatile is for it's a way to tell the compiler That it must go to memory And the operation cannot be reordered relative to other operations So a volatile operation cannot be moved up or down by the compiler for example because it might actually have side effects There's a section on this sorry bright screen again This is a section on this at the bottom of the memory order Um documentation Which talks about the relationship between atomics and volatile and memory ordering and the answer is basically They don't interact like they're not for the same purpose even though sometimes it might seem that way Oh, okay Uh Great I think I think that's all I wanted to cover on atomics Is there anything that still feels unclear anything you'd like me to go over one more time Anything you feel like I haven't touched on now now's your time Can you tell I'm going a little horse? So it happens when I talk for like two and a half hours I have to I have to wait for about 10 seconds for youtube comments to come in in case anyone has questions from there So for those of you are watching this on video on demand It might seem weird when I'm like are there any questions and then I just sort of sit there for 20 seconds Is the delay to youtube comments coming back to me Uh anything particular about atomic pointer Okay, so atomic pointer might seem like it's a little different from these other types Like these other types are sort of primitive types. They seem simple like their numbers and bullions atomic pointer It's not really special like atomic pointer is an atomic u size Where the methods are specialized to the underlying type being a pointer In fact, my guess is if we look at the source for this Oh, it does actually store a pointer now nice But atomic pointer you see that it doesn't have like It doesn't have fetch add for example because that's not an operation you Technically you can do a fetch add on a pointer like using assembly But it's like probably not going to do what you want you run into all sorts of undefined behavior You lose like pointer provenance if you start thinking about those things It's complicated But basically it's a Specialized version of atomic u size that operates on pointers and keeps pointer provenance So you see like it provides load provides store and rather and it then operates on pointers Which are represented as u sizes sort of in memory if you will It also does compare exchange and it has fetch update right so remember Fetch update is really just a compare exchange loop So it's sort of like having a nicer interface to doing a compare exchange loop So yeah, there's really nothing that special about atomic pointer And it still requires unsafe to turn that pointer because it's a raw pointer into a reference There's one more thing actually that I can mention while I'm at this So most of the atomic types you will see have a get mute Which takes a mutable reference to self and gives a mutable reference to the inner type This is safe because if you have a an exclusive reference To the atomic itself Then you don't need to use any like special memory ordering or exclusive operations or anything No one else has a reference to that atomic So you don't need to use atomic operations on it. So that's what get mute does Um, in fact atomic u size could implement dref mute. It just can't implement dref Which is kind of interesting I wonder if it implements as mute No, because as mute so dref mute Um Extends dref So you can't implement dref mute without dref and I think as mute is the same it requires as ref Um, so it can't even though technically this type can implement either it doesn't How do atomics interact with regular stores? So if I cast an atomic to a mutable pointer to u32 and pass it to c Um, I mean in c you also have the option of doing atomic operations using like memory order Uh, in which case it would be subject to the same things Although I don't actually know how the compiler operations happen across there But the if you did an atomic operation in c like if you did a Like a memory a store with like a memory order Sequentially consistent, uh, that would still be enforced on that operation Like think of it as it's not that the value is special It's that the compiler knows to generate special instructions for it when you're using an atomic u size Or atomic u32 if you pass that as a raw pointer to c and then did just like Just like dereference it with star. Um, that would that's even weaker than a relaxed load Of that value. So like it it's the same as if you were to cast it to a uh a raw Well, you couldn't even do this in rust actually Yeah, if you cast it to a raw pointer to a u size and then dereference it Um, which I don't think is even sound like you might just run into undefined behavior Um What is consume ordering so, uh rust doesn't currently implement consume ordering. I don't know that it's used very often in, um c and c++ either that I should have warned you sorry, um, so Release consume I haven't looked at too carefully, but it looks like the Rules of what things are visible to the other thread changed a little My guess is that this is like more specialized to cases where you don't need full release acquire Or is the other way around that release acquire is Uh weaker than release consume. I'm not I'm not quite sure It's not available in rust currently anyway Probably for good reason. I don't know Um It's really hard to define consume ordering great. Yeah, I mean looking at that I forgot to warn you again looking at the description of release consume. It looks like it's just annoying Uh, and if you look here sin c++ 17 The specification of release consume ordering is being revised and the use of merry order consume is temporarily discouraged So I think it's just not worth trying to learn it at the moment Um, and even if you learn it chances are that information will be outdated whoo, all right With that, I think we're gonna call it a call it a stream. Thank you everyone for coming. I hope that was interesting I hope you learned something. I hope this does not leave you more confused than you were I wish you luck in writing, uh lock free atomic code Just keep in mind that in general Don't write lock free code unless you have to right like I want to end on this note that If you can just use a mutex and most of the time you can Do that it will just cause you much less headache if you really need to get into atomics like Use loom use thread sanitizer run it through miri, which I think now has some concurrency support Like get other people to vet the code Do everything like find a paper that describes the algorithm you're trying to implement just Do your best to make really sure you get it right because it's a lot of subtlety as we've seen And the best way to avoid the complexity of atomics is to not use atomics So in particular don't use them unless you need to And with that, uh, I will see you next time so long for well, uh, I hope it was an educational experience