 Welcome. This is a talk on memory barriers in practice for software engineers. Memory barriers. They are these scary advanced multi-processor synchronization things, which if you don't get them right, then your code doesn't really work maybe. But sometimes it does work until you run out of a new CPU on a different microarchitecture and then it starts to behave in very weird ways. It's kind of hard to, you know, there's a lot of lower run memory barriers. And you kind of don't want to encounter them. But they slow your code down, but also they make it. This talk is to present a simpler view of what you need to know about memory barriers for software engineering. This is not a mathematical talk about formal memory models. You can find that somewhere else. This is a talk about how to do engineering with memory barriers. Now there's a TLDR. Memory barriers are always there to order four different memory operations. You have some memory operations in some section A and then a memory barrier. And then a store, where you write to some pointer, some location memory on one CPU. Another CPU, you load from that same memory location and they show another memory barrier. And then there's some other memory operations in B, using the value you run from it. Now there's four things going on here. There's A, there's the store at B. So there's a store at location P. There's a load from location P. And there's more memory operations at B. So four different things on two different CPUs. Memory barriers always come in pairs. When you put the memory barriers in pairs like this, so that one barrier ordering some operations on one CPU matches up with the other memory barrier ordering the operations on the CPU, then you're guaranteed some very simple reasoning. It's as if the logic in A happened before the logic in B in a traditional sequential CPU where you don't have to think about multiprocessing things. It's as if you just went step by step in program order. Excuse me. And that's the main thing that you need to know about memory barriers. They order four different operations. So if you're reviewing code that has barriers in it and you see a barrier somewhere, the first thing you should think about is, OK, where's the corresponding barrier that another part of code uses? Maybe the same barrier, but another CPU will be executing at some point. And what are the four memory operations that need to be ordered to work out your protocol? The second line is a simpler version of the same thing combined in some architectures a little bit faster to use a store release and a load acquirer. And then there's a little optimization with read-only users that I'll get to later. The real red flag that you should watch out for that indicates something is likely very incoherent. Probably doing something wrong is when you see a store before load barrier. That only appears in really obscure protocols like Decker's algorithm, which usually exists only to scare undergraduates in university away from looking at multiprocessor synchronization. You don't ever see Decker's algorithm in the real world for the most part. This entire talk is at the URL here. Unfortunately, I don't know how to put a QR code into an Emacs buffer. And this is a very code-heavy talk, which is why I'm presenting it in Emacs as a live editor session. So if you want to grab this code, grab this slide to follow along at your own pace, you're welcome to do so. Except I'm going to move to the next slide now, so it's going to go away. OK, so basic multiprocessor business, let's suppose we have a counter, C, and we have a function count which adds 1 to C. You call count 1,000 times, C will go from 0 to 1,000. Simple, simple sequential machine logic. Of course, if you run this on a multiprocessor system, things will be a little bit different. But let's first break this down into three separate things as if we're on a risk architecture with loads and stores. So we first load this count function. We first load C into a temporary variable, T, then compute 1 more than C. This computation, 1 by T plus equals 1. Purely local inside registers, doesn't touch memory outside. And then we write back T to C. When I say load and read, those are synonymous. When I say write and store, those mean the same thing. So load, store, read, write, same deal. Just different words because different cultures is merged in computing history. So obviously, if you've ever done any multiprocessor computation, if you call count 1,000 times on two different CPUs, the result C might not be 2,000. It might be 1,972. Because when you step through this program, well, the execution might be interleaved. CPU 1 might load the value of C. Let's say it's 7,982. CPU 2 also loads 7,982 at the same time. They both increment and both write back 7,983. And now two different calls to count have only counted once. So we need to have a mutex to ensure that this is ordered. So that when we are running the count function for trying to use this global state, we only want one CPU at a time to be able to touch that global state. Now, of course, on a multiprocessor system, we want to do many things. But ideally, we have fine grained locks for certain applications so that one CPU can be working on one counter, another CPU can work another. But if they happen to be using the same one, then we need a mutex to ensure that the count function runs as if it were on a sequential machine. This is a very simple-minded way to do a mutex. We'll use the atomic exchange instruction on X86. This will, it's a test and set, this test and set function. It will return whatever value is at memory location, our lock here, L. And at the same time, it will store one at that location. And the way we use this is L is 0 when it's free and 1 when it's locked. So this is a very simple spin lock, very, very simple-minded. Nothing exciting going on here. Not going to perform very well, because if two CPUs want at the same time, there's no exponential back off to make sure that there's no performance. This is for illustration. So when we want to count up, we will repeatedly test and set to see was it unlocked and try to lock it. And if it was unlocked, great, we're good. If it was already locked, we try again. And then we go through the critical section to increment the counter as in the sequential model, as in the straight line serial machine. And when we're done, we unlock the mutex. And ideally, if we did this right, we would end up with, let's say, two CPUs called count 1,000 times. Then C at the end should be 2,000. Unfortunately, some things are conspiring in this computer, this demonic shimmering silicon crystal to screw us up so that it's not going to turn out 2,000 all the time. Not going to work out sequentially. One problem is the compiler generally assumes that without a special instruction, it can reorder sequential statements as if they were written in a different way if it doesn't change the way that the program would run on a sequential machine. So the compiler can take this L equals 0 and do it earlier, because there's nothing that would, if you ran this program on a single CPU, nothing that would change if you did that. So if we do that, then the critical section is no longer locked, and we'll go back to the same problem of two CPU stomping on each other's reads and writes at the same location. Going back to the same code, we need to fix this to make sure that the compiler doesn't do this. Oh, sorry. This is to illustrate in practice that the compiler will actually do this. I didn't use Xcd6 here. This is ARM assembly code, but I actually ran this, put this code into GCC. And not only did it, yes, I put this code into GCC, and it actually did reorder the stores so that the store of L to unlock the mutex comes out first before we've updated the counter. GCC very helpfully optimized the code for us in a way that broke it. Of course, that's because we didn't tell GCC that we want to do this on a multiprocessor machine, and GCC was well when it's right to do this. So to fix the compiler part of things, before we get to another part of this demonic shimmering silicon system, we can put in instruction barriers. So the problem that happened up here is that when GCC was issuing load, store, and add instructions, it just reordered the instructions that it generated for the machine. And we can ask GCC to please do not reorder those, put in an ASM-volatile block. This is magic that says, GCC, you cannot assume that anything in memory is as it was before the ASM-volatile block. And you can't delete this either. So this means that when GCC sees memory operations, sees a load and a store and a store, it can't reorder the memory operation before with a memory operation after in the code that it generates. So it comes out much better. We wind up with the critical section, as we intended, load add store of C, and then we released a lock. So great. We have a simple way to make the compiler stop trying to do things the wrong way behind our back. However, the compiler is only part of the system. In a sense, there's sort of a pipeline here. You feed programs into your compiler. And the compiler transforms those into machine instructions, which then feed them, which are then fed into CPUs. And the CPUs are wired on a system bus with a shared memory. So there are multiple little CPU cores all spinning around doing computation. And they're all wired to a shared memory. And the CPUs transform those machine instructions into memory transactions on the CPU bus. So there's a whole pipeline here where we found a way to prevent the compiler from reordering machine instructions. We now need to make sure that the CPUs will do the right thing as well. Actually, the story is a little more complicated, because there are CPU cores with cache interconnects and different levels of caches. And it's all very messy. But we're not going to go into those details for now. This talk is about how to do engineering with the memory barriers as components. I'm not going to go into details of how the messy cache per-careance protocol works. That's all at a lower level, because it's how to write software. OK, so the basic problem is that we don't want this interleaved execution. We want to make sure that if we're running on two different CPUs, this situation here is not allowed to happen. This flies in the face of the sequential logic that we've written in our program. We want it to be serialized. Well, not necessarily serialized, but we want to be able to reason about it as if it were serialized in this example of a counter. Well, we want to be serialized in this example of a counter. Later on, we'll want to avoid serialization, because we want multiple CPUs for performance. But for this example, we're just looking at serialization of running as if in serial on one CPU. So to recap this, we have three parts here. There's the lock acquisition. We're taking the lock. And once we have taken the lock, we expect, OK, we have exclusive access to the counter memory. Nobody else can be touching it right now. No touchy. And then we can do our business as if we're the only CPU in town. There's nobody else trying to interfere with us. And then we want to release the lock so that another CPU can take over and continue counting. So we have these three parts, the lock acquisition, critical section, and lock release. So one view of things is that between these sections, we need to ensure some kind of ordering here. And that is to put an, we'll call it an acquire barrier and release barrier between these two sections. But this isn't actually quite the view that it's important to know about when you're looking at the implementation of a mutex. I mean, when you're using a mutex, I just want to say mutex enter, mutex exit, or whatever. But when you're looking at how it works under the hood, or if you're building a synchronization between the two sections under the hood, it's tempting to say that, well, this acquire should be paired across a critical section with this release. But that's not the view that you need to be thinking about. Because you actually need to think about where two different threads, or two different CPUs, are coordinating. So this is a single thread view of what one thread is doing at a time, one thread is doing at a time. But you need to have a view of the two different threads that are coordinating. So on CPU zero, or thread zero, when I say CPU or thread, it's the same deal. It's a parallel thread of execution of some sort. So on CPU zero, we've been working through this critical section. We loaded a counter value, we incremented it, or then we wrote it back, or there's some other data structure that we're updating, or whatnot. So we want to make sure that in order to have the sequential idea of programming that is entered to do sequential reasoning about programming, we want to make sure that the critical section that on CPU zero all happens before, as if it had been in sequence on one CPU, the corresponding critical section on the other CPU, on CPU one. So when we're ordering access to the memory, we have four memory operations here that need to be ordered, or more, four or more. The memory operations in the critical section on CPU A, there's the store to release the lock. And if that release is witnessed by the other CPU, by CPU one, then it does a test and set, and it observes that the locked value is now zero, then there's a synchronization event between these so that the barriers guarantee everything in critical section A on CPU one has happened before everything in critical section B on CPU zero, everything before the critical section B on CPU one. And once you have that, anything that you would have written on a single CPU and a single sequential machine, any reason you have data structure invariance for the critical section A guaranteed is also guaranteed when critical section B begins. So we want A to happen before B. And that is what the pairing of release barrier memory operation on CPU zero, corresponding memory operations on CPU one, and the acquired barrier do. They guarantee that everything in A happens before everything in B. So the way we do this is to ensure that we have CPU barriers that will put everything that was on critical section A and CPU zero, make sure that that has completed by any other CPUs once we do the store, and then the corresponding thing on the other CPU and the acquired barrier. In NetBSD, we spell it with member release, member acquire. These are actually relatively new. There used to be a different set of members in NetBSD. We recently changed it to better match the literature on this and what is actually useful in most software. The member exit mentor are now deprecated. I'll mention them later. You can forget about them. Release and acquire are the important ones. For the most part, the only two barriers you ever need to use in realistic real world code. Now there's also you can. So if you look at the member release and store, this operation can be combined on some CPUs more efficiently with a combined store release instruction, like on ARM. STLR is a store release instruction. And similarly, sometimes atomic operations can imply an acquire barrier so that this can all be done as one atomic swap acquire operation. Actually, we don't have that name in NetBSD just yet. You can spell it in C11. But in principle, these are the same thing. Release and store or atomic operation and acquire or just load and acquire. But the combined operations are sometimes clearer and sometimes a little more efficient. You can write it both ways. Either way works. Well, in NetBSD right now, you'd have to put an atomic swap member acquire. But we might change that eventually. In C11, you can use the combined operation. OK, so I've been talking about mutexes and about serialization. And serialization is great. But we have multiple CPUs because we want stuff to happen in parallel. We don't want to serialize everything. If we serialize everything with a kernel lock, it's very slow. Fine-grained locking can help. But also there are times when, for instance, in a packet processing path, there is data structures that we really have contention and memory over. We just want a bunch of CPUs to be able to load from them, read only, and use them without having to issue rights to other CPUs' witness. Because if you have coordination with CPUs that's slower, that eliminates parallelism. So one example we might have is table lookup. So let's see, you have a table of roots or something. I don't know, a table of user records or something. And so inserting or deleting things in this table may be an expensive operation. You don't care. It doesn't happen very often. Looking things up in the table needs to be fast. It needs to be cheap. So you don't want to have to serialize insertion and lookup and insertion and lookup because that'll slow you down. You want to be able to maybe serialize insertion, insertion, insertion, and also have lookup happen in parallel with as many CPUs as you want. So here's an example of a naive way to try to do unlocked table lookup. This is maybe the first way we'll try it. We have insertion into the table serialized. So we lock the table, allocate a record, initialize the record a little bit. Maybe each record will have its own fine-grained lock so that when you're using two different records and two different CPUs, they don't have to wait for each other. They can work independently. But only one CPU at a time can touch that record. This is just an example. Then we fill up some data. And once the record in the table is ready, we mark it occupied. We say, occupied equals true. And the idea of this approach would be that we can try to do an unlocked lookup. So we don't have to take any locks at all to look up anything in the table so any CPU in the machine can do lookups. And so we check, OK, is it occupied? If not, go to fail. Forget it. Nothing is here. If we try again, this time, assume that this software is running in sequence more or less. So after occupied equals true, we look it up again. This time, it is, in fact, occupied. So we get a record. We can get the tag. We can take a lock on the content of the record. And we can use some data in the record, whatever. So do some logic here. OK. Fortunately, again, the compiler might reorder things for us. We didn't put any barriers here to prevent it. So where the compiler saw initialization of tag and then the lock and then some data and then occupied, it might just reorder those to be in some other order because it found a better way to optimize the store instructions on the ARM CPU, which has fancy immediate values you can use to store bit patterns in special ways. And compilers can find clever ways to take advantage of that. So it might turn out with setting occupied equals true first. And that kind of puts a damper in our scheme to let, excuse me, puts a damper in our scheme to let CPUs use the record as soon as it is occupied, because it hasn't been initialized yet. So it's going to fall into some uninitialized data. So that's no good. Similarly, on the lookup side, the compiler might see, well, we're going to load the tag of this record, whatever a tag is. Maybe a packet flow for a root or something. I don't know. And well, we're going to do that. So we might as well get a start on it early, as the compiler says. And the compiler might move that to a little bit earlier to before we even checked whether the record is occupied. And again, we might have some uninitialized garbage in here. So that's no good. In order to make this work, we need barriers. We need barriers to make sure that when you're looking up data, everything that you use in the data has already been initialized. So the initialization, setting the tag, setting the initializing the lock, setting the data, that all has to happen before the critical section in the user side. So we're going to have a critical section, but it's not unlike a lock where you have lock acquire, critical section, lock release. The critical section on the two sides isn't actually really delimited that way. It's just the critical section sort of ends on the creation side. And it begins on the lookup side. But again, it's not delimited as one CPU starts and then ends. And other CPU starts and then ends the way Buttex was. So it's a little bit different. And that's why the multithreaded view of a Buttex even is important to think about instead of just a single threaded view that I brought up earlier. Again, you can use a store release and load acquire instead of a store and a member release and store together with a load and member acquire. Same thing, doesn't make a difference. It can be a little bit more obvious when you have store release that the synchronization is happening at the point of the occupied variable. But they both work. I personally recommend using store release and load acquire. It makes it a little clearer, easier audit. But sometimes it's not convenient for one reason or another. OK, so another issue that you might have that requires barriers is reference counts. So the idea of a reference count, a lot of people in a big room and they're trickling out and the lights are on, last but not least hit the lights. So you keep account of how many users there are. And on the very last reference, you destroy the data and free the resource. Hit the lights, turn the lights out. But maybe a user is going to grab some data out of the object that you're out of the resource that you're trying to release, out of the object you're trying to release. So we grab some data out and then we release the reference count. And let's pretend for the moment that minus, minus is atomic and C. The atomicity question here is not the relevant point. It's the point here is going to be about memory ordering. I mean, I could replace this by atomic, dec, event, and v. I can replace that. But let's just say minus, minus for now to keep it simple. And maybe I don't actually memset here. Well, maybe I have to explicit memset, but maybe free does that internally as a diagnostic thing. The point is something is going to overwrite the object of garbage as soon as we free it. But once there's no more references, that's OK. Because we're the last one holding the reference to this object, so nobody else can be touching it, certainly. Well, unfortunately, that's not right. Because again, the compiler or CPU could conspire to reorder the code in a way that makes no difference whatsoever in sequential execution, but poses a problem for us. So we might load the data after we've decremented the reference count. What that means is that if there's two CPUs, one of which is decrement the reference count and not the last user, they might decrement the count first. Then another CPU will go through. And so we have the first CPU decrements the reference count goes down to one. So we don't take this branch here. The other CPU then decrements the reference count to zero and destroys the object. But the first CPU hasn't yet loaded the data. So now it loads this garbage that was written in by the CPU that thought it was the last user. But it's not the last user because the CPU and the compiler were working against us to screw this up and we end up reading garbage out. Then the CPU one eventually frees the object and CPU one might return the right data, but CPU zero has returned garbage, use after free. Of course this is often very difficult to diagnose because this kind of use after free usually only happens when very tight race conditions and high performance things and just like one packet has two bites of garbage in it or something and it's up with that. It's very hard to diagnose once you hit it in the wild. So we can use memory barriers to help us here. And this time we're not just taking a multi-threaded view, the way that we had a multi-threaded view of the mutex, but we're using release first and then acquire, which is maybe counterintuitive. But the reason that we're doing this is that we need to make sure that on one CPU everything prior to decrementing the reference count happens before everything after decrementing the reference count. This way when we didn't draw a diagram of this, anyway, putting the memory barriers here rules out this ordering possibility so that once, so every single CPU that has ever been using the object has completed using the object, has completed doing any memory operations in the object before another CPU can witness the reference count, go down to zero, and do anything after the member acquire. So putting the member release on one CPU guarantees that member acquire on the last CPU, on the other CPU will observe all of the data having happened first, all the memory operations having happened first. Again, this is a little counterintuitive. I should have illustrated this one with another walkthrough of this step-by-step, but apparently I didn't. Sorry. So I fixed a large class of bugs in NetGST this year a few months ago. I don't remember when with all kinds of reference counts. Should probably consolidate some of that into simpler abstraction so that it would not have to have copies of the member calls all over the place. But at least large class of bugs fixed here just by inserting release and acquire around reference count decrement. And of course, you need to use an atomic operation for real. You can't use minus, minus. Now, as an optimization, when the producer is right only and the consumer is read only, for instance, you want to read out a set of event counters, but you want to have a consistent snapshot of that set of event counters. Make sure that it is exactly as it has been written at one point in time, not partially updated. And these counters, there's way too many of them to do with a single atomic operation. Maybe on some CPUs you even have a load link store conditional that can work out up to a cache line at a time so you can write 64 bytes atomically, but this is way too much even for that. So you need to get an atomic snapshot of it. But you want the snapshot to be cheap. You don't want to have to wait for anything when you're taking the snapshot. Or at least you don't want to have to wait for anything unless it's getting updated in the middle of an update or something. So you want the snapshot to be cheap. And there's nothing transferring data from the consumer to the producer. So in this case, instead of using membar acquire and membar release, you can use on some CPUs a slightly cheaper membar producer and membar consumer. The producer just orders writes, just orders stores, and membar consumer just orders loads. And that's it. This is called a C-clock in Linux. There's another variant of it, a lamport variant of it. So the Linux C-clock, you set a bit to indicate that there's an update in progress. Then you update the data, and then you increment the version. And on the reader side, you check if there's an update in progress and wait until there's no update in progress. And then redial the data. And if the version changed, you have to go back and start over. Similar thing with the lamport lock. It's just to use two different version numbers instead of a bit and a version number. But they can both take advantage of membar producer and consumer as very slightly cheaper. So for example, on ARM CPUs, membar producer is very slightly cheaper or can be cheaper anyway than membar release. Same deal on RISC-5, but it's a micro-optimization. You can ignore producer and consumer and just use apply and release. And that'll be fine. The red flag is store before load. So in NetBSD, in Solaris, and OpenBSD, there's this membar sync. Membar sync orders every possible combination memory operations. And I haven't really gone into the details of load before load and load before store because I think it's much more difficult to think about than just the four memory operations that you put an acquire barrier in one and a release barrier in the other, and that's it. But membar sync orders all pairs of load and store operations, including store before load. Now store before load is necessary in Decker's algorithm, which is this kind of clunky system for achieving mutual exclusion on a machine that doesn't have atomic operations. Now, most machines these days have atomic operations. In fact, it's been a long time since any serious machines that have multi-browser systems lack atomic operations. In fact, this has pretty much been only useful for theoretical machines designed in academic purposes in the 1970s. And so this uses a store before load barrier. Of course, your machine needs to have no atomic operations, but it also needs to have memory barriers in order for this to work. So there are even fewer machines, I think, that have memory barriers, but no atomic operations. So this is a very, very theoretical device, and it still needs an acquire barrier and a release barrier to bracket the critical section. So this is not useful. If you see membar sync, that's a red flag. That something is probably going wrong here, and either this code has not been audited for an actual protocol, doesn't have a clear idea of which things are happening before which other things, or it's actually basically what it means. Or it's a very bizarre protocol that you should maybe rethink, because Dekker's algorithm is hard to reason about, it's hard to think about, and it's usually slow. The thing is the store before load barrier usually has to stall the entire CPU pipeline. All the other barriers, load before load, or acquire release barriers, acquire barrier will stall the loads, and a release barrier will usually stall the stores. But aside from that, the CPU can keep on executing, keep on reordering, and so on. Membar sync is extremely expensive in contrast. It's so expensive that even on X86, where X86 has total store order, and you can often write entire multiprocessor systems on X86 without even thinking about barriers, because memory ordering is so strong. Even on X86, Membar sync implies a barrier, because X86 may take a store and defer it until after a load. It doesn't reorder anything else except store before load. And so Membar sync is most expensive, least useful, and indicates there's probably something screwy here. Maybe the code hasn't been audited. Maybe it's doing something stupid it shouldn't be doing. It's a red flag. In NetVSD, we also have Membar Enter and Exit. Membar Enter is just store before load and load before load. But on all CPUs, except possibly RISC-5, but I'm not even sure about that, it is just as expensive as Membar Sync. So there's no performance advantage, but it has all the same red flags. Membar Exit is a legacy alias that just paired up with the name Enter Exit. It's nice and memorable, but it's the same thing as Membar Release. Then we also have a few other barriers for other purposes. Out of the scope of this talk, I'm just mentioning briefly. BusFace Barrier is for write combining memory types or prefetchable. X86 requires this with S-Fence and L-Fence, but for the most part, you don't have to worry about it. Unless you're running a graphics driver in the kernel graphics stack frame buffer business, that's usually what it's used for. And then BustDNF Sync, this is for when you're doing DMA operations with the device driver. Again, not going to go into details, but the device drivers need to use DNF Sync to ensure ordering between memory operations and by a DMA engine and memory operations and by the CPU. Membar, the Membar functions, Membar AcquireMember Release do not cooperate with IO devices. So you do need to use these when you're doing IO. The Membar functions in FBSD are only for CPU-CPU synchronization. And then one final note before I wrap up and take questions is that most of these barriers are going to be inside the implementations of some synchronization abstractions. You don't usually use them in regular driver code. If you're writing a mutex, if you're writing a mutex, a reader, writer, lock, or something, you'll use barriers probably. If you're writing a table, a multistreaded table that allows concurrent lookups and concurrent inserts or something, you might use barriers or load, acquire, store, release operations. But in driver code, in most application code, you don't use them directly. But it's sometimes important to understand where, how to audit them, how to identify what looks sensible, what looks nonsensible, how to tell whether it's correct when you do encounter them. So that's memory barriers. I'm happy to take questions. I don't even know how much time I have left. I've been rambling for a while. I have a good question. All the while of the question. Oh, all right. Benny, are you slacking? OK. Yes. Very similar thing to previous ones. Like for bus-based mapping, sometimes it is effectively like a CPU to CPU barrier. Yes. If you're not actually doing the bus-based barrier, it can be MIMIO, in the case it's, like, sometimes you need a kind of barrier or a patch loss or something, sometimes you don't. But also, sometimes the bus-based things are actually writing for RAM. That's like, it's a filter ring. And it's a CPU in the device that's kind of cooperating with. Yes. So when you issue a bus-based map sync, actually it's usually more expensive than Membar because it has to work both for, well, yeah. Bus-based map sync deals with bouncing and also with different memory types. If you have a write combining buffer, which requires a barrier on x86 and ARM and everything else. And so yes, bus-based map sync is usually more expensive and also even more necessary, even more important for a driver because the DMA engine just doesn't work if you don't have the right flushing happening sometimes. Sometimes you need an explicit cache flushing instruction. So yeah. Bus-based map sync, it often has the effect of a CPU barrier. The point I wanted to make is that Membar functions are not enough for device drivers to use to coordinate with DMA engines. You'd need to use bus-based map sync. The Membar functions are only for CPU-CPU synchronization. And they're often weaker than bus-based DMA map sync. Yes, yeah. So can you speak up a little bit? Oh, yes, there's a microphone. Amazing. I won't even have to repeat the question for the stream audience. Lots of the problems you have raised are related to code being reordered, either by the CPU or the compiler. And it seems like a lot of this is related to the special requirements we have when working within the kernel. And things being a bit underspecified in the C specification. Is there anything we could do or anything we should be doing to prevent unsafe compiler optimizations in the kernel? Well, compiler optimizations are only part of the story. The CPU will reorder memory operations, even if the compiler generates machine instructions that are in the same order we wanted. We can insert these instruction barriers in here. And well, on actually 6, this will be enough. But on ARM, this would not be enough. So you asked for this, and we can do to mitigate unsafe compiler or maybe CPU optimizations. And the general answer is we can create four classes of applications, like mutexes. We can create an abstraction for it, like the mutex kernel API, instead of having those written, copied in every application. There are other high level abstractions that we can use. In some cases, the right thing to do is just make software single threaded with event loops. In some cases, the right approach is something else. So yeah, for the most part, when you're writing a driver, when you're writing an application, you won't want to reach for membars. If you do want to reach for membars, you probably want to isolate them to an abstraction that has properties that are easy to reason about, like mutexes, and then use that abstraction inside the rest of your code. Sometimes maybe you want go routines in Golang or some higher level concurrent ML, concurrent ML or concurrent Haskell, or these high level abstractions like that. Or in Unix, maybe you just want to use pipes in processes that are strung together that way. And you don't have to worry about membars in that kind of application. So generally, you isolate the difficult things to abstractions and make sure the abstractions have properties that you can use to reason, have contracts that enable reasoning. But it's a big subject, so I'm not sure I can say anything more specific than that. Yes, does the question down in the third row? So when you write code that uses, like when you're writing synchronization primitives that rely on memory barriers for correctness, what kind of techniques do you use to try and prove to yourself that the code is right beyond just kind of testing it? Because obviously, that doesn't really. Yeah, so there are frameworks for formal verification. The technique that I mostly use is to make sure the logic is simple enough that I'm confident in it. If it is too complicated, then I conclude, nope, this is unverifiable, which of course is what an automated machine approval will do, too. It will say, well, if this proof is too complicated for me to find, then unverifiable. And I think that's usually a pretty good heuristic. Something I often do is just annotate with a comment any time there's a member. What is the corresponding matching member? And that really helps with auditing. It's not formal verification. And in the end, there are frameworks for that. They can get very complicated and hairy, and there's lots of academic research on them. And I'm not the person to talk about that. This is more about how do you do practical engineering without getting tripped up in all the academic details. It'd be great if they were more readily usable for real-world applications, but until they're ready to be integrated into Clang or something, yeah, I don't know. They're a little more specialist. Other questions? I'll point out if no one else has anything. There is a cool suite of tools called Herd Tools. Called what? Herd, H-E-R-D tools. H-E-R-D tools. It's some framework for dissecting memory models, and it's written in no camel, and they've got about command line tools. And one of the, so I only found out about it recently, which is why I asked my question, but one thing you can do is write little snippets of assembly and say, you know, ARM64 assembly and write assertions about, you know, it's sort of the result of parallel executions of different snippets of code, and it'll actually kind of produce, you know, all the possible interleavings, and it knows enough about the memory model of CPUs to, so I guess I was wondering if, yeah, you'd had any experience applying that to kernel code or anything like that, because it's- Yeah, I have not applied that sort of thing to NetBSD development, but if you're interested in doing that, then we can talk. Cause it, so in order for a tool, I mean, you can use a kind of tool for initial development, and that's great. For an OS kernel, you also want to, you know, if you really want to make good use of it, you want it to be part of the automated tool chain that is used all the time, and if we can do that, cool, that's great, but there's also a lot of engineering and likely academic work to make it sort of a more reliable thing that you can integrate into the, you know, a continuous build system and so on, yeah. But you're great if we can do that. I know of one similar effort to that, although I don't think it would be practical for us to use. Some folks I know at Cambridge on our project, I think it was called REMS. They have a, it's a formal model, and so they have a formal execution model for ARM, a couple of architectures. Yeah. But they're able to take, like, snip it, like you can basically make a small elf and run it, and it'll do a similar thing where it'll do right all the paths. I think they actually found an upstream bug in Linux's mutex implementation on ARM V8 that way. Nice. Yeah. And I've talked to that professor in person, because I work with him. Yeah. And it's, it would be a clunky tool, for example, to automate into, like I don't know of anything that would be something you could fit in a tool chain and like in CI, a useful way. Yeah. Yeah. Any other questions? Oh, let's thank the speaker. Thank you.