 Hi, welcome everyone. I'm Jonas Oberhauser from Huawei Tristan Research Center and I'm talking today about the Linux kernel memory model. And when the Linux kernel memory model was published in the scientific paper, they titled it frightening young children and disconcerting adults, which reminds me of some old fairy tales I had to listen to as a young German child. But after getting to know the Linux kernel memory model a little bit better, I felt actually it's more like some cool futuristic thing than some frightening scary fairy tale. And I want to maybe tell a few things about the Linux kernel memory model to share that kind of view. And at the same time I want to answer a lot of questions I have been getting about weak memory models whenever I talk to people in industry about it. And the first of these questions is why weak memory model exists at all? Why can't we just kind of ignore the whole thing? And historically we all know Moore's law and some people are saying it's getting slower, but if you look at the actual transistor count it's still grown kind of exponentially. It's just that the single threaded performance isn't keeping up with that development and instead what the higher transistor count gives us is more processors. So basically we have at the one hand side that this multi-core tries performance. On the other hand, if you look at the ratio between performance of the CPU core and the performance of the memory in terms of bandwidth and latency, you see that that latency is also falling exponentially behind. So latency is getting much worse from the viewpoint of the processor. At the same time performance is still getting better. And how is that making sense? The answer is in order to keep this kind of throughput high we need to execute memory operations out of order and not wait for that long latency. So if we look for example here at this instruction stream, which are the instructions that the processor sees in this order, we might have some stores, some loads, maybe some arithmetic stuff and some branch conditions in there. And if we execute them actually on the hardware, this is not executed in order at all. It's executed in some kind of parallel way. And so what we get is actually more like a two-dimensional picture of time and instructions where we see that some logical operations that are started quite early and are early in the instruction stream actually completed much later. And this gives some interesting effects. For example, let's say we store to some variable and then later we load from the same variable. We're able to see the value of that load before the store that writes that value completes. And this is called forwarding. And another interesting thing we see is that if we have a branch condition which is evaluated based on a former load, we don't have to wait for that load to complete before executing the code that comes after the branch. It's speculation. And these effects, they can hide this kind of out-of-order parallelism inside of the CPU from the programmer as long as the programmer writes only single-threaded code. And I think that's why most people might not even be thinking about this at all. But what happens if you have a multiprocessor program? For example, here with two CPUs, one writing first to Y and then to X. And the other one reading first from X and then reading from Y and doing this with a loop to make sure that by the time it starts reading the Y value, the X value has already been updated. And maybe the programmer thinks if he writes the code like this, he's guaranteed that the execution looks something like this. First, the two CPUs at the same time execute their first instruction. One of them writes to Y, the other one loads from X. And because X hasn't been updated yet, it loads to zero value. Then the processors in parallel execute the next operation, writing to X and the other one is evaluating the branch condition, which of course fails because it read X equals zero in the first iteration. And then the second processor starts another iteration of the loop, reads one this time because the value of X has been updated, completes the loop execution and reads the value of Y finally as one. And maybe programmers think, okay, it should always be like this. And on some processors like X86, you will see only this kind of behavior. But if you look at weak memory architectures like ARM, power, risk five, the answer can be quite different. And the reason is that the store to Y can be reordered with a store to X. And the load from X can be reordered in the sense with the load to Y in the sense that we speculatively guess that the value is going to be one, already execute the loads that come after the loop and then see some out of date value. And the only way to prevent this is by adding so-called barriers. So two most important barriers are store release and load acquire, which basically impose some kind of ordering constraint both on the processor as well as on the compiler to make sure that the stores of the first thread and the loads of the second thread happen in order here. And when you add these kind of barriers, you can make sure that you will always read the value of Y equal one and never Y equal zero. And then people might say, okay, this week memory model exists, but why do we need a model? Why do we need to think about it like in some form of abstraction? We could just do it like Calvin's dad suggests in this awesome comic where Calvin asks, how do they know the load limit on bridges? And Calvin's dad says, well, they just drive bigger and bigger trucks over the bridge until it breaks, right? And this is how a lot of software has developed. And why don't we just do the same thing for this kind of concurrent software? Of course, this works on my machinism and we can see how that worked with one very concrete example from the Linux corner. So the code here that's too small for you to see is the Q spin lock code from 2015. And this code was written by complete experts, very, very smart people. And if you run this code on x86 processors, it will always work. And if you run it on risk five processors from 2015, it will always work. And if you run it on R processors, well, as long as you're not using the big server machines, I think it will always work. But if you run it on PowerPC, it will most of the time work, but sometimes it just hangs. Okay? So now we see this code, which is extremely complicated concurrent code with lots of crazy stuff happening in there. And we have to identify where's the hang. And because most programmers don't even have maybe access to this kind of PowerPC machine to see this kind of strange behavior happening themselves, they might not even realize that the bug is in the code. And as a result, the code was not fixed for three years. So this hang was in the Linux corner for three years inside of the lock itself. And now let's try to debug this code. And I've done already the most important part of the work for you. I've cut that down to like a very small part of the code where the problem is. Okay? And the way we debug code is, well, we maybe have like the watchdog to trigger that the kernel is hanging without doing anything. The watchdog creates a core dump. And we look at the core dump and we try to guess back what happened. Okay? And if you look at the core dump, we see, okay, let's go through the code first. So we have two threads here. One of them is the current lock holder, that's thread two. And one of them is trying to acquire the lock. That's thread one. And the value of the lock variable is basically telling us a kind of queue, tail pointed to a queue of the threads that want to enter the critical section. And currently, in the beginning, the thread that wants to enter the critical section is thread two. So the current tail pointed to thread two's node N2. And what thread one wants to do is it wants to insert its own node N1 into that queue, using this atomic exchange, which reads the current value, which is N2 and stores its own value N1 into it. And then it wants to go to the previous owner, which is N2, and tell it, hey, by the way, the next guy to enter the critical section is me. So please, once you're done with the critical section, let me go in. And it does this by writing this next pointer to its own node, N1. Originally, the next pointer is now. And thread two is waiting until that next pointer is set. And it will get the next pointer, N1. And then thread one simply waits until thread two says, okay, I'm done with the critical section, you're good to go. And it does that by looking at its own lock bit. As long as the lock bit is zero, it means someone else is currently in the critical section. When the lock bit is set to one, it means, okay, it's my turn to go into the critical section. And that's what thread two does. After it sees the next pointer to be set to N1, it will store the lock bit to one. And that will allow thread one to enter the critical section. Now, if you look at the cordon, we see a couple of things happened. Thread one, as expected, set the lock value to N1. It set N2's next pointer to itself. Thread two read that and left its code completely, so it already executed the store setting the lock bit to one. But thread one is somehow still stuck in its loop, waiting for the lock bit to be set to one and it's somehow zero. And now I can ask the audience to have any idea where I need to insert the barrier to fix this code. Okay, someone has an idea. You can shout it and I can try to repeat. Which one? On the right. You want to change this part. So, no, that doesn't have. Actually, where we need to insert the barrier is this right once needs to be a store release. And you can see this is almost impossible to guess. So you did a good effort. I have never had anyone guess correctly where the barrier needs to be put. Okay. And we changed this right once to store release and the bug disappears on the compound server, for example. Okay. What does that tell us, right? Did we really fix the bug? Is it just maybe we added some complicated code there that makes it slow enough that it happens to work? Is it correct for all ARM chips or just the one I tested on? We don't know. And that's why we need some memory model. And maybe the better way for that to answer his son's question here is to give an answer like this. Okay. The way we know the load limit is we have a model of bridge. Okay. And we use that model to do some kind of like stress analysis and so on. And, yeah. And what is this kind of model? Well, it's an abstraction of the bridge and it helps us make predictions. And in the case of the bridge model, it abstracts away from some intermolecular forces and whatever and allows us to predict what we in the industry call inconvenient user experience. And for memory models, we want to abstract away from like the concrete gate level design of the architecture from all of the crazy optimizations GCC can do to memory operations. And we also want to predict some inconvenient user experience. And so because of that, yeah, what's the question now? And because of that, a group of very, very cool people sat together and said, okay, let's make a Linux kernel memory model. And basically they tried to create this unified abstraction of many different architectures because Linux has to run kind of everywhere of compiler optimizations that do crazy stuff with your code. And of course, combining the kind of two worlds from the mathematics point of view where the model should be like really correct and be reasonable and meaningful. And of the systems people who want to make sure the model actually helps us predict real stuff and not is just some abstract nonsense that people cannot use. And then in order to make sure that there's like the kind of marriage between these things is really working out, they created thousands of test cases, I think over 5,000 that highlight different use scenarios and so on and made sure that the model predicts the right thing on all of these thousands of test cases. And now I want to show a little bit what that model looks like. And when I created these slides, I realized I made a big blunder by actually trying to cram two weeks of memory model lectures into 10 minutes of a talk. And then I cut out most of it, but it's still a little bit bumpy. But let me start first by trying to remind ourselves of the sequential model of execution, maybe most of us have in mind for concurrent software. And in that model you have the different threads, each thread has a position in the code where it's currently executing indicated by this arrow here. And initially they of course at the start of the code and then in some order they execute operations. For example maybe thread one first executes its first operation, sets the lock bit to zero, then it tries to do this exchange tail operation that's an interlocked operation consisting of a read reading the current value and a store updating the value to itself and they are kind of interlocked so nothing should happen between these two operations. And then it writes and then it starts spinning but of course because thread two hasn't yet set the lock bit it will just be stuck there. And then thread two will leave its loop reading the updated value of n1 that was written by thread one. Thread one maybe spins a little bit more and then thread two sets the lock bit to one and thread one can leave. So this is one possible execution written here as a sequence of memory operations and there of course many other possible executions of this code each of them has a different sequence. So for example here's a sequence where thread two starts spinning a little bit earlier and sees a null pointer for a while or here's one where thread two starts reading the next pointer immediately reads the n1 which will be written later and then performs a store release then thread one sets the lock bit to zero that's the exchange and so on and in the end keeps spinning forever. And I happy many of you have very concerned faces because what I said makes no sense because actually it's not possible for this read once to read the value of n1 that is written actually much later. And so if you want to like make a general definition of what is a sequential execution, sequentially consistent execution we could say it's a sequence of memory operations with some conditions not every sequence makes sense. First of all each read has to see the current value and that is the condition that was violated in this nonsensical execution I just showed. Secondly operations of each thread have to follow the program order and third interlocked operations are atomic and now here's some lingo that people in the formal methods community use for this kind of stuff they call these conditions axioms okay it's just a word you have to remember whenever you have these kind of conditions they call them axioms. And knowing this we can also start defining what's a lkm execution what does lkm consider an execution and the main point is that instead of you looking at an execution as a sequence of memory operations lkm looks at an execution as a graph of memory operations so and I go to some more example later and then there's some axioms as well for example each read has to read from a right to the same location so in this graph you cannot have some read to x reading from a right that wrote to y or something like that and then there are some axioms that read like the happens before relation is acyclic the propagates before relation is acyclic whatever those mean right we don't know what that means right now and there are lots of these axioms and then maybe you didn't notice because I very very skillfully hid it between these egyptian hieroglyphs but what you have there is not some message written by the aliens that built the pyramids but this is actually part of the definition of the linux color memory model and there you can see they define what these relations mean using some strange syntax that I'll try to explain very briefly in a second and they say for example here hb is defined as this and this and hb is this happens before relation and then they say acyclic hb which means it happens before relation acyclic so this would be the second axiom here and now let's try to to understand these strange writings here so as I said a graph is the model of the execution and in the graph we have nodes and edges of course and then edges are the operations so here are all of the nodes of one particular execution and the edges between these nodes are relations between the events for example there's the int relation int means internal which relates any two events that come from the same thread so in this case for example this exchange tail operation and the load or quiet operation came from the same thread so there is an edge an arrow going from one to the other and that arrow is labeled int and of course this exchange tail operation here comes from the same thread as itself so there's an arrow going to itself and so on so you can see there's actually a lot of them it looks kind of like a Christmas tree or something and because it looks so so confusing usually when we write these graphs we don't write all of these edges we just keep them in our hand and there's another one external that relates all of the events of different threads and there's something called program order which is really really important program order basically relates the events of the same thread in the order in which they're seen in the instruction stream which may not be the order in which they actually execute it okay and here we usually write only the directly next one okay the next important class of relations or edges between these events are the ones that connect events to the same location and there are three the first one is reads from and it starts from the right that provides the value and goes to the read that reads that value so here we read the one which means we read from this store so we have the read from edge like this and there are lots of them of course because there are lots of reads then there's coherence which tells us the order in which things are being overwritten so we have the initial write which is being overwritten like this which is being overwritten like this so we get all of these errors and then we have from reads which tells us when a read is reading some more out of date value and you don't have to remember all of these right now all you have to remember is these kind of orangey reddish ones are basically telling us the order of the events to the same location in kind of a logical order okay and then the next kind of edge that is really relevant for this kind of for understanding these models are the so-called event types or tags and they basically are cycles on note on memory operations that have a certain kind of property or form so for example this one is a read so they will be the R tag going in a loop from itself to itself and there are a couple of them there's reads writes I won't go through all of them another very important one though is the marked one and this marked one differentiates between accesses that use LKM and primitives like exchange tail write once read once and the plain accesses that you write in the normal C program like n1 sub locked equal zero is just a plain access it's not marked with any kind of LKM and primitive and this is important because these LKM and primitives even if they do nothing else they prevent a lot of compiler optimizations and therefore allow the LKM and to be defined very very precisely but the unmarked accesses all kinds of crazy compiler optimizations are possible there and therefore the LKM makes very very little predictions about them okay and another important type are the release operations which provide some kind of ordering with the preceding operations okay and then there's something like a regular expression language used to define other kinds of edges so for example there's and which defines edges whenever both subrelations hold then there's override which holds then there's this pipe which means one or the other and then there's the semicolon which means first one and then the other and you don't need to remember any of these in detail but I just wanted to give you an idea of what the memory model kind of definitions and stuff look like so that you have a chance if you ever want to like read it or someone talks about it to have an idea in your head of what these things mean okay and then we can look at this example from before of the HP definition defined in LKM and what it means is you have a marked axis followed by one of three options either something called PPO which means preserve program order and tells us anything that is ordered by a barrier or a read from external or this kind of complicated thing that we don't talk about today and then the second line here defines one of the axioms saying that HP has to be acyclic and we'll try to go through an example like that and in a little bit slower time so first we have here a marked axis because it's using the right ones LKM and primitive and not just the plain axis then we have a read from external because it's a read from and between different threads and then we have another marked loop here because the other axis that is reading from it is also marked using the read ones LKM and primitive and because of that we have this HP edge and using the same kind of reasoning we can also show that there's an HP edge here and an HP edge here and so we have these three HP edges and we can follow these edges and ask the question do we ever get like back to the starting point and that is what acyclic means we can never get back to the starting point and what the way to understand the axiom is if we can ever get to the starting point following these kind of HP edges it means the graph cannot happen which means that any bugs that occur in this graph cannot happen according to the LKM so this kind of axiom it really gives us a prediction of the Linux kernel memory model and all of the other axioms can be understood in the same way and now we get back to the part that I think more people should pay attention to so now if we want to use the LKM basically one way to do it is we think of an inconvenient scenario like for example an alien attack and then we construct all the graphs in which this kind of inconvenient scenario would occur and then we try to find the HP edges in this graph by manually reading through that huge definitions that are there in the Linux kernel memory model and when we find these edges we can look whether we can ever reach back to the same point and if we can then we know that this scenario cannot happen in reality and this is the case for all of these tiny examples here these blue edges here other HP edges each of them form a cycle so all of these graphs here cannot happen in reality and as a conclusion we know this bug can never happen and there's an example here from the QSpinlock and the bad behavior that we're worried about is that the thread is stuck forever here reading the zero and now we will look at this kind of we try to construct all of the graphs and we construct everything basically based on that we end up with this and the only question that is kind of open is what is the order between the two stores to the locked bit so there's one here initializing it to zero and another one which is supposed to set it to one and we figure out if the one that sets it to one comes later then we will definitely eventually see that update and we will not be stuck in this loop forever because we will eventually see that one so the only case in which we can't be stuck in here forever is the one in which the locked bit is overwritten by this kind of initialization store and so this coherence edge should go in this direction saying that this is the logical order and this one came last and now we try to look and look and look and we find all of the HP edges and we see oh that's quite bad there's no cycle here whatsoever so I put all of the HP edges in here and we can see we can never reach back the same node the longest we can go is from here to here that's it and so according to the LKMM this behavior can happen and so we found our bug um but that doesn't help us immediately fix it right but what will help us fix it is the observation that there's another kind of cycle in here and that cycle is if we follow here the program order then we have here reads from program order and coherence and then we're back here at this point and this kind of cycle is always forbidden by the sequential model and we can use this knowledge because it means whenever we have this kind of cycle if we add just the right amount of barriers along the events on that cycle we can make this cycle also forbidden in LKMM and we can prevent the bug in LKMM so in this case we know either this this this or this operation need to be strengthened with some barrier and then the bug will disappear and in this case we already know if we put the release store here then the bug will disappear and that is because we get a new HP edge going from here to here why? well we need to read the model to know that and then we get this cycle and the cycle says okay LKMM says this behavior can no longer happen okay but of course that's completely like insane like you have to read this huge model you have to look at the graphs manually inspect everything so what do we do instead? in practice in my company where I work we use some electronic helpers to actually do all of that work for us automatically and there are two main tools that are now quite ready for use in real code one of them is called GenMC and the other one is called Datania and I try to give a short demo of the Datania tool so I'm running now the code directly on the real queue spin lock not on the tiny example I've shown before I'm using the Linux kernel memory model that is specified here LKMM and now it starts running and what it will do is it will generate all of these graphs that could create a hang or other kind of bug in this queue spin lock so each iteration here is one graph where bad behavior could happen and each time it will find okay the behavior cannot happen except on the 34th one it actually says there is a bug and this is the current version of queue spin lock in the current repository so this is not the old version so we know there's some safety violation according to the LKMM on the real queue spin lock in the Linux code right now and you saw it took like just a couple of seconds to find not three years okay that's pretty cool but it just tells us fair right no that's not true it also gives us the graph according to which the problem happens that graph is quite huge but if you like have the time you can go through this graph and figure out where the problem is or you can use another tool to do that automatically okay and so what happened here we have the current queue spin lock code and according to the tools there's a bug in it whether three possibilities basically either there's really a bug in the queue spin lock code luckily we have analyzed it and on the hardware models if you compile it there's no problems so that issue we can clear up the second possibility is that the tool may be implementing the model incorrectly and even though the model says that there's no problem the tool has a bug and accidentally mentions this we can also rule this out or the model may be a little bit too conservative in order to protect people from doing stupid things and really says that there's a bug here and this is the case so depending on where the issue is you can you have to adjust either the model or adjust the tool or maybe adjust the code which is the most likely thing to happen in this case okay and that brings me towards the end of the talk and I just want to show a few more interesting examples what can happen according to the linux column memory model and the first one is basically what I call relativistic frame of reference and we have here this code with four threads and the first one writes to x the last one writes to y and the others they read from x and y and now what we will do is we will look at one particular execution where cpu2 reads the update from cpu1 and then executes this rmb which is a read memory barrier which provides ordering between these reads so we expect that these reads should be definitely executed in the order they're written in the program so we actually get this hb edge and then cpu2 executes the read ones but it reads the old value of y it reads zero from the initial store it doesn't read the more up-to-date one so the logical order here is that the read ones of y logically occurred before cpu4's store and cpu4 store in turn logically occurred before the read ones of cpu3 because it sees the value of 1 and then cpu3 executes the read memory barrier again ensuring that the order between these two reads should be exactly as written in the program and then we can ask what do we know about the read ones of x in relation to the write ones of x and to summarize we know logically the write ones of x happened before this read this read definitely happened before this read because of the read memory barrier then the read ones here logically happened before cpu4's write because we read the old value and cpu4's write logically happened before cpu3's read which again happened before cpu3's other read because of the read memory barrier but that shouldn't mean that we will definitely see the up-to-date value of x here right and that's not the case yeah I see people shaking their heads very nice and that is not the case actually according to LKMM and actual hardware it's possible for the read ones of x to still see the outdated value and the way to make sense of this from my point of view is to say okay each cpu actually has a different view of the order of the writes to x and y and cpu2's view is that the write to x happened before the write to y and cpu3's view is that the write to y happened before the write to x so it's a little bit confusing maybe in the beginning but that's memory models for you okay now the next interesting example can be explained at hand of future armor which is another very great show and in that show there's a young handsome guy called Fry and he travels back in time and meets a young lady and they have a son together and that son meets another young lady and they also have a son who happens to be Fry and so in short Fry happens to be his own grandfather and the same kind of thing can actually happen in LKMM and here's the example and it's a little bit complicated we'll just focus on the core part in the middle for now and there we have basically cpu1 copying a value from a into b and cpu2 copying a value from b into a and when cpu1 and one way to see this is basically here the b value stored by cpu1 will be 42 somehow and he travels back in time and meets a beautiful read here and the read has a store using its value and that store gets read by another read which has another store reading its value which is 42 and so we have this value b equal 42 which is nowhere in the program just being generated by this store to be here copied over and over and eventually ending back to itself which is how the store to b equal 42 came to be in the first place and one way to see why this can happen is that we're using unmarked accesses here and as I mentioned before unmarked accesses in LKMM have very very little guarantees and in fact what the compiler is allowed to do to this b equal a assignment is to insert a trashy store b equal 42 right before why would the compiler do that in reality there's no good reason but in theory you could imagine that it wants to prefetch the cache line of b in exclusive mode so it stores some trash value so it can paralyze the load from a and the fetching of the cache line of b and then when the load of a is done it can immediately store to that cache line because it already has it in exclusive mode and the reason that doesn't happen is because we have prefetched instructions that do this for us without having to store but in theory on an architecture that doesn't have that kind of instruction something like this could happen and so you can see on unmarked accesses store the compiler can really introduce some crazy things and make crazy things happen and so the takeaway for basically everyone in the audience is don't try to be smart by using some plain stores whenever you have racy accesses because there are already LKM imprimatives which are marked and guarantee no such thing happens that if everything goes right generate exactly the same code as a plain copy unless the compiler somehow optimizes it in a crazy way which would be completely okay for sequential code but for racy concurrent code we'll just bring you a world of trouble okay and that brings me basically to the conclusion of the talk so like I mentioned the LKMM is a graph-based abstraction that predicts if bad things can happen modern platforms have lots of crazy behaviors that we cannot easily manually predict or even for experts are hard to predict sometimes and so the right way to write any kind of concurrent code that doesn't use only locks is to use the tools because they're now powerful enough to actually run on real code like the Qspin line okay that's it thank you so much for your attention okay I'm very happy to hear any questions yes there's a microphone there so well all right a couple of years ago I read a read a paper that said that's everything weaker than sequentially consistent is in principle undecidable so perhaps I read it wrong or maybe so I'm not sure if you so my question is is it's like theoretically completely sound this model or is it just like a very best effort approach that should you know should work well in practice okay so your question is that since everything weaker than sequential consistency is undecidable whatever that means improving yeah proving that the program is consistent okay that the the notion in that paper was that if you want to prove that the program is consistent and the program has is correct and the program has weak memory model than in in principle it's an undecidable problem like the problem of stop yes so so let me try to to repeat your question for the online audience and then try to give an answer from my point of view so your question is since sequential consistency is the only model under which we can decide certain things without clarifying what those certain things are are these weak memory models really sound or are they just the best effort okay and the thing is I think there are lots of levels of abstraction that are being mixed here like one of them is what exactly is the thing that is the problem that is being that is undecidable so we already know program verification in general is undecidable even for sequential consistency right like halting problem and so on but then there's a simpler problem which is if you look only at finite state machine programs instead of full Turing machines then for sequential consistency you can get decidability and indeed for many weak memory models you don't but that is luckily not the problem we have to solve because we don't have to look at arbitrary finite state programs and arbitrary Turing machine programs we have to look at very very practical programs and even though maybe in general it's impossible to decide the correctness of certain programs on weak memory models or even sequential consistency for practical programs and maybe some practical correctness properties that we care about we can very well decide like you could see we ran the thing on the Q spin lock and it decided within like 20 seconds there's a bug okay now the next question is are the weak memory models themselves a kind of best effort thing or not and the answer is that we can actually do weak memory model proves all the way down to the architecture so I have done this very recently some proofs at the gate level construction of for example the boom open source load store unit which implements the weak memory model of the risk five architecture in the Berkeley out of auto machine and you can prove that that processor correctly implements the weak memory model which means that the weak memory model that is written in the manual is sound at least for that processor and then you can build this kind of tower we say okay now we have the architecture level but we have one level on top which is the LKMM and then we have the LKMM mapping to this lower level using some compiler intrinsic some assembly code that has been handcrafted and we can also prove that that level is also sound so the compiler things are correct and actually in the LKMM Victor Vafiades recently found an unsoundness but only one when mapping to power but you can fix it and then prove okay the fix is sound so the model itself is definitely not just the best effort kind of thing and the tools I would say are best effort kind of thing but usually in practice the best effort is good enough I hope that answered your question okay any more questions yes yes please I like your questions be kind of simpler but I'm not sure it's possible to answer it so I was working for a company pre-producing ARM V8 chips and they had some internal benchmarks that were proving that the weak memory model is in some cases worth the effort so but there are very little like strong arguments backed by numbers that would show that the whole issue is really worth the effort I mean the yeah so what's your opinion on that so this is a really really great question I think it's great pain right for programmers and me we had the same issue like you like here with the Rinnix kernel we had like years old code that we found bugs yeah so this is a really great question so your question is are weak memory models really worth the effort in terms of performance and so there's two parts of the answer one of them is that we have some concurrent accesses in the code like these right ones right read ones SMP right release and so on but they actually make up a really really tiny faction of the code that we really run and 99% of the time you run some plain accesses using some kind of local logic and actually that part is the part that the weak memory model is for so the weak memory model doesn't really have those concurrent accesses but it helps us execute these other accesses out of order and there are some performance benefits for that so the company I work for also produces armchip and we have also some benchmarks proving that there's some benefits especially if you have long pipelines and huge out of order speculation windows and then for this kind of local code you get a benefit now the other thing is that we have done some previous work automatically choosing the optimal barriers for these kind of less than 1% of concurrent accesses so this was the v-sync work we published some years ago and what we found is that even though we could get really really nice performance benefits in some micro benchmarks really stress testing that kind of concurrent data structure if you put it in a real product then you're optimizing less than 1% of the runtime and it doesn't matter anymore so my feeling now is that in reality weak memory models are very important to optimize all the plain accesses but the barriers you choose for the marked accesses almost never matter and there you can almost always just go with sequential consistent accesses and then that brings us back to your previous question then you get something if you only really use these kind of sequential consistent accesses for those parts then you get something that's called DRFSC theorem which says then your whole program will behave as if it was sequentially consistent and then you could use in theory sequentially consistent tools to verify but we have not found really a good need for that in our company more questions all right thanks so much hope I didn't put everyone to sleep