 Right, so I'm Will Deacon, I work for ARM, and I'm here to talk about memory barriers in Linux. I'm really surprised how many people showed up, because I mean, people don't like talking about the stuff that much. So thanks for turning up and try and stay here the whole talk. So the memory ordering is a really complicated topic, and I'm not an expert on it. I'm just interested in it, and I'm forced to work with it. It's lovely because it's different for every architecture, and within a given architecture, you can implement different levels of it. You can be architecturally compliant by being stricter than what you're allowed to do. Most people don't get it, because it's really hard, and if you get it wrong, then you get subtle, non-repeatable software bugs, which are the worst kind. But particularly on ARM, it's a key contributor to overall system performance, so if you get it right, good. So I'm going to focus on the ARM v7 in this canvas, that's 32-bit, some new stuff in v8, which you can ask me about in questions if we have time. From a software perspective, I'm not a hardware guy. There's one in the audience, so I might deflect questions to him if they come for that. And the ARM remains authoritative, because this is very much informal. Just trying to help people understand that this is at least where I understand it. So we start from this sort of academic point of view, which is this thing called sequential consistency that some of you may have heard of. So Leslie Lamport, another guy might have heard of, said this before I was born. Some stuff about multi-processes and sequential consistency. Now, I'm not going to go through that, you can read the slides. It's much easier done with a picture. So in this picture, you've got your program up here, right, and I split into three chunks. So chunk A, chunk B, and chunk C. So they're parts of your program, they run consecutively in program order, we call that. Now, for sequential consistency, what it says is you split that up into three parts. So one on process is zero, one on process is one, one on process is two. And actually, these are running in parallel, despite the fact that I've staggered it. I just wanted to show where they came from. They run in parallel, and it says there's an equivalent sequential execution of these instructions, which gives the same result, okay? So you can think of it as like, if you've got two chunks running in parallel, you get some arbitrary interleaving, right, which you'd expect. If you're writing a multi-threaded program, you spawn two threads, you think, well, they're running at the same time. If you haven't got explicit synchronization, you've got some weird interleaving there. But within, say, processor one, within chunk B, you want those to execute in order. You don't want that processor to start executing all of these things out of order, because it's on the same processor. So sequential consistency gives you that guarantee. And the hardware guys hate it, because for years on unit processor systems, they played all sorts of tricks on the software guys, which we couldn't detect. We didn't know they were doing this to us. So they did out-of-order execution, speculation. You've got store buffers, so you can hit in your own store buffer. You can bypass the store buffer, so your reason rights go out of order. You can't detect it until you throw another processor in there. It can sit back and go, but you're doing all this out-of-order. What do you think you're doing? And once you get that, software can detect this. So the hardware guys aren't going to go and fix that for us, because then it's back to square one with memory latency, unless you throw a lot of hardware resources at this. And yeah, I mean, all this was done to try and hide the memory latency, because as you process again, what faster the memory is. So what we do instead is we define a memory consistency model for each architecture, which says, well, we're not quite sequentially consistent, or in some architectures, we're nowhere near sequentially consistent. But here's what we do, we define what we can do. So we're basically relaxing things from program order. So I've got this simple example here. So whenever you see any of these examples, it's always initially A equals B equals 0, you always have that, but I wrote it in there anyway. So things are 0 to start with. You've got two processes, P0 and P1. P0 does A equals 2, and it is B equals 1. P1 is doing C equals B and D equals A. So the interesting thing is what does P1 see? So I've enumerated some of the results there. So for example, if P1 runs first, so we're going back to this interleaving. So P1 could do CD, and then P0 could do AB, and then you'll get 0 0, right? You've read those things before, they've been written, fine. But look at the last case, C equals 1, D equals 0. Now to get that, one of those guys or both of those guys has to execute their instructions backwards. And that can be done in R, and it's not sequentially consistent. All the other versions are. So I put some orderings there on the right of the orders that you'd have to do it. So the issue with the last one is you've done D before C, and that's not allowed, because that's not in program order on that one processor, okay? So things like P threads and Java, I think they give you something approximating sequential consistency. So normally you don't need to worry about this. But if I can't all about it, you do. So how do you get around this? Well, the way you get around it is the architecture offers you these things called safety nets, or fences, or barriers, whatever you want to call them. And you can use these to enforce the ordering when you need it, because all the time you don't care about the ordering. It's only in specific cases. And as well as barrier instructions, there are defined dependencies. So often if you load the same address twice, that will happen in order. It's not the case on itanium, I think, which is really weird. But there are dependencies, and then there are different types of memories which I don't have time to talk about, but come and find me or shoot me an email. So with that out of the way, I'll talk a bit about what ARM does, and then a bit about what Linux does, and then some stuff I write which makes it more hard to use, but more performant. So to talk about ARM, we need to talk about observers. And observer is you can think of it as someone that can master memory. So it's somebody who can read from memory, someone who can write from memory, it's not a slave device. So if you're writing to a slave interface on a peripheral, that slave is not counted as an observer. And a CPU actually has multiple observers, so it has an instruction fetch, it has the D side, it has the table walker, they're all separate observers. Each observer sits in something called a shareability domain. So this is ARM terminology, right? So we'll do a bit of that and then I'll show you some pictures because it's much easier with pictures, but we have to do this bit first. So we have, whoa, how many does that one do? Four shareability domains in ARM. So we have non-shareable, inner-shareable, outer-shareable, and full system. And you can use shareability domains to limit the scope of things like cache maintenance, but it's also a fundamental part of how your system is integrated. You can't change these domains. Someone gives you a system, they go, these are how the domains are laid out. You have these observers and they sit in these domains. And you can have multiple instances of each one. And although it's system-specific, and I said defined by your SOC, there are architectural and Linux expectations. So this is a quote from the ARM ARM, which is, ARM v7 is written with expectation that all processes in the same OS are in the same inner-shareable shareability domain. We really need that, because when we do cache cleaning, we broadcast to inner-shareable, okay? So you could build something when that's not the case and you won't be able to run a single Linux instance like that. So I've got some pictures to show that. Here we've got four processors, ABC and D. And I did up some memory and there's a DMA controller on the bottom, so you've got two, three, four, five observers there. Each have their own non-shareable domain. Then I did something wacky, please don't build this. I gave two inner-shareable domains, because I basically want to show you what an outer-shareable domain is. It's a bit convoluted, but it shows you what it is. So here we have two inner-shareable domains. We have A and B and one, and C and D and the other. So you couldn't run a quad-core Linux here, right? You could run an instance of Linux there and another one there, or some other OS, and maybe they could use message passing or something, but it's two domains there. The reason I did that is because, hey, put them in the same outer-shareable domain to try and draw a distinction there. Maybe someday people will start using this a bit more, but at the moment that's not used. And then finally you have the system domain, which basically wraps everything up, so that's the whole thing, it's just whatever's left. So we've got observers, and the reason we defined observers in their shareability domains is so I can throw this waffle at you from the AMR, which I'm not going to recite. I will try and explain it, because it's actually fairly intuitive. It doesn't look intuitive, but it is. So first of all, we have to think of a right. So what does it mean for somebody to observe somebody else's right? So when an observer does a right, another observer goes, how? I'll observe that right. Well, it means that if the guy who's observed the right does a read, it gets the new value, and it means if it itself does a right, it overwrites the right that it just observed. If you see what I'm saying, it's actually quite intuitive. It's basically some kind of total order on this. And, yeah, essentially I can read it back, or I can overwrite it. The two halves of that? The bit that's a little harder to get than I think I can also observe reads. It took me a while to get my head around that, because a read doesn't really have a side effect, at least not a normal memory. So what does it mean to observe a read? Well, it means that if I do a subsequent right to that same location, I'm not going to affect the value that you read. So if I can still get a right in there, so maybe your read is held back somewhere, and I can get my right into some buffer, and then you get that value, and I haven't yet observed your read and I can still change the effect of it. Nearly done with the terminology. So we then have global observability and completion. So a normal memory access is globally observed for a share-related domain when it's observed by all the observers in that domain. So for our previous example, there was that insurable domain with A and B, so something would be globally observed in there when both A and B have observed it. And this might be a table walk, but for the purposes of this, you can probably think of global observability and completeness as being the same kind of thing. There's maintenance operation as well. I don't have really time to talk about those. Now we're on to the diagrams. Hopefully this will help a little bit. This isn't supposed to be a circuit diagram. It's not topology. It's not micro-architecturing. It's just a silly diagram that I came up with that I think is helpful for describing this. So we have four processors, observers, masters, whatever, A, B, C, and D. They each have a read and a write channel. It's one insurable domain. And there are these three things which are kind of like buffers, but I don't want to say the word buffer because, again, it's just a silly picture. What you could do is you could have CPU A issues a read and a write. The little A just means that it's from CPU A. It's got nothing to do with the address. Everything here is to a different address. And maybe that read can overtake that write, OK? So if that write hits that buffer and it's satisfied, then when the write hits that buffer, then C will have observed that write, because if C does a read, it'll hit it. If it does a write, it can overwrite the value. And this write from D here could overtake the write from B, OK? So hopefully you can sort of see why I'm going with this. The dots can all overtake each other. They can't overtake each other if there's a dependency between them, or at least if there's a specific type of dependency. So I'm going to describe three dependencies and the effects they have. This is good because you can use dependencies instead of memory barriers, and they cut off to be cheaper. So the first one is the address dependency, which is where the value returned by a read is used to compute the address of a subsequent access. Makes sense. So that's to be like with your OO language. You load a base pointer for the object, and then you load relative to that to get a member out of the object. There would be an address dependency there. They have to happen in order. So as long as the creation of the object is ordered, you don't need a barrier on the read side, unless you're alpha. We're not alpha, so that's OK. Then there are control dependencies, which is where you basically you do a read, and then there's some conditional on that read, and then maybe there's a conditional access after that, OK? So that's basically like an if statement or something. You load something, check it, and then go and do another access. And then there's this last one, data. This isn't actually in the ARM arm, but it's just not called a data dependency. It's just a bullet point. So I call it data dependency. That's where you read a value and then you write it somewhere else. Maybe you've done some arithmetic on it. There's a few other rules that we don't speculate stores and, again, reads to the same address are ordered. Here's an alpha. I mentioned it a minute ago. So here are examples of free. So this is ARM assembly. I kind of assumed you might have some familiarity with that, but it's not too complicated. So here's your address, you've loaded it, mask out some bits, and then load based on that. That's guaranteed to be ordered. Control, where we're loading, we're doing compare, conditionally manipulating a value, and then doing a load, that is not ordered. Not guaranteed to be ordered. You can speculate that second load before the first one. So control dependency is not enough. If you're relying on control dependencies, it won't work. I've worked with all cores that didn't speculate, but it won't anymore. When you load, add 5 to it and store it back. So that's guaranteed to be ordered. So when you can't use dependencies, you have to use memory barriers. We have three types in ARM. ISB, DMB, and DSB. I'm not really going to talk much about ISB. It can be subtle, but it's not the most interesting one. So we'll talk about DMB and DSB, mainly because they take these options, and that's the kind of work I've done that's based on these options. So DMB ensures the ordering of memory accesses, and DSB ensures the completion of memory accesses. If you stick a DMB in the instruction stream, the instruction stream can continue to be processed, and we just make sure, basically instead of flag, and make sure that things can't overtake each other, with the DSB we store when we stop, and we have to wait for everything to finish before we can proceed with any instruction. I'll show you some pictures in a minute. And the option can be used to specify the shareability domain. So here we have non-shareable, in-shareable, out-shareable full system with these and the three-letter mnemonics. Our architect was referring to that as NHS the other day. He wanted to get rid of the NHS. Which is the health system, right? And you can also have an access type, which is stores. You can only specify either full system or stores. If that's absent, it's full system. And then you mix these together so you have things like ISHST for in-shareable stores only, which maybe it's a little hard to get your head around. And I think even gas accepts different ways round, and then it sometimes has different syntax for these as well. So you can make some really unpronounceable things. But these are the ones we'll stick to. So there's one up there, B1. We have a DMB ISHST. So we'll come to that in a second. Say B writes data equals 42. So you can see B0 there. It's a write from B. It's traveling along. Great. Red. Oh, it's a barrier. And it's a sort of square-con rectangle, which means it's a DMB. So DMB ISHST, it's out on the right side only. It's a store only. It's in-shareable while this is the in-shareable domain. So it's there. Second write comes through. Now that is not allowed to overtake. They have to remain ordered. And that's the key part. So as long as that can't overtake that barrier, well, good. And B can continue doing processing here. Here, they've actually switched round. It was a bit of an exaggeration, but it's because they're leaving the in-shareable domain. I just wanted to emphasize the fact that it was in-shareable. So let's go back a little bit to where we just were. So we've issued our data equals 42. We've got our DMB ISHST. And we've got our flag equals valid. So that can't overtake. B2 cannot overtake B0. Now we do a DSB ISHST. And the dreaded egg timer comes at us, right? And we see this diamond, which means DSB for this slide. And that means that the processor B is now stuck. Can't do anything. And the next instruction is a save, a send event. That's not a memory access, right? But it can't process that event until all the pending stores have been observed in the in-shareable domain. So it's still stuck. Now they've been observed. So I can go and do the save. So I hope you all followed that. I see a DSB is much more expensive because it's going to stop the CPU. We also use barriers for maintenance. I didn't have this slide originally, but I figured if you go back and look at the slides, it might be useful. I don't have time to talk about it. We'll just stick to memory accesses, right? But there are things like cache maintenance and stuff like that where you can use barriers. You need to use barriers to ensure completeness. If map's okay, it's slightly over. The PowerPC has separate barriers for this, I believe. So that's the basic barriers out of the way on ARM. What does Linux do? Linux has loads, actually. And I bracketed. There's lots of brackets on the slides. It's not too obvious, but I bracketed the ones which we don't need on ARM. So we don't need a rebarried pen and we don't need an MMIOWB, but nobody uses that one anyway. We have a compiler barrier. That doesn't actually expand to any code. That's just telling GCC don't reorder things because GCC also can reorder the instructions. We have mandatory barriers. We have MB, memory barrier, RMB, read memory barrier, and WMB, write memory barrier. Now, mandatory because they always expand to the relevant barrier. You can use them for IO. And like the SMP conditional barriers, which are basically just between CPUs. If you haven't got an SMP kernel, you can compile the barriers. And they also have write, read flavors. On ARM, as you've probably noticed by now, we don't actually have a read barrier. So we have MB and WMB and then RMB is the same as MB. And they're implicit barriers in some of our constructs. I really recommend you go and read that file. You need to read it about 60 times before you think we understand it. But it's still worth having a look at. It describes a lot of good stuff in there, but it's very low level. It doesn't have any nice pictures. It doesn't have the nice read and write channels in different colors. So for ARM, it's the code under ARCH ARM, we implement the Linux barriers as low level barriers. So like I just said, the SMP barriers go to DMB, RMB goes to DSB, WMB goes to DSB, et cetera, et cetera. And there are low level barrier macros, so if you want a DMB and you're in some ARM code, it's called DMB like a function. And you get a DMB called DSB like a function. You get a DSB, but there's a problem here which you might have spotted, which is that we don't use any of the options. Everything all access is full system, which is the worst kind. It means that we get stuck on doing a lot more stuff than we have to. But we kind of got away with it because actually the CPUs didn't do anything with the option either. So it didn't really matter. Until now, when it turns out, the hints can be quite useful and I'll show you some very hack bench results. I'll show you some stuff I did, which made a difference. So I think that we should code to the architecture and then people are more likely to implement this as well. Because if they look at Linux, I know it isn't even using this stuff. What's the point in implementing it? So it works both ways. So for 3.12, you can now write code as monstrous as this. So you can have DSB pass in NSHST as a string. Well, not string, just a token. And it will give you a DSB NSHST, right? So this is some code that I added to the TRB flush code, but I haven't talked about maintenance, so it doesn't make much sense. But you can see here it's a non-sharable store only DSB and then later there's a non-sharable full access. So you might be thinking, oh, I don't really know which barrier to use, which silly option to pass. Well, it's not too bad because it will do the right thing for you. And they're actually in a sharable as well. You need a recent bin utils. I got a lot of flak on the list because if you use a bin utils that's more than 3 years old, you can't load them on the V7 kernel but we don't care. So tough, just upgrade. Or just patch it out. It's your loss. So this is an example, which is the spin unlock code. Now the spin unlock code is a bit special because it uses the low level barriers, right? It's a special hack. It's not using the high level barrier so you have to go and fix this one unmanually, which I've done. So here's the code, which is sort of the weird kernel version of C. And before this point, you've got your critical section. So we need an SMPMB to make sure that no accesses leak out with the critical section. You don't want to unlock and then find that some accesses happen afterwards because that's not safe. There's a barrier there for read and write because that's when you're accessing your shared data. Then we release the lock. We use ticket locks so you just increment the owner ticket which is a 16-bit field. And then we wake up the waiting CPUs. Now that means you need a DSB to make sure that the unlock becomes visible right now before we go and wake up the CPU. Otherwise we may wake up the CPU and it goes there and it can't see the unlock and then it goes back to sleep. So for 3.11, that expands to this, which is, as I just said, a dnbsy. The load, increment it, store it back, fold the sb, all accesses, and ascend a VIM. The load in the store will be ordered because there's a data dependency there. And that's a pretty rubbish code because that first one can be in a shareable and the second one can be in a shareable and stores only because there's a data dependency between the two accesses. That would be sufficient. So for 3.12, we get that code. And that made Hackbench go 5% faster on my, like, 10-bit code. Hackbench got 5% faster on my, like, 5-core development onboard. Now I've picked the example which made the huge difference, right? Otherwise this presentation would have sucked. So I changed lots of other things and it didn't really make any measurable difference. But this was a cool one. It was really nice code. I mean, this is a very hot path. Unlocking a spin lock is pure overhead. You're not in the critical section anymore. You just need to release it. But I think that considering it made a big difference. That's a user-based benchmark, right? Although it spends most of its time in the kernel. So now you're all going to go away and, sorry, Paul? Can I ask you a question? Why would they all ISH to begin with? Why would they all system? Not in this path, but in general? In general, I... So you know my comment about binutils? Binutils didn't used to accept the options. So I guess when someone did the original code, that's like, oh, great, binutils won't let me do it. And then weirdly, I think they implemented... They implemented one option. It was like outer-sharable or something first. It was something weird like that, so you still couldn't use it. And it's not useful. So that's just its legacy, but basically. And also, I suppose partly lack of understanding. People didn't want to worry about this. It was bad enough having to go through and add the things in the first place, right? You don't have to worry about that. So before you all go away and make everything non-sharable stores only, DMV, or just delete the barriers all together because you think dependencies are enough, I can give you some horror stories about IO. So for IO, you pretty much always need strong barriers. And if you get it wrong, it's horrible because you get random DMA corruption, which sucks. So here we have DMA2 device. So up here, that dotted line is supposed to say here's an in-sharable cluster of CPUs with the read and write channels down. Then we have a DMA controller over here. So this, let's say it's in the system domain. And the DMA controller is split into two halves because that first one is the slave interface, right? That's not an observer, as I said. So you write to the DMA controller and say, please do a DMA, which is actually what we're doing on line A2 in this kind of weird arm assembly I made up. And that will cause the master to go and perform accesses. And I've drawn this as a single line because it's maybe some slow or slower slave bus. It's probably strictly ordered, but everything just gets funneled down in order. So if you're going to store some data to memory, then you go store to the controller to say please go read what I just wrote. You're going to want a barrier in there because you don't want to write to the controller before you've written the data to memory, okay? So what time, does anyone have any idea what barrier you would want? Okay. So the first thing you might think is, okay, I need ST, right, which is correct because it's two stores and ordering stores. Okay. So we can end up in this scenario where our first store to memory gets to here. It goes past this buffer, come bridge, come whatever it I've decided it is. It's on its way to memory, fine. But the DMA master hasn't observed it yet. And we've got our DMA ST and we've got our right to the control register. But the interesting thing is that's going to have to fork off, all right, so the barrier, I've kind of drawn the barrier being propagated down both, wouldn't need that because that's an ordered interconnect. But the reason you can do that and the reason you don't have to block is because the master, the DMA master, cannot observe A2, right, can't see it. And neither can the DMA controller because it's not an observer. So now you've got a race because A0, for some reason, it gets held up in a buffer, I don't know what. And A2 can reach the controller and then the master can go and read from memory before your right's got that. So what you do instead is you have a DSB ST and I've had, I've handily to do that. So just use WMB whenever you're doing IO writes like this, and it'll do the right thing. And with that, we have to stall the CPU because we have to wait for A0 to be observable in the system domain, which includes the DMA master. So we get stuck waiting for the master to be able to observe it, which means we have to propagate all the way down to memory before we can issue, whoops, I don't have a slide, before you can issue the right to the controller. So the DSB would work. This isn't even worse case, I think. So this is when you want to read back from the device. So you're doing a DMA from the device. So in this case, a little bit more code. We load from the controller. Now on line A1, we compare, is the DMA done yet? If it's not done, okay, we'll go over from the controller again. So we're just polling the controller. And at some point, it says, yes, I'm done. Then we have a barrier and then we go and load from memory to see the new data. Now, if you remember, control dependency does not enforce ordering. So control dependency now, who cares? It's not going to do anything. We can speculate that second load from memory. But the only observer, really, you might think is the CPUs. It's loads, so we need both access types. So let's try a DMB again. This time it's got to be right. And you know what's coming. So here's the load. Here's the load from the control. And the CPU goes, well, hey, let's do some speculation. So it doesn't have that load back, so it can't issue the compare. But it can make a guess and predict the branch. It goes, oh, we're not going to go around again, but the DMB is done, I think. I don't know what I think it's going to be done. Okay, we've got a DMB. Well, we can speculatively execute that as well because as long as the speculative access is in order, that's fine. So then we issue A4 as well. Now this is starting to look pretty similar to what we had before, which we know doesn't work. And what a surprise. Now we're screwed. We're doing the load from the controller. We've got a load from memory, and the DMA is just about to finish its write. Okay? Now you could have a horrible interleaving here where you read the stale data from memory, the write from the DMA finishes, the controller says, yes, I'm done, and the load hits the controller. Some good timing for that to happen, but if you're doing DMAs like this, which I know many people are on many devices, you'll hit this problem, and you'll get the end of your DMA buffer will be corrupt. So actually here is also, it's like a back one. If it turned out with the branch, so we do that, we get the wrong value, but then the DMA actually says, oh, it wasn't done, right? Then we'd have to discard all of that because you do a backwards branch, you'd have to do another load, and then the DMA wouldn't have been enough, so we have to throw everything away. So the CPU is okay in that case. It's only this particular case where the DMA is finishing. Yes, so the solution, yes, what? It's just DSB, which is in the RMB macro. You know, your barriers when you're doing this kind of stuff. That'll be enough because the CPU's going to wait for the read from the controller to come back before it does anything. So now I've probably confused you to the sense you're thinking, well, which one should I use? It was bad enough when I had just a handful of barriers, and now you're giving me these weird things that I can't pronounce that I can sort of merge with the barriers we already have, and ugh, which one? It's not too bad. I've said the really low-level stuff just so you know what's there. If you're writing really low-level code, this is the kind of terminology to use. You can email the lists. Good. Just have a think about maybe I can use a restricted variant. But in general, who should think, do I even need the barrier? Maybe dependencies are enough. In fact, a lot of the time you want barriers, it's when you're publishing to and consuming from other observers. As I said earlier, you could be initialising an object and then reading the object. Well, you know, it's a nice simple case of a problem. You can think about that and apply it to lots of other things. You need a barrier on the right, the writer, between the two rights, and on the read side, you've got an address dependency. If you only care about CPUs, use the SMP versions. If you only care about reads and writes, then use the specific read or write version. The low-level barriers that I've been describing are rarely needed. In fact, you only need them really for non-sharable, outer-sharable, and maintenance. Because these other higher-level, higher-level versions will do the inner-sharable ones for you. And non-sharable, I think we use in one, maybe two places in Linux. It's not that useful for us. Outer-sharable, we don't use at all. And the maintenance card, I already fixed it all. So, unless you're adding more maintenance, you should be okay. One interesting thing is the IO accesses. So, if you go back to the DMA slide, you need these horrible barriers, and there's RMBs everywhere. Actually, if you use the MMIO accesses, read or write all, they already have barriers in them. Because Torval's got cross, instead every architect has to look like XA6, which orders device accesses in there, the downside is, performance sucks. Or an arm and power. So, we thought, okay, we'll make relaxed versions. So, you have read down underscore relaxed, write out underscore relaxed, they don't have barriers. The problem is, alpha, titanium, per PC, MIPS, I think, they also added relaxed accesses, and they all do subtly different things. So, if you want to write about a driver that works on more than one of these architectures, and you want some performance, because it's going to behave ever so slightly differently on these things, I'm really trying to get that fixed, but no one's interested, apart from Ben H, the Pappy C guy, but I think between us, we can sort of deflect enough abuse that we may be able to get something fixed. So, that's it really, any questions? There's a cake, because it's my birthday. There might be a microphone somewhere, I don't know if it works. Yeah, okay, if anyone wants to ask questions. So, since we need to read... Can everyone hear Thomas or not? That's all right, you can ask a stupid question into the mic. Since we need to read 60 times the document about memory barriers, can we have you talk two times in a row? So, can you start again? I could do the slides out of order. Oh, and by the way, happy birthday. Cheers. The slides are available. Actually, the version I did publish is in a slightly different order. I'll send another question up. A question from Wolverine. Have you experienced in the difference writeL and writeL relaxed means on certain kind of drivers, because I recently had the issue that I accepted a patch, which I reused writeL relaxed and later I got a follow-up patch to use writeL, because it was also used on PowerPC. And so, I have a rough guideline when it pays off to use writeL relaxed, or one to stay safe, writeL relaxed. Yeah, sure. So, I'd say for the moment, and I really don't want to say this, but I think I have to be fair, is that the relaxed accessors aren't even implemented on all arches. There's no as in generic relaxed accessor. So, if you see someone adding a relaxed access to a driver, your best bet and the easiest thing to do is this driver used by more than one architecture. If it is, push back. That's the simple solution. I'm trying to fix this. I plan to add, first of all, someone from x86 who cares enough about this stuff to respond to my email, then I have to find someone who cares about alpha and get them to agree to the change, make the change across all those architectures, and at that point, you'll be able to use the relaxed variants, and it'll be a well-defined semantic, which is hopefully documented in the documentation. At the moment under documentation, there are two different definitions of relaxed, and one of them is completely insane and the other one's really vague, so. For now, I tend to be conservative and think that right L relaxed to reduce them on a really hot path because even if there's the course not used on PowerPC now, it might be in the future. Unless that is sorted out, it seems to me a safer way to do it. Being conservative, is that okay? Do you agree? By conservative, do you mean accepting or rejecting a patch? Only accepting right L relaxed in hot paths and one where I can really see the performance is critical here? Again, it's tricky because my last look at the x86 code for this was that the relaxed version doesn't even have a compiler barrier, so if you have two right L relaxed to different addresses, GCC could swap them round on x86. This is my last reading of the code, it may have changed, but that's different to every other architecture. There's the same device, but different address in order, which is normally what you want, but if meanwhile you are populating some DMA buffer, like in my examples, they wouldn't be ordered against that. So there's the issue of we want to maintain order to the device with a relaxed accessor, but we don't care about ordering with respect to sort of random normal cache buffers, because actually normally you don't. If you're programming you don't care whether that's out of order with respect to random cache buffers, you just want them to be in order with each other, and you might not even care about that, but I think that's something you may want, particularly if the last one is enabled. So for that, the relaxed accessor would be enough on everything apart from x86 and what I can tell. So it's really your call, I would say that if the driver is going to run on more than one architecture, that's 64 because they have the same semantic for this, then at least question the use of it, because people might not realize, they might have just added it and found that it went faster, and not check that it didn't need that data. Do you want to pass the mic forward to Paul? Thank you. For the DMA from device slide, I wonder if you could bring that one up. I think that at least for many SoCs, it might actually be a... Which one do you want? The last DMA from device, I think it's this one. Yeah, yeah. This one might actually also need an MMIO barrier from the... in terms of the interconnect, like it might need a readback from that DMA controller. It would on OMAP, I think. It depends on what you want from here. We do have... not in our R&B, but in our W&B, we have things like you can poke the Outer Cache Controller on A9 because barriers aren't propagated, things like that. What's the reason, actually, to try and use the accessor, I think? But this is even in terms of the interconnect. It's beyond the scope of the arm... What the arm has control over, basically. How do you do DMA on the OMAP? Slowly. I'm not sure about this particular case, but I know that for many of our device drivers, we just used rawRidel and rawRidel, and then what we often had to do, particularly for interrupt handling, was to readbacks to avoid spurious IRQs and things like that. That's one of the things with the arm architectures. We have the architecture, but people can bolt on other things which just behave weirdly or different than what you'd expect. This is a good way of thinking about the problem, though. We could, for example, take what you just described and add it to this diagram somehow. I think it's bad enough as it is. Question at the back? Oh, okay. You come forward. Okay. For your DMA examples, if you use the coherent DMA that only handles caches, so you have to enforce ordering anyway using these barriers, if you use the streaming DMA and you put those DMA single of SG for device and so on, does it include these barriers or you have to add them anyway? Do you remember? Oh, I'm just trying to have a think. The streaming DMA API will do the right thing. So you don't need the barriers? No, if you're using the streaming... I mean, no one writes DMA in assembly code for a start-off, so you wouldn't go and write this. You'd use the streaming DMA API, but within the DMA API we will do the right thing. I think it's good to have an appreciation about how it's working, though, particularly if anyone here after this talk, you know what I'm going to do? I'm going to go back. I'm going to hack this all non-sharable. Please don't. It's like that for a reason. Okay? No more questions? Well, if you have any, if you do think of anything or you want to have a look at the slides, they're on the web and you can email me or come on the list and shout DMA stuff. Thank you very much.