 I think we're about ready to get started. So it's nearly 4 o'clock and we're stuck in the dungeon. So we're going to talk about memory ordering. I might fall asleep in my own talk. If I do, come kick me or something and we'll see if we can get to the end of it. Who am I? Apparently, when I present, I do this. So if I start doing this, hopefully I'm not asleep. Co-maintain a bunch of stuff and the I'm 64 architecture with Catlin and then some other bits and pieces including the atomics in Linux, the locking interfaces, the memory model, and recently TRB invalidation as well. So it's kind of grotty concurrent stuff. And I work, I do all that in the open source software group where I have a close working relationship with the architecture and technology group who produce all the sort of whiz bang features that get implemented in CPUs. And as part of the relationship I've got with them, I helped put together the ARMv8 memory model for the ARM architecture and formalism of that as well. And if that wasn't enough, I also do C++ memory model. Well, that's a big committee thing. I have a minor role there that I'm involved. And last time I spoke at ELC, which was here five years ago for me, I also spoke about memory ordering. So I'm a little bit of a stuck record, perhaps a one trick pony, but I've modified it a bit to talk about eye ordering this time, which is gonna be about things like DMA and stuff like that. So to set the scene a little bit, what's my idea of paradise? So this place looks pretty good. A tropical desert island. But there's one thing you can do to improve this tropical desert island immensely. And that's to make it a unit process, a tropical desert island. And that's also an alpha, so it's kind of even better. Don't tell my employer. But that's about as good as my Photoshop skills get, the shadow is not quite the right angle, but it's great, I'll take it. And the reason I like this so much is because the complexity and the burden that I spend all of my working hours worrying about just goes away. There's no peripherals either, right? The grim reality is that we crammed thousands of these poor CPUs with no natural light, all on a network in these air conditioned warehouses, that the island dream is gone. It does not exist. I really wish it did. But this is what we're dealing with. Loads of instances of Linux and within each instance of Linux, you've got coherency and SMP all over the place. So we have to deal with concurrency, we can't have what we wanted. So even with a single coherent shared memory, which is basically what you expect when you're writing concurrent code for CPUs. GPU people are a bit more out there with what they have to deal with, but we've got a single coherent shared memory, you have one copy of a variable and you're modifying it in a mutable state. Even in that case, concurrency is really difficult. Reasons that are difficult, you can't write your program and then reason about it executing in steps because you can have outcomes, I'll show you later, you can have outcomes which don't correspond to a stepwise execution of the program because your memory accesses can be reordered, for example. And when your program goes wrong, which is probably rare, but enough that you get told off or you crash something, you put some instrumentation in and it starts working, which is the worst kind of bug, Heisenbug. So you strive for this balance between performance and correctness because you could put a whopping great big lock around your program and it'll work, but it's probably not fast enough. And when you go to look at tools to reason about or to validate your code, there's not really a concurrent GDB in the sense that you can just connect to Debugger and it says, oh, your race is here. There are things coming along, but it's not the same interactive instant response kind of stuff that you're used to with a single processor. So the CPU's basically not doing what you asked it to do, can it get worse than this? Well, of course it can, otherwise you wouldn't have a talk. It gets much worse than this, but we'll go through this bit first, otherwise we're all gonna just give up. And I forgot when I did this talk that I can't just go straight to IO ordering. We're gonna have to do memory ordering first and so I've had to cram it in five minutes, but let's see how we do. The end of the talk is examples and if we don't get through all of those, it's fine. If you get stuck here, please stop me and I'll do my best. So here's an example called store buffering and we'll see why it's called that in a minute. So there's two CPUs and this is sort of kernel code-ish. We've got two shared variables, X and Y. They're both zero in memory initially and then we've got two local register variables, foo and bar. So each CPU, CPU zero writes one to X and then it reads Y into foo and CPU one does the sort of opposite. It writes one to Y and then reads X into bar and the million dollar question is what are the permissible values for foo and bar? Because you could run this program many times, right? And depending on which CPU may goes a bit quicker than the other or the order in which things propagate, you may get different outcomes. So does anyone wanna have a guess at a permissible outcome? All of them. Yes, congratulations. So that's why that's a good question, right? Because whatever you said, I could have said yes. And all production architectures will permit the perhaps counter-intuitive result here, which is that foo and bar can be zero. And I can show you that if you want, we've got a little bit of time. Ignore this. This is cryptic gobbledygook from academics. So there's the test in x86 assembly. You can see there's only four instructions and I can just, this is a memory model toolkit. I can actually run that on my laptop a million times. Okay, and for some reason, whoop, whoop, whoop, let's go back to that. For some reason, it prints out CPU info at the end. So we just scroll up past all of that. It's always the same. But hey, okay, so now you can see what we saw. So we ran it 10 million times and there are the four outcomes. And we did actually see all four on my laptop just then. Well, I was running it. So yes, you were correct and my laptop agrees with you. So let's try and get back to the slides. There we are, good. So now you believe me, I'm not making it up. Anyone got an idea of how? The clues in the name of the test. Well done. So it's still buffering. And the thing is that these right ones basically sit in the local store buffers and the local store buffers are not snooped by other CPUs. So when the other CPUs do their corresponding read, that variable has not yet been updated in terms of it's not been made visible to other people. And that looks a bit like the read and the write get reordered on each core. And the reason this is counter-intuitive just to some people is because there's this memory model called sequential consistency which says that your program has to run stepwise. So that the concurrent execution of the program looks like an interleaving of the threads. So I've labeled the four statements here, A, B, C and D. And what you're allowed to do basically is come up with interleavings as long as the order they are in the program is preserved. So A always has to be before B and C always has to be before D. And there's some example interleavings. And that's easy to reason about, quote unquote. There's lots of concurrency modeling toolkits which are built upon SC. And the big problem with SC which is sequential consistency is it forbids the zero outcome in the previous example. So it gives you this concurrent toolkit and a memory model that's quote unquote easy to reason about but it's just it's not applicable to the real world because nobody builds SC machines or at least there are no SC architectures out there and you saw that on my laptop. So that's a problem. So people of, some people have realized this is an issue and they've come up with ways to reason about weak memory behavior. So that's non-SC behaviors, i.e. the zero, zero case that we've been talking about. And one way you can talk about those is with these very cryptic things called litmus tests which also have cryptic names. So I'm going to explain to you the structure of the litmus test so you get an idea for what they are. So this one is called MP plus Popple plus Poe. Ignore that bit. We just call it MP. That sounds for message passing. Okay. Ignore this next line. It's an AR-64 assembly and this part here between the curly braces is setting up our shared memory. So everything is zero at the start of time. That's just an implicit assumption of a litmus test. And we have two variables, X and Y. So it says on CPU zero, that's what that zero is, zero colon, register X1 is X. And what that actually means is it's a pointer to X. But the syntax here doesn't have the ampersand of it. There's a pointer to X in register X1 and there's a pointer to Y in register X3. And over here, it's CPU one, X1 has a pointer to Y and X3 has a pointer to X. So that's, we've got our registers, they've got pointers to variables, shared variables. And then you have the test here, which is the program here. And the program here, you've got P zero, processor zero, P one, processor one. They execute concurrently, they execute at the same time. And they run these instructions. So this guy moves one into this register, W zero, and then stores it to X1, which has a pointer to X. So actually if you draw this, it looks a bit like this. So we do Fred zero, which is P zero here. Does write X equals one, write Y release, because this is a special kind of order inducing store, to be one. And this guy over here reads Y and reads X here. And the constraint at the bottom constrains the values that these registers can hold. So X zero is the same as W zero for the purpose of this example. It's just the 64 bit view of that register. And with that knowledge, with that exists clause, you can create these reads from arrow and this thing called a from read arrow, which I don't have time to go into. But what you've got is you write X, you write Y. This guy reads Y. Is it allowed to read X equals zero? So if you think of this as data and flag, you write the data, you set the flag. This guy sees the updated flag. Does it see the old data? That's kind of a weak memory question, which you could ask. And you can run that through a tool and it will tell you the answer. The thing to remember, this is basically the takeaway that I want you to have, is that you've got a cycle here and cycles are bad and whether or not you need to worry about a cycle is determined by the memory model. And that's kind of memory models in five minutes. And that's the easy case. Don't worry. I don't think I've got any more in this test, so I haven't checked. I don't think I have. So we can take this beyond shared memory communication. So we're already, most of the research out there for concurrency is dealing with sequential consistency. There's a whole bunch of people doing weak memory consistency and I don't really know of anything going on in this area, which is unfortunate for an operating system. Not all communication between observers, threads, CPUs, is via explicit access to shared memory. So imagine this case here, which is a slight change of the previous one. If we write X equals one and then instead of writing to Y, we write to an interrupt controller, which triggers an interrupt on thread one and then it reads the old X. Okay, now there's no cycle here. So memory model, like classical memory model logic says, well, that's fine. Linux may not like this. Probably be quite bad. So we need a way of talking about these kind of tests, which doesn't really exist. And it's not just interrupts. There's a whole bunch of other ways you can do it. So it could be DMA from a peripheral, it could be updates to a page table. You can even have weird things like regulators where you power on a regulator and then try to access the thing that's been powered on. You would need to make sure that happens in the right order. And most people go, that's all out of scope. No one does this, of course. Yeah, Linux does this all the time. Because it's difficult and it's already hard enough that I think people don't wanna go that extra mile. So what I want you to think about is we can generalize the idea of shared coherent memory just a little bit. So we can generalize this so the inter-process of communication, we consider it as accesses to endpoints. So access is, and I'm saying, this is something I've come up with for this talk. There are other ways you can look at this. This is just how I think about it. So an access is an event targeting a specific endpoint, which causes it to change state. So that could be, in a memory case, that could be a right to memory. How does it may cause it to change state? Because a read probably doesn't for memory. And an endpoint is a piece of hardware with a mutable state which can respond to accesses and then maybe it can also generate other accesses, like accessing a DMA engine which then accesses memory. Here are some modern endpoints. So for us, for Linux, we don't quite need this generalization as far as I'm explaining it here. We really just need to care about memory or MMIO. And when I say MMIO, it's IOM endpoints. It's stuff that you get back from IO remap is what we're considering. So it's that and it's memory. And we'll just consider all accesses to be load store operations via the appropriate accesses in Linux. So it's things like readL and writeL, which I'm gonna talk more about. There are also peripherals that have perhaps funny system register interfaces, but we're not gonna go into that either. So we're kind of limiting the scope a bit for this talk, but the same kind of idea applies. And then once you've got sort of that in your head, we need to distinguish one other thing, which is ordering versus completion. So the way I like to think of this is ordering requires that two accesses to the same endpoint will remain in order on their way to that endpoint. So in this funny picture here, you've got CPU zero, which has done access A and then it's done access B and these are going towards this endpoint here. So in this case, you know, perhaps they are ordered perhaps, and that means that everyone's always going to see them in order. And as they propagate down here towards the endpoint, you know, B is not allowed to overtake A. They have to remain in order. And that might be because you put a barrier between them or it might be because of memory type or whatever. And CPU one will see them both in order because of that. Because it doesn't hold anything up, right? It just punts them out and they propagate in order. So completion on the other hand, requires a prior access to reach a certain point before it can initiate a later access. So for reads, we say that reads complete when they have their data. So they appear to complete at the endpoint. You know, if you're reading from memory or a peripheral, you can't satisfy the read until it's completed. It's going to go somewhere. It's going to pick up its value. Then we say it's completed. And then after that, you know, that reads are complete. We can do something else. For writes, they can actually be buffered, even merged, and they can complete early. Like, for example, on a posted write. So here, if A was a write, it might complete at this buffer. And now we can do B. So in this case, we're saying we have to complete A and then we can do B. So completion sort of implies ordering, but you can also use completion to achieve the effects of ordering to different endpoints. And that's what we're going to talk about for the Linux IO accesses. So yeah, let's go on to the API, which I can't do. One of the big problems with IO ordering is it really is it's like a melting pot of lots of different memory models. So you might have the CPU memory model, which is perhaps the architecture memory model. And then that might interface to an interconnect which has an internal memory model, which isn't programmer visible. And then someone has to bridge that to another bus like PCI, which does have a programmer visible memory model. And because all of these things have their own memory model, and it's a lot of it's at the mercy of the hardware integrator to get this right, it can be really complicated. And it means you can build systems that are broken. You could in theory integrate a system where you cannot interact with the PCI memory model from software point of view because the thing in the middle just doesn't play ball. So Linux kind of has to assume some basic sanity here. And yeah, correct bridging is crucial. So the two things we need to consider really is DMA buffers, which are allocated via DMA alloc coherent or maps using the streaming API. So Linux has that interface and it makes an assumption about the coherence of devices. So devices either DMA coherent or it's not. We don't have a middle ground where oh, it's coherent for these types of things up to this point. It's coherent or it's not coherent. And MMI regions are mapped using IR remap. That requires aligned accesses when you make them. It gives you some exercise guarantees. It guarantees that you don't speculate things. There are some funny versions of IR remap like IR remap WC, which is weaker and IR remap no cache, which is stronger. The semantics of these are pretty vague. They're very driven by x86. If you're gonna start using IR remap WC or no cache in your driver, watch out because there's a good chance you won't have portable correctness or you might not be correct across different architectures in those cases. But we're gonna just concentrate on IR remap really for this. So here are the default IO accesses. So any of you who've written a kernel driver has probably used at least some of these. So you've got your IOMM star back from IR remap and you want to dereference it. Well, you can't just blindly dereference it because everything will go wrong. A compiler won't realize what it is and who knows what will happen. You might break the machine or at least crash. So you use a special access. So there's these inX, outX, so inB for reading. That's a legacy x86 port IO access instruction, but we have an API for it on other architectures. It will be a memory mapped thing under the hood, but that's what they are. We also have readX and writeX, so readL, writeL, things like that for accessing explicitly mapped MMIO. And then you can use IO, read32, IO, write32, those kind of things, which expands to the correct accesser based on what the device is under the hood. And these are the default accesses. They pretty much always do what you want, but they can be quite expensive and I'll talk about that in a minute. They're a little endian by default and they're ordered against other accesses to the same endpoint. So if you do two writeLs to a given MMIO region that you IO remapped, they will be ordered to that IO region, which is often a desirable property. You can push writes by doing a read back. So if you writeL and then readL back, the writeL is gonna have to reach the endpoint before your read can come back. And here's the bit which makes it really, really expensive for non-XX6, which is that write accesser initiates after completing prime memory writes. So if you remember that what completion means, that means that your memory writes that are before the write accesser have to go all the way out to memory and be visible. And depending on your coherency, that could be a different point, but you gotta push everything out, you're at least gonna have to drain your store buffer. And then you can initiate the write access to the device, but only then you can't do it earlier. And similarly for reads, the read accesser completes before initiating later memory reads and it also is ordered with respect to later delay loops as well. Not currently on ARM, I'm fixing that, which was a patch that came out of writing this presentation. So this is kind of, the reason it's designed like this is because if you're interacting with a device that's done DMA, it kind of makes sense, right? So you write, so you want to transmit some data, you write into the DMA buffer and then you write to the device saying, please DMA my buffer, the stuff I just wrote to memory, well your writeL is gonna give you that guarantee because it's ordered with respect to those. It completes them, I should say. You can do some crazy stuff with spin locks using this, just try not to do it, it's very complicated. On non-X86, this is really expensive, I keep saying that. So what do we do? Because in the cases where you don't care about ordering memory accesses, you don't want to pay this price. Well you can chill out, we've got relaxed accesses. They're actually quite heavily used because that performance burden is so high. So we don't have them for everything, we have them for readL and writeL kind of style accesses, writeL relaxed, readL relaxed, which are just for these accesses here and also for these string variants with the S or the rep. So the idea of these string accesses that it's when you're reading from a memory map FIFO, something like that, so you don't actually care about memory accesses at all, it's just endpoint accesses, so you don't need ordering there, you don't need completion guarantees. So these don't provide any completion guarantees, but what they do provide is that they remain ordered to the same endpoint and people forget that, so I think there are quite a lot of cases where you're not worried about DMA, but you want to make sure your accesses arrive at the endpoint in order. That's still guaranteed for the relaxed accesses and in hindsight maybe relaxed wasn't such a good name because it really makes them sound like you can't rely on anything. You can, you've still got that order to the same endpoint. Practically they probably also work with spin locks, but yeah, if your machine crashes, don't blame me. So mandatory barriers, so what you can do is you can, if you need to be explicit about the ordering and the fences that you want, you can use the mandatory barriers and you can even use them in conjunction with the relaxed accesses if you were, I don't know why you'd do that, maybe it's useful to do sometimes. So we've got three of those, they're a bit like the SMP barriers without the prefix, so MB completes prior reads and writes before it initiates later reads and writes. And this is for all different endpoints and memory. And then for our MB it's just reads to reads and WMB it's writes to writes. They're fairly straightforward. If you ever add one of these, please put a comment there because it's impossible to read code that just has these willy-nilly, you just can't remove them. So if you use them in conjunction with the relaxed accesses, for example, writeL basically behaves like a WMB and then a writeL relaxed. And that's pretty much how we implement it. And I'm 64 and this WMB is expensive. And then this is just another example. If you wanted to order or complete a write before initiating a read, you have to use an MB because writeL doesn't do that. And that's the only way you can have write to read ordering here. But if you're doing normal DMA with the same device, you generally don't need these. You can use the four accesses. Good. So another type of ordering mechanism we have, DMA barriers. Now these are actually quite a lot different. These are only intended to be used for DMA alloc coherent allocations. And they only provide ordering for memory accesses. And they're much, much, much cheaper than all of the other fences. So a common use for these, in fact, the only use I've really seen is where you have a coherent descriptor ring and maybe a device is doing DMA of descriptors which have a payload and a valid flag. And you wanna read the valid flag and then read the payload. And you can use, for example, a DMA RMB to order those reads. No effect on IOMM access and they're relatively cheap. So yeah, if you're writing a coherent DMA ring, the read side of that, just use DMA RMB. Don't use RMB. DMA RMB is much, much quicker. And I think it's quicker on x86 as well. So let's go through some examples. I went through the kernel source code to try and find examples of uses of these. So the slides are available if you ever wanna go back and have a look. And actually in doing this, I found bugs in code I'd written. So I don't know what that says about the API, but at least I've now documented it and I have patches. So here's an example. This was one case where I actually got it right. My favorite driver, because I wrote it. Submitting a command to the SMU. So the SMMU, it's an arm, IOMMU, and it has a memory map queue. So we've got this queue right here. And we're basically just copying from something in memory to a DMA buffer. So this is a DMA alco here in memory buffer. So we copy the stuff in. That's a bunch of stores. And then we're gonna do this little bit with maths because it's a queue. And then a right L to update the memory mapped. So it's probably here. To update the memory mapped register, say, hey, there's some new stuff in the queue. So you need to make sure that what you wrote to the memory map queue is completed before you write to the device, which is why you need a right L. So that's a classic example of where you need to use right L. So here's another example. And I'm not very, I don't know much about networking and this didn't all fit on a slide so I had to trim things down a bit. But we have this ethernet driver here for reading RX data. So there's a function here called MVregRead, which is basically just a reedle. I think it might be a hash defined to reedle. So just treat that as a reedle. I ran out of slide. And we do, we call that, this mvnet-a-ruscu-busy-desk-nom-get, which is here. So that's our reedle to find out how much stuff is there. How much stuff is there in my receive queue? How many descriptors do I have? And once we know, we basically get something out of there and then we need to do a mem copy here. And this mem copy has to be ordered because it's a reading from the queue with respect to when we find out how many descriptors there are. So we've got to complete these guys. Now you might think, well hang on a minute, there's a sort of a dependency here in some sense because this rx to do forms part of the condition of the while loop. So maybe that's enough to give us ordering and it's actually not because the CPU can speculate these loads here in the mem copy in theory so you could still have an issue. So you do need to use reedle there. So you can't use reedle relaxed. So it's like the opposite of the previous example. We're doing a reed of the register and then a reed of the memory. So here's, you know, I've just given you two examples where you should never use the relaxed accesses which is not what I want. So let's move on to the relaxed accesses. Here is a driver. Again, I've had to trim it down a bit to keep on the slide. But basically what it's doing is it's setting up the parameters for a DMA in MIO registers. So there's a whole bunch of stuff. It's a display driver, which is another thing I know nothing about. But this got a memcon which needs to be started and then an address and a pitch. And then there's GMC setting zero. It's the thing that actually says, right, go and do the DMA using the parameters I just gave you. So actually the parameters themselves, we don't care. We can use relaxed here because we just need to make sure they're ordered with respect to this guy. And if you remember, relaxed accesses are ordered to the same endpoint. But the final right now, which triggers the DMA, that needs to be ordered with respect to prior memory rights. Or it needs to complete them, I should say. I keep getting technology wrong. People tend to get this wrong and you'll see WMBs littered over here, which is probably worse than just using right L for everything because it makes the code more confusing. So you don't need WMBs in this case. This is perfectly fine. This will work. Okay. You might not recognize this file because it's not in mainline. Unfortunately, it is in the Ubuntu kernel package because they apparently ship a special kernel for this SOC, which is suboptimal. And this file brings up the L2 and initializes a snoop control unit, which is questionable whether Linux should be doing this. So take a deep breath because this is a marvel of code. Yeah. So I'm sure these are all documented and we could probably just figure out where the bug is. But there is a bug here and it's actually quite a cryptic bug. And I do need to fix Linux. That's another thing. The ARM64 port doesn't quite get this right either. And the issue is we're doing a right-tell relaxed to apparently de-assert this reset with these magic values. Then there's an MB. And we have to wait for 54 microseconds and then turn this thing on. And assuming this is critical, otherwise this somehow hasn't had an effect. I have no idea. But the problem is this can complete early as I showed in those diagrams earlier on. There's no guarantee here that this has actually had an effect. It might be buffered. So this U-delay here, the CPU can wait 54 microseconds. The device might still not have seen this. And then this could go and join it in the buffer and then you can see them both at the same time. So there's no guarantee that the device is gonna see the delay here. Anyone got an idea how you would fix this? Exactly. So you put a read back here, which is gonna force that guy to come out and go all the way to the endpoint. Then you can do your weight and then you can do this. I'm gonna fix it and then that will also work on 64. It currently does work on power and it does work on x86. And it seems that actually this is quite a common pattern for people to do an access to an MMO and then spin for a bit. But like I said, don't worry, it's not a mainline. So I'm sure this is done by some reliable firmware instead. So another thing, use of the DMA MBs. So this one is great because they put a comment in. So whoever wrote this code, thank you very much. This was a joy to read, just found it very easily. So where it's in an Infiniband driver, it's in a tasklet, polling this notification queue, and there's a budget basically of how many things we're gonna try and pull out of this queue. And there's two bits we do. So we grab something out of the queue and check, is it valid? If it's not valid, we just skip, we're done. And if it is valid, we go, oh great, it was valid. So now we can go and actually access the stuff in the buffer, the payload part, I should say. And we need to make sure that we don't speculously load the payload. Then the thing becomes valid and we think that we saw the valid payload. We need ordering between checking valid and reading the payload, and you can do that with a DMA R&B, and this is a good example of that. So this is another one of my favorite drivers. Of course, lying. This is the SMC 911X.C. And if you haven't encountered this driver, you haven't lived. So this has some nice wrappers for these functions. So we've got to receive and ascend, which for some reason have got different style of naming. But they expand to these macros. Well, they use these macros. So there's a pool data and a push data, which is another way of saying, reading and writing data. And you can see how these macros are then defined in terms of in SL and in out SL, which then eventually go to the, I write 32 rep and I read 32 rep. They're using the string accesses. And this particular piece of hardware can be built and very often is built without any DMA capability. So the way that you get data out or get data in from it, I should say there's just an MMO register and you read it and you get 32 bits of packet and then you read it again and you get the next 32 bit and you just keep reading. And if you can't read it fast enough, it all goes wrong and similar for sending stuff out. So that's what these macros are doing. And in that case, you really don't care about any memory ordering as long as you're getting the stuff out the FIFO in order. And that's guaranteed by the string accesses. So that's why we use them there. So it is actually, once you untangle all of the preprocessor, it's actually a reasonable example. So I'm pretending to offer a 100-pound reward because there is a rumor that some Adaptec card is rumored to do DMA on a read. So you read a register and then it does a DMA, which is a bit weird. It doesn't make any sense why you would build hardware like this, because reads are quite expensive because you actually have to complete at the end point. I couldn't find anything in the tree, but there's a lot of code in the tree. And all I have is this some Adaptec card. I don't think it's a very new card. So in theory, to fix this and make sure it works, you'd require an explicit MB before the MMIO read. So if you find it, you can come and claim your 100-pound reward and I'll probably not give it to you. Have a look, it'd be great. Let me know if you find any examples because this will not work without the MB. So maybe your reward is actually, you have to write a patch to add the MB. Okay, any questions? Oh, and thanks to Arnd and Ben for helping me reverse engineer the semantics of these APIs. That's it, questions. And if you have a question, you have to come up to the front and there's a mic there and a mic there. If you want your tropical island with non-SMP platform, you might not want to bring it alpha, not a wall of them, right? That's true, alpha, it started off well. And another comment on the whole, maybe you don't want to bring that up in Linux on the L2 cache config. Oh, yeah. I know it's a complicated subject, but at least we see the mess now. The code would be the same in firmware, it wouldn't be fixed, but yeah. That's true. There's quite a lot going on, obviously, to have open source firmware now, but the incentive is perhaps not as strong as having open source operating systems. There's not as much to get back from it. So you're right, at least we can see the code and we can reason about it. As I say, even if they fixed that code, there's actually an underlying bug in our back end which I'm now going to fix. So it was useful for that, you're right. Yep. Thanks for your talk. Could you show slide number 25 once again, please? No, Barmy, what did I do? This one? Yeah. If I understood you right, you said that MB here is not enough, or is it a bug? Yeah, so let's have a look here. So the problem is, yeah, so MB is going to force this guy to complete. But if you remember the completion diagram I showed, you can complete earlier, because I posted right, for example. So when you complete this right here, the device, this L2 base thing, you might still not see it. There could be a buffer, a hardware buffer, before that device. So all it means is that this guy's guaranteed to have got into that buffer, but at that point it's not able to affect any state change on the L2 base, this power control here. So then you wait this 54, then this guy can also go and sit in the buffer, and then perhaps they both go out together, and you see one cycle at the end point between the two rights, which probably isn't what they're going for here. I mean, it also might work fine, and given that this is SOC specific, it might work fine on the SOC. I'm just trying to use an example for, this was the only one I found this by chance, when I was looking to see what this file was. And it's, yeah, not the correct use of the API. This code's probably never had many people looking at it. And that example, is there actually anything preventing us from speculating the U Delay even before the writeout starts? So it depends. If the U Delay is backed by a memory mapped timer, then the MB will hold it up. If it's backed by an system register read, then actually probably not. I was thinking of the classical loop of a million times implementation. Yeah, and if that doesn't contain any loads or stores. Basically, any memory access is going to be held up by that MB. But U Delay doesn't do any memory access in the trivial implementation. So we could just do the U Delay even before we do the write out and the read out and the memory barrier. You might run into trouble with, because you'll get a, let me think about this. It's quite complicated. The backwards control loop might stop that. I have to have to think about it, but it certainly can go wrong. You need that, you need that read out and we have to put some magic in the read out to make sure that we hold up stuff. Probably using a fake control dependency, which is based on a trick from PowerPC, actually. Another question, not about the slide. Does IOVs affect when writes complete or not? For example, on all your examples, you use 32 bit writes, right? And if you, for example, interleave them with 16 bits or byte writes, does it affect somehow? So you mean if this was like a series of write Bs or something? Not a series, but interleaved. For example, if you go to that slide where you have three writes, L relaxed and one write L at the end. Yeah, I think it was probably just the wrong, this one, yeah. So you're saying if the second one will be write B. Okay, so that's still okay. So actually, if you look at these, whilst they're using the same base, these are all different offsets. So all you need is to make sure that they're accessing the same device and then you'll have order. They don't have to, the actual underlying addresses do not need to overlap. And you'll still get the order there. Okay, I think I'm about done. So, cheers. Oh, no, there's one more. How can I find you?