 Hello, thanks for possibly listening to me today. So let's start off with, excuse me, is that a little better? Let's start off with how you can, before I even talk about the problem, how you can reproduce the problem. So the best way to contact us, there's an IRC channel for computer express link on OFTC. There's a really cool script that's maintained by Vishal Verma, who's in our group, which you need to download an out-of-treat QMU for right now, but with that, if you build that, you can easily run all of the workloads that we're going to talk about today. And finally, the patches are on that list for any other work in progress stuff. Any questions about that? Okay, so since I'm new to this domain, I wanted to spend a couple slides making sure I understand everything correctly. So if something sounds wrong in the next two slides, please yell really loud. So there's this top level, there's these things called struct resources, and they can represent several different things. But there's one important instance of this, I guess I'll stand here. There's one important instance of this called ILM resource, which represents all of your host physical address space in a system. And that's kind of cool. Everything seems to work okay today. But just in general, resources are this tree-like structure. You allocate, if you want new space, you allocate out of a parent resource. Resources can have siblings, and they can have children. So hopefully that my understanding there is correct. The lifetime of resources is generally you have some sort of boot firmware that will build up a memory map. And it has this intimate knowledge of what ranges do what in a system. And so it passes that along in whatever platform-specific way you want. EFI is sort of my, well, I guess the A20 is sort of where I learned about this stuff. But you can understand that BIOS knows something. It has some mechanism to pass it along. There's some handshake between OS and OS pulls it out. So you typically will have some architecture-specific code that's able to read this thing. And it says, oh, I'm an EFI kind of system. And I see these ranges. Let me go and insert these resources. You might have some platform-specific drivers. One good example that's relevant to this conversation is the Enfit driver. So it's given a bunch of information via ACPI. It says, hey, this range is persistent. Let's go and add that. You might have bus drivers like PCI that say, I'm a bridge or whatever. And here's my physical address window. And devices underneath me may allocate out of that. And then ultimately you'll have either, and I don't, from the architecture side of this, being an x86 person, I don't really know many use cases where x86 starts allocating these resources. But I've seen in the code, you can see request resource from some of these architectures and generally drivers will consume them. So you can imagine, for instance, a PCI driver wanting to pull out some of the memory that the PCI bus driver may have surfaced. So let me say what I expect from BIOS, and this is not necessarily the same thing as what BIOS can and can't do. But in general, I expect BIOS to say, I have these volatile capacities of DDR, and these are their physical address ranges. Additionally, I expect BIOS to say, for CXL memory capacities, I have same exact thing except there's some complexity around interleaving. You might have platform-specific ranges that are reserved, and you can't touch those. And the main thing that I guess we don't expect is BIOS to map persistent capacities. Not that they can't, but in general, the OS is the primary actor who will map those capacities. So it's, you know, if you're looking at EFI memory map that you might be getting in from BIOS, and you have a range of host physical address space that's dedicated to persistent memory that's not like the older NVDIM stuff, I would not expect to see host physical address space allocated for that. So hopefully that was clear. I guess I don't really know what the level of understanding of CXL in the room is, so please feel free to stop me if anyone has questions. So on the left, you can see this is running in QMU today. The CXL tool is updated. You can see there are two host physical address ranges that BIOS has conveyed. In this case, there's not really BIOS doing this. It's some QMU. I don't want to call it hacks, but maybe not the most elegant things. And you can see that there's one range starting at B9 bajillion, and it's of four gigs, and it can do persistent and volatile. And the number of targets has to do with interleaving. You can ignore it for this conversation. And then there's another range, C9 bajillion, also four gig, and it can do all that stuff. And so I call CFMWS, it's sort of like this sideband memory map. So instead of going through the EFI mechanisms, it's conveying the same kind of information. And in this case, it's an ACPI specific way. CFMWS is a CXL fixed memory window structure. So there's n number of these for a given platform, and it's passed along as an ACPI table. This is actually a sub-component of a table called the CEDT. The thing I'll point out for the conversation, hotplug doesn't do anything to this. These are fixed from once BIOS boots. So if you want to hotplug, BIOS had better have given you enough space for that hotplug memory that you're going to want later. And if you don't have enough, that's too bad, you can't do anything. So they are, even though they're coming in kind of after the fact of normal BIOS memory map handoff, they are fixed throughout a boot. And so we have a driver now called CXL ACPI, which reads this thing, and the kind of topic that we're going to get into is what should it do when it reads this thing. And we have code today that's doing something, but I don't know what the right thing is. But yeah, so that's the other point I'll make. I don't have an answer to this. This is just a problem, not an answer. Okay. So when I first started working in this and I was doing the QMU code, I made the QMU memory windows be above the SysRAM ranges. And so everything played really nicely, and I didn't think there would be a problem. In this example, so on the left, you have your traditional system stuff, and you know, I'm ignoring kind of the PCI here because it's, you can think of it as the same as SysRAM for this case, but you have chunks of memory that may be busy or not, you have stuff that's reserved, and then you might have free host physical address space for whatever. And then in addition, you have these windows that BIOS passed to us, and you can go and create regions out of those, or maybe, well, yeah, actually that's the easiest way to think about it. So in this case, like you just saw, there's two windows, and you can create regions in one. The other one we didn't create. This all works just fine, because in this case, our CXL driver can manage that host physical address space and doesn't ever have to mess with the rest of the host physical memory. So IOMM resource gets to just do whatever it was doing before. As you know, provided nothing comes along and doesn't insert resource that stomps on our CFMWS host physical address space. As far as I can tell, that's not a real, an issue in practice. As it turns out, that's not how things are likely, well, actually let me step back. If you had a platform that only had persistent capacities on it, for CXL persistent capacities, it might actually look like this. It wouldn't be unheard of to say, like, you have DDR in your low host physical address space, and BIOS is going to make a bunch of these windows for your future persistent capacities. Cool, and like I went into this with a real persistent memory bench. So like this made sense. But in reality, you'll have volatile capacities, and this is more what you might expect to see. So in this case, let's say you have a CFMWS that starts at host physical address zero to whatever, this little gap right here, maybe that's DDR, I should have put a box, and then maybe you have some CXL volatile memory. So in this case, you actually do have CXL host physical address space allocated as a region that is now intersecting with your IO memory source physical address space. BIOS has passed to us this thing as memory. It just looks like sys ram to us. So now we have, okay, and we'll get into the problem with that in a little bit, but so like let's say right here, if you wanted to create a new region of volatile capacity, you could certainly do that. And one might say, well, okay, what should we do with IO memory source in that case? And that's really the question that we'll get to. So just to like expand on that a little bit, while this thing might be fairly sane for persistent only, and this thing might be what you'd actually, and let's say this is like a persistent capacity for instance, and this thing is something you might actually see in a real system, this is also possible. So it's all, you really can get anything from BIOS, and we can't, there's no constraints based on what they tell us. So in this case, like for some reason, BIOS decided to give us this range of memory that the OS incorporated and it says, okay, I have this, but it also mapped some part of that as a region, and it used up the entire window for that. Okay, so hopefully that was enough background to discuss the problem. And again, I don't have a solution. Well, I guess we did post something, but I don't have a real solution. Oh, we already got hands. Okay, so one second. I think he's, you good? Yeah, I'm good. The question is, why would you want to excel in other space, be a part of IO memory source? I don't. However, if you go back to this kind of situation, I don't really see how you can avoid it. So the part of the system RAM actually lives in the Excel-backed memory? Yeah, so BIOS, we don't know. When OS boots, you don't know. But all we know is BIOS gave us an EFI memory map entry that says, this is the system RAM. Looks a bit weird, but yeah. I mean, I think the point that you need to, not you personally, one needs to incorporate is BIOS has the ability to do anything it wants really in terms of what it tells us. So when it's all persistent, I think it's a really easy story. Once BIOS is starting to do things with volatile capacities and just telling us this is regular memory, it gets more complicated, especially because it has to live in these windows. And don't forget here that you can think of, I think people are thinking of CXL in this model of it being some kind of add-in card that's something that you add on to a system later and it's not really part of the core system. But there are other people talking about, maybe the only RAM in your system that you have is CXL attached. And so like in that case, there wouldn't be any possible way to have the kernel even boot without having, when I'm in resource, cover CXL area because there isn't any RAM other than CXL, potassium. So we need to stop thinking of CXL as being like, some other thing that gets out of the system. It really is. I mean, it really could be the memory in the system. Yeah, that's, thanks, Dave. Yeah, David. So yeah, in my humble opinion, like the IO memory source would describe everything and you would actually want to have some kind of a layout where you have the devices on the first layer and then like have some kind of a split of the system RAM region. So if you like your bioscape, your system RAM region that is bigger than the device, then you have to split it and reorganize, but you can always punch in these details later. Essentially, what I've been doing for Word of a Mem is that I have like a system RAM region that's not busy, but that's just like the container for everything that might be in that device range. And when I actually hot block memory, I insert a system RAM region in that container. So you actually have like a topology that says, like this is my physical address space. My device lives here, but within that device, I'm actually like have this portion, for example, be system RAM. So the main issue, let's see, did I have one that represents that kind of? Well, I guess not really, but the main issue that started the conversation was you can do that. And then, so the second bullet is the problem. There's now this device private memory APIs that try to walk through the host's physical address space. Yeah, that's just broken. Well, so that's where... I mean like I already had nightmares how that would work with Word of a Mem because you cannot just take anything from your address space where something else lives and try to reuse it. That's just broken. So we have a proposal on the list where basically we do what you're saying and we kind of have an upper maximum. So if you take all the CFMWS address spaces, we say, okay, nothing... if anything's scanning through host's physical address space, it can't take any of this. If you want anything above that, go ahead. And so there's a long discussion here. I actually haven't read the whole thread. And Dan Williams has been a lot of it and Jason. So yeah, if you want to chime in on what maybe the feedback has been on that, because I'm of the same opinion, but it seems like the community doesn't agree. Yeah, I mean, I think we all agree that we definitely don't want device private memory to allocate inside of CFMWS. But the question is how we do it. I mean, I think we also... I think everybody agrees that it would be really nice if I have my resource tells the whole truth and there's not something on the sideband. But I might want to talk to you, David, more about how it gets represented internally. I thought there were places in the... I thought there were maybe I'm wrong. I thought there were places in the kernel that depended on system RAM being the top-level resource, that if system RAM is like three levels down, it causes problems. I think I fixed most of these by now. But yeah, initially, what we had is only system RAM that would live on the top layer would be considered, for example, for KX second, but I think we fixed most of these. I think what we actually would want to have is some kind of a placeholder, like this is a device, and unless you're using a special ABI function to request memory within such a region, you're not going to get it. So if you have device private scanning, you would stumble over a device region and you would say, yeah, well, I kind of take anything of that that already is reserved by somebody else. It's not busy, but it's reserved in the space. That's essentially what Ruder-Mem tries doing, but it's also not complete in respect, for example, to device private. So it's... So like a flag, right? I try to do something like this. It gets a little bit messy if you look at this case, because you want to say, hey, I have this CFMWS thing, and it's from zero to whatever, but there already exist resources that are in that range. And so it's not trivial. You can try and re-parent things. Yeah, so it looked like when I looked at it re-parenting... Well, let me start off also for people in the room who aren't aware. Strug resource has existed forever. It's like older than me. Well, not that old, but it's almost older than me. And I've never seen anything re-parent resources. So there was definitely a concern for me that it's like, maybe you can't do that. Somebody says that's not true. So there's insert resource, which... So what we had on PA... It can consume one. It can consume one. So if you had... You're right. You're right. It will grow it to... So in this example, you could... I'm sorry to cut you off. You could do it. However, if you had multiple... So if I had another SysRAM here or whatever, it wouldn't work because these are two separate resources. And at least as it works today, it can't... It can't manage more than one. Okay, so let's fix that. So the original problem that we had was... We were discovering things in the wrong order. And this was on PA risk. So you don't even care about the original problem that we were having. But basically, we were discovering device A and devices C and D. And then we were discovering device B, which was the parent of C and D. So I thought we actually did re-parent more than one. But apparently, we're going to report... If you say we don't have that code, then we don't have that code. But there's no... But it's kind of legitimate, right? That maybe FEMA gives us devices in the wrong order and that we find A, C, D, B. And then we have to insert B between A and C, D. And C and D get found fine because they're both children of A. It's just that they're grandchildren of A, not actually children of A. And we need to insert a layer in between them. So I mean, I can go back and double-check. But just to make sure the agreement here is... So this thing would become the top-level resource. Well, I guess... I can't edit, but... You'd have IOMM resource grow to the full size of CFMWS. IOMM resource would sit below that. Is that generally...? I thought IOMM resource was the root of the tree. Yeah, yeah, so flip. You grow IOMM resource and reparent it above CFMWS0. Okay. And then system RAM... All the entities that are currently IOMM resources children become the children of the CFMWS resource. And there's a flag there that tells... Well, somehow we still need to shut the device private memory out, but I think you can just add the flag for that to be fine. Yeah, no, I think device private literally looks for anything that has no entry, which was always dodgy. And it always made everybody skin-crawl. And now CFMWS finally breaks it. But I think as long as we have that range in IOMM resource, then device private memory will leave it alone. It will try to magically think that anything else that's not in there is okay. Which is still dodgy, but at least doesn't conflict with CXL. There are, and I'm almost positive that resource API support this today, but there are constraints that the spec imposes where you need to allocate post-physical address space above a certain amount. But I am pretty sure that's there. You can have a start for that. But yeah, I mean, that's fine with me. I don't know if there were any arguments on the bridge with that. And I don't know if anybody wants to. So the struct resources themselves, I'm a little, like I don't have the history, but they're kind of wonky. Like, I mean, it has its own embedded linked list implementation for the siblings. And so I'd really prefer not to touch it. So if anybody else wanted to do that, that'd be cool. So just, I have two things. One of them is I just want to make sure I'm clear on this. What we're saying is we want to change things a little bit so that IOMM resources is the one source of truth for reserving things and figuring out ranges, right? That sounds good to me, if that's the case. That makes a lot of sense, because otherwise you have two competing systems that don't talk to each other and you're doomed. Okay, so if you do that, that's great. The other one is an idea, a little bit fuzzy here, but the zone device thing started off as a hack and it still is. I know IOMM resource is supposed to be something you figure out early, but could we make zone device integrate in so it takes out, you know, it becomes an IOMM resource? No? Okay. Well, I mean, zone device is just a hack to get struck page. So it's taking any random IOMM resource and then annotating it with struck page. I guess, what are you wanting that's not there today? Yeah, what I want is for zone device to take out an IOMM resource that matches the range that it wants so that it fits into this system. It would have to move things around to do that. No, all the zone device users today, they do request region. So you'll see, anybody's using zone device, you'll see a new entry pop in to the IOMM resource later. But... Yeah, they should be doing that. I don't know, like, yeah, struck resource is a very old data structure. We have people in the room that love to add new data structures. Is it time for a new data structure besides struck resource, or just keep hacking on it? I don't want to. I started trying to do this. It's a maple tree, right? It's range-based? The amount that struck resource has changed in the last 20 years. And Willie was saying he has a patch right here. He said 20-year-old patch that's outstanding. Well, he shouldn't have said anything if he didn't want me to bring it up. Sorry. For those on the stream, Andrew Morton has been carrying this patch for 20 years and I was hoping Ben wouldn't bring it up because he might hear this and say, oh, yeah, I should just drop that patch. It's been 20 years. We didn't need to take it upstream. So it's changed surprisingly little. And as someone who's new to MM, I did not want to touch it. So I guess if everybody's like, yeah, go ahead and break it, let's do it. No, you're going to break my 20-year-old patch. No, it is kind of a mess. At least for people unfamiliar with the code, I think it doesn't match a lot of the other kind of data structures in the kernel. Liam did the record. He whispered to me as I was walking up. He said it does sound like a maple tree. I just want to get that on the record. So I'm going to agree that the data structure is horrible and especially traversing the tree is essentially, I think you do a depth-first search or something like that. Just traversing the tree, it's like absolutely insane. You just want to have some range walked. You traverse almost the complete tree. So any work on that would be highly appreciated. And I mean, don't be scared to break it. We're going to fix it. Well, maybe now is not the time to say this, but I'm taking a new job in a couple of weeks. So please tell your new employer that you want to fix that. Yeah. Okay. I mean, I agree. It was definitely, does anybody see a path to solve my problem without all the hard work? At least for like a kernel release? Like I promise I'll fix it later. No, I think, no, I think, I think, like Willie was saying, like, I think we just need to, if insert resource, it does the wrong thing, it gets confused. You might just need to do insert resource on smaller granularities just to, just to match the current constraints, like as a temporary hack. You might end up with like, instead of one big CFMWS, you'd have like, like a series of CFMWS entries in there. I mean, you could, you could, you could, our CXLACPI driver could say, oh, well, this already exists, you know, at this point. And so we can just create a sub CFMWS region. I don't know if people would be, I don't remember if I tried that or not. I don't think it'd be a sub. It'd just be a, it'd be an artificial sibling. Yeah, I mean, the thing you'd run into, though, is potentially you could hot unplug, or whatever, remove this region later and then you end up in this weird case where we weren't actually tracking that this was a CFMWS allocated area and if you try and add it back later. I don't, I don't think in practice it'll be an issue, but it's not, it doesn't tell the whole truth at that point. Yeah, I mean, yeah, sorry, this all seems really complicated. I wasn't paying attention to the first half of this talk because I was trying to deal with all the M share comments that are coming in. I appreciate your honesty. You know, it just wasn't that important to me. But it seems to me like the whole problem is that IOM resource is the wrong size. Like it ends too soon and I don't remember where we set up our IOM resource or why, but at the point where we discover CFMWS zero, surely that is our cue. Oh, hey, IOM resource is too small and we need to just enlarge it. Yeah, one of the, how we got here was, and maybe this was a mistake or something we could revisit, was wanting to allow it to be discovered late. And because otherwise, yeah, we can certainly go in and before we parse the memory map, like the first thing you do is parse this other ACPI table and insert that first and then insert, like then it wouldn't work. Yeah, but then we're touching the boot code, which we can do, but it seemed to be a nice property that modules later can be like, oh, actually the address map. I have more information now, let me go update the address map. That seems like a nice thing for Linux to have in general. Yeah. It also makes back porting easier, but nobody, sorry, I said the B word, but yeah. Yeah, that was the original intent. At least I thought the most value was you can mod probe CXLACPI some later point on whatever system, right? And all of a sudden this all appears. And I'm totally cool with that. So yeah, we can go in and resize IOMAP resource at that time. That's cool with me. All right. Well, nobody's disagreeing on the bridge, then I can give you back one minute.