 Hi, everyone. So this is mostly dance session. I'm trying to get this done in a few minutes, recap of what we talked about last year. I think everything I proposed last year is kind of working as intended. Yes, I will document it and I'll write something up, put it up for a review and then I'll stick it up on Ozil Labs and that I'll link to all those irritating emails most scripts in there. I don't think the MM stable concept for giving people a non-rebasing stable tree to develop against. I don't think that's as successful as I hoped because it just takes so long to get stuff stabilized into the stable tree. So if you're working against MM stable, it's generally two or three weeks out of date. But that's fine. If people want to work against a non-rebasing tree, please go ahead and develop against MM stable. It should be a good target. Please at least do a test merge with unstable before you send it out, just so I find out how much shit is about to hit my fan. And I am getting increasingly active about hiring people along. Please try and get this stuff finished. Don't leave it sitting in an unstable state for three weeks. And I'm also getting a bit more active about when people send out a patch set and I'm not sure about it or not, I'm not sure about the person. I just let it go nowadays. I'll often send people a private email saying, I'm not going to merge this. At this time, please let me know a week from now if you believe there's some action I should take. People seem to be appreciating that. I don't like hearing about maintainers. You just send it and nothing happens. I think I'd like to provide people some feedback. And fourthly, I'll be looking at dragging the mem block and slab trees at least into MM unstable so that the other MM developers get additional visibility into upcoming MM work. Anybody? Well, I'm here to serve you guys and you know my email address. Don't be afraid. No, I'll keep it as it is, but I want to hurry things out of unstable into stable more rapidly than we are doing so that the MM stable tree is more up to date. I mean, that's what's going on in the next merge window. So I'd like to fill it up more quickly. Get it from three weeks down to two weeks would be great. I do not have concerns. I would just like to give a feedback that I really like how the process has changed is much more transparent. And I think that was a great step in the right direction. I would say just a question. Can you clarify that you're you're expected or you're what you think is the ideal timeline for patches to stay in an unstable before they go to a stable? Well, say in the worst case, it's up to three weeks nowadays and sometimes it's necessary. But often I'll see it. Patch sir has got some unaddressed review comment or some guy said, yeah, I'll send you three and then nothing happens for two weeks. Please don't do that. So I'll be sending people in Navy emails. It just sort of holds up whole pipeline. And I end up crunching a whole bunch of stuff into stable just in the last week. I don't think that's good. I'd like to get flowing more. Is the implication that if somebody if you pull something in the unstable and they don't think they're going to update it for the next couple of weeks to ask you to pull it back out? If it's causing a problem. Yes. If it's causing problems, I'll drop it out and take the next version. I'll generally hold version two waiting version three if it's not causing any problems. Just attempt to keep pushing things forward. Yeah. So if you didn't get the implication, this was just to get you in the room to yell at it. But what I want to do is serve a role of mediating between the MM community and the SMDK proposal about that kind of started from this position. And if you think about it, it's actually not, I mean, so the idea here is CXL needs new MF flags and new memory zones. And of course, like last time we added a new memory type. We didn't do any of this, right? Oh, we did. We did add two new MF flags and we added a new zone. But this was part of the PMM work. But this was a hard fought fight. I think I basically risked my career to get MFs into the kernel. This was not something I would wish on somebody else to go through. So we don't take adding flags lightly. And I think for this discussion to be productive, it really needs to put solutions in the other room. And we need to talk about problems. Problems first. And so the format I like to do with these kind of questions is not assume the solution. And we actually, me, Hal and I had a conversation and he came up with a solution as I thought it was a problem that couldn't be solved. And he's like, oh no, what if we do this? So I really believe in, we have a lot of bright people in the room. I think we can talk through do we need these things or not. But I want to quickly walk through how CXL looks at the system. This is all the same page. I felt like there were some misconceptions about how do we identify these things. So these window ranges at the top, these are static ranges that your platform vendor puts in an ACPI table. They're statically never changed. And these tell you the possible places your CXL devices could be mapped. So that's platform designer decision. And OS has no say in it. What you typically see is like a platform will have multiple host bridges. Just like today, your service system has multiple PCI Express host bridges. You'll have multiple CXL host bridges. They're actually the same thing because PCIe and CXL are the same electrically. In this case, we have four windows. We have three host bridges. And this first window is interleaving accesses across these three host bridges. And the host bridges can have an arbitrary number of, for now we'll assume arbitrary number of CXL devices beneath them. In these other window cases, these are not interleaved. So this address range will only map something that's coming from beneath host bridge zero. And this window is only for one and two and that kind of thing. So in this case, the operating system would be responsible for picking like, hey, if I want max performance, max bandwidth, I want to use window zero because that is interleaved across multiple lanes and multiple CXL devices. Whereas this one is just a single, these ones are just single pipes. And you ask yourself like, why would I want that? Well, maybe you want, if you want to maximize bandwidth, you want to pick this window, max interleave. If you want to maximize recoverability, maybe you just want to be able to lose it. If we're interleaved and I lose one of the devices, it's just like losing a disk out of your RAID zero, like the rest of it, your whole thing is gone. But you could conceivably, if you're using these ones, you could lose a device and the other ones would keep working. So that's a choice that the platform gives you. But we do have the concept of like multiple memory types. And I'm going to call them QoS classes or performance classes or properties. So you could have something like this where there's a window zero that's interleaved across all these host bridges. And let's just say that this is your volatile memory volatile memory devices. And then the platform also gives you another window that's also the same interleave. And we'll call these your persistent memory rest in pieces devices. CXL does define persistent memory. So even though we're in a post-optimal world, I still think somebody's going to put a battery on a CXL device and call it persistent. But the reason the platform gives you two different windows is for the case where this memory might be slower response than the volatile memory. Go ahead. So all the devices in the same window will have the same properties? Is this a guarantee or? This is a, well, so there's hard guarantees like if it's like persistent versus volatile. Like those actually have different. But for instance, CXL zero and CXL four may have different latency. Right. So the properties on the window that the platform tells you is either like the volatile persistent or a QoS class ID. It'll say like this is class zero, zero, one, two, three. The numbers don't mean anything but they're just different QoS classes. But the reason to separate them is that if these devices are slower storage class memory, you don't want to have like head of line blocking issues while you're trying to talk to these fast devices. So you give them two different traffic lanes to talk through and the fast stuff can keep talking while the slow stuff is kind of taking this time. And they don't collide with each other. I think the point though is like there's nothing in the spec preventing a bad BIOS from taking slow memory and interleaving it with fast memory. So ideally smart decisions are made in interleaving, but there's no guarantee. And especially with the performance classes, it's, yeah, the operating system can be like, hey, let's say window four is all full when you hot added some device and the operating system could put it in window zero. So it probably wouldn't be the optimal situation, but it's not mandated. So what Linux does today is just the simple thing. So everything we do is try to do the simple thing first because there's so much complexity you could do. The simple thing is just assign a numenode per performance class. And so even though we have multiple devices and this could be 32 devices, they all get mapped to the same numenode. And so what kind of problems does this impose? And I think these are the main ones, but I'll come up after me and we can have a discussion if he wants to identify more. But I feel like what people need an ability to, well, so I think most applications won't care if they say give me memory. But I think administrators or cloud orchestrator people go to one and say, hey, these people are paying for the, for coach class. I want to bind them to the coach class memory. And these people are paying out the nose for the high class stuff, so I'm going to put them in the first class memory. I think that's going to be a policy imposed by the cloud vendor. So I think we need bind by performance class. I think we might need avoid by performance class because I think some people, it seems like memories of lots of different types of capabilities are going to be put behind CXL. And so something might be too slow for some applications. And this might be something that Colonel wants to avoid. So I think the Colonel might need avoid by performance class and maybe migrate. Like they're in coach and there's a seat available in first class, they want to upgrade. As the bind by performance class was the one, I think, oh, we can't solve that with numinodes because let's say you're in coach and you're running in that class and then more memory gets added. If you've bound yourself to the coach class nodes that were available at the time, this new node comes online. That's also in coach class. How does your application node that it can also now use this as well? But this is when Michal said just bind to possible nodes. It was obvious to other people in the room, not me, but I'm not that smart. But that's an example of, hey, we didn't need to invent a new node for this. We just set a node mask that we classify nodes through some mechanism and then people learn about their performance class and then they bind to the nodes that they care about. So I'm curious how static is your configuration? You say it comes online, like a node comes online. Do you know upfront the properties of that node, like characteristics or can some crazy Excel switchy fabric that I don't get like add something else, replace something, upgrade performance class? I don't know. Or is this completely static? So I submit a static. Yes, you could put devices of any quality, Joe's backyard CXL device in here and CXL switches that have crazy latencies, but at the end of the day, they're going to either map to Windows 0 or Windows 1. The degrees of freedom are always limited. And so the way this works in the driver is the driver says, ask the device how fast are you, calculate all the performance, calculate the performance from the top of the system down through the host bridges, all the links, and then says, hey, platform, this bandwidth and latency of memory, I want to map what you and then the BIOS gives you back the performance, the QoS class ID. And that's a static set. Okay, that's perfect. So we're not talking about hotplug of CXL devices yet. I guess we're not talking about hotplug. We are talking about like these could be dynamic. Yeah, dynamic, but you're adding not completely new devices with characteristics, new numer nodes pop up. That doesn't happen. You know exactly what you have. And what Microsoft is like, you bind to the possible ones because you can identify the characteristics. Right. So let's say there was zero device in the system at the beginning of time. I mean, then we plug all six in, the driver handles it, and then you ask, hey, which windows are available in any method. Yeah, so we should handle dynamic devices to plug the same. The numer nodes, like we wanted to make it, we don't have dynamic numer nodes in Linux. We just, we just statically say, okay, there's SRAT nodes. And then we add more for each performance class. And this pad, we just pad out the possible nodes. And then that's it. Makes sense to me. Thanks. I mean, okay, so when the configuration can done, I mean, the configuration can be made after always boot via driver time or BIOS time? All the above, like the way the, the way the driver is designed is like, like to, like let's say today's BIOS is don't support, don't support any level of switching. The BIOS might only map the directed things are directly attached to the host bridge. The Linux driver will say, okay, BIOS did those, but I'll do the, I'll do the rest. So, so right now the driver can do, the driver will do everything the BIOS won't do. Okay, then, so I think it is about the requirement that I want to address. So if the configuration can done after always boot via driver, so, I mean, in a software way, flexibly without cold boot system, then I think this makes sense. But, I mean, so, but even, even if you're dynamically doing it, the only place you can put it is into one of the nodes that was already there. Like you can't add, we're not supporting adding new nodes. We're just saying, here's all the possible performance classes. And there's also a hardware resource bound on the number of these windows, because these actually take up, they're called a sad, tad decoder registers, but these are fixed resources in the memory controller, because when a device like sends a DMA, and it goes and to CXL, it goes all the way up. And the system decoder says, oh, this is not, this was headed to the CXL space, not the DDR space. And then it sends it back down on the CXL side. And so that logic to do that routing is these decoder registers, and that's a fixed resource. So like the number, explain that to say that we're not going to have a whole ton of these. We're not talking about blowing out our 1024 node numbers. I think it's going to be like in the, maybe 10 or like number of windows that we have to deal with on a system. Okay. For me, it looks like the mechanism you explained can be smoothly integrated into under John, but yeah, let me start my presentation. Yeah, the other one's okay. Yeah. Okay. So by performance class, I feel like that's, I feel like that's, get a set of nodes that are going to be in that classification. And if you, if you plug in a device that's way below that, like, because they have to map at some point. And so if you plug in a device that's way too low performance, it's still going to map the same node. And that's basically, that's kind of your fault as the person who put in a low quality device kind of thing. Avoid by performance class. This is kind of back to our discussion of like, do, do we need some overrides to tell the kernel, like, take these nodes out of your fallback list, like don't even bother. Or, or yes, you can use these, these nodes for allocations, but like the application has to bind to it. Like if it's not, you can't get this by accident, you have to opt in. But that seems like something we can communicate with a node mask. Yeah, that, that's something that we do not allow right now, because zone lists are covering all the existing nodes for zones to be more specific. But we do not have a way, or as far as I remember, we do not have a way to essentially put a node outside of the zone list. So maybe we want to enhance the hot block API and say that this should be an outside node. And essentially, when you bind to that node, then you might still have that fallback list all the way around to other nodes. So if you start there, you eventually get a memory, because otherwise we would have terrible problems like what ooms and stuff like that, that cannot be really resolved. But you start by a slow node, but there is nothing there because you haven't hot blocked anything there, but you still get your normal memory. But when you start with normal memory, you do not end up on that slow thing and get surprised. Because otherwise you would have to use M bind just to avoid that. And that's really clumsy, because that's hard to configure properly. Yeah, I think, I think it would, yeah, this has to be something that the driver either does automatically or or some other enumeration to say like, okay, the performance here is just not something that's Linux worthy for the CoreMM to have to worry about or need to opt in. Yeah, but for that, we would need an extension to the hot block. It wouldn't be hard to do, but we just have to think about how to communicate that to the hot block code. Right, yeah. So yeah, and the idea here is that we're not trying to avoid doing kernel changes, but like, but in terms of would we rather have a new zone type or do these kernel changes for these mechanisms? I haven't, yeah, I haven't come across why we make that, make that switch. Migrate performance class seems pretty straightforward, just migrate pages. I don't think if it's nodes, you just migrate nodes. I don't see. Just for the binding to a performance class, it's, it is nice to have another use case just bind to multiple different performance classes at the same time, but with an interleave or racial input by the users. So that will fully utilize the total bandwidth between the DRM on the board and the sector device. That's true. Yeah. Yeah. So the, the mechanism that got added recently called M pull preferred many. The idea would be like, you can opt into all the performance classes that are okay with you or, or, or you'd bind to the ones that you only, only one kind of thing, but yeah, you could do an interleave policy in software. I'm not just in, in the CXL hardware. One quick, one quick comment. Since you mentioned this is a performing class, right? I assume that it's no longer just a binary value. Basically, it could be multiple different values, right? You mentioned it could be up to 10, whatever. It's just an interesting, it's just like, yeah, yeah, yeah. Like an approximate main numbers. It has no meaning. It's just an integer. Yeah, that's fine. My point is that it's not a single value, right? So basically, if we want to map the performance class to zones, then basically means we're going to probably introduce multiple zones, not just a single zone. In this particular example, probably means two different type of zones, right? Right, right. So think, think of it, think of it like a NUMA node per window because each window has, each window has one. Yeah, yeah. I think NUMA node is a very natural concept. I think since you're in the beginning mentioned, should we introduce a new role for this kind of thing? My, my comment here is that if we introduce a zone, then it should not be a single zone. It should be a list of zones, right? One corresponding to each performance class. I think that might be a complexity. Right. No, I, I take your point. Yeah, like, like, like a single, a single zone can't represent the degrees of freedom we want here because we want to be able to have gradations of CXL and, and NUMA node numbers give us that, that gradation. So I just have a comment or like a, not an idea, just like a question to, to the audience. Like if we start talking about multiple nodes, would it make sense that we have some, some kind of MPOL policy that tells me like give, give me the slowest memory possible. Like here's a set of node or give me the fastest one or is it really just like I'm going to bind to let's say like two classes and whatever I get is what I get. Is that rather what we want? Well, so the, the part I'm not saying is, is I like these, these classes don't have any meaning. I mean, so we have, we have ways to tell user space what, what they are, but I'm, I'm really hesitant to have an API or any kind of way for somebody to say, give me fastest because everybody's going to say, give me fastest. No, no, I, I, I, I expressed it the wrong way. It's more like right now when you define an M policy, it's like preferred many. So you prefer multiple ones, but it's not that you can define like a hierarchy, like first try from node freed and for node two and then from node whatever. I'm just wondering if that, that could be desirable to express something like, yeah, like I'm really interested in the slowest possible memory, but if that one node is depleted and maybe like, isn't that the fallback order? But is that like, I think like with M bind, you can actually run out of memory if one of the nodes is depleted or not. Yeah, not with ample preferred many. Yeah. So essentially, you can just say, I want all eight of them or whatever the number will be. And you start by, you start somewhere and then go through the zoneless order. Yeah, there's no, you want to add something? I just kind of wanted to dig into the yet that was talked about on hot plug of different performance classes. The comment about, can you have new CXL devices show up of different performance classes? Oh yeah. I mean, eventually that would be a desire, right? But like, let's, let's say we had like all these windows were only, only one window. So all these eight windows are mapped to CXL devices except window four has one little bit of space. And then CXL six comes in and we ask it's just performance and it doesn't match the performance class of window four. We have a choice of leaving it unmapped or just mapping it at the wrong performance class. And right now my policy is just map it at the wrong performance class until somebody screams like, like, I think having capacity is better than, like having less performance is better than zero performance. So, so I'm not too, I'm basically like putting the onus on system designers to say, yeah, make sure, make sure you have enough window capacity, make sure you plug in devices with the compatible things. But if you don't, the kernel will do best effort. And we won't, we won't be strict about it because that's makes being strict is, is more difficult. So if they've got a system running on a bunch of switch attached DDR4 based modules and they're over time upgrading to DDR5 because the DDR4 is dying or whatever. Like is it, is it practical for them to, to be able to understand that when they hop log? Like what's the onlineing process? How do they make sure it shows up in the right window? Like, like, like the running with DDR4 and then they swap with those devices with DDR5 kind of thing? I mean, yeah, like, I think that's fine. Yeah, like, you unplug, unplug these, the capacity gets freed up out of the window, you bring up the new devices, ask this window has capacity for them. Pick that one as your preference. And if it's not, if it's not available, just pick any other window just to get it mapped. But yeah, that should be fine. Okay. I agree with the architectural way to use it. But why it becomes a new John? Because the reason we selected the John instead of what node was, you know, the John actually implement the actual memory management algorithms like a convection or reclaim mark, or to mark or migration or anti-pregurations like that. So, so, but there is a node specific algorithm for node scale. So what if we make it as a new John? You're saying that you want to have reclaim policy, like node reclaim policy? John reclaim or compaction or migration or anti-preguration that algorithms currently applied on John, John later. So I think for the CXR needs, we probably need to revisit the algorithms on CXR memory. So when you become a node, the, the currently used to implementation, the algorithms are not applied on node, but John, John. So probably it is, yeah, good to try. Can anybody else help me out with that question about CoreMM migration and compaction policy and whether we would, that would need to grow to become node aware or is it, is it zone aware? Anybody can? Zones are internal memory management abstraction that shouldn't leak outside of the memory management, core memory management. So if there are any API limitations that parts of the kernel would like to work with something and they are forced to use struct zonas and argument, that's a good reason to change that to be node based. Yeah, yeah. There is one reason that we choose the, the John. Because the John is, as you told us, is, is, is internal algorithm, internal implementation. So, but node is widely coupled with other, other kernel subsystems and user subsystems. So as you told that, even with the performance of Windows, there will be probably 5 or 10 nodes on the system-wide. Then the, if we make it, it has a single node, then probably, you know, the kernel side could make the abstraction layer to make it single load. Then the, where the kernel modification needed is the node level. So which is widely connected to other subsystems. So if we, if we choose the John level, it is connected to the less kernel subsystems then. So, I mean, I always keep coming back to the observation that node numbers and zone numbers are in the same like bit field of, of struct page, of page flags. And that any, any operation that you'd want to run on a, on a, like, if your zone is a boundary of a, of like a memory of a certain like physical lattice range, like, you can, you can just convert that one number into, into a, into a node mask. Like, if you, if you want to say, I want to migrate all CXL, you just find, you just make, you just create a node mask of all the CXL nodes, of all the CXL nodes and then then do your operation. So I'm missing the part where we can't do it with, with a node mask. Like, I feel like, I feel like a new zone and a node mask number are basically analog or largely the same thing. Yeah, realistically speaking, with this kind of memory, we are talking about zone normal or zone movable. And most likely zone movable in 90% of cases because you simply do not want to have a lot of your kernel metadata sitting there just without any actual intention to be allocating from that. So, so you might have both those, of those zones but you define that by how you allocate, right? So, yeah. And, and right now we're really, we're really piggybacking on a lot of David's, David's work in, in Verdiomem about like hot plug policy and reserving a little bit of, like if you're adding a terabyte, maybe you need some percentage of it to be zone normal and the other percentage to be zone movable. So, in all these kind, in all these kind of questions about zone partitioning, I'm like, let's do the exact same thing that Verdiomem is doing. I get to see a smile.