 So, welcome back to the room and let me introduce you to our first remote presenter, Hong-Qiang Fan. We'll start. You can start. Hi, everyone. My name is Hong-Qiang Fan and this is my first time to attend this Linux Summit. You can see I'm very excited and it's also a little bit surprised that I can be selected on the agenda. I'm fairly new to Linux. So if I make some mistakes or errors, that's obviously to you guys, please be nice to me. And I'm working on some CXL related stuff. CXL is also a new concept and many of you guys might know one of the potential usage for CXL is about memory expansion. So we have some, this is still in the concept level because we don't have any real CXL device yet to work on, but just some thinking about how we can use CXL memory together with the container software. So let's get started. So first thing, why we talk about container and software memory interface because currently there are kind of device driver for the containers which is called storage interface. It's a middle layer between the containers and the many storage system like the cloud, NAS, fabric, persistent memory. So the whole system can work together better. For memory, we are thinking about a similar idea, but due to the fact that memory is so different to storage system, so we are thinking about what could be the changes and how to make that happen in an easier way. So like showing the presentation, the container storage interface only opens to file storage, but if we want to make this container to better utilize memory, probably we need some APIs between the system and the container software. So something maybe people will call it memory interface. There are some possibilities like currently persistent memory can be used as a direct access device and CXL device probably can be treated the same way, but is there any performance difference because an ideal CXL memory device can be used as system memory. There is very little change required for system to use it. And what's the benefit could memory interface can give us? One thing I think is it will provide us some opportunity for Turing system, even though how to implement it and where is the best location for the Turing system is another question like people may have a lot of discussions. And the other way is the container could be used as, I mean, the memory interface could be actually a fabric management layer that control the resources on a large scalable memory system. So just here are some very basic scenarios I think where the memory interface could happen. The first scenario is very basic, it's basically nothing need to be changed. The container uses whatever the system is providing to it. And on the second scenario, there's a little bit more. So we say the container itself wants to implement a Turing engine. And the Turing engine needs to know which piece of the physical memory space are from the DRAM and which space are from the CXL. And if there are different numerals, it's numerals the correct way to use it or direct access to the physical space is the better way. And the third one is the container itself doesn't care about Turing, but let's do the work with the memory interface itself. And the last scenario is just trying to expand the scalability. Different servers could connect it to the same memory appliances. And each of the server can request the memory based on its need. So that's pretty much what I think about the usage for the container memory interface. It's as I mentioned, this is a very early stage of the work and there is no real hardware for us to do some experiments yet, but it's things are early. So what we would like to know or we would like to discuss is is there possible to have a common standard, not like each company is making their own standard for the interface? And if so, what kind of parameters, APIs or data structures that the kernel needs to provide to the software? And the other one is what myself felt very interesting is could we host the container on a sexual owning, like a sexual owning container? And where to put the memory tier engine? Is it better to be inside the container or is it better to be on the interface or just install it or implement it in the kernel, let the kernel handle everything? And the other one is about the fabric management. If there's like many servers connected to the memory appliance, what people like to do? And as I mentioned many times, it will heavily involved with the design of tiered memory engine. So I think the rest of the time will be open to discussions. Myself is not yet very experienced. I'm still learning and trying to get more involved with the memory management side. So I would like to listen to the experts here in the meeting room and other folks online. Any? So I had a question. When I think of the containers, is this a C group problem? Could we manage this at the C group level and just carve out Pernuma node allocations? And why do you need the CMI interface? That's just my thought. Yes, that's also a good question. I think the C group is probably the best or simplest place to start with. C group now can assess the maximum of memory. But I don't know if there's a way or is it easy to implement the same? I want to give so much memory from this NUMA node and so much from another one. I don't think we have any interface like that. But what we have is CPU sets that can bind you to a particular set of NUMA nodes, which can help and but I don't think we have any quality of service for, let's say, defining how you want to spread your memory on those specific NUMA nodes. So is that something that would be really necessary? Or you can work with just defining the NUMA list of NUMA nodes that you want to operate on and use memory C group for limiting the total amount of memory? Yeah, yeah. I can see that like you see group can control the total amount of memory is visible at this point. The problem is now nobody knows or at least there's no real hardware for this one. No real application scenarios. All those are kind of in imagination. So what's the problem is, at least I would like to just propose one use case like saying one hypervisor is running maybe 10 different containers. I want to keep them kind of their memory performance similar. So I want to give them exactly the same size of the DRAM and the CXL memory. So if that's the case and the C group is able to like same limit the memory size by different NUMA node, that would be a way to accomplish this. But if that's really necessary or how much real usage, I don't know. I can answer that. There are people who want capability like this because in systems where persistent memory is used kind of in a volatile mode, which we can do in mainline today. They want to be able to set up things and say, hey, we have a container that somebody paid the cheap price for. So we're going to give them a lot of slower to your memory. And we have another one somebody paid top dollar for it. So we're going to give them lots of regular RAM. So people definitely want to do this. And it matters even before CXL is in the picture. So systems that are doing this with Intel Optane today would like to do something like this. And I did post a URL in the chat to some patches that kind of start down this direction. So there are people who have the same problem today. So is there any way to solve this with NUMA control now? No, not easily. So with NUMA control, what you can do is say you can lock somebody away and say, the allowed nodes, you're not allowed access on a DRAM node. And that would put you all on a slower tier. Or you can say, you're only allowed access to the DRAM node. Then you're kind of in a fast tier. But it's kind of all or nothing on a node basis. And there are problems there. Like if you lock somebody, sometimes people have the resistant memory node, for instance, all as zone movable. And if you set only to be able to access on a new movable only node, strange stuff happens. So there are some not great side effects from what's happening now. So no, NUMA control isn't good enough today. People really do want to be able to mix and match these things a little bit. And like was mentioned, this is kind of a quality of service thing. It's kind of like right now it's all or nothing. And people want to kind of have that be a little more of a dial where you can select in the middle there. So within a NUMA node, we need a way to identify the memory and then a way to attach to it. Well, remember that this is one of the prerequisites of all of this is that the memory that you want to manage is like every different class of memory, every different set of memory with a location and a set of performance characteristics is already in its own NUMA node. So we're not talking about subdividing existing NUMA nodes. Everything is already in its own node. And the firmware kind of told us how to place it like this. So this is starting off with a system that already has all of its memory. Tier is divided into different nodes. So I know that the current work is really focused on DRAM and something slower than DRAM. But one of the challenges CXL is going to pose is all of the above. It's going to be a spectrum of vendors and a spectrum of performance targets and configurations that you're going to have four or five different... I mean, in a pathological case, you're going to have more than just two performance classes. And I don't know if we need to start thinking about that kind of stuff now or at least I think we need to just get good at the two tiers first before we start worrying about the five tiers. But I notice that the patches you posted are those are the managing between the top and everything else. So I think there may be a difference between the kind of system that a hobbyist might assemble and something that an enterprise cloud vendor would put together. I don't know that an enterprise cloud vendor is going to put together a system that has so many different one-codes, but I don't think you can get on. Yeah, I see the states not to worry. Oh, I would just second that. I think we should start with these use cases that are well-defined, right? Like, by everyone kind of saying like, this is my vision of what it looks like between the cloud sort of people and the hardware people, right? And we all kind of come to say like, this two tiers where we start, right? And here's kind of where we're at. And instead of my worry with all this is it gets out of control really fast, right? And so I would go for the slow approach with tangible use cases. Yeah, the only thing I'd say is I think there's actually a very sane three-layer case. There's the CXL memory, there's regular DRAM, and there's HBM. I think that's a system that makes a lot of sense to put together. I haven't heard of one that makes sense to put four together, maybe to exist, but I want someone to tell me what it is. Well, I think the CXL also isn't a monolithic thing either. You could have DRAM attached via CXL, which is going to be slower than something directly attached to the CPU. And then you could have, for instance, Optane attached via CXL. So you could have multiple CXL attached tiers. And since it is an open standard and anybody can build a CXL card, I would very much expect there to be a lot of weird CXL devices out there. And there'll be a pretty diverse group of devices. So rather than there being like, hey, CXL devices are going to be one tier, I think there's going to be a bunch of different kinds of CXL devices. And actually, people have even talked about what if a system didn't have any direct attached DRAM? CXL might be the only way that you attach normal RAM to a system as well. And one proposal I saw recently in the list was just to default all memory to top tier and just let the driver decide now this should be stored here. Yeah. That actually seemed like a pretty sane way to do it up front to me. So you got to start somewhere. So actually, that's another kind of a question or idea that I really want to propose is you, the Linux need to know the performance of the device. Like whoever builds up the system should have some idea about how this CXL device performance relative to the DRAM and the other CXL device that he's going to plug into the system. So is that he should provide some configurations then let Linux handle the rest or Linux is able to auto balance the performance just based on how it works. Well, so there is quite a bit of standards work in this area. ACPI has a longstanding way of enumerating the latency between different numerals. And so that's there. That's been there for a long, long time. There's also some newer things happening at ACPI around this. And I know not everything is ACPI, but just I'll talk about the ACPI world here for a sec. There's something called the HMAT, the heterogeneous memory something table. And that actually will give you individual read and write latency and bandwidth for every individual proximity to main ACPI, which is roughly a new node in Linux. So we have that in the standards world today. And then for CXL, there's also the CDAT table. And I have no idea what that even stands for, but CDAT. And that provides basically the same information as the HMAT does, but for CXL attached memory. And I think Dan's in the room there if I missed anything there. But there is a coherent. Coherent divide attribute table. But so I just wanted to point out that everybody's using the term CXL device. Excuse me. But you really need to make sure you're thinking about the fact that these things participate in interleaf sets. And it's not the way that CXL redefines your memory regions are no longer just here's a device. And so you have to be really careful when you start talking about new nodes and mapping them to a device. Because I think you're moving into an area where that no longer applies. So yeah, just for those in the room who maybe weren't familiar with that, it's an important aspect of the spec. I was also going to say that Linux today, we're still very married to the idea that new nodes have distances. And that's how we organize them and prioritize them. So Dave mentioned we have the HMAT. Other firmwares might have something similar. But we just publish that information. It's not really plumbed in any kind of sane way into the kernel memory policies, policy decisions yet. But yeah, I think that's coming because the CDAT is something that is a per-vendor published table that tells you nominal bandwidth, read-write bandwidth, read-write latency. And then the kernel has to, when the driver, unless the BIOS mapped it, the driver has to go parse that, figure out how many devices are there, the switch routing to it, and then tells the kernel, hey, this is this new node has about this performance. But right now, today, we have to boil that down to a distance number, which is, I think it's OK to start. But yeah. It's a very small baby stuff. Just because when we see Excel opens up, I think new classes of media, right? And so there could be read-write asymmetries. And me personally, looking at PMEM before, I think it was very, the NUMA distance didn't work, right? Because there was a read-write asymmetry, right? And we need to advertise that up to applications so that they know and they can act on this information. Yeah. So like Dan said, it is actually in a place right now where applications can read it. It's dumped out in the NUMA directories and stuff. So people can't actually figure this out if an app knows where to look. But I think the harder question there is how the kernel makes decisions around here. Because right now, all of our fallbacks for NUMA node allocations are based on that one distance number that Dan was talking about. And so not that we want to do this, but today, let's say that somebody had a really right-heavy workload versus somebody that had another workload that was really read-heavy. They might have different preferences about which NUMA nodes they want to fall back to. Because they want to stay away from the guy that's doing a lot of writes wants to stay away from the media that has really slow write performance. So right now, they have to go figure out uses based on what they're going to do and kind of map that onto a NUMA policy themselves. There's no way for the kernel to... There's no currently no implementation in the kernel that uses any of the information to do anything smart to say, oh, this application is doing a lot of writes. Maybe I don't want to put their allocations there. I'm not saying we should do that, but I'm just saying there's a lot of opportunity there. There's a lot of data that we're not using or consuming in the kernel at all to do anything smart with it. Maybe we don't want to, but there at least are a lot of options for us. We're leaving a lot on the table today. So I want to get back to the presenter about this question about placing a tiered memory engine and certainly the kernels is working on making this as automatic as possible for applications that don't want to care about NUMA, which is probably most of them. Like I think most people don't want to worry about NUMA. That's the kernels problem. So when I see things like tiered memory engine, I'm worried about the kernel and you just basically getting in fights about who's doing the tiering and when. At least with this device mapped interface, you basically explicitly take the memory out of the kernels control and do it yourself. But I'm not sure if we want like a hinting-based daemon in user space that's instructing the kernels tiering versus it's either all kernel or all user space and the MMS not involved. Yeah, but that seems like a future decision. I do wonder what folks think about that though, because we already do have, in theory, we could have fights between auto NUMA and NUMA policies. I think we fixed that today by just saying, auto NUMA, essentially the kernel enforcing some kind of NUMA ordering or NUMA migration, completely stays away if it sees any users-based policies set. So I don't know whether we need something similar like that here or whether once we've started to do some smarter things in the kernel, whether we need the kernel to stay involved. Because maybe for auto NUMA, it really is, it isn't super-duper widely used. And I wonder if we get to the point where these configurations become really, really common and the kernel is doing these migrations in a, you know, on a more frequent basis, like across the ecosystem, whether we'll need something a little more, I don't know, a little smarter than what the auto NUMA thing is, which is just turn myself off when I see users-based expressing any kind of intent. So I wanna ask our presenter, Hong Zhang, do you see deficiencies in the current NUMA APIs for expressing what you need to express or were you seeing some of the gear-matching problems? So with the, yes, it's really, what I really like to get achieved is like a way to, just set a limit of memory based on either the NUMA node or based on the range of, well, it sounds like a little bit scary, like a directly tell the application state. You have a memory, physical memory that you can use from this here to there. This seems a little bit over. So asking, like after considering it, that's probably just like says a new knob in the NUMA control or C group is a better way to achieve what I would like to get. One thing I would just want to let you know in terms of not having hardware. So I'm in a position to have hardware, which is nice, but I know not a lot of people will initially. And so what my kind of shameless plug is to try to push QMU forward as well. I'd like to like bring that up in here that way people can experiment with this too, right? And start working on these interfaces. And so my suggestion is to take a look at those patches, review them, start working, and then you can kind of move the kernel forward with QMU, just to put that out there. The problem with that is that it's limited on the testing. You can create the different topologies of interest you want and kind of like experiment with that, but like when you want real numbers, you're gonna come out lacking. My idea for doing that is not about the performance numbers right now, right? If we want to talk about interfaces and like this seems to be an interface problem, why is that not good enough now to start, right? And to like really experiment and say why the interface is not good enough yet? You can right now without the, I mean I'm all for the CXL support in QMU. I'm just saying that it's also, even to test interfaces, I don't see it that necessary because most of the examples I've been seeing are just fabricated with different NUMA topologies just with QMU right now. I guess that's the fun part of QMU, right? Like fabricating all these different topologies and then seeing like from the kernel side, can I control the corollies, but this is the way I see it. But I think what Dave's saying is like at the end of the day, all that driver work ends up as just a NUMA node number. And so for testing the interfaces for better NUMA numbers, we can do that without emulating policy Excel, but it doesn't get interesting when we're trying to do the algorithms of like translating CDAP, but yeah. And there is some more fine grain stuff that we have to go do at some point, which is to say like, when is it worth moving one of these things and how much does it actually cost to move a page around and do all the TLB shootdown and all that stuff? So there are ways to do some of that today. You can get a persistent memory machine and you can kind of treat that like another tier because it will act differently than DRAM. The only thing that we've done in testing is to take a two NUMA node system, a system with actually two sockets and essentially treat that second node as far away memory and run it as a single socket system. So again, that's not a system of course, but it kind of looks like one if you squint at it funny. So there are ways to play with things that kind of resemble the future systems on today's. So you can do it without, that's what I'm saying. That's of course a heck of a lot more work, but if you're actually worried about performance and how some of these things will look on a real system, that's as close as you can get today, but you can go buy one. Yeah, that's really good. Okay, Hong-Jun, you can go ahead and wrap up the summary. Yeah. Yes, so that's the last page of my slides. So yeah, this is very good information. And just hopefully there will be a lot of more usage for the CXL and people I think that there are two more sessions talking about the tiering engine, we'll get more involved in the CXL and a lot of new kind of application cases. Okay, yeah, I think that's done for me today. Okay, thank you. Yeah, welcome to the MM community. Thank you.