 Hi, everybody. My name is John Tranum. This is my second Linux conference. When I submitted the question, I was really hoping that Ben or Dan from Intel would actually get up and talk about this so I could ask questions. Disclaimer, I'm not an expert in this area. I'm working on hardware design and I'm working with emerging memory suppliers. I didn't put a CXL, what is CXL, slide in my presentation because I didn't want to insult the audience. But then at lunchtime, somebody said, well, what's CXL? And they're in this room. So I should probably just briefly say, for us, CXL is a way to attach memory devices that wouldn't otherwise be able to attach like through the DDR DRAM bus. There's other aspects to CXL, ways to do like cache coherency and accelerators and that sort of thing. That's all good. But it's a little aside of what I'm interested in. Let's see it. We're doing a lot of work in looking at emerging memories. It's very expensive for an emerging memory company. A lot of these are start-ups to put a DDR DRAM interface on their parts that may even not work from a timing perspective. But CXL solves that problem. What's more, some of these memories are persistent and DDR DRAM, you know, DDR5, it really wasn't designed around persistence. So that makes things interesting there as well. So I'm going to talk through some failure modes and things that I'm worried about trying to set up a system. And maybe these things have already been thought through and maybe the kernel already accommodates for them. If so, I apologize and I'll be happy. But if not, maybe we should talk through them. So first thing is the memory devices that we're going to be attaching to these systems are going to be a little bit different than traditional DRAM. The performance is different. We've already talked about there's mechanisms for different performance latency and throughput. There are ways of informing the system about that. Some devices have persistence. The nonvolatile and others are volatile. Endurance and wear out, failure modes and error behavior are different and reliability of the parts are different. We're looking at things as diverse as ferroelectric memories, carbon nanotubes, resistive RAM, just to name a few. There's already been a lot of work with phase change. Intel's Optane product is a good example of one. So when we're thinking about failure modes and reporting, CXL introduces many different ways of informing systems. So there are multiple paths through CXL for telling the system, the kernel, that things are either going down hill or going bad. There's the CDAT, the coherent device attribute table, which basically is a way for the CXL device to say, this is my performance. This is my throughput, my latency. There are also ways that CDAT allows for a device whose performance is maybe degrading to say, here are my new numbers. There are ways to give updates through that CDAT. There are event records. The device can post an event record that it needs maintenance or its performance is going downhill. So that's kind of another path. There are messages and there are other mechanisms in the protocol that allow for this sort of reporting. I haven't seen an implementer's guide to inform somebody like me trying to put together a system as to, they must do this, this, and this or this way is preferred. So that's a challenge here. Oh yeah, I should also mention, devices can report, you know, poison or errors on individual reads. So like if you do, you know, a 64 byte cache line and you have ECC and it comes back bad, you can poison the read. And if the whole device goes belly up, you can go viral with it. Switches, CXL, they're different versions and the 2.0 version enabled switches. And so this is our, say here, this is our emerging memory. We have a device controller. This device controller can either plug directly into the computer or can go through a switch. And you know, when you talk about switches, now you're talking about possibly having the CXL memory devices in a different row in the server. And so it takes away a physical barrier. And I know you can do offline on DRAM. I haven't tried it, but in pull DRAM, but it's probably a lot more likely to happen with CXL if it's in a one by or whatever and a person comes by and pulls it. That's a possible issue. And so, you know, I have to think about if somebody, if one of these fails or if the switch sees that somebody's pulled it, how the switch, you know, what's the preferred way for the device to inform the kernel that that's happened. There's also the notion of dynamic reef configuration which the buzzword is composability. And basically, you know, what that is is if you have multiple microprocessors attached to it and imagine you're allowed a large CSP and some company doesn't pay its bill or another company wants more memory, you can potentially dynamically give more memory to one processor than another. So you may take some away or allocate more, you know, and CXL provides multiple ways of setting up these switches. One way that we're thinking is probably the more likely is an out of band fabric manager that would configure the switches and allocate different memory and change it on the fly potentially. And so but that fabric manager is likely out of band. And so then that gets to the question of if the fabric manager is allocating more or less memory to to a particular processor, you know, how does it say I want to offline, if it's doing gracefully, you know, it would say I want to offline this memory, the kernel would go then and unmap those lines and say okay, you're good to go. And, and then the fabric manager would go and do its reconfiguration. I don't know if that aspect of, you know, communication, there are mechanisms for it, but how it actually works. And then whether it's this is, you know, a generic device driver or a custom device driver, or, you know, it's something that wants to happen in the kernel, what, what answer, you know, is there or is it up in the user mode? Is that fabric manager, fabric manager example, is it making it look like a PCI device removal and add for memory to simulate the memory add? It might, but it could also be a shrink or grow to, you know, it may not just be an outright removal, it may be a shrink, you know, maybe I'm taking some memory. I'll go to my next slide here. Yeah, so, and, and, and, you know, so this is kind of a worse case. And I'm sure to Adam's point, you know, it's going to be a lot more generic than this to start with. But here you have multiple hosts coming in and hosts can interleave. And so the memory, which is great for performance in general, you know, you want to, if you're doing a page flush and you want, you know, sequential accesses to go to multiple devices, you know, motherhood. But the problem is, then if I want to shrink this guy or let's say we're not even interleaved, but, you know, maybe shrink all of them, I'm, I'm now, if I fail a device with the two aspects here, if I shrink, you know, I got to do that communication back to the host system. And then the other aspect is if I fail, I may have interleaved data that now has to be, you know, you have to pull coals in it essentially in the page table and deal with that failure mode. I don't know if the kernel has any mechanisms for regular interleaved memory, you know, deallocating or dealing with errors at that. But, you know, this interleaving can be as fine grained as like 64 bytes. And so it's going to get tricky. Matthew? I think, I think our usual method for dealing with that kind of situation is crashing. Yeah. Thank you. Halton, catch fire. I've seen a lot of that trying to get this stuff working. So, one other aspect I wanted to mention is security, you know, CXL2.0 enables security. So there are keys. The keys can be individualized to individual memory devices. And so if you lose a device, there's a whole aspect of key management and all that and how do you deal with loss and how do you deal with migration and all that. And so really there's some religious philosophical arguments and I don't know if the stuff is solved. I'm curious to hear feedback from you, Dan, and others as far as what forum there is. You know, you got CXL, which is great. It gives you all these different ways of doing things. You've got the kernel community, but, you know, where the rubber hits the road of trying to put these two together, how does that get settled? You know, do we agree on this is the way you do things and that it's published, you know, from, you know, how is that going to happen? You know, certainly some of this can wait until these things prove viable and valuable in the marketplace, but it's generally preferable if we can at least give some guidance up front. And so in general, you know, it's a call for help and just as far as what we do where, how we standardize this and how do we take this to the next level. And with that, I'd like to hear you guys' feedback as far as how we go on that, Adam. I would love to be able to write LWN articles to tell vendors what to do. That sounds great. Yeah, so like, so on the last slide you were talking about security. My understanding of the current definition of those security commands, they're only for persistent memory. You're not thinking about security for volatile memory, right? We've gotten feedback from customers that they want that link encryption for volatile memory even. Link encryption? Okay, you're talking about like the IDE stuff. Not data at rest. Data at rest is separate than link. I'm talking about the 2.0 mechanism. Yeah. Yeah. Yeah. Yeah. We have this massive pile of specs to get to, to implement link encryption and wrestling and wrestling with PCI Express as well. So that's a, I'd like to put that aside. So would I. How to deal with loss. So you're talking about like device loss? Device loss. Yeah. Yeah, I mean, yeah. Device loss. Yeah. I think that's, that's the halting catch fire situation. Like, unless you can, like. Yeah. So say they're just running containers or something. The kernel is all on DRAM and, you know, whoever's just got a bunch of containers. You know, is there a mechanism then to go and error out those sorts of things? So the, I think the simplest way to think about this is to pretend it's a dim. Yeah. If you pull out a dim, you're dead. Like. Does it have to be that way? Well, at least the current version of spec, even the electricals don't support, I think the actual data loss, if you, if you just yank the device. So, so yeah, I think it does with CXL 2.0. Okay. We've got a comment on the line. Then we'll go over here. Yeah. I will say for the, for things like interleave and trying to recover from errors there, I mean, I can't imagine we'd ever be able to do something like recovering, you know, on a 64 byte boundary or something. I mean, I would take a really close look at what the current area handling is that we have in the kernel. And, you know, that is in the end in the hardware probably focused at the 64, you know, byte level as well. But we only handle it at pages. And, you know, even if memory fails at a, at a, for my granularity, we start axing pages out of, you know, out of existence. So I would, I would first kind of look at the existing mechanisms and make sure that you need something different than that. I know CXL has all this crazy stuff that it can actually do. But I think the real question is how many of the capabilities the kernel, you know, really cares about and can really leverage. For the most, for the most part, like it just follows normal DDR air handling some things, things machine check pages to get offline. I think the new use cases that CXL brings is somebody was sent out there to pull out card A and they pulled out card B. And that will be a massive memory failure event. I mean, but the, but all we can do is hope that we can log things before the kernel crashes in those situations. But at least for, like, if you look at the persistent memory side, at least there is an error model on top to kind of, you could tell, you can tell the file system to give up, you can tell people to give up. And the page might not be mapped in something critical where the whole kernel comes down. You might lose the file system, but you won't lose, you won't lose everything. But yeah, that's the main thing I think from the loss side is just being able to communicate 16 terabytes are now gone. I'm starting to log that range. And if it makes it out there, but yeah, there'll be other alarms going off when that happens. Just one other comment. Oh, I'm sorry. I think those massive loss events are going to be relatively rare. And I agree that the best thing is it just crashed the system, you know, you dynamically reconfigure and reboot and carry on, right? If it was a persistent memory and, you know, you expected data to be retained out there, you know, that's sort of a separate problem that, you know, again, something out of band has to deal with recovering that level of stuff, right? Either because it's, you know, it does have multiple failure domains that it can recalculate and, you know, reconstitute, you know, and some other thing, or, or it's backed up on storage or whatever, right, to do that. So again, I wouldn't worry about any of those issues, those big level issues in anywhere in the near term, perhaps ever. Yeah, the only thing I was going to add is from a recovery standpoint, if you do have backups, you will be able to carefully select which device has interleaved where with the current interfaces. And so you could rebuild without too much trouble. And I don't know, Dan, you've thought about the user space tooling for that, but it shouldn't be a tremendous difficulty to do that. So, you know, even though you, you just blew up in a fire, you can, you can recover. It's not like you're just, you're just lost at that point. On the composability side, I'm curious on the shrinks, for example, is there support for that sort of thing, especially if it's interleaved? I think one of the, I think one of the observations is that CXL basically gives enough rope for us to turn bare metal systems into virtual machines in terms of kind of ballooning and those kind of like the same kind of model where you want to inject memory into a guest and get memory back out of a guest. And now it was a bare metal problem because we want to inject memory into a CXL domain and pull memory out of a CXL domain. So I'm hoping that we use the same interfaces. We're mostly like we do some infrastructure to glue them together, but for the most part it kind of looks the same from the kernel side that memory is showing up and disappearing. Yeah, so that would mostly rely on memory hotplug, which increasing is kind of thing that works. Shrinking it's in a much worse position because if you want to shrink that you essentially reduce the usability of that memory because that means that you can only use that for the movable memory so that you can migrate that somewhere away. And then we are back to the question how much of that movable memory you've got on your system because then we are back to high-mem systems where essentially a large part of your memory cannot be used by the kernel for anything. And explode by metadata that's just on the regular memory and all those problems. So yeah, shrinking can get pretty complex. So when you said I think you said low-level memory, does that mean non-pinnable? Yeah, so only for the user space and only if that user space behaves and doesn't pin that memory by some other means. So quite a lot of restrictions right there. Yeah, and as much as the device-dex access interface is kind of weird, it's kind of there to bridge that gap of like the MM you can add to it and you might be able to get it back. It's a maybe and not a guarantee. And if you need more guarantees then, like if you need strict guarantees then you're definitely, I feel like you're in the device access mode where I can rip away your device and shut everything down with using it. And I don't have any concerns about the kernel of putting anything useful or that needs for its own survival there. Yeah, so shrinking as long as if you, but yeah, but when you ask application developers, they don't want to map an advice. They want, I want to do a malloc and so the kind of what their cake and eat it to I want to be able to malloc and I want to be able to rip it away from the kernel whenever I want to and those things conflict. You can focus on the room here, Randy. Yes. So one of the things hardware folks can do that would be really helpful here is to help the kernels like hot remove mechanisms. The kernel is, you know, OK at evicting a memory area so that we can, you know, yank the memory out. But if the hardware vendors could, for instance, let us have more flexibility to say, hey, we can get all this one gigabyte of memory removed except for like this one 4K page. If you could, for instance, let us do that where we can leave one straggler in an area and then you can yank the rest out. And with your CXL magic, you can map things underneath. That makes our job a lot easier. That makes it much more likely we can give you, we can deliver success or as an example, like maybe you said, hey, we need to remove some of your physical address space. But the kernel gets to pick what physical address space goes away. That maybe we can start evicting things and say, oh, we just happen to be able to evict this one, but not that one. And then you, you know, you unmapped underneath, you know, in your magic switch layers. But if there are things that hardware can do that lets the kernel be more fallible, that's that would be really, really nice. And that means that we can probably, you know, like right now what happens for hot remove is basically something says, hey, this exact physical range is going away. We need you to evict all of your use from every last bite of this. And you have to be perfect. That's really, really hard for the kernel. But if you can come to us and say, hey, can you evict most of this and you can you can screw it up a little bit, we're better at that. I wanted to comment on your ask about how to coordinate, right? In some ways, I have the same problem too, right? I'm I want to know, like I see patches up on the mailing list and they kind of show up, right? And so what I'd really like to see is if there's any way you could have a call, like as a group, right? And discuss what's going on in terms of CXL patches, what vendors are thinking, what consumers are thinking and kind of all get on the same same call and and to constrain the problem space. Yeah, they want to put Willie on the spot. But I know you run like a periodic call and the logistics are setting that up. Success, what would you do? Would you do differently or anything? Yeah, it was actually LSFMM in Puerto Rico, where we decided that, you know, there's a whole bunch of us working on transparent, huge page kind of stuff. And that was the THP cabal. And I think we grew to like 16 people in our bigger Zoom meeting. It was invaluable, utterly invaluable to have everyone know that every couple of weeks they get to say their thing about what's going on. We we made it happen. I mean, it worked. I mean, I think things things were doing a lot better with some in person meeting because, you know, two years of that was COVID, right? So we didn't see we didn't have this conference for two years straight. And that sucks and definitely delayed what was happening. But the having the every two week meeting, if nothing else is a status update, like what have you done in the last two weeks? It's like that's that's kind of an important check-in, like how's everything going? I mean, we said on two weeks as being a good cadence, but, you know, you might feel differently. So, yeah, we just have a Zoom meeting and anyone can do that. I just go to look for an hour. But if everyone said that piece and there is anything to say, then we'll hang up after 20 minutes. Yeah, yeah. So I've been interested in coordinating something like that. Yeah, it would it would be the ground rules or something that would be only talking about things in terms of Linux, open source, open public specifications. Nobody wants to hear anybody else of proprietary development plans, but there's a lot of common, common stuff that we can that we can't talk about. Yeah, so I built into that. Yeah, you can sign, sign, sign Samsung up. Thanks, Adam. Good suggestion. Thanks everybody. Yeah, thank you. Thank you.