 So people have been teasing me the last couple of days about how do you spell fizzar or phyar or whatever we're calling it. So Matthew, which was your preferred embodiment? It was your name? So this spelling. OK, so I got it right this time. I think I made up the fizzar by accident at one point. OK, so this has been kind of a percolating issue for quite a while. We have some hacky solutions to some of these problems in some of the places in the kernel. This is an effort to kind of give us a solid footing where we have solutions to these problems that we can actually build on and make the world better instead of abusing Scatterlist. And Scatterlist has been the bane of a lot of these sort of things. So Matthew kicked this off with the number one here. He was interested in speeding up user pages, right? If you've got 1 gig or 2 meg or whatever, you still fragment them into 4K and return them in a linear array. And that's just awful. Really, really is. I'm interested in number two. I would like something called peer-to-peer DMA to work you. Steven Bates talked about this on Monday. And his version of it uses struck pages. And there's another version of this that we're very interested in, especially for VFIO that just uses raw PFNs. And we had a little side panel on the mailing list about this. And it basically turns out, VFIO runs in so many environments that you're not going to get struck pages for it. Because it's not going to work on S390. It's not going to work on RISC 5. It's so far away. So we need a non-struck page solution for that. But number two is just PFNs. It's the struck pages. No metadata. It's the struck pages. It's a problem. No, but I mean, the solution you're going for is just PFNs or are you looking for a little bit of it? There will have to be some metadata, I think, because that's the way the DMA API works. The DMA API needs a little bit of metadata to know where the page originates so that it can compute the correct PCI offset when it does mapping on strange embedded ARM platforms. Don't ask. We are not going to go into that. It's an issue. We need a little bit of data. And you can see the data already in the struck page environment, because when you create the pgmap for P2P DMA pages, that little bit of data is in there. And then number three, this was talked about by Keith. This is, I would like to make the block API run faster, because I've got this silly thing where I take a BIO and then I just mem copy it slowly to a scatter list after allocating the scatter list. And then I allocate, then I do a map, and then I get the same numbers back and it's all kind of dumb. And in RDMA land, we have these crazy users that would like to pin memory quite a lot, like hundreds of gigabytes. And that's their application. They pin 100 gigabytes of memory, and they do IO to it forever. So currently, we store all this in a scatter list. And we use 28 bytes per chunk. And you waste a healthy amount of memory in your system just keeping track of the stuff so that I can unmap it and then unpin it. It's kind of useless. And this is so far down the list here, but the IOMMU driver interface under the DMA API is just horrible. It's really slow. Every single page, it walks the page tables and then carefully inserts it, and then it goes to the next continuous one and it walks the page tables again and inserts it. So that could be improved too. Yeah, Matthew. If I can channel Kristoff and add number six, cleanliness, Kristoff really, really hates it that we have the scatter, oh. I've got a whole slide about that. So one of the challenges in this area has been the scatter list. Kristoff really hates the scatter list. I think we all hate the scatter list. But the problem with the scatter list is that it's everywhere. It's absolutely everywhere. And every leaky abstraction that it has has been abused and misused some place. It's just being used wrong and what you do with that. I know Logan tried to tackle some of this in a very narrow, and it was hopeless, there's just too many drivers. So any hope of doing something better in scatter list is gone, I think. I think it's a write off. I don't think anybody wants to see that. I'm bringing this up mostly to kind of share this, but yeah, scatter list is what we've had and if you're not familiar with scatter list, this is kind of weirdly designed for the future that we live in. Scatter list was based on this idea that you would allocate a linear bit of memory and you would have some goofy arm or something that had a gart and it might translate only the pages that were above 32 bits because you were running on old hardware that didn't support 64 bit addressing. So you'd have a mismatch of translated and untranslated addresses. But we don't live that in that world anymore. After you do a DMA map, you either get the CPU addresses you passed in, maybe plus an offset or you get a single chunk of linear IOVA. So we do not need all this complicated stuff of having like every pages in general. We need it in special cases still, but in the fast path, like in keys fast path, we don't need this stuff. Like you can get away without it. So you can optimize towards the fast path where you can say my DMA map is the same as my CPU addresses where my DMA map is described by 16 bytes. So we can significantly reduce the amount of memory, the amount of cache lines we dirty, everything. We can do a lot better. So I got into this because I wanted to do something with VFIO. I've been working on VFIO for a while. I replaced the management of the IOMMU with something else and in the process, I removed the VFIO support for peer-to-peer DMA that had been hacked in and had a kind of a security issue. So I took it out because I didn't want to put the security issue back in. And now I would like to bring it back and I want to, what I want to do is create a DMA buff that is the handle for your peer-to-peer memory that doesn't have truck pages. I would like to pass that to the IOMMU stuff and I would like to put in the IOMMU and hold the reference count so I don't have a security problem. And currently DMA buff is designed around scatter lists. So I did the obvious thing that DRM has done and I did what DRM did and Kristoff didn't approve. So we're going to try and make an improvement to the DMA API that will allow us to do all these things. I had some time with the DMA API maintainer earlier today and maybe we have some agreement of what it will look like and it's probably not going to be quite like this but this is kind of the big picture here. So you have these new concepts where you do something without a scatter list. You DMA map this new thing, you'd go in and you get back a DMA list and we can introduce the optimizations I talked about where we can optimize for IOMMUs where you only have the 16 bytes of storage and you can optimize for identity. You can make peer-to-peer work. Peer-to-peer doesn't really work today unless you have to start pages which is kind of an annoying hassle. And from there then we can talk about maybe we can use the same structures, maybe the same API we can put in pin user pages, maybe we can return folios from there. Like it's kind of an adventure. And as I started looking at this I realized that the possibility of doing this in like a one-shot conversion is basically nil. There's too much code, there's too much entanglement check into scatter list. Like even RDMA, I can do most of RDMA in one shot. But somebody added DMA buffs to RDMA. That might have been me. And so then I have to convert DMA buff as well and that's too big. Like I can't do all that at once. So I want to see this kind of incremental world where we can still take a scatter list from DMA buff. I can feed it into an API adapter and I can use it in a new world after I convert RDMA. And then I'll go on and fix DMA buff. But I can't do it all at once. So my idea has been to create effectively a non-leaky API for iteration over these concepts. So you can iterate over your CPU pages, you can iterate over your physical page, your physical PFNs, you can iterate over your DMA buff. And under that non-leaky abstraction, I will hide the details. And we will make it so that one of those things is you can iterate over a scatter list. So I can take in a scatter list and I can feed it to new code and then I can fix it. Piece by piece, slowly, slowly, slowly. So it's a little bit different because scatter list was micro-optimized to be like minimal number of CPU instructions everywhere. And this is the opposite. We have made an API which is really solid that we can put behind the scenes what we want, but you're gonna pay a penalty. You're gonna do more branches, you're gonna do more function calls to get there. So, yes, it would be- More instructions per range, fewer instructions per page, right? Yes. Well, this becomes into the interesting question. So a lot of the users at the hardware level, once you get these things, you need to program hardware. And there's two popular ways to program hardware. You're either programming range lists, like SGL is typically called, or an SGE or something, or you're programming page lists. And I think something like NVME has both and something like RDMA has both. So if you're programming a page list, you need to take your range, you need to break it up into pages and you need to iterate over, maybe my hardware only does 4K pages, so I have to break it up. So in RDMA, we've already built abstractions and APIs to do this, and part of what I drafted is I took those and I put them in a common code so anybody could use them and you can take your ranges and it breaks it up into 4K or 8K or whatever you want. And it just gives you what your hardware asks for. You tell it what the hardware wants and it gives it to you. Like, yeah, saves lots of code. So when I started working on this, I'm like, this doesn't sound too bad. And I started at the logical point because the DMA API is kind of the line. To do anything useful, the DMA API demands a scatter list. So to do, if I wanna do something else, I need to get the DMA API to do something different. And this is where I got quickly an education in why this is so fricking hard. So we have 23 of these DMA API implementations called DMA mapping ops, DMA map ops. And that's a lot. And they all implement, right now, they all implement weird old fashioned IOMMUs, often guards, often weird things. And they've all been kind of micro-optimized for their weird little architectures in their time period that they were written for. So I really don't wanna touch them. In the modern world, we're encouraging everyone to use the IOMMU layer and something called DMAIOMMU.C, which is like the common code for operating an IOMMU. So you either have identity mode, which is common code, or you have this common IOMMU code. And that covers all the modern, cool hardware, except for PowerPC and some ARM32 stuff, which hopefully can be fixed. Oh, and S390, which has been fixed. It just needs to be merged. So we're doing pretty good. But there's gonna be some kind of problem with the DMA API. And my sort of feeling is we're leaning towards getting to being able to do number three on this list. So number three is we convert the modern world. ARM32, S390, I don't know about PowerPC. It would be nice to convert PowerPC, but I don't know about PowerPC. And then we just provide some backwards compatibility for the old stuff. It'll still work. It'll be very slow, but it'll still work. And we don't try and do number four with the scream face there, which would be trying to add this to everything everywhere perfectly. Damn, I was hoping to ship you the contents of my basement. I don't want your basement, Matthew. I don't have room for your basement. So I don't wanna touch these old architectures. I think they're fine the way they are and they're never gonna run RDMA or NVME anyway. So why do we need to fix them? Okay, so this seems, does anybody disagree? Okay, that's a consensus. I'm moving on. Now the other challenge that I've been looking at, so that was the DMA API. I think we got some ideas how to get there. The other side is the gup side. And gup today is used in so many different places and some of them are really performance critical, like feud exes. We wanna do a page walk for a feud ex and we wanna do that really fast. How do I make it return something different than a page list? I wanted to call a function call to append it to my abstract API that I built. I wanna go append this chunk that I've read from the page table and something else will deal with it. So what are we comfortable with here? Are we comfortable with adding some slowness to gup to do this in direction? Maybe some static branches. Maybe some indirect branches. I don't know. But is this gonna be, the output of this, is it gonna be like another, like a folio but not a folio, is it gonna be a net mem, is it gonna be something else? We need lists. Realistically, we need lists, right? Like the use case that everybody has is they wanna get a chunk of user VA and they wanna get a list of what that is. If it's two make folios or a couple two make folios and a 4K folio, they want the list. So my reading of this is that we need to, we do have two different users, two different kinds of users. You're right, we have the feud ex case and I haven't really been thinking too hard about the feud ex case. But I think the feud ex case can be get user page. Yeah, Lorenzo's work has been moving, I think, in that direction nicely where we will have like a get user page that's really good for that. And then the other use case is basically get user range where you want physical addresses for this array of, for this range of, you want a vector of physical ranges. So you have a physical start and you have a length and you have two pulls of that. Yes, I think that's likely the case. It's a little bit complicated because in various places in all of this, we do sometimes need to know if the struct page is there. And we do need to pull information out of the struct page. As counterintuitive as that is, the DMA API actually pulls information out of the struct page when it's working, which is annoying, but that's how it's been built. I mean, we can convert back from physical to struct page. If you know that the struct page is there. Right, and so then you have to handle the cases when it's not, but today that fails, right? That's an E fault. If you call get user page and there is no struct page, you get an E fault. Right, but what I'm looking at is like the whole picture. I would like to go with a kind of a similar language, a similar API from get user pages through DMA map to hardware, and I would like to solve the pace where get user pages always returns folios. I would like to solve the DMA buff case where I don't always get folios. I get PFNs as well, so you want a range and you want a bit that says this is a struct page memory, this is not a struct page memory. That's like the generalization of that concept. And then even further, to be really efficient, what you want is like here's my handle for my non-struct page memory that tells me where it came from, what it's for, how do I DMA map it, which is that other little piece. Now we could still use one bit and we could use like an interval tree or something to get back to the handle by looking at the PFNs, it just depends how efficient you want this to be. It doesn't really matter, I think, for my use cases anyway. So, but the few dexes and things like that, like how do we structure gup that everybody's happy? Like if we have a single, if we have a gup get single page, you would still have to substantially duplicate all of gup to do all the page table walks, all the fast stuff, all the slow stuff. Is that what we want to do? Like we already have a couple copies of gup in the tree. We have the pagewalk.c, we have gup, we have a couple places where just open codes, like you can see it sometimes like PMD, PGD, there are a lot of them. Do we want to do something here? I know Matthew had talked about making like a page table iterator thingy at one point and I thought that was neat and terrifying. Who said no? No, that's too bad. Just do it and call it something else Matthew says, okay. Jason, one question, like you mentioned that you want something that doesn't have a struct page or like return something, but how would you actually like protect that it's not going away or getting reused or is that another concern? I mean, if you have a page, you can like pin it, you increase the restaurant and whatsoever, but how would that be handled? This is what the DMA buff is for. So the DMA buff is the handle that protects the memory and when I ask the DMA buff to give me a list of the memory that's held inside of it, I promise not to let go of the DMA buff until I also let go of that list. So it moves the reference count from being like a per PFN thing to like a per file thing effectively because the DMA buff is a file. It would be like ref counting pages if you ref counted the VMA, which we don't do in the MN, but we can do this for these special cases. Okay, and the other question I have is, I think right now we disallow, get user pages when we have a PFN map. We do think. And I... Would you have to change that or does it not apply? Because otherwise you always have like a struck page, you can do your ordinary PFN to page and your PFN to folio, et cetera. I don't propose to change gup at all. I think gup should stay where you always get struck pages back because there's no other way to get an object and hold a reference counter. At least it's very hard. I mean, maybe we could... It's hard. I've thought about it. It's kind of hard. What I want, I want some pages to come from gup and I want some pages to come from DMA buff is what would be nice. And, you know, maybe at the extreme, you would, gup would fail and then you'd probe the VMA and you'd say, oh, this is a DMA buff VMA. I'm just gonna get a DMA buff. I'll stash it away. I'll hold a ref count. Then I'll get the pages out of it and I'll append them to my list so I can handle like mixed maps and other weird things. And places that care about this, and I think there's really only one place that would want something so complicated and that is VFIO. They can do this extra work. I don't need to put it in the core code. So couldn't you like implement that, that you have like your new fancy function and it calls old style, get user pages and if that phase whatsoever and then you just convert from whatever like array of pages you had or folios ideally or whatever we're able to come up with internally just convert that to your new representation that you don't have to duplicate each and every gup function just to handle both types. Well, it's not duplicating a gup function. When you go to get the pages out of the DMA buff it gets them from somewhere else. Yeah, right, but I mean it's like it's just like one part of the implementation if I get it right. You try first like a user pages otherwise you'll fall back to your DMA buff. Well, in a lot of cases the DMA buff things are passed in by file descriptor. So you don't ever call a gup anyway. You already have a file descriptor. The only kind of wonky case is historically VFIO has used VMA as the handle for these things. So for compatibility it would be very nice to be able to go from a VMA to get the DMA buff file descriptor and then just do the DMA buff special stuff as though you had passed the file descriptor and I think that's good enough. I don't know that we need to go deeper into that rat hole with gup. Gup is already kind of difficult. So I'm more for leaving it the same. Yeah, just on the structure of gup.c or whatever it occurs to me that you wanna put on your new API, put it in there and then just opportunistically factor stuff out. And I think one of the big opportunities is the page table walker because I don't care what you call it. That's what the duplication is. Yeah, it's all right. But I don't think you should call it gup v2 exactly. It's more like you've got this additional thing, it's get fizzer pages or whatever. And then you opportunistically use whatever provides and you factor out common stuff and just call it good. So there's kind of two approaches I thought about. One, we could kind of do that. We could put if statements like if new gup mode then call a function, old gup mode, append the array and then maybe you factor and stuff. And gup v2 would be kind of like I take gup.c and I compile it twice maybe. If we really, really cared about performance, we could pull a stunt like that. So then the fast version is fast because it doesn't change and the slow version is slow. But then you've got two copies of the thing. It's not appealing. It depends how much we care about the Feudex case and other really performance sensitive cases. But I thought your new thing was for users that understand the new thing. Like why do you care about old users calling the old thing? Well because we have about a thousand lines of page walking code in gup that is exactly the same for the old thing and the new thing. And the old thing is people care about performance a lot. So how do I do that? How do I think of a thousand lines of code and just I just need to change it a little bit. Little tiny changes without affecting substantially the performance. Inject a BPF thing. I'm moving on now. So I think I already kind of talked about these pieces in the other slides. But yeah, I've kind of realized that even for NVME, they have page lists. It would be really helpful to be able to have the segmentation approach that RDMA's got. I know DRM has the same issue. I see this in other places. And DMA buff gets kind of fixed. So this is a distillation of keys, or yeah, key slides. So today it goes through this adventure. What I've been sketching is we take the BIO and we can convert it into, like I said, like an adapter for an iterator. And then we can feed it through the new API. What Kristoff has sort of proposed is that we change the way we view the DMA ops to be more modern. And instead of having the DMA ops actually process like a BIO or something, we expose more of the underworkings. And we say the DMA ops are more like you allocate an IOVA space and you map something in the IOVA space like the IOMMU lets you do. And then we let NVME deal with the BIO and use these primitives so it can be super efficient. So it's sort of decomposing the problem. It's moving the code we have to just different places. I think it could work out. It needs a little work in the IOMMU to factor it, but that's OK. But I think the kind of the general dream is you get to a world where you could take the BIO. You could do whatever the new DMA ops are and you get your answer back and you program your hardware. When I look at this from a reusability perspective, especially if you don't care about micro-optimizing what was it, 12 million IOPS per core, then you probably want some helpers. You probably want something like StatterList where you can just stick stuff in it. You can call a function and then you get something out of it and you can make your simple driver that way. So I have been sketching this RList kind of API that is that. It lets you have lists of CPU stuff. It lets you have lists of DMA stuff. That was one of the cardinal sins of ScatterList is that it didn't have the type safety. So let's put the type safety in. And it gives you a set of helpers to work with in your respective worlds that are useful for drivers and things. So I'll negotiate with Kristoff. He seems quite interested. So that's the end of what I had prepared for slides. Oh, I'm right on time. Feedback, I've posted a couple things about this and there's been devastating silence. So sorry. Oh, that's in general. Yeah, I didn't understand it. No, I understand it better. So I can ask some comments. Actually, no, yeah. It sounds the same like the, or actually question like the ScatterList is still going to, like if this is useful for the new world order, like the ScatterList needs to be like, is that just for completeness or like you really see the need to go back and rewrite all the ScatterList users? No, I don't want to do that. Good Lord, no. I mean, it's like Matthew's x-ray conversion, right? It should be super cool if people could use the same new easy to use API. And that would be great, but realistically, it's even worse in x-ray. It's even worse in radix tree. There's a 10 times the users of radix tree. So it's. Also, so your ScatterList conversion is more about like people that want to do it the ScatterList way, but do it in a type safe new way. Like it's for new ScatterList users? Yeah, yeah, new ScatterList users. So yeah, you're sorry. Yeah, so there'd be a new API that you could use if you wanted to. And my thinking, again, is sort of based around this very abstracted iteration thing. And you could put ScatterList inside it. So when I'm doing conversions, I can do them piecemeal. I can say, this part of the world is still ScatterList. EMABuff is still ScatterList. This new part is the new API. And I can bridge them, at least for this time, for this kernel release. And I can remove the bridge in the next kernel release. And I can do things incrementally, because even the sketch I've got so far is like 20,000 lines or something. It's completely nuts. I need it to be smaller. I can't get that merged in that form. So when I was looking at this, and I probably overlooked something major that you can just explain to people I overlooked, I was thinking that we would just leave ScatterList the hell alone, introduce the fire, fire to ScatterList conversions, and then once every single user of SG underscore page and SG underscore was, we could actually remove that from the ScatterList. And then drivers would just be unconversed and be able to, you know, we just shrunk the ScatterList. What did I get wrong? What did I miss? That you could ever do that, right? Like, there are so many SG to page callers all over the place. I swear there were like five when I looked. Oh, maybe I'm thinking of the SG page? SG underscore page, yeah. No. I missed something there. Oh, OK. OK, I see. So the ScatterList is two things, right? It's a list of CPU things and a list of DMA things. And they're sort of inseparable. They become unified at the DMA API level. So a lot of places, there's not that many. A lot of places, oh, I have to stop. But a lot of places, the subsystem creates a ScatterList for you and feeds it into a driver. So the subsystem does the SG page bit. And then the driver gets it pre-populated. But the driver is still mucking about with it. So you're still stuck with ScatterList. Oh, well, thank you.