 FYI, Tritonia is here, so he's the one that asked for it. Oh, we're gone. Come on, please. Do you want to borrow the laptop? If you share the slides. Your display's actually going to sleep here. If Linux ends up working better, then Mac, we win. I just wonder if it's because you're not plugged into power. So right arrow should go forward? Yeah. OK. OK. OK. OK. OK. OK. Hopefully, this stays up long enough. OK. Just quick overview. What are we doing here? So let's say you have an app application. It really only wants 64 bytes of your, OK. We're just going to do it without slides, guys, sorry. I'll show you later. OK, so we have an application. It really wants to read 64 bytes of data. It's on the hard drive, or your storage. And it doesn't want any more than that. That's an example. What it has to do instead of actually reading that directly, it has to allocate some kind of a bounce buffer, transfer the data into the bounce buffer, and then copy the data that it actually wants, and then discard the bounce buffer. This adds latency, especially when you're reading something so small, you probably have a very latency-sensitive application. You're also using a wire bandwidth that could better be used elsewhere. So to address this, storage protocols have added support for things like this in VME. They call this bit buckets. Basically, it's a way the driver can add a scatter gather element to its descriptor that says, I don't want this data. Do not transfer it to me. I have no memory for you to put it in. So the protocol, do you support this? I don't know of any other besides in VME that supports something like that, but there might be. So to support this in Linux, this would only work through O Direct, obviously. And the descriptors that we get in from user space are already in byte-sized alignments and byte-sized transfer links. So user space can already describe this. The block layer just won't let you do it. It'll give you an error instead. Because the structure that the drivers get are BIOS, and they describe everything in terms of sector T, which is a 512 byte. So I prototype some stuff with a little bit of help from Jens. And the proposal that I came up with is that we're going to add special pages that fill in those gaps. So you can request your 64 bytes right in the middle of a sector, and we'll append BVEX on either side of it with the special page in it. And then pass that down to the low-level driver, and it will see that this is a special page. It will create BitBucket descriptors for you. And then you can directly transfer your 64 bytes or whatever from the drive directly into your user space application. No bounce buffering, no copying, and it should be nice and fast. There's a few other issues that came along with this. If you do a full sector read without BitBuckets, it takes only one scattered gather element. So we don't need to allocate anything else, because it fits inside the command. When you use BitBuckets, you need to have an extra descriptor on either side of it. So now you went from one to three. You can't fit that in the command, so you have to allocate at the DMA pool. The DMA pool is actually slower than we'd like it to be. And allocating and freeing out of that actually negates any of the performance benefits that we would have gotten. Those extra scattered gather elements are you talking about the special pages you put in the bio, or is that at the protocol? This is in the protocol descriptor. So this is so just for allocating from the DMA pool, this is for a device-specific descriptor. So we allocate out of that. That's because the command is still block-based, and the SGLs are memory-based. And so you have to describe all of the bytes of the block and where they have to go. So you have to have the one SGL entry that says, I want these 16 bytes. Well, then you've got the other, what is it, 498 or whatever, that if you've got a 512 byte sector, you've got to describe those other bytes too. So you say, throw those away, keep these, and throw those away. Yeah, the SGLs still have to describe the whole sector, just where to put all the bytes, and throw some away and keep some. Thanks, Fred. So I did recently send a proposal to the DMA pool to make it faster. It's about twice as fast with some of the patches that I sent. They're still under review. I do need to send a V2 after some feedback from Willie, so I'm going to probably do that this week. I also have the RFC proposal for the bit buckets that I need to send out, because it didn't hit the list when I sent it out last week for some reason, so I'll fix that. Couple of other issues. This doesn't work with IOU rings pre-registered buffers. The reason why is those are already BVEX, and you can't append pages once you create the bio from them. So we can't append the special pages to both sides of it. And, yeah. I don't, I'm not sure. There are inline BVEX that you can use without allocating anything on bios, right? So you... Oh yeah, inline BVEX. So there's four of them, I think, up to four. Can't remember. There should be enough for describing the subsector read. Yes, there is. I think what Keith is saying, so for the register buffered for IOU ring, those are organized as BVEX. And part of this, how they're passed down, right? You don't modify them, you just attach them directly. So you can't. So you init the bio from the BVEX. Yeah, you just attach that BVEX to the bio. Once you do that, it's your fixed, you can't append pages to the front or the back. But, so the point is that you're bringing up, right? Each bio will have, well I guess actually for, I mean it'll have some inline BVEX. So you could just copy the BVEX over for a fixed buffer, right? And add your pad, right? So it'll either be one or two, right? You pad on either side or just on one side, depending on the size. When you have the bio, you do have four BVEX. Or is it more, I forget, I think it's four available to do so. The ones that are at the end of the bio. So you could make it work with the registered buffers or the fixed buffers. Okay, maybe for V2 we'll look into that, yeah. And I guess the only other question I have is should this work? It only works through RobLock ODirect. Should it work through FileSystem ODirect? I think it probably shouldn't, but maybe it could. There's no reason. That would be fun to get that to work with BufferDio. Yeah. And also it doesn't stack, so maybe it should, but probably not. DM is going to be fun with that too. DM would be really fun, yeah. But at least this initial proposal doesn't touch it. So I mean, I would assume that we need to flag the queue as supporting Bitbuckets or whatever. Absolutely, yeah, otherwise you don't know if they're going to even check for the special page. Not going to work, right? Because it doesn't have that feature. I don't see a lot of use case for this for FileSystem, to be honest. Unless the block size has got really big. Because part of the reason is, I mean the main object here is saving bus bandwidth, right? If you need 64 bytes, don't transfer 512 and now you've saved, you know, 87 and a half percent of the bus bandwidth. But there's also latency concerns, and at least for now, ODirect on FileSystem still kind of sucks compared to raw block devices. Fair enough. Yep, that is the downside. Instead of using PRP, now you're using scatter lists, but at least on the stuff that we tested on, the controller is suitably efficient at SDL. So having three elements, SDL versus PRP, has not been a concern. The overhead is actually on the software side. Right, PCI only. I suppose it could be implemented on fabrics, but it's... I mean, it makes sense there too. Yeah, it's all about saving your link bandwidth, right? So if it matters on PCI, it could matter on TCP too. Forgive me, because I don't know. Is this both reads and writes? So the protocol does not support writes. I'm not aware of one that does. So Fred might know a little better if there is something coming down the pipeline on this for writes. And if we could, that would actually be really cool to add as well. Well, I don't think it would gain much for me because it would be a read-modify write for the device anyways. Not necessarily all of them. There are some that have very, very... Like, was it a persistent memory-backed? They don't care. They can do sub-sector writes without a problem. There's no active proposals doing write bitpockets. That answers that. So reads only, I guess, for as far as we know. I'm sorry, the slides didn't work. Some of those graphics were really, really cool, but take my word for it. So maybe I have a question for Jens, though, for you too. So what about extending the block layer to support more block sizes besides what we do today from 5,000 to 4,000, which is essentially all that it does. Wait, wait, before we get on that route, how is this different from the same proposal we had for different dicks, which wanted 520-byte sectors that we eventually solved with dual scatter lists? So why couldn't we reuse the same machinery for this? I think the main thing is it's much simpler this stuff, right? It's really just, if you look at the code changes that Keith did send out, well, I guess they didn't hit the list. But it's pretty trivial just to pad the SGO lists. I don't think, outside of that, outside of the conceptually having sharing something, I don't think there's a lot of overlap in it. The DiffDix is, I mean, it adds a bunch of stuff in a bunch of different places that Bio included too. And I don't believe, even, say at Facebook, I don't think we even turned it on, right? Because we don't use it. Okay, but conceptually, it's still two scatter gather lists, one of which essentially adds decorators to the other, which is just what you're doing. Yeah, no, I'm not disagreeing with that. There's a lot of code associated with DiffDix that this one doesn't really have, right? It's really just, this is just a driver setting up that SGO, because it's all in NBME. I mean, I think maybe if you had support for other devices, right, you know, the NBME that it makes sense to do it generically somehow, doesn't seem like there is. So I don't think necessarily it makes sense to turn it into like a subsystem feature just for NBME. Not what the amount of code it is. It's literally like tens of lines of stuff, right? It's not massive. Regarding what David said, I think this approach could be very interesting for zone storage, if we can make the block size as large as the zone size for zone storage. If you have a zone that is one sector, that's an interesting zone storage, I guess. No, I was actually asking about the supporting different block sizes, because if you think, for example, an SMR drive where you can read by LBAs, but you have to write per physical sector size, which is not necessarily the same. And you think of large sector ECC or that kind of stuff that they're trying to work on as it is, and it's not a stretch to imagine that we may have to support 64K sector sizes where you can read anywhere within that sector without having to read the entire sector, but you have to write in units of 64K, for example, which would be the same, not just the same scale, but so eventually that may get generalized, maybe. Yeah, no, I agree. That's totally the same issue, right? Just different scale, but exact same thing. I would imagine for the large block sizes, you have hardware support for just reading it, right? You wouldn't need to do the padding and all that stuff. It's more like an LBA, PBA type thing, yeah. You would just be able to ask for a subset of that sector. But I'm thinking more the, so lower than 512 bytes, so that the sector T addressing would be the main issue to solve. So the bio offset being a sector T, while the length is actually bytes. Offset is bytes too. Oh, well, I'm talking about the memory offset. You're talking about the sector T in the bio, yeah, yeah. So maybe unifying to sector T being something different or going to bytes or something might simplify things going forward? Yeah, I mean, we already have a little bit of mix up of that stuff when you have things that are not sector driven IO, like pass through commands and whatnot, which is why most of it is in bytes, like the size of the bio, right, is in bytes, but the offset is not. So yeah, maybe. Partial sectors, partial session, it all makes sense. No more questions? Okay. Thanks, everyone. All right, we have a 30 minute break, well, 45 minutes now. So we'll be back at 3.30 for NVME pass.