 Hello everyone, I'm Anitesh from Samsung and Chaitanya from NVIDIA we are about to present copy off load and this has been a long standing patch series which I think many of the folks here have worked on earlier. So the idea here is like you issue a copy off load command to device and it does copy inside the device, like maybe sometimes it might be across the namespace as well. And the advantage we see here is like mainly reduction in CPU and PCI bandwidth and in fabric's case mainly network bandwidth. So the first effort was taken by Martin in 2014 and it was basically I octal based approach wherein like you make a payload and you submit along with that, but that had some issues with like especially in stack devices wherein it was not scalable and Nicholas came with two wires approach which we have at present somewhat similar and this is like compatible with DM layer so that was that but after that there was not much traction in community and somehow it didn't make up and we started with simple copy once the spec was ratified we initially pushed the patch with similar I octal approach and we had one in 2021 conference call we agreed like maybe we need to again support DM and two wires is a mush one and from that we started and in 2000 like the previous LSF it was mainly like people are complaining we don't have any infrastructure test like you might be having all the patch series but there is no way how you can test it so we addressed that as well and now we have around I think it's in V10 and it's stuck at that so in this conference I want to mainly know like what all things are blocking and what I can do to get reviewed by the previous discussion it's similar to that like it went long but at present I need some guidance like whether the present state where it is in is fine or I need to do anything extra so the present state we have like a user interface was a copy file range the existing one and for direct cases we are going to copy offload and if it is cached again we fall back to generic file range so there is not much change so this is one new edition copy file range so from user perspective that is the one and from block layer there's like two bios initially we issue a read bio along with the source information the sectors and length and along with the token and this reaches driver layer and from driver layer we just fake a completion come back and from block layer again we issue a write command wherein once it reaches the driver through all tickling down from DM and all like we actually form a copy command and we send it so that's the design and also we have emulation for cases where like let's say offload is not present and this is super beneficial in fabrics case where in like let's say if I have a device which actually doesn't copy offload support copy offload but from whole side I can still send a copy command and through emulation I can complete copy from whole so we saw like around 30 to 40 times improvement in desktop environment and even in server environment it was like 30 to 40 times better and at present block layer wise architecture is capable to support x copy odx and if in any way me if copy across namespace comes in that also should be addressable to the present for whatever we have and as far as testing like the previous concern was about like how we can test so Qmo support was there earlier as well and this time we have null block and fabrics loop back is there for loop we are in working on it like we have some POC kind of model but yeah it's we need to test more so maybe in coming forward I might add that and from user perspective FIO Vincent is maintaining like his own private repo wherein he keeps updating all the changes corresponding to kernel and also like I have a couple of bilquet tests for block and NVMe NVMe over fabrics so if this manual series goes in I will add the loop as well and null bilquet test going further so this was present info and the upstream plan was like basically we want to have block device default block operations and from block layer offload and emulation and dm linear there's like it doesn't do much like whenever there is a split we just come back so it's a basic infra for dm and also the support wise like we have a queue flag wherein like we support only for dm linear at present because we felt it's a simpler use case and the plan is to going further like we might start expanding that to other dm target and yeah so what we are planning is like at least if we have some reviewed by or whether it's going right wrong direction anything so we would like to know like it's like at present it's in there's not much clarity whether we require something or not oh sorry i didn't look at your last drop of this so i'm i'm surprised why dm linear if you have the bdef copy file range why do you need anything special in dm linear oh sorry i can't repeat no i don't understand why you're trying to push dm linear what's special about dm linear nothing nothing nothing that's just to show like if people want to use copy output so that's the one of the target we have tested it's more like we don't want to enable copy output for anything everything and it breaks so the thing is like for other targets it happens but through emulation we are doing it even the underlying device supports copy output we are not exposing it to dm linear so it's more like uh we don't want to break the existing setup so as we start testing more and we feel more confident if it is working fine then we can start increasing more targets i i took another look this morning as i promised and uh you know you've basically done everything that i asked for so i i can't just go out and say no it's never going to go in i think it's i think it's fine i have like they're not objections i have two questions so question number one is when halfway through the copy offload not going in the main use case was being able to do garbage collection um on stuff like zone devices or that kind of thing so multiple source single destination thing uh where are we at i haven't heard any desire to have that capability in a long time so are we is this still a target oh yeah we want it better fs dm zone and there's plenty of places we're gonna use it i think it's useful but i strongly believe it should not be in the first round okay so i so we can start and let us derate the whole this whole thing for quite a long time for years literally that we're always going back and forth do we have an invitation yeah not really do we have a use case yeah not really do we have hardware not really and so we're always going back and forth yeah we probably would need to get one or the other and so we always have been waiting for each other to do something well the problem is it's also been a moving target right because when we started and there's not load it was about provisioning vm from a golden image yeah sure and then it changed to oh my god the whole world is going zoned we need this for garbage collection which is a very different use case from the vm provisioning thing so the target moved to a different model and now we're sort of gravitating towards the original approach again which is why this series works and i i think it looks fine okay so same year so that the my second again it's not an objection in scuzzy we we ran into the issue that we had no way ahead of time to establish whether given device a and device b can these devices talk to each other and they may both report that they support copy off load but we don't know if they support copy off load between each other um so we spend work in scuzzy to try and formalize that and the same thing is happening now in any me uh so in the current code your checks are does this device support copy off load and does this device support copy off load but we should have that be a little more bit more sophisticated right in the nvme case initially here we can go is the same block device that would be a fairly good heuristic for whether we should go down the copy off load path and and for the scuzzy stuff i'll wire it up for both the the token based on the extended copy uh i'll rebase my patches on top of yours areas and i mean i don't have any objections at that point at this point i don't think yeah i agree i think we we definitely should allow copy only on the same device for now because the checks are obvious and simple and they can be seen later if we can if we want to enable multiple device between multiple device i mean we don't want to go down the copy off load path if we think it's going to fail right because it's not free right so the more sanity checking we can do upfront is this even going to work the better i think we need some more flags again or maybe some metadata i mean i mean you have it i forget what you called it block q copy or whatever yeah q flags out of check whether the two b-decks are the same that would be a good start for scuzzy we'd need it to be slightly more sophisticated than that i can wear that up it's not a big deal i'm just saying that the fact that the device has a copy off load capability doesn't mean that this is going to work right so we need to establish a better heuristic for when should we attempt to to use the copy off that might be optimization but i see one issue especially with dm devices right like maybe down the layer it might be single device but uh from hope no but i think for the initial rub you should not even consider that case so yeah sure you may have two dm linear block devices on the same physical devices and a copy between them would work but don't even consider it like the b-deck pointer is it the same or something simple like that for now so for scuzzy we actually you know have we can go in and check the scuzzy name and what i did my second generation patches was i validated that that you know cookie essentially was the same between the two b-decks so i had that copy add into fire in the b-deck and i would only attempt to do the copy off load if that cookie was the same and in practice that cookie was you know reported by the scuzzy device for nm for now the b-deck being the same as a good start once we go to multi-name space or whatever we'll refine it we'll have the reachability architecture and nm me that expresses this again ahead of time can you do between namespace a and namespace b are they even able to talk to each other would you need a user space interface to so to allow user space to determine if it's going to work so that way they don't waste time in case it's not if the copy cannot be offloaded the code has the emulation the block layer is not going to do the bio read and write copy emulation so the user should not be concerned to check if the it's going to work or not for the you as far as the user is concerned it should always work unless there's an area of course it's only the offload is it going to be offloaded by hardware or not that depends on the setup so i noticed on the slide nvme fabrics right because i know i have seen slide where i have no idea if there are products i have no idea of anyone in this room actually cares but at least in theory you could have two completely separate devices that could talk to one another over the fabric and it would work now how we would actually make that determination i don't want to even think about it but i'm curious is that in fact something people care about today i hope the answer is no i so again we both scuzzy and nvme at least almost soon nvme have the ability to express this kind of relationship between devices right and again but this is even even if even in theory if the protocol is osmosis between between two describes in the back of something right i mean it doesn't have to define how they communicate just that it is possible for these two to seemingly distinct devices to communicate with each other out of band whether or not it ever gets it gets implemented is another story but right and i would that's the question i was asking is that something we need to worry about no so there is no no no the reason fabrics is there is not that you communicate above fabrics is that you can offload to a fabric target and do the copying without going through the network so it's not two targets a single target offload the copy that's the reason why scut scuzzy has that defined in the protocol i'm not aware of anyone that's implemented it any two devices anywhere in the world have the ability to describe each other and perform copy operations i don't know anybody that's implemented it nvme did not put that into their protocol yet it is simply namespace to namespace within the same subsystem you cannot get there's no way in the protocol to express a relationship between two different namespaces in two different subsystems somebody has expressed an interest that they might come forward with that in the future i don't worry too much about what those people say with those kind of statements that if it shows up then we'll worry about it not there now anyway for all the cases we realistically would want to care about we have a way of establishing this ahead of time yeah so what that means by fabrics is a a fabric device that has multiple namespaces you copy between the namespaces within that subsystem out in the fabric and the subsystem does the copy for you so it never comes into the host it never goes back out to the destination it just saves you a lot of transfer that's not true it can go faster than a host copy oh no but at least our results showed like with the emulation it was faster for whatever the test cases we had at least two implementations in storage ray and you know it's like it never it never met the expectations of those people who use it no no i got a point but the emulation layer right it's more like you do read and write so maybe copy off load might be off but at that time you can just this is a good conversation for a beer not right now yeah some implementations do read writes and sometimes the read writes in the storage array can be slower than the read writes on the host but those implementations that are able to do snapshots or clones or something like that can do very fast copies for those specific instances so copy off road has this uh there's no way to really quantify what the format is and it's like it's like it's like it has its own version of write amplification with a vengeance right you know and it's like it's like one command one i op that has a cost that really can't be measured yeah so the blocker still remains i guess we agreed but we see lack of reviewed by comments and i'll commit to officially adding reviewed buys and i commit to wiring up discussing pieces for both token and extended copy sure that would be helpful and christian not here okay so i just want to check whether the copy file range plumbing question is there hi uh like uh have you had any look in uh copy file range plugin like plumbing for uh copy off load sorry uh like uh did you get any time to have a look on like latest series like i worked on your comments like especially negative values are coming for copy file range so that i fixed but recently taking a look the last time was my last status is when i pointed out those issues yeah yeah need to take a look sure okay about from that like do you say any issue with the current plumbing give me time to review this again sure yeah uh i think uh that's it like mainly we wanted to have some like initial cut and on top of that like we have plans to include subsequent additions maybe the garbage collection multi-range and so yeah so i hope i'll get to review buys sure thank you