 Hello everyone. So this is talk about long-standing problem in the block layer storage area of copy of load. It's been almost six to eight years. There has been different attempts to solve this problem. So this is just an update what's been going on and where we are right now. So the objective of this talk is to find a block layer interface which can allow us to pass copy of load operation down the stack, maybe from five systems or from user space tools using Ioptal interface or potentially from IO ring. So it has several constraints such as it must work with underlying major storage protocols such as Kazi, NVMe. It must fit into the block layer semantics and it must work with stacking drivers. So these are some major challenges or the areas of challenges we've been discussing or we've been having since few implementations posted by Martin, Nicholas and from Samsung. So it's a little bit history of this particular work the way it's going. So originally Martin implemented single request of post-based copy of the operation. But then we came to know that there are several issues with it with the biosplit and whatnot. So later on Nicholas came out with two request OP upcode based copy of the operation that seems to help us implement copy of load for DM or and deal with the challenges that we have faced for single OP. I think at this point everyone is aware it's been like 70 to 80 years and it's all over the or all of the list. So probably skip this part. So right now the prefer implementation is token based which allows us mixing copy of the operation with reads and write workloads to avoid the OS attacks. So this is the direction as opposed to the single extended copy command which was the first implementation. So it addresses the problems with integrating with DM and simplifies the command multiplexing into a single bio into many. It also simplifies multiple bio cases and that's what the Nicholas approach does. So the way it works is we submit the request OP in first. So the low level device driver will store the ranges such as NVMe or SCSI. Unsuccessful submission of the first request we submit the request OP out to execute operation with a reference to the previously submitted request OP in and that gets to the controller and then we submit the actual copy of the command and copy happens over the controller. So the current state of work is recently a pastries posted by Samsung. It has block layer generic copy of code with the multi-source and destination interface. It also has emulation to off load if the native device doesn't support a copy of load. It has support for DM linear where it doesn't require any splitting. It also has new octals and internal usage such as zonefs, copy file range and kcopyD. So right now with this background we want to understand exactly what are the missing pieces needed to get this work or get this work moved forward in terms of block layer plumbing which has been missing or we should add. Are we okay with token based approach where we have two opcodes to execute the copy of load command? Any outstanding issues with DM implementation? So we have selected so far the minimal subset of DM implementation and is that okay or we need to extend that area of implementation and so forth. Do we need to find Yens? I don't know if you saw Dave Schinner email this morning explaining his view on how to test this. So that was one of the thing I commented on the series that there's nothing to test all of this right now. So we need something. I suggest in your block Dave Schinner actually has a more advanced but useful idea which is to use loopback so that you can rely on the underlying file system doing a copy file range to implement the copy of load. So we need something to test first of all otherwise it just can't. Yes I completely agree with you. We need something to test which is in kernel driver and not the QMU site. For example Null Bielke or any other driver that we can use. Read Dave Schinner email. He's suggesting so again using loopback so that you can rely on something that is already tested which is the copy file range from the file system that does that implement the storage for the loopback device. So you have essentially with that copy of load ready almost driver that is well tested that we can rely on and we can with that better test the plumbing and the API of everything on top. I'm not sure that's the best approach. I still think that implementing copy of loading and Null Block will make it easier to write test scripts and to integrate these test scripts in the existing block test. Of course yeah and we definitely need to go there but Dave Schinner point was that if we do that now and test everything with that it means that we essentially also end up testing the Null Block implementation itself. So having already something that is exactly like copy of load that is well tested for testing the plumbing on the block here DM etc is better as a first approach but yeah I totally agree with you we need your block too. Yeah so I think we need a loop copy of load implementation and Null Block copy of load implementation yeah and with the memory back and supported for Null Block not just so we can add block test around that also okay. I'm more interested in understanding on the DM side. But again once we have the loop back plumbing in place you can actually better test and so evaluate how well the plumbing in the block here works. From the patches I had some concerns about DM is that I didn't see anywhere where the range the copy ranges were being remapped to match DM linear or whatever the DM is doing so that's something I didn't see but yeah I didn't really look in detail though I may have missed it. Mike is on the zoom and he has some comments can we get one? Hey guys can you hear me? Yes we can. Okay so I've looked at the initial well there's been a couple patches a series of patches but the one that is thinking it's effectively imposing the constraint that the splitting is not allowed and and so forth it actually is it doesn't go far enough because it there are targets that make use of the interface DM except partial bio which effectively imposes splitting that you it allows the target to impose a split even if DM core didn't do a split up front. So there's some DM quirks there that need to get work through in terms of that initial constraint on splitting. I'd like to better understand is that just a pragmatic initial implementation that you'd like to just sidestep dealing with it it's like it's just an annoyance to have to deal with it in a serious way because honestly it just feels like a toy implementation at this point as it relates to device mapper. You know you make mention of DMK copy D but it's unclear to me where you'd like to see if you'd like to see a full robust implementation that has full support for all the various DM targets then what has been proposed is obviously fairly lacking in my view. There's some work I think like DM crypt because of the per sector key thing obviously I mean I don't see how that can work with DM crypt. Per sector key thing is I guess I'm missing where that relates to the copy off. You're moving data from one place to another without the data going through the host so you're not going to be able to decrypt and re-encrypt with the new using the new location the seed for the key DM crypt is the sector right so you're going to copy encrypted data to a different location that you're not going to be able to decrypt later that's what you're going to end up with with DM crypt if you do copy or fold unless I'm totally missing something in how the DM crypt work but that that essentially was the issue we had with zone append because we always specify the same sector for for writing and we we had to to go through the doing the emulation for DM crypt otherwise it doesn't work okay so per target we we we will use that flag that says okay that that one can do copy off load or not an inch target can decide okay can I it doesn't make sense to to allow those those copy commands to come go through or not and I suspect that the encrypt is not going to be one that will that will work with copy or fold yeah you're probably right I mean I'd have to just dig in and look closer at it you know DM crypt aside though it's it just I mean I don't want to dwell on this point it's just like maybe the authors of the the changes could speak to what their vision is for where this will go because it seems to me like DM hasn't been fully accounted for or stacking in general for arbitrary remapping of of IO as it goes through through the IO stack um you know like where is it that you see the implementation do you do you see issues or do you need me to paint them out in a more painful way I mentioned earlier before you you you joined to China is that for so DM linear has been enabled in the later series but I didn't see where they do the remapping for the the copy ranges so that's one point probably that needs to be addressed and the splitting yes I think it's possible to actually split those commands but if it's done using two commands a read and write that may be very tricky I'm not sure I haven't thought about it so if I if I could make a comment here the the current the current DM plumbing is actually seem very similar to what what was what was tried when it was attempted for SCSIX copy and while DM linear is is is having the support at this point but but the DM core has been modified currently and um and it is actually possible to to to enable the same flag in in other apart from DM linear probably it is possible to to enable the same flag uh and use what is being done in DM core but uh the the limitation currently is that whenever we have to split then uh then yeah thanks it's it's gonna fail it's gonna say that the copy is not supported and that is going to happen more uh with other targets and probably going to happen less with DM linear so this is how this is currently this is one way to see it so either we support or we should not support we cannot say in the middle of the operation we cannot do it once we wire up the changes and we allow copy sectors available through queue or bday we cannot just fail it that okay now the BI is split so we cannot do it uh so probably one of the way to ensure that would be that we kind of you know uh we actually have the fall back right we have to have the fall back to emulation so maybe for a target uh even if you let's talk about DM linear if it is if it is created on a you know playing device without having a very jazzy configuration then it is going to work fine but in case if it is if it is if it is based on two physical devices and a DM linear is created on that on a particular IO if it is gonna fail the copy offered is gonna fail it's gonna take the emulation path and the copy would be completed that way so that's the current approach. Mike you have any comments on this approach? um I I would I would like it to not uh punt to the fall back uh trivially um I so I need to review the patches closer obviously um I'll I'll commit to doing that um unfortunately I didn't give an exhaustive review um prior to this discussion so um I have more work to do here and I'll do it um but you know I think I I need to better understand the inability to cope with splitting um to me there it requires that implies a certain amount of rigidity in the um set up in preparation of the copy um a priori that that just is maybe I just don't fully appreciate at this point um it it seems as though we should be able to be a bit more nimble uh and reactive to um the remapping that a particular layer would be looking to impose um so yeah I just I just I don't want to hold you guys back with the discussion here um I have more learning and and review to do so so so the way I I mean that that problem that the splitting problem I think that was described by uh by Martin in his in his slide I mean the hereby dragons I think that the problem is still there when or whenever whenever um that that he he says it's it's M by N mapping right uh I think that that part is something if we can do something about it uh it would be great but but current choices actually is is that if that happens uh the copy offload will basically turn into copy emulation regarding splitting and copy offloading uh last time I checked uh patches for copy offloading I noticed that uh offset and length are not encoded in the traditional fields in the bio but rather in the data payload of the bio I think that's why the standard biosplit doesn't work for these uh copy offload requests I've been wondering whether biosplit could be modified so you did inspect the uh data payload of the bio and does the proper job for splitting these uh copy offload bios so uh one more question I have is what should be done if device has scheduler should we allow copy offload should we do not configure copy offload or can I can I get your question again device has what so if if the block device has a scheduler if we want to do copy offload should we configure it copy offload or how should we deal with that scenario so maybe I I probably couldn't imagine the problem but I I'm thinking that scheduler would be still be seeing the the read commands and write commands and the copy flag would probably be ignored by I mean we can choose to do nothing currently we choose to do nothing about that and and I think whatever it does for read and write the scheduler maybe we can we can we can be fine with that policy but I probably didn't really imagine the problem that you have in your head would like to would like to see how you know how do you elaborate that problem yeah and you can change the schedule dynamically to between known and whatever so you don't want to have to also switch that flag that set copy offload not because you change scheduler so you have to deal with it same as dm splitting and what not whatever is is in between the user and the device has to be dealt with and and if that's a problem it's always a solution to bypass the scheduler I mean for when you issue the the commands you can bypass yes direct insert if I if I if I if I may add a point for discussion here I think what Damian mentioned before with the with respect to copy file range and the comments that came from from dev so if I got that right here we are talking about he is at least talking about a refling base copy and the refling base copy I think that would that would probably wouldn't require any any driver level implementation right probably it wouldn't look back the storage back end of loopback device is a file system right it's a file and take any file system that has a copy file range and that's your kind of a device level implementation of the copy offload so you just reuse that you need to code the copy bio support in the loopback driver and just hook that up to the file range and you're done you have your device implementation to test everything on top that's what that was Dave's point yeah it sounded good to me just just wanted to be you know to confirm that probably it would not really converse into into um into device driver level um you know whatever it is maybe if loopback is a block device right that's right that's right that's just one implementation to be able to test whatever we you implement whatever programming you implement on top again you're sending patches but we have no way to test anything we need something boom boom so so the point here is that we need to have loopback support we need to have memory back null blk support and we need to see block test with respect to with respect to um um I mean the same point the previous one just to be clear here um so uh I think while this would this would be a good way to to basically see what is happening in the block layer but as far as NVMe plumbing is concerned if I'm not wrong probably it would not be I mean that path wouldn't be test by that right or am I am I have I got it wrong no so yeah sending the NVMe plumbing is fine but who has a drive that has copy of float support in today and maybe you have I don't yeah so how do I test that so I need a loopback or you lock or whatever sure and remember this the block test is integrated in all the most of the distros and they run block test on pretty much every release I agree and device independent testing is extremely important that's why we have so many test cases and so many reports coming out of block test I agree so and in order to continue doing that first this sort of bigger change going into block layer we need to have loop and null blk block test so I'd like to remind you that the NVMe the N kernel NVMe target loopback setup would probably get you the full holistic stack testing that you'd want if you use that and you can implement it with copy file arrange use the file backed NVMe target loop that's the one to use yeah that that's kind of a loopback to similar because you could actually in the target driver call copy file range there exactly and yes and we've done this and it works fine so you would that would give you the full stack with your virtual NVMe targets with your file back loop yeah but you still need the backend block device to configure the target which supports the copy of the right no or it will just emulate on the target okay no no that's the point you don't um the backend implementation is copy file target on the target yeah so the only thing you need to do is just hook up the command parsing for the target and then call copy file and that's it so that's dead simple to implement yeah but if underlying target backend doesn't support copy offload then it will result back to the emulation in but it does that's what I'm saying so you need a device no you don't know you're in the target if you set up a target a linux target and implement the proper command parsing for the copy for the copy command and hook in the command the the command to just call copy file yeah for fileback target right yeah for fileback target exactly so there's nothing no hardware dependency on that one it's just pure software violation okay and I think maybe one of the patch there is is probably already doing that that if the if it is file mapped fileback target it is actually falling back to copy file range and if it happens to be block based then it is it is using the block layer copy offload helper in the in the in the current series we'll see I'd like to see patches then we can continue yeah yeah any any more questions concerns so I just looked at the patch set as it relates to dm and the implementation is it seems confined to kcopy d I'm not seeing anything else in dm core other than like initial setup and like the the mechanics of setting the supported flag in the linear target and that kind of thing um so you know that michael ush's approach may have been factored in in spirit I need to look closer at the block code and and other changes to fully appreciate the design of what is what is being done here but um again like on entry into dm core just through like submit bio um you know that that bio wherever it's whatever target is it is destined to um it could need to be it could be that the IO needs to be split um or constrained um you know like think of thin provisioning um it imposes a particular chunk size or or or thin p uh block size and um it seems to me that you should be able to cope with reducing the payload uh by by you know as as it's implied by a um split so um I completely understand that you know an arbitrary remapping to different queues it would negate the whole point of copy offload and and um you know the constraint of having a the source and destination queue be the same and that kind of thing so I'm not looking to to have this be artificially or whatever I'm not looking for this to be complex or something it's just by only implementing k copy d and only uh supporting it if absolutely no splitting uh is done it is very much constrained um as it relates to dm so I'd like to see that improved but we can take that offline and deal with it on the list and stuff out of time uh yeah sorry so we can talk about block tests and ffs tests uh thanks thanks