 Hey, Ming. Hey, how are you? Yeah, I can share some. I can share something that, yeah. I posted the IOU-RAM-based use-based blog for the layer driver, yeah. It actually is very simple. It's based on the IOU-RAM command patch, but that patch isn't more than yet. And once the patch, the IOU-RAM command patch for support, and then the pass-through is more that we can post the patch. We can post the patch in the mirror list. But actually, the gateway has been shared. The driver actually is just doing very simple things. The first one is the communication between the use-based and the driver. It forwards the request. It's actually a block modicine driver. It forwards the request to the use-based by the IOU-RAM command, because it's very efficient. This way, it's very efficient. So far, no porting is used. It's just sent the request to by the IOU-RAM. And I think one thing I found is that if the batch is because the IOU-RAM batching makes a big difference on the performance, yeah. Another thing what the driver is doing is that the delta copy, as we discussed in the mirror list, the zero copy is in the SUFA. The member management doesn't support the zero copy. And I saw the guest from the Alibaba has posted some patches, but still can't work. I tried to use these patches, but it can't work. Yeah. I think that's basically two main things done by the driver. This driver does nothing about the basic use-based driver logic. And all the logics are done in the use-based. In the post-patch, I just implement the loop. It's like the kernel loop. From a simple test, this use-based driver's performance isn't better than the kernel loop with direct IOU. I mean, the kernel loop enables the direct IOU. So, Ming, I've been benchmarking your code in comparison with NBD. And I'm seeing quite impressive speedups in comparison with NBD. In particular, I'm using also the server that you provided, which is, as I've understand it, it's single-threaded. And it seems to be quite an improvement with over-NBD multi-threaded simply by dropping all the overhead from the network. So I wanted to ask you, in particular, maybe I'm jumping the gun, but in particular regarding multi-Q support, do you see that as, obviously, multi-Q on the block layer side is implemented, but do you see that as a requirement for getting it upstream, supporting multiple Qs for the driver? And if you have any ideas on the design to implement that? I think it shouldn't be hard to support the multi-Qing for the driver. So far, we used one dedicated demo, use this demo for handling one request from one queen. But if the multi-Q is supported, we can create one thread, use this thread for handling the request from one queen. I think that's a big idea. But because in the use-based, actually, the current loop is handled, it's still done by the IOU. I mean, in the use-based part, the request is handed by IOU, which actually, I think most of the times, maybe one thread, one IOU job, or two or three is enough. We don't need much queens. Because it's AL, it's IOU, and one queen, maybe most times can saturate the banking device, but I think that's easy to do. That shouldn't be hard to do, if you just need it. I think it will be needed in case of some drivers which need more CPU, which takes more CPU utilization in the use-based. I could definitely see the benefits of multi-Q, but I think from the test that we've seen, just multi-Q, I want to separate out the basic of the driver, multi-Q, and zero-copy. Because I think even without multi-Q or zero-copy, there's still a lot of CPU savings. And I think there was also some latency savings over something like NVD or other network loopback device drivers. So I'd really be interested in seeing if we can get a base version in. And then there are obvious enhancements that we can work on in the future or in parallel. I think we've been using a similar, but based on legacy user-based block device. And we've seen the same savings over something like NVD or other loopback devices involving the network stack, because there's just a lot of CPU saved doing the processing. So that's my take. Yeah, I think the NVD takes extra CPU in then. I think it's in the socket communication. And I will use this, the same file you're in is very, I think it's much efficient than the socket. Yeah, actually, I'm also looking at the KMU copy-on-write image. I tried to support that in the USB driver, but still, I'm just working on that, yeah. If that's work is done, we can compare the USB driver with the KMU NVD. Because these two drivers just do same things, yeah. I also tried to test the KMU NVD with IO U-Rain option. It still looks this performance isn't still not good. Or maybe the KMU NVD can be improved. The KMU NVD has supported IO U-Rain option already. Are they doing the actual IO transport over something like IO U-Rain, or is that a interface to just talk to the NVD device itself? No, I mean, I think they should use the IO U-Rain to submit IOs to the host to image. Oh, I see. Yeah, there is option on the KMU NVD. But as soon as I think once the patches are ready to post, I mean, the IO U-Rain command patches are merged, I think we should focus on the interface between the kernel and the USB space because this part is the KAB part. It's the ABI things, and we need to solidate this part so we can support lots of drivers in the USB space. Also, I think there are lots of things to do in the USB space. So regarding the interface with user space, I guess the big part missing there beyond an implementation for zero copy would be the multi-cube support. Is there anything else on the interface that still needs to be discussed from what you're proposing, from what you have in the patches right now? I think the interface isn't related with the memory zero copy or the data copy. Because both are done into the driver. Actually, the USB is invisible to the data copy or the data copy. I mean the communication part between the driver and the USB space because there are some data commands on the definition and their commands. Because this part is hard. Once it's merged, it can't be changed. It just can be extended. Maybe we should need some provided the version, the driver's version or the USB's version to do. Yeah, I mean this kind of thing, yeah. It's not related with the copy or their copy or data copy. Because when the request is delivered to the USB, the USB just can use the memory for read. For write the memory, the data has been in the USB's memories. For read, the driver needed to copy the data to the memory and each bank of memory is associated with one request. Because it's a multi-queen request for driver and the zero is very limited to the requests. It's the queen depth. So far the driver just supports 128 requests. So we need any lock inside the driver, inside the USB's. And once the request comes and it just being handed, each request can be handed without any sync, yeah. You also mentioned batch submission and completion. Are you suggesting that we could have a single command completing multiple requests or how would that look like? So you mean to the batching? Yeah, batch submission and completion. I saw your latest batch that moves the submit data copy to the task struct, but are you planning anything else for batch submission and completion that we could assist with? Sorry, I'm not sure I understand your point. I'm saying that right now we need two IOU ring commands to submit each request to complete each request, right? We need a fetch request that can be done to get data in line through the task in the task context and then you need a second IOU ring command from the UBD server to complete the request. I'm wondering if we could be able to complete multiple requests in a single completion event or if that is not necessary since IOU ring, the cost of executing IOU ring commands is much smaller than executing. Of course, it can be done in this way because actually if you see the lead, see the data tree I have in the latest tree I have used the task work and to do that. In this we can actually the driver can add lots of requests and once the user base is wake up, the user base UBD server can get lots of requests and then this request handed such as it can be handed by the IOU ring from the user base and once all this is completed, actually the, I think most of the time they will be completed at batch and once they are completed this result will be, will be handed by another command. I forgot the name, it's do sent to the kernel with a batch. If you trace that you will find that we can add some statics inside the user base to observe this branching honey. Yeah, I think we can add this debug statistic in the future. You will see that most of the time they are handed by a batch. I saw your latest batch moving the data writing to the task context without the get data command but I haven't been able to make mark that yet and I want to take a look at the new patches. I saw you push the branch this week. I want to take a look at that and collect more data on it. I got a, is there any, any other, I think we're running out of time. Is there any other things that we should discuss? I would like to continue this discussion with you and I don't know if maybe we can schedule it over time or. I think I can answer your, you mentioned that as a right moved to the, to the user base context. The reason that it is very efficient to pin page in the current context. You can see that that's changing the driver. I looked at it, it saves much, much CQ in this way. And with also in the, in the, in the recently, I use the task work add. We needed to another command to, like the COVID get dirt command and that command isn't needed anymore but still we can do that with the task work add. That just one thing is that task work add is the, isn't exported as simple, it has to be used in, the driver is the beauty, the beauty into the kernel. But I think it isn't a big deal because this driver will be a very simple driver. It's just do very simple things. It can't be very big. Yeah, we won't move the logic, use this logic into the driver. We just do the communication things and the data copy. This driver can be very simple and small. So I think it should be fine to be just to put into the kernel so we can use the task work app. And the communication becomes a bit simple too. I think we are at the top of the time. So do we have more time, Mora? We're out of time but we have plenty of time on the schedule later this week. So if you guys take it offline and we can't figure out, you can't sort anything out, we can add another slot for it probably tomorrow. So that sounds good. Okay. Okay, sounds perfect. Thank you a lot, Nick, for the work you're doing. I think it's been quite interesting to see exactly what we were looking to implement. Okay, so if we need any further discussion, we can add a new slot in the meeting tomorrow. Yeah. Just email me to figure it out. Now that we have the... Okay. So I haven't been able to get a hold of Tim.