 Let me introduce myself, so I'm Bernd. I'm working for DDN Storage. We have a couple of storage solutions. So, typically, we are in big data centers, so university data centers, HPC. We are moving a bit to enterprise now. So, and we are using Fuse for two products, one one older product and the new one, and for both of them, Fuse is a bottleneck. So, these are network fire systems. So, the overlay for us doesn't apply because we really need to get the access to the data buffer and we are going to send it out. Ming in the neighbor session is also going to talk about, I thought, about zero copy. This approach doesn't work for us. I will come to it later because we need the data buffer. So, we had, in the past, there was an approach from Boas Herosch, correct me about his pronunciation. So, he had proposed, I think, in 2018, in a session, the UFS, Zero Copy User Space Fire System. There had been concerns that, because there's Fuse already, that do we really need another fire system and then Miklos started to work on Fuse too. So, I looked a bit at both fire systems. I didn't get given official reviews. To be honest, Miklos patched was hard to review because it was a full copy of Fuse. So, not the patch like, but it was, and there were multiple approaches, so there were multiple options. It was hard to see, which is the important code bars. And at some point, I run out of time and I stopped. So, I did this over the weekend and it didn't work. So, last year, we were working on Atomic Open. Patches for this are going to follow. I hope this week, maybe next, but not later. And then Miklos asked, give us benchmarks. And then, Damendra, who was working on the patches, did the benchmark. And the benchmark was really confusing because we did, so Atomic Open basically sends a lookup call less. So, every time you do an open, let's say it's not an open create, but open create is already kind of optimized, but also normal open, but also with open create, there's always a lookup first, and then it sends an open and it starts together. But the lookup is there. And real Atomic Open, as last year, and I think NFS are doing, they have everything in one. They have lookup, create, and stat in one call. So, this fuse always comes to lookup. It's really painful over the network. So, we did the look at Atomic Open Patch, and Miklos asked for fuse to do a benchmark. And this lookup call removed, and multi-threaded fuse demon server, we got a lower performance. And Damendra was sitting there, and we were discussing, and then it starts, okay, let me look, and I said, okay, we start with this, without multi-threading. And with multi-threading, this work, without multi-threading, this work perfectly. So, one thing I noticed, there's a bug in LibView specs this time. There was this bug, which were, when you created, it could create and destroy threads. So, it's fixed in LibViews now, but the other thing is really the way LibViews, LibViews, user space communicates with the kernel. And while I was investigating that, I worked on a patch that does spinning. So, when you do the read request, let me go to the next slide, have a point or something, like, okay, not so important. So, you have here your thread pool. So, you have here your, from LibViews, it sets up the read, it sends the read, and you have the thread pool, and they go into here, into wait queue. Then your application comes, it does over the surface views, and it takes the request, and it wakes up. One of these threads here, randomly. So, you don't have control, which one it gets woken up. And what my spinning patch does is, this read here, instead of going into the wait queue directly, it goes into spinning. That gave a much, much better performance. My issue was now that I didn't have control, that it was not spinning just one thread. Then, because of the random wake up, multiple had been spinning. At the same time, I was thinking then, oh, the whole interface is kind of broken. Why do I be doing that? And my history was actually not that I was familiar with this U-ring, but I had been working at some point at Domino as well on our own UNVME driver, which we have in our product. And then we have rings, and with NVME you have rings, and then I thought, why are we not using rings to user space? Oh, there's IOU ring, so it's doing that. So why don't we use it? Oh, it's the other way around. Because we are sending requests from kernel to user space, and it's the other way around. And then, at the same time, as we were discussing this, internally thinking about it, Mink brought the user space block device driver. And there, so that's now the U-ring approach. So, and user space block device driver works rather similarly. I took quite some ideas from him. So there's a user space, does an SQV submit, and it sends the IOU ring or U-ring CMD, which is a forward command. So the command is not handled by IOU ring in the kernel, it's forwarded to the do a handler, to a custom handler, that what is UBAK doing, and what are implemented in Fuse. So basically, so in the default motor, the motor you should set up IOU ring with Fuse is, you have one queue, or one queue in one thread per core. So with lip use, the default to current mode, you set up a custom number of threads, you don't know how many, you don't know on which core they are running, well, you could, but then the kernel still wouldn't know, well, that's running. So with the U-ring, you set up your core bind one thread per core, this thread has a queue, and this thread has Q-ring entries. And you submit these ring entries to the kernel, and there they are waiting for you to be completed. And every time the app sends a request, let's say it's a sync request, right now it still does an allocation, request allocation, and then it goes to your, it goes to a bitmap, searches which request is still available, takes the first request that's in there, and copies the data into it, because it removes the memory alloc here for the request, I initially in my page said we're doing that, but I got issues on Puce, Demon, Server, Kill, and the shutdown bars that's really complicated bars. And then when I tested it, I run into issues and site corner case, and then you do for a sync or background request, I need to do the allocation anyway, sorry. And so for now I do the memory allocation to avoid to make the patches too complicated, and we can remove it later on. So basically it's just application comes, takes the request, sends the request back to user space, and then interesting part is when the user space is done, it sends back the reply as an, again as an SQE says I have a reply here for you, this is a command which has the answer, and it immediately registers the command for the next request. So in normal views, you go back basically, you write to defuse, then it's done, it comes back, says write was successful, and then you read again. So you have two system codes to commit your request, or to commit your result, and to fetch the next one. Here you remove one system code at a minimum, because this you think you could theoretically avoid all these codes all together. I have a question, users need to be aware of this though, or can this be plugged into libfuse? Yeah, it's plugged in. Upgrade the fuse in libfuse. Yes, that's what was my goal. So I wanted to be able to commit this, so that one just hooks into it, and every fuse program could immediately benefit from it. So I will come to another slide, because we recently had in libfuse an issue where, let me find it, I moved a little bit ahead. So this is the struct for the ring request, and we recently had an issue in libfuse where copying the libfuse header caused an incompatibility. Yeah, you remember all of this. And why when I brought the code, I initially had all these, so these are the in-out headers, these are the basic information, but the per comment information are basically here, and that's the same for, it's very similar to what current defuse is doing. So it has everything, that's basically the IO buffer. It could be a one megabyte buffer here. And with the ring, we could split this into two pieces. And then the interface becomes a little bit less compatible, so right now I could hook into a libfuse into the low-level interface, and I could reuse all the functions there. I just added one function more, which has the ring interface and a couple of other functions that's really ring specific. But my suggestion would be that we split this here and that we add for the command header, we add another struct or another buffer, dedicated size buffer. It could be 4k, it could be 100 bytes, sometimes 128 bytes. But if we extend this struct, then we never get an incompatibility. If we made it to make it sufficiently large, we would never get back the issue we had before, except if you would exceed the buffer, but that's extremely unlikely. Yeah, so here I wrote in text form. So we have set up right now with everything over shared memory buffer. I wanted to avoid this Kmap because my memory was that it's expensive, but now there's Kmap local, so I'm not entirely sure if we would really need it, but on the other hand, it's kind of for free. So the expensive part here is, and they are expected to actually review comment, I don't know if it has fixed me or find a better way is. So we need for the memory map, I wanted to have it numeric rare and that sends the QID as offset parameter, which is a bit annoying. So my really, what I would really like to have is that the memory map function in fuse would know what the fuse, what the buffer pointer, the user space gets. And why we have all the file system people here, and maybe the memory map people on the other side. I need an interface where I get in the memory map function in fuse exactly the results that user space gets. Why, what I get on the memory map function gets scrambled down the stack. And then I could avoid this QID parameter. I hope it's not too confusing what I'm saying here. I could describe it in a way because it takes a longer text, but it's basically I need to have the QID and I need to have the QID somewhere. So either I need the QID and then I have the QID mapped or handed in the offset parameter because it's the only parameter I have in MMM map, or the better choice would be to know in kernel space what user space gets and then I could put this into a map. A map what user space and kernel space have and then I could correlate them later on. So right now the correlation is basically done over the QID, which is a bit annoying. Because then I would use my features. Okay, so you need the kernel to know what user space is going to get when it calls the M map so you know that the QID to use. Yes. Okay. I can't think of a way to do this. That's like relatively safe. The only thing that I could think of is either you have offset parameter or having something BPF off of it. I'm not sure mapping that you track the QIDs for all your requests. Well, I just need to, I don't even need the QID in this case. I just need to have the buffer correlated because I need, so if I write, you know, if it's a memory map buffer, I write on the kernel side onto the buffer and this needs to be exactly the same buffer as user space is using. So I don't really care about the QID. I just care that users that kernel side doesn't write to the wrong buffer. I see. So it's really that the kernel doesn't know what user space gets. So at least my function, it gets scrambled. So I looked into the code and so the M map function, the file system gets is a different, so further down the stack it gets probably scrambled for security reasons. So that user space doesn't get exactly what the kernel calculated. I guess, I don't know. I see. So this is like related to the address space randomization, you say? Or... I don't know what's the reason for it. So I just looked up the code and there I noticed for some reasons it changes what user space gets. So what I calculate in my file system M map function gets changed down the stack and... I see, okay, that's right. No, so basically, yeah. So you have your M map function, yeah. And then from there you call your, you call the MM like M map Pg of or however it is. Yeah, so I call all of these things, I return the result to the caller, to the MM map, to auto the MM system. And the MM system further down before returning to user space changes basically the address. So I don't know what it calculated because it's further down the stack. So I would need to... Yeah, I understand now. Yeah, so basically when we are returning it can still modify the virtual address we are returning back to user space. I see. Yeah, and I can check, I'm not sure, but yeah. But you could also like do it by attaching... Oh, but you need to somehow share this in QID with the kernel with the user space, yeah? If I understand. No, so if I know what user space gets and I don't care about, for the memory buffer I don't care about QID at all. I just, QID is my helper here to correlate both together. So my QID basically says to the kernel, okay I want to have the address for this QID. If I know what user space gets, then user space was later on, could send me its buffer back and then I could find in the map what is the kernel buffer, what is the kernel side for this buffer. So it seems to me like you are trying to, so the user space knows the virtual address, it needs to tell this to the kernel and in the kernel you need to find what buffer corresponds to the user space. But basically if you look up the virtual address then you will get to the VMA which is the kernel description of this address range. I didn't find the function yet to do this. It's very simple. I didn't find an index. There is like VMA lookup which is used on the page fault basically which transforms the virtual address to VMA and offset possibly. Okay. But that's basically how you, that's the way how you are supposed to transform the virtual address as seen by user space towards like the kernel representation of this mapping. Okay, maybe you can point me later on. Yeah, sure we can take this offline, but yeah. Yeah, so the other part is basically so when user space sends me the command this op-euring CMD so I basically return to the caller that it's queued, it's then put on hold and it needs to be completed with IO-euring CMD done. That's the important part because if I'm not completing and my ring gets shut down then I get what is this, worker queues in D state and they complain in intervals that the completion is not done yet. So this relates basically to Miklos comment which I got on the patch series about why I have a very complicated shutdown handler when my process gets killed. I hope that I can avoid it because for demon kill made the patch series I spent most of the time on demon kill. It was really annoying. Another part is that it's important for reviewers to basically if use has foreground queues, background queues and it sends when from background to foreground so probably not to fill up the demon side with only this background request. So let's say you write to a file read page from a file and you have one metadata request in the middle so that file reading and writing doesn't disturb your usability. So with the user space, we only have basically one queue per core and basically I solved this with credits. So user space tells me how many sync and async, I call it sync and async here because it's loosely bound to foreground and background. So in patch series two, not the one I sent to the list yet so this gets a little bit updated. They are in patch series one, it's still called foreground background but it's called async and async. So basically user space could also decide that it handles async requests with a lower priority in it first and it's for sync requests. There will be probably needed more credit types but for now, not for now. So now maybe you wonder why it made this here green. I think the new approach, yeah, that's not very well visible. That's a shame. You cannot see it. I see it on my laptop but I don't see it on the slide so you see it kind of here. So let's say you, so I'm testing typically when I test something, I want to see metadata performance. I use a very simple Bonnie program which just creates files and does a zero file size. So then the zero file size means it creates a file, it tries to read from a file but it's a zero size read and then it unlinks the file. So if you run the Bonnie over normal depth views, you basically see the Bonnie spread randomly over all cores here. So it's not running on one core but it runs on all of them. So all the cores are in a rather so that's H-stop. They are running at a rather low CPU frequency but I don't think the cores can go into C state sleep. While if you run it with U-ring and you run it with the scheduler fixes, you basically get one core that gets really busy and the other cores can sleep. So that's why I think this is echo which is why I made it green. So there's work in progress. So I'm fighting for the last I think four weeks with the scheduler because the scheduler but what happens is when I go back to the slide before. So let's say it committed something to a ring then it sees oh it's running on this ring and the scheduler thinks this rings right is disturbing my let's say in this case my Bonnie process and it moves away my Bonnie process from the current core which is not what I want because then if it moves about Bonnie process the next request will come on the next core and I get exactly the spread thing here again. So I removed one so the ring thread is core bound which is not with the normal depth use interface but the application process is typically not core bound. I mean who of you binds this application every time you are copying something you don't. So the scheduler needs to be aware that the processes are not hurting each other. I will go to the maybe performance slides now. So here's some performance crash. Let's go to the direct I will read maybe. So here we have depth use migration on depth use migration off. So I disabled migration of there's a function where you can disable CPU migration in the code. So CPU migrate on migrate off or CPU migrate enable disable and so I disabled migration for depth use before the weight queue after the weight queue. For depth use it doesn't matter it doesn't work. For you ring so this is migration on that default behavior. Performance is improved a little bit here but not much. And here migration is disabled you see a huge improvement. So it goes basically here for the numbers. It goes from in the middle or more here goes from 700, 700 to 22,000 megabytes per second. So that's quite an improvement I think. And that applies to everything here. So you really need to have the scheduler to be aware that it doesn't move, that it doesn't move around your process. I'm not entirely happy with, so here I guess it's mmhp, so for mmhp it's kind of the same. So that's with migration enabled. So without migration enabled it's, with migration enabled it's bad, it doesn't work well. It's migration disabled it works really well. Yeah, so actually we see this all the time with our performance testing with IO that simply the scheduler tends to migrate sometimes more than we would like it, and exactly as you said. And actually it tends to interact in an interesting way with like CPU frequency scaling because like if you migrate to a new CPU you start at the low frequency it takes time before the scheduler decides to ramp up the frequency on the CPU and this actually costs you no amount of latency initially until the CPU ramps up. And actually before it manages to ramp up to the full frequency the scheduler already decides to move to another CPU. Yeah, so that being said this is, like there is other side to this. And that's actually that the scheduler needs to spread the load and there are other loads which actually benefit from this behavior. We have been fighting with this in our kernels for a couple of years. And so we do carry in our distribution kernels some patches which somewhat tweak this behavior into CPU frequency scaling behavior. But the truth is that there are different workloads that really benefit from spreading because it's very difficult for the scheduler. Like there are different things like you need, you want sometimes to schedule the processes on the same processor to share caches if they are in kind of a producer-consumer kind of relationship. But on the other hand you want to spread them if they would need more than one CPU time because at certain points simply you have to spread the load to be able to scale if there is too much. Yeah, with other files system it's not so bad because with this use you're totally offloading what your application does to another process. So the other process does all the work. So and there when you come back you want to continue on the current CPU especially for sync requests. For a sync it's actually the other way around. So if you have a page, sorry Nicholas, if you have a page to read in the next page series I actually put things when I have a certain IO size and it's async, I put it on another core. So I still need to work on this to find the right core because I'm now doing scheduling infuse but there because it's processing the ring in user space so it's going around the ring it doesn't go to the kernel and it's already submitting requests to the submit queue. It's doing work on one CPU and in page mode it's basically reading the work on another queue. So there's a real, your income's really into game. So basically all I want to say is that it's not a simple problem and the scheduler simple doesn't have enough information to make a good decision here. Well, we have patches. So there's basically a patch to wake sync on the same core except that it's not working perfectly yet. That's called CISO patch where you put into your processes, hey, I belong together, I'm a CISO process. I don't want to be moved away. So it's working, so that's why I didn't... Like there are, like the scheduler people are actually also working on some approaches to this problem that they know about it. So yeah, talk to them that they already have some other ideas perhaps how to do this and how to like propagate the information towards the CPUs. Yeah, yeah, I'm on the way. Oh, yeah, okay. So if you are in loop with them then just talk to them. So that's where I got. So I got a patch to wake on the same core. Exactly that it's not working perfectly yet. Then I got a CISO patch but the CISO patch I think would have side effects. So except besides that we are extending the struct task which is used for everything. So right now by one byte I think it would have side effects. So I don't want to go the separate approach. I try to put in more information so that the side effects go away then the patch didn't work and then I realized actually we could do this with this wake on the same core wake up. Except that this one doesn't work perfectly. So my next step is kind of I need to investigate myself why it's not working. So we're out of time officially. Do you have more things that you wanted so people have more questions? So just the more performance slides so that's my data. I think we believe the performance could be better with the scheduler fix and with your ring. I am but Miklos do you have anything you want to... Thank you.