 Okay, they call me Pradeep Sureshati. So, I work for Red Hat Performance Engineering. I joined around seven months before. I work closely with our performance team and our KVM development team. Kareen, Kevin, Paul, Paulo, and Stefan. I feel privileged for that. Okay, I'll be covering three things today. What is the motivation for us to do this work? And second thing, what is the problem we are trying to solve? And third, I have some of the performance results. Lot of performance results, I don't think we'll be able to go through all of them. We'll be sharing that. And this is one of the recent performance brief which was released, I think two, three days before which way this was released. So, before jumping to, I'll just spend some time on the KVM IO basics. It's just for a couple of minutes. So, this is our typical architecture. This is my KVM case. And each KVM case has, each KVM case for CPUs has dedicated VCP threads which talks to KVM kernel module. So, which runs my guess code is, so my guess code is being run by VCP which is being talked to, which is talking to KVM module. And also I have separate IO thread. So, it is, so that there is a separate IO thread using select loop which will handle all the events. So, this is what precisely it looks our IO, kernel IO, KVM IO architecture. Before moving, it will be good to know what was happening earlier to appreciate it. So, up to appreciate what is there today. So, back then for full virtualization. So, full virtualization, I'll spend a couple of minutes here. So, with full virtualization, unmodified guest. So, we were running unmodified guest on KVM with ID, SATA, SCSI, that means, as we know, so, we have a lot of issues with this performance issues. And as we had to do, put a trap and emulation for each and every operation. So, we had to do this trap and emulation. It's like compared to today, with this probably we had around 40%, 40 to 40% of the performance drop back then. So, then slowly we moved on. And this is para virtualization. With this, whatever BLK, whatever SCSI, so, later so many come. And we have effective guest and host communication with this. And the problem what we try to solve is, we had with using whatever ring buffer. So, we have whatever ring buffer, we'll be collecting all the queues and we'll be merging all of them and submitting. Whether it is read or write and random read all of them. So, we merge request and send it. So, with this, yeah, we improved the performance compared to IDE. And still we had couple of issues like, so, okay. So, with AIO case, so, we have, so, we'll have to take a global meeting, so, global meetings, which I'll be talking about, which I'll be talking about. So, another one later after these two. So, third one is PCI, PCI pass through. So, for example, we are, so PCI pass through, performance wise it's really cool, but it has got it's own limitation. Like, I have limited performance, PCI pass through, PCI devices, and little expensive as well. And hard for live migration. So, it has got it's own limitation. So, some people tend to use it and mostly, people tend to use whatever. Now, again, back to the whole problem. So, okay. As I was talking about this full virtualization and para-virtualization here, full virtualization trap and emulation, and para-virtualization with this modified guest, which has where I have whatever drivers installed. So, with whatever, as I was mentioning, we have this V-ring buffer here. So, my whatever device, each and every device will have a separate word queue. Okay, this is, I'm talking before data plane. So, before data plane coming into picture, I'm talking about that. So, each device has a separate word queue, and all of this request are there in V-ring. So, once we submit this request, then it'll send a kick. So, it'll kick your queue, a boss, I have so many requests. You go ahead and submit your device. So, I have the, shucks, I'm new to this. Okay. So, with this kick, now we can go ahead and submit it. Other one, so, as I was talking about this, we are back to, so, after PCI still, we have this, so, we are back to the problem because of the limited PCI devices and data plane. So, data plane, today, this is what we have got. It's tried to solve most of the problems, okay, the problems what we spoke already. And, we are here today, and with, there is no big cumulok. We never know, probably, since Tifa and Kevin, they might, so, they might recover with something else later, for now, this is what we have. My experiment mostly, I covered with whatever BLK because in OpenStack, that's being used. So, I haven't covered data plane performance results, which I'll be reporting down the line. And, this is the storage configuration, so, which I try to cover most of them. So, this is my storage, and my host to VM, I tried to attach LVM or else file-based or block-based, both of them are covered, so, which I'll be covered or talking about it. Okay, the problem, so, this is the problem, what we had, so, when initially, so, when I started, I joined Red Hat, so, this was the mandate, so, that this was the mandate given to me, like. You know, OpenStack, we've got AIR module, so, modes for this devices. There are two modes, two modes, whether it is, it's like native and threads, so, we have got native, as you know, native is asynchronous, it submits, it does iOS, iOS as a system calls, that means, from VM, I'll be submitting my request using iOS. So, once, once, since it is asynchronous, once this reader write is done, again, it will be, it will be, no notify, QMOO will be notified with the event FDs, so, this is how it has been going on, and the synchronous is, IO is equal to threads. It is done with our regular system calls period 64 and period 64, period P write. The problem is, okay, theoretically, so, when we ask someone, so, which one should, which one should perform better whether it is asynchronous or synchronous? Generally, so, it's asynchronous, right? So, that's how it has been the case, and in OpenStack, this is the default choice, you know, so, IO is equal to threads is default choice, which is synchronous, so, this is based on the previous performance results, so, we just wanted to figure out, okay, this is the reference for that, the reference link. We just wanted to figure out, with the latest QMOO, so, which we had in Rails 7.2, I don't recall the version. So, we just wanted to figure out with the different combinations and see what is the problem, and is there something we can do to improve. Okay, this is one of the examples when I attach my disk is either with the KBM or the plain KBM or with OpenStack, so, as I was telling, IO is equal to native, so, I try to attach from the same storage, I try to attach two devices, one with native and one with threads, so, from this, same storage, it's mounted under XFSR, different file systems. Okay, another problem with the threads and natibles, so, as you see here, initially, when I started this multiple experiments with native, so, this is the performance, so, this is the CPU, CPU says, of where we're using IO is equal to native, when I'm submitting multiple requests from around 16 VMs, concurrently, if I try to submit, so, these are what it is, CPU says, it's not much, then with threads, if you compare to native, this is slightly, this is slightly higher side of it, it's not slightly, it's higher side of, compared to native, it's, CPU says it's more, so, if we can increase, if we can improve native performance, yeah, then, so, we have, so, we can save enough CPU, then, probably, we can run some more VMs on that, concurrently, so, concurrently, we should be able to submit IOs and we can run some more VMs, okay, so, after all this, so, without of listing down which are all experiments we should be covering, so, first one, hardware, SSD and HDD, mostly, so, these what, and then, LVM, EXT4, or XFS, with NFS and EXT for XFS, so, I thought I covered most of them, then, QCao2, so, this QCao2, QCao2 with Feloc, so, this Feloc, Feloc is, while creating your image, so, there is an options, right, and, with QMU image, while creating the image, you have this pre-allocated option, that is, pre-allocated, non-pre-allocated metadata, pre-allocation is equal to Feloc, pre-allocation full, so, I try to use this Feloc, and, other one is QCao2 with Felocate, plain QCao2 created, and Felocate, it's, it's a Linux command, a Linux command, so, with that, you fill the, you pre-allocate your VM, it's, this is more of a pre-allocation, but, okay, these are also being used in some of the enterprise, the enterprise community of OpenStacks, the reason behind that is, see, some people, okay, this QCao2 with Feloc, it is not supported with backing file, so, typically, when you create, you create a backing file, and, so, you'll be, keep creating VM thread, having your backing file, so, this Feloc, it's not being supported at this moment, so, then, so, then people use this QCao2 with Felocate, so, at this, so, as of today, and then, raw pre-allocated, so, these are the experiments, and jobs, sequential read, sequential write, random read, random write, random read write, and block sizes, four, 16, 256, number of VM, single VM, and 16 VM, so, 16 VM and SA-16 VM, concurrently writing it, concurrently reading it, so, that is how it is, I thought, okay, I'm so good here, I covered most of them, then, yesterday, while speaking to Kevin, here, I missed GloucesterFS and Ceph, so, which I haven't covered, yeah, down the line, I have plans to cover, and Kevin was still missing a couple of things, I was like, man, okay, I'll cover some more, after covering these many, Kevin said, yeah, I'll cover some more, okay, yeah, this is the test environment I had got, so, one machine, so, 256 RAM, and hosts, I had one SSD and one HDD device, so, as I was covering both SSD and HDD, so, I have both of them, so, I'm slightly digressing, so, now, like, why, which are all the tools we, so, we try to use it, while running these benchmarks, one is this P-bench, so, recently it has been open source by a performance team, I like this, so, pretty much, so, what it does is, it, okay, it provides easy access for benchmarking and performance tools, so, all the debugging tools, whichever you have got, all of them are tightly coupled with P-bench, and also you should be able to run your benchmarks with P-bench, like FIO and NetPower for what, most of the performance tools are integrated, that means, this is what it does, before running my benchmark, it will go and capture, see, it clear, clear, all of the, clear, all your disk cache, cache, everything, and make sure that you register all the configuration, that means, before running the test, and it starts this, it's fully automated, it starts the benchmark, while benchmark is running, it will capture your debug logs in the frequent intervals, like SAR and S-trace, all of them block trace, you name it, it is there, so, it captures in the frequent intervals, and then, once the job is done, so, we have, once the job is done, it will kill all the records, all the tools which are being recorded, all the debugging tools which I was speaking about, and then it will move the results with third-party machines, where we can get nice graphical view, so, I'll show you this graphical view, or how we get it, so, for example, see here, as I was showing earlier, so, these are, we get this SAR output, and each CPU, how much of the CPU utilization here, and frequency being, being recorded, it's more, I try to cover, I try to capture only little, limited graphs, but there are n number of things here, it's like, I, with IOS stat, so, each device, so, each disk, so, for example, this disk, home, each disk, how much it is being used, and tool visualization, there are many more, so, these are what it looks like, so, this is how we can run, so, just an example one, P-bench FIO, this configuration, and targets, like, and, you should be able to run concurrently on all of the machines, so, this is how I was able to run concurrently on multiple machines, and the same, same host, I had 16 VMs, and all of them, I was running concurrent reads and sequential reads, you can mention, so, various configuration options, you should be able to run it, and test types, read, random read, block size, everything, and we, as we know, I used FIO, so, this is one of the famous tool for IO benchmarking, so, this is what I have, so, I was using, like, changing IO engine, IO depth, and IO type, there are multiple options, right? Okay, so, this is the, again, back to the problem. With P-bench, so, while running this, so, while running IO, for example, I was running 4K, sequential read, random read, all of that, so, the performance difference, what we see is, with, okay, as I was mentioning, native and threads, compared to threads, native performance was almost, like, 20% less than threads, so, there's a valid reason for OpenStack team to use threads as a default option, because, so, we see performance differences, initially, that's, so, before, so, before solving the problem, so, there was around 20 to 25% performance degradation was there, but then, we went back and checked, so, as I was mentioning, P-bench captured all the results, so, as you see here, guest 100%, so, with IO stat, as you see, 100% consistently, that means, we constantly keep writing to the disk, so, on guest 100%, so, effectively on host, these are before starting any benchmark, so, these, we tend to have both on host, and so, guest, we tend to have 100% IO stat, so, here, on guest, I have only 100%, on guest, host, I have only 50%, that means, on guest, I have enough number of requests to submit, but, somehow, it's not writing back to disk, reading, okay, writing a reading, so, it's not, see here, as you see, 50, 50% percent, then, we went back and checked, run block trace, so, we really, we realize that, okay, so, this, anyway, we covered, so, I thought of referring once again, as I was mentioning, these may, these may, so, these may be varying buffer, which I'll be submitting all the requests, your whatever device will submit all the requests, and from here, we'll send a kick, right? Now, so, why I had to include this once again here is, so, the problem which we were speaking, on guest, we were seeing 100% IO stat, so, on host, we were seeing only 50% percent, right? So, reason behind that is, we had multiple requests in the queue, like word queue, we had multiple requests, but, here, QMO was submitting, QMO was submitting only one request, so, at a time, imagine for each 4KB request, it's separate IO submit call, as I was mentioning, native, so, with native, we submit using IO submit, right? So, we submit request using IO submit, we use IO submit system call. So, separate for each 4KB, we had to use separate request, this is one of the bottleneck for us, and then, it's just a few lines of code change, then, so, Stefan, so, Stefan made these changes, like, merge all of them and submit at once, this is, okay, again, this is with what IO, not with data plane, and with data plane, and whatever SCSI we already have this, but with whatever BLK, we didn't have that, so, with read, we had this problem, and we submitted, if you submit all of them at a time, yeah, again, 20% of the performance boost, and you are more or less close to the equal to threads, and also, in some cases, it's better than threads also. So, now, again, we went back to open stack team, so, this is the patch. What is IO batch submission? So, handle more requests in one single IO submit, so, one SCSI scull, so, which will be able to multiple request can be submitted through one submit IO submit, that means drastically we are reducing IO submit calls, and this is the patch. We went back to open stack team, so, we, again, ran the benchmarks, all of them, and went back to open stack, and open stack made the default size as native, with couple of constraints, still, so, in some cases, open stack use IO threads, and some case IO is good native, and enter, so, eventually, enterprise open stack, and some of those other management tools. Okay, so, this is the performance results I had, like test specifications, single VM, and multi-VM, so, this case is DFS, so, I use LVM, and image draw, as I was telling, pre-allocated, so, here, we had this detail, we have these, we have all the details, so, in which case, as I was mentioning, I was running 4K, so, 16, 4K, 16K, 256K, 64K, right? So, in which case, which one is performing, there are, so, slightly, that there are some changes also there, so, still, we have, so, we, still, there is a lot to improve, so, this, as of today. I'll, okay, I'll be sharing these results, there are multiple things here, I don't think I'll be able to go through all of them now. Okay, so, this is how performance graphs, can you see this? No? Okay, ah, there's no way I can improve, I can increase. So, okay, I'll be sharing it, so, here, so, this is what, as you see here, so, this comparison for every day, so, every test for both, so, what we have got, like, this is comparison for random read, right? Sorry, this is random read, and with LVM, so, with, for first two, first one is a native with LVM, and second, again, native with threads with LVM, and third, so, there is color coding is there, so, we can refer it easily, and EXT4, here, third one, native, again, fourth one, threads, so, here, this one, XFS, yeah? Yeah. So, that's what I'm, even worse, but I'm seeing, the first three, you know, things, what I'm seeing, the rebrand, 4K, seems to be the same, so, just, they are basically all the same, so, I'm trying to understand why the, including the file system into the mix doesn't change the performance, is that what I'm reading? Yeah, correct, okay, I understand that, so, as you say, LVM should be performing better than EXT4 and XFS, because we have VFS layer, so, these, okay, these are the results I have got, probably I'll go back and investigate it, so, these, yeah. I think that they are great in file systems, they do a great job, so, you know, you don't even notice us, but I would think so, so, that's surprising, they are basically the same. Yeah, more or less, okay, so, you, okay, once you go. It depends how you play or play them, right? Okay, no, this is not, okay, this is not felicate, so, these, as I said, it's a LVM, right? This is what rhyme is, I've pre-allocated with the DD, so, pre-allocated all of them, not felicate, so, I'll come, I'll talk to that felicate. You wrote zeroes. I wrote zeroes on all of them, okay, so, yeah, thanks, Kim. So, these, what it is, like, these for random rate for 4K, you can compare, you can compare it, so, same way, 16K, you should be able to, so, like, here, as you see, native is performing better than threats, same way, so, these are what it is. No, no, no, I'm giving an example here, so, with this, we can take it, so, we can take it further, okay? But, okay, in case if someone wants to use pre-allocation, yeah, these, the better way, better way for them, but, okay, I will, talking about the conclusions, so, I'll touch upon that. And, random read rate, so, here, so, these are what it is, native threats, native threats, native threats, these, again, with L, so, with the raw, raw pre-allocated, and these sequential read, and sequential read, sequential write, and these with, with HDD, so, as I was mentioning earlier, this was earlier with 16VM, now with, again, with HDD, 16VM, we can go back and check these results, this is with the single VM, now, again, back to SSD, with single VM, okay, with QQAuto, there are multiple options we had to cover, we had to cover, as I was mentioning, with QQAuto, so, plain QQAuto image, and QQAuto with Flock, QQAuto with Flock, with XD4, XFS, so, there are multiple options here, it'll be difficult to go through all of them, there is a lot of data here, so, probably it'll take a lot of time to cover this, so, I'll be sharing these slides, yeah, we can, so, we can, we can go through them, and SSD, QQAuto, same thing with SSD, multi-VM, 16VM, yeah, SSD, that's probably enough for the deadline scheduler, is the scheduler important? No, with SSD, there's a huge performance improvement, so, I don't have comparison now with SSD compared to HDD, but, if you- The problem is the same, so, maybe, there's a better involvement. Yeah, yeah, and this with NFS, okay, so, this is in OpenStack, which I was mentioning, so, here, the native is being changed to default, default is equal to default as native, but, this is like, as I said, it's not for all, so, all formats, and this is the performance brief, yeah, we can refer it, and conclusion, yeah, so, your question, so, throughput increase a lot because of IOThrad takes a few or six, so, few or few years, this one, huh, sorry, can I use most? As I was mentioning with threads, so, we have, so, with the threads, we tend to use more CPUs, more CPU cycles, so, if we can increase for the native, that would be even more good, so, we'll have to see down the line how it goes, and minimum number of IOTs, IOT submit calls to complete this task, and, yeah, so, this is the conclusion, writes to sparsely allocated files are more likely to block than fully, fully pre-allocated files, that means native is preferable if your disk is, if your image is fully pre-allocated, if not, still go ahead with threads, so, that's how OpenStack is using, so, as you are asked, so, in OpenStack, if someone wants to use that, yeah, this, someone wants to use pre-allocated, and it is the way to go, yeah, and threads, it's continued to be there with spars, so, we'll have to see how it goes further, whichever, these are the references, and, okay, I was working with Andrew and Stefan, so, we three of them, you must be doing both of them, so, we three used to work in IBM, again, all moved to Red Hat, and we all worked on this, it's nice, yeah, that's pretty much, any questions? So, yeah, as I said, there are a lot of graphs, so, this, this is being shared, this presentation, yeah, probably you can go through that. Hello, hello, good afternoon, good afternoon, yeah, you can hear.