 Hello everyone, thanks for coming. So we're here to talk a little bit about KBM and Kemu internals, but before we start on that a little bit about who I am and my background and why you might want to listen to me a little bit. I'm an architect at Red Hat. I mostly work with key customers and partners to develop performance and sizing guides and reference architectures in conjunction with Ceph storage clusters. Prior to Red Hat, I worked at Ink Tank where I did a lot of the same thing, working with customers, building large storage systems. And prior to that, I was in Operations and Architecture at Dreamhost, where I worked on Ceph clusters as well, bringing some of the first few into production. So why did I fall down this rabbit hole of really trying to understand the internals of Kemu and the effects that it had on performance? Well, being that I'm heavily involved in Ceph, and Ceph being the number one sender driver for a number of years for production in open stack environments, talking with a lot of our customers as they're moving their workloads onto their open stack clouds, we were beginning to be approached by them, they were asking us, we want to also use, continue to use Ceph, but we want to use it for our high performance storage too, not just to kind of our capacity tier or for just ephemeral booting of images. And if you look at the open stack user survey, 70% of the application frameworks that are being deployed in open stack environments are LAMP, LAMP based, right? So making sure that MySQL performs very well is very, very important. And so we did a bunch of work with Percona and Super Micro, and developed a reference architecture on running MySQL on top of RBD block devices. And if you're interested in more on that, there's a short link to that reference architecture paper there. But that work has led me into really trying to understand all the different components that make up the Kimu subsystem. And it starts with there's full virtualization and there's a pair of virtualization. And originally, with full virtualization, it was nice because it had the highest compatibility. You didn't have to modify the guest images. But the downside of it was that it was really slow, both in that all the hardware was being emulated, the CPUs were being emulated, and very trap heavy. And so there was a better way to approach this. And the Kimu architecture was very simple at the time. And it wasn't very high performance. And not until the advent of VRTIO and pair of virtualization drivers being placed in the virtual machines were we able to see significantly better IO in the guest operating systems. So now we can actually see relatively good block performance. And this works by there's a VRTIO bus interface between the hypervisor and then there's VRTIO drivers that are inside the kernel of the virtual machine. And then that way there's a couple different methods right now. There's VRTIO block, pair of virtualization driver, and there's the VRTIO SCSI pair of virtualization driver. And you can choose between the two of those. And the way you do that is on your glance image you can set properties on that glance image that say for this particular glance image, if you instantiate a virtual machine from it to use VRTIO SCSI, for example. So as we're doing a lot of this performance testing, the performance testing we're doing was on some really trick hardware. So we had brand new NVMe servers from Supermicro and we had really fast P3700 NVMe's from Intel. And we kept on seeing we were running into a bottleneck when we were trying to understand what it was. And if you look back at kind of an older version of Kimu, the architecture looked something like this where while you had the virtualization of the and multiple threads for the CPU threads, for actually executing the guest code natively with the hardware, your I.O. was still being processed by an event loop. And this event loop could block, depending on the type of I.O. that was issued. To improve on things, this event loop was broken up into multiple components so that it had a separate AIO context so that certain I.Os would have to kind of lock this event loop waiting for I.Os to complete. And that worked pretty well, that improved things. And until a better method came along, and that's the data plane, right? And so with the data plane, you're able to have multiple threads of AIO context. And if you're doing a lot of asynchronous I.O., that works pretty well. And so we're testing with all these different methods, with Verdeo block, with Verdeo SCSI, playing with the different knobs that are available in Kimu. Just with Kimu itself, right? With Kimu itself and RBD. To see what the best configurations were, even outside of OpenStack, and then we would come back to OpenStack, look at how Cinder, or look at how Nova, look at how Glance interacted in order to then program the data plane, program Kimu to be as performant as we could make it. So bridging with OpenStack. Cinder support doesn't quite yet have support for the Verdeo block data plane. There's a blueprint up. So that's something that'll be a work in progress. I was unable to find a way to use the Qs in vector support that's available with Verdeo SCSI that allows you to have multiple I.O. threads through Cinder. And while it could be interesting to add these things to Cinder, one problem that might be encountered is that if you have these separate I.O. threads that are going to be using a lot of CPU to process the I.O., that's going to manifest as CPU steel if it's not accounted for some way. So you would need to make, if we were to add support to Cinder to allow assigning multiple I.O. threads to different volumes using Verdeo SCSI, then we might need to have Cinder and Nova work together to account for those and maybe assign less CPUs if you're using effectively an execution thread for I.O., right? So, you know, only primitively I suppose you could reserve some cores on the hypervisor, but really you would probably need a more holistic approach that does some sort of accounting. Turns out that there's also multiple different I.O. modes in Kimu and actually at the last summit it was hard-coded. You were only able to use the I.O. method of threads. Turns out that the I.O. method of threads is fairly safe. And how that works is Kimu has its own user space implementation of kind of a thread pool and it just does p-read, 64, p-write, 64 calls. And that was just hard-coded in Nova. Now it's the default. I believe there was, I think it was in Mataka that it changed from just being hard-coded to being a default and you could actually override it and use AIO native. The difference with AIO native is that instead of using a user-land thread pool, it uses kernel AIO and I.O. submit a POSIX system call and that can help perform it. It can bring down CPU a little bit in some cases, but it's complicated. We ran tests with both and for our particular workload we didn't see that one performed particularly better than the other. Although we did see that with some means of benchmarking that I.O. submit could block and that could cause the VM to have jitter. So if you were running a workload in the VM and you were pushing it really, really hard, this would manifest as you're running maybe DSTAT and then you would see missed ticks and then it would come back and that was a little bit worrisome for us, especially because we're trying to use those timers for keeping track of how many I.O.s were completed in a given period. There's also a lot of different caching methods, caching modes that are available in KVM and KEMU and understanding how these map, whether you're using just local LVM or just local drives or if you're using RBD can be a little bit confusing. So I've kind of put together this table or I've expanded this table from to include the RBD components but the way that this works is that you can set a cache mode in Nova, right? So in nova.conf you set your just cache mode to writeback and that forces the cache behavior of all your VMs and center volume types to use that particular cache, which is kind of strange, at least from my opinion, from user experience point of view because I don't want to use caching, right? From the benchmarks that I've done, not having caching turned on when you're using an all SSD volume type is better performing than having the cache turned on at all. But for a magnetic pool, it's going to make a lot of sense to have writeback. The writeback is safe. If you have an application that needs to guarantee that data has been persisted to a disk, it should be issuing an F-sync and they should be using a file system that supports barriers and if that's the case, then that'll get passed through and the RBD cache or the page cache on the host system will be flushed and it'll be on persistent media, right? So it's still beneficial in that it's reducing the amount of IOs that are going to the disk in terms of writes, which leaves those seeks available for faster reads. In an ideal world, I think that you should be able to set different cache policies based on your different volume types. So when you're an administrator and you're setting up your different volume types in Nova, I think it would be really sweet if I could say for my gold level tier that's based on all flash media, I want to use direct sync. But on my SSD-based tier storage type, I want to use writeback and Kimu lets you do that. It's just not exposed through the control plane in order to push that down, push that logic down. I get asked about QoS a lot. And Ceph itself doesn't have a native QoS capability. It's something that's being worked on, but it's hard when you're doing distributed storage. But I feel you can achieve adequate QoS through doing appropriate capacity planning. If you know the throughput of a particular storage node, you know it's capacity, and if you collapse those two constraints onto each other, similar to like they have in the public cloud, then your capacity planning is you either add more because your tenants are provisioning it for storage, spatial capacity, or for IOPS. You don't care which. You just need to know that you need more. So by collapsing those two constraints onto one, you don't have to independently track your IOPS and your capacity, because as an operator, it's just making things more complicated than it needs to be. So in the testing that I've done and comparing hard drive based pool performance and SSD based pool performance, if anyone is familiar with the work of Neil Gunther, he wrote a great book called Gorilla Capacity Planning. And he has a model that he's provided, and Gene talks about in great detail in that book, called the universal scalability model. And there's actually tools that you can use by taking data with different levels of throughput and different client numbers and use R to project confidence bands of how much throughput you should expect for certain given numbers of client levels. And it turns out that part of the universal scalability law is a thing called the coherency delay. And the universal scalability law was originally applied to databases and pneumotype systems. But it turns out that the coherency delay matches really well to seek latency. And so client scaling, when you have SSDs, it kind of plateaus because the seek latency is fixed. It's not like spinning media where once you reach a certain number of threads, you have this severe retrograde performance. And so what that leads to is you can do the sort of provisioning that we see in the public clouds. And being able to have that in Cinder, being able to do capacity derived limits for your SSD pools where you give a ratio, so you say, for my volume type, I want to give 30 IOPS per gigabyte of storage that the tenant provisions. Or I want to do 3 IOPS. And if you wanted to do something closer to a general purpose SSD tier, I think that makes a lot of sense for the all flash, all flash-based. Now for the spinning media, I think the static limits are great. Having static limits and volume quotas, I think, is perfectly sufficient for spinning clusters. But for SSD, I really think that a ratio approach works really well. And then, like I said, by collapsing the two constraints together, you don't care whether your tenants are using up capacity or IOPS. You just know that you need more. And if you've done testing beforehand on how much throughput you get from a given system, you know whether or not you can hit those targets. And the architecture paper that I mentioned earlier, we actually did this. So you can create deterministic performance. So our benchmark environment, I mentioned earlier, it was super micro NVMe servers. We had some pretty recent 2650 Haswell CPUs, two of them. And then two Intel P3700 800 gig NVMe's dual 10 gig networking. And we were running RHEL 7.2 and the previous release, not the most recent release, because this work was done earlier this year of Red Hat Storage, which is based on the Hammer release. So this is before Juul. We did do some tuning, though. We actually ran four OSDs per NVMe device. Those NVMe devices from Intel are pretty dang fast. And you really need to fill the queue depth. So we ran actually four OSDs per NVMe device. We tuned TC Malik to use larger thread cache. That's the default now, at least in the downstream versions of Ceph. Other things that were important were TCP, no delay, basically disable the congestion control so that you can have, you don't have to worry about that with your, it keeps the latency down. And then also, really critical when you're running any sort of NVMe is making sure that your kernel supports block MQ, which is so that you can have multiple queues into the NVMe device. So at first, when we were doing the testing in these KVMM instances, we were using FIO with O-Direct. We were F-syncing after each write, using the lib AO engine, and using EXD4 inside the guest. And this is when we encountered the issues with Jitter and the Ms.Tix. And we weren't quite sure that the FIO accounting was, we weren't super confident in the IOPS figures because we didn't know if the stalls that were being caused by the Jitter were affecting the calculations on how many IOPS were being done per second. So in retrospect, and after learning a little bit more, it might be due to using O-Direct and using AO-Native, which happens to not be the default in OpenStack. So maybe if we were running it in an OpenStack environment and not playing around with things, we might not have seen this, and potentially IO submit blocking. But regardless, that was kind of the lower level testing. But the eventual goal was to do MySQL benchmarking. And so we brushed off kind of sysbench. And we set up a config where we had a 50 gig buffer pool, 8 gig, I know DeBeast, made sure it was a fully acid-compliant configuration, I know DeBeat, O-Direct, flushing after every transaction. The guest file system was XFS. We used no A time, no do or A time. We accidentally used no barrier, which is a terrible idea. So don't do that. But as far as the effect on the results, that probably would only affect the results where our writeback cache wasn't enabled, and we actually had better performance when it wasn't enabled. So again, do not use no barrier ever. That was just kind of a mistake. So we reloaded the data before each test. And then we did two different tests. We did 100% read workload, just doing selects. And then we did 100% write workload, where we were just doing updates. And then we did a blended workload, where we were doing 70, 30 reads and writes. And we made sure that we were using a uniform distribution, so this wasn't a parrot distribution, because we wanted to have a good amount of the IO hitting the disk. We didn't want everything coming out of the buffer pool, because that's not really testing the storage. Ran it for 20 minutes. So what were the results? Well, as you can see, the original default, fully virtualized, is very slow. That's all the way on the left here. The defaults, just kind of out of the box. You can see that on, we were doing pretty good on reads, just shy of 200,000 IOPS for that particular instance. Reads, I mean, both the writes and the reads and writes were over 5,000. When we set the cache to none, the final two charts here, or with the threads in native, you can see that there wasn't much of a performance difference, especially if we had factored error bars into these graphs. There would probably statistically insignificant difference. As far as cache modes go, you can see that the highest performance was with the cache turned off. So with the cache turned off on an all SSD-based cluster, you saw a lot better performance. And so if you were already running a cluster and you had gone into your nova.conf and set your caching method to write back because your initial deployment was a bunch of spinning media, and then you added a high performance tier because of that inability to have that flexibility in terms of a different cache configuration for a different volume type, you would be sacrificing a decent amount of performance. We didn't see very much of an increase by using dedicated dispatch threads. That's where you can actually have a completely separate IO thread. And I think that if I go back and then I run back through these, you're probably seeing a common theme here in that writes are really stellar. I mean, reads are really stellar. But writes are low. But more interesting is the fact that our mixed read workloads is almost always the same as our write workload. So this is where we're beginning to suspect the Kimu log. So when we know that we have synchronous writes that are blocking the IO thread, the subsequent reads are just waiting for that to come back. And so I mean, any realistic workload is going to be mixed. And so when you're doing all those little I know DB log updates and you're synchronously issuing F-syncs to them and you have asynchronous accesses for I know DB pages to pull them in to answer selects, those are going to be held up. So this was surprising to us. We didn't think that this was something that we would see. We even tried increasing the cues by using Verdeo-Skezzi. And we did see much better performance on writes. In fact, we were able to get over 35,000 IOPS from an RBD inside of a guest, but that's a 100% read workload. So the having additional IO threads and being able to asynchronously access or do reads was scaling, was increasing. But still, as soon as we had a mixed workload where there was synchronous writes going on, that was the limiting factor. So we went back and we ran the same tests on bare metal, which is the kernel RBD driver. And then we ran kernel RBD, put a file system on it, and passed it through to a container. And you'll see something interesting, in that the mixed read write workload all of a sudden is not getting blocked. And the only difference here is that we're not going through the Kimu IO subsystem. So kind of in conclusions, our performance is pretty good, right? Like there's been a lot of work that's been put into Kimu to make it perform really well. We're able to, at least on 100% read workload, hit the caps that are similar to what the public clouds do. So the public clouds cap out their instances at 20,000 or 25,000 IOPS that you can't do more than that. And in some cases, we're demonstrating over 30,000, close to 40,000 IOPS, at least on read. I mean, that's still pretty impressive for a single instance. I think the UX is kind of weird for a pair of virtualization devices that you have to set properties on your glance images. I think that if you look at how Google Compute Engine works, they only do VerdiOS fuzzy. I think that it would be really neat if I could say, hey, I want to always use VerdiOS fuzzy for all the instances that are booted as a cloud. That'll just be an operator decision. I don't think many tenants really understand the difference between VerdiOS block, VerdiOS fuzzy, and which one is more appropriate for them, and that they're going to choose the set and maintain separate glance images. Like, am I going to have the rel VerdiOS block and the rel VerdiOS fuzzy, separate glance images with different properties so that I can know I'm probably just going to use one. So I thought that was a little awkward. The OpenStack cache configuration is not very flexible right now, and the impact on multiple volume types didn't seem to be considered, and I think that's just the organic way that it's grown, right? It used to be perfectly sensible when it was Nova and just Nova volume, and we've outgrown that, and we probably just need to adjust to that. The big Kimu lock limits mixed workloads where you have IO direct and synchronous rights. This is a particular evident in I know DB type workloads like we're showing here with Sysbench. The AIO and data plane multi-Q does probably help other workloads a lot, particularly if they're mostly asynchronous, but in the case of databases, I think we're being held back a little bit. So what would I test next? What do I want to test next? I want to try using Verdeo SCSI, and there's a way inside when you're booting the instance to tell it to use block MQ for the Verdeo SCSI device, and then if you have multiple IO Qs or IO threads and vectors for your Verdeo SCSI device, we might be able to see some better parallelization there. One thing that I recently found out about is this thing called Vhost SCSI, and what it does is instead of using the IO subsystem, it actually uses, it goes back to LIO, in the kernel, and uses the kernel for processing the IO. And it turns out that there's some work being done with KRBD and integrating it with LIO, so there may be a way to potentially leverage that work and Vhost SCSI to bypass kind of the Kimu backend with the userland libRBD and just use the kernel implementation of KRBD and not have to do it inside the guest, right? Because we don't want to expose the storage network to the tenant networks, but to have that back Kimu and maybe see if we're able to get the performance levels that we saw here on the metal and the containers in Kimu. And then if this does work, what are the impacts on live migration? Because the kernel CPU, because the kernel is gonna be consuming CPU with all these kernel threads, does that kind of show up? Like if there's a lot of guest IO processing, does that kind of manifest as CPU steel for the guests? And then finally, this is something that is pretty cool. It was done this summer. There's the MyRox, which is a version of MySQL written by Facebook that uses RocksDB instead of I KnowDB as the backend. An intern at Red Hat over the summer made it so that RocksDB can store its right head log and its pages directly into Rados. So you could potentially run a MySQL database without a block device or a file system at all. It would just be talking to a mutable object store. And so you could completely eliminate the double right buffer because Rados is already ensuring atomic updates because it has to be able to support rollbacks anyways. So that could be interesting. So that's all I have for today. Thank you everybody for coming. And if anyone has any questions. Yeah, the question was, so what is the answer when you wanna have different volume types with different cache modes? Is there a way to do it with different host groups or regions? That's a good question, I'm not sure. Yes, sir. When it comes to Virtio Scasia or Virtio BLK, when it comes to guest driver support or any settings which needs to be done in guest OS or is that change is completely invisible for the guest? If it's typical guest OS distributions, not only Red Hat, but others. That's a good question. Yeah, I'm not sure. I think most guests support Virtio Scasia. VDA, yeah. VDA to SDA. SDA, but SDA is supposed to be the normal Scasia, not virtual Scasia. Okay, but maybe you have tested, is it true that the device name at least changes? I thought that it will remain the same because SDA is bare metal Scasia, not virtual. Okay, thanks. Great, thank you.