 All right. Hello everybody. My name is John Kohler. I'm a principal architect in Nutanix and I'm going to talk about optimizing KVM performance for electronic health record systems. Although the talk really isn't a health record system specific. So a lot of the methodologies and items we'll be talking about are valid for everybody. So quick agenda. We're going to talk about kind of what EHR systems are real quick, just for context settings. So everybody gets the acronym and then dive into a technical example. This is a technical conference. This is going to be a highly technical talk and I'll try to space it out enough where possible. So this practical example will focus on core KVM and then we've also got some related ecosystem enhancements that we'll be talking about as well. So what's an EHR system? EHR stands for electronic health record if you've ever been to a doctor. And seeing them type up your record digitally, they're probably running a system like the ones on the left here from the IT infrastructure side. There's a few things to know. One is that the ISV, which is the software vendor mandates the application architecture. So, you know, as good as architect as any of us might be, it kind of, it flows downhill. Right. So let's tell you what it needs to be and what the core systems behavior is going to be. A couple of key tenants here is that the provisioning and scalability of these systems are inelastic. They rarely get smaller and they live on forever. Right. So mistakes tend to have a long half-life. And the other thing just to kind of set your mind in the right place is that, you know, people touch this 24 seven, right on some of these use cases are life critical. So it goes without saying that tolerance for the system is running slow or the wheel is spinning is pretty much zero. And in that kind of clinical setting, sometimes you just can't come back to the computer later when the system comes back up and so on. So when you translate that all down to the infrastructure level, it just means you got to really pay attention to the details. So the moniker here is really it takes a village. So we're talking about optimizing performance for these platforms on top of KVM. We have to look at both KVM as well as all the other components that plug into KVM. And from the Nutanix perspective, we're just going to be focusing on our hypervisor, which is powered by KVM. So practical example. So maybe six months ago or so, I got a report from one of our EHR partners that one of the benchmarks that were running on top of our platform was slower than another platform. And so typical A-B testing hardware is the same. Processors the same. It's just not as quick on our platform. These are the types of things that we do with those vendors to make sure that we figure this stuff out at an engineering level before customers come and yell at us. And the benchmark that we're going to be talking about here is a pretty typical load runner type benchmark where you have some system, this happens to be running on Windows and it loads up synthetic users, does some work with the application and measures the response time as it ramps up. So from the core systems perspective, there's a lot of different ways to measure what's going on under the covers. I personally am very visual. I love flame graphs and using Linux Perf. If anybody in the audience or online is new to flame graphs, I have a copy paste example on how to generate this exact type of graph in the appendix. It's dead simple, so you can get it from the slide deck. So the simple question, which is rhetorical, is what's wrong with this picture? There are four key things wrong with this picture. One is that the stack that shows off the IPI delivery path call chain is not taking any fast paths, meaning you can see it is actually occurring outside of the VCP run loop, which is not great. And we can kind of infer that any sort of slowness that we're going to talk about in the next couple of points is kind of getting incurred on that path as well, meaning IPI delivery is slower than it should be. The second piece is when you're looking at a flame graph, anytime you see this wide kind of swath of a tabletop right here should set your alarm bells off because that means that that function itself is spending a lot of time doing either code inside that function or inline functions that is just taking up time, right, and not time spent actually going through the rest of the call chain. So each path of the VCP run is expensive, which is not expected. You can't see the name in here. If you actually go to the raw flame graph, you'd be able to click and highlight and look at this. But there's also about 1% of this graph, 1% of all samples done in this trace R due to X-save guest and host load, which is odd. And the last one is this absolutely hilariously large flat top for speculation control. It's just ridiculous. So we're going to get into where that comes from and how we can whack that. So let's dive in a little bit. I'll just kind of build this out. So for the first issue, this is, again, remember this benchmark is running on Windows. So the IPI delivery path is different for Windows than it is for Linux. Without going into the details, you can go and see many, many talks by Vitaly who's talked about Windows Enlightenments and so on. One of the Windows Enlightenments for emulating Hyper-V on top of KVM is called SynIC or Synthetic Interrupt Controller and the default for that enables something called Auto End of Interrupt Paravirtualization, which is a mouthful and it disables hardware acceleration like Intel APIC-V. And that's what we had configured here. It works all right. It's better than not having it enabled, but it's still less than ideal. In 5.15, there was some support added for a Enlightenment called HV-APIC-V, which enables hardware acceleration. However, for small Windows guests under 240 CPUs, the exit reason is actually APIC-RITE instead of MSR-RITE so they don't use the X2 APIC-ASR MSRs. So they don't actually have a fast path. That means that anytime an IPI is issued, even though the APIC IPIs are trap-like, it has to go all the way out of the run loop, then go through the handler leading to further delays. So the fix here for the specific benchmark is to switch over to hardware accelerated IPIs and Windows and also we've introduced a new APIC fast path handler, which I have the code for in the appendix. I got to send upstream, but it's incredibly simple. It just looks at the exit reason and handles it faster. For the rest of these things, I'm actually gonna call out how many samples these are in the graph itself. The IPI overhead isn't really a sample overhead from a flame graph perspective. It's just kind of coming out of the wrong place. The rest of these are actually just pure overheads. So that VCP run overhead that we described in the graph, which is kind of a long table, is multi-fold. One, keep in mind this hardware is ice-like based hardware, which from a mitigations perspective has enhanced IBRS, which is a set at once and forget it type of configuration. And long story short, the way that that enablement works in the kernel today, when the guest enables that we disable MSR bitmap and interception so that every single time the kernel exits it has to issue an expensive read MSR on spec control because the guest may have changed it even though the guest will never change it. There's also a regression here in the debug CTL restore mechanism. Architecturally, debug CTL MSR is zeroed on VM exits. So if you had set it on entry, you have to restore it on exit to go back to the host. It is regressed in 5.17 and below where it's setting it even though we at the host level didn't actually set it. So there's two fixes here. One is fixing the enablement and interception path for EIBRS and the second one is reverting the off-ending commit. And I've got the details for every single one of these including commits and such in the appendix. It'll keep it to a summary level for the sake of time. So we've got a bunch of cool stuff to talk about. The third bit is the XA of overhead. Now this one's kind of interesting. It's a bit of a mystery and took us a minute to figure it out. So if you look at the assembly, which you can do with, or should say disassembly, you can do with perf top. You can see here that the cost for this overhead is exclusively limited to the entry and exit doing an exit BV on the XA feature mask every single time. And that's because at Nutanix, the control point that we have will automatically mask out MPX and PKU features from the guest, which conveniently are Xavable features, which means that the host view of Xav and the guest view of Xav are different. Therefore, you gotta flip them back and forth constantly. Unfortunately, it's not just enough to just compile it out and go to a kernel config change. So there's some early initialization code we gotta mess with to get this to work properly so that the host mask will actually match our guest mask because we can't go change the auto masking feature because we're not going to mask in MPX or PKU to our guests. We're just not gonna do it. So we gotta fix the early initialization code, which we've got a sample for. And lastly for that fix for, this is actually the easiest one of all, which gets me so giddy because the upside of this is so large that essentially we also have a guest versus host mismatch here. And the issue actually comes from this kind of subtle nuance, right? The issue comes from the way that the mitigations are set up in 5.16 and below and QMU 2.11 and higher. QMU 2.11 and higher turns on set comp or sandbox on whatever flag is by default. And below kernel 5.16, any set comp jail will automatically get over-pessimized from a security perspective, from a mitigations perspective. So it ends up looking like from the host perspective that you have mitigations cranked all the way up. And even though guests might have, you know, ERBRS and whatever configured, there will always mismatch and you will always have to do a write MSR. The really devious thing here is that when you do a write MSR to spec control, completely styles the CPU pipeline. So it has to flush everything, flush all the speculative holds and everything. And it's really expensive. So thanks to our friends at Red Hat, they changed the default in 5.16 and above the deep-pessimized set comp jails because really that was all kind of window dressing. The commit message for that's really detailed is in the appendix. So into the kind of the spoiler results here, we could see that after we've done some optimizations, the flame graph looks a lot better, all right? So we've got a much shorter little tabletop here in vCPU run. We've got the call chain for handle apic write coming doing the IPI delivery in the vCPU run loop itself, which is great, which means the IPI delivery is happening faster. And if we build this out a little bit, these changes were, as specified, we've switched over to hardware-accelerated IPIs and added the new FastPath handler. And we've suppressed the expensive read MSRs and spec control and reverted a fending commit to make the tabletop for the vCPU run faster. The Xsave overhead is still there. It's just kind of a more nuanced fix. So, but when we get that done, we'll whack another 1% off of this as well, so we'll get some additional headroom. And you can see here that after that back port, the overhead for spec control is completely gone, which is great. On to the actual results from the application. So I've anonymized this to kind of protect the innocent. This graph here, this, I guess, reddish, orange-ish line, whatever it's coming across as on top is the baseline or previous and the bluish-greenish thing is the after. And there's a couple of key things to note. The little dashed line in here is the SLA ceiling, and it's kind of a fixed SLA type of thing. The y-axis is response time, the x-axis is user per core. And if you measure just where the line ends up hitting the SLA fall, we have roughly 14% better density with these improvements. So from the user perspective, just apply a hypervisor patch and boom, you got 14% better system. And the other thing I like to look at is that the steepness of the ceiling is our steepness of the line is quite a bit less, right? So as we degrade over the SLA fault line, the line there is quite a bit less steep. So if you were to compare, let's see, 10 users per core, so with this setup at the previous rate, you would see the gap in between these be astronomical. So it's quite a bit of a better response time and quite frankly, better tail latency. It's a good way to put it. But as I mentioned at the beginning of the talk, I realized that not everybody has to deal with healthcare applications. So let's talk about this in a bit of a more generic way, using a benchmark from LoginVSI, which is kind of emulating virtual desktops. And same type of load runner type idea. You've got a bunch of VMs that live on a host, running Office and Windows desktop and so on. It measures opening Excel, opening Internet Explorer, doing different operations and measuring response time. We do this also on an Ice Lake based system. And we're comparing here our previous release, which is 5.4 base and our current release is just 5.10 based and taking out hardware acceleration for the IPIs just because there's some enablement issues there that we'll talk about. And we can see here this really awesome graph from LoginVSI. I swear I can't change the color scheme on it. So it's blue and blue. It's just the way it outputs. But the top line, which is kind of going up into the ceiling is the previous release. The red line there is the kind of maximum ceiling for response time. And the bottom line is our current release. So you'll note very clearly, at 320 VMs per host, we don't hit the SLA ceiling at all, meaning that from a customer perspective, if you had 300 something VMs per host, you can get even better experience, better tail latency at full load, which is great. Onto these nice colorful graphs, which I can control the colors of. We can see what's called VSI base, which is a measurement of what the lowest single session response time is. Is quite a bit lower. The percentage there is about 9%. That means that even if you're not fully loaded, these optimizations will give you 9% better user experience, which is awesome. And the other thing we looked at is say, well, let's take away some of the automated ceiling stuff that this benchmark does and use to fix SLA kind of like the EHR benchmark does. So instead of having the SLA up at like 1800 milliseconds, we set it at 1200 milliseconds for this response time, this benchmark. And you can see the difference there quite a bit. So if we look at that, we can see the benchmark there. If you're gonna look at the X access, which is active sessions per host, it's below 256, this one's far above 256. And the actual numbers there, if you're doing a fixed rate threshold is 298 sessions, almost 300 sessions with the optimized kernel and 245 sessions for the unoptimized kernel, which is a 22% increase in response time at a fixed rate, which is pretty good. So zooming out for some related ecosystem enhancements. So as I mentioned, it kind of takes a village, right? So in our hypervisor for networking, we run OpenVSwitch and there's both Windows and Linux guests. So I wanna talk through some findings and methodologies that we've seen in these guys. And the first one in OpenVSwitch is a thundering herd problem. So I try not to read through this. I put this just big dump of detail, really more for slide consumers. But the short of it is OVS has a thundering herd wakeup problem. And the challenge for CPU sensitive workloads like EHR workloads, but it could be anything like SAP HANA or other database workloads, is that when that thundering herd blasts up, if you have anything in a halt polling loop, one of the exit conditions for KVM halt polling is single task running. So you'd have stuff in your run queue and you could potentially have less than ideal polling, which is what we saw. And the issue there is that there was a change back in the day to optimize the wakeup pattern for OVS. It was all well-intentioned and good, but it just made it actually go into a thundering herd. We found this in an interesting way, which is using Google Skedviz. If you've never used Google Skedviz, I know there's plenty of Googlers out here, so they may have already used it. But Google Skedviz is really, really quite slick. Visualizing things like wakeups and migrations and interactions of this low-level scheduler itself is incredibly hard. If you've ever tried to F-trace that, you get a massive output and it's really difficult to visualize and understand what's doing what. Skedviz does all this stuff for you. So we can see in here, we've taken a Skedviz trace from a host. The details aren't super important, but what we can see in here is there's a little search command. We search for the handler threads, those are the OVS handler threads. We can visualize them with this little color-coded rainbow doodad on the left, and we see, bang, there's a thundering herd, right? All of these threads are waking up at the exact same time and kicking all of the applications off the cores like that, and that's bad. Now that is fixable with both a kernel and OVS fix. It's one of these things that has to be done on both sides, or it doesn't work. I have the links to the commit series in the previous slide. But the net of it is a 28x, 28x reduction in wakeups, which is awesome for latency-sensitive workloads. So if you're running the affected kernel in OVS versions, you want to get this patch on board like ASAP. Another issue which I upstreamed, I actually just got committed yesterday, conveniently, is OVS communicates over a net link to the kernel to get some statistics, and actually, excuse me, it communicates over the kernel the net link to do all sorts of different things. One feature of net link is it just automatically gathers all these statistics, which is all well and good, it's all good, but on really big systems, like what you might find for quad socket or eight socket systems for SAP HANA, there's some of those calls in the kernel are linear in complexity based on the amount of cores, and it's really expensive. To the point where the OVS vSwitch daemon just sits on the CPU constantly and sucks away a core, and when you have CPU-sensitive applications like EHR or whatever, that loss of a core is kind of expensive, especially when you're thinking about it at scale. So there's two key improvements here. One of them is the one we just committed to OVS yesterday to align some of the stats gathering, reduces CPU and OVSD, or excuse me, vSwitch d, and then there's also a commit to inline some of the expensive stats getting calls, which does help, but it doesn't completely eliminate the issue, so you need, it's another one of those things where you need both to actually kind of whack this on large systems. And in true form, I love my flame graphs, and clearly when you take a flame graph, this issue stands out like a sore thumb. You can see that OVS's bridge run function here is doing all sorts of whatever, and it's just spending all this time on this Inet, Fill, Link, AF, whatever that is, getting stats, but the thing is, it's not actually reading any of those stats when the call returns, so it's just completely pure overhead. So we get rid of that with the commits mentioned, and wouldn't you know, a little purple search field here in the flame graph goes down from 4., or excuse me, 42.2% of samples to 9.4% of samples. So a big time tax cut. Last two bits here, one of them is Linux, VertiO related. This isn't throwing any tomatoes at our friends at Red Hat, but prior to Red Hat 7.8, Multi-Q didn't actually work for VertiO, SCSI, where if you had a Multi-Q enabled backend, it would just funnel everything to one queue. So you can see here, we use Vhost, or Vhost user SCSI in Tanix, to do out-of-tree PCI management. That process happens to be called Frodo, Frodo is the keeper of the rings, yes. So that's where that comes from. And you can see here in the top trace that very simply, one thread is doing all the work, and that's ridiculous. So we reported that up to Red Hat, and thankfully, it was just a simple cherry pick on their side. And so Red Hat 7.8 and what have you and above, now works much better, which is great. Last bit I have before I get into some of the appendix stuff if I still have any time and I'm running short on it, is a large IO and Windows. So this is one subject I'm incredibly passionate about, is that one of the benchmarks that one of our EHR partners has, basically goes through and does restore backup, some ETL stuff, and emulates a database workload. And large IOs in the back of the restore sections, this didn't work very well. And we ran into some issues getting that to work cleanly and actually have it come across to our storage backend as a one mega O instead of like concatenating 256 KIOs. And the fix there was multi-fold, there's kind of three different commits that have taken a whack at making large IOs work in the driver, which is great. The last one, which shows upstream earlier this year is one that we put up there to fix the maximum transfer length handling so we can kind of cleanly get a one mega IO, not like a one mega and five 12 KIO or something. And the kind of money slide here is this one, right? Where the requirement of this application is to be able to back up at line speed and now we can back up at 100 gigs per second out of this application with 100 gig, the Melanox, CX6, whatever. And so the benchmark comes nice, clean and flat with 512 K right restore IOs and it's hitting some line limitations there around a paltry sum of 12 to 13 gigabytes per second blistering. The last thing I'll bring up here, I know I've just got a minute, is that technically this actually breaks the Verdeo spec because the way that Windows Driver works is use indirect descriptors and long story short, you're not supposed to be able to do that. So we've got a PR upstream to fix that in the Verdeo spec and we'll route that back and make it work but I think that's the last of it. So I probably have maybe 30 seconds left if anybody wanna does a speed run question or catch me afterwards. Yes, sir. Yes, so there's a, I don't have the slides in here but we have a extensive performance CI that Nutanix that looks for these types of things, yeah. And the EHR benchmarks are a part of that but it's always, it's one of these things where you can look at benchmark results and they might look good in hindsight but they're not necessarily also having a CI across like competitive platforms as well. So those were those kind of late, late binding things might come in but the question was do you have a CI for that and the answer is yes we do. Well, I think we hit it exactly on the mark, it's 25 minutes. So that's all I got, yeah. There's appendix slides, has all the details, blah, blah, blah, blah, blah, blah, but you can get those later. Cool.