 Hi everyone, my name is David Vernet. I'm joined by Song Liu. Today we're going to be talking about kernel live patching at Meta. So basically what is this awesome feature kernel live patching that's available in the Linux kernel? And then how and why do we use it at a company like Meta? So let's go ahead and get started. The talk will be roughly this. We're gonna start by giving a history of how fixes have been applied to the kernel in the past. What's been the evolution essentially of how this this has happened over over the course of the lifetime of the kernel? We'll talk about KLP of course, describe how it works, what its pros and cons are. Then we're going to move into talking about the ecosystem around live patch. So what sorts of tools are available in industry to make live patch easier? And to kind of assist you as a developer and creating live patch and using it. We'll talk about our experience with using live patch at scale at Meta. That'll include how we use it, what it's enabled for us. We'll then go on to talking about some of the challenges we've encountered with it and some of the fixes that we've added in upstream to live patch. And then finally we'll talk about some ongoing work with live patch and some future work that we're working on and that of course anybody else can join in on to help with as well. So with that said, let's go ahead and get started. Now, of course, the canonical way to apply a fix or upgrade the kernel in general is with a full reboot. So we've all done this, of course, you install a new kernel, you do a full reboot of your machine. The kernel is then loaded via the bootloader, UEFI firmware, whatever. The kernel itself then does a full boot, initializes all of its subsystems, powers on the devices on the host, allocates memory, does whatever it needs to. And then the system starts to run at that point. Okay, so obviously there are pros and cons to this. On the pro side, doing a full reboot is simple. The system is shut down. And when you're rebooting, everything is starting out from a fresh state. So there's no possible memory corruption. There's no running tasks or concurrency issues that you have to worry about. All the devices are starting from a powered off state. And then also the hardware does post checks, power on self test checks. So your hardware is validated before the kernel starts to run, at least at some level to verify that there's no issues, that memory isn't corrupted or something like that. Now, while this is simple and it's the safest method, so to speak, of applying fixes or upgrading your kernel, the cons are that it's really, really slow. You have to migrate all workloads off of the host. Obviously it's going down so you can't have any traffic going to those hosts. This requires downtime. Post checks can be really slow, especially in large machines with like tons of memory. And then eventually once the machine is booted and the kernels booted and everything like that, you have to re-warm your caches and start to send traffic again and ramp it up slowly but surely. It's slow and it requires a lot of time. So to address the slowness of things like power on self test checks, post checks, Kexec was added to the kernel. And the way that Kexec works is just like with any other kernel, you install a new kernel and then rather than rebooting your host, you Kexec into the new kernel. So tasks are killed on the system, devices are powered down, you load the new kernel image into memory. That new kernel starts executing and then boots as though it had done a full boot from the host itself being rebooted. Devices are powered back on, the kernel allocates memory, it does everything it would do in a normal boot cycle. But you skip the power on self test checks and just in general the latency of rebooting. Now of course this is faster than reboots, that's a big plus. Like I said, you skip the post checks and it's still relatively simple. The kernel is doing a full reboot, there's no tasks running while you're doing this boot. But of course the cons are that it is still a lot more complicated than doing a full reboot. You have to make sure that the devices on the system support going through a full lifecycle, power running off, powering back on. That can require a lot of work with vendors to make sure that it's supported and that it's being done correctly. There is still downtime as well with Kexec, the host isn't going down but obviously the kernel is shutting down. So you have to migrate your workloads off and you still have the same latencies in that regard as you did when you were doing the full upgrade and the full reboot. Okay so this is great, these work but they're slow and what can we do to try to improve the slowness? Well we can use KLP. So what exactly is KLP? Let's talk about that before we talk about the advantages and disadvantages. So if you have a kernel function, the first instruction in that function is going to be a no op, a no op instruction. And so when the CPU encounters a no op instruction it just skips over it, looks for the next one. It's totally useless, it's just a placeholder for doing things like this. Where using Ftrace you could have that first instruction actually be a callout to an Ftrace handler and then that Ftrace handler can call out to a replacement function that you implement as a live patch. So basically you do a jump over to an Ftrace handler and then after it's safe to transition to the new function which we'll talk about a little bit more in a second. Ftrace will itself call out to your replacement function and you'll totally circumvent calling the original logic in the first function. So you've patched the kernel by having a new function that you call out to you with Ftrace. Great. So this is obviously a totally different paradigm for kernel upgrades compared to kexec or rebooting. This is happening at runtime in place, no tasks are being shut down, no rebooting has to happen. It just happens at runtime. What are live patches? Well they're actually technically special types of kernel modules so when you're applying the live patch what you're really doing is loading a kernel module and then the init hook is where the patch is applied. And we've mentioned Ftrace, well this is actually how it's done. You use the ipmodify flag to replace the existing kernel function in the way that we just showed. Okay let's talk about the pros and the cons. So obviously the pros are that it's super fast. At meta when we apply a live patch it usually takes on the order of one to two seconds to apply the patch to the host. That's to a single host, obviously not to the whole fleet of servers. But one to two seconds for a host is really really fast compared to even kexec. It doesn't require any downtime or workload migration. You just apply the live patch and it's low. It's lightweight enough where no perturbance is detected on the system. We'll talk in a little bit about how there actually can be some perturbance but those are bugs that need to be fixed. So on the cons side there are limitations to what can be patched with live patch. Live patching patches texts, executable instructions in the kernel. You cannot patch data so if a fix requires you to change the layout of a struct or something like that you can't live patch that. What you would have to do is something like jump to a different function that does like a lazy allocation maybe of a different type of struct and then have people use that pointer. So something that you wouldn't upstream but you can still sometimes get creative with how you apply live patches. Or if it's just like you're missing a null check or something then that's purely live patchable. That's like the canonical use case and that you can do. Or like mem setting or something like that if you forget to like zero out some stack variable. Another con is that in addition to the limitations of live patching extra engineering work is usually required to do this. It's not just as simple as compiling a live patch and knowing it'll be safe and applying it. These are kernel modules. You can break things if you're not careful. There are no guarantees provided on the fact on whether the patch itself is correct. So as we'll talk about later we'll talk about the ecosystem for live patch a bit later. You have to have a kernel professional that understands everything very well. Look at the patch. Look at the instructions and really verify that this is correct and nothing is going to break. So it is a lot of overhead but the pros are significant as well. So we already looked at this. This is the knob instruction that we patch and we call out to you with F trace. This is just showing you the instructions without live patch. We have this knob instruction here. We can overwrite that with live patch to a call instruction which calls out to the F trace handler. So pretty simple, super useful stuff. Live patch can also patch multiple functions atomically per task. So each task has a state that shows whether it has been patched or not for some instance of applying a live patch. And essentially what this means is that when you're applying a live patch the live patch subsystem will guarantee that it's applied atomically to tasks. So for example you don't have one task that's calling the patched function and then another task that's using the unpatched function that's not safe because maybe you're changing like how locks are acquired or something like that. And you could break things if it's not atomic. And so live patch will iterate over every task in the system and lock its call stack to detect if it's safe to transition the task or not. And if it fails to transition the task the live patch subsystem like any good kernel subsystem would rolls everything back and guarantees that the state is maintained from before the patch was applied. Which is pretty cool because it's a pretty complex operation. Okay so that's KLP. Now that we've spoken about that I will hand it off to Song and he can talk about the ecosystem around live patch. Thanks Davy. KLP is actually pretty common among enterprise distros. The primary goal for enterprise users is to fix kernel bugs without rebooting the system. The most popular solutions are Kpatch from Red Hat, Kgraph from Suzy, and Ksplice from Oracle. Here I want to thank Red Hat folks for sharing and supporting the Kpatch tools which are great. There are also third party solution providers who build patches for CVEs. Based on our knowledge all these solutions have similar internal mechanism but different two genes. So how do we use KLP at scale? First, why? As a matter we don't have a problem rebooting a specific system but we have a problem that kernel upgrade is not fast enough. Typically it takes us a month to upgrade the compute tiers and potentially much longer for the storage tiers. As a result we cannot rely on rebooting for urgent kernel fixes. And this means sometimes we have to run kernels with known bugs. And sometimes we have to debug the same issue multiple times because they may not hit the different tiers at the same time. Occasionally we have to have a fixed only kernel release just to fix a bug. And such releases will delay the actual feature release. The solution use KLP which can rule fixes much faster. Another case we found KLP helpful is when we debug tricky issues. Modern kernel provide a lot of tracing tools but there are still cases where tracing is not straightforward. For example if we want to trace the exact condition at the middle of a function it's not easy. KLP on the other hand can add a printk or maybe a traceable function to the exact location we want to trace and make tracing much easier. The other case we found KLP very helpful is when the issue is hard to reproduce. When we finally get a live reproduce of this issue we can use KLP to make smaller changes without resizing the system state. This will make more debugging much more efficient. So how do we do KLP at meta? We push for homogeneous configuration which means the kernel team owns the KLP and the user cannot decide which fixes you want or not. The other side of this is that we need to make sure the KLP do not introduce any new problems. homogeneous configuration allows us to do something called cobalative patch which means we combine all the fixes we need for a kernel in one KLP module. This greatly reduces the test metrics we need to cover. Kernel provide a replace flag which allow we attach the new KLP and detach the old KLP in one atomic KLP transition. This eliminates the vulnerable window between we load unloading the old KLP and we load the new KLP. At meta, KLP rule out is 100% managed by automations. We package KLP module in an RPM file, something like KLP for kernel X, hotfix Y, and we manage these RPMs just like any other RPMs. When we load the KLP, we use health check to detect any issues. Specifically, we compare the kernel with the new KLP and with the same kernel with old KLP. We check for issues like new crashes, increased error rate, or KLP transition failures. If any of these metrics looks bad, the automation will stop the rule out and bring this to the attention of the engineer. This figure demonstrates when we draw a new kernel. When we start a new kernel, we always start small with the release candidate and we gradually increase the deployment of the kernel. Before we have KLP if we get any bug, we have to go back to the first release candidate. With KLP, on the other hand, we don't have to do that. If there's a bug, we load the KLP and move on. This gives us a much more predictable kernel release cadence. To read out, KLP helped a lot. Right now, we have millions of servers running with KLP. And typically, it takes us a week to rule out a new KLP. In the extreme case, we can patch the whole fleet in hours. And we found KLP is essential to keep the fleet healthy. We have eliminated the most fixed only kernel releases. And we don't have to debug the same issues again. OK, so Song just spoke about the live patch ecosystem, how we use KLP and meta at a large scale. So now let's talk about some of the challenges that we've encountered when using live patch and some of the things that we fixed and upstreamed as part of doing that. So, yeah, live patch is awesome, as I said, but it still has some sharp edges. There are small performance issues, which independently or on a small scale may not be a huge problem. But then at a larger scale with a ton of hosts, it becomes serious performance problems. We've encountered some conflicts with the tracing subsystem due to the use of F-Trace and the IP Modify flag. And we've also observed that KLP sometimes fails to transition tasks due to things like not yielding the CPU, which we'll talk about in a second. We'll look at it in a second. So to start with, I want to talk about a bug where we noticed that we were applying a live patch to the meta fleet. We saw a short uptick and missed IO, and this would set off alarms all across the company. For example, one of the alarms was a higher TCP retransmit rate, which indicates that TCP packets are being dropped. The other one is a higher IO and F-sync latency, so we were getting alarms for some of our database workloads and stuff like that. Now, these spikes in retransmits and F-sync latencies were only lasting for one to two seconds, but then across millions of hosts, it results in alarms going off all over the place and people freaking out and it causes problems. So what was the problem? Well, it turned out that the INSMOD task, so the task in Linux that inserts a module and adds it to the system, and as we know, kernel live patches are modules, so INSMOD is applying the patch. It turned out that that task was hogging the CPU and starving KSoft IRQD from being able to run. So the kernel that we were using was compiled without preemption, which means that a long-running task in the kernel, unless it yields the CPU, will not be preempted by another task. So we would load the KLP module, we would be doing relocations. For relocations, you have to look up symbols to figure out what the address should be of the relocation, and because there's a lot of symbols in the kernel, this would start with KSoft IRQD. For anybody who's not aware, KSoft IRQD is a kernel task that is used to service interrupts when there's a lot of interrupts in the system and they're not all serviced. In a bottom half handler, they can be serviced later on by a kernel task called KSoft IRQD, and that's done to avoid starving user space and only handling interrupts and not letting any other work get done. So if KSoft IRQD doesn't get scheduled, then networking packets may not get serviced. IO, when it's completed, interrupts from like a storage devices may not get serviced. So the system slows down. Essentially interrupts aren't serviced and the system can experience all sorts of problems. So the fix was a very simple one. It's adding a cond resched call on the symbol lookup path. This is pretty common in the kernel. Any place where you're doing something in a loop and it could take a long time, you have to add a cond resched for this exact thing. That fix was upstreamed. And then one fun fact is that we didn't want to have to wait for the fix to be applied through real kernel upgrades as we were upgrading the fleet because we wanted to be able to apply more kernel fixes via live patch before that happened. So we decided to roll the fix out as a KLP. So we actually patched the live patch subsystem with a live patch. And we experienced the same like retransmit bumps and like the KSoft IRQD starvation during this rollout. But then for subsequent rollouts, we didn't observe it. So KLP can patch itself even, which is pretty cool. Okay, so now I will let Son take over to talk about tracing in data centers. Thanks, Devin. Tracing is first class citizen in the data centers. In other words, monitoring and tracing are as important as the main service. If the main service goes down, Uncle will come up at night and fix it. If monitoring goes down, Uncle also wakes up at night and fix it. So yeah, they are as important. As a result, KLP should not break any tracing users. However, there were issues. First, sometimes KLP may break trace point. This is because KLP function cannot have jump labels and trace point use jump labels. As a result, if there is a trace point call in a function that is patched, the trace point call is removed. This issue hit us as block trace start missing event. Fortunately, this is partially fixed in 5.8 on neural kernels by Josh. Specifically, trace point in the VM Linux will not be removed by KLP. However, trace point in the kernel modules still have this issue. Second, KLP may conflict with tracing tools. Specifically, both KLP and BPF coupling use the ftrace flag IP modify, which means we can modify the IP of this function. As a result, only one of the two can apply the same kernel function because we cannot modify the IP of one function to two different targets. And whichever comes later fails. This is fixed in 6.0 kernels. ftrace direct function, which is used by BPF coupling, no longer set IP modify. And BPF coupling is trained to share the same function with KLP. Next, KLP transition failures can be problematic at scale. Recall that to finish the transition, each task needs to start from KLP unpatched to KLP patched. And such transition happens at a specific point. The common point are first, when the task access kernel space and returns to user space. And second, when the task goes to sleep without any being patched function in the stack. This means kernel threads sometimes fail the transition because they never go out of the kernel space. This usually happens at a very low rate. However, unfortunately, low rate times a big fleet sometimes means many, many failures. So here's an example. The BPFS reclaim work may run for many seconds. However, it costs conditional reschedule many times per second, so technically it's not a bad behaving thread. However, this means we were seeing 10 times increase in KLP transition failure rate because the events on the kernel thread doesn't go to sleep often enough. Unfortunately, this issue cannot be fixed with KLP. And KLP with this fix has 100 times more transition failure rate because now the thread sleeps with a to be patched function in the stack. We have a temporal fix for this by adding KLP try switch tasks to the kernel thread. And now we're also working with upstream community for better fix and we'll give more information about this in the later slides. Next are ongoing and future work. First, we are proactively identify and fix corner cases because real corner cases in a big fleet can be a serious problem. Here's one example. If our kernel module with a patched function and you unload it and try to reload it, it fails. Why? Because when you load the kernel module, there are some sanity checks for the relocation addresses and KLP breaks it. This has to be a non-issue for years, but it hasn't been a priority for enterprise users because unload and reload the kernel module is not a common use case. However, it's potentially a big bazooka for a fleet managed by automations like MEDAS. Besides fixing corner cases, we're also adding new features to the toolchain. First, we add a toolchain support to build KLP for kernels compiled with ClimbPGO support. In this case, we need the profile data to build the kernel. We also need the profile data to build the KLP. This work is mostly down, but it's not yet upstream because the kernel support is not upstream. We're also looking at a case to have one KLP built for both intree and autotree fixes. We have this requirement because we need the replace flag and cannot have two KLP, one for intree fixes, one for autotree fixes. Thirdly, we're actively looking for solutions to reduce KLP transition failures. There's an interesting idea by Peter. Specifically, we can use F-trace to attach KLP tri-switch task to specific functions. As a result, pending tasks can finish the transition without going to sleep. Okay, that's all we have for KLP at Meta. Thanks for your attention. Any questions?