 Hey, hello, everyone. My name is Hao. I work for Google. Actually, in this talk, I'm going to, actually, it's a BPF use case. I'm going to share my experience of BPF's tracing to debug a kernel performance regression. I think it's a very interesting case. So first, there was some background and then a statement of the problem, what we discovered. And then how do I root cause the problem? And then the second part, we are going to use a BPF to do some analysis on the swap page for a brief introduction to that. So this actually, me on my side, I'm going to introduce my experience of using BPF to root cause a performance regression. And Chris, he probably will join me later, but I can also cover his part later. This is a project Chris and I are working on together to do using BPF to analyze the latency issues that we discovered in our production system that uses swap page faults. So the background is a problem that we found. So in Google, we are upgrading our production kernel from 4.15 to 5.10 kernel. And when we are deploying this 5.10 kernel in production, we discovered some issues. Basically, the container folks told us that on 5.10, on the older kernel, 5.15, we have a latency. When we write the CPU set, CPU file, it's a C-group file. The latency mostly, the majority, within P99 or within like 10 milliseconds or something. And when we are upgrading to 5.10, we discover that the majority of the latency now becomes like 90 milliseconds or 80 milliseconds. You can see that this is a big difference. And especially, it's very bad in the case where you have a lot of small jobs that are allocated, create a lot of small containers. It's a problem. And my task is to find out why this is so. So my best guess is that there is something very costly happened in the code path in this kernel operation. I thought it might be, I think it might be just a log contention. Because you have so many threads that contain the log that are grabbing this path. And let's verify if that is true. So let's see if we can measure the log contention. So luckily, my colleague, Nam Young, recently introduced a couple of trace points that traces the beginning and the end of the log acquisition, the slow path. These trace points are available in 5.19. And we backported to our old kernel and the 5.10 kernel. And with these trace points, we can measure the time that threads spend in the contention path. That's the slow path. So when a thread acquires a log, it first goes to the fast path. It basically tries to say, if the log is available, just get the path. But if the log is not available, go into the slow path. And the profiling is scoped in a special function. That's a function. The kernel function is the code path that I'm interested in. That's a CPU set write a REST mask. And then I collect the samples for the time durations in this log contention. I exported the samples through the RIM buffer and estimated its quantiles in the RIM buffer to get the tails, like a P90 or median or something. And I also tried to implement the quantile estimators using BPF. I think it's doable. That's what I mentioned yesterday, the talk, that it's totally possible to implement a T digest, this quantile estimator, purely in BPF. And I compared the results on 5.10 against 4.15. It turns out the result is that on 5.10, we have a much, much higher log contention under our production workloads. How? Quick question? Yeah. Right here. Just out of curiosity, did you guys try enabling lock stats and using the built-in kernel mechanism for tracking contention? Or was the idea to just not have to do anything and just use BPF without it? To enable lock stats, you need to compile the instrumentation in the kernel. This is a pure production level kernel without any instrumentation, not any debug info, but not enable lock stat. So we just want to enable it at runtime, lock ones without the building support. Sounds good. Do you guys think that you would add what you built to BCC? And I don't want to capitalize. But just out of curiosity, I think it would be generally extremely useful to be able to get essentially on-demand lock stats from BPF without having to enable lock stats. So if you think there's anything generally useful. Yeah, I definitely think so. This is the motivation of this talk. I want to just share. Oh, that's why you're here. Thank you. I want probably not to anyone, not many people know that there is a trace point to monitoring the slow pass of the locking. So this is the profiles I get on 4.15 and 5.10. You can see that on the P90, P95, it's super long. It's shifted. So that's not the full story. When we know that there's a lock contention, but we want to know the cause of the contention. So this next question is that how do we find the cause of the contention? So my idea is that since you know which lock is content. We first need to know which lock is the contention because in the code path, you could grab multiple locks. There's a C-group lock. There's a thread group of mutex. Right now, my colleague, Namyang, has this perf lock contention. That could get this profile information. But the problem is that it's a little difficult to map the lock address back to the symbol of the lock. For global lock, this is OK because the global locks have a fixed kernel address. We can map it back. But many locks, that are difficult to do so. For example, for the locks, that's inside the super blocks. We only know the lock address, but we don't know which super block contains the lock. And then it's a bit difficult. Also, the common lock that we know that heavily contented is the VMA's MAM semaphore. I remember correctly. And it's a per-process lock, basically. We know the address of a lock, but we don't know the process or the VMA that contains that lock. But anyway, so we can actually hard code it. We can convert the lock address into the common lock and then take a guess to see if it's some common data structure that contains the lock. Luckily, we identified that the lock that causes contention is the CPU revised semaphore. So another thing that I did, in addition to identify is lock, is that I write a custom program. In that program that attaches to the entry of the lock contention, and then I set a timer in that program to set a timer to, let's say, 32 milliseconds when it expires, the callback of the timer will record the state of the lock information. For example, the lock owner, the state of the lock, or which thread is waiting for the lock. And then export that information through the room buffer, so the user space could see that, oh, OK. I've been waiting in this lock for too long. It's these 32 milliseconds, and what state it is. And then that could also help the user space to debug. And using this tool, I verified that on 4.15, the older kernel, there's no contention that's beyond this threshold. But on 5.10, we have so many instances that happened beyond this threshold. I can confirm that this is a problem. And the lock of a problem is CPU set regress semaphore. So like I said, we need to find the cause of the contention. So how do we do that? One idea is to say, let's check the critical sections for these locks. Luckily, we only have a few kernel functions that takes this CPU set semaphore. So what I did was that I write a program that attached to the lock acquisition and the lock release, and also scope the profilings within the kernel functions that possibly hold the lock. Luckily, they are just a handle for all these kernel functions. So it's doable. So by profiling the length of the critical sections, I discovered the kernel functions that takes too much time holding this CPU set semaphore. So I think one thing useful is that you can see there's a pattern. We want to have a scope or profiling within a single entry. For example, a single kernel function or some set of kernel single functions. And I think this will be something might be a useful helper to be added as a library that could be shared for many lock profiling results. And the result here is that we find that this particular function, CPU set regress semaphore mask, has a very long critical sections. So what do we do? And then the rest of the things are just to take a look at the code and to see what happened. So I compared the code between the old kernel and the new kernel, particularly in this function. And I noticed that there is a particular function, rebuild root domains. That is a conditionally called in this function. What it does is that it iterates all the CPU set C groups. And for each C group, it iterates all the tasks within that C group. That has a very high overhead. What is the intention to do this traversal is that it wants to find the deadline tasks in this C group. But our production system never has any deadline tasks. So we're just paying a very high cost, but just for nothing in our workload. So as a quick experiment, I just disabled this traversal in the lock. And then just give it a try to see if it works. So this is a result that I get an experiment on that. So you can see that without traversing this whole CPU set C group and the tasks, I bring back the latencies from P90 back to within 20 million seconds. So it's much better than before. So I searched the mailing list to see if someone found this issue. And some CAS use from also from Google independently discovered that this traversal is problematic. They reported that it's going to increase the time of resume not suspend. But the scale, they reported it's much lower than what I found here. I found that it's increased for 60 milliseconds. Probably on his environment with the mobile system, it's not that many tasks. It's more like 20 milliseconds. But it's still a problem. So we independently found this and reported to upstream. And then there's a fix that is coming. And but one thing to note is that there are still tails even after this fix. You can see that there are still some samples that are between 20 milliseconds to 200 milliseconds. What's wrong with that, actually? So the rest of the thing is something was I didn't use a BPF to get an answer. So I noticed that on the older kernel, the lock was a miltex. But on the new kernel, it's a per-CPU-REV-RSM-R4. In the per-CPU-REV-RSM-R4, probably is a less known fact that the writer or the writer path to get that per-CPU-REV-RSM-R4 will be heavily penalized. The reason is that it will have to call a synchronized RCU to wait, and then in order to get the lock. But the trade-off of that is that the reader of this per-CPU-REV-RSM-R4 can get the lock much faster. So what I tried was to switch from this per-CPU-REV-RSM-R4 to miltex. And now you can see that the tails, beyond this 20 milliseconds, are now gone. So this is, I think, two important fixes. We need to have it in the upstream kernel. And luckily, yesterday I saw that the fix is already merged by Tegern in the C-group branch. So I think it should be available in 6.5. The second part is about something we want to use the BPF to analyze the swap page fault. This is a collaboration with my colleague Chris Lee. OK, he's probably in the MM tracker, but I can basically cover that. We also, in our production system, we also have some issues of tail latency when we get this swap page fault. And we don't know what happened. The very first thing that we want to get is that at which stage this tail latency comes from. So Chris provided me a list of kernel functions and then break down this handleMMFault into several steps. Give me the key kernel functions. And then we put the BPL programs to take the timestamp under each step, and then to do a calculation to measure the duration of each steps. So some results Chris found was that they put his analysis into some experiment machine, some experimental workload. He found that the contributor for the tail latency comes from this free swap slot. And yeah, basically it's a free swap slot. So this is actually a very good indication for us to have a better understanding of what happened in our production system and where we should be looking at next and what's the cause of that. This is 515 kernel. And so this is actually the very first step. We want to use in BPF to understand our system, especially to capture these long tails in many kernel operations, like a page fault or kernel operations. Further, we want to take a look at it, develop more kind of a BPF profiling tools that to analyze these swap page faults results. So that's basically just a share my experience and then to see if people have any comments or anything. One thing I think is beautiful I think may be related to a talk later is that we also have some observer some contention in coming from Superblock. But we can't associate the lock address to the Superblock's address. So we don't know which file system caused the contention. We are thinking that maybe have a Superblock iterator may help because if we can iterate the Superblock, we can validate the lock address against the lock that we get. And then we can determine which is Superblock, which file system caused the contention. That's it. Thank you. Super interesting talk. Please, as David said, please open source everything you can. Okay. And stuff that you cannot, like we'll take a look. Really awesome stuff. For the address of the lock, do you have a trace point in the things like mutics and it? Because I think currently the old microsys, at least for the spin lock, I'm not sure whether they are for mutics is maybe as well. So if you put a trace point there, you potentially can just like collect all of the addresses. So I think my understanding why you can't identify the dynamic addresses, like inside the dynamic allocator structures locks, because, well, they are dynamic, right? So if you have, because they go through the init, you just put the trace point there and record all of the addresses. We know the lock's address, but we don't know the kind of the owner's address. Is that the question you're asking? Yeah, but I'm saying like when you record the address at the time of init, you do it once, you know the address of what you're needing and you collect the stack trace, so you know like where it's being collected and then where it's being needed and then humans can easily identify. I never thought about this, maybe doable. Like even without recording the address, once you know the contention and this is currently lonely, so user space is not involved, just collecting the stack trace with line numbers. Like humans can just look and say, well, this is the most contended and this is the stack trace from the most contended. You just look at the source and see what's. Yeah, so yes, as I mentioned that Perf can do this lock contention analysis. It also reports the stack only observe this contention, but it doesn't solve all the problem. We want to know if the lock is contended, who caused the contention? Who is owning the lock? Who hold the lock right now? That is actually more valuable information or important, but not all the locks record this owner information, only mil tags and the real estate model, not all of them. So, but luckily for common lock types, this will have that information. And the contention. I thought this is only for contended, right? So the trace point only in a slow path, so like spin lock. Yeah. I see. So when it's getting locked, you're saying there's no, like when the spin lock, for example, getting locked, there's no owner field and you cannot, like there's no trace point there, so you cannot record it offline. Right. We don't know who holds the lock and there are locks that the lock acquisition also in line. So we can also not trace this lock acquisition function. It's very difficult to get to keep track of who owns the lock right now. That's the many problem, but luckily in some common cases, we can identify the lock and the owner of this information. Nice talk. So recently I had to navigate the complexity is around sock lock, which is like a very unique lock in the sense that it's a spin lock in the bottom half. It's a mutex when you're contending with the user space code. I was wondering if you can extend this framework to inspect the sock lock. So certainly you can with the trace points are already added for all the types of locking, the kernels, spin lock, mutex, reverse them all for all of them. The problem is that how do you map the lock address into the meaningful kind of, back to probably socket or something. Like I said, we can record the stack trace to get a better knowledge, but exactly who owns the lock or what the object contains the lock is largely no unified solution in this. Okay, but I was wondering like, can you dump more context around the lock? For example, is the BH disabled or not? Or is RQ, software RQ enabled or something like that? Yeah, you can, you can. It's a kernel state, right? BPF, you can inspect the kernel state. Like you can get the primed context or the locks, any fields in that lock, if you have a record, you can dump it out, right? It's doable. Also, I'm interested in measuring the lock contention of the BPF allocator. You have this per CPU lock, right? So why can't we find the owner if we can just inspect the lock struct itself? Because some lock doesn't record the owner. Okay, so they just don't store it at all. Not store at all, yeah. So you could have somebody else free it and it wouldn't know. Yeah. Well, and you don't want to trace the hot path like you were saying, right? So you could record it. Sometimes you can't even trace it. It's in line. Any other questions, feedback? All right, thank you very much. Yeah. That's awesome. Thank you.