 G'day a month ago. I worked in Netflix in fact very very recently. I worked in Netflix I've recently joined Intel I'm on day five of Intel in fact doesn't Netflix for eight years in a month and Certainly felt like taking a month or two off between jobs, but LSF MM was Within like a week of leaving Netflix. So I had to just come straight from one job to the next so That's why I'm on Intel day five and night currently in my break between two jobs I'm gonna talk about two topics the first is about BPF observability tools the sort of things I'm working on at the cutting edge to get your feedback and help and some problems and A wishlist to find out where we're at and then the other topic is the guidelines It's something that BPF steering committee. We've been trying to well. I've been trying to get done the first topic about BPF observability is While I've created all these great low-level tools for SSHing on to systems and running them That's not the number one thing people will use for BPF observability The killer feature is flame graphs. In fact Intel just acquired a startup called granulate that we're doing eBPF flame graph profiling and there are various other ones. There's optimized and there's others as well prod filer What's great about eBPF is that we can Frequency count the on CPU stack trace and then emit just the frequency count to user space turn that into a flame graph It's really efficient versus using say perf to do All the samples down to user space and then having user space to the aggregation even though perf has all these optimizations which actually makes it a lot better than it could have been But it makes on CPU profiling more efficient and more of this product that startups are now selling Also off CPU profiling It will make off CPU profiling practical at all because without eBPF and the aggregations and and things that can do Off CPU profiling was not practical So I've spent a lot of time on this because this is going to be what a lot of startups adopt And this is this is what a lot of people can experience is their first Taste of eBPF for observability and then later on they might try my tools although using a UI on top of them so one problem we had at Netflix is that with sampling stack traces Even though the kernel could Put them into a map and then frequency count them would still have scenarios of having something like 30,000 unique stack traces going to user space and then User space would do the symbol translation and it would take a very long time to emit a flame graph And so this became bottleneck and so I noticed that the top level instruction offset We didn't really care about sometimes we care about I want to know what instruction I'm stuck on but for the flame graph level like you don't you can't even see it because we just I turn that into a rectangle so my idea was to discuss how can we Chop off this bit It's just a few characters kind of just use like said and like take it off will know because in kernel space It's just a list of numbers And so what I'm really trying to do is take the current instruction pointer and go to the start of the function with no symbols For user space code while you're in kernel space in ppf context Can it be done? This is where this was where I thought this is a great thing to discuss It's like wouldn't it be great if we could have a BPF function get start and you give it the IP and it's like Here's the start address Because then I can use BPF get stack ID skip the first frame and then I have one of my thought which is This without the the top level offset, right? I haven't actually tried this I haven't tried the I'm sure this works when you skip the because I like to put it in there He's you can skip the first few frames, right? You've got a bit mask Okay, so this it sounds like this all should work like people have already has this capability We just need need to be able to do this It will reduce the unique stacks by a factor of 12 to 15 times I took some old profiles and I and I summarized it and Yeah, it turns out it's because your That's about the average function length for production workloads that I was looking at and because you're just chopping off the offset And you're just reducing it to one so instead of Sampling each of those off instruction offsets. You've just got zero and so I had a few ideas Anyone have an idea of doing this? So I want this I Know I've got that So what I've been hacking up today the problem with giving it a talk at LSF MM and saying oh his problems I couldn't solve is I feel guilty and lazy. It's like I should try and solve them before I talk about them and so I've been trying to solve this and I'll give a demonstration of it All right, so I'm profiling bash in one window and in the other window. I've got my tool and So I've got on the left This is the instruction pointer hash search and on the right I've Shopped off the offset and just turned it into the parent function so that we can aggregate on that and It mostly works don't look too closely mostly works How did I do this purely in BPF for user space? But I'm printing out some aggregations at the bottom and you can see how trimmed The trimmed array is smaller than the normal array because I've got fewer things So I'll tell you how I did it and you can tell me if this is just wrong. It's working by accident so my rough proof of concept is I this is all in big depth trace I Go to the Rbp and Then I walk backwards five bytes and have a look at the opcode If that's e8 It's a call instruction. I then look at the next four bytes for the relative and then add it to the parent function and today there's the start of The top-level frame so basically walking down a frame going backwards and disassembling the call And it's a direct. It's only like you're talking about direct calls, right like indirect Sure, I like I broke this today while you're talking sorry But like this can be longer like I can add a few more I can like do other architectures and instructions But tell me if this is insane because I will abandon this but tell me if this just seems like a bad idea Like I said, it seems to work. I've tried it for like 30 seconds The benefit is we can use this to do CPU flame graphs and reduce the the emitted Stacks by 15 fold it'd be great There is there is the instruction pointer and then you just trying to see what is going on It works. It works. It works for like yeah, but for direct calls. It works But I know I need to do the table of like other instructions and like there are all sorts of her in the case It's like tail calls indirect calls. Like it's super fragile You can use it So like as like he said, there's lots of other instructions like this different cold cold types indirect calls As an observability tool, it doesn't have to be a hundred percent accurate I just have to document somewhere that it's like this is mostly right and That's fine security tools have to be a hundred percent accurate. Anyway, that's the first topic Are the comments on this idea idea is interesting so but push up probably get I'm just I'm trying to see what They will really have something like this the helper that will find the beginning based on the address inside the function will Give the first bite We might have one Maybe not if not we can add it that would definitely help Yeah, and like this is my quick tool and it's like look it kind of works like internal Malik And I've chopped off the offset internal Malik that didn't work But then it wasn't an E8 so I'm like doing the wrong instruction So I need to fill out my table and so that didn't work for some reason. I need to figure out Don't look too closely. There's still like more code needs to be added But there should be no blocker of like if we need to add a new helper that is dedicated to this Specifically it should be fine, right? That'd be great. I don't I don't make you add lots of helpers like Because I can turn that into like a use a space macro somewhere in BCC or BPF trace But yeah having a help a BPF get function start and then we can hide all the different architectures So do we need to do this in the NMI when we do the trigger or it could be after a little bit later Because that's I'm afraid you to get accurate. We're gonna do the binary search This is probably a little bit expensive Well, if we're already walking I mean we're already walking stacks in the NMI if we're doing NMI based profiling and like the Netflix Production the production stacks go to a thousand frames deep Which is why an Aldo added the perf max stacks this CDL because our stacks are so deep So yeah, but I think that it's a good concern like I'm doing too much stuff in an NMI handler But we've already broken that and Can I do it after the NMI depends if if the user space thread got rescheduled? I mean instruction point is somewhere else now Well, yeah, I think we can definitely give it a try. I mean, oh, maybe we just Keep the first address and do the lookup later or maybe have a poster process or something I don't know this was It's not easy enough to to give it a try and we'll see how how fast or how slow it is Cool Keep talking about that next problem Missing context so stack traces Fantastic flame graphs I can understand big picture of my code But it's missing stuff like the database query string and so the code answers It's supposed to answer why like why am I here and why is my code on CPU? Why am I calling this stuff? But that'll be useful database query string or like user IDs or whatever like Netflix movie title you're watching Whatever it is You can do this today via tracing and stashing so I Trace the start of something and stash it to the thread ID and then when I sample I include it And then I clear it, but I want to do it just in sampling. I wanted to make this lower overhead I don't want to have to do tracing and So things to discuss so so you understand the goal here is I want to see these stack traces and I want to add like a frame of context like this is the database query string or the thing you're working on So trace and stash that's one way I can do a double walk So I did prototype this in BPF trace and it works and that's where I Do a quick stack walk Until I get to a frame that has the context and then I fetch the context and then I follow it with the normal BPF stack walkers So I'm basically diving as quick as I can just go down to the and when I get into the IP Instruction point of range between my low and high go do my context fetch from the other arguments the function if they're available on the stack and So that because that will work entirely from from sampling context I just do a quick walk grab the context then I do the full stack walk today. I've got it without tracing. Can you sorry? What's what is context? How do you know this AP range and why context is there? So my sequel is a nice example because they've got like Do command and and query and join and all the functions and a lot of the query strings are available as Chastars, so if you can access the argument to the function, you've got the query string but the big problem is that that only sometimes works for the Registers that are on the stack. So if the register isn't on the stack, I can't fetch the argument If it is on the stack, I can fetch the argument Still missing. So you're talking about user process stack, right? But you So you have to like this Walking logic has to be very specific to particular binary, right? So it has to have a built-in knowledge that this is like a stack trace potential of my sequel or whatever Walk from whatever it's stopped Until some lower frame and if this lower frame a certain function that you know will be in that binary in that address Then you will look in like addresses, etc Sure Absolutely. So the two things is as Alexi said, this is very specific to the application job So like in BPF trace or BCC or whatever we have some sort of API so that you can define Here's the MySQL walker and here's the Erlang walker And here's the memcache deal walker or whatever and people can add write their own little walkers and plugins to get the context The other thing is to I want to add a syscall called This is my context like do context syscall which has a char star and then until application developers Before you do your slow route not the hot path. I'm not saying add it to hot path, but for the slow path Do a syscall or just save it in a known memory address of the string context so that debuggers can go and fetch it Could this just be like a you couldn't use register somewhere and save at context No, I mean if Annotate in your like application code. This is context and this is somehow handled by the Yeah, if we can if we can modify application code This is solved because we can do it so we can like application memory walk We could say just stick it in this memory address where we know where it is Each thread populates that memory address We just go fetch it the my SQL helper will go and fetch it from the mysql context address and we're done So syscall would require application modification. Yeah, I mean everything's good either If you get really lucky and you can just walk through memory like if has anyone done caught up analysis without registers And you'd like I don't know maybe memories over here, and then you can like figure it out Maybe like what about for key application functions you have on your slow path On the entry point of the function where you know that we're going to populate We're gonna do some heavy stuff you attach a trampoline of some sort and start saving context there rather than modifying the application Totally works totally works my skills slow enough. I'd actually just do that for my SQL I just trace like do command and then save the string. It's just more higher frequent It's like wouldn't it be great if there's a way I could sample this for higher frequency things that didn't Didn't need that so do you do you have an example like Context like the mysql is a system called like reading a file on Or sending some level traffic. There's some something could be asynchronous, right? It could be related to that process, but maybe it's a asynchronous asynchronous copar It could be synchronous copar things like node.js really mess things up with the event worker thread It's night things like mysql are nice in that they will do a thread for a Connection and then handle that query and so you can tie things to the thread ID But if it's synchronous, then you're going to have to find in a key address to match the context with becomes more complicated If it is synchronous you may be For example right into a file, I don't know sending out to Send message to a socket Maybe you can store something the tiles local storage and then maybe we can get it from there Yeah, if you mean if you can modify the application put things in task local storage and then and then yeah And like have a have a well-defined macro that we publish and we tell people when you add your probes Tracing probes you should also add this context Context not a probe is it but when you trace when you trace it But some will send message or write to a file when you trace it Maybe trace point you should be able to get a current task. You added that you get a current task You know the PID of the task for the mysql then at that point maybe you can do something Sorry. Yeah, when when you trace it you you have the process ID. There is a way to do You know you even can get a current task structure at that point you can store something there So when you go down to the set It at the top of a stat is something that you want to Tack then you can tack it there by using the current current task storage for something Synchronous will be I don't know But I wonder in this context like when you need to maybe debug some Production issue, and then you don't have that thing stored in that context that you want to have it's maybe also tricky, right? I don't know. I mean I don't know I haven't really dealt with this problem on Linux, but on Windows and when DBG I had to write scripts And we beat when DBG to grab my context. It was heavily application-specific like because I was unrolling like a stack and an interpreter inside of a like a C program and We to debug that's what we had to do And it required like a full-blown scripting language to support it Is it possible to use like the information from DOF or other information to have you To pass where is the context for example the argument function argument is basically can use the DOF information to illustrate that But I know that the performance is an issue, but it's a possible if we do that after war to analyze and maybe we just keep capture part of a stack friend and Later we can use DOF information to pass the part context Yeah, I mean dwarf has lots of stuff. I just worry that the request will be done by the time we're looking it up and Dwarf will help us understand I wonder if we can use dwarf to help develop the double walker because like I said the The first six on x86 the first six registers a night on the stack, but dwarf knows where they are So maybe I can use dwarf information to come up with how do I because with my secret is such a nice example You've got the char stars like arg zero and I'm pretty sure I can't do this method But if dwarf told me where that was I think if you think that we're a size the walker Would probably be the appropriate. Okay, so that as I think of it that might actually be the best way to do it Is you have like a BPF trace or BCC way of other people publishing their own? basically context widget and You know people can write one for for my school in the hoodoop and Apache spark and like its API is to populate a context Variable based on the thread ID so long as a synchronous to the thread and that's it And you just let people add their context walkers and then the profiler will check if that's populated Maybe that's the best way to do this Could you go forward one slide, please? So at Meadow we do a lot of a and c Occasionally in combination I'm actually bummed that Andre had to step out because he effectively wrote a Meadow internal version of what you want for see like the custom macro That the BPF program can then find very easily Rather a custom macro that populates the elf and then some code that at Profiling time or tracing time Tells the BPF program where to find the context Sounds like that that is this yes, but that we I want that open source so I can use it He's been he's been bugging people to open source it. I do I believe that the way that it currently works is it? Basically reverse engineers thread local storage for various architectures, okay, and we can also use that to sort of Associate what threads are doing work for example on behalf of a certain request You know using request ID and then we will occasionally use that in concert with a to associate sort of Metadata that is known at the end of a request that may not be known at the beginning for example the latency etc etc So that's something we kind of have working internally more or less and we've gotten requests for the B version of this So I think that would be useful as well. So who is it and what's it called? Andre did the original It's called stroke meta. We might have some self-tests that are sort of related But he was definitely on my case in the last year about open sourcing it How do you how do you bind the request ID to the thread ID? I miss I missed something The request ID to the thread ID Well In this scenario, you don't need well, it's not it's not it's yeah Official think is not open source, but it is in self-test in not even reduced form So if you look at it like it's doing all sorts of crazy works Different data structures and it does look sort of like a local storage and it knows More let's be just because it knows what things are It's but it's specific to the like it's custom to the application. So there is no like completely generally In our specific case, we have some request context thing with its own The application will essentially always populate these memory regions And when the BPF program is starting up or rather the user space piece is starting up It will find the running executables with a specific elf section That this macro populates Correct Yep Sounds cool. One other thing I haven't explored is Can the hardware help like processor trace because processor trace can trace branches and then pull out arguments if it can somehow basically keep in memory the first six Register addresses for six registers That we can I can then look up on the fly from BPF that might be another way to get to those I haven't looked at that yet Just an idea another topic on Stack traces is off CPU sampled stacks how we're how Netflix is currently doing it is event tracing using the off CPU tools in BCC and the problem is Scheduler tracing can be frequent. You can be doing over a million events a second You can be doing over two million events a second on the system that can start to add up to be 1% or 2% CPU If we can sample instead of trace Sample off CPU stacks. We could take that down to I'm estimating 9,000 samples a second if say you're you're looking at one application with a thousand threads And you're sampling at 9 Hertz something gentle then you can really cut this down Which would be great a problem with the stuff I've played with is So many of these application threads are idle doing nothing. So you end up and Netflix has stacks that can be a thousand frames deep. So you end up walking these deep stacks over and over and over that haven't changed So the idea is to only fetch stacks after wake-ups so Find some thing in task struct that will tell me if this thread has not budged since I last sampled the sack trace Which I think I can use get the get our usage stuff because that gets updated I think I've tried this before and there was some problem with it But there's gonna be some flag else at a flag There's gonna be some flag in task struct that will tell me has this thread woken up and being scheduled Since I last looked The runnable stuff like the app that there is there is accounting for the scheduler side on how much time This is spent on a CPU. Yeah, that doesn't get our usage maps to it Yeah, I tried this once and hit a problem But there's other stuff. There's times. There's all sorts of stuff in task struct to go through it and get the thing that works Hopefully it works and then each time you look at an idle Thread, you know if it's woken up or not and if it hasn't woken up you just use the cached stack ID and Then we save that so you don't have to recall BPF get stack. Does this sound like a good idea? Yeah, this is my idea of making this more efficient Since I've been changing jobs. I haven't actually coded it yet, but I Was going to use BPF get stack ID and for some reason I thought I could point this at other threads I now forget why I thought that Anyone Anyone tell me I'm on the wrong direction. I'll play with it I Went last time I really dug into the weeds. I thought this should work. They should totally work. We don't need to change BPF. I can point It to other threads and then do off CPU sampling also known as wall time profiling and we're done Efficiently so that's the plan something else in terms of BPF observability is Since we meet six months and and every six months and talk about the same things Is anyone actually working on the you probe speed-up stuff that we keep talking about it plummets hands up if anyone is doing the dynanced you probe Speed-ups nobody okay, because I keep people keep emailing me saying Brenda And I'm trying to use you probes in this application. It's really slow. It's like yes You've volunteered to do the work But so far I haven't convinced anyone to do the work We know we've got to get you probes to be faster so that it opens the door to all sorts of things The second one is anyone working on heap tracing So if you want to do you probes on Java or other jittered runtimes You need to be able to point you probes at the heap segment and trace methods there and It doesn't currently allow it because it's not an executable segment and it doesn't have a backing store Is anyone working on fixing that right now? Someone at meadow really wants this so when we discussed it a few days ago I Poked him about it. We really want to do the heap tracing into jittered I heard met is going to release this soon great So good looking forward to using it I mean if you interest in this kind of thing I did ask the Oracle JVM developers a while ago to give me the capability to pause the C2 compiler So we can pause the C2 compiler. We can then go once this is done. We can then go you probe Java methods and once the speedup is done will actually make sense and Then we can do all sorts of things instead of using our or Netflix's bespoke collection of Various Java tools and all just be BPF to do like memory leak detection and profiling and try all sorts of stuff Can just become an eBPF program, but that's some of the building blocks The last thing is anyone working on So stack walking like it's great when it works Is everyone working on an eBPF way to merge all the different stack walking methods? That's not all of them, but some of them You won't believe this but someone from meta great good about I think about a year and a half ago someone from meta did a trial of emerging LBR and frame pointer Stacks and it seems to work pretty good I don't know if we kept it turned on but it was at least iterated on quite a bit so I can check the state of that That would be good. That'd be awesome some meta Meta for and also like why not you're here as well. You may as well do you prep speed up, right? Let me do all the things I was going to talk about tools, but there's from my perspective. There's actually not a lot to talk about There hasn't we haven't merged that many new tools recently. That's not a bad thing To make great tools and they great problems and so we do get various submissions where people have taken say trace points and they'd Formatting them in a different way, but there seems to be the absence of a real production problem to solve with it so We haven't merged with most of the interesting ones, but not too many recently. I Do I will talk about PCC and maintain a ship with Dave and young Hong. Is there any particular issues you want to summer? I mean, it's it's hard work to tell people like Your tool isn't suitable. You need to do this. You need to do that. It's just ongoing maintenance. Is there any Updates you feel we should briefly mention PCC and BPF trace tools well selfless Selfish plug I guess now we have used g-support in the BPF so it's possible to convert more tools to like purely the BPF I guess I get so many github like emails about the conversions. Are they there's still some that haven't been ported? Oh there are like 41 or 43 tools that are converted and there are about 50 that are not so No, I thought we're much closer. So any like halfway through moving So if you're played with the PCC Python as the old interface, we want to go to the LibBP FC BTF and core interface Some tools take a heavy advantage of like client compilation at runtime. So like trace dot py is like I assume is super hard to port to the BPF Some were relying on use DTE and some either no one is using actively or I don't know Or maybe there are some other complications that haven't yet You know other main issues to summarize Just some general comments about maintaining the tools for a while I've wanted to find a way to get people who are Experts in the specific thing the tool is trying to trace To take a more active role in maintainership because I don't know yongkong if you would agree but about 90% of the issues and pull requests we get are related to the tools and Specifically when we get pull requests where people are adding some feature It's generally about some tool that we're not particularly invested in so we're inclined to just say yes and not push back Which you know over time is not going to end in a good state Yeah Yes, so the comment was we we want to get the experts to do the tools Because that's the best tools and if any subsystem you've worked on in the kernel think about what tool Can you write for your own subsystem? So like Daniel you did was it DCTP protocol Yeah, so where's DCTP DCTP stat and DCTP snoop and DCTP errors Are you the best person to do it because you've wrote it right? They're running it So so like I worked on ZFS so I did the ZFS tools It's if you've worked on a particular subsystem you're in the best position to write here's the Here's the tool because you know the internal bits and the bits that matter for performance and debugging so Yes, we need to we need to get more of the real that the authors of the code to instrument their own code come up with Performance source. It's almost like maybe Because we're doing fairly well on getting engineers to provide unit tests when they do pull requests How about BPF trace programs as well? It's like yes, do the tests and the instrumentation So maybe there's something I should start fighting for I recall another like sort of blocker inconvenience for BCC to libpf conversion Like any tool that does have a use of stack traces, especially user space tech traces That's kind of hard right now because we don't have like a nice library for stack traces Symbolization the capturing stack traces is like yeah BPF gets stuck, right? But the symbolizing them is a little bit of a problem for like a old sims We have some helpers for you probably it's kind of a little bit harder and especially if you want to do like in line function all the stuff But guess what? Matta is working on I'm sticking I'm like I'm going as quick as I can let me type it in and it's Kernel user space symbol Yeah, basically symbolization library and You can fit that like the orc Lbr stack traces like non-frame pointer stack traces like into kind of similar space, right? Like how do we get stack trace and how do we make make it human? Comprehensible so but like with this can you do they can you define the user space symbolization and then have the kernel do the work? No because of dwarf Well, let's not use dwarf like we've got BTF we should have used a BTF Like the inline functions, right? It's dwarf does Good job representing it compactly. It's complicated format I'm not an expert, but I don't think so like okay, we have kernel folks here, right like Do we even imagine? having like inline functions the buck information supported natively by kernel for BPF programs If we could do good concepts now, so Unlikely if you could just to have an interface for not just KL sims, but like for this Process here's its symbol tables Something you just do for the number one application process so the kernel has it can do its own quick lookups Our symbol is not hard. Yes But then then you have to deal with putting user space out Strings out to user space because we're just doing anyway, okay things matter is working on a fix the slide