 Good morning, folks. And thanks to Anand and Savita for that wonderful session on CACD Pipelines as code. And welcome to our session on continuous profiling with Parker. And in this session, we'll take a deep dive into the world of continuous profiling and discover how Parker helps initiate that process easier. But first, let's have a look at this CPU Sage Metric Graph. And I bet it triggers the job for a lot of us in the room today, like the random spikes that disrupt the otherwise calm baseline. So it usually would have occurred once or twice in a lot of times in our career. So and whenever something like this happens, like theories often emerge within the teammates. Like one might attribute these spikes to the garbage collection runs. And someone else might suggest it's triggered by a problematic code path in the admin users' tasks and stuff like that. And those are theories. And theories are theories. And what we need is data. So data, something like this, that can give detailed idea into what was really consuming the CPU during those spikes. And even to the details of up to the functions and the line numbers in the code, where the resources are being spent. And imagine how much easier it would have been if we had these kind of solid data to back our theories in the first place. And that's the power of continuous profiling. It makes your performance debugging experience smoother and also insightful. Yeah. So who are we? Like I'm Manoj, a software engineer at Polar Signals. And I'm an open source enthusiast. And I'm also a maintainer of PARCA, which we are going to talk a lot about today. And I'm also a creator of Responsibly App. So that's on a totally different vertical from what I'm going to talk about today. So it's a dev tool for front-end developers. Passed along if you have any friends or colleagues who are into front-end development. Yeah, and that's about me. Sumira. Hello. I'm Sumira. So I stare at all the pretty icicle graphs, Manoj, minutes all day. So that's my day job. And I am also an open source maintainer of PARCA agent and PARCA Dev. And I've been working at Polar Signals for two years now almost with this product. And I'm very excited to share what we have built with you all. Now it's over to Manoj. He'll carry on, and I'll be back in a bit. So let's jump straight into the heart of the discussion today, profiling. Profiling is a phenomenon that's been around since the 60s. Like it evolved with the time when modern programming languages were getting evolved. And let's break it down. Like profiling is all about analyzing a program's execution with an emphasis on measuring how the resources are being used by the program. And by resources here, it can be CPU, memory, or like Ivo, or whatever. And also like it gives reports that are very detailed up to the line number in the code that says where the resources are being spent and things like that. There are different approaches to profiling. And today we are going to focus on sample profiling. So sample profiling is an approach where we observe a program's execution for a fixed amount of time. Say for some 10 seconds, we observe a program's execution. And during those 10 seconds, we sample the program's function call stack for every constant interval, say 100 times per second. So by doing that, over the course of the 10 seconds, we'll have 1,000 samples to work with. And that's a good enough number to get an idea into what was really happening within the program. Yeah, and considering the fact that sample profilers don't continuously monitor each and every change in the program, it's very low in terms of the overhead and also making it suitable for a lot of common workload that we run in our cloud. So why do you want to profile our applications? One, to make our applications faster, say you run an eCommerce store. And in the eCommerce world, the golden rule is pretty clear, right? Your sales numbers are directly tied to how fast your website is. So the profiling comes in and helps you identify the bottlenecks so that you can address those and potentially increase your sales numbers. And also to cut cost on your infrabills. Usually applications run code where 30% of the resources are spent on easily optimizable code, which we don't optimize because we don't have the insight that it's being wasted. So with continuous profiling, you'll get clear insights into which parts of the code it's doing, like wasting the resources so that you can strategically apply optimizations to those and, like, in turn, cut your cloud bills. So like we just saw, profiling is an incredible tool, right? But traditional profiling, it has its limitations. It's very momentary. Like whenever you run profiling, you'll get the samples. And once you stop it, you're not going to know what's happening in the application. So that's momentary. And also it's very manual. You face an issue, then you set up profilers and start collecting profiles and stop it. You'll have to do the cycle again whenever you run into a performance problem again. And also it's not very easy to get profiles from production. You'll have to either SSH into the instance or probably port forward to a local and extract the profiles from your application. So all these are both time consuming and also very error prone when we do it in our production applications. And considering the fact how powerful profiling is and the developer experience is not so ideal, so we wanted to solve that problem somehow. And that's where the continuous profiling comes into picture. So continuous profiling, like the name says, it's an act of continuously collecting profiles from your application over a duration of time or the lifetime of the program so that you have a constant trail of what's happening within your applications. And since, like we already mentioned, the sample profiling is very low in terms of overhead, we can, like with any other profiling observability data, like you'll not know when you'll need the data. So it's always good to collect it at low head using continuous profiling. So how does the whole thing work, right? So we employ the sample profilers to continuously collect profiles from all the processes that are running in an order. And we tag the data with the required metadata that will later allow us to slice and dice the data to get profiles from each and every workload that we specifically need to look into. So in addition to solving the developer experience problem of traditional profiling, continuous profiling also brings in a bunch of benefits on top of that. One, your development isn't production. Even though we strive so hard to make our development environments as close as production, we simply won't be able to replicate the same workload in our local. And by not doing that, we're going to miss out on the crucial data that we get from the production workloads. And continuous profiling helps solve that gap. And also the data and context over time, right? So once you employ a continuous profiler, you'll have this profile trail over time across various events like rollouts and also like production incidents. So that whenever a deployment happens and if your performance numbers go south, you can go compare the profiles before the deployment and after the deployment and see exactly which part of your application has degraded and see what's wrong and fix it immediately so that you will eliminate the regression in your performance goals. Yeah, so the gist is it's possible, absolutely possible to profile your production workload all the time. And also we employ our continuous profiler in our production environments and also our users do that all the while. And by doing so, in addition to the performance insights with the help of the profile data, day to day we discover even different insights into our own code that helps us solve bugs and even other kind of stuff. So how do you tap into these benefits, right? So PARCA is our answer to that. PARCA is an open source continuous profiler developed by Polar Signals. And it easily integrates with Kubernetes environments. It runs an agent to collect the profiles. And the agent is deployed as a demon set. And it's also an EPPF-based zero instrumentation profiler. So it doesn't need any code changes to your application. You just have to deploy the agent as a demon set and it starts it magic. So it discovers the processes, each and every process that runs in the node. And it attaches profilers to all of those and collecting samples from those. And it also indexes the collected profiles with Kubernetes metadata, like label values, so that later whenever there is an issue you can query for the exact key value pair that you're looking for and extract the data for those. Let's have a quick look at PARCA's architecture. So it has a very simple straightforward architecture where the EPPF-based agent, it collects the profiles, a bit more on that from Sumera in a bit. And it sends the profiles to the PARCA server where we process the data, do a bunch of simplification and other enhancements to that. And we save it to the profile store, which is backed by a FrostDB, which is a custom built columnar embeddable store that we specifically developed at Polar Signals again. And that in turn stores the data into the object store. And the same happens from the UI. Like whenever we want to query something, it sends a request to the query that in turn gets the necessary data from the profile store and we render cool reports on the UI. OK, let's take a quick look at how PARCA works and how it will help us solve the performance problems. If you're curious, you can go to demo.parca.dev on your own and have an Anson experience on how it works. OK, this is the PARCA's profile explorer. On high level, we can divide it into three parts. Like the first one is the query selector where you apply the queries to get the specific data for the specific workload. And OK, looks like the demo guards are not with us today. But I still have the local running. I'll try and pull that out. OK, so you can see the metrics graph below that. Right now, we are seeing just one metrics in the metrics graph. That's because my local, I'm just processing a single process. But in your production node, you'll be able to see one metrics line for each of the process that you run so that you can see up to the details like what's being consumed and how much each process is using the CPU and stuff like that. And below that, we have the visualization section where we can see more details on within the application what was happening that was taking up the CPU. So this visualization is called Icicle Graph. This is most commonly used visualization for performance data. On a high level, the Icicle Graphs, like the vertical space taken by each of the node, it represents how much resource it takes. Say for this root node, it represents 100% of resource usage. And within that node, we can see each of the stack traces and how much resource it takes and things like that. So the root usually takes 100% since it combines everything within that process. And within that, here, if you see the scrape loop has taken 16% of the CPU and the server has taken 11% and the runtime has taken 55% and so on. Yeah, you can also dive deeper into specific sections here. Like say, you want to know what's within the scrape. You can click on it and it will expand the call stack within that so that you can drill deeper, like go into the Ivo and see more details on each of those. Yeah, that's about it. And also, the query section, you can use the Kubernetes label values to query for specific workloads. Say, you have instance, and so this just runs in my local. So it doesn't have a lot of Kubernetes metadata. But in our production or somewhere where we enrich the processes with metadata, you can query them with each and one of the data that's there. Yeah, another cool thing that you can do, like I was mentioning before, comparing profiles before and after the deployment. You can use this compare feature where you'll have two metrics graph now. And on this, you can select two profiles, one on the left and one on the right. And so that we'll compare both and see how it performed. So I've selected two points here. And the visualization now shows a comparison of these two profiles, a differential view of these two profiles. And the things that are green are improved in the second point compared to the first one. And the red things are the ones that are degraded in performance. So I'll compare a point where there was a high CPU utilization with the one that's with less CPU utilization. So that we'll see a lot of white, a lot of green here since the performance has improved between the compared points. And if we compare a low point here and an high point here, we'll see more data in red since the CPU utilization on the second point is higher than the first one. And the other thing is the targets. The targets in the Prometheus targets, it gives an idea into what the agents are, what are the agents that are running and that are sending data to the Parca server. And talking about agents, I'm going to hand over to Sumera now to talk about all the magic that the agent does. I'm back. So you saw all these pretty stacked traces that Manoj was just showing you. But how does all of that information come from? That's basically what the agent does. It collects all of this metadata, the name of the, for any binary that you're running, the name of the process, the name of the pod, the name of your cluster, process ID, it discovers all the binaries and it collects information about them, makes that into stacked traces. That's where we use EBPF under the hood in kernel space. Then it compresses them into a format, like a very low space, taking optimized format. And then it sends them to the Parca server for visualization. But where's the magic? So it's zero instrumentation. You just deploy it. Since we do this in the, all of it's automated, all you need to do is you don't have to do any scripting. You just need to run the binary with one or two flags. And you're good to go. More magic, it's very low overhead. We are actually doing some very low-level things, like reading registers at some point. But we do them using EBPF. So we have extremely low overhead. It does not actually take up CPU space. And it does not affect anything else that you're running on your machine or your clusters. So it doesn't interfere at all. Next thing, so how do we do this target discovery, though? So you just saw these in the stack traces that Manoj was just showing. All of this information, we take a binary and the agent has, discovers all these targets or the binaries, and then it sort of got a list of targets. Like, if you see at the bottom of the screen, you can see the process ID, you know, that 39174, the Parca agent CPU. And you see, there's the highlighted part in green. It's the name of the binary. It's for VS code, you know. And VS code is, I think, something that everybody is familiar with. It's a code editor. So target discovery in the agent is system-wide. If it's a binary, if it's a process, it doesn't, like, we profile it. Whether it's containerized, whether it's bare metal, we profile it. And we discover all the binaries associated with the process. And this means you can see stack traces for everything that's running on your system down to the last system call. And so I was talking about VS code, right? This is what VS code looks like under the hood. It's intentionally an older image from, like, a few months ago when I was developing this for, like, jittered stacks. VS code runs, it's an electron app, right? So it runs V8, an engine under the hood. And we were just compiling and profiling just-in-time tags and developing support for them. And all of a sudden, I realized that there's some WASM code also, WebAssembly code running down there. And it's a WASM wrapper and compiler kit, I think. And that was my galaxy brain moment, right? I use VS code extensively. And to see what's going on underneath and to discover something cool like WASM, which I wouldn't know about otherwise, it really made an impact. It hit that, you know, there's real life impact of what I'm doing. That was a very aha moment for me. So the next thing is, how do you go from this VS code to seeing all the functions in an icicle graph? So I'm not going to super detailed stuff here, but you have a binary, any application. I've used go binaries as an example, but this extends approximately to all Linux binaries. Binaries look a bit like this under the surface. There's executable code. And there is other information about memory addresses and functions and memory mappings. And so how to read these elf binaries? They're known as elf binaries for Linux. So how to read them is, it's encoded in this format called dwarf. Suffice to say that dwarf is like 500, 400 page format specification that I will not go into right now, but we got your back. So what the agent does, it takes all of this information from the binaries. It does some very cool things reading all the information, interpreting it, reads the registers, and it changes that into this very handcrafted artisanal stack traces. We actually do build this table a lot by hand. We had to do all of the calculations on paper first, and then it was done automated in the code. So then we get this return addresses of functions. So we get a call order. The first function's calling this, and that's calling the second function, and we get all the addresses of that function and memory. Then we compress it, and we send that to the server side. And then we attach all the function names, which we have also gotten from the binary information. It just is less data and more optimal to send it to the server side and then get the function names. And as you can see, function names are essentially just return addresses in a linked list. So that's how we do that. Okay, but next question. How often do we do this? We want to continuously profile, right? We want to see things happening dynamically in real time because we want all the data. We can't wait till a womb kill happens and the entire system crashes and everything hangs. And to find out where the bug was, we want to see the bug as it's taking up space, so that we can do something or we want to later troubleshoot what went wrong. So how often do we profile? The way this happens is we profile, we take samples 19 times every second, so 19 hertz. The EVPF program, it's attached as a Perf event hook. Perf event hooks are basically their event hooks in the CPU that let us look inside what's going on in the Linux kernel. And so all the syscalls, all the processes you're calling every CPU event basically that has to do with performance. And the EVPF agent, it's attached to that and it every like 19 times per second, it takes the samples and collects the data and it puts them into EVPF maps and enriches the EVPF maps. Then from the kernel, we send whatever's information there in the EVPF maps to the user space of Parker agent every 10 seconds. So basically you have data, it's continuous stream of data all the time and it's very optimized. We do it at a very low overhead and there's a special reason why we use 19. Things just work better with prime numbers. Something it's very cool is that there are interruptions and you have CPU interrupts in the kernel space and everything and they tend to happen at time schedule intervals. So but if you have multiples of that, so if you have a prime number, prime numbers are not multiples of anything. So we want to minimize interrupts and any funky interactions that's going on, which is why we choose a prime number. We prefer a prime number. So now overview of what we just saw. The agent looks at all the processes, every process running on your system, every binary, be it bare metal, container. And it takes information from them using EVPF in kernel space. Then it compresses them into stack traces, sends over the stack traces to Parker. Throughout this process, we attach a lot of metadata from the C groups, from your container labels like pods, which cluster, which node, and then from the compiler runtime, which language, which version of the compiler like GCC 1.20 or any of that. And then whether it's jittered or not and then we take some more data from the system. And then we have the name and where is the binary located on your CPU. So it will directly show you user bin, what's the name of the binary that you can, that you usually use to run from your terminal. On Parker side, we see the icicle graphs and that shows us what's being, how much CPU things are consuming. And all of this is no code changes required. You just deploy the agent and you deploy the server. There's all two lines of code, just the command. And yeah, you have that. So now, what compilers and runtimes do we support? This is important. It's important to know what you can profile and what you can't. So we have full support for all natively compiled languages like C, C++, Rust, Go and other languages. Very recently, a few months ago, we added support for just-in-time compiled languages. So with Perfmap or Chetan, you can do C-sharp, Erlang. So by Erlang, anything that uses beam underneath, you can profile that. There's Java virtual machine. So anything that uses JVM, even Java, but also closure maybe. You can profile that. There's Julia that you can profile. So if anyone's into data science and is using Julia, you can look at what functions using how much CPU, then Node.js. So that's how VS Code worked and how also like profiling Firefox and other browsers will work. And yeah, the UI, Parka UI just saw. We profiled that with also like just-in-time languages and all. So then we also, very recently, my co-worker Kemal and my co-worker Javier, they both added support for Python and Ruby. So you can also support all the fancy AI workloads and all the cool, exciting AI stuff that we are running today. Definitely check it out. And we support all architectures. So x86, ARM64, the agent needs Linux to run, and x86 is like super simple, but even on my, we use Macs actually at work. And it's, we only have to have a Linux VM. We don't need emulation, like we don't need to emulate x86 anymore. So we just have the ARM hardware from the Mac and like just have to have a Linux virtual machine and it just works. So, and we do all of this, all the compiled languages and everything with or without frame pointers. So what we need to know about frame pointers is there like an extra register that's there on the hardware level, but a lot of compilers and binaries, this trip that. So with frame pointers, it's just like very easy, but without frame pointers, like I mentioned all the elf and dwarf, all that information from the binaries, we have to take like a lot of, actually calculate the stacks by hand. So this is not something that's very easily done with low overhead and zero instrumentation. It took us like some six months extra like to develop this and initial support's fine, but every compiler has an edge case. So when we say we want to fully support it, we were actually looking for end users and everybody in the community to try it out and tell us what are the edge cases because it's impossible to replicate everything in a cloud environment on our machines. So we try as much as possible to, every edge case for every compiler and that's I think a very high goal for us to achieve, but we have been pretty good at getting there so far and we will do so. What kernel versions we support, technically we do also support some 4.19 and above versions I think, but we always prefer version 5.3 plus and even more preferable is the latest Linux versions. So right now I think latest upstream is 6.3, 6.4 maybe even, but from like some Linux versions like 5.22, 6.2, the kernel itself has one or two EVPF bugs and they have all been fixed upstream, but upstream moves really fast and a lot of cloud providers and people, they don't even me, sometimes we just don't update our machines or the operating system and a lot of distributions they don't back port fixes. So we have workarounds on those bugs but it's like sometimes it's a bit tricky and we want our end users to have like a very smooth experience. So always update your machines people and that's really all. These are links I have added, how compilers actually, the edge cases, how compilers affect Icicle graphs. So really edge cases about C++, Node.js and funky things that some virtual managers, VMs do, that's what the MI contained. These are all blog post links, do check it out and so this is a roadmap that we have for Parker. The query language you just saw, we want to extend its capabilities to more like autocomplete features and everything, more languages, you know, we want to support PHP. There's also been asked for support for Perl, so we want to support Perl and camel and like basically every language we can, Mojo that's like a very up and coming language for AI workloads and all and mostly we also want to have like build a community around this, it's been two years but we're still very new as far as open source communities go and we're doing this for like the long time thing and we want to make continuous profiling a regular part of your observability stack. So please join us at the park officers, join us on the discord, try out the product, the tooling and tell us how you feel about it and how we can make it even more easier for you and so that was my talk, thank you. I'm sorry I went like maybe five, 10, no five minutes over time I think. So I'm sorry for keeping you from lunch and thanks for being very patient listeners. So that was all. Thank you.