 All right. Hey, everybody. Welcome to our CubeCon talk. Today, we want to talk about Comproff profiling in the cloud native era. And actually, we don't want to talk about Comproff, but more specifically about Parker, which is the next version of Comproff. I'm Matthias Leube. I'm a senior software engineer at Polar Signals. Next to maintaining Parker, I also work on Thanos, the Prometheus operator, and recently started hacking on Pura, which is an SLO management tool for Prometheus. And with me today is Kamai. Hello, everyone. My name is Kamai Garkovin. I'm a senior software engineer at Polar Signals as well. I just recently joined. Previously, I was working for Reddit. Besides working on cool things in Polar Signals, I'm also a maintainer of Thanos project, Prometheus Clanco, and Cube, Prometheus, and now Parker. So today, we are going to talk about continuous profiling. But before that, we need to just do some groundwork and talk about profiling. Profiling itself has all this programming. And according to our research, we could see some academic papers dated back in the 1970s. So what is profiling? So profiling is a form of a dynamic programming analysis, where we will get the measurements about space and time complexity and the usage of instructions and their frequency and duration overall. So why do we use this profiling and what can we do with these profiles? We can actually check out the CPUs and from the CPUs we can determine where an actual program is spending much of its time or how it uses memory, like what is allocating and how many times it's allocated, what not. So today, we actually focus on the sampling profilers and because it's more feasible to build upon the processes that we have today. So how it works is it's actually, let's say, we want a profiler process for 10 seconds, we collect and aggregate all of the data, and then periodically we just collect this data from that process. And this, since the sample snapshots, like the net not that frequent and that frequent for the CPUs, this has really low overhead. To check this data, there is also a nice paper from Google you can check out. So today we're going to specifically talk about a little bit about Pprof. Pprof itself is a profiler, it has cross-language support and that's why we kind of picked that. So Pprof a tool from Google and it actually specifies an open format in the form of Protobuf and it's like reads a collection of profiles from the processes. It can either use an HTP endpoint or it can use a local file in formatted in Protobuf. And this tool is also using dot visualizations, using graphics, it can actually visualize. I'm sure like you have seen a couple of demos how you can use the Pprof. Pprof has a lot of language supports, some of language supports already and some of them is like really good, like go because it's also from Google and some of them is just really getting there. But since Pprof is an open format and you can actually just like any language runtime can actually support this. So let's dive into code and see these things in action. In the left you see a code piece it's just doing some iterations and it's doing some function calls and when this program is running we are like doing a sampling, profiling and from the samples we kind of end up something like in the right. This format in the right is called folded stack trace, it's from a branding graph. It's super easy to understand because it's just simple, it represents a stack calls and then whenever you see that stack you just put an entry in there and you can just count them all. But it's like a lot of redundant information. That same folded stack trace can be interpreted in the Pprof format as aggregation of the locations and how many times we actually seen that stack trace. In order to understand this Pprof format more in that you can actually check out a blog post from Matias which is already in the Polar Signals blog. You can add the link inside the slide. So we already talked about Pprof can actually fetch the profiles using an htp endpoint. In this slide you can see how you can actually instrument your application to actually expose the profiling endpoints. It is super simple for Go itself. It's already built in the runtime stand-up library. You can just like get the net-htp Pprof library and you can just register those handlers and you already have those endpoints exposed from your process. Same goes for for example another popular language JavaScript. It's also like super convenient to add with the libraries, you just like register an endpoint and like Pprof can collect that profiles and then actually it can also serve that problem. So profiling can be a really incredible tool because we can just like trace anything any resource usage that we have but it has some problems. One of those problems is it's momentarily so you can actually just try to fetch a profile for that point in time and whatever you have for in the memory or in the call state you just get that. In order to do that you need to do this like manual. So it's a manual process. You need to have something to actually fetch that from the expose endpoints. Because of those reasons like there are a lot of embryo workflows and it's not like actual automated or there are some scripts laying around specific to your organization and you just like use them. So in order to fix those things which is the actual topic of our talk is continuous profiling. Yeah and continuous profiling is the foundation for what we're doing obviously and why is it important just to give you another example. We might see a process run and then looking at its memory all of a sudden there's this drop right and if you've seen it you know this is like an infamous umke but what happened there we don't know. So what we would really love is we want to use Pprof to create the profile samples. We want to sample every so often. We want to do it with a little overhead due to the sampling as Kamal explained and then additionally we hope to now get the profiles right before these umkeles happen and all of that should be done automatically rather than by hand. We don't want the manual toilet. So yeah intercontinuous profiling. Again just visualizing we want to do this like over time. We want to collect like heap profiles and allocational profiles but also slightly more infrequent the CPO profiles. So we could now pick any of these profiles at any given point in time and yeah just like look at what was going on. So something is cheap again there's not much overhead and then by storing that we can add some more index data some more metadata for example the container identity so we can see where in which data center the profile came from or which container etc and now we can use a query language similar to promql like Prometheus has it to then fetch the the profile set we are really interested in and that's why we build parkour or even before that umkeleconf but yeah again parkour is the next version of of conprof um and yeah Kamal is going to talk you through that. Parkour is a new open source project it's a continuation of conprof as Matias already mentioned it's a neutral it has an initial government organization model and like since we just recently built that we're still waiting for the contribution so if you're into this or check it out and please contributions are welcome. Parkour itself heavily inspired by Prometheus like all of us are kind of coming from the Prometheus community that's why we like carry over a couple of things that work for Prometheus such as like single statistically limited boundaries to ease up the operational work we carry over the multi-dimensional label model which you can actually share these labels with as Prometheus and we can just like scrape and populate all those labels in the parkour itself as well it's using the same service discovery and it has like new cool built-in storage as well in addition to parkour itself we also created a new thing called parkour agent and this is one of the major differences between conprof and parkour now the agent is proposed another way to discover the workloads in your running system it's using evpf and it's kind of trying to find out the c-groups that is running on your system and from those c-groups is trying to understand where the cpu and memory or the IO resources are being used it captures the current stack and traces x many times per second and created an analysis of that you don't need to change any code you just need to drop the park agent in your nodes and then it can start scraping the profiling data and send it over to the parkour backend itself so to check out this high-level overview of this profiler it's just like discover c-groups from the target provider this could be any container runtime this would be rocker or cryo and then from that it kind of loads evpf a little evpf program per c-group and and start collecting data from that it reads that data right into the evpf map and from that evpf map it transforms data to ppro format and then extract the symbolization information and then send those data over to server and then flushes the buffer and starts doing the same thing again this is a low frequency event and that's why like it doesn't have a lot of overhead for running systems so if you want to learn more on park agent there is another talk from Frederg it's today at 5.25 pm and he really goes deep into how we actually achieve the collecting data using evpf so let's actually see parka in action so parka is running in my local machine and scraping itself there is also parka agent running in the same machine and it's just scraping the system the c-groups and discovering and sending the data to local parka so let's have a look at the allocated space and in here we can see that over an hour like how parka is allocating bytes so we can pick a profile in nearly an hour ago and we can see that like which function is allocating more bytes or what not but all in all this single profile doesn't give a lot of information to us what we can do is we can actually compare the profiles so let's pick a profile from an hour ago and more recent one and now we can actually compare these profiles this is getting a diff of those profiles and it's showing like how much bytes we are allocating in which function and it's all color coded and the more red indicates it's more bytes actually is allocated one thing we can also see check out is the cpu samples from the agent itself as you can see agent is scraping the c-groups that is running on my local machine and from that we can actually pick one of them let's say park agent service and we can see that the profiles that we scrape from this agent park agent and we can actually merge and see what's going on over an hour and as you can see this is the actual time spent over this past hour by the park agent itself okay that was a fantastic demo so let's talk about the stories and what really makes parka parka and why we didn't continue with corn profit as a project right so previously corn profit itself just got the profiles and took the profiles as they came in individually gzipped and stored them in a slightly modified Prometheus time series database where we could store byte slices and not float 64 anymore and that was fine to get like a proof concept out of the door but the compression was great so during a talk at promcon earlier this year i explained how we improved the compression by basically storing the uncompressed profiles and then in groups compressing them together which got us between like 20 and 50 percent of improvements but it wasn't that much it was only like discs usage to begin with and we wanted to improve on computation and throughput and everything so we needed something better so and that's why we created parka's tcb and that is written from scratch so it is still inspired by Prometheus as again the same that was true for the service discovery and the scraper is true for the storage and it uses a separate meta meta data to store things it handles stack traces in the storage as a first class citizen and has different chunk encodings for specific various data and to kind of like walk you through the entire architecture we're going to look at this step by step but first of all what happens when a write request comes in and that is what kema is going to talk you through yes so right now parka can discover profiles either by like scraping or it can receive a write request from an agent and it can just like ingest the data so let's see how like how a life cycle of a write request looks like right a write request is represented as protobuf as you see in the screen and we have the role profile that generated by pprof and we attach the meta data meta data label set from the service discovery we have already built in parka itself from that we get the row profile and we just like using pprof itself pars and validate that data from that data we start building a memory representation of that in parka itself so the profiles track or profiling the memory looks like this the in this format the samples mapping and location and functions are the relevant things that we are actually looking for so using that data schema when we got the actual data in it it looks like this we have a sample value and attached metadata information in it from that data we built a model in the memory and we ingest the metadata in our freshly built meta store and by doing so we eliminate the redundant information redundant meta information that we have in the profiles right now this metadata store is implemented in sqlite and in memory sqlite that that is used in the study binary but it's compatible to be work with any sql databases so after we script that metadata from this pprof profile we ended up with bunch of location IDs and corresponding sample values to that location IDs but as you can see those location IDs actually builds a tree that's actually what we do and we build a tree out of those location IDs and we store those sample values as nodes in the tree and this is how we actually store the data in this in the parka storage exactly okay so how do we actually append those profile trees into into the parka tstv and first of all we do that by creating an appender based on the label set and that label set gives us a specific series or creates it if it didn't exist yet and once we got that appender we can then give that appender the profile to do its magic before we actually append any of the tree and any of the values themselves every profile comes with a timestamp and duration and period so some metadata and we also want to store that the timestamp is really what we store once and then everything else kind of works off of the index of that specific timestamp and that is true for the values as well later on because timestamps are pretty pretty well and monotonically increasing we can use a double data encoded chunk to store them efficiently and the duration and periods are often pretty repetitive so we can use run length encoding for those to really store as little data as we have to okay cool not that we've taken care of storing or creating and getting the series and then storing the timestamp duration and periods what are we actually doing with the profile tree with the individual stack traces and how do we store them over time right so as Kamal said earlier we do get this profile tree with the stack trace and the corresponding values and we take those as a struct and those those profile trees have a root and the root itself has a location ID flat values and cumulative values and then those flat values and cumulative values are tree value nodes with a specific value and the key which is uniquely identifies again with the location ID but also the labels which can differ what this value belongs to really and we can use those to then walk the tree and ingest the values like so we basically walk the tree and then we see for every location key that we see we store for example for the zero key which is always the root we store on the cumulative value 46 and then we walk the tree with the stack and and store the other values I'm only looking at the cumulative values the flat values I can't ignore but it's the same for them as well next grade another profile tree gets created and we want to append that so again we get the same in this case the same profile tree and we just take a look at the individual keys and append to those chunks the values now gets a bit more interesting we actually get a profile tree where one node is missing so we append all the the values to the keys that we've seen and we really just ignore appending the the anything to to the key that didn't exist and again another edge case or another use case is we see another stack trace that didn't show up earlier and because remember we only store the timestamps once we need to work off of the index for duration periods but also all these values we can't just like store them then the timestamps and the values together but we need to work off the index so we need to ingest the 11 here to the third index and we do that by just appending three zeros and then appending it right there the index three now you might think might be thinking well if we're going to read back this what about the the missing values right there and that is called sparseness you actually don't store anything for those nodes but when reading back we pretend that they were zero and these zeros are basically ignored by by pprof they are not doing anything so we we pretend that they were zero and then people have ignored them so that is how we can even just by seeing stack traces every now and then we can still be smart about storing them so to recap the profile store itself works on top of the standard pprof and it creates like this more efficient representation it stores all the metadata uniquely in the sql light database right now and then we append this to a tsdb which is still a lot like prometheus but it works on top of the profile tree and then every chunk for the different types of values also works just like prometheus okay cool now that we've seen how to store profiles in parkas tsdb how to actually query things again we do use grpc and we can send a query request next to the mode which specifies if we want a single profile or a diff or a merge type we do need to give it a single profile in our case which we are interested in and the single profile which you can see at the very bottom takes a time as a timestamp and a query so what does it look like the time is actually just a unix timestamp and the query is just like you know from prometheus almost like a promql query where you give some label metrics and that will match all the series that that you're looking for and then in memory the tsdb is creating a wrapper the query that will look five minutes earlier from the timestamp you want and five minutes later and then we'll select all the series based on the selector that that you've given and then we're in the first follow it will iterate over all the series that were matched and then in the second follow we will iterate over every series's timestamp until we find a timestamp that is equal or higher than the requested timestamp and once we hit that condition we can immediately return the profile now the parker tsdb has been so much more efficient because we don't store the metadata anymore next to all the the actual values and because of that diffing so that is like taking two profiles and subtracting the values has become basically just math and then once we get the results of that with the location IDs we can symbolize them and then build what you see in parker and the same is true for emerging which is basically taking two or multiple profiles and then adding every adding all the values and that has become also just take all the chunks that are falling into that time frame and then summing up everything returning that symbolizing it and they have your profile and that has really sped up everything and the entire reason why we built parker tsdb has proven to be exactly valuable for that reason okay now that we've talked all about the parker tsdb what's in the roadmap for parker we want to have persistence on disk we want to be able to persist the data that we had um we want to improve querying we want to be able to for example only query sub stack traces where we ignore everything else in the tree and then once we done all the math we can return that symbolize it and show that would be fantastic additionally we can still improve the symbolization there's lots of work to be done in the meta store the sequel parts are also still ready for improvements but most importantly we want really you to get involved with parker and the community for you to try parker open issues open progress just contribute and we would really love to create a community about performance light-minded people based on parker that would be fantastic so thank you and if you have questions feel free to ask them and we are also hiring thanks and take care