 Hey everybody, I'm Matthias Loewe and today I want to talk to you about leveraging Prometheus TSDB time series database for conprof, for continuous profiling. I hope you're staying safe, I hope you're doing well and yeah, let's get into it. Quickly about me, I'm Matthias Loewe, as I said, I'm a senior software engineer at Polar Signals. I previously worked at Red Hat and Kubernetes. I'm an open source maintainer with others working on conprof and Thanos and Prometheus operator and Kupromethius, among very other things. And I'm also organizing the Berlin Prometheus meetup. You can find me as I wrote down on the left hand side, on social media, always at Medemutze. Cool, so let's talk about profiling. Profiling, if you may be asking what it is, is a form of dynamic programming analysis that measures, for example, this space. So the memory a program uses, the time complexity, so the CPU, and the usage of instruction and the frequency and duration of function calls. So it's fairly low level, but it's not all too complicated. Being at PromCon, you might be asking what's the difference of profiling and metrics. So I tend to think that metrics serve as problems. So if we're thinking about the CPU and we look at a Grafana dashboard, we can see that for some reason the CPU usage is higher than expected. Same for the memory, we might be looking at the memory of a program and see that for some reason the usage is a lot higher than expected. Now what we want to do instead of looking at dashboards all the time, because sometimes we need to sleep, we want to alert on things, right? So for the CPU, we have the CPU throttling high alert, and that actually tells you when the kernel is kind of like throttling a process too much, and it could be faster, but it kind of reached the limit of the CPU time the process gets. And then for a more symptom-based approach, you might have for your service a P90 latency requirement that 90% of requests are answered within one second or something like that. And if the requests overall are too slow, something maybe with the CPU might be happening. So we want to take a look at that. But we can't really see what's happening. And then for the memory, most of the time what we kind of like see is that there might be a cube pod crash looping alert and that sometimes indicates that out of memory kills happening. So the kernel terminates the process because it has reached full memory limits. And yeah, it tries like the kernel tries to maintain the overall system stability and kills the process. Kubernetes is just an example, same for system DR, thus. So looking at this metric, we can see that right at the time, roughly 21 minutes or something, the process got killed and went from one gigabyte of memory all the way down to zero. And we don't really know what happened. We just see that something happened from the metrics. But what we really want to know is what did the program state look like at 20 minutes, for example, right? So this is what we're trying to answer. And this, in the end, really is what profiles give you. They show why something is happening in the lower levels of the program. So for CPU, it shows where the process has spent a lot of time and kind of shows the function oftentimes quite clearly that has spent the most CPU time. So most of the time is like a good indicator that spending some time on that function and trying to improve it is worthwhile. And then for the memory, we have two different profiles. So one, for example, is the LOX profile, and that shows how much each function overall, the entire time of the process running has allocated, so which are constantly allocating memory versus others that don't really do that much of memory allocations. And then there's the heap profiles that show what currently every function has allocated. So it kind of shows you if you use like one gigabyte of memory, which functions have what part of the memory of that gigabyte allocated. So that's often quite useful for travel shooting. So one of the projects for profiling is Pprof. It came from Google, and it reads a collection of profiling samples. So only every few seconds it takes snapshots of the program state and stores them, or like we do that manually. And the format is described in a profile.proto file, and I will show you that in just a second. It uses the dot visualization tool from GraphWiz to make beautiful graphs. And then we can read profiles from disk, from a local disk, or we can pull them via HTTP, and that's quite helpful if the process running somewhere else on the server in a Kubernetes cluster, whatever you have. And yeah, we can kind of do the remote travel shooting. So for the proto of Pprof, just want to quickly demystify a couple of things. The profile at the very top has multiple samples, and every sample has a location, and these locations have lines, and each line has a function. And to make it a bit more clear, I actually ran a debugger and looked at a profile at memory, and you can see this, the 17th sample of the profile, it has a location, it has a line, it has a function, and if we look at the function, it has a couple of strings, and one of the strings is the name, and that says that the current sample and function, line, et cetera, was the byte that makes slice function call. And further down we see that on line 229, this function is being called in the file buffer.go. So that's really all there is to it. A bit more metadata, but that's like the underlying format. Pprof ships as part of each go release. So if you already have go, it comes with a go tool Pprof sub-command, and you can just start using it. And actually many applications already expose Pprof, one of which is Prometheus. For other languages, we basically get CPU profiling, which is great, and some support heap profiles, and not so much for everything else. And for go, we even get go routine profiles, and FGPROF is a new one outside the standard library started by someone, which is really cool as well. I will not go into this one right now. And then if you don't have Pprof, if you cannot change the actual program, the process that you want to instrument, there's perf and networks without like any code changes. And then there's a perf data converter. And we actually recently open sourced a project called Professor, and that converts the profiles from perf and into Pprof and then ships it to a profiling backend. And we still recommend, if you can, we still recommend using the native instrumentation via Pprof whenever possible, this just kind of like as a last resort. And it was already quite useful and quite cool to see that it still works. Right, as I said, Prometheus is actually instrumented with Pprof. So what does it look like? Pprof exposes a P, Prometheus exposes a Pprof endpoint. And the handler is located at this given address. And if you take a look at this, it just gives you like a super simple page and click on these profiles, but it's just a bunch of text and you can't really do anything really useful with it, right? So what we actually want to do is to use the go to a people sub command, give it that address, and we can pull down the profile and then type, for example, web or SVG and we get a graph, as I shown in the screenshot. And as you can see in the profile, the promq at extrapolate rate function called as quiet, quite big. And that's due to me running some queries in my Prometheus while pulling this CPU profile. And then just shortly after I ran the go to a Pprof profile with a heap endpoint. And unsurprisingly, we can see that right in the center, there's the promq evaluator taking a lot of memory. And that's also due to me running a couple of queries, I guess. So just to give you a high level overview. So after all, we hear from for continuous profiling and you might be asking what it is. So Pprof creates these samples, right? It's just these small snapshots of what the state looks like. And what we actually want to do is we want to sample every so often. And that's what the continuous profiling part really is. And we want, and we can do that because it comes with a little overhead, because it's sampling and we've seen from like 0.2 or 0.3 to 3% depends on the process that you're profiling. What we really try to do is we hope to get profiles right before umkills, sort of memory kills happen, so that we get like the very last bad state before the process was killed. And we want to do this automatically rather than by hand, because we might forget being in an incident, etc. We want to be sure that we have these profiles when we really need them most. So yeah, until continuous profiling, let's take a deep dive. As I said, we want these heap alochs and profiles and some others like go routine, etc. And we want to like every 10 seconds in this example for, for instance, take these profiles and snapshots and then we want to start them. And if this looks like anything similar that you've seen before, it totally looks like time series, right? So yeah, we create a comprop and comprop stands on the shoulders of giants. So it stands on the shoulders of Prometheus and Thanos and Cortex. And it's really cool to see like code from all these three projects come together. And we heavily use the Prometheus time series database, which will even dive deeper and do later. We use the Prometheus service discovery and we use the scrape manager as well. So all the like, I'm scraping samples from each process is exactly the same code almost as Prometheus has it. And then we also use the Prometheus remote write mechanism to whenever we scrape samples, we can ship them off remotely. So yeah, like lots of reusability that we gain from relying on these projects. What does comprop look like? It's quite a simplistic UI at the moment. And you can see you have this like query interface on the top left where you type what kind of profile you want in this case a heap profile and then the job selector. And then yeah, you get like a series of time stems. You can click on the individual ones and you will get a profile. And to query these profiles, it's because we use the Prometheus service discovery, it looks almost exactly the same as Prometheus. So on the left inside you see the go mem stats heap in use bytes metric. And we can literally copy these label selectors and put them in comprop and change the metric to heap. So we get the heap profile. And then with the same label selectors, we get a heap profile for that process. And that's that's really powerful and super cool. Now I want to quickly give you a demo to get a real feeling for comprop. All right, demo time. So I have this example application that exposes pprof and does a bunch of bad stuff. And we want to be able to profile this application with comprop. So first of all, let me start our application by simply running the binary. And I would start up, yeah, binds to port 8080. And I can show you the scrape config that I have for this application for comprop. It is a static config. So we literally just give it the local house 8080 address. And we want to scrape it every single second. And it's called that. So that's pretty much all there is to it. So let's start comprop with this config file. And we want to run all the components of comprop at the same time. So now we're starting comprop. And this will take a second to start scraping. But we can already go to the comprop interface on a port 10,902. And yeah, let's just search for job equals app and query it. And sure enough, we get profiles for Alex go routine heap and thread create. So maybe let's look at heap first. And we can even go and only show the heap metric or profile. So yeah, now we can let's click at the latest one. And awesome, we get this. We get the profile. And we can see that they're in the main file. There's a function called a log mem and that has currently 98% allocated. We can also look at a flame graph, which is kind of a different way of looking at this. So we see that at the root, we've allocated 88 megabytes. And all of these we have 87 for the alloc mem function, we can drill into different ones, and we can click around, we can go up again. And that's just a different way of exploring these profiles, which oftentimes is quite, quite nice as well. So let's look at all profiles again. And yeah, we also got CPU profiles. So let's take a look at this one. And sure enough, we can see that the calculate Fib, maybe it has something to do with Fibonacci took 99.6, 3% of the CPU time. So yeah, something is clearly happening here. And we can also, for this one, take a look at the flame graph. And we can see that the calculate Fib is kind of like, hello, or using CPU time like this, and look into different sub graphs of the CPU profile, which is pretty cool. Yeah, so let's take a look at the actual program that we were instrumenting, open up the main go file. And yeah, like right in the main go we call calculate Fib, so Fibonacci and alloc memory. So we calculate the Fibonacci number of very big number, and that obviously takes a lot of CPU time. And then we also allocate lots and lots of memory. So these were the two functions that were quite prominently shown in our profiles. Very cool. All right, cool. So let's talk about the contrast time series database. So because Prometheus stores everything as tuples of n64 and float64 n64 for timestamps and float64 for all values, we needed to change that we needed contrast to store n64 and byte slice because profiles are byte slices. And we ended up needing to change a lot of little code in very many places. And that's what we did. To give you an example, here's the TSDB appender interface. And we had to change the append interface to accept n64 and float64 to n64 and byte slice. And super similar, the TSDB iterator interface. Whenever we iterate through all these samples, once we actually want to retrieve a sample, we return a byte slice instead of a float64 now. All right, so whenever cornprof scrapes a pprof endpoint, we get back a gzip profile. And we need to uncompress that profile, we need to validate and parse that profile. And once we've done that, we gzip it again, and then we store these individual samples, right? And that's what we did. And that's how we could quickly reuse Prometheus time series database. But we wanted to improve things. So as a first step, we started storing uncompressed profiles. So we had to just change this one little snippet of code to store uncompressed profiles. And what we ended up having was all the profiles in raw. And we were storing them. And it wasn't that great, but it allowed us to actually do a next step. And that is improve the compression by grouping samples together in groups. And we, I think we are right now at 12 samples per group. And then we take all of these compress them. And if you think back to the format, a lot of the data is actually just strings, right? So what we can do by grouping these samples together, we can compress the strings a lot more efficient. And then to make this better compression actually happen, we needed to change the underlying time series database chunks. So what we currently have is a bytes chunk. There was always a pending tuples of timestamps and values. The timestamps were always a double delta, so super similar to Prometheus. And then the values were these individual samples, right? And if we think, think how to like kind of like iterate over these, we are always like time seven zero value zero times them one times them value one and so on. And looking at our UI, oftentimes we only want to see the timestamps. In this case, we don't care about the individual profile just yet. So like, like iterating over all these samples is kind of a waste. And to improve things, we split the bytes chunk into a timestamps chunk and a values chunk, essentially storing them separately. But we kept track. So we always append together to both chunks. And we always iterate over both if needed. But now we can only iterate over timestamps and ignore the values, the samples entirely. If we have to, though, we can at the same time iterate over both timestamps and values to in the end, get to the same place and then return that profile. What this allows us to do is the timestamps can be iterated over individually, they're still using double data to store them. And the values instead are now grouped without the timestamps in between and we can compress these groups of profiles together and get a lot more compression out of this. After lots of benchmarking, we actually chose ZSTD for compression. I don't know if that's the right pronunciation. But yeah, like we're using ZSTD for compression now. And in the benchmarks, I'm seeing like up to 50% disk savings. And in other benchmarks, I think I've seen up to 75%. So it's quite significant with a little bit of overhead in memory. But yeah, like I think it's totally worth it. So for Comprof Roadmap, obviously the UI is very basic. And we will want to re-lamp the UI. We might take UI components from our product and get them into Comprof. So that's definitely something we want to do. We also think that there's more room for improvements for the storage. We can have better compression. If you think back, a lot of the actual data are just strings that are repeated. And we already made some significant improvements by grouping samples together. But I think we can still get more by kind of like improving the storage. And we also can be more efficient with querying. All right, that's everything. I hope you enjoyed the rest of Promcon. And yeah, we have some nice SAS product out, which is called the Polar Signals Continuous Profiler. And we actually take profiles and get metrics out of them. So as you can see in the screenshot, you get like a metric for each profile. And you can see if something is zooming, you click on the metric and you get a profile. So that's pretty exciting as well. So check that out. Let us know if you have any other feedback. And if you want to become part of the Comprof community, reach out as well. And yeah, enjoy Promcon. Thank you.