 So I'd like to welcome Brian from Grafana Labs. So we're going to talk about how Prometheus has its memory usage. We have a welcome applause to Brian. Well, hello. So a little bit similar to RID1 earlier, I asked the AI to illustrate my talk. I pasted the title of the talk into the AI, and this is what it came up with. And I have no idea what this has got to do with Prometheus or memory. But it's a nice picture. OK, so this is roughly the agenda. Who runs Prometheus? Oh, like quite a lot of people. OK, who's got a big Prometheus, like 10 gigabytes? Yeah, you got it. You got the joke. Yeah. OK, 100 gigabytes. 200? Still going? OK, who's I don't know. We could have a contest who's got the biggest Prometheus. Well, so the talk is about making it smaller. Who's kind of new to this? Who has no idea what Prometheus is? Yes, I thought there might be a couple. OK, not very many, but so I drew you a picture. So this is the basic idea. It's Prometheus is self-contained. It takes metric time series data in from a whole bunch of things. Could be your machine, your program, your database. It'll store them on disk, but it doesn't rely on another database or another key value store or whatever it's self-contained. And then it can connect up to things that are going to display the data. Draw some kind of wiggly line, which we like to look at. Oh, yeah. Who am I? So my name was on the slide earlier. My name's Brian Borum. I work for Grafana Labs. So I guess most people know about Grafana, the dashboard, which draws the wiggly lines. But we also have a bunch of people working on storing data. And I work on that. I work on storage of metrics, logs, traces, profiles. And a lot of that is using Prometheus code. So I'm a Prometheus maintainer. I am the ninth biggest contributor to Prometheus, apparently. So mostly in the last couple of years. But I have been working on this stuff for about six years. I have been optimizing programs for, like, 40 years. So I do this a lot. So let's get into it. OK. Oh, by the way, this is an observability day talk, right? So I'm not going to talk about the code. This kind of occurred to me halfway through preparing this talk. I should do another talk, like GopherCon or something, where I talk about the code. But this is not about the code. This is about, well, so we're going to start with, like, how do you observe memory usage? Well, metrics is one way. So which memory metrics should we use? I went and looked up. So these are all the ones you can get out of CAdvisor if you're using containers. Or these come out of Kubelet if you're running Kubernetes. There's a few more you can get out of Prometheus Client Library from the operating system. Prometheus is written in Go. There's some more you can get from Go. And I could have gone on for a couple more pages. So who knows the right answer? Always there's one person who gets all my jokes. We should meet. You want to tell me the right answer? RSS, OK. There's a vote, OK. Oh, I put this up to indicate that it's something you need to think about. So I put them all on one dashboard just in case maybe that's the right answer. Like, look at all of them. It's a little hard to figure out what's going on. So you can Google this. You can get people's opinion on the internet. And so we had a vote for RSS. I actually decided to focus on working set, which is nearly the same thing. So there's a metric called container memory working set bytes. So you can look at that. People will recommend that. And I also want to stress, why do we care about memory usage? So one answer is you've got to pay for it. You've got to pay for more memory if you're using more memory. But probably the most vital reason is if you run out, it goes bang, right? It goes, ooh, you're out of memory error. Your program crashes. So we can look at the peak on this particular time period with 17.7 gigabytes. So this is me running Prometheus 2.27 from two years ago. I decided arbitrarily I would take that as my starting point for the halving in memory. Go back two years. It's running in a dev environment. It's scraping about 600 targets running under Kubernetes. There's about 2 million time series. And it peaked here at 17.7 gigabytes. So mostly you will have a limit of memory. You can set it in Kubernetes. If you change the limit, it restarts the process. That's what that kind of dip in the picture is. But a little bit interesting, it was peaking above 17 gigabytes. I put the limit to 17 gigabytes. And it just runs smaller. So that's kind of weird. Well, let's bring some more data into this picture. Prometheus is written in Go. And we can ask to see the Go heap memory with this metric. Go mem stats heap in use bytes. And this is by far the biggest component of memory usage in Prometheus, the Go heap. It's kind of waving up and down a lot on that picture. So I drew another one. Simplified view. So this is kind of what's going on inside. The memory sort of builds up. And then the garbage collector runs. And it comes down again. And then it builds up. So you get this sort of sawtooth pattern, right? And this is happening quite fast. I mean, in this particular Prometheus, it's like every 10 seconds or something like that. So this is why on this picture, it's kind of waggling around rapidly. Because it's actually faster than we're sampling the data. So it's kind of random. Somebody should look up Nyquist or something like that. So you could learn things. Anyway, so what I like to do is bring in this different metric, which is next GC bytes. So this is essentially the top of the sawtooth, the crests of the waves. So this darker color here is the peak of the heap usage, if you like. So this is getting a bit closer. There's still a gap, though. But I mean, take it from me. If this one hits the limit, it'll go bang. So that one, a little tip for me, if you're looking at a Go program, you're going to have to look at a Go program. So a little tip for me, if you're looking at a Go program, this number is kind of more dependable. Let's try and understand what is in that gap. So it's really hard to get into all the detail. But basically, it's the kernel cache memory that's in the gap. And I kind of appeal to you to believe me, because it never goes above the limit when you add those two up. So it's IO, right? Prometheus is writing to disk. Everything that it writes to disk then sits in the cache for a bit. Everything it reads from disk sits in the cache for a bit. And that includes the memory mapped IO that Prometheus does a lot of. If you're looking at a program that doesn't do a lot of disk IO, it won't look like this, right? Because it just won't have, there's no reason for the memory to be in the cache if it's not doing a lot of disk IO. But in the case of Prometheus, that's kind of the bulk of the difference between the working set and the heap memory is in cache. And because it's in cache, we don't really need it. It can go back to the disk and get it. As you get closer and closer, the limit comes down closer to what's really needed. The program will slow down a bit, but it won't crash until they kind of touch. OK, so this is my kind of biggest takeaway. Sorry, it was a trick question. There is no single metric for memory. I interact with a lot of people on Slack or whatever, and they say, oh, my memory is this. And I always say, well, which metric are you looking at? And then the next thing I usually say is, well, you should look at these three other ones. Because one metric will never tell you the complete picture. So I really want to stress that for observability day. Oh, yeah, so I reduced the limit. So it was at 17 gigabytes. I took it down to 16 gigabytes, and it went bang. It didn't go bang straight away, but it did go bang. And so a little bit more detail inside Prometheus, what's going on here. It's building up this data structure we call the head block, which is the most recent data. And by default, it's two hours of data that it builds up in memory. And then it writes that to disk in a process called compaction. And because we don't want to lose two hours of data if it does crash, it also writes all the data to another thing on disk, which is called the write ahead log. So I go back to this picture. So what happens this? Probably it was doing a compaction here, because that uses a little bit extra memory, and that's when it went bang. So it goes down to zero. It starts up again. It reads all the data back in from the while. And it gets going again. And there's a real problem here that Prometheus only tidies up and makes everything neat and tidy and truncates the while after a compaction. So it crashed again, and it started up again. Now the while's a bit bigger, because it had been running for another two or three minutes. So now it needs more memory. So it crashed again, and again, and again, and again. This is a pattern that some of you will be familiar with. I thought it'd be nice to explain it. If you get into this state, the main thing you can do is give it a lot more memory so that it can get out of this state and recover. But I kind of promised in the title of this talk that the memory usage had halved. And this is running 2.29 from two years ago. So I showed you that. I will come back to that. OK, I want to talk about profiles first. Yeah, how did I make it smaller? How did we know where all the memory is going? Well, a profile is a good tool for that. I skip exactly how you get hold of a profile. Go has, in its runtime, built-in profiler. And you can ask it for all the memory and all the CPU usage as well. But we're focused on memory here. And you can visualize it in this what's called a flame graph view. The width of the blocks is the proportion of how much memory is being used. So what's called the root, the top of everything, that's 100%. And it's 6.7 gigabytes. This is basically the bottom of that sawtooth. The Go memory profiler reports the memory usage as at the last garbage collection. So you never see garbage on this picture. A lot of people think that it's probably mostly garbage. I don't need to think about it. This is never garbage. When you look at a profile from Go, this is the bottom of the sawtooth. This is what could not be discarded. Anyway, so the process of making it smaller starts with saying, well, what is making it so big? And I added up these various bars. So I'm running some queries in my test system that turned out to be about 9%. The metadata, if you like, for the series, 14%. The samples themselves, you'd think that would be the biggest thing. But they're very, very highly compressed in memory. The metadata to with scraping, getting the metrics in, that was pretty big. But the winner overall, and this is as at two years ago, pretty much a third of all the memory inside Prometheus was taken up in labels, so strings. What do I mean by labels? Well, every series in Prometheus is uniquely identified by this name value pair set. And if you have another series, which is kind of related, and this is the only difference there is between the method, you actually get a whole new set of strings, and on, and on, and on. So you look at this, and you say, well, that's dumb. Just have one copy of the strings. But it's not as simple as that. So this is showing you the kind of data structure inside. And I don't want to get deeply into the goal programming details. I put a link on there if you want to go look. But basically, the thing called the slice header that's pointing to all the labels, that's 24 bytes. And every string has a string header, which is 16 bytes. It's a pointer to the contents and a length. And if you add all them up, it turns out that all of these pointers in the data structure are way bigger than the strings themselves. So I did this. So this is Prometheus PR 10991. And I put all the strings in a single string. And I kind of encoded them with the length. So you can sort of walk down the string and find the one that you're after. And this is a little bit exaggerated because the strings are really small. But most of the strings that people use in Prometheus are small. There are things like namespace and cluster and job. And so there's a lot of small strings in there. Anyway, this shipped in 2.44. It took a year. 2,500 lines of code changed because there was a huge amount of code that just assumed that it knew what that data structure was like. There's lots of projects that use this code which had knock on it. They needed basically 1,000 line changes to react to this. And also I want to give credit to the Prometheus team that put up with this. I did a lot of the typing. But they were reviewing these changes and helping with testing and all kinds of things like that. So shout out to them. Right. So I didn't add any more memory. I started up. So the previous one that was crashing at 17 gigabytes, I ran 2.47.2, which is the most recent released version of Prometheus as of today. It's not half. 13.1. Result? It's not half. OK, carry on. Oh, I wrote this up in detail. Yes, great. Oh, but maybe I can explain this because we added out-of-order handling of samples. We added native histograms that Richie was very pleased about earlier. Maybe they used up all that. No, they didn't really use up all that. No. OK, so this is conference-driven development. This is I'm preparing for this talk. I said it was half. Well, maybe there's another thing I can talk about. This was a bug that was fixed in 2.39, where for really big Prometheus, the transaction isolation ring used to get enormous under certain conditions. I did the math. It's still not half. What are we going to do? So I went back. I did the thing that everyone said I should have done in the first place, which is to have only one copy of every string. So this was April. I did a version of that. It is quite complicated for a couple of reasons. Well, first of all, it's a lot smaller, because the only thing that we're storing is indexes into a table. So we have a thing that is technically a string in Go, but it's actually just a list of numbers. We have a table index by those numbers, and we need to point to that. And the thing that makes it really complicated is we don't just want to have one of those tables. If I just have one table for the entire program, it's a bottleneck. Basically, the entire program becomes single threaded on this one table where all the strings live. So it's a very intricate change, and we have multiple of these symbol tables. That took us down to 10 gigs. So this is like about a month ago, because I submitted that PR in April. 10 gigs. Still not half. OK. So I carried on. I did. I mean, basically, the process, you get the profile, you pick the biggest number, you work on that, find some inefficiency in it if you can, do it again. And now what was the second biggest number is now the biggest number. And you do that a few times, and numbers that were really not that big to begin with are now big numbers. So it's a nice self-reinforcing process. Anyway, running low on time here, this is 2.47 plus all of those PRs on the previous page. 8.6 gigabytes. Are we at half? Nearly. I don't know. I think that's pretty good, though. OK. Tough audience. One more thing. There is a parameter in the Go runtime called GoGC. And it defaults to 100. And basically, the size that this sawtooth, the size it grows to is 100% of the size at the bottom of the sawtooth. So in Prometheus, that's, in my little test one here, that's six, seven gigabytes. And in those of you who have 100 gigabyte Prometheus, that it's kind of growing by 50 gigs, basically for housekeeping purposes, you don't need 50 gigs of garbage to run an effective heap. So you can tune that number. It's an environment variable you can set. It will grow to whatever percentage you set it to over what it went down to as the minimum. And it will garbage collect a bit faster. So probably your CPU will go up. But like I say, if it's enormous, if it's garbage collecting really slowly, because it's got to grow to 50 gigs. And I worked on that as well. Worked on optimizing the amount of garbage being created. This can work out. So anyway, I did that. So this is the Prometheus 2.47 plus the 8 PRs on the previous page, plus gogc equals 60. And it's running with a memory of eight gigabytes. And I'm done. So yeah, there's the takeaways. I really love feedback. If you go to the page for the talk, you can download the slides. And you can tell me what you thought. And I have some time for questions. Thank you, Brian. If you have any question, please approach here the microphone. I have the first person. Yeah, thanks. Great talk. I'm just curious, normally when you get that much memory improvement, you're paying for it in some other way. Do you have a summary of the CPU impact from all these changes? Yeah. Well, that's OK. OK, so yeah, engineering is all about trade-offs. Were you over there? So it is quite likely that when we squeeze down on one thing, that something else goes up. Memory is kind of an interesting one, though, because the way your CPU works, it's working in a lot of the time, 4K pages. And so it's kind of spending a lot of time in housekeeping. And it's got a, I mean, your cache on your CPU is probably something like 10 gigs, sorry, 10 megs. And we've got 100 gigs. So basically, if you reduce the amount of memory being used by your program, it might go faster, because more of it might fit in the cache. More of it might not cause what's called a TLB miss. Don't have time to explain that, sorry. Anyway, so the very last thing that I did push the CPU utilization up a bit for everything else that I talked about, it goes down. Yes, it is nice, yes. When evaluating the performance improvements that you went through, do you have any general framework for deciding if the development effort to implement it is worth the performance optimization? Oh, yeah, that's a good question. Oh, this is the when do you stop question. Well, I recommend having a talk deadline. Yeah, more seriously, I mean, yeah, it is good if you can somehow work out. And Richie was actually talking about our, I work for Grafana Lab, same place, sending out these weekly stats about what we're spending in the cloud. And memory in CPU is a part of that. And so you can literally put a dollar amount on what is hard is to predict what the improvement's going to be, right? So I think realistically, you have to get the data, get the metrics, get the profile. And you get some kind of idea like, I mean, a profile that is really attractive to optimize is one where there's one big bar. Like the one that I showed you, there was a lot of kind of evenly sized bars. That means people have already optimized it quite a lot. Anyway, yeah, it's really hard to predict what a benefit you're going to get out of it. So you kind of have to say, well, I'll give it three weeks or something like that. And that's worth so much of my time. And I'm hoping to get this much drop in my cloud bill. But if I don't get anywhere by three weeks, I'm just going to have to drop it because it didn't work out. I sad, I don't have a better story than that. Just kind of time box it. Yes? Hey, great talk, thanks. That resonates a lot with our experience with Prometheus. So it's great to see the memory being there. My question is about the agent config. What is it? Is it an agent mode? Does it change the situation? If you use the Prometheus as the agent, is the memory usage lower? OK, yeah. So there's a mode you can run Prometheus in called agent mode. That's what you're asking about. And essentially, it does not build TSDB in that mode. But you still have to write the headlock, right? You still have those towers. Yeah, so it'll be a little bit different. But in my tests, I did have remote write turned on. That's hard. Takes a lot of memory. Yeah, I think there are some specific fixes in there. And yeah, so basically, the labels as string table thing was not particularly helpful for Prometheus agent. The next one, the labels as symbol table will be. Great, thanks. Thank you so much, Ryan. I appreciate it. Thank you.