 I'll come back everyone We are a little bit late, but it's time for our next talk. It's about how not to scale your Prometheus From Kush and Nicola. All right. Can everybody hear me? Good. All right, so we just want to start out with just asking a couple of questions Hey everyone, so I know we are at the Prometheus day, but I just want to start out asking that how many of you are using Prometheus Okay, that's everyone Awesome, so how many of you self-manage your deployments? Fewer Gotcha, gotcha, and then I think one of the big questions is how many of you have had trouble scaling those deployments or have had trouble with the stability Yeah, fair ish number So Nice Well, we do have face scaling issues and Prometheus going into crash loop back off a lot of times So worry not so we hope that after the Skokins you would have some fruitful insights that how not to scale your Prometheus and go on the right way So hey everyone. Good afternoon. I'm Kush Srevedi and he's my colleague Nicola Collins We are platform engineers at DevRiv and today we plan to talk you through some of the mistakes We made while taking shortcuts with her Prometheus deployment so that you don't have to go through them Yep, so Yeah, so as Kush mentioned We work at DevRiv and we're on the platform side of things and at DevRiv We really only infrastructure to support our developer oriented API first CRM or dev CRM And with that I can kind of talk you through a little bit of at least what our Topology looks like right so we are kubernetes We are primarily EKS and then we initially started out with some particularly small nodes On top of that we use Argo CD heavily Istio for service mesh We use Kali to visualize our mesh and then we use Grafana initially early on To do some visualizations of some internal metrics So on top of that when Kush and I started this we had around 20 microservices that we were supporting today We're up to approximately 70 and then on top of that when we started all of this out We were primarily data dog shop So we dumped everything from our metrics to our tracing Logging straight into data dog and that's where all of our developers are used to working out of So as Nick said There dog was a one-stop observability platform and being honest that worked pretty well for us in terms of visualization of matrix application performance monitoring and Corelation between multiple toolings, but soon cost started increasing significantly with data dog Up until this point we had the luxury of dumping everything today dog without worrying about the cost But now we had to become more discerning So we decided to go ahead with the hybrid observability model Where we would have a cloud provider as well as n cluster observability setup So that's when we decided to move some of our observability setup concerned around our platform and infrastructure To Prometheus and since we were very new to Prometheus We didn't know how to manage it very well. So we did commit a lot of mistakes while going along the way Yeah, so our first attempt or really our first interaction with Prometheus was indirect it was it was a necessity to visualize our service mesh metrics and Just to get that rolling. We simply installed the example Prometheus yaml that came with our Istio install But we are we try our best to deploy and control everything out of Argo So we pretty quickly moved to the Prometheus the most basic Prometheus chart helm chart So one of the things that we immediately noticed and were informed of was our data costs Our data dog cost essentially doubled overnight as soon as we turned on these mesh metrics So that was a pretty stark realization. So the the first quick answer was alright Let's filter those out. We don't really our app team our developers don't really need those metrics They're more for us to internally debug So we filtered those out and we started dumping them directly into our local Prometheus instance That led to the first real instability actually in our cluster is we noticed that this Prometheus pod Just kept going down and just kept going down So initially our first my first question or asked to push was just turn on the horizontal pod autoscaler and everything should be fine But he quickly informed me. That's not really an answer So the next thing we did was obviously just try to throw more memory at it and with our tiny nodes We kept that out pretty quickly and then in the last ditch effort We just we tried the vertical pod autoscaler without really investigating or diving too deeply But that actually led to us having node instability So we backed that out We backed that out almost immediately, but yeah So our Prometheus was still crashing and it has been few past past few weeks Since no one was actually looking at the Prometheus metrics and as Nick mentioned that there were no product critical metrics in a Prometheus And we were just using those for debugging some internal and five shoes which were related to networking But then things started changing pretty soon when we started to explore can read appointments for a microservices We were already using STO and Argo CD setup and guess what? That's what all we needed to start our migration to can read all out strategy We quickly installed our rollouts and started exploring subset based routing via STO and now we needed an Encluster observability solution to monitor the matrix. There were two reasons why we needed an encrysler solution First thing for latency reasons We didn't want to send all those can read tricks to some cloud provider and fetch them every five seconds Just to make sure that how can reporters working with and to make sure that we don't slow down our deployment pipelines We needed again in house solution which could provide us with the APIs and we can query our matrix every five seconds Six second whatever is our frequency So that's the point when we started that we need to start working on strengthening a Prometheus setup And this was the first time that Prometheus availability directly impacted a deployment pipeline and we needed to fix that thing Yeah, so as availability or I guess uptime went from annoying to critical We took much might say more focused approach on this and the next piece was and based on like the very loose research was it Well, if we can't optimize a single instance of Prometheus maybe we should start partitioning or sharding our data between multiple instances and Just the logical grouping that we went with initially is we'll put all of our cluster metrics in one Prometheus instance We'll put our service mesh metrics in another instance And then we'll put all of our app metrics everything else that our our dev teams are looking at in a third instance And it was at this point. We realized that okay We have two very stable Prometheus installs, but we have again a very unreliable mesh instance So it in some ways it worked well at least we minimize the blast radius But at the same time we essentially ended up with the same problem and that is we have one Prometheus instance that every day We look at it it ran into an out-of-memory error and that out of memory error leads to a crash loop back off and then That crash loop back off. Well, our probes are timing out So let's increase probes and as soon as we increase probes as long as we can get Out of memory error again and again and again. So so how do you fix one of the Prometheus instance, which is always crashing as Nick said a two-third of Prometheus was stable We are getting cluster matrix. We are getting custom matrix, but again the mesh instance for super unstable So what we did that back then was we just deleted the PVC restarted the pause and boom a prometheus was back again For two to three hours until it reached the limit of the matrix and started going to crash loop back off again We needed to fix this Yeah, so so let's look at all the things Which we had tried to Make our prometheus more stable. We tried vertical scaling We tried putting on more limits more Request limits to Prometheus, but again it reached all of them and things started going to crash loop back off very soon We Started using vertical autoscaler But then the vertical autoscaler meant Prometheus increase its limits which in turn led to cubelet going to crashing state Ultimately bringing a nodes down and when the nodes were down some of our critical service started going down because Prometheus was scheduled with one of those services Then we tried isolation to limit the blast radius We started scheduling Prometheus on a dedicated node but again Prometheus soon reached those limits and One of her node was always down and trust me nobody likes a Kubernetes cluster with one of its node always down We started pushing more resource at the problem So we bumped the node size and now we were at 16 gigs of memory and eight core CPU During that time we might say that our Prometheus was pretty stable But up until that we started querying the matrix from past 48 hours past 36 hours And that's when the query started Prometheus breeze the limits again and things started going to oh and So we were back to zero yeah, so now we start exploring really what's our last option and Kush mentioned this operator model so this Prometheus operator and I think the thought was at least on my side is well We have this problem with instability but if we can recover gracefully without manual intervention we can worry about it later and So we just blindly again a blind attempt to just install some more technology and hope that it fixed the problem And we actually resulted up with the exact same thing we had before two very stable Prometheus instances and one inherently unstable Prometheus instance that essentially just ends without of memory near the end of its first retention window So it's really at this point that I felt we tried everything and I also very much appreciate that Kush to kind of took it upon himself. So yeah, and before we get into that just really quick again So more resources doesn't work. We can't go horizontal Isolating it pretend stops impacting other services Federating it got a stable where we didn't have issues and then Operator didn't help us particularly much. So it's at this point that Kush I think took it up much deeper and really decided to own the problem So what do you do when the thing works? We were done with it and trials We were done with bumping from random get-up issues about Prometheus going to crash Prometheus going to home So that's when we started to holistically dive into the Prometheus documentation. It's working. We started reading a lot of blog posts articles and Came to know that there were few things which are very fundamentally wrong by how we were Installing and managing a Prometheus deployment So we were first went to the brand Brazil's Prometheus memory calculator Where we gave all the inputs we had and it came out that we needed 33 gigs of memory to make sure a Prometheus runs in stable state Well coming from t3 micro nodes to a node with 32 gigs is not something which we can easily afford for it start So that's when we started looking at the problem more briefly and we found out that our scrape interval was too small We were scraping everything at five seconds, which was too frequent. So we bummed up our scale scrape interval Second thing was that we had three million time series Honestly, I don't know if we actually needed the dismiss time series That's when we stumble upon the blog by Thomas. They get into Around how to investigate the high memory usage by Prometheus We use the TSDB analyze tool to identify What were the high carding ID matrix? What were the matrix? What's highest number of labels and then we use the Grafana memory tool where we came to know that what all matrix Do we only need to make sure that our Grafana dash was working correctly and we are getting all the absolute data We need so we started dropping all the news matrix We started reducing the labels and things started becoming more and more stable and Then we stumble upon these two documentation around where it was mentioned that if any of your call is coming from out of the Mish or if any of your call is going out of the Mish is still by default Marks the destination service as the host header. What it can do is it can result in very high cardinal Destination service matrix. So then we disable the host at a fallback in our clusters and boom a Prometheus was way more stable than it was ever and Now you ask me how you like how we know that our Prometheus was stable So ever since we made these changes our Prometheus average memory usage has been around 3.5 gigs It's still scheduled on the node with six eggs of memory and during querying the peak goes to 5.5 eggs of memory usage There was still a very manual intervention Which we still need to do was that during the peak times when HPA used to kick in a cluster used to go to 500 or 450 Pause and that's when we that's when the PVC used to fill up So we were still manually intervening with the PVC's Increasing the PVC size from 8 to 12 gigs 12 to 16 gigs and as of now we are having around 1 million series with 3 million chunks and Prometheus is working well for now And since Prometheus has been stable for us. There are a lot of different teams Which are now looking at exposing custom matrix with their microservices so that they can have matrix based monitoring Yeah So this is the one area really in our entire infrastructure that we can't rely on Auto scaling I would say and it's not that we exclusively rely on it It's just this is a much more manual tuning than I would have hoped or expected So I think it's on cushion I now to number one We have to really understand the data that we're ingesting and we have to be a little bit more proactive with our strategy as we're number one A one-day retention is not an answer. So we know we're gonna have to bump that up pretty significantly here But at least we have a baseline and from that baseline we can have some type of You know deterministic scale model So on top of that we have a dev team that is pretty spoiled by their dashboards in a single place and all of their metrics and everything And as we split that up we really have to have an answer from a from a usability standpoint and It's a it's a lofty goal to get to what they have today But what we really can't have is three or four or five different places to get metrics So that opens up a new a new issue for us or something that we're gonna Definitely have to tackle which is this idea of instance aggregation and let's say more long-term retention and from there We've done very very brief research, but things like Thanos Cortex And three yeah, there are some other other options out there But that's definitely next on our list and then another big thing for us is these you know these these Metrics are particularly important to our developers and developers are our customers at dev rev So we've been trying to figure out what's the best way to number one use some of these metrics internally So let's say kick off internal tickets or issues Programmatically feedback into the platform and then another thing that we do is we kind of have the opportunity to visualize Some of your infrastructure or some of your what we call parts We think there's definitely an opportunity to do some real-time data display by feeding that back in But that really wraps it up for us We glad to answer any questions that you have and also thank you so much for taking the time to listen to us Have a question over here Hi guys So when you said that you guys were aggregating all the data in the Prometheus was your Prometheus the Only source for metrics or you guys were aggregating and pushing it somewhere else Like when I said talked about the retention So when you said 24 hours 36 hours or like whatever that time frame is So you get your Prometheus was the only store or you guys were still pushing it elsewhere to have like a week store Or a month store something like that purely one day at this point So we have not yet looked past that and that's primarily because the metrics that we have today are used for our Internally debugging we don't really have that use case and at the moment our app business metrics our developer Important metrics they're in datadog and and that's where The retention becomes at the moment more important, but long-term we have to be able to answer this Any other questions So would you say the Federation was necessary looking back or would you Maybe have waited if you could have figured out first like would you still do it if you had to go back if if we didn't Have to I don't think we would have right like if I have if you can scale without me deciding arbitrary Like partitions in our data. I would much much rather prefer that personally But yeah, I think I don't think we have another answer to that or at least for what we've come across So as of now ours data is still federation based For long term we are planning to have each Microservice as a federated Prometheus which will again be agreed to by global Prometheus Which by then we could have a Thanos for different cluster primitive which can aggregate around multi-cluster setup So we can have long-term attention All right, we have another question over here So you said you had issues with high cardinality data. What approach did you use to resolve that? We use the MIME tool by Grafana, which told us that what all matrix We actually needed for our dashboards second We use the TSDB analyze which told us that what were the matrix which were limiting a lot of labels and then we check that Do we even need that much? How did you get rid of it? Like did you? went back to change the Matrix that like refactor the matrix that you had or How did you go about it? Can you pardon? So how did you like? Resolute right like we drop the matrix which we didn't need it