 Hey, everyone. Welcome to Cube Day Singapore. Today, we are going to talk about handling billions of metrics with Prometheus and Thanos. I'm Ravi Hari. I work as a principal software engineer at Intuit. Hey, I'm Amit. And I work as a senior software agent at Intuit. At Intuit, we work on an AI-driven expert platform which drives the work for all the developers at Intuit and derives modern development experiences. And this platform is powered by Kubernetes Infrastructure. And these are some of the metrics that signify the importance of the platform. Our day-in-day job is to ensure the platform scales, as well as solve new use cases. Here is the quick agenda. We'll see the metrics evolution at scale with Prometheus and Thanos. And then we'll see different use cases with argoloads for metrics. And then how we leverage metrics for AI ops. And then how we are using metrics for Argo CD, metrics extension dashboard. And then how we push metrics for golden signals to even best via Kafka after. So it all started like this. We have close to 320 Kubernetes clusters at Intuit. And it's growing on a day-to-day basis. And then as Prometheus is a de facto standard for Kubernetes for the metrics monitoring and these HPA use cases, we thought of running Kubernetes Prometheus on top of Kubernetes with Prometheus operator. And we created a Prometheus HAPair, where one of them is primary and the other one is secondary. And then we install service monitors that scrapes the data from services and collects the metrics from the services and stores in Prometheus. So we used to store this data for 36 hours because of various use cases. And because of which, we had to run Prometheus with an EBS volume because it's a standalone instance, each one of them. And then we cannot store data in an object store with Prometheus as of that time. So what we did then was we were collecting the data and then we used a third-party tool for visualization. So we had to translate the Prometheus metrics into custom metric format that this third-party visualization tool supports. So we created additional add-ons like the storage adapter for Prometheus and then the S3 adapter that stores the data in the custom metric format in the S3 bucket, which then was leveraged by the custom visualization tool for us. And then the metrics started growing and the number of metrics that we had were overwhelming. And it was difficult for the developers to always look at these dashboards and figure out what's going wrong or stuff like that. So what we thought of is coming up with a few golden signals, leveraging this rate methodology, the metrics such as rate of request, error count, and then latencies along with CPU and memory utilization on the parts. We kind of looked into some few critical metrics for the parts of the applications. And then we had written our own custom Prometheus rules on top of the base metrics. And then we started shipping the data out into Event Bus because we had a separate pipeline that reads data from this Event Bus and processes the data and sends alerts for these golden signals. So we had to introduce a component called Prometheus Kafka adapter. We looked into the open source solution. We liked it, but it missed some of the features. So we had enhanced it. We'll cover that in the subsequent section. So we had to add up to golden signals, and we had to write the data out from the Prometheus remote right into Event Bus. This is an additional use case. And what happened was we started with some opening up 300 metrics per part, and eventually it went all the way up to 1,200 metrics per part that we had to scrape. Even that was not sufficient. In some cases, we went all the way up to 4,000 metrics per part that we had to scrape. The metrics use cases have been growing, and the rate of requests for the metrics has been growing. So at Intuit, we have seen that the rate of metrics in the last couple of years has been gone by 2x. So single instance of Prometheus HAPR was not sufficient to handle all this metrics. We had to scale Prometheus horizontally. And when we looked out for the solutions for it, we came across Thanos project, and Thanos can scale Prometheus horizontally. The way we have done this is we have run Thanos' sidecar along with Prometheus, and then this sidecar will write the data from Prometheus into Thanos' S3 bucket that is shown on the top. So what we did was we reduced the retention in the Prometheus instances. That kind of reduces the size of the Prometheus instances, but we are storing the data in S3 bucket and retrieving this data for up to 48 hours using Thanos' pods. So Thanos' query is front end, and Thanos' query is Thanos' store, which retrieves data from S3 bucket and sends it back to the users. So we had to scale Prometheus horizontally to handle all the additional metrics growth at Intuit. So this is how we started getting into Thanos. And once we have opened the doors with this, there are additional use cases that came out. So one of the challenges that we faced was when we scaled Prometheus horizontally, we started seeing HPR not working appropriately. The reason for that is when we have multiple shots of Prometheus, each shot of Prometheus was getting the data from a given namespace only to a set of pods, because the pod IPs is what we eventually derive from when we query from a service of that particular namespace. And then each pod is hashed to a particular Prometheus instance. So the data for the metrics is getting split, even though they all belong to the same set of namespace, the pods belong to the same namespace, the pod distribution among the Prometheus instances is getting divided. So when the HPA metrics were getting calculated via the Kubernetes API, the data that we are getting is not accurate because it's not aggregating across multiple Prometheus instances. In order to solve that, we had introduced another component of Thanos called Thanos Ruler. And Thanos Ruler is a component that queries Thanos query, which internally can retrieve data from Prometheus as well as it can retrieve data from SV bucket. And then what we had done is we had evaluated these rules in Thanos Ruler and we have not evaluating the Prometheus rules in Prometheus anymore, so that this central evaluation had solved us for the HPA use cases. Once the rules are evaluated, we write it back to Thanos receive component, which puts the metrics back to SV bucket, right? So this is how we solve the HPA issues. And then more and more use cases have been evolving at Intuit. So one of the use cases to use AA ops, AA has become a norm in the last couple of years and we had Intuit started using this project called Numa approach, which leverages metrics to solve for the anomaly detection with multiple applications. So one such application is Argo Rollouts. We'll talk about that in detail. The additional use case that we have come across is Argo CD. Argo CD started extending its functionality so that use cases such as config map and config map plugins or metric plugins and new dashboards for rollouts are all now integrated into Argo CD. One of such extension is metrics extension. So users don't have to go to different screens. Once they install Argo CD, they can view events, locks, metrics in the same dashboard. So Argo CD also started leveraging metrics from this Prometheus instance. So we had to call this Thanos query instance to retrieve the data. But if you see all of these different workloads are putting pressure on Prometheus and scaling Prometheus is costly. So this is something that is kind of burning the cost whereas now we started looking at how can we optimize the cost and not impact Prometheus with all these different use cases because Prometheus has this remote rate functionality. Can we leverage it in such a way that we can reduce the cost? So what we did was we had separation of concerns for each one of these use cases. For the regular use cases, we let the cluster admins and the developers take it off the regular query path, which queries the Thanos and then retrieve data from Prometheus because there are very few limited use cases for this, the TPS is less. Whereas where the TPS is high, for example, for the AOPS use case, we wanted to store data for eight days. And retrieving data for eight days means that the Thanos query parts were overwhelming and it reads a lot of data from the backend. So we had to add an additional component called query frontend that kind of chunks the data into the interval that you want. We have chunked it for every two hours so that the Thanos query parts are not exhausting. And it retrieves data back from the Thanos receive. So the remote right here would write data back to Thanos receive. And here we store the data for eight days in this use case for the AOPS. The primary reason we chose to store it for eight days here is because for any anomaly detection based on our study that we found, in order to get a 95% confidence on that anomaly, you need at least eight days worth of data with 30 second scrape interval of time. And other use cases for Argo CD here in order to show the live metrics because developers will be mostly interested in live metrics when they are looking at Argo CD, they wanted to see immediately how their deployment went and how are the metrics coming from the new deployment and other things. We didn't want to store for too long. So we stored in this instance of the Thanos receive only for six hours of data. So with these different use cases, we solved the problem by splitting into different pipelines but the source of all these metrics is to Prometheus and Thanos has helped us solve these problems. Let's look at one use case with Argo rollouts. So Argo rollouts is a replacement for deployment from the open source Kubernetes. So Kubernetes in the open source doesn't support blue, green or canary or progressive rollouts. So Argo rollouts is created to solve the problem and is made available in the open source, right? And then we leverage a metrics primarily to define when to promote deployment from canary to stable or shifted from blue to green and stuff like that, right? To make it stable, we use this analysis templates that run the analysis and qualify whether the new deployment or the new image that you have deployed is running successfully or not. A definition of analysis template looks like this and as you can see here, we are using a Prometheus query to query the data from Prometheus and ensure it meets a success condition, right? And the way it works internally is that you have a stable deployment and once you introduce a new image, it creates a canary replica set and that creates a canary pod. If let's say we have five pods running for this replica set and one pod shifts into the new replica set, that percentage of traffic gets shifted into the new replica set and at that time, we can run the analysis template which triggers an analysis run that queries data from Prometheus. Argo rollout supports other metric providers but at India we primarily use Prometheus as a backend to retrieve the metric data and once this analysis run is successful, we can then subsequently promote the rest of the pods and make the canary as stable. So let's look at a quick demo for this. So here you can see the rollout definition. In this rollout definition, we are seeing that the name of the rollout is rollouts demo and it is running in metrics analysis name space and there are five replicas and we use canary as a strategy and we have defined different weights. So initially 20% will shift and then run the analysis template that triggers an analysis run. If that passes through, then the rest of the traffic gets shifted after 10 seconds each time. We can also run this analysis template in the background so that at every step, this analysis template gets executed and we ensure it is up and running successfully. Let's look at the analysis template definition as we have seen earlier. We're calling this as a success rate. Here we are ensuring that this runs every 30 seconds but we can define the count and the failure limit other fields as required. We have just taken it as a base and simple thing for the demo as one but you can also put the time out on the Prometheus if it is not responding, you can time it out and retry after some certain time and stuff like that. So here we are querying for the CPU utilization data in this namespace for this particular app and we have defined the success criteria for that. Let's go ahead and quickly see that we have already created this rollout and we'll just go ahead and update the image with a newer image so that the canary process starts. So I've just zoomed in here for the easy view. So as you can see the newer image starts getting deployed and newer pods are getting provisioned. If you look at the analysis template, this is the analysis template that we have defined and you can see that a new analysis run got triggered and it is successful, that is why it started promoting the pods in that namespace. If you look at the value that we got from Prometheus for this query, it is 0.018 which is lesser than our target. So it is the success criteria passed and it started promoting and as you can see here, it started shifting other pods into the new canary and it'll make the canary stable. So this is the overall use case for leveraging metrics with our rollouts. Amit will cover the other use cases. Thanks Ravi. Yeah, I'll move forward with the next use case which is regarding AI ops with metrics. So at Intuit we use automated detection of discrepancies and issues something called anomaly score. So it's a valuable tool for developers to find out issues quicker than other traditional methods and also it reduces MTTD and MTTR. So in Intuit we use ARCO rollouts along with this anomaly score to automatically roll back changes if the quality of changes doesn't meet the required standards. So while implementing this we faced several challenges. So one of them was access to historical data and whenever this numerologic infrastructure was querying this data, we had to scale based on the load. So we had this in cluster Prometheus but it was already doing some multiple operations like scraping, compaction. It was also serving HPM metrics and it was remote writing to different things and also doing rule evaluation. So we didn't want to put another read heavy operation on Prometheus. So we came up with this initial design where as you can see we're pushing the data from Prometheus using Thanos site card to a S3 bucket in the highlighter sections and we created a dedicated pipeline for this numerologic infrastructure to query the AI ops data. So in the S3 bucket we are storing the data for eight days and we are using this query front end to efficiently fetch that data but there is some problem, right? Because we are storing all the data for eight days in S3 bucket but for AI ops we don't need all the metrics, right? There are only few metrics that we are interested in. Also this huge amount of data in S3 is incurring a lot of cost and the way TSDB blocks are designed whenever you are retrieving a particular metric it would load the entire TSDB block in the memory and it may contain some unnecessary metrics that you don't want. So we improved the design a bit and what we did was we replaced the store parts with Thanos receive and here what we are doing we are only writing the required metrics from Prometheus to the Thanos receive and we are keeping it there in its local TSDB and we are keeping it there for eight days. Since the number of metrics are less so we can afford that and whenever the numerologic infrastructure is squaring the data we are again going by the same path using Thanos query front end and Thanos query but this time we are fetching the data directly from Thanos receives EBS volumes so that is much faster and there is a significant cost reduction as well due to the volume of data that we have now. So I'll quickly show a demo on the performance of this new infrastructure so recorded one, let me play it. So as you can see at the top this is a query S3 instance that means that it is connected to the S3 based pipeline and I'll quickly show you the store that is connected to so it says it is connected to a Thanos store instance and we can quickly verify that using the IP. So it's a query in Thanos query S3 store instance and let's try to get some data for eight days. Yeah, as you can see it took around four seconds to fetch that data here, right? And so yeah, let's try to see the newer approach using receive will query the same data here. So before that we'll quickly see it is connected to to receive instance and we can verify that using this command. So yeah, let's try to fetch the same data here and we click on execute as you can see it only took around 300 milliseconds or 350 milliseconds to fetch the data, right? So there is a significant performance improvement on this so there was a demo on this. Let me move forward with the next use case. So as Ravi was talking about Argoscd live metrics, right? So we have this new extension is in Argoscd that allows the developers to customize their key metrics and show it in the Argoscd UI itself. So you can configure metrics like CPU memory and some HTTP error metrics and you can directly view it in the dashboard. So it also helps providing a unified view to the developers and identify any issues quicker. So in the diagram you can see that whenever there is a query from the Argoscd browser UI app it goes to the Argoscd server and there is extension service in place which communicates to different Argoscd extensions. Here that we are interested in the metrics extension so when the query goes to the metrics extension it tries to query the in cluster Prometheus to fetch the metrics. But here we face similar set of challenges as before like the in cluster Prometheus are already doing multiple stuff and we don't want to add another read heavy operation to it so we came up with a similar approach here. Yeah, so came up with a similar approach here. So in the highlighted section we created a dedicated read path for this Argoscd live metrics and as you can see we are remote writing the data from Prometheus only the required metrics to a separate Thanos receive TSDB instance. So whenever this Argoscd is querying this data it is going via separate read path and this path can scale independently and it doesn't affect the in cluster Prometheus. So this is how it looks when you configure the metrics tab in Argoscd. So you can see all the metrics in the same tab as your application. And coming to the next one so we wanted to send some golden signal metrics to Kafka. The need here was to send data to multiple Kafka topics and we wanted to support tag based filtering as well. The challenges were there were some open source solutions available but it doesn't support multiple topics. Also there was no graceful shutdown and connection handling. Also if you wanted to use the solutions we need to add extra remote writes to Prometheus and that would add an extra 25% memory overhead on Prometheus. So what we did was we came up with our custom Prometheus Kafka adapter which had a better connection handling and all those things. Also we added a multi-topic support that means we are able to send metrics to multiple topics based on something called a tag parser. So a tag parser is nothing but, there is a image as well. Okay, yeah. So the tag parser is nothing but a collection of labels and a matcher. So whenever a metric coming from Prometheus matches those tags here, right? So it has to match this foo equal to anything and bar equal to true and that metric will go to Kafka topic one. And anything which doesn't match any of the matches will be dropped. So I'll quickly show a demo on this. So let me play this. So we are using telegraph to generate some mock Prometheus metrics. As you can see there is metric one with this labels foo and bar. There is metric two with baz equal to true. So ideally metric one and metric two should go to Kafka topic one and two and this metric three should be dropped. So we are also creating some local Kafka instances using Docker compose. And this is the same configuration that I showed in the slide. So let's try to start all these components. Yeah, let's start the telegraph first to generate the mock metrics. And as you can see it started generating some mock data. We are also starting the Kafka as well. So let it start and then we can start this Kafka storage adapter which starts accepting the remote right request from telegraph. And as you can see there is all the metrics are going to the correct topic. Metrics one going to topic one. And similarly all the other metrics are going to the correct topic based on the configuration. Yeah, so that was the demo on this. And I'll hand over to Ravi to talk about metrics component in Twitter. So we have seen certain use cases that are critical for us. But in general there are many other components at Intuit that we leveraged metrics for. So here are some 15 add-ons that we use at a high level. Some of them are inbuilt and open source via kick approach. And some of them are consumed from the open source like Prometheus and Thanos, right? So we have tools like active monitor that does synthetic monitoring of the add-ons and self-heals if the add-on is in a bad state. And then AWS metrics to check the resource limits on the cloud provider and then self-heal if in case the limits are breached. And then we use skip state metrics and then we use metric server to fetch CBM memory data for the pods. And then we use alert manager to create alerts through GitOps. And then we use node startup monitor and cost monitor to keep a check on the cost and the startup time on the nodes and stuff like that. So there is a lot of depth at which we look at the metrics and process them for our use cases. Thank you. Any questions? Yeah, good. There is a mic in the aisle. Thanks for a very interesting talk. I just have a question. I believe that the sharing rule that will play a crucial role in determining how effective your scaling effort is. So I just want to ask how was it designed, configured, in order to adapt to the dynamic environment as communities in which the targets added and removed constantly at the very high speed? So how do you design the sharing rule to ensure that the workload will be distributed evenly among the prometheus processes? Yeah, thank you. So today, our sharding is based on a GitOps process. So we get alerts if the instance of prometheus or the shard one of the prometheus that we start with, if that reaches some on resources, we get alerted. And then we add another shard through a GitOps process that creates a new shard of prometheus. And both of them starts processing them. So essentially, if your workload environment spins up multiple pods very quickly and stuff like that, you might need to lower the threshold and scale it accordingly. That's the solution that we have gone with. But did you evaluate whether the workload distributed evenly, or is there any way you measure that? Do you mean in prometheus? Let's say if there's a skilled workload, can we detect that? We haven't looked into that use case because we haven't come across such an issue where one of the prometheus instances is getting skewed and stuff like that. We have not seen that yet. But it would be good to look into. We'll watch out and let you know if there is anything like that. OK, thank you. I saw in one of the other Intuit talks that you've been relying on Istio as well in your service mesh. Are you leveraging any of the generated Istio metrics for your use cases as well? And if so, can you elaborate more? Thank you. That's a great question. So we use Istio primarily for all our service mesh use cases. And we consume a lot of metrics from Istio generated ones. And we have a different scrape limit because Istio generally emits a lot of metrics. So the scrape limit on Istio pods is very high. And the golden signals that we were talking about, it is one of the primary use case for Istio because there are so many metrics that Istio generates. We came up with some set of five golden signals that we wanted metrics that Istio generates and then consume them via the event bus and alert the user based on anomaly and other things. Two more follow-ups. Are you using it for the rollout, your Argo rollout? And if so, what kind of Istio metrics that you use for that analysis? Yeah, so Istio is used in Argo rollouts. We use virtual service to kind of leverage canary deployments with Argo. And how do you maintain the balance between cardinality? Istio generated high cardinality metrics. Oh, we control it by the scrape limit in the Prometheus. Thank you. I think we are at time, but we are available outside. If any of you have questions, please reach out to us. Thank you. Thank you. Thank you.