 Hi, good morning, good afternoon, good evening, depending on where you are. Today we're going to talk about making dashboards like automatically faster. Using some tools we already know about like recording rules and also we're going to dive into kind of M3 attribution tier. My name is Shreya S. I'm a software engineer at Chronosphere. We're working on building a hosted metrics and monitoring platform targeting large scale high throughput use cases. I'm interested in all things observability, but primarily kind of focusing on making use of metrics and traces to solve like interesting problems. Prior to this, I was on the observability at Uber working on like building out and scaling the alerting platform there. The agenda for today is kind of setting up the problem, looking at kind of like high quality metrics and why they're a challenge. Then we're going to talk about kind of aggregating metrics using recording rules and M3 aggregation tier and kind of like how that is, yeah, how we can kind of make dashboards faster using that. Then we kind of talk about how we make things easy to use. And finally, we kind of round out with the demo and kind of Q&A. To set up the problem for kind of most of this talk, we're going to take CAdvisor as an example, which provides essentially resource usage and performance metrics of running containers. This dashboard taken is kind of the CAdvisor dashboard from Grafana. As you can see, it has some amount of aggregated information and some amount of like individual kind of like pod level information. So it has like it's tracking 5,000 containers. You have some aggregated stats like CPU usage, network traffic, et cetera. You also have like per pod stats like CPU usage and memory usage, which are useful. So the aggregate stats kind of give you an idea that something is wrong and then the more like per pod stats kind of help you improve cause better. So this has like tons of dimensions because it wants to kind of slice and dice metrics in different forms. So what do these dimensions actually do? If you consider a simple metric kind of CPU usage, you end up having like 16,000 series and it takes 20 seconds to actually query this information, like which is really slow. This is because like you can see the so many different tags on it, like there's a pod ID, there's an instance. There's the image and so on so forth. If you only consider the two labels that are interest in and we actually aggregate this using a recording rule, then this query becomes way more manageable like 230 series and just takes like half a second to actually get back information. And in most cases, this is actually all the information we want. Now the reason for getting so many series here is like even though it's kind of like containing, it's kind of something is tracking one instance as things get restarted, like as the pod gets restarted rescheduled like a new time series shows up. So there are more things actually getting stored in the time series database which don't have data currently but when you actually like query for like 30 minutes or like a few hours of data, you end up pulling time series which should not exist. So it basically results in like slowing dashboards which kind of like constantly de-creating like de-create over time. As mentioned, this is because kind of the cardinality of dimensions keep increasing. As you add new instances, roll out new images, like things basically, yeah, we keep adding dimensions and things keep slowing up. So this basically results in kind of slower dashboards. Slower dashboards results in kind of browser locking up and a bad user experience. So when an engineer notices this, the dashboard needs to be optimized like how we actually kind of go about doing this. The first step is kind of figuring out which queries are the culprit. You inspect the request, you could inspect the request from a dashboard to look for slow queries. You could do this in kind of Chrome Inspector. You could use the Prometheus query log to do this and kind of associate back and try to associate back to the dashboard, but this is difficult. Prometheus query log has like all the queries running in the system and the dashboard itself, like you may have many, many dashboards. But given that we figure out kind of which are the culprit, you probably want to do some form of pre-aggregation of these metrics to make these queries faster. And over the next few slides, we will kind of talk about a few of these like pre-aggregation, like a few ways we can actually pre-aggregate things. The first thing I want to talk about is recording rules. This is something Prometheus provide support for and kind of they're like widely used for this purpose. They basically allow kind of pre-computing queries and storing back aggregate time series to the TSTB. Once you do that, you essentially instead of making the query which kind of looks at all the series, the dashboard is just making a query which is like picking back one series. So it's like super fast. And the caveat is that now the dashboard now has to go and you need to point it back to the pre-computer time series. You need to go and like change the dashboard to actually access the end result time series. Recording rule looks like something like the below. You have a record which is basically the new time series you're storing stuff in and then the expression which is like whatever prom keyword expression you want to get executed so you can put anything as complicated as you want here. And as mentioned, the record is essentially a new time series. So to basically create recording rules, you need to know what to pre-comput. So we figure out the bad queries by doing some final analysis on the dashboard. Then you basically go configure these recording rules and you can go and change the dashboard to kind of query the recording rule metrics instead of the underlying metrics. So it's a real long manual process. Let's say you've actually done this. What happens when one of these metrics changes like the query for the recording rule has changed or a new panel becomes though, you basically have to repeat this process all over again and kind of do the same manual processing again and again. The second thing is recording rules are very expensive. They basically execute and pre-compute to query at regular intervals. So a query accessing many time series can get expensive very quickly, especially when kind of the dimensionality of the underlying metrics keep increasing. And given that these kind of run alongside other queries in the system like dashboards and alerts, they have the potential to kind of overrun the query engine. So those are kind of a couple of like really bad aspects of kind of using recording rules and which we need to kind of be careful about. But as mentioned before, sometimes for underlying metrics, we don't actually need all these dimensions. So it would be nice if we can actually get rid of these completely and not store them. What we care about is just the aggregate and we don't really care about kind of underlying metrics themselves. So it would be great if we can do that and that's kind of where the M3 aggregation tier comes in. So M3 is a remote storage for Prometheus and the M3 aggregation tier allows us to kind of move the expensive recording rule computation into streaming aggregation. So when Prometheus remote write comes into M3, it sees that some metrics need aggregation and kind of forwards it to the M3 aggregation tier. The M3 aggregation tier basically knows what rules to apply on what metrics. It applies those aggregations and then sends it back to the coordinator to actually like persist it into long-term storage in M3DB. So the aggregator basically allows down sampling, dropping or aggregating metrics prior to persisting them to the time series database. The aggregator supports kind of two different types of rules. One of them is called rule of rules which allow aggregating these metrics. Then second is mapping rules which basically allow kind of dropping random metrics. We're gonna first, we're gonna talk about both of these but first we're gonna dive into rollup rules. Rollup rules are a way to aggregate metrics. So they basically have a series of transforms which are applied in order to kind of change and generate a new metric. And the metrics kind of applied to depend on kind of what filter actually matches. It also has something called a storage policy which basically determines like where to store the generated time series in M3DB. So we're talking about kind of one by one. The first step is kind of what we call like a transform or increase or a delta transform. The underlying metrics which come from Prometheus are kind of monotonically increasing like counters. Since we can't really aggregate them as is we need to apply the equivalent of kind of the Prometheus rate function first. So that's exactly what the increase transform is. Essentially applies the Prometheus rate function gets diffs between successive data points and kind of generates a new time series for kind of use by the next level of the transform. The next level of a transform is called the rollup. It essentially sums the deltas by the unique dimensions specified in the group by. In this case we're only interested in container name and namespace. So it does a sum by container namespace in like promql language and it stores the generated thing as a different metric name. Now the metric name is interesting. If you're actually using a mapping rule to drop the original metrics then we can basically store the rolled up metric with the same metric name as the original metric. If you're not planning to kind of drop the underlying metrics then this metric name is to be different just like it's in the recording rule case. But the advantage of actually dropping the metrics and storing the metrics using the same name is that dashboards or lorts or anything which is actually querying these metrics now just query the aggregate rather than the high dimensional metric. So that's something to keep in mind. Now finally like after we've rolled things up and we kind of reduce the dimensionality we want to store this back into the TSTP. So we need to go back from dealing with deltas to actually kind of accumulatively like increasing like monotonically increasing time series. So we apply kind of a cumulative add operation for each of the metrics to kind of get the aggregated like monotonically increasing time series. This is basically sent to M3DB namespaces identified kind of by the storage policies and things get stored there. Now as mentioned before like if you actually want to go and store the metric kind of with the same name as the original like original metric then we need a way to kind of drop the like the raw like high dimensional metrics. This is where kind of mapping rules come into play. So mapping rules also have a filter which basically says, oh, these are the metrics that I'm actually interested in match these metrics and just drop them do not store them. So the roll up with rules would still apply but they won't actually get stored to the TSTP. So the combination of roll up rules and mapping rules we can get like aggregation like really easily like add ingestion time. So to summarize kind of the aggregation here basically allows for kind of ingestion time streaming aggregation. The metrics can be aggregated or rolled up based on like whatever rules you provide different functions are possible. And I think the key thing really is that there's a way to kind of like drop the raw metrics based on these matching filters which gives us the like the great characteristic that dashboards and alerts and different queries could automatically sped up. So we really have like two ways of doing things. You just wanna quickly like go over pros and cons of recording rules and roll up rules. So recording rules are generally like their general purpose and they support full promql. So if you have this like super complicated like expensive query which we wanna speed up then the coding rules are probably the way to go. The caveat is that they're expensive because they run against their real query engine and they also kind of affect other queries. So you're actually querying like if you have a recording rule going and like aggregating across 20,000 time series the recording rule every minute or every interval goes and queries those 20,000 time series and does aggregates and stores them back. And third kind of, I guess bigger point is that kind of all the data needs to be stored. So there's a very high storage cost. You cannot like with recording rule you will store the low-dimensionality information and high-dimensionality and aggregate information at the same like high-dimensionality information and aggregate information at the same time. Which also means that kind of your dashboards and queries and alerts need to not be modified to actually like hit the aggregated metric instead of the original metrics. Rollup rules because of the way they happen at ingestion time are much more efficient to run. We have the option of only storing the aggregates we need and dropping other series. And if we go there out of only storing aggregates then you kind of get automatic query speedup because aggregate ends up having the same metric name as the original metrics. The biggest kind of caveat with rollup rules is they don't support full promql but rather like specific aggregates. We are, we have been kind of adding more and more of these aggregates, aggregate functions over time but it's unlikely that recording like rollup rules would support everything that recording would support. So now that we have these like two things that two ways of kind of aggregating things like how do we kind of make it easy to do these aggregations. So we're gonna talk about kind of two we're gonna talk about kind of a couple of different things around like, so we're gonna talk about this tool which we call the high card analogy analyzer which makes it easy, yeah, which basically makes it easy to go in like do this, yeah, which basically makes it easy to kind of go and do this analysis and kind of like create these recording rules and rollup rules. So we're gonna share screen again here. So this is actually available as a tool on GitHub. The link to this is posted sometime like in the latter part of the, yeah, in the latter part of the talk. So we have, we're basically gonna make use of the Prometheus like query logs and we're gonna analyze them and we're gonna do like some operations to actually go and like speed things up. So for this example, I emitted some metrics locally just to kind of get like some basic information for the purposes of the demo. It's not really high card analogy information with too many series, it's just a handful of queries. So what does the query log actually have? It has some query information here of what the exact query was when it started and then different stats, like how long it took to eval, how long it took to sort, like the query preparation time, any inner evaluation and so on. So kind of just to run this, we have like this high card analogy analyzer, we point it to a sample query log. We give it some targets and I'm only interested in queries which have kind of a minimum query time of like 0.01 seconds because it's a smaller example. And then we have kind of, oh, I want to only like filter out queries which have happened at least two times. So this actually shows you some options here. Then the query log like the analyzer has this way of like generating recording rules for these queries that we identified and we actually see recording rules here. And it has another mode where you kind of generate roll up rules and generate mapping rules. So what exactly is this like analyzer doing? So let's jump back to the slides and kind of talk about that. So analyzer like primarily kind of uses the Prometheus query logs with logs, all of these queries kind of like run by the engine. And the query log has information about kind of where time was spent in the query. So analyzer is kind of an offline process, a standalone tool to generate recording and or roll up rules. As mentioned, it uses the Prometheus query log to find good candidates for aggregation and it provides some recommendations for recording rules or kind of M3 aggregator roll up and mapping rules to create and speed up expensive queries. So it just provides recommendations and then users kind of have the option of taking these recommendations and going and creating the actual rules themselves and kind of speeding up their flashboards. So a few steps like we would do is you basically have like days worth of Prometheus query logs. So we have enough information to know like which are the repeated queries which repeated expensive queries which are being run on the system. So we want to kind of find these most commonly hate expensive queries. Once we kind of get information about that, we want to kind of check the cost of these queries or due to number of series. So the cost of a query could be due to multiple different reasons. It could just be because it's very, querying like large chunks of data, large chunks of like a few series, essentially over like a huge like sequence of time or it could be like querying many, many series of data. Like yeah, many series of data. And the cost, like even when it queries many series of data, if it's actually they're turning all the data, then you can't really speed things up. But if it's actually a querying many underlying series from the TSTB and then it's kind of aggregating them together into like a fewer set of series, like those are kind of the candidates like candidate queries we're actually looking for. So we want to basically look at kind of like the cardinality of the queries which are actually being run and kind of use that and only optimize those queries. So once it kind of identifies these queries which actually need to be sped up, we provide proposals for kind of recording and roll up rules to create. And then users are kind of like free to go in like configure these rules as necessary. If the user kind of goes and creates recording rules then dashboard and other places where these rules are being applied need to be changed. If we're talking about kind of roll up rules and the user has already gone and like the user also said that they want to kind of drop the underlying metrics then the queries will kind of get sped up automatically as the query actually captures the aggregated metric. In case a roll up rule is not dropping the underlying series then like recording rules you'd have to go and kind of like change dashboards on other places where things have run. So we built this tool, it's available open source. The link to this is in the mix like in like the last slide. So yeah, we encourage you to go and like try this out and let us know kind of how this works out for you. So thank you so much. We're open to questions now. If you want to kind of know more about kind of M3 aggregation here like reach out, do reach out to us on the M3 Slack channel and there's a link to kind of the high continuity analyzer available here. Yeah, there's a link to that. I got an analyzer available for you to kind of try out. Thank you again.