 Hi everybody, my name is Suresh and today we're here to talk about how we can auto suggest and auto generate recording rules with the goal of kind of speeding up dashboards and kind of in general the speeding up query performance. My name is Suresh, I'm a technical leader at Chronosphere where we're building a hosted metrics and monitoring platform targeting large scale high throughput use cases. The platform is built on M3 which is open source kind of remote storage backend for Prometheus. Prior to this, I was at Uber on the Verity team, primarily working on alerting and dashboarding and yeah. The agenda for today is, first, when we want to lay out the problem statement, talking about high cardinality use cases where aggregation of metrics are useful. Then we kind of talk about a couple of different ways where we can actually aggregate metrics to speed dashboards up. and speed querying up in general, we follow that up with a demo of how we can make how we can make this possible. And finally kind of the goal is to just show how we can make this easy to use. So first kind of just setting up the problem statement. High cardinality metrics is basically kind of like is basically the area where we probably want aggregation. C advisor metrics are kind of a very classic case of high cardinality metrics. C advisor basically is a way to get resource usage and performance metrics of running containers. So it's essentially like CPU, memory, disk, network traffic and kind of all kind of the infrastructure level things are containers. This is kind of a very simple like C advisor dashboard, we're just kind of monitoring like 5000 dot containers and has various metrics on it. If you kind of look at the dashboard, there's some aggregate metrics and you also kind of have like per pod metrics where you kind of want to look at like the top usage information. So sort of dashboard like just kind of looking at the scale of the cardinality of data you realize that it can come really slow really quickly. And this is where kind of like you can make portions of the dashboard faster by using kind of pre aggregations. An example of kind of how C like C advisor metrics are slow is just looking at something simple like container CPU usage. Container CPU usage just with kind of all the tags that get added by C advisor like as the metrics are script. For the same, for the same environment you kind of have 16,000 series. And if you just want to kind of just display all of these things it's going to take 20 seconds to just display them. So any simple kind of operation like doing a sum on these or doing a max of these is basically going to be really, really slow. But you probably don't really care about kind of the whole combination of all of these metrics. So what you can do is this is an example where you're going to just take in the same metric and you have aggregated them to just like a couple of different labels. In this case, a couple of labels we've aggregated them are kind of cluster and kind of the namespace the container is in. If you're kind of using namespaces as the place like as a division of your services or like some form of some form of division, kind of this gives you like your CPU usage per customer or tenant or some, yeah, some notion of that. So this is probably good enough to tell you how a particular like portion of your code is actually working. This doesn't mean that we don't actually want underlying series, like when we actually have an issue within a particular namespace and a cluster you probably want those underlying like individual series to actually go in like dig deep into that namespace to see what's wrong. But just from a perspective kind of what you want from an overview dashboard or something you want to alert on. This is kind of what you would want to see normally. So high quality metrics like as we see, it's just like have many dimensions and can kind of be really slow to query. So really high quality metrics are something that just leads to slowing dashboards, the cardinality of dimensions can keep increasing. As we add new instances, roll out new images, new instance tags show up, new image tags show up. These basically result in more number of unique series. And as these new tags come up, you just keep having like an explosion of the number of series. If you want kind of craze, which are like spanning across time, like across days or even like hours when like new rollouts have happened. We just end up like crazes in a capturing more and more underlying series, eventually just leading to slowing dashboards. So what we see as users is that this dashboard, which is really fast today, like over the next few weeks, just keeps on getting slower. Eventually, it needs to kind of a browser, like getting locked up. And then when you actually want to use these dashboards and engineers and then notices this and then realizes, okay, the dashboard needs to be optimized. So how do you actually kind of debug and like approach this problem? The first step is to kind of figure out which queries are the culprit. Inspecting the requests from a dashboard to look for these low queries is a good start. You can do that in one of the browser inspectors like Chrome inspector or Firefox inspect panel. But even if you kind of know what these queries are, then kind of you have to associate that back to the panels on a large dashboard that's not always easy. Another option is to use the Prometheus query log, which kind of has information of all the queries running in the system. If you're able to kind of parse that information, then you can kind of associate that back to a dashboard, but that again is kind of difficult. But once we figure out kind of which queries are actually the culprit, we actually need to have a second step and we need to decide that we can actually pre-aggregate these metrics to actually speed up the queries. So let's kind of talk about like the different options we have to kind of aggregate metrics. So recording rules are probably the most obvious option to aggregate metrics. They basically are supported by Prometheus and allow pre-computing like frequently needed or like expensive queries and they basically store back the aggregate time series through the TSTB. So if you actually want to like query the information, you just make like simple queries which just like puts back a single time series and so the queries get really become really fast. Once we do that, you can just have dashboards go point back to kind of pre-computer time series. The really cool thing about recording rules is you basically have all of comql functions available to you. So you can do like anything as complex as you want to actually do your pre-computation. An example we give here is we have these two metrics which is kind of total and available bytes. But if you want a dashboard which is just going to show you percentage used, then you can actually write a recording rule which does that, which basically has, which is basically doing a computation of like total minus available divided by total and then kind of stores that back in this new metric and then the dashboard can just query this individual metric. So recording rules basically they execute and like pre-compute the query at regular intervals, like as you define it every minute, 30 seconds. You can probably do it at the same interval that things get scraped so all your metrics kind of can have the same resolution. The one kind of a couple of downsides are like one probably downside with recording rules is that the underlying time series still need to be stored in a PSDB and the queries accessing many times a race can get expensive. So if we are actually aggregating across many, many time series then the query itself may take 10s or seconds. So that kind of determines like how frequently you can actually like run your recording rule queries. But there are certain situations where we do not necessarily need the underlying metrics and like the raw metrics can be dropped and not stored. And we know we may not also care about kind of the more complex queries where recording rules are really powerful. So for that kind of we want to go over kind of M3 aggregation tier which does not support kind of all the functionality that recording rules does but kind of does it in a more like does aggregations in a different style. So M3 basically is a remote storage for Prometheus. It basically kind of moves the expensive recording rule computation through a streaming aggregation tier. So it's not running the query engine itself but rather happens in a streaming aggregation tier before metrics get persisted. So aggregator basically allows down sampling dropping or aggregating metrics prior to persisting them to have remote storage in this case M3DB. The aggregation tier allows like two types of aggregations. One is roll up rules which allow you to kind of aggregate like a cross metrics apply functions like some max, etc. And then mapping rules which have a couple of different purposes for the purpose of this talk, we're just going to talk about them as having the ability to drop metrics. So roll up rules essentially they contain a series of transforms which are applied in order. Metrics applied to them like the metrics that these roll up rules get applied on kind of depends on a filter match. Now roll up rules kind of you have like three different steps the first one is to take a delta the second one is kind of some by like the dimensions that you're interested in. And then finally kind of create a monotonic monotonic cumulative counter. So let's look at kind of what each of these steps actually until the first step here which is called an increased transform is basically kind of taking the delta underlying series. So the metrics which come in are actually monotonically increasing metrics as all as Prometheus counters are. So before they can actually be rolled up we kind of need to get the deltas to figure out kind of like what the value is at the at the specific times. So first step kind of like as your deltas for each of these and these deltas are kind of sent over to the roll up. The roll up basically now takes these deltas and then they some by the unique dimensions specified in the Dubai. For the container CP usage, we basically want to sum them up by container name and namespace. So we're going to just like sum them up by container name and namespace and we want to do an activation is just some so that's what kind of this roll up step is. So once we have these kind of delta sums or sums of deltas. The next step is kind of transform them back into the monotonic to increasing the format but to actually store them to make them compatible Prometheus again. So the last front does basically a cumulative out of these metrics to end up with like an aggregate attack series. This is basically sent to m3db and namespace essentially identifies like where to actually store this within m3db. So the second thing is now we've been able to roll these up and we've actually been able to roll these up before persisting the raw metrics to storage. So mapping rules now come in and they kind of allow us to drop these metrics based on a particular filter. You don't have to do this, but in the case where you kind of don't want original and aggregated series. This kind of can save a lot of storage space so if you actually don't want the per container. CPU information then like or the other don't want kind of the other dimensions of the CPU information here then you can kind of summed up metric has already been stored. And you can say like with similar filter that I just want all the original series to be dropped so it's just a space saving thing. The other kind of neat thing about doing this is that if we do something like this we can actually store the aggregate series with the same name as the original metric. So depending on how the dashboard queries are set up, you potentially can speed up dashboards without even making any changes to them. So quick summary of aggregation tier essentially allows for ingestion time streaming aggregation. The metrics can be aggregated and rolled up based on defined rules. And raw metrics can basically be dropped based on like whatever matching filters by using mapping rules. Just a quick kind of comparison between both of these. According rules obviously the biggest advantage really is that the general purpose and kind of support full promql. They're a bit expensive because they run against a query engine so they may affect other queries. Rule rules are also expensive but they just can happen to happen to occur kind of at a different stage of the pipeline so they don't affect kind of queries themselves. The one kind of thing is that all the data has to be stored so there's a higher storage cost. As against kind of roll up rules where you don't necessarily need to store all the data. So roll up rules in general are kind of more efficient to run as an ingestion time aggregation or please kind of it separates it out into a separate piece of software. These are really only stored aggregates. If we only store the aggregates then you can potentially speed up dashboards and queries because they can add the same metric name. But really the biggest downside is it does not support full promql but only specific aggregates. So if you have a SAM or a MAX then yes this works really well with kind of more complicated like functions in promql. Like you want to divide like two different series less in our example earlier than yeah these don't really work. Now but either of these cases their challenge is using them you kind of need to figure out kind of what to pre compute. And for that you need to figure out bad queries by analyzing dashboards. Then you need to kind of configure the aggregation rules, change the dashboard to query for the aggregated metrics. After you do all that you may have not gone and change every panel so what happens when metric changes are a second panel becomes slow. You end up having to repeat the process all over again. So really is it possible to do this automatically and we're going to show you a way where we can kind of do something reasonably automatically. So start with we have a couple of graphs here which are continuously updating now. They're generated using this tool or prom remote bench which just generates some synthetic metrics. One is kind of disk usage and the other is a CPU usage metric. And have like two queries which are kind of running. Max of the disk by a measurement type. Below we have two panels for recording rules and we'll show you kind of how these recording rules can get populated automatically. It's coming here we have problem remote bench running which is like continuously like emitting metrics which are stripped by a Prometheus instance we have here. And in this tab we kind of have a high priority analyzer that we can actually run. So we have the analyzer kind of set up to run in a loop. It's going to run the high ground analyzer it's pointing to the query log that Prometheus is running with has a couple of different parameters here and it's indicating that you have to generate recording rules and push it into kind of the Prometheus high quality rules. And then it curls and kind of reloads the Prometheus config and just kind of keeps doing it. So if you run the analyzer, what this is going to do is it's going to look at the Prometheus query log it's going to go figure out kind of which queries are things that they can actually change. And then based on that it's going to go and like generate recording rules. So if you can see like what this is actually generated it's basically then they have a recording group I kind of analyze a group. Generated a couple of different rules. For kind of both the queries that we have. So if we kind of go back to the dashboard now. We actually very soon maybe it's going to look at the last five minutes you kind of see these recording really show up. So what this is basically then is kind of taking the Prometheus query log figured out what to do and just automatically kind of added these recording rules. So let's talk a bit about kind of how this is actually possible. So let's kind of jump back to our slides. So fundamentally what we're doing is we're basically analyzing the Prometheus query log. And making decisions on like how to actually speed like on making decisions and what to actually speed up. So we want to log. So Prometheus query log basically logs all queries run by the promql engine. It provides like a bunch of times like the total evaluation time the amount of time it's spent in the queue like the preparation time for the query and kind of just the evaluation time. And the inner kind of expression. So it has detailed information about kind of all steps of the actual query itself on the right side is kind of a simple example for a particular query kind of like how these timings and that showing up. So what that analyzer does is basically an offline process which helps the standard recording or a roll of rules. Use the Prometheus query log to find different candidates for aggregation. And then it provides recommendations for recording rules or M3 aggregator roll up or mapping rules to kind of speed up expensive queries. How does it actually do this. It goes over days of Prometheus query logs. It kind of finds the most commonly hate expensive queries. How it determines expensive queries is basically kind of evaluation time like how much time it took to kind of evaluate the query. If the query is expensive, then you kind of need to figure out if it's a candidate for aggregation. That basically depends on the cost of the query. Is it due to the cost of queries probably to do the number of series, but if we're querying like 10,000 series from the TSTB and we're returning 10,000 series, then there was no aggregation really like worthwhile aggregation to do there. On the other hand, if the query has a bunch of sum, max or different aggregation functions or it's doing operations like divides and multiplies, then there's kind of additional costs to just kind of like other than just retrieving the series. So kind of goes and looks at what gets returned from the query and like how many are there, like how many series underlying series is the query actually capturing. And based on that it determines if like it's worthwhile to actually make this and the application also allows you to like put in a couple of different parameters there. For example, I think in the example we had was we're looking for at least like a couple of queries which have taken more than a second to run and that's when it like kicks off to optimization. So goal for the tool is to kind of provide proposals for recording a roll up rules, and then users are kind of free to configure these rules as necessary. If we do end up using recording rules or roll up rules and dashboards and other places kind of need to be changed to be changed after that. But tools kind of a baseline thing and then like you're kind of free to go and like extend it and kind of incorporated into kind of whatever workflows that makes sense to you. Yeah, so that's kind of what the tool is some resources. This is a link to where tool is hosted. Like feel free to kind of take a look and give us suggestions and help us improve it. A couple of other resources are kind of around like what recording rules are and why they're useful and there's also a help here on kind of them. And yeah, thank you. Thanks a lot for your time. And yeah, thanks for being here and we're open to questions now.