 Thanks everyone for being here, so Rob and I are here to talk to you all about streaming recording rules for Prometheus, Thanos, and Cortex using the M3 coordinator. All right, so just a little bit about us before getting started. So my name is Gibbs Collin. I'm a developer advocate at Chronosphere. During my time at Chronosphere, I've helped work on the M3 project as a contributor. I'm also a member of the CNCF observability tag, and then I think I have my Twitter up there as well. And I'm Rob back just to mainly do the demo with Gibbs here, as I mentioned before, have worked on and created the M3DB open source project and also an open metrics contributor, which is a CNCF project defining a standard for transmitting metrics based on the Prometheus exposition format. All right, so just going to go through the agenda real quick. So I'm going to start out going through the problem statement and then going to talk about aggregation using Prometheus recording rules. And then we will go into streaming aggregation using Prometheus and the M3 coordinator, which Rob will then give a quick demo on and then time for Q&A. All right, so clearing high cardinality metrics. Okay, so here we have an example of a C advisor dashboard. The C advisor dashboard basically just provides resource usage and performance metrics of all your running containers. And in this particular dashboard, we're looking at all the containers or pods within the gateway application. And since you're just giving you a kind of a quick look or 10,000 foot view of how all your applications are performing across your pods. So just looking into a particular panel here, looking at CPU usage across all of your pods. We're going to take a deeper look and to see kind of what it takes to kind of calculate this metric to kind of get the results you're seeing in this overview. So as you can see here in this query here, so we are looking at the CPU usage metric. And you can see that it's going to take quite a long time for you to render all of your career results. And that's because your query is pulling all the labels across all of your pods. And that's going to result in about 51,000 time series metrics. So with that sort of load, it's going to take about 20 seconds for your career results to render. But in most cases, you don't typically need to have this look at all your kind of pod level metrics. So in most cases, you probably just want to look at the aggregate level view to see what's going on across your pods. And so in order to do that, you can do some aggregation. So in this example, we are looking at just two labels. So we're aggregating on two labels container name and namespace. And by doing this aggregation on these two labels kind of before you do the query, you're able to cut down by quite a bit the time it takes for your results to render. So in this example, now it's only going to take less than a second for your query results to render. And then that's mostly because you've cut it down to about 230 time series. And one of the ways you can do this aggregation kind of before the query to get these improved results is through Prometheus recording rules. So just going to go through what Prometheus recording rules are. So essentially they allow you to pre-compute frequently needed or computationally expensive expressions, which you can then store back as a new set of time series to your time series database. And these rules are typically evaluated and pre-computed on regular intervals. So using the Cron job type processes. And this makes them really useful for dashboard purposes where you need to kind of use the same expression over and over again without having to, every time they refresh. And so recording rules make it so that you don't have to kind of do that reevaluation every time. And again, they are because they are compatible and supported by Prometheus. You have all of your PromQL functions available to you. And in this slide, we're just going to look at an example of a recording rule. So I believe the code here, we have the CPU kind of usage metric that we are summing at a regular interval. So every minute in this recording rule, and then it's going to be grouped by the two labels in the config. So container name and namespace. But basically inside this diagram, you have your single Prometheus node. You have your rule manager. It's pinging the querier at these regular intervals. So these are some examples inside the diagram here. And at that point, it's going to create a brand new aggregated time series, which will then be sent to your time series database. And this is all just to know this is going to be all done in memory and as like a single process. But let's say you wanted to do aggregation outside of your Prometheus instances or kind of use a remote storage solution for larger scale use cases and to also aggregate at almost any level of time series without increasing the load of your heavy recording rules, then M3, then kind of you could do this using M3 and M3 coordinator, which allows for the streaming aggregation. Okay, so kind of just to go into a little bit more about how M3 does the streaming aggregation. First of all, just a quick review of a recap of what M3 is. So it's a remote storage. Prometheus remote storage solution is created back at Uber in 2016 to help manage their metrics monitoring use cases internally and now is used by many other big companies including Chronosphere. And then with M3, so the aggregation is going to be moved from kind of your recording rules to streaming aggregation and we're going to do that using what we call rollup rules and rollup rules essentially solve the same problem as Prometheus recording rules except we just take a slightly different approach. And what we do is we through the aggregation using the M3 coordinator and in some use cases the aggregator tiers like in the diagram here we use them to reconstitute metrics so counters, gauges, histogram metrics as if they had been exposed as an aggregate from a single instance of an application versus from several instances of an application. And then we also do provide another way to aggregate metrics through mapping rules which essentially just does down sampling of metrics but regardless both ways of aggregation, once the aggregations are completed these new metrics will be then sent over to your times year database or your remote storage solution. And then just zooming in a little bit and kind of recapping on those two ways of aggregating metrics so with these visuals here which I think are helpful so we have kind of rollup rules which essentially just roll up across or aggregate across multiple time series and this essentially just reduces your query time cardinality and then we have mapping rules which do the down sampling by doing aggregations inside a single time series and making kind of these longer term or larger scale queries more efficient as they have bigger samples. And now I'm just going to quickly go through an example of a rollup rule and kind of how you would implement it. So we have kind of the configuration of what a rollup rule looks like here on this slide but essentially rollup rules are just a series of transforms that are kind of done in order so we have the three main steps we'll go through here but you can see on the configuration we have kind of the rollup rule name and then there's the filter that we use as well and then in terms of the steps we have our first step which is to take the increases or the deltas between each of our monotonic counters so you can see in the bar chart we're kind of measuring the little increments in between each of the counters measuring the increases or the deltas there. Second step would then be to kind of sum up the increases or the deltas so kind of creating this new rollup metric by the different labels that are presented in the group by line here so by in this example container name and namespace. So once we've done the summing of those increases then we can now do the final transform and the final step here which is to create, turn this metric into a monotonic cumulative counter which will then be sent over and persisted to your time series database or remote storage and kind of one that has been done you'll use the storage policies set in this configuration here to kind of dictate the resolution and retention periods. So that's kind of a quick overview of what that looks like and now Rob is going to go through how these two different approaches so streaming integration with M3 and regular recording rules how they are different from each other. Cheers. Thanks Gibbs. So yeah I was going to just quickly step through a few of the differences at a high level and then give a quick demo if the demo gods are kind today which they may not be but we'll see. And you know I think what's first worth talking about is this is actually a similar example that I talked just in the short earlier keynote but when we look at like specific use cases that can be very high cardinality and again histograms tend to be an obvious choice here because the amount of buckets which you define which gives you the level of granularity you want in terms of how precise your histogram measurements can be can multiply your cardinality quite extensively and so you know in this example we're looking at yeah single microservice with 200 pods 20 endpoints, HPP or GRPC measuring latency via the different status codes and then at a level of precision that can be defined by a spread of 30 histogram buckets so this is you know a single metric across a few endpoints that you might want to graph like you might want the P99 of all of those endpoints on one panel in Grafana and that single panel is going to be operating on 600,000 time series and you know it would be super nice if you could then also go and slice and dice those metrics by the Git server version or the mobile client front end version maybe you're running experiments and you want different metrics based on the subpopulations that the experiment groups for AB testing and things like that and all of this is only going to multiply this like very high cardinality number so you know it gives a run through what a recording rule execution looks like and then in a distributed mode this is essentially you know gets even more complicated because there's the hops are over the network here so when you're pulling data back from storage if you actually want to reconstitute an aggregation across multiple Prometheus instances if you're using say Thanos and the Thanos sidecar or with Cortex it'll be from all the Cortex storage nodes you have to pull all that back and then write it back again very frequently so and to concretely walk through what that might look like you know if we have four Prometheus instances and you're again for this single metric for one service are trying to calculate the histogram of those four storage nodes each of returning 150,000 time series all over the network to a single Thanos query instance that's all operating in memory which has obviously memory constraints and then if it can't execute fast enough then you're getting data that's kind of irregular comparative to when you collected it and so M3 aggregation gets supplement recording rules now as kind of gives already said it reconstitutes the metric it doesn't allow arbitrary promql but again the majority of the problem here is not in any other part of the expression except a few distinct high cardinality labels that you can aggregate away and reconstitute a new counter as if it was aggregated from one place but stays on that expensive step that runs over the network it also avoids this reverse index query pressure and just release query bandwidth on Prometheus or your remote storage in general so you can separate and scale aggregation independently to the TSDB and not be bottlenecked there or slow down dashboard or other alerts and then of course you can pack way more alerts and then also scale your dashboard traffic without having to pay the cost of recording rules at that later and then you can also use templated aggregation rules which can be really powerful to deploy this use case across many different metric names without having to go and one off for every single panel that you're trying to speed up go and create a rule for I just want anything that matches a specific filter to exclude a certain high cardinality label that I know exists on all of my metrics like the pod or instance and so we're going to have a look at what this looks like in reality now bear with me because I am going to try to just remove the screen mirroring here and get my terminal on the back of my laptop which is also low so let's just have a look at the demo that I'm about to run through so basically we have a Docker compose stack here and this is so yeah you can check this out just from the M3 repo is also you can get here via the docs so you go to the M3 docs then you go to integrations instead of going to the default Prometheus remote storage aggregation query with M3 you go to the aggregation for Prometheus Thanus or other remote write storage and this will highlight the Docker compose stack that I'm about to run through so it has a deep link to the development stack that I'm about to run locally what's interesting here in the development stack is basically we have if we go look at the Docker compose file we have a coordinator which is receiving traffic from a Prometheus scraper this Prometheus instance just acts purely as a scraper sends all of its data to M3 coordinator at 1 through remote write yeah that is a good call thank you I'll do that on the terminal as well so this Prometheus scraper basically scrapes a whole bunch of C-advisor metrics in this demo and sends them to M3 coordinator via Prometheus remote write and then either you can use M3 coordinator by itself to aggregate in memory with a single replica of it running as a sidecar and then it can basically aggregate across those metrics and give you rolled up metrics and then it sends basically data either to two different Prometheus instances one's called raw which is just storing back the samples that the scraper was sending in and then the other one is aggregate which is receiving both time series that were aggregated using rules that you defined as well as down sampled to like either one minute or five minute of resolution so that it basically means you can get different levels of resolution similar to how Thanos does batch down sampling but this is all happening in real time so it's available like every five minutes as a data point is ready so what the config looks like here is you have basically these this is the roll up rule so it's saying for in this specific example I'm not using the template which is a little bit more complex I'm specifically saying I want to match exactly this metric name take the increase of it grip by these dimensions which is the container, the name space and the CPU index if you want to look at stats on a per CPU basis but not across all the instances and then we're building the monotonic counter again here and we're storing it in this storage policies tells us which Prometheus those two Prometheus that we had which one to store it in and those are defined up here we have the roar Prometheus basically taking any unaggregated metrics and then the aggregated one receiving anything that's been defined for one hour and one minute resolution by the rules that we have down here so I've been running this for the length of talk that's probably why I don't have any a whole lot of battery left so if we log in here we can kind of see the different yeah great the different metrics so here these are the unaggregated metrics this is the CPU usage seconds total and it's a two minute apart interval you can see here that the results and we're summing by container and name space this is the aggregate one which is the exact same shape of data and everything else but it's using the rolled up metric name and it's you know we don't even have to the only labels on this metric are the ones that we define to keep and so that's why on the right here where we're actually doing the count of the underlying metric name so to view the cardinality of each of these you can see that the raw metrics is in the thousands for these whereas the aggregated one is just at the granularity you need to view your data in your graphs so obviously over a large periods of times with you know lots of pods rather than a small demo environment I'm monitoring this really adds up because it's a factor of a thousand divided by twenty three so forty three forty three acts larger in terms of cardinality and yeah that's pretty much what I wanted to set you through I think we have a brief amount of time to do Q&A perhaps yeah I think a few minutes great and yeah it'd be great to also catch up after the talk at the event at 5 p.m. we also have a booth on Wednesday if you want to come chat about open source or anything else yeah thank you so anybody have any questions alright I'm gonna bring this over with some wipes and we can hand it off so I have a question about where this fits in with the receiver setup in Thanos and so would this be in front of the receiver and then it would just proxy those requests back to the receiver behind it yeah that's a good question so basically if you're using a Thanos receiver already instead of like a Thanos sidecar then yeah you can point the entry coordinator to just write its results to that Thanos receiver so instead of writing directly to the Thanos receiver via remote write you write to the coordinator via remote write and then have the coordinator point to the Thanos receiver and then the optional part of this whole thing as well is if you wanted to not just use a single process for the entry coordinator is having a cluster of entry aggregators which the coordinator can use for distributed stateful aggregation awesome thank you thanks for your talk what are some best practices you recommend to avoid like false positives on your alerts that do problem I heard to avoid false positive and then I lost that last word in Prometheus how can you avoid like false positives and firearms firing when they shouldn't be you know yeah interesting do you have a specific example like latency going too high or what kind of false positives where are you thinking through yeah like CPU spikes, HTTP requests things like that would you use standard deviation yeah that's a really interesting question and just to repeat it I guess yeah latency spikes latency deviation yeah I think everyone struggles with this one you know everyone has their own strategies obviously using a sustained high period for like latency spikes can help in terms of like what's known good standard deviation does tend to work well but it requires a lot of care right like looking at week over week data like using an offset by 7 days and comparing that to the current that's all pretty yeah it requires a lot of human curation I would say there's probably better sources than me that have deeply written about it but you know the industry best standard really with Prometheus tends to be standard deviation with like a lot of care and curation and also did you change the Prometheus CMO file as well or using your own custom one for the main thing that we altered Prometheus to do is just send remote right to entry coordinator so it's adding Prometheus remote right endpoint to Prometheus itself so everything that it's scraping it's sending to entry coordinator thank you thank you for your talk one of the challenges with Prometheus is the high cardinality with the metrics that you have available and something that can help with that challenge is the recording rules which you talked about going a step further M3 appears to be able to help with the rollup and mapping rules again to help with the high cardinality of the recording rules that you might be creating on a label by label or aggregation by aggregation by basis on what are some best practices you might recommend around a new user for M3 from creating or abusing the share number cardinality of these recording rules and mapping rules that you could be creating with M3 so just so I understand the question it's basically how how to think about the vast number of like specific aggregations that you configure yeah for example you might be taking a look at multiple types of aggregations on top of a single metric such as the sum or the p99 or things like that why not all of them how does that help shift the problem of the high cardinality of the labels that exist into the high cardinality of rules that you might be aggregating over yeah that's where the metric template sorry the aggregation template really you know shines in that you define like anything that matches underscore total if you assume that to be a counter then you can exclude by you can apply a roll-up of increases and deltas and then build a monotonic counter on those and use the metric name as the the resulting as a variable for the output metric name so that way it'll dynamically build essentially a recording rule on the fly again it's not as powerful as recording rules because it's reconstituting the metric rather than giving you like obituary promethia or promql but the templating aspect is very powerful wonderful thank you alright then I guess on the last question can you share a little bit about the circumstances that cause you to design and implement this to design to design and implement this aggregator oh yeah great thanks everyone yeah I mean I think you've touched on it a little bit it's been part of M3 since M3 was built as remote storage at uber and the main the main reason why that was really done was that we had all these high cardinality use cases and it was just very difficult for all the developers to go and curate and specifically protect against cardinality explosion in every single one of their different things that they were monitoring and so the aggregator was built to be able to like give the power back to the end users so they don't have to go and re-instrument their application they can just derive different pivots of the same very high cardinality data that they're pumping in so at some point you can just throw money into the problem to collect the metrics and then the dashboard spinner or the timeouts of your avals start to hit so this was a way to like basically without changing your code get to the actual result you want around that very cool thank you very much thanks for the talk thanks everyone