 Thanks for the introduction. I want to make this short and sweet. Yeah, here to talk a little bit about one of the topics I'm fairly passionate about. And we will be talking about it a little bit later in the day as well at a talk that our developer advocate Gibbs from Chronosphere is giving. My name is Rob. I'm the creator of the M3DB open source projects and the CTO of Chronosphere, which is a hosted SaaS observability and Prometheus native monitoring platform. So this is something that I assume lots of us have faced. Loading up a dashboard or trying to essentially view some high cardinality latency data over a growing set of pods or distributed system that you're monitoring. And of course, over time, this can become something that can, when you add a whole bunch of cardinality, really quickly become an immediate problem that impacts your visibility. So hands up, who's ever run into this kind of situation unexpectedly? Wonderful, not the only one. So this talk, the title of it, definitely seems, I guess, perhaps, a little bit extensive, like millions of time series. Why are we talking about millions of time series here? Most of the pods that we're monitoring are in the hundreds. But I just wanted to kind of paint a picture of how you actually get to millions of time series while monitoring these pods in a cloud native world that may be unexpected to the average developer when they start deploying applications and monitoring systems in a container world. So I'm just talking here about a single metric. And we'll find very quickly that just measuring HAP or GIPC traffic can become quite high cardinality very quickly. So we're talking about 50 microservices, 200, say, average pods per service. Now that seems large. But imagine even pegging one of these numbers down, and you're still in a very high cardinality world. And then, say, each one of these services might have 20 on average HAP endpoints and GIPC methods, five common status codes, and 30 histogram buckets. So you quickly get to 30 million unique time series, which is insane. This is a single metric we're talking about here. And you have lots of things you're tracking and lots of unique metric names in an organization. So wouldn't it be nice as well to slice and dice by the Git server version or the mobile client app version or web front end version? These are all things that are just really obtuse when you get to these kind of numbers of unique time series. So if you multiply by 2 twice, you get to 120 million time series, again, for a single metric. If we remove the pod cardinality, this becomes much more manageable. We get down to 150,000 unique time series across all 50 microservices. And then it becomes manageable in terms of how you think about keeping these around for a long time. So if you multiply by twice again each time to get the active server versions and the mobile client versions on there, you're below 1 million unique time series. So how to get there? A lot of folks already recognize this problem and deploy best practices to make dashboards faster, alerts more manageable, and keeping this data around for longer periods of time, more possible. Using recording rules. And recording rules are one of the unarguably very powerful tools in the Prometheus tool set. However, at these types of cardinality, you do run into frequent problems trying to deploy them in some of these single individual metric metrics that you're tracking that are very high cardinality. So we're talking here about deploying a rule that actually handles the latency across all 50 services, just as an example up front. One thing you might run into when deploying this rule is that you'll find that the Prometheus instance or remote storage solution that you're using, such as Thanos Cortex or M3, starts basically unable to even evaluate that result because it's touching 30 million time series. And you'll see Prometheus rule evaluation errors in your metrics that you're monitoring for your monitoring platform. And then you go look in the logs. You might see something like query processing would load too many samples in the memory. So that's the first issue you might face. Another issue is if you reconfigure your Prometheus or your remote storage instance to actually allow you to bypass those limits, you'll see that very frequently when we're talking about 30 million time series, you're going to start missing that evaluation cycle. It takes far too long to load all that into memory, process that out, and record all the results when you've got that kind of level of cardinality for a single rule. And then of course, the workarounds that you might deploy. So I already mentioned one, which is raising the maximum samples per query allowed by Prometheus or your remote storage. The problem with that is that you'll probably run out of memory. We're talking about 30 million time series. And especially if you want to do some kind of SLA calculation that actually is pulling back historical data, not in the recent 30 minute window, that's just a lot of data to load and process into the query engine just by itself. And then if you're able to deploy enough resources and bypass the max samples limit by a config, you're still going to see that you may miss the interval. So you might go and split out the rules one per service. But this again has similar problems. So divide 30 million by 50, get each rule taking about 600 each rule needs to process 600,000 unique time series still, which is quite a huge number. You're going to eventually hit the timeouts per rule group for that kind of deployment. And also just at the end of the day, you're using a lot of resources just to basically aggregate things in the same system that's storing your data. So actually, these are the steps that happen. So you collect the metrics, you write them to disk. And then when you evaluate the recording rule, it does a reverse index query. And then if it's touching millions of time series or very high cardinality, that can be expensive memory-wise in of itself. You read from that storage, you evaluate the results, then you write it back to storage. And so on a single node, you can actually get quite far because there's levels of efficiency with everything happening in a single process here. Once you go to a distributed model with Thanos Cortex, M3, anything else, this picture looks even more expensive. Because not only are you doing that frequent, expensive reverse index query, reading from storage, evaluating and writing back, this is all going across the network. So every 15, 30 seconds, 60 seconds, whatever your valuation interval is, you're pulling back huge amounts of data over the network from many nodes, coordinating in one place where you need to reserve a lot of resources and then all write it back and serialize that as RPC or HAP calls. M3 aggregation is a little bit different. So it's not meant to be a replacement for recording rules, but it can turn a lot of distinct counters into fewer aggregate counters. And it does that by basically taking, it can do things like take the derivative as the first function of each time series and then aggregate that together and then build another monotonic counter from input counters. So what it's doing is as data is written to either the M3 coordinator, which can be used as a side card next to Prometheus, or using a scaled-up M3 aggregator instance, it essentially does those multi-step aggregations. And at each step, it basically can do a hop within the cluster itself. So it can spread that load for really high cardinally workloads over multiple nodes. And then this is all happening in memory, and then eventually it writes the results of storage. And we can do this now with M3 as of v1.3 with any Prometheus remote write-through receiver. So this is not just an M3DB feature. You can combine this feature with any other remote write back end. And so it's really meant to be a supplement to recording rules. It saves the frequent, expensive step of the reverse index query, the refresh storage, I need to write back. You can scale aggregations separate to your TSTB. So you allow your TSTB to focus on executing alerts and dashboards. And you can aggregate millions of series because you can spread it over a large number of distinct resources that basically handle sub-parts of the space. And so, yeah, and you can also use template aggregation rules to basically apply to many distinct metric names. You don't need to write a rule every single time you wanna do aggregation. And this was just a single metric. And if you remove and just look at one service, there's only 3,000 unique time series here, very doable to scale up and look at those metrics on the fly or as alerts. But if you wanna learn more, our session is at 3.25 p.m. today. Thank you so much for having me here. It's been great to be back in person. Thank you. Wonderful, thank you. And I know I have a couple of questions, but that'll probably come for the session later.