 So yeah, welcome. This is going to be Trevor and I talking about how we built and scaled the Thanos infrastructure at Reddit for the 1,000-ish engineers we have. So hi, I'm Ben, I'm a backend engineer in our infrastructure department. I've been, like it was said, I've been a long time Prometheus maintainer. I've been working on Prometheus since it started at SoundCloud, kind of mostly focused on the exporter side of things, but I also do kind of the SRE of Prometheus doing scaling with Thanos and others things and providing as a platform for internal developers. Hi, my name's Trevor, I'm here from Minneapolis, Minnesota, also experiencing a little bit of jet lag. I've been at Reddit for about four years, probably four years in April, primarily in the infrared SRE space, and then about two years ago, Ben and I spun up the observability team. We're fighting a number of clickers. Yeah, I got the clear. Okay. So, and also, yeah, shout out to our team. Without the rest of our info team, we wouldn't be able to be doing this. Yeah, like I said, Ben and I started this. We're up to about seven people now. There are a lot of engineers at Reddit and a lot of services, so seven people feels kind of small sometimes, but we really accomplished a lot. Also, our other info teams and SRE teams assisted us in rolling out Thanos over the last few years. So, at Reddit, we talk a lot about Reddit shape and how we try and make things in kind of a cookie cutter way because we have a lot of engineering teams and a lot of internal services. And so, we kind of want to make things a Reddit shape so that it's easier for teams to onboard into different infrastructure. In our infrastructure, we have about 25 Kubernetes clusters. They vary in size from a few thousand cores to tens of thousands of cores. And over the overall infrastructure, we have hundreds of thousands of Kubernetes pods constantly being created and destroyed. So, Prometheus, of course, being a dynamic discovery, makes a nice monitoring platform for all those dynamic workloads. All the different services are all managed by different teams. Like I said, we have over a thousand engineers building different services and so we don't have one central application team and we need to make this observability platform as self-service as possible. And so, before Reddit started with Prometheus and Thanos, we used a third-party software as a service and before we had Prometheus and Thanos, we had a lot of stats team metrics. This was classic for the early days of Reddit. It was a Python service and we used, apparently at the time, used Graphite and then moved to a third-party SAS when managing Graphite became too difficult. But there were bottlenecks and the typical problems with stats D started to show too many packets, drop samples, sampling upon sampling to try and get the data down to a point where it could actually be shipped off to the SAS. And so, one of the things we built was we built, or each service would have a telegraph that was kind of an aggregating proxy that would eliminate most of the normal cardinality by dropping pod and the typical thing you see in Prometheus, pod and instance, those would all get thrown away and you'd only have top-level metrics for a service. And this all started to show its problems. And because we needed the data to find out are we having problems with one service across the whole service or is it one pod that is, or one node that is causing the problem? And so, we couldn't get that diagnostic information. Also, between moving from Graphite into the SAS, they were also moving into Kubernetes. At the time, because previously, like a normal setup, we had Puppet, Puppet was used to deploy all the services. There weren't as many services. There weren't as many engineers. It was a simpler infrastructure. And so, in order to improve the observability, we needed an order of magnitude leap in capacity, or more than that because there is this large need for more data and more metrics, but we just couldn't afford to send it to a SAS. It was just prohibitively expensive. And so, today, we had some experience with Prometheus because we were in a Kubernetes world. And so, there was already some Prometheus in the infrastructure and we did some design work and came up with the idea that, hey, how about we just use Prometheus for all the monitoring, not just the Kubernetes infrastructure. And so, Trevor and I designed and built out the initial prototype of a monitoring SAS for the entire organization. We followed the kind of same telegraph idea and to give each service its own Prometheus. And this allowed us to really scale out Prometheus as an underlying infrastructure for a large distributed set of services. One of the bigger challenges for getting this service out to all these teams was, how do we get everybody to switch from a stats D data model over to a Prometheus data ball? The good news is, is we have a Reddit shape for that. We have a series of base plate libraries and these are basically the underlying microservice library that all the services are built on top of. And so, we could instrument all of the base plate code with Prometheus at a pod monitor and then all teams had to do is upgrade their base plate and they would instantly get Prometheus monitoring on their service. And so, we kind of used Prometheus and our base plate as an APM kind of plug-in. We also, in some cases, there are services that didn't use base plate or didn't wanna upgrade because their code base was kind of in maintenance mode, abandon mode, legacy code. And so, we also used the Kubernetes sidecar model to use the stats D exporter. So, instead of sending the stats D to a telegraph, we would send it to local host and then we could use that to convert the data into a Prometheus metrics endpoint. And that also worked really well. We also needed to pull in data from our cloud providers. So, we have things like AWS CloudWatch. Well, how are you gonna monitor that? The old SaaS service would do that for us. So, we had to implement things like the CloudWatch exporter. And now, today, we ingest a boatload of data. Maybe not as much as Cloudflare, but we're getting there. So, I'm gonna hand it over to Trevor here to now talk about how we had scale up, how we scale things up. Yeah, thanks, Ben. Yeah, as Bartek's alluded to, this talk is gonna touch a little bit on Thanos, a little bit on Prometheus and how we have automated our Prometheus deployments. Ultimately, Thanos is the glue that holds all of this together. So, as Ben was first saying, the challenge we were trying to solve was our lack of detailed metrics with StatsD. We aggregated out a lot of the useful bits, sent it up to our third-party provider. We've had some experience running Prometheus in the past. We used it for C-Advisor metrics, Container metrics, TubeState metrics, things like that. That's how we got those into the SaaS. We had problems with these Prometheuses. We would run one Prometheus per cluster, no HA, the cluster would get larger, a large scale of pods, the Prometheus or Trash. We'd be stuck waiting on the wall to load up tons and tons of data. We'd be down for an hour. There's a lot of pain there. And so, as part of this project, we wanted to solve that pain as well. We've scaled up. We've got over 2,000 namespaces. Essentially, we ended up running an operator to solve this. We chose namespace as how we wanted to distribute or how we wanted to shard out our Prometheus deployments. So, each team or service or each namespace gets a Prometheus or an HA or up to that pair. What this means is we wanted to automate this process so that we don't have to go set this up for everybody. So our solution was an operator. And essentially, we just watched for new namespaces to stand up. And when we see new namespaces, we set up Prometheus pair. These Prometheuses have sidecars, ship data of Thanos. We gained long-term retention as 3D to the Thanos. And we're able to utilize Thanos query to query across all of these various stores, whether it's a Thanos store pod, Prometheus sidecar, or a ruler. So for the next part, we'll take a look at this operator. On top of just creating Prometheus's 3D's namespace, the operator will pre-populate common rules for our internal service framework baseplate. This enables teams to move quick. They get shared dashboards. They all utilize the same kind of rule naming, nomenclature, and so we can get teams up and running fast. For scaling our Prometheuses, we use the vertical pod autoscaler. We utilize this to dynamically scale the Prometheuses up and down, well, generally up. This has actually proven to be quite a challenge. Staling a stateful thing is as hard. And a lot of the workloads we see as a team introduces a new metric, which introduces tons of cardinality, sometimes on purpose, sometimes by accident. And what this results is an out of memory with Prometheus. Sometimes our vertical pod autoscaler can see the gradual rise in resources and scale up before we hit this point. But many times we hit an out of memory error and we do have some metric downtime. We're looking at other solutions other than the VPA. Essentially, we would love to track other metrics than just CPU and memory, potentially samples per second, to try and catch these things before the crash Prometheus to crash. Other things we've put in for that are limits on each job. So it's great limits for Prometheus. At some point, instead of crashing, it'll just start dropping samples. Finally, we do allow configuration for namespace. Right now this is just annotations. At some point we'd like to make this our own custom resource. This allows us to limit things like the maximum CPU we'll scale up to, max memory, sample limit, et cetera. Yeah, as well as the min. And the min is, as Bob was saying, we are able to set the minimum value. This is really useful when spinning up a new namespace if we have to take into account large servers coming online. We can kind of set the minimum so we don't have to trial and error our way up to the correct resources. So much of our initial tuning was spent on final store. We actually haven't tried the time series charting, as Tom has mentioned. We've kind of jumped into the hash base charting due to its simplicity. And actually when I get back I'm excited to try out the time-based charting based on Tom's experience. Tom talked about vibe-based sizing. I feel like t-shirt sizing is essentially vibe-based sizing. You pick a few different sizes and when it gets slow you make it a large. That's really where we're at right now with our final stores. We have our large clusters. We throw a ton of resources at them. Our queries are feeling painfully slow. We bump it a size. We'd love to continue kind of iterating on that process and getting better metrics and better stats. We'll see that further in a moment. One of the metrics we tracked is basically just the number of probe queries that are greater than one second. So we wrote some very simple range query, instant query. We kind of utilized these to just see how quick our response times were. Because it's too hard it can be challenging to do that using user queries because we don't know what they're going to ask for. Generally the difference between each size is just the number of shards. Yeah, and we're using the hash-based charting. Another piece we kind of we also charted is two state metrics. This is something that we didn't really run into too many issues with, but we definitely had to chart it out in our larger clusters. So that was some of the stealing work we did, kind of up front. I think I touched on a few of the challenges there. Now I'm going to talk about some more of the challenges we've run into running both from Atheist and Thanos at this scale. So the graph on the right is that query probe that I mentioned. So that's the percentage of queries that took longer than one second. The time frame here is from February to June, and where it looks like up is right around April. In April we made the decision to disable partial response. We were getting some unexpected results from some of our rules, especially like aggregations dividing by less data, things like that, because some piece of our query would fail. So we made this decision I think maybe about a week into it we had leadership asking, hey, can we turn that back on because it felt really bad having so many queries fail or takes a long. And what we realized is we actually had a lot of query problems. We had a lot of queries that failed. And this touches on the need for a label proxy or a label enforcer because every time a user would run a query this would fan out to every single Prometheus in every single cluster including all the stores. And if one Prometheus was having a bad time the query would fail. Not just be slow. As soon as we turn partial response on it would now fail. And what these a lot of times are with partial response off, yes. These failures weren't always pretty either. Grafana renders them as like showing every single Prometheus's external labels in a giant list which is really confusing for our users. And so yeah, adding the external label enforcer essentially as a proxy between Thanos query front end and Thanos query enabled us to give nice air messages when a label is missing and enforce that just a few select key external labels that will at least give us to a cluster or a specific Prometheus are enforced. And with the release of that we saw the time of our queries and the failures of our queries like drop substantially. Another major challenge we have right now is rule management. And some of this is based on the tiered architecture we have. We encourage teams to run rules locally on their namespaces Prometheus if they fit into that model. So if the rule runs over less than 24 hours worth of data and only needs to query data within their Prometheus we encourage them to run it there. As soon as they need data outside of that 24-hour period now they need to talk to Thanos store. So we have cluster level rulers that run within that cluster and then we've since added a global level if you want to aggregate across clusters. Part of the challenge with this is teaching our users where to run the rules keeping track of where rules are deployed or where they came from. And this is still a problem we have today especially going through and trying to clean up rules we have some origin issues and so what we'd like to work on coming up is some kind of centralized API for this where our users can tell us a rule that they want to run and we will figure out where it needs to be and then it'll also give us more control over tracking where they came from. Finally to help with alerting rules we've created an alert routing service. This service essentially will look at things like namespace or other labels and figure out which page to be changed to route to. This is really useful for things like pods are receiving their CPU limits pods are restarting. These are just kind of generic infrastructure alerts and we don't want to page our infrastructure team we want to page the team that's responsible for the service. Part of that comes from the fact that even though we monitor each team's namespace things like container metrics and kubestate metrics those still need to be centralized in a single Prometheus per cluster because in order to get the combination of labels to monitor both the container and kubestate and kubepod labels and be able to do those joins all that data has to be in one Prometheus anyway and that's not really shardable per team and so instead of using their own container alerts we have to centralize and standardize the container alerts so we needed another way to key the routing not just from the external labels of the Prometheus that their data was coming from. And finally cardinality is an issue I hinted at this earlier teams will accidentally introduce a label that he's on user data or is a random spring and we'll see cardinality go through the roof, we'll see Prometheus' trash, we'll see queries get slow and this is just kind of an ongoing challenge that we have to deal with the other case is when is essentially histograms histograms just generate lots of cardinality and we have teams that want to specifically like services like GraphQL want to monitor all their operations over a ton of different buckets and it just doesn't always scale well and we have to work with them to figure out the cardinality of their services. These cardinality bombs generally result in Prometheus being knocked out and not just like one of the replica pair but both which basically makes the last two hours of data not queryable until they're back up as well as like missing all new data. As I mentioned earlier we've added sample limits which has actually helped reduce the number of crashes and then the last piece of challenge I want to talk about is the VPA. I touched on it earlier scaling by the VPA is hard memory will blend really fast with Prometheus. One of the huge features that Prometheus released was the wall snapshots the ability to notice that we need to trigger the VPA to restart the pod and for that pod to come up significantly faster because I didn't actually have to read through the entire wall that unlocked a lot of the issues we've been seeing with VPA scaling. I'll send it back to Ben. One of the nice things that we have at Reddit is we do actually have a good open source culture so we do work with our upstream so we contribute to Prometheus, we try to contribute to Thanos. We're trying to do more of that. I've been a Prometheus maintainer for a while and I'm trying to help onboard some of our internal engineers and on our observability team to be more upstream friendly and do more upstream contributions. A bunch of the functionality we talked about like the label enforcer is stuff that's on our rough plans to open source and maybe some more of the other internal tooling that we've built. I was just chatting with somebody on the team this morning. We found a thing where we wanted to actually update all through history into our S3 bucket. All of the external labels that are stored in the Thanos metadata, there's no Thanos tool for this so we had to write our own little internal tool and now I think that's the perfect kind of thing that we should contribute upstream. I'm trying to get more of that upstream work done and I encourage anybody here to work upstream as much as you can because it helps everybody. More stuff that we're going to be doing. Of course there's still many, many, many more things that we want to fix. Trying to do anything perfect, magic and scalable is kind of disingenuous. It takes a lot of work to run this stuff at scale. One of the things that like we said, we want to work on the recording rules and alerting and try and make that better. The nice thing is we started entirely with Inferis code so we just need to work on organizing that code. Whether it's coming from a service repo or a monorepo or automatically deployed from our instrumentation and centralized managed clusters. We also have been working on tracing for our internal use but that's going to be a whole separate talk that Trevor is probably going to give again. One of the other fun things that we want to do teams manage all of the we have the baseplate metrics and that provides a good set of like HTTP and GRPC and database and client library code but teams can create their own metrics and sometimes they get metric names and they put labels into the metric name and you end up with a metric name explosion and that slows down Grafana's metadata JSON hugely. At some points we've had to go and notice that the metadata JSON that is trying to pull down to do metric name auto-completion is 100 megabytes of JSON and that just completely destroys every engineer's browser with too much data and so we need to maybe do some metric name enforcements or some other kind of automated auditing of metric names in each namespace but at least the nice thing is because we've started by namespace when a team does blow up their Prometheus it only blows up their Prometheus and not everybody else's and so it only takes out one team's service and doesn't interrupt any other team so they can only hurt themselves. We've also been testing out the Go Mem Limit features just started doing this to try and soften the out of memory conditions we're doing this with the Prometheus right now but we were just talking about like we need to do this in Thanos in the stores so if the stores don't um they will just get up to their memory limit and then they'll just start GCing to try and avoid the um problems there and then we have two petabytes of data but I think we can cut that in half by using the vertical compaction and so that's something we've been wanting to test and experiment with I don't know if anybody is using the permanent deduplication vertical compaction I'd love to talk to you about that questions. Thank you everyone. We have maybe time for one or two questions but in the meantime another speaker can prepare Anyone has made a question? Amazing. When you talk about limiting the number of metrics per namespace what mechanism is that? Is that Prometheus has built in limits for scrapes or? Yes, scrape sample limits. We set a scrape sample limit and teams also with our operator can self-service override that with a namespace annotation. Any other question from the audience? Amazing. Hi, thank you. Do you plan, you talked about the alert manager routing service which you wrote. Hi. Do you plan on upstreaming that as well because this sounds really interesting to me with many clusters happening? Yeah. That is something we can do. It's not exactly super portable. It's actually a whole separate tool that basically generates alert manager routing trees and puts it into a config map if I remember correctly. It's very keyed around our internal service database and so it's not quite a portable project. Amazing. Let's give our applause again.