 My name's Danny Clark. I'm a software engineer at Google working on cloud monitoring, specifically helping to build out Google Cloud managed service for Prometheus. Google Cloud's managed Prometheus offering. In general, I'm an avid geek for observability tools, distributed systems, and all things cloud-native. And you can reach me on Twitter and GitHub at PintoHutch. Here's an outline of what we'll be going over today. First, I'll give a background into conventional ways to scale Prometheus today. Then I'll go into the trade-offs that we were making when we were analyzing those scenarios when setting out to build a managed service. I'll then go into the finer details of our operator-based approach and custom resources and conclude with some future direction of the project. Some prerequisites for the audience. You've run and configured Prometheus before, even if only a little bit. And you have some familiarity with Kubernetes Custom Resource Definitions, or CRDs, and the operator pattern. So what is Prometheus? Specifically, the Prometheus server. It's a metrics-based monitoring and alerting tool. And what it's really good at, what it was built to do, was to be deployed as a single process, running as a single server, running alongside your workloads. Specifically tailored to monitor live numeric metric data through polling or scraping. And by live, I mean persistence on the order of weeks as opposed to months or years of data. Larger instances can be scaled pretty well via fine tuning. But what Prometheus is not, at least out of the box, is scalable long-term storage. I like to think of Prometheus more like a traditional postgres database running on a single node in your cluster as opposed to something like Cassandra running on a fleet of nodes doing talks over the network. And at this point, it's really become a ubiquitous for metrics infrastructure in Kubernetes. In fact, Kubernetes components themselves use Prometheus metrics format for their monitoring. So how do we deploy Prometheus on Kubernetes today? We use Prometheus operators that are facto standard. Its battle-tested foundational was one of the first Kubernetes operators. And how most people tend to deploy Prometheus operators using the Kube Prometheus stack, which gives you a soup-to-nuts monitoring infrastructure in your cluster. Specifically, Prometheus operator, HA Prometheus, HA alert manager, some Grafana dashboarding, common metrics exporters like Kube state metrics, node exporter, et cetera, as well as some recommended Prometheus recording and alerting rules and all that data. So let's go over what this looks like from the user's perspective today. I've installed Prometheus operator on my Kubernetes cluster. I then have my Prometheus custom resource, in this case, Prometheus A. I've given it a replica of one just to have a single server. And I've used a service monitor selector to match the workloads of 1, 2, 3, and 4 to configure scraping. I apply this to my cluster. The operator then sees that and reconciles that into a live Prometheus server. Running in the cluster configured to scrape those workloads. Pretty simple. So this works pretty well for small to mid-sized clusters. But when we talk about scaling Prometheus, we typically see that RAM is the bottleneck pretty quickly. Some official benchmark numbers from the website, consuming 4 million time series at a throughput of 100,000 samples per second, we see usage somewhere in the realm of four cores and 25 gigs of RAM. So what this might look like in a real production environment is say you're running 100 node Kubernetes cluster with 30 pods in node. Each pod is emitting 1,000 time series, and you're scraping every 30 seconds. Your Prometheus server might be chewing up resources in this realm. To fix this well, we could dedicate a monitoring node pool for our Prometheus server, give it 32 gigs of memory and call it a day. But what happens when we need more memory? What happens when our workloads start to grow, our metrics data starts to compound? Well, we can vertically scale, but this obviously has practical limits and requires constant observation to ensure that you're not redlining your resources. What about horizontally scaling? How does that work? So the idea with horizontally scaling Prometheus is to logically shard your scraped targets by deploying multiple Prometheus servers. Specifically, we manually split out the scrape configs to select subsets of our total target space. So on the left, I have a monolithic Prometheus server configured to scrape all workloads in the cluster. I then break that into Prometheus A and Prometheus B on the right, such that Prometheus A is scraping workloads 1, 2, and Prometheus B workloads 3 and 4. How do I do this with Prometheus operator? Instead of one custom resource, I now have two. Prometheus A, using the service monitor selector, I'm scraping workloads 1 and 2, and Prometheus B 3 and 4. I'm also giving it a replica count of two to demonstrate some HA. I apply these to my cluster, the operator sees those and reconciles that into two instances of Prometheus, each configured to scrape their respective workloads. But there's a problem here. In order to query metric data that I care about, I need to know which Prometheus server a given metric resides on. Now, this might not be a big problem if we're talking about two Prometheus servers, but when it grows beyond a handful of servers, this quickly becomes cumbersome. So it really behooves us to have a single pane of glass for querying a single entry point to run PromQL. What are the fixes for this? Well, we could federate. In a federation setup, we have parent Prometheus servers that are continuously scraping child servers in a tree-like structure against their federate endpoint. As they're scraping that data, they're collecting it and aggregating it and moving it up to the next level of the tree. You can picture this tree kind of growing and morphing to suit your infrastructure needs. Your query entry point would then point at the root or the roots of that tree and you would essentially have a global view of your data. The downside here is wrangling deployments and configuration. It's not trivial to ensure that all your servers are properly talking to each other and scraping and aggregating the right data. So we'll want to probably have some specialists or a team to maintain this infrastructure. You could also use RemoteRead. And in a RemoteRead setup, a central Prometheus server at query time pulls in all that data as opposed to setting up a federated tree. The downside here is there's currently no support for query push down. So larger queries could be potentially pulling in gigs of data over the network where you're extra susceptible to things like network failures, maybe some ingress policies that prevent you from doing that. So again, we find ourselves in a situation where we need to dedicate MSREs to the task of maintaining this infrastructure, maybe introducing a caching layer for client-side aggregation or something like that. We could also use Thanos. And Thanos works pretty well for this configuration. It was built to solve this exact problem of scaling Prometheus. Thanos is supported in Prometheus operator through the Prometheus custom resource and deploys the sidecar. You could then deploy the Thanos Query component using a stack like kube Thanos. Marry those two on your Kubernetes cluster and you essentially have a functioning Thanos stack. But you can see from the architecture diagram this isn't a trivial setup. You'll again want to have some people that know what they're doing, a team of specialists to maintain this infrastructure and set you up to scale. Finally, we have remote write. In a remote write situation, we have each Prometheus server forwarding their metrics data off to some centralized global backend over the network. The downside here is maintaining that persistent and scalable backend is non-trivial. Additionally, remote write uses around 25% more memory than a standard Prometheus server. Now, a fix here could be using the new Prometheus agent mode, which uses much less resources at the trade-off of not having all the features of the traditional Prometheus server, like local storage or querying. So there's a fundamental problem with all these approaches and the need to rebalance shards. And what I mean by that is specifically with Federation and Remote Read where each Prometheus server is maintaining state, once a shard contains data, it stays there. Scaling or rebalancing shards is a manual process and would require some manual orchestration, maybe some backfilling. With Thanos and Remote Write, where we're shipping our data off to a backend, whether it be a blob store or a database, we're still susceptible to becoming overwhelmed at scrape time, particularly in large scale dynamic environments like large Kubernetes and clusters where we have workloads that are scaling up, coming and going as workloads do in Kubernetes clusters. So these are the scenarios we had in our mind when setting out to build a managed service for Prometheus. We found shipping metrics off to a remote backend appealing. It allowed us to treat Prometheus in cluster as effectively stateless. It also allowed us to separate the state side and the query side and the collection side concerns. Additionally, at Google, we were able to leverage our planet scale time series database in Monarch, which had the capacity we needed to offer the service, specifically being able to serve over two trillion active time series and offer long-term retention of metrics. So we could stand up a PromQL compatible API on top of Monarch and solve our query concerns, but we still needed a stable, scalable metrics ingestion approach to offer a seamless product experience. So what if we leverage Prometheus as a node agent? Specifically, we limit Prometheus to scraping only co-located targets on the same node. Think of Prometheus as a pure collector. This helps mitigate our scaling issues because the size of targets and metrics is naturally constrained by the capacity of a given node. We've solved our single pane of glass querying problem because we're forwarding all of our metrics data to a remote backend. And in terms of maintaining this Prometheus infrastructure, Kubernetes provides a built-in resource and then daemon set to achieve this exact setup. How do we implement this? Well, each Prometheus server needs to know that it's on at runtime. So we could leverage the Kubernetes downward API where a container is exposed to meta information about itself at runtime. We can then mount that, the node name as an environment variable, for example, to the container manifest, which is what's on the left. And then we could use that in our Prometheus configuration. Now in Prometheus, to do down selection or filtering of targets, we use what are called relabel configs. So in this case, we have any incoming candidate target that has the node name that matches my environment variable will keep. Everything else will implicitly drop. Okay, so that's a functioning solution. I'd like to briefly interlude into Prometheus service discovery on Kubernetes in general. So in order to discover targets to scrape, Prometheus opens up watch connections against the Kubernetes API server for various resources, whether it be nodes, pods, endpoints, et cetera. So whether you've known it or not, whenever you've installed Prometheus on your cluster, you've always given it a cluster role that looks something like this to allow it to do that. But remember, we're running Prometheus as a daemon set, which means we have a Prometheus server opening up these watch connections on every node in the cluster. The number of watches compounds by n, the number of nodes. And on larger clusters where we have on the order of hundreds or thousands of nodes, this puts considerable strain on the Kubernetes API server. Every time you open up a watch connection on the API server, it spawns two go routines. It then needs to process all changes for objects of a given kind and then serialize and send those updates back to the watch client when an event happens. Additionally, our approach of using relabel configs, we're just doing a last mile filtering of this traffic. We're not reducing any load against the Kubernetes API server. Okay, instead of using relabel configs, what if we use field selectors to filter targets at discovery time? This works because the Kubernetes API server watch cache indexes pods specifically by node name, which means that it only has to process changes for pods on that node, as opposed to every single pod in the cluster. And this greatly reduces the resource utilization there. In fact, the kubla and kube proxy components already use this pattern today to allow Kubernetes to effectively scale. So let's remove our relabel config and instead insert our environment variable in the service discovery config. This sets us up for a much nicer picture with respect to watches on the Kubernetes API server. Each watch is effectively constrained. So this deployment model is fundamentally different than anything offered by Prometheus operator today. I do want to point out that there are discussions on GitHub about this idea though. Where we're trying to enforce the pod role and node field selectors in our custom resources to handle these exact scaling challenges, as well as to address some other things, specifically our back and forth tendency, and just having a simpler configuration service. So we introduce pod monitoring as a custom resource to configure scraping. It's closely modeled after the Prometheus operator service monitor or pod monitor with some differences. Pod monitoring is on the left. You can see it's a namespaced resource running in the namespace backend, configured to scrape our prom example workload. The analogous service monitor on the right, it can be in any namespace you want. So we can stick it in monitoring, for example, and then use a namespace selector to tell it, go look in that namespace for targets to scrape, in this case, backend. In pod monitoring, we have strict namespace tendency. So a custom resource in one namespace cannot scrape workloads in another. And this is in contrast to the namespace selector we just talked about. This aligns well with Kubernetes our back enforcement around namespaces. As a cluster administrator, I have full control over not only who can create and configure a pod monitoring, but I'm effectively controlling which service accounts can scrape which workloads. This has some nice properties. So for example, it prevents accidental metric blow up from inadvertently matching targets in other namespaces. This tendency is enforced at persistence time using relabeling. So all time series that are ingested using a pod monitoring are relabelled with the namespace in the cluster of where that pod monitoring resource exists. Now we realize that having a namespace scope for scraping isn't always convenient or appropriate. So we have a dedicated customer resource in cluster pod monitoring to have a cluster wide scope for scraping. This also works for certain exporters where we want to preserve the namespace label on our time series. So for example, cube state metrics. And again, this is a dedicated custom resource. So as a cluster administrator, I can allocate only certain service accounts have this cluster wide scope for scraping. And with respect to tendency at persistence time, all time series are just labeled with the cluster of where that customer resource is, not the namespace. It's noteworthy that pod monitoring is limited to scraping just pods. And again, this is because we can't constrain node local watches for things like service endpoints in a scalable way. Remember the Kubernetes API server watch cache indexes pods and pods only by node name. We were okay with this trade off because a service monitor is usually used to scrape workloads, pod workloads that are behind a service anyway. It's less common to scrape pure service endpoints. With a service monitor, you can even select multiple services at once to scrape from. Additionally, we found a service level spec adds a layer of indirection for target discovery. So for example, if I have a service monitor selecting service workloads one, two, that has these pods behind it, and then someone throws workloads seven into the mix that happens to be selected using some bad luck with label selectors, we can have a hard time debugging what's going on and in the worst case have unintentional metric blow up. Okay, so that's scraping. Prometheus also offers recording and alerting rules. Now our Prometheus collectors are running effectively stateless on every node, and it didn't really make sense to put the burden of analyzing recording and alerting rules on them. So we wanted to deploy a separate workload whose sole concern was to enforce recording and alerting rules. So we have a workload called the rule evaluator that runs as a deployment. It takes in Prometheus recording and alerting rules, queries and writes recording rules to the remote backend and then sends any alerts to a configured Prometheus alert manager. It's a similar idea to the Thanos ruler. To configure the rule evaluator, we have a rules custom resource that looks basically the exact same as the Prometheus rule custom resource from Prometheus operator, but with some differences at runtime. So again, rules have strict namespace tenancy. So a rules resource in namespace A cannot query or write to time series in namespace B. This has some nice properties. So for example, we prevent expensive queries matching time series and other namespaces as well as preventing collisions when writing recording rules. Again, the tenancies enforce that persistence time with the namespace in the cluster of where that rules resource exists. And mirroring pod monitoring, we recognize there are cases where it makes sense to allow users to query beyond a single namespace. So we offer cluster rules as a way to do that. Finally, it's totally possible that you have multiple Kubernetes clusters that are all writing metric data to this backend. If you want to be able to query against all of those time series, we offer yet another scope in global rules that has no boundaries on what it can query or write to. Again, the common theme here is giving the cluster administrator the tools they need to control which service accounts have what level of scope with regard to metrics in your cluster. So how do we control all these components, what we've been calling managed collection? We have yet another custom resource called the operator config. Now the operator config runs as a singleton in your cluster and controls everything from collection. So for example, if you want to filter data before sending it over the wire or compress it, you can do that. You can configure Prometheus alert manager endpoints to route alerts to. We even offer a managed alert manager that you can configure through this resource. So let's walk through a little bit of what this looks like. So all managed components live in a dedicated system namespace alongside the operator. The idea there being that users shouldn't have to modify anything in system. The operator config singleton lives in a dedicated sister namespace, a public namespace that's watched by the operator. So users are sort of intended to modify resources in that namespace to configure managed collection. This has some nice properties. So for example, we can ship our managed components with more constrained RBAC. So for example, our operator doesn't need cluster-wide read access on secrets. It can just watch secrets in these two namespaces. So let's walk through an example. So we want to send alerts to an alert manager service over TLS. The conventional way to do this would be to use a Kubernetes secret resource. You would use a secret key selector in your custom resource, which would fetch the secret in the contents. You would mount that to your workload and then use those credentials to do your TLS. Now the rule evaluator runs in a system namespace and in Kubernetes you cannot mount secrets across namespaces. Since we don't want users to modify anything in the system namespace, we propose that you create your secret in the public namespace. The operator then reconciles that and mirrors it over to the system namespace to be mounted by the workload for our TLS. Throughout development, we've really strived to minimize our configuration surface. I guess as developers, we're always trying to do that. But with pod monitoring and cluster pod monitoring specifically, we provide label selectors for target selection, basic scrape configuration, and very limited relabeling capabilities. Prometheus relabeling is very powerful, but it's really easy to shoot yourself in the foot and cause a metrics explosion through a bad relabeling rule or even have some collisions and drop a bunch of metrics. So we try to limit the configuration surface there to protect users when setting up the configuration. But notice there's no Prometheus custom resource. Prometheus runs as a vanilla daemon set. There's no ruler custom resource. Ruler evaluator runs as a deployment. We've also broken the operator out from the reconcile loop in a lot of respects, where instead of reconciling the entire resource, the entire daemon set or the entire deployment, the operator is just concerned with reconciling configuration. We let Kubernetes reconcile the managed resource at the infrastructure level. So as a cluster administrator, if I have some custom security profiles or volume mounts or certain quotas around CPU requests and limits, I'm free to alter the daemon set myself and the operator will not overwrite me. This allows for deep customization of these resources while keeping our configuration surface super simple. Anything on the pod template spec, you can modify on the resource. It doesn't need to be a part of the custom resources that the operator comes with. Some future direction of the product or the project. So Prometheus supports authenticated scraping, scraping protected metrics endpoints. With this setup, collectors will need access to secrets in the cluster. It's tempting to just say, hey, collectors, you can open a watch connection against every secret in the cluster and then retrieve it then, but then we find ourselves back in the same scenario where we're overwhelming the API server with all these watch connections. So we're working on ways around that now. Additionally, Prometheus provides collection side status APIs. Basically any HTTP API that's not promql falls into this category and they're really helpful for debugging. So if you need to know what's the status of some targets in my cluster or what are the configured rules or build info or things like that, these APIs are really useful for users. But remember, since Prometheus is deployed as a daemon set, this status is effectively sharded throughout the cluster. There's no central place where I can go to to get this information. So we're working on ways to centralize it and surface it to users conveniently. So in summary, scaling Prometheus can be a challenge. We found that separating collection from querying can be advantageous here. Our approach is to run Prometheus as a node agent. In this configuration, it's fairly simple to maintain, but you do need to watch out for scaling. We've introduced a new operator and custom resources that emphasize tenancy around target discovery and metric scope that we feel align well with Kubernetes RBAC best practices. This infrastructure straddles to namespace as a system and a public namespace that allow us to constrict the RBAC necessary for our components to run. Users interact with the dedicated public namespace for configuration, and we've really strived to keep our configuration surface as simple as possible, oftentimes trading simplicity over configurability. And the operator itself is really just concerned with reconciling metrics configuration, not monitoring infrastructure. This code is all open source. You can find it on GitHub at Google Cloud Platform slash Prometheus dash engine, and thanks for the time. Any questions? Is there any limitation to the backends that you support? Yeah, so this general, so the question was, are there any limitations to the backends that we support? So the code that it runs as it is on that repository just goes to Google's backend to Monarch. However, this approach would work in theory with any backend. It's really just a way to deploy Prometheus and Kubernetes in a different way. So that's what I wanted to focus on. Hi, what is the cost overhead that we get for deploying a small Prometheus server on each node, as opposed to one large server? Yeah, the question was, what is the cost? I guess in terms of like CPU. CPU. Yeah. The cost is not zero, right? I mean, it's kind of like that meme, like side cars, side cars everywhere. It really highly depends on your environment and the workloads and the metrics that you're producing. So Prometheus will, in kind, use more resources if you have more metrics, right? So it's definitely, I wouldn't say it's using more resources overall than a standard Prometheus server because that's effectively sharded throughout your cluster, right? So you're kind of dividing up those resources on every single node. But yeah, it really just depends on your environment and your workload size. Any other questions? I just have a follow up question about this actually. Does it scale down very nicely, Prometheus? Because we go from one giant Prometheus to 100, I forget your numbers, to one per node. And if your node has small enough, does Prometheus degrade to, like, is it linear? Probably not, right? So you're asking if your node doesn't have enough resources to fit Prometheus on it? Just a small node, as opposed to a large node. Yeah, I mean, you could set up Prometheus with whatever requests and limits that you think are appropriate that could fit on that node. But it just kind of depends on, again, what it's trying to scrape, like, is how much it'll consume. It'll be related to how much it'll consume in terms of metrics on that node. But like, if you're asking about, like, if Prometheus will be like evicted or something by the Kubelet. No, it's more a question of, let's say I do Prometheus per namespace or something of the sort. If I end up with one part and one Prometheus, like, I'm guessing there's a, there's an overhead to having just a single Prometheus breaking just very few metrics. Like, it doesn't scale linearly. It's probably, there's a constant. Yeah, totally, yeah. So I would say, yeah, so the question was if you have a small cluster or a small workload is deploying this really necessary? Does it kind of overkill in a lot of ways? And I would say, yeah. I would say this is, you know, again, we were trying to offer a managed service where we didn't have to think about the size of the cluster. It'll just work. And the trade-off there is, you know, it's all trade-offs in engineering. The trade-off there is, yeah, we'd probably use more resources than just a standalone server at the cost of being, I guess, frictionless. Other questions? Anyone else? Yes, with the Node agent implementation, can I use Grafana and the query metrics from it? How that can work? You're asking if you can use Grafana in this, in this configuration? In that way. Yeah, yes, you can. So the question was, how can you use Grafana with this setup? Again, since all of our metrics are persisted centrally in a global backend, you would essentially point your Grafana at the URL of the backend, and then Grafana interacts with that. The infrastructure I've described here is purely for collecting data and also doing recording and alerting rules. So Grafana is sort of concerned with the query side. And that backend has to be in Google Cloud? Can it be, like, on-premises? So yeah, so the idea here is that you can write to any backend you would like. The code that, as it sits on GitHub that we've been writing, writes to Google Cloud. But you could stand up this same exact sort of infrastructure pattern to write to any backend. Yeah, it's really just leveraging Prometheus as a stateless in the cluster. Thanks. Is there an assumption that every, that metrics for the Node are gonna fit in one Prometheus? Can you say that again? Yeah, so you've mentioned demo sets and Node agents and so on. So you have an assumption then that everything that runs on that Node is then gonna fit one Prometheus, right? So you don't need to scale to more than one Prometheus per Node. Yeah, I mean, so the question was, like, is it possible for a single Prometheus server on a given Node to essentially become overwhelmed because there could be a ton of targets on that Node, right? And that's totally possible. I mean, it's, you know, it's again about, like, mitigating the need to manually scale. We've seen, we've definitely seen a heterogeneous usage of resources because of exact scenarios like that where a lot of workloads land on a given Node. So in that case, what we do is typically deploy your demo set with the high watermark of what you'd expect the usage to be. You sort of overallocate on the other nodes at the cost of not having to think about maintaining the infrastructure. So yeah, it's totally possible and I'm sure it happens. We have... You guys, are you able to speak to any customer success stories where someone implemented this and then how their infrastructure, the workflow, their whole system was improved? I mean, we have a lot of customers using it that are happy with it because they don't need to think about managing Prometheus infrastructure themselves. So the question was, do we have customer success stories? Yeah, I'm thinking about, like, the before and after, right? So before they have Prometheus, he has some issues and then after, how did it just make their life better? Yeah. I think that's elaborated on, there's like a Google Cloud podcast on a particular customer who found success through adopting it and I'd encourage you to go check that out for the exact transition that they made. Okay, really, thanks. Yeah. Just time for one last question. Great talk, so appreciate it. It's interesting to hear a lot of different use cases with Thanos and Prometheus and, you know, your new service you're building because we actually use Thanos for our use cases, so it's cool to hear about that. But in relation to that, actually, I know with Thanos, with the rules deployment, it can have performance issues when you're actually trying to evaluate the learning rules, right? Have you guys run into that problem when you're not evaluating them on the actual Prometheus themselves but have a separate, you know, managed service, obviously, that you're running? Yeah. So the question is, like, have we seen our rule evaluator become overwhelmed because it's... Exactly, yeah. Yeah, we haven't yet. We anticipate it will, but there hasn't been enough, like, noise from people saying, hey, this is getting overwhelmed. Yeah, yeah. I was just curious when I was about to hear it. I was like, oh, is that going to be an issue? Yeah, Thanos deploys an HA, kind of like the ruler is HA. Exactly, yeah. Yeah. Thank you. For your presentation. And yeah, we're right on time for the closure. Cool, thanks.