 Everyone, thank you for joining the PromCon keynote today. I'm looking forward to all of the exciting talks that we have in line this year. Like to kick off the conference by exploring the Prometheus ecosystem and show that there is much more to it than just Prometheus. We can easily do that by going through the monitoring stack that we deploy as part of OpenShift. And for those of you who are not familiar with OpenShift can think of it merely as a Kubernetes distribution. So everything that we build as part of the OpenShift monitoring stack very easily applies and translates to Kubernetes itself. I'll just briefly introduce myself. My name is Philip. I'm a member of the Red Hat OpenShift monitoring team. Besides that, I'm also fairly active in upstream projects most notably in the Prometheus operator and CubeState metrics. All right, let's jump right into it. In order to illustrate how we would build a full-blown monitoring solution by just leveraging what's available there in the ecosystem. Let's start by a vanilla Kubernetes distribution. Sorry, a vanilla Kubernetes cluster and see what we can build. So for us at the core of the monitoring stack is of course no surprise Prometheus. We deploy Prometheus using the Prometheus operator and we do that because the operator can do a very good job at upgrading Prometheus. So whenever we need to deploy a new version into the stack, we are confident that the operator is going to do the right thing and is going to make sure that the upgrade is successful. Furthermore, the operator also provides some self-service mechanisms that we can use to configure Prometheus in a decentralized manner and we'll see how we use those mechanisms in just a bit. So having a Prometheus instance lying around is not very useful without having any metrics. Besides Prometheus, we deploy two very important exporters for us, namely KubeStateMetrics and NodeExporter. KubeStateMetrics gives us metrics about the Kubernetes environment, whereas NodeExporter gives us real-time information about the resource usage of nodes and processes or containers. So we can see that with just a handful of components we have what is traditionally referred to as infrastructure monitoring and this setup is already useful. I mean, you can query a lot of metrics. If you have a dashboarding system, we can create dashboards, we can create recording rules and so on. In addition to what we have here, we also allow what we consider core OpenShift components to register their targets with this Prometheus instance and they do that by using the service and pod monitors that are provided by the Prometheus operator. So the operator then scrapes those three sources and configures Prometheus to scrape core OpenShift components. And this kind of setup up until now has all of the metrics that are required for monitoring OpenShift as a platform and this is the reason why we call this setup the platform monitoring Prometheus. We also allow users to monitor their applications but we spin up a new instance for them. We want to separate the two failure domains so that if a user application accidentally causes a cardinality explosion and the platform metrics are still protected and admins or anyone with access to those metrics can troubleshoot and see what's going on in the platform. Of course, for routing alerts, we deploy alert manager again through the operator and an extremely important component for us is the Thanos query. So the query or the Thanos query serves two functions for us. One, it provides a single query endpoint to users and therefore it serves as an API essentially for the monitoring stack. So users don't have to think like, which Prometheus do I need to query for which metric they simply use a single endpoint. And two, it also, so the query can join metrics across Prometheus instances which means that users can issue queries that have metrics both from the platform and from their applications. All right, so there's one pretty interesting aspect of the monitoring stack which I also wanted to bring up that's multi-tenancy. So OpenShift itself is multi-ten, this is kind of a fundamental building block which means that we can have multiple tenants running in the same cluster at the same time and thereby using the same monitoring stack. What we want to do is we want to carry over this multi-tenant property into the monitoring stack and we wanna say tenants should only be able to access metrics that they essentially have access to that they belong to them. So multi-tenancy in OpenShift is defined on a namespace basis and if tenant one has access to namespace one only they should only be able to see metrics from namespace one. So this is not natively kind of available in Prometheus out of the box, but we again reach into the ecosystem and we pull in a component called from label proxy which is a community project. So from label proxies able to act as a proxy in front of Prometheus and enforce a particular label in any of the queries that come into Prometheus. So we specifically configure from label proxy to inject the namespace label, but this is an arbitrary configuration specific to us. And so when a tenant issues any promptable query, in this case, for example, cube pod info, from label proxy is going to inject the appropriate namespace label. And so tenants don't actually know that this transformation is happening. They only, yeah, they would, they basically make the query as they would anywhere and the kind of these underlying planning scope the query to return metrics that belong to them. All right, so this is kind of, this kind of setup has worked for us for a fairly long time. We have this kind of stack deployed across probably tens of thousands of clusters at the moment. What we are seeing though is that some users are growing out of this setup and they do that by requiring either a bit more flexibility or a bit more resilience. We see here, for example, that we have two Prometheus instances, but we don't necessarily have to have two. I mean, these two, number two is fully arbitrary and has worked for us until now. But what we wanna do next to improve the resiliency and also the flexibility is to slim down this Prometheus instance that we call platform monitoring Prometheus and use it for absolutely critical metrics that are needed for monitoring a playing Kubernetes cluster. And then we would delegate the monitoring of each core component to its own monitoring stack. So we as a monitoring team would build and package a well formulated opinion and that way to run a Prometheus-based stack with high availability, multitasking in mind and all of the best practices that we've kind of learned while running Prometheus and then we would allow any core component, any team to spin up such a stack through either a custom resource or through some other configuration mechanism. And again, there's no reason why users shouldn't be allowed to do this. We would also like to allow users to spin up such stacks with very minimal configuration. So this kind of, we're still exploring this approach and we're at the beginning of the project but once we launch it, we would like to make it kind of a fully open source project that anyone can contribute to and anyone can use. So we want to have to provide kind of a good way to run a Prometheus-based stack in isolation. So in summary, this is how we leverage everything that the community has built and is available in the community to kind of compose a very robust and very good monitoring stack. Of course, we as a monitoring team also actively contribute to almost all of the projects that we went through here and all of the projects that we pull in for our stack. And if you are someone who enjoys being part of this community and would like to keep it vibrant, would like to keep contributing to it, please let us know. We're currently hiring for people who are excited about open source and monitoring and just excited about software development in general. So thank you for listening and enjoy the rest of the conference.