 Hello, everyone. I hope you're enjoying the conference so far. Today, I'm going to talk about how we use Prometheus and its friends for an OpenShift offering in Red Hat. And I'm going to tell you about the system that we built for remote health monitoring for the OpenShift clusters. Also, how we get our telemetry data to make data-driven decisions in Red Hat. Without further ado, let's start. I work for OpenShift Observability and Monitoring team as a software engineer. Also, I am a Thanos maintainer and a Prometheus contributor. As a team, we are building a platform to collect and store observability signals. Also, we have SRE responsibilities, and we are on call for the internal platform that we are building. The system that we build is called telemetry. So what is telemetry? In simple terms, telemetry is an open source remote health monitoring system. Remote health monitoring works by sending a carefully chosen subset of data back to the telemetry service. The data is anonymized to maintain privacy. We do not collect any identifying information, such as user names and passport or resource names. And the full list of data collection is available publicly. So the primary information collected by telemetry includes the number of updates available per cluster, the number of errors that occurred during an update, the progress information of running updates, health condition, and status of OpenShift component that is installed on the clusters, the name of the platform that OpenShift deployed on, such as AWS. The data we collect enables us to provide a lot of benefits to the end user, otherwise it would be impossible. With the help of the telemetry data, we can observe events that may seem normal to a single end user, but with the perspective of seeing those events across the fleet of users, we can provide more insights. With the connected information, we can improve the quality of the releases and react more quickly to the issues found in the clusters. As a result, we provide better support. And the information allows OpenShift to more rapidly release new features and focus on engineering resources where they can be most impactful to the end users. So how we build telemetry. In Red Hat, we have an upstream-first mentality. So telemetry is based on the open-source tools on the Prometheus ecosystem. We deploy and maintain highly available Prometheus alert manager and several Thanos components in each and every cluster we run. How do we deploy? We use Prometheus operator to deploy Prometheus alert manager and several Thanos components into OpenShift clusters. So let me briefly explain what it is for the ones who aren't familiar with the Prometheus operator. The Prometheus operator provides Kubernetes native deployment and management of Prometheus and related monitoring components. The purpose of this project is to simplify and automate the configuration of a Prometheus-based monitoring stack for the Kubernetes clusters. By using Prometheus operator out of the box, we monitor critical cluster components and alert on the metrics we collect. Moreover, we also let the users define, configure, and deploy their own monitoring stack to monitor their own workloads. In this stack, we have Prometheus HA payer, alert manager, Thanos ruler, and Thanos square here to provide global overview for the Prometheus HA payer. So how does it look like? As I already told you, we deploy Prometheus using Prometheus operator and several other Thanos components. For each cluster, we collect and send critical metrics, critical alerts, and the information about upgrades to our telemeters service. In the first version of the system, the in-cluster Prometheus was collecting the data from the workloads in Prometheus format. And then a dedicated component was scraping the federated endpoint of Prometheus every four and a half minutes. Then the component was cleaning the metrics and anonymizing the data and then sending the metrics in Prometheus federate data format to the server side telemeters server. On the server side, we had the telemeters server component. It was receiving the data on the federated data format and stored the data into its hashing of numerous replicas. And then the telemeters servers had been scraped by the two replicas of Prometheus ingesting all of the data twice. The hashing was super primitive. All the data was in memory and nothing was persisted. And we used Prometheus to directly provide access to the data we collect. Ultimately, they were bottlenecks. And we failed due to the high volumes of data to ingest and query. Moreover, the queries of that would bring down one of the Prometheus's eventually and that would prevent us from ingesting more data. So when we hit these scalability issues, we decided to invest a redesign of the system. And we chose Thanos to build a more scalable system upon. Thanos helped us to compose a highly available metric system with unlimited storage capacity, which can be added seamlessly on top of the existing Prometheus deployments. Thanos helped us to build a cost efficient store for historical metric data while retaining fesquare relays. We introduced a new custom central metrics collection pipeline for telemeters using Thanos. To make it even more scalable, Red Hat invested in adding a new component to Thanos called Receipt. At that point, this was a novel idea for Thanos because it had changed Thanos model from a pool-based solution to a push-based one. This effort was started in June 2019. And we ended up building a SaaS offering for Red Hat's internal usage for internal customers. So how does the new stack look like? All the in-cluster bits remained the same. And we converted the telemeters server to a mere authentication proxy and data transformer. For the legacy endpoint, all the uploaded data now being converted to the Prometheus remote data format and send it over to the Thanos Receipt. We added an event point though. With the event point, we added the ability to support directly sending metrics from the Prometheus to a telemeters server using a remote write API. In the upcoming OpenShift versions, we are planning to move completely to writing directly from Prometheus to telemeters servers. Also, we created a new controller for Thanos Receipt to coordinate updates, adding and removing nodes from the Thanos Receipt hashrink. And we utilized Thanos Store and Queryr to provide access to the data to our internal customers. But how about the other signals? I mentioned that we collect them, right? So this diagram shows what we have now, metrics and logs. And what are we planning to build soon, tracing and profiling? To extend the functionality of telemetry, we decided to build yet another open source observability system. Our main goals were to provide multiple observability signal support, correlation between signals, seamless multi-tenancy, authentication, and authorization, and improved security. Thanks to Prometheus ecosystem and the other systems that were built similarly, this was a relatively easy task. So we based our design on two major points, schema-less labels and object storage support. We recently started to provide logging solution using Grafana to our internal customers at Red Hat. We called this system observatorium. You can think observatorium as a distribution. We have packed a reference architecture of Thanos, Loki, and soon others that allows easier installation, configuration, and operating of several observability systems that we found the most useful and practical. And of course, all of these efforts are open source. You can check out everything we have done so far by visiting our website. Please do so. So what we learned while building this platform. In such a big organization like ours, being able to offer an internal source for the less experienced teams, extremely useful. Moreover, the flexibility of deployment and design is must have. Requirements and priorities change overnight. And we had the support from one CPU cluster to many, many huge clusters. Using schema-less labels to correlate signals was paid off. API-derival model helped us to extend the system easily. And cost matters. And on an even medium scale, it takes millions of dollars to gather all the log lines, traces, or events. So focusing on actionable metrics helped us to scale seamless. And also relying on the object storages for long-term retention helped us to control all the costs. Last but not the least, we are currently hiring if you want to work with us and contribute to all these cool open source upstream projects, please let us know. You can apply by using the link in the slides. If you have any questions, please reach out to us. You can find us on the upstream projects, CNCF Slack channels. Thanks everyone for listening and have a nice day.