 So, hi everyone, so glad to see you all here, especially on like the last day of KubeCon. Hope your conference went well and you were able to learn some cool stuff, visit some cool booths, maybe grab some stickers as well. So, this talk will be a little bit like a case study with some examples and these examples will kind of give you an idea of certain unique things that you can do with Thanos to derive more value from it. So, a quick show of hands, but who knows what an Infinity Stone is? Okay, so that's a good chunk of you, but for those who are uninitiated, it's basically a stone that grants you power over some aspect of life. And that is what Thanos uses in like the Marvel movies to wipe out the world. But we are going to use the Infinity Stones a bit more productively here. So, an Infinity Stone for the purposes of this talk is an innovative way to use Thanos or the data within Thanos that kind of enables you to do more with it. But before that, if you're observant, you might have noticed that there's only one speaker on the stage instead of two. Sadly, my co-speaker Daniel Moore who is a manager for several observability teams at Red Hat had to fly back since his child fell ill, but shout out to him for his help on the talk and in any case, you'll be hearing me, yep, for like half an hour. So, time for a quick intro. My name is Shashat Mukherjee. I'm a software engineer at Red Hat where I mostly work on monitoring platforms largely based around Thanos. I'm also a maintainer of Thanos and was previously a Google Summer of Code mentee under the same project about three years ago. And I helped maintain certain other CNCF adjacent GoTools and libraries. And you can find me at the SOS with M code pretty much anywhere. So let's start from the basics of all this which at its core is metrics and why we need them to monitor our applications and why do we even need to monitor our applications. So whether you fancy on-prem, the cloud, any hybrid in between, you want to make sure that whatever offering you have is actually working most of the time. And if it isn't working, you want it to call you up at 3 a.m. so that you can fix it. So you use tools like Prometheus and Alert Manager to check on your infrastructure, maybe every 30 seconds or so. Record that data and alert you when things don't look too good. The data model of recording samples or float values with timestamps and appending it to a time series is what we commonly call as metrics. And simply put, it is an aggregation of events happening over time in your infrastructure and it is still the cheapest signal to use and store in 2024 with very much your storage systems, query models and providers that allow you to gain insights from it. And over the course of time, users of tools like Prometheus would like to scale their monitoring to suit the needs of ever expanding infrastructure often to global scale. And this is where Thanos kind of steps in. So if you haven't already heard of Thanos, it is a distributed and scalable monitoring system based on Prometheus with highly available and scalable components that provide you with a global view of all your metrics and long-term storage and querying with features like down sampling and even multi-tenancy. It can actually be termed as a sort of distributed Prometheus++ as it distributes parts of Prometheus like the rule engine, query engine, script engine and so on and makes it scalable whilst also enabling more functionality. Out in the wild, it is often used in either of two architectures and I won't go too in-depth into explaining what Thanos is, because the talk is about something else and we're just going to go over the couple of architectures. So firstly is the sidecar architecture. So this is the architecture that Thanos kind of started with initially and here as you can see we have Thanos sidecar components attached to Prometheus instances and it basically reads data from Prometheus and serves it to the query whenever a query is fired and it uses the remote read API to do so. All these sidecars also make sure to ship time series database blocks from the Prometheus that they are attached to into an object storage of your choosing and you can then run your Prometheus with very low retention and you can query data over a longer time with the help of store gateway and compactor. Also you can set rules to get global alerting and recording rules and so on. The second architecture is the receive and this was introduced to make Thanos into a sort of monitoring as a service offering. So here different Prometheus can remote write their metrics even across network boundaries to Thanos receives which are kind of horizontally scalable TSTVs using hashrings and so on. And so these receives either ingest or replicate or forward these remote write requests throughout the hashring to other receive nodes and they then store these metrics in their own local TSTVs and expose them for Thanos query to query. The answer ship blocks to objects storage which are again queryable via the store gateway and query it. And so we now have some notion of what a standard setup kind of looks like and we can move on to refreshing our memories of what some of the traditional use cases for these metrics are. Starting with debugging and SLOs. And so most of you will probably already be aware of this but the single most common use case of metrics is to use it for debugging. Either your application is running on prod or something else by collecting metrics from them and then firing ad hoc queries. Once you notice an issue maybe we are some firing alerts or so. It is also used to guarantee service level objectives as mentioned maybe in your organization's service level agreement such as a 99.9% uptime over a window of 30 days or so. And there are tools that people use for this such as the Thanos UI and alert manager or PagerDuty to get notified when something goes wrong or for ad hoc querying and for SLOs you can have either a combination of alerts and dashboards or even sophisticated tools like Pyra which can keep track of them and visualize them. Next is what we like to call as dashboarding and you might argue that technically it is still debugging and it feels like a part of debugging but there adjacent functionalities do it which is converting metrics to human friendly visualizations that kind of allow you to communicate different information to different users. And so an example of this is maybe you have dashboards for your server health which is quite useful to diagnose issues at a single glance for SREs or developer personas or maybe you have dashboards to represent the success of your product like number of users in the last 30 days or so or even SLO dashboards that maybe are shared with people higher up on your chain. And tools like Grafana are crucial for things like this and there are certain alternatives that are popping up like Purses and so on. And finally a large part of why people use Thanos is that they want to store their metric data over longer periods of time and to be able to measure different things over longer durations. And this involves a couple of things which are storage and querying and so the storage part is largely handled by shipping TSTB blocks to object storage which is virtually kind of unlimited and the querying part is slightly trickier which is why features like compaction and down sampling exist on Thanos so that you can query for years of data without omitting all your Thanos components. So you might want to keep track of SLOs over a year or see the number of sales over here and so on with features like this. But what else can you do with metrics stored in Thanos beyond these traditional use cases? And I asked this question as you are already kind of being quiet a lot for computing maybe not so much for storage but you're paying for hiring people with the knowledge of keeping systems like Thanos up and healthy and while good global scale observability is enough of a motivated to continue to do so there are things that can be done from aside from like the operational and the SRE functions. Your long-term terabytes of metrics enable a lot of use cases but a lot of data just remains and sits there. And sure you can delete it or do it compact it but your return on investment would continue to feed with time. And maybe there's some ways for you to like use or reuse some of the stuff that you have stored or maybe to deploy it better with more optimized user experience for your SREs or developer personas and so on. Or there is even like some room to maybe automate a bunch of this. And so we can optimize quite a bit here. Hence I'll actually be sharing some infinity stones to add to your gauntlet with Thanos many of which may tie in with your traditional use cases and try to enhance your monitoring setup. So starting with the first infinity stone which is telemetry. So the term telemetry might sound familiar and is often used interchangeably with monitoring but telemetry means something slightly different and it refers to the collection of data from remote sources. Now when I mentioned telemetry as an infinity stone it might occur to you that wait I already scraped my applications and stored them in Thanos so what more can I do about that? But the important aspect of telemetry is where you get your data from. Yes you can collect it from your own infrastructure which is what everyone is already kind of doing but you can also collect it from your users i.e. people who might be running the workloads that you are offering them or even the hardware that you built for them. And this is something that we've been seeing a lot of in the Thanos community where people have created hardware or software products with inbuilt agents that can remote write metrics back to their Thanos setups that they can then store and pray. And an important thing to note here is that a user should be aware of what data you are scraping from them and that you are doing so with their consent. Don't use Thanos creepily. But so at Red Hat we've been doing this for a while now. Actually even before Trimming Thieves Remote Write was a thing. And so we actually have agents running on all OpenShift clusters out in the wild and they report back metrics that we first authenticate using a lightweight proxy call telemeter and then telemeter validates the request and we ingest it into our Thanos receive hashing that we run internally. We then store this data for up to a year and then replicate it elsewhere too which I'll discuss in a bit. But Red Haters can actually use all of this data across different business units over long ranges of time and gain insights that go beyond just simple debugging. And as for how this unique data is used you could use it for a variety of things such as making actual customer data backed decisions across your portfolio of products. You could conduct user experience research and improve on that maybe even build your customers. You can also do a bunch of analytics on this data especially if you have quite a long time ranges for it like multiple years of data or so. But what sort of analytics could you do with them? And that brings me to the next infinity stone which is analytics. And so Prometheus metrics are time series data with each series being labeled and containing samples where a sample is basically a combination of a float value and a timestamp. So this high quality structured numerical data becomes ideal for things like analytics. But what sort of analytics could you do? Well, like most stuff in software engineering, it depends largely on your business context and on the telemetry that you gather so far. So analytics can be hugely powerful and evaluating to your teams with long term metrics if you know what you want or you know what your customers want. At Red Hat, we had close collaborations with certain analytics teams to share some of this customer data to data scientists who can actually figure out how that data might be valuable to us and what we can do with it. And so in the replication job I mentioned in earlier slides was exactly for this where they replicate data over a much longer time range into a read-only tunnel setup and then they prepare that data for experiments using tools such as OpenData Hub with quite a complex architecture to prepare for experiments that they plan to run and models that they plan to train. And many of these experiments actually are very unique and impressive. Now I will mention some machine learning and AI which I'm sure a lot of you are tired of hearing by now but internally at Red Hat, we actually used telemetry data from our customers in particular their cluster health data and train models on them. And they can now predict how risky it might be to upgrade a particular OpenShift cluster and what those factors actually are. And so in a very real way, we've given customers back some new feature that they can use to get more reliability just from connecting telemetry into Thanos. And this is a tool customers can now use with inferences made from the latest telemetry data from their clusters and they can view what their upgrade risks are and have a safer upgrade experience. And there are other unique examples and experiments that we've done like correlation of alerts that fire at roughly the same time and how they share context to make sense of alerts firing during an incident and what users can do to fix them. Basically going with the assumption that they are not a coincidence. Now looking at things like this, you might ask, okay, how are we actually collecting this data on OpenShift clusters to enable things like this and how are we doing things like this? And that brings me to my third and third stone which is single cluster monitoring. So this is a bit of a tricky one as you think Thanos probably doesn't have much to do on just a single cluster, right? So I first explained what I think might be the standard scenario and then what we do for Red Hat OpenShift. So let's say you are some platform engineering team that is in charge of a few single clusters and you have Premier's Operator installed on them and these clusters may then be used by certain other teams to deploy the workloads. Now you realize that your team might care about the cluster health and more of the platform side of things while teams using your clusters only care about the workloads. And so querying the same Prometheus for metrics and firing the same alerts tends to be confusing. And while you do want to correlate certain platform metrics and certain user workload metrics for some specialized applications, it is not the standard use case. So with Prometheus Operator, what you can do is just declare a second Prometheus object and now you have one Prometheus have data for only the platform and another Prometheus to be shared by your user workloads. And this is the sort of logical split that makes sense. But then you have a missing piece of puzzle which is now some of your specialized applications don't work as they need data from both platform metrics and user workload metrics. So how do we leverage the powers of Thanos to solve this? And it's actually kind of easy. You just need to deploy a sidecar for each of your Prometheus and then hook them up to a Thanos query. And instantaneously, you now have a way to support all your use cases while still maintaining clear views for you and your customer teams. And you can also set alerts on maybe a Thanos ruler if you need to do global alerting on certain things. And this is actually what we ship with OpenShift in Red Hat as part of the monitoring stack with a split between platform and user workload tied in nicely with the Thanos query. And this logical split also allows us to ship relevant things. We need from this clusters to our own telemetry pipeline without ever having touched the user workload metrics. Now, what do you do if you have more than one cluster that you need to monitor? And this is the point to introduce the fourth Infinity Stone, which is multi cluster use cases. What I'll share here is mostly how we do it at Red Hat and how you can replicate a setup like ours as nearly all of it is open source and permissively licensed. So this is actually one of the most common use cases of Thanos that we've seen, but essentially these days, people seem to love following our clusters as cattle in paradigm where we're tearing down and bringing up clusters quickly and enabling to link to manage all of these is in trend. And monitoring part of this is where I think Thanos fits super well and shines. And so you can run Thanos easily here to support dynamic clusters like this. But our internal monitoring service at Red Hat actually looks something like this. So as you can see, we run a receive based setup and we have some custom open source components called observatory combined. And these are lightweight components that we actually use to ensure tenancy on Thanos and easy authentication and authorizations for tenant teams that want to use us. And so our tenant teams can choose to remote write metrics from any Prometheus they want across all of their multiple clusters and infrastructures and we just ingest it into receive and they can then get a global view of all the infrastructure and query metrics from us. And they only get to see their own metrics and not metrics from other teams. So they can also set their recording and alerting rules with us, which we then sync to the Thanos ruler and fire alerts to an alert manager of their choice. And all of this is actually open sourced under the Githama observatory mark with permissively licensed projects. So you can replicate a setup like this for yourself by using this across all of your clusters. And we actually tried to ship this on-prem with Red Hat as well with something that we like to call as advanced cluster management. So this is something you use when you want to manage multiple open shift clusters in a hierarchical way. And it is largely based on the open source open cluster management project that we have. So we ship the observability part of it as a sort of similar Thanos stack to the ones we run internally. So customers can achieve the same wins with observability that we do. So it is kind of based on like a hub and spoke architecture where the hub cluster has Thanos components running receive, query, store, gateway and other crucial parts. And the spokes have a collector remote writing metrics to the Thanos receive hash rig present on the hub. By default, we only send the platform for the metrics, but since all of this is on the customers cloud or on the customers on-prem, they can choose to actually send their own user workloads to the hub cluster and get a global view of the entire fleet of clusters. And these clusters are dynamic by the way. They can be detached, attached, re-attached, whatever. And now you might actually be wondering, okay, wait, how do you ship this to customers? Because all of this is manual, right? Is there an operator that sets this up for you? And you'd be right. And that brings me to my fifth in 3D stone, which is an operator. And so there are a lot of operations with Thanos that are still quite manual and require lots of time and expertise to get right, especially when you have large scale environments. So it makes sense to automate at least some of these operations as most of these things are run in Kubernetes. And so why not leverage the extensibility there? So to preface this, I will say that at this point in time, there is no single perfect Thanos operator that will give you everything that you need and works in the way that you expect it to. So to start with some context, you can use the Prometheus operator to manage sidecars and rulers, but you would still need to set up your own query and compactors and store gateways you will receive if you choose to use that. So this leaves quite a lot of things to automate. So for Red Hat's ACM project, internally we use something called the observatorium operator. And this is quite a unique project to be honest. But it helps you set up Thanos in a receiver architecture with some observatorium add-ons to maintain tenancy in a very self-contained way. And it is actually based on this Locutus project which is a paradigm to be able to render JSON at templates with options to override those templates with CRD configs. And it rolls it out and it does so on a trigger like watching changes in a particular CRD that you can forget to do so, or even time-based interval ones. And Locutus itself is a great concept and experiment that challenges a lot of the operators out there that could actually just be something generic or just be a cube apply command. And it works well from projects where you don't need to control the entire life cycle of all of your workloads. But at Red Hat we've realized that this becomes tricky to use with things like Thanos or things with a similar complexity level as you have a variety of options across components that various environments might need. And often these options can influence your life cycle. And representing this in static JSON at templates, not to mention not having control of the life cycle and without delving into extending Locutus with custom go code is quite hard to do. And this paradigm is not something most people within the community might be familiar with, as most people who have built operators or who've interacted with operators are mostly using things like cube builder and operator SDK instead of JSON at as a medium for their rollouts. And Locutus, while a very stable and a very innovative project is still new project and it can break in very unexpected ways. So the momentum operator is functional and we do use a fork effect for some stuff at Red Hat, but it leaves a lot to be desired and it can be better. And thus we felt the need to start a new initiative for Thanos operator with a special focus on receive mode where we can define special controller reconciliation loops for specific modes of operation across various Thanos components. And it should play well with the existing beloved Prometheus operator and extend functionality on top of that. And so we've started taking some steps towards a Thanos operator project whose reply literally created yesterday which is a cube builder based operator. And the roadmap for that looks something like this. Again, all of this is TBD, but still, Thanos components are installed with some easy mechanisms for setting up receive hash rings and this would be the first step as well as reusing things such as Prometheus rule CRDs that can sync with components like Thanos ruler. And then as part of some future work maybe at some point we have something like T-shirt sizing where you can pick a size and some CRD based on how much metrics you think you'll be ingesting and querying which would then size your Thanos accordingly for you without you having to do much work. And it'll have a variety of options in case you need something custom or with some auto scale and maybe even a power component pulse switch so that you can stop the operator to debug certain Thanos components if you need to. We also aim to be pragmatic about this and to only put things that make the actual sense within the operator. So things that must be external to Thanos or queue specific should be what goes into this particular operator. And to create all this, we plan to leverage the awesome Thanos community especially those who feel the need for a similar operator like this to automate some of these manual operations and to make managing Thanos easy and those who would love to work with us. And that brings me to my final infinity zone which is community. So it's no secret that your project will be as great as the community behind it. People and their intentions to build or use something great matter a lot. And to this end, one of the superpowers that I think Thanos gives you is the presence of a great community. It is one of the greatest ones you'll find. People are ready to help each other. People are kind and people are ready to share their experiences with you. And these people include the team members, mentees, active contributors, our users or just anyone who is looking to learn and get involved. And the way you get involved is to just join the communication happening on our Slack channels either on Thanos or Thanos Dev on our GitHub issues and PRs where all the major work happens or even decide what the future of Thanos should be like by commenting on these issues. And for meeting with people in the Thanos community we try to have bi-weekly office hours although that is hard to organize at times. And we meet here in person at KubeCon. And this time we actually had our first ever ThanosCon which was amazing. You got to see a room chock full of people who use Thanos and just want to learn more and work with us and we had some excellent talks from people across various different large companies. And we have our mentorship programs as well which are one of the best ways to onboard with an OSS project in my opinion. And you pair with two of us either from the Thanos maintenance teams or from some of our coworkers who are actually using Thanos in their day job and just learn a lot while working on a very cool project which benefits Thanos, which benefits you as well. And we have regular retrospectives during these mentorships to ensure you get enough actionable feedback so that you can grow not only technically but actually as an engineer. And this is literally how I got hired to Red Hat as well. And so definitely leverage the community and feel free to reach out to us and work with us. So as a summary, we've learned a few things. So starting with using Thanos to get telemetry from customers or other sources that makes more sense in your context. Running analytics on the long-range data that you have stored in Thanos. How Thanos can be useful in single cluster setups. How it can be useful in multi cluster setups and how you can get that easily with open source components that you can contribute to. The operator initiative and why automating this would make sense and how to leverage and get involved with the Thanos community. So any questions? You might need to come up to that, Mike. So I'm not sure yet, but. So thank you for your presentation. Sure. And we have seen that there are two main methods to aggregate Prometheus query. The first one is with Sidecar and the second one is with Receive Remote Ride. And which are the use cases? When to use one meter or another meter? So Receive can be something for environments where you have egress-only environments or just environments where you don't really control the source of the metrics. So maybe you are maybe a platform engineering team and you have separate teams who are dealing with a bunch of other things. But they want to use you as a metrics provider. And so you let them control their own Prometheus and just set a remote ride target and you just ingest the metrics and act as a sort of database that they can send metrics to and query. So that is what I would say Receive was for. Sidecar on the other hand is something that you would use if you'd have both egress and ingress only as well as more control over the actual Prometheus that's running on any infrastructure of your choice. But yeah, that would be my suggestion. Anyone else? Thanks for your presentation. In the hypothesis that we have a lot of clusters, a lot of these clusters and all of them require a robust monitoring. So we are not deploying only to meet each cluster but we are deploying ethanos installation for each cluster. How can we then monitor them together? Also for using some legacy tool that only leads to some Prometheus interface because of media data production where we have to add another custom piece of code that introduces the federation feature that are present in Prometheus but is currently not present in Thanos. So if I understand your question correctly, you are running a Thanos installation on every cluster and this Thanos installation is what components like query and sidecar or more? Yes, it's a full step Thanos installation. Okay, so my question would be why are you doing that? In any case, I think like what would be ideal for you is to centralize this in some way. So you just maybe have sidecars on these clusters and then you start a global query which federates information from all of these sidecars. So that would be my recommendation as step one. As for your legacy tooling, I'm not really sure how you'd like to migrate that. I'm not familiar, but yeah. Okay, thanks a lot. Anyone else? Going once, going twice. Okay, thanks a lot for listening.