 Hello everyone. We are super excited to be here at the DevCon conference because it's actually our first one, I think. And we are here to talk about the cloud native project that we both develop and maintain in open source area, and it's called Thanos. So after this, I would like you to know, you know, what why you need metrics, what Prometheus is, what things it solves, what Thanos is, what problems it solves, and last but not the least, how to use Thanos and what are, you know, design decisions that we made while creating this scalable, distributed and durable metric system. But before that, a short introduction. My name is Bartek Plotka. I am engineer working at the Red Hat OpenShift team, OpenShift monitoring team. I love open source Golang and solving problems, especially in Go programming language. I'm part of the Prometheus team, and I also co-author of the Thanos project. And with me, there is Lucas. My name is Lucas, and I'm Marine. I'm also working on the Red Hat monitoring team and everything OpenShift kind of for the last two or three years, I came from the CoroS side and yeah, I work on everything that's monitoring, but also I'm really interested in machine learning and distributed systems and networking also. And our job is focused, you know, on building scalable observability platforms and solutions for OpenShift. However, the major part of our work is also maintaining open source projects, including Prometheus and Thanos on a daily basis. And those projects are focused on enabling monitoring via metrics for infrastructure, service application, for example, microservices, running on Kubernetes. So first, it's important to understand what monitoring is. So let's dive in. I hope I don't need to repeat that, but it's useful to reiterate what are the reasons behind the monitoring, right? And there is a saying that I learned kind of the first day of my work as an SRE in my previous role with production cluster. It goes like this, running a product without any monitoring is like not running the product at all. It's kind of strict, but what I mean here is that if you want to run something that means you probably will be accountable for downtime and bugs. This means that you have to have reliable monitoring in order to prove or at least being kind of aware that something is running or not. There's no point or little point in creating a system that has to run 24 hours per day if you can't prove if it's working, right? So, well, the same message you can read in the SRE book that I really recommend if you want to run things on for production. It is a SRE book written by Google guys and there was like this muscle pyramid for infrastructure need. As you can see, you know, kind of monitoring is the foundation of it, right? It's very essential. It's actually before building the system itself, which is on the top. So it's really important. It actually gives you a couple of things. It gives you other things which, you know, detection mechanisms. It gives you the ability to debug deeper if needed. It, I mean, you can see long term characteristics. You can learn from that, allowing, yeah, you can essentially have data driven decisions, which is amazing. And finally, you can automate things. So you can have, yeah, auto scaling or self healing based on things that are happening with your system. So monitoring is essential in this sense. Now, as you probably know, there are different signals as well. And this is, yeah, typical way we divide the monitoring. So we have tracing, which is any bit of data or metadata that is bound to lifecycle of the request of the transactional request in the system. Logging, which is essentially a discrete event. And finally, metrics. So samples over span of time, composed into logical counter, gouch, histogram, things that you can aggregate. Now, it's not the end. There are other signals as well. So, you know, for example, continuous profiling, something that actually our team is touching as well. As this name suggests, you, it allows you to continuously collect profiles, application profiles. So for example, for Golnack application, you continuously collect, go-piprof profiles, heap dumps, threads, whatever, go-routine dumps. And this is massive help for the developer, because, you know, when something is happening, the application is using too much memory, the CPU spike happens, and you got notified. It's actually too late. Like, things just happen. So it's too late to profile this application. And with this method, you can actually retroactively look on the profiles that happened, yeah, a couple of minutes ago to actually find the root code. So this is amazing as well. And that's not the only one. Maybe, you know, at some point, we'll figure out further kind of other signals that will improve overall monitoring capabilities. And, you know, how those signals come together, that's interesting as well. This is a, like, typical journey of anyone that runs something on production. And let's say, have an incident. First of all, you got an alert, like you got notified. And it's mainly probably triggered by your metric system. Then you go to the Grafana, so static dashboards to locate the problem or, like, what services are actually affected. If you want to dig more, then you can always do ad hoc queries. So, you know, kind of analyze your data that you have based on metric signal as well. And to be honest, in 90% of the situation, that's totally enough. Like, you already know what's going on. You can apply the fix. You can resolve the incident. But sometimes not. Sometimes you need to dig further. So, you have a log aggregation, right? Number four, where you can narrow the issue a bit as well. And if that's not enough, you can move to distributed tracing and check the latencies for certain operations on the way of the request, and that's useful. And then, finally, hopefully, you will find the bug and fix it and deploy it. So, you're good. And during this talk, we'll be focusing on metrics. So, enabling users to do those things like alerting and analyzing your data in the quickest possible manner with help of the prompt use and Thanos. Cool. So, in order to be able to make use of metrics in the first place, we have to instrument our code no matter what metrics we're using. If that's Prometheus. So, if that's Prometheus or if that's any other service, you always have to instrument your code to actually know what's going on. So, you have to say, what are the things that I actually care about inside of my code? You know, inside of my service, what is important to me? What are the things that I actually care about for defining my SLI or, like, for my SLOs? Like, what are the things that actually matter to my customer and to us as a team to help us debug? So, really quickly, let's go through an example application. So, let's say that we define some service that we're writing in Golang that's a Golang server that exposes just one single API endpoint at slash API. And it does the most minimal amount of work possible. So, 50% of the time, it'll return one status code that's status okay. And the rest of the time, it'll report status teapot, which is also very important. So, how would we go and instrument this application? The first thing that we would do is we'd say, okay, let's define some metric that actually matters to us. So, in this case, we're defining something called HTTP request total. And we can keep track of, you know, what are the total number of requests that we're serving for any single HTTP method and then also what's the response code that we gave for that endpoint. So, in this case, we should expect to see, you know, some time series that ends up with labels that are code and method and those values could be, you know, 200, 418, and the methods could be getPost, et cetera, et cetera, right? So, now that we've defined a metric, we actually want to instrument our code to use this metric. This is typically pretty easy as well. The only thing we have to do is, first, you know, define some registry and expose this new metric that we defined so that it can be scraped by our metric system. And then, whenever we get a request, we increment our counter and say, okay, when I get a request of 200, I'll increment with status 200 and I'll also add a label that is the method on it. And then, once our metric system is actually scraping this application, we should expect to see some payload that looks like this, which is time series for every label and value pair, so code and methods. So, here we have code 200, code 418, method gets, and we see, for example, we got fit through request, 50 get requests that ended in 200s and 53 requests that ended in 418. So, that's just an example. And this is how easy it is to do with Prometheus, for example. When you have a more built-out application, you can get a lot of different metrics, pretty much for free from your client library as well. So, for example, if you use the Prometheus Go client library, you can get all of these time series that just kind of qualify how your Go application is running, like how much garbage collection is taking, how many Go routines are working in a background, et cetera, et cetera, and you can define hundreds and hundreds of different time series that are actually useful for you when you're debugging your applications in their first place. But before we go any deeper into instrumentation, let's talk a little bit more about Prometheus in the first place. So like, who here has actually used Prometheus? Oh, okay, cool. All of you do. That's awesome. So, Prometheus, I'm glad that everyone's using it. We love it as well. One of the things that we love about it is the fact that it's super simple. So, let's cover some of the simple concepts that make Prometheus really good in the first place. Prometheus, first of all, is a super simple application, really, really stripped down, right? Where like, it's just one single binary that you can run on any machine or on your cluster. And the only thing you have to do is point it at your application and scrape metrics, right? Now, the thing that's great about Prometheus is that it's super simple and then it's also really efficient to use, right? So, you can either query Prometheus directly using Prometheus' own query language as a promQL, or you can also define dashboards that are like really complex and can have lots of different useful queries like built into the dashboards. Or you can do things like writing alerting rules and recording rules. So, for example, you can get notified when my service is serving too many 500 requests over the course of the last 10 minutes, I can get paged of your pager duty or maybe an SMS on my phone. And one of the things that's also particularly cool about Prometheus is the fact that it scales pretty well in itself. It can handle tens of millions of time series just on one single application if you give it enough resources. And then we're gonna be talking a little bit more about how this scales going further on. One thing to note about Prometheus is that it's a pull-based model and this differentiates it from other monitoring systems because it makes it also easier to configure. So this means that the only thing I need to get my monitoring system up and running is run my binary and point it at the applications that are actually exposing metrics. My client libraries for this reason can also be really simple. My client libraries don't have to do anything like retries, backoffs, buffering of different metrics. When the client libraries end up taking all this into account, the complexity of the client library can end up being greater than the complexity of the application I'm actually monitoring. And Prometheus, on the other hand, has really simple and stable client libraries that make it super easy to instrument your application. And it also means that Prometheus server itself can kind of regulate how much data it's ingesting and it doesn't have to do anything like throttling or rate limiting clients that are maybe spamming the Prometheus server. And this way, you would overload your metric system. So today, let's see, Prometheus has been an open source project for about six years, and it's super well adopted. And one of the things that's beautiful about this is that there's tons and tons and tons of different open source projects that already exposed Prometheus metrics. But it also means that if your open source project, for example, Memcast or something, doesn't expose Prometheus metrics, there's probably some ecosystem project that's an exporter that can wrap that component and does expose metrics for the application that you care about. And it's really easy to write exporters as well. OK, so we talked about why Prometheus is really great, why it's super cool, it's super easy to run on just one machine. But there's also a lot of times when running just one single Prometheus on one node isn't enough, and we have to scale it out. So let's cover a little bit, why would we even need multiple Prometheus instances? We just said that you can handle 10 million time series in one single Prometheus server. It's super easy to configure. So why is this ever even an issue? And let's go back to this example where we have just one single Prometheus that's scraping a bunch of applications. And we said that this scales pretty well. We can ingest a ton of different time series. But imagine that I'm actually running real life We work in distributed systems, we work in OpenShift every single day, and things break in distributed systems. Really unexpected things happen. Let's say imagine, for example, that I'm running on an OpenShift cluster or some Kubernetes cluster. And the node that's running Prometheus goes down or dies for any reason. Or maybe my Prometheus server also dies for X reason. And it can't scrape my application. And I also can't query Prometheus. And maybe right when this node is down, I'm having an incident. And this is exactly when I need to be querying my Prometheus server to get metrics. Or maybe something that's much more routine, like not a failure or disaster scenario. Maybe I actually just want to do a rolling upgrade of my Prometheus. And I want to switch from one Prometheus server to another one. And I don't want to have any downtime on my Prometheus server. So for this reason, it can be super useful to actually be running two Prometheus servers side by side. And we call this running highly available pairs of Prometheus where every Prometheus server scrapes every single application that I'm also running. This way, if one Prometheus server is down, the other one is likely still up. I can also do a rolling upgrade. So I can take one down, upgrade it, take the other one down, and upgrade it. And I'll always have high availability for both queries and metrics ingestion. So this is just one example for why we might want to run multiple Prometheus. Yep. So we will cover that, I think. Yeah, we'll cover that. But I think the member here is that in this highly available scenario, both Prometheus are scraping all of the different applications. And they're also evaluating alerting rules. So cool. But that's not the only reason why you essentially have a need to have more than one Prometheus instance. So let's cover other use cases. So let's see our example that we provided before. And we spoke a certain amount of metrics on each of those applications. So let's say we have four applications. Each of those has some number of metrics that it exposes. Within Prometheus, we hold almost 2K of metrics. And this particular query tells you that information, like how many series it stores for fresh metrics in something called head block, which is essentially in memory. And this is kind of number of those series. It's called cardinality. And this is super important because cardinality tells you how large the Prometheus instance will be. How much resources it will use. This is the main thing that affects your resource consumption. So let's assume we have this situation. This is fine. For 2K of series, my Prometheus will run super smoothly, like probably almost no resources will be used. And maybe one gigabyte disk is totally enough for that. But in real world, we have more applications. And with this equation, let's say we have one node and just 30 apps per node, like maybe Kubernetes node. And each of those has roughly 300 metrics exposed. Suddenly, we have 9K of series in our Prometheus instance. And that's totally fine as well. That's a low amount of metrics. Probably you need to have one gigabyte of memory reserved for this Prometheus. But that's totally fine as well. However, who runs the huge cluster with just one node? That's unrealistic, right? So well, let's scale it up. Now, things get maybe more tricky. We suddenly have like 3,000 applications to collect metrics from. So at the end, Prometheus collects almost 1 million of series. And that's still manageable. That's totally fine. You need to have more resources for that, but maybe 10s of gigabytes of memory. But that will do just nice. But then we scale more. We have suddenly 500 number of nodes, which is still doable in these days for Kubernetes. So it means like 15,000 applications running. And then in our equation, it means like almost 5 million of series. Now, this gets kind of tricky. You have almost hundreds of gigabytes memory right now. The operational aspects are getting tricky. You suddenly need to have more buffy VM or hardware to support this instance. So it gets more pricey and scary, right? And also the disk, you need to have large SSDs to cover those number of series, essentially. So OK, but let's scale up. Maybe we have much, much more nodes. Or maybe we can change other number in this equation. Maybe more pods per application. Maybe a couple of applications can have like 10,000 metrics exposed. So this goes out of the limits. Because we set, as a Prometheus team, we are kind of setting 10 millions as a threshold for something being kind of easy to operate. Because at this point, the startup of the Prometheus will be slower, because we need to replace certain things. And yeah, crash recovery and all of that stuff are super, super tricky. So we really recommend to have like 1 million of series. That's the best. But suddenly, well, we have those apps, and we want to collect those metrics. So how to, what's the solution to reduce the number of series per Prometheus and be able to run maybe smaller VMs, right? So the solution is called sharding. And this is showing like the functional sharding, which means that you separate your application per function, or maybe per namespace, or maybe per project, or maybe per customer. And yeah, you have separate Prometheus scraping those metrics. And being weak are able to limit the size of each Prometheus majorly. And then there is kind of connected. But if you don't have separate functions, or you don't care about like, OK, you don't have information what function is for what application, you can use something like consistent hashing, which you just tell Prometheus how many Prometheus's there are in your cluster, and it will automatically hash and allocate each application to one Prometheus in a way that it will be equally evenly distributed. So that's a nice feature as well, and you have that out of the box. So this is another way of sharding, right? But what it means? It means essentially that, again, we have another use case for running more than one Prometheus server, and that's really realistic these days. And there's one more. Well, we run more than one cluster, right? These days, OpenShift or any other Kubernetes can be started in minutes. So it's really realistic that companies will have many, many Kubernetes clusters or any other clusters running around the world, maybe for geolocation reasons, maybe for HCI high availability reasons. So this means you need to have more than one Prometheus. Well, you could say, OK, but why can't I have one Prometheus which collects the data from many clusters? This is not recommended, because your Prometheus, because of pool model, should be in the same failure domain, in the same network as your services that it's monitoring. So remember about that. This is very, very important, and this allows Prometheus to be very reliable, and you can trust it more, because we limit the unknowns that are between services that are being monitoring and Prometheus. Which means, anyway, again, multiple Prometheus instances, at some point, you can escape the scenario. OK, but Prometheus by design works as a distributed, does not work like a distributed systems. And as Prometheus team, we kind of designed this being like this way, to be a single server application, super simple to run, super simple to maintain, and understand. And a simple one binary time series database. But to be honest, it makes this project actually super focused on matter, and I love it. That's why I love to contribute and maintain to Prometheus. That's why it's super popular, and to one of the best projects with one of the best and kind of friendly community. So this is because it prefers simplicity versus over-engineering. And yeah, it allows integrations. However, it is focused as alone into things that matters in these projects. However, this means that not everything is solved out of the box. And we'll be talking about exactly this point. So yeah, let's run through the challenges when having more than one Prometheus. First of all, let's say we want to find a number of metrics we store in each of those Prometheus's in the current moment. So we do this query. And normally with query this, we will have the number of series in the head, and then we will sum all of those by cluster. Because we want to know by cluster what's the number behind each cluster, how many series we have in each of those clusters. However, if we do that, we need to, well, to do so, we need to kind of query each of those Prometheus instances separately. Because this database is not replicated in any form. They're not connected in any form. So it's kind of tricky to calculate that. We would need to manually do so, which is not really convenient, right? So there is no global view. This is the first challenge that we want to mention, the global view. So global aggregation on top of multiple Prometheus's instances. Yeah. And then it actually gets even a little bit trickier when we start talking about highly available Prometheus pairs. So in the previous example, we had functionally shorted Prometheus i in one cluster, this US1 Kubernetes cluster. And now here we have functionally shorted and highly available Prometheus in our Kubernetes cluster. And it gets even trickier because you see how one of the Prometheus i says, yeah, we have 200,000 time series and the other one has zero. What's going on? How do we reconcile these two differences? Why does one Prometheus have so many time series and the other one has none? And actually, why did this happen in the first place? And the answer is that there's actually some valid reasons why two Prometheus i in one cluster will have different time series. For example, we remember the fact that Prometheus scrapes on interval, right? So periodically scrapes my applications, like maybe 15 seconds or 30 seconds. And it's totally valid that maybe in the time span that I was scraping, something happened in my application. Maybe the application got restarted. So one Prometheus will show one set of numbers for time series. The other one will show a different set of numbers. One might show certain metrics that existed when it was scraped the first time. The other one won't show it. Or maybe my Prometheus got restarted and it just doesn't have anything in a CSDB. I lost something on disk for any reason. And it's expected that our highly available Prometheus pairs are going to end up with different numbers of time series. It's simply a fact of the matter. But how do we reconcile it? When I'm querying these different Prometheus i that are both on the same cluster for the same time series, I can end up with one Prometheus that shows, yeah, I have a bunch of data and I'm missing it in the first time span. And then the other Prometheus is saying, yeah, I have that data. But I'm missing it in the later time span. So again, it makes sense that maybe one Prometheus was down for some reason. The network connection between the one node and that application was buggy for x reason during that exact scrape interval. But the question is, how can we reconcile these two graphs? Like, it would be great if we could have one graph complement the graph from the other Prometheus. So the question is, how can we make our querying of Prometheus highly available? Not just the scraping, but also the querying. Another question that comes to mind when we're running Prometheus at scale is retention of our metrics. So normally, when we run Prometheus, we are querying metrics for the last 36 hours. These are the metrics that we typically care the most about. Because when I'm diagnosing my application, what I care about is, what's happening in my application right now? What is the state of this time series exactly right now? Or maybe I'm debugging an incident from yesterday. And it's like, OK, what was happening yesterday at 5 AM when we got D-dust? Or maybe it's what happened 10 minutes ago so that I can evaluate an alerting rule and I get paged or something. So my Prometheus normally is looking at really short time ranges of data. For that reason, typically Prometheus has a time retention for its database of 15 days. And that means that after 15 days, the blocks in this Prometheus time series database just fall out of the database and we eliminate them. This means that it's difficult to look at data from last year. Even though there are some cases where maybe it's really useful to analyze data from last year, because I want to look at long-term trends maybe for my application, maybe for I want to analyze my SLIs over a long period of time, or any other reason. But if I'm keeping all of this data on disk, I suddenly have to increase the amount of disk space that I have available for my Prometheus. And disk space is really the only external dependency of Prometheus. We want to have some persistent disk. But if I'm keeping a years' worth of data, my dependency goes from being disk like this to disk like this. And the problem is that this isn't totally scalable. Like when we scale up the disk sizes, we also have to think about scaling up the operations that we do on these disks. That means backups, resizes, restores of the backups. And this makes it really difficult to manage Prometheus service with a long retention time. The other thing that happens is that when I query for a years' worth of data that's scraped every 15 seconds from Prometheus, that's millions and millions and millions of data points. So imagine that I query an application every 15 seconds, and that means four times a minute, 60 times an hour, 24 hours a day, 365 days a year. That's 30 million plus time zeroes if I'm querying for some metric over an entire year. And number one, this takes a lot of processing power to compute, but also it takes a ton of network bandwidth just to send all this data from my Prometheus server to the client that's querying it. And what's more is that 30 million data points is way more than I can even render on my screen. If I had an 8K monitor, I would need 4,000 monitors side by side to render this in full resolution, which I don't have at home. And so what we really want is to have lower resolution data for these really long time ranges. So for example, can anyone point out the difference between the graph on the right in full resolution and the graph on this side in low resolution? And the answer is no, because they look exactly the same. And when we have really long time ranges of data and we do down sampling correctly, we can still get really valuable insight into this data and we can preserve the accuracy even while we're losing some resolution. And we can still make the valuable inferences, but we save a ton of processing power and it also means that we can go from responding to a query in a minute's time to responding to a query for years with the data in under a second. So we want to have some kind of long term retention and down sampling in Prometheus, but this doesn't exist right now. So what have we covered so far about Prometheus? Number one, monitoring is super important, do monitoring. Number two, it's really easy to do with Prometheus and you should get started with that. We often have to run more than one Prometheus for different reasons, maybe high availability or just because to manage the load, but doing so is pretty difficult and there's a lot of challenges surrounding that. And also we want to be able to keep maybe long term data, but this is also difficult in Prometheus. Yep, and all those challenges we mentioned with running Prometheus at scale, like global view, lack of global view and handling of high availability, really hard story behind enabling long term storage of the Prometheus, those did not come out of nowhere. It was a feedback for many, many users of Prometheus and that wanted metric in this form. And in the same time in my previous place I had kind of improbable startup at London, we suffered those problems as well. So that's why we teamed with Fiband Raynards who was in the CuraS back then, also Prometheus maintainer and we then together kind of created a TANOS project which was essentially fully open source from start because we knew that everyone wanted to solve the same problem so we want to collaborate together on it. And the project has two years now. We joined Cloud Native Computing Foundation, I think in summer, and we were quite lucky to have really, really amazing and friendly community that helps us to maintain this project. And we kind of have that from kind of start so we were super, super lucky. And through those years we managed to build quite diverse maintainers team as well. Most of them from different companies so this is pretty great as well. And there are many users who are actually using us right now, the project is free, open source. And more users means more chances for awesome contributions and feedback so it's pretty great. And I think the key value of TANOS on why TANOS is kind of popular is that with single, it has single mission to scale Prometheus, nothing more. So TANOS is also aimed and focused on simplicity and low maintenance burden like Prometheus is. So this is additional kind of things that we are focusing on. So why is that? Well, we try also to reuse as much as possible from Prometheus code. So we literally import Prometheus code for query, for querying, for kind of TSDB for us, the storage format, and many, many other like alerting. So once someone fixes something on Prometheus site, it immediately fixes things on TANOS which is great. So we are all together in something called Prometheus ecosystem, we call it. So Prometheus and projects around that. And we also team up with the Prometheus. So it's great, it's open source. But let's focus on like how enable TANOS in our case where we have like multiple Prometheus instances, like how to enable that. And we actually have some kind of seven steps and I would say pretty simple one. So let's go. How to essentially transform the existing Prometheus setup to solve global view, HA and long term storage. So let's say we have like two clusters and three Prometheus is overall. And yeah, step number one is to add a sidecar. So TANOS sidecar is a small Go binary that has multiple features, but we cover two of those here. First of all, it translates the internal Prometheus remote read API that exposes some metrics that we selected, select. And it transforms that to TANOS GRPC API called Store API. And also it has some other features, like for example, it reloads the configuration of Prometheus or rules or alerts automatically from, for example, Kubernetes config map or stuff like that. So to have TANOS enabled, we need to have the sidecar added to each Prometheus instance and in Kubernetes it's kind of easy because we have pods. Now, step number two is to add a stateless TANOS querier component which is essentially, it essentially performs a PROMQL. So native Prometheus square language evaluation on the global level instead of just in Prometheus local level, right? And it connects to the Store API via GRPC and it actually connects to any Store API. It doesn't know if it's sidecar or something else, it just, it's an API. So, and then it can evaluate the query and by fetching the data from the nodes, leafs, let's say. So actually it will ask Prometheus directly and you know the same kind of, the querier exposes the same HTTP API as Prometheus does, so all the Grafana dashboards all the toolings works because they think it's just Prometheus, but it's something more, right? And it also allows us to answer our previous query, right? Like how many series are per cluster and because it has access to the data from both and it can aggregate in this particular example sum, all the things from each leaf and provide us with the correct result. Cool, now do you remember the kind of the problem with the HA on multiple replicas and how to provide the output that kind of makes sense given two replicas having some gaps here and there? Well, querier has built in the duplication layer and thanks of extra label that you can configure in each of those Prometheus replicas, it is able to tell that this data should be the same and there is like a certain penalty algorithm that detects if there is a gap, then feel this missing data from another replica if there is any. So user at the end transparently, I mean the process is transparent to users, so user doesn't know and will not even notice that there are multiple replicas running, it will see like the one signal which is exactly what we need because now we can for example, upgrade Prometheus version, and no one will notice finally. So this is how we solved high availability as well and global view. But let's say you want long term storage retention, right? Right, so what then? In order to solve long term storage, we introduce what is essentially the only dependency, external dependency of Thanos and this is object storage. So what we configure pretty much is you essentially give the sidecars in Thanos credentials to your object storage provider and right now, Thanos supports a ton of different object storage like S3 on AWS, Azure, Google, and also some in China like 10 cent in coming Ali Cloud and essentially what happens is that when Prometheus outputs a block in its time series database format, the sidecord takes this block of data and it uploads it up to object storage. Now, one thing that's important to note here is that this is not some kind of streaming API or something like this, right? We're essentially doing lazy upload of all of our data. After two hours when Prometheus actually produces one complete block of data and although this is not real time, it actually provides other really beneficial characteristics and the fact is that we don't need to have a super low latency connection between our object storage and our Prometheus. We can have them in separate regions or totally different places or whatever and we can just upload this chunk of data when it's ready. And now in order to actually make use of this data we introduce a new Thanos component called the Thanos Store Gateway. So the Thanos Store Gateway implements the same GRPC API as a sidecar and the querier and the thing that's important about the Store Gateway is that it makes instead of reading the data directly from the Prometheus API it reads the data that it's serving queries for it reads the data from object storage. So now whenever I query for data that's very recent my Thanos querier will probably get data from my sidecars and directly from the Prometheus that's running on a cluster and when it's looking for data that's more long term for maybe my years worth of data it'll get that data from object storage. But there's one caveat to this and that's that remember that we were saying that when I look at data from a year ago I might be getting 30 million time series and we need some way to solve that. So in order to solve this issue we introduce one more component that's called the Thanos Compactor and this performs a couple really important functions. Number one it does this down sampling that we were talking about. So I can take really high resolution data that's sampled every 15 seconds and down sample it to maybe every minute or even lower or something like this like an hour and I can essentially when my querier is looking for hey give me this time series from a year ago I can give it much fewer samples and make the queries a lot faster. But the other thing that it does that's very important is that it combines multiple Prometheus TSDB blocks into fewer larger TSDB blocks. So instead of having thousands of TSDB blocks that are only two hours long I'll have fewer TSDB blocks that are two weeks long. And this is a very important optimization for speeding up our queries because rather than having tons of different indices in every single block and a lot of different metadata associated with that we can essentially reduce the size of these indices and combine them all in larger blocks and make this way our queries a lot faster. One last component that we're gonna introduce in this talk is the Thanos Ruler. And the purpose of this component is to provide a global view of our alerts and our recording rules. So normally we always advise and Prometheus that you wanna keep your alerting and your recording rules local to your Prometheus cluster. So that means you want to evaluate the rules on your Prometheus that's running in your cluster. But there are some important places and some cases where you would actually do want to have a global view of your alerts. And one good example of this is maybe you have some alerts that aggregate over multiple clusters. Another one is maybe you want to have some alerts that take into account years worth of data or very long term metrics. In that case you want to be able to draw. You want to be able to evaluate these queries with data that's in object storage. Or maybe you want to do some kind of meta monitoring and you want to know like is this query, is this cluster or is this Prometheus up or down or not, things like this. So this is kind of an optional component. But one important thing to note is that it actually serves these recording rules via the same store API that both the sidecar and the store gateway also implement in the query year. So there was something that was common in all of these architectures and it wasn't just Thanos. One thing that was common in all these architectures that's very important is the store API. And we kept saying this word like what is the store API? The store API is a GRPC API that isn't defined inside the Thanos project. And this is really, in my opinion, is one of the key innovations and things that the Thanos community actually brings forward to the ecosystem. It's implemented in all of the Thanos components that allow evaluation of time series or create time series like the sidecar, the ruler, the store gateway, et cetera. But it's also implemented by some third party projects that want to be able to maybe act as a bridge between some different format like OpenTSDB and Thanos. So now anybody that implements the store API can integrate with Thanos community. This is what the store API looks like. Anybody that implements this can integrate with Thanos. I think that we don't have that much time to talk about this. So let's get to the more interesting things. Sure. One more thing that you can add as a bonus I would say and something that actually we are still working so this kind of still working progress is caching. So all of those things that you query and you maybe run some Grafana with some dashboards and imagine that 20 users are, and there was incident. So many people wants to see exactly what's happening and hitting the same dashboards which has probably 20 different graphs. So 20 different API calls to your metric system. So caching in this case and all those users are doing exactly the same API. So caching the response for those particular queries are super, super nice to have. And this is where Open Source is amazing because we have friendly project called Cortex which is kind of similar. And then we are kind of friends with them, with maintainers from that project as well and they created something like Cortex query frontend. And this allows, this actually works against Prometheus and Cortex and also against Thanos. It uses some shared caching for example memcached or just in memory. And you can put that on top of Thanos for example to cache all the responses that are evaluated for the query and then those consequent kind of, sequent, well next query is that he's exactly the same data. We'll respond very, very fast without kind of resource consumption. So this is something that we're still improving. Like for example, it doesn't support the sampling yet but probably we will make sure like we can collaborate with the team to make something that works nicely for Thanos but caching is essential at some point. It allows you to fasten things up drastically. And one more thing, one more bonus is streaming. So at some point we were talking about pool model based so for example Prometheus is pulling, collecting metrics from application. A query is pulling data from the leaves but at some point you need to have some kind of push method. Like you can't escape it in certain cases. For example, your cluster is very isolated and you don't allow ingress traffic, you only allow egress traffic, right? So at this point maybe you want to push the data to some centralized place. So for this kind of rare let's say cases we created another component receiver which is essentially another story API. Well it exposes story API as well and it allows to integrate with remote read in the protocol that Prometheus uses which essentially streams all the samples which requires again load latency maybe network and there is some delay in metrics visible in your centralized clusters but you know this is a trade-off that you can decide on why deploying Thanos. And the key part about all of this is that all those components you can shape this deployment as you want. You can have just courier and sidecars and no object storage at all. Maybe you want to have certain cluster having no object storage but actually kind of data from this cluster will have long-term storage retention so you have the Stargate way for example or maybe you have just one special cluster that requires kind of streaming so pushing the metrics from the cluster instead of pooling and then you have some receiver somewhere as well just for that purpose. So you can mix things together. You can just use one or three together so it's very, very flexible and all thanks to the same API as well in some form. Cool, so kind of instead of live demo we thought that it would be nice to instead having some nice interactive tutorial to be for you to try in offline mode as well so we don't need to like prepare something that will go to garbage essentially and actually there is a nice Catagoda platform that allows that if you go there there is actually only one scenario yet but we are working on other stuff as well and if you click on that scenario you can essentially play with Thanos because this Catagoda platform allows you to run some communities cluster or Docker from your browser and access everything access from your browser you don't need any hardware it will run on some other kind of cloud provider and there's tutorial that explains step by step what you should do and how you can play with it so we really recommend to go for that and we are trying to make it up to date so it should get latest version or something so yeah, check it out and yeah, let's sum up so overall let's sum up what we learned today we explained what monitoring is that it's very essential so please monitor your application that's for sure and metrics helps a lot that's our opinion we explained the cases where you have why you need more than one Prometheus we explained why that brings some challenges and finally we presented Thanos and hopefully explained how in few steps you can gradually transform your setup or Prometheus setup into Thanos and there's a very important note like let's say you don't have monitoring yet and you want to instrument your application don't start with deploying Thanos that's distributed system which has some complexity just start with Prometheus learn first how to use querying how to use alerting how to operate it how to learn Prometheus well first and then once you hit the scale issues and you are sure okay that's not enough then you can gradually install without you know it's not like you are replacing this Prometheus no, you are building on top of stuff adding more components and gradually installing Thanos so it's very very flexible and that was our design behind Thanos as well cool, so that's it from us you have some links we are also hiring for a monitoring team at Red Hat and yeah, thank you for listening to us I think we have time for questions so yeah, we support GCS, S3, AWS, Azure Tencent Tencent, Ali Cloud we also support Swift Seth, yeah but like Swift, OpenStack, Swift anything that exposes S3 API so like Minio, stuff like this yeah, exactly so any S3 compatible buckets as well so yeah but if you are missing any just put GitHub issue on that yeah, it might be more people interested in that cool, questions yeah, good point oh, I read the question so essentially you are asking can we provide any user stories or where you can find the user stories behind the deployment architectures different deployment architectures because you have many components and you can put Store Gateway in different places together with centralized place and maybe in the client cluster so the question was what's the recommendation I guess and where you can find and read about those so I would recommend the blog posts and like I think if you go to our Thanos page or like GitHub page there are some links to the blog posts which essentially are from different companies and how they use and architect the deployment and particular question with the Store Gateway yes, I think it's really reasonable to have a Store Gateway in some centralized place in the same region as bucket because the latency might matter here, so that's our recommendation and we run Thanos in many places as well in production we are on call as well I would suggest this yeah, so the question is is there any roadmap to integrate OpenShift and Thanos or put Thanos on OpenShift and the answer is yes, actually coming in the next release of OpenShift we're having user workload monitoring this has Thanos on the cluster this is providing functionally sharded Promethii on the cluster one for cluster monitoring and then another Promethii for user workload monitoring and then you can still have a global view using a Thanos Quarrier and this means that now you can actually configure say I have a bunch of I have a bunch of applications that I want to monitor OpenShift can you please provide me with a Promethiis to monitor my applications using the Promethiis operator and all the nice like service monitor CRDs and stuff like this so you can get that for free and then still use Thanos to have one single place to query for everything there's as far as I know there's no roadmap to deliver Thanos itself like hey like OLM or something like this like some kind of you know operator installed Thanos or anything like this but basic sidecar and like querying functionality this is coming in a soon release of OpenShift and actually you have Promethiis operator right as well which deploys out of the box Thanos sidecar and ruler so it's coming in soon so yeah super easy to deploy yeah so that's integrated we have to integrate in the cluster monitoring operator and the user work the monitoring this is integrated in the cluster monitoring operator but anything any more advanced features of Thanos this is not yet on the roadmap however you can integrate that on your own as well yeah Alex yeah so I think the question is we are talking about number of series and how they affect reasons consumption the number of samples the retention that we are having the data for and that's a very valid question and the answer is that samples and like this that because there are two dimensions like the time and the number of series cardinality right and the problem with cardinality is that this dimension is not compressible however the samples are very compressible so there are really really nice methods to compress it yes we chunk it we have like very special gorilla kind of compression format and it's actually not a major addition to the resources if you add more samples or less right at some point of course it matters if you store you know like one day retention or one year retention however yeah but in fact most of the time when you add more samples like higher higher interval scraping these samples become more compressible because they are more similar and like because of the encoding of the samples it actually compresses very well and the thing that really becomes a limitation oftentimes or vast majority of the times in Prometheus is not number of samples but it's number of time series absolutely it's true that it has become a limiting factor when we were talking about sending millions of samples over the wire for a long-term query like this can become a limitation for sure cool all right I think we're done please find us later we'll have you talk thanks