 So, hello everyone, welcome to our talk in the Sculpt Con. It's called Thanos, the number to scale Prometheus and make it highly available. My name is Gader Gaders and with me I have Brown. And we will talk about Thanos, which is a global scalable highly available Prometheus solution, which provides you a lot of overhead metrics of tension. So at first we will do a short introduction about what Prometheus is, what scaling issues it has and then we are going to talk about what has been added to Thanos since the last presentation at Cupid Con. So at first let's we will introduce ourselves. Hello, my name is Prem. I am currently a software engineer intern at Red Hat. I'm working with the observability platform team. I have been an OSIS contributor like with Thanos, like from quite some time like last year when I, when I started my GSOC project with them. Yeah, that's all. And my name is Gader Gaders. I'm a central ability engineer at Vintage. Vintage might not be such a familiar name to you. It's essentially a second hand fashion marketplace. I'm part of the observability team there. I work on metrics, traces, blogging. And I'm also open source software between contributor and the fan. So I started in the Thanos project a few years ago as part of my work at my previous workplace. And yeah, ever since then I am involved in the project. And also I contribute and maintain a goal library for interacting with Grafana. So it permits you to automate various Grafana actions in that type safe manner. So if you are ever, if you ever want to do such a thing, then go to github.com slash grafana dash tools slash SDK. And suddenly I write about software engineering worthy related topics on my blog at github.com. And now we will jump to the side about the Thanos project. To give you a short history lesson. The project started in 2017 at improbable in November of 2017 by Bartek and Fabian. And it was open source from the get go. Over the years, it has grown quite a lot. Nowadays, it's a CNCF incubated project. Just one step away from being a graduated project and being an incubated project means that Thanos follows certain standards. So for example, we are vendor neutral. We have a public roadmap. We review, we review podcasts quite quickly and so on. And on top of that, we have quite a vibrant community. So since the last Qt presentation, we've gained about 2000 new GitHub stars. We've gained 75 new contributors. Since last Qt presentation, which means that we've grown about two times since 2019, which is pretty insane if you ask me. And we also, we've also gained 839 new site users. So if you have a question about anything related to Thanos, then it's a very high chance that someone will be able to help you, not just from the meat-eaters team. And we have lots of meat-eaters from different companies. And on top of that, we have a transparent governance. So it means that not one of the companies will be able to take over the whole project by pushing it into one direction that they are interested in. And that means that really anyone can come forward with any kind of idea and we will consider that idea. And what's really awesome, I think, is that we are part of the Prometheus ecosystem. So we don't reimplement lots of code. We tend to reuse code from other projects in the Prometheus ecosystem. So for example, we reuse the whole ChromeQL engine from Prometheus. We reuse the end-to-end testing code from Cortex. And, yeah, it's just awesome. And finally, on Thanos.to webpage, we have a list of companies which have publicly announced that they are using Thanos in production. So you can see that we have really lots of popular names using Thanos. So it's not just the sheer number of users in our community, but also companies just pop. But also companies are publicly announcing that they're using Thanos. That's how awesome it is. So let's do a short recap on what is Prometheus. And what Downsides does a single-node Prometheus have and how Thanos could solve those problems. Yeah, Thanos really loves Prometheus. So this diagram shows you the different components which make up a single-node Prometheus. So we have the rule and alert engine, which pre-decreate evaluates expressions. Then by using the query engine and the local storage, and if certain thresholds have been exceeded, then the alert engine sends those alerts to alert manager. The query engine is, of course, responsible for the queries. Garfana queries come to it. The scrape engines scrapes different endpoints. In this case, these are services 123, and they all expose the metrics endpoint in turn. And finally, you have the compactor, which makes the storage more efficient. But what happens if that Prometheus node just goes down? Well, in this case, it's a pretty bad situation to be in. And it could be even worse if the local storage of the first node would get destroyed. Then you would just lose all of your metrics of that single instance of that single Prometheus instance. And how in the vanilla Prometheus world you would solve this problem? Well, you would scrape the same metrics from one or more Prometheus instance. But then it has its own downsides. For example, the same time series would be more or less stored on both nodes, which means duplicate a data, and you would have to retrieve two times more data when curing. Which is also pretty bad. And what if, for example, some of the Prometheus instances would not be able to scrape one or more services anymore? Well, then you would be in a very bad situation. In the Prometheus world, without Thanos or Cortex, you would solve this issue. Most likely having another Prometheus instance somewhere. And what's called a federation mode in which Prometheus, one Prometheus instance is scraping metrics from other Prometheus instances. And then that one Prometheus instance, which is scraping the federated nodes, all of the metrics would have to be transferred over to the central node. And you would just run into the same problems. Because you would have two times the same data. And if you would want to avoid sending two times the same data, you would have to use some kind of recording rules and only send pre-aggregated data over to the Prometheus, which is in the federation mode and so on. So those issues, it's not really, it's kind of hard to solve those issues by just using Prometheus itself. And that's where Thanos comes in. And Prem is going to tell you about how Thanos solved those issues. Yeah, so we are going to talk about some features of Thanos. So Thanos allows you to have global query view over your data. So you can run a single query which touches like all of your infrastructure, like all of your infrastructure metrics. We can have unlimited retention with the help of object storage. Thanos exposes Prometheus compatible API. So you just need to, you can like put those endpoints anywhere where a Prometheus query is supported. And Thanos also has support for down-sampling and compacting your metric data. So yeah, this is what a typical high availability setup looks like. So you have multiple Prometheus instances running, which are scraping metrics from some cluster. You would run a sidecar along them. The components called Thanos sidecar. And you will run a central Thanos query. So whenever you would query this Thanos query for some data, Thanos query does a GRPC API call to all the sidecar instances to collect those data. And these Thanos sidecars proxies this to the Prometheus running there. Ultimately, all the data get collected into Thanos query and then the query gets executed and the data and the result gets returned to, for example, Grafana to show those dashboards. Thanos also does like deduplication on the fly so that if two Prometheus are scraping the same cluster, the data would get deduplicated automatically. Now to add unlimited retention, Thanos sidecar supports uploading Prometheus blocks to object storage. All the sidecar uploads their blocks to object storage for longer retention. Moving on the next step, we need some way to query this data from object storage. So this is where the third component, Thanos Store, Store Gateway comes in. Thanos Store is a component that sits between Thanos query and the object storage. Whenever Thanos query requires some data, it would send a query to the Thanos Store like it sends a query to the sidecar and this Thanos Store would get this data from object storage and return it. There is this another component called Thanos Compactor, which is like a standalone component but it runs in the background, regularly compacting and down-sampling all those blocks in object storage. Compaction and down-sampling improves query performance across the Thanos fleet. Now to reduce a lot of requests to object storage, we can run a memcasty server to cache those chunks and blocks. memcasty not only improves the performance but it also reduces the number of requests to object storage which in turn can reduce the cost of the whole setup. That's what a typical like Thanos high availability setup looks like. If you have any more questions, you can come on to like Thanos channel on cncfslack or thanos.io. Next, we are going to talk about some new features in Thanos. So the first feature we are going to talk about is adding support for deleting particular series. You can always delete a whole block of data by just going on to object storage and deleting it but like deleting a particular series was not possible before in Thanos. So we have introduced a new tool called Thanos tools bucket rewrite which given a block ID can rewrite that particular block. So if you want to remove a time series from a particular block, you can just rewrite that block. The Thanos tools bucket rewrite will create a new block since Prometheus blocks are immutable. This block will get uploaded to the object storage and the older block will get marked as deleted which would then get cleaned up by the compactor. This tool can help you in situations when you want to delete some data due to GDPR regions or similar laws. So the next feature we are going to talk about is Query Frontend. As Gid just told you before we are like Thanos and Cortex have a lot of collaboration going on and Query Frontend was a component of the Cortex project and since it has been like contributed to Thanos as well. The Query Frontend is a querying layer on top of Thanos Query with the help of memcashd it can do caching of the queries. Also, whenever Query comes to Query Frontend it splits those query into time windows into multiple pieces and run those queries in parallel to improve the performance. Those pieces are then like cached into memcashd to like to improve further query performance as well. And I guess I wanted to add that not only caches the queries but it also caches the requests of label names and values. Yeah, so those label names and values get cached in there as well which improves the overall query time from graphing engines like Grafana. Yeah, so you get a very quick drop down of like label name suggestions and so on. Yeah, back to you sorry Pram. Yeah, so those were the major features which we had like after the last QubeCon talk. Now we are going to talk about what's in store for future in Thanos. Recently there have been a lot of improvements around multi-tenancy capabilities of Thanos. This includes making the components tenant aware in a sense so that we can include we can introduce features like per tenant rate limiting into Thanos or tenant aware resource users observability. So this this also in this also builds upon the existing multi-tenant capabilities of Thanos receiver, which is like already multi-tenant. This comes as to like the next point. We are doing some work towards maturing like Thanos receive component. This work includes making Thanos receive easier to operate. So this there has been recent work around splitting Thanos receive into two components so that it is easier to operate and making it more resilient by introducing introducing things like shuffle sharding and improving its memory footprint etc. We are open to all your ideas as well. You can go on to Thanos GitHub repo and file a feature request. And we also run Thanos contributor officers every week. In these meetings you can come and like come to a video call and discuss about Thanos related projects or if you are like contributing and you are blocked somewhere you can discuss those things as well. We organize these contributor officers in both Europe and USA time zone. So no matter where you are you can meet us like in two weeks the maintainers join this contributor officers regularly to get feedback and answer questions. Yeah. The next thing we are going to talk about is about mentorship programs Thanos has been continuously participating in LFX and GSOC mentorships. There has been a lot of cool stuff implemented by the mentees. I was also a mentee for Thanos during GSOC 2020. So like some currently ongoing projects in Thanos by the as part of the mentorship programs are like multi-tenant instrumentation in Thanos. So this involves like answering questions that which tenant used how much resources from Thanos and this is the part of the like bigger multi-tenant movement going around Thanos. The next project is to make the ruler stateless. This would allow making the ruler more lean and like easier on memory footprint. The next project is vertical block sharding. So in in bigger deployments Thanos blocks can increase like a lot in size due to compaction and and in those places we somehow need to limit the size of the block and then maybe split bigger blocks into smaller ones. And this is the project ongoing to do to do exactly that. The next project is implementing exemplars API inside Thanos. So due to the recent addition in Prometheus for the exemplars support, Thanos is now planning to extend its internal API to support exemplars as well. So we can support the same API from Thanos. So if you are looking into using exemplars with Prometheus and you are using Thanos, you soon be able to use exemplars along with Thanos as well. So thank you. That was all if you have any questions. Yeah. You can ask and if we are not able to answer them now, you can always go into like Thanos.io and come on to like a Slack channel to ask.