 Alright. Thank you everyone for coming. We did not expect such a full room. So this session will be both an introduction and a deep dive into Thanos. So the first part we're going to do like an overview of Thanos, what it is, how you run it, and then in the second part we'll cover some of the improvements that have been done recently and then we'll talk about some more exciting stuff, some concrete use cases and so on. Maybe just a quick show of hands. How many people here run Thanos in production already? Okay, that's so for you, you might find the first part a bit boring, but maybe you'll also learn something new. Before we get started, just a quick round of introductions. My name is Philip. I'm a production engineer at Shopify. I work in the infrastructure team. Also part of the Thanos maintainers team and also in the past, they helped maintain Prometheus Operator and CubeState Metrics. And with me today is Saswata. Yes, my name is Shashat Mukherjee. I'm a software engineer at Red Hat where I work on an internet monitoring platform. I'm also a maintainer of Thanos and was previously a G-Soc mentee under the same project. I also helped maintain a couple of CNCF adjacent projects like M-Docs and Observatorium. You can find me at Saswata. I'm good on Twitter, GitHub or pretty much anywhere else. Alright. Okay, so now getting into the good stuff. It's kind of hard to talk about Thanos without mentioning Prometheus first. Prometheus, I expect most of you to have heard at least about it. It's a standalone monitoring server or system that you drop in your environment very close to your applications. Prometheus, they scrapes metrics from applications and then stores them locally. So Prometheus doesn't have any external dependencies, which means you can't actually offload metrics data into an external database. And as a result, it has to store those metrics on disk. It also has a very flexible query language, which we call PromQL. And using that query language, we can also write alerts, which means we can have alerts that are constantly executed and delay the snow when something goes wrong. When, for example, a threshold is violated, Prometheus can fire an alert. And so if we zoom into the Prometheus design and what it's composed of, we see four kind of key modules which are present in the Prometheus code base and functionality, which are very relevant for what Thanos is, for what we'll be speaking today. So we have the rule manager, which executes alerting rules. We have the query engine, which executes PromQL. Then we have the time series database, which Prometheus uses to store metrics on disk. And it also has something called a compactor, which optimizes the layout of metrics on disk over time. And so if we were to now use Prometheus to monitor various environments, we would have to deploy at least one Prometheus instance per environment. Prometheus cannot handle a large multi-environment setup. And so this can work. Like Prometheus will take care of each individual cluster, for example. But very quickly we see that we won't be able to get something that we call a global view of the data. So we won't be able to query metrics across environments or across Kubernetes clusters or maybe across namespaces. We also won't be able to retain data for a long period of time because of the local storage constraints. So disks can be hard to move around or they can be expensive. We want to retain data for a year, for example. And so also the resolution that Prometheus has, it's going to scrape every 30 seconds. And if we want to execute a query across, say, three months of data or six months of data, that resolution is going to be very dense and very high. So a query of that duration is going to be very challenging to execute. And so this is where Thanos comes in. These are the gaps that Thanos as a system tries to fill out. It has features such as a global view, a long-term retention. It has down sampling built in over time, so it can reduce the resolution of scrape samples. It has also some nice multi-tenancy features and so on. So starting kind of with a global view, how Thanos solves this global view problem. And maybe before we kind of explain how it does it, it's worth mentioning what we mean by a global view. So if we were to have a set of Prometheus instances and we were to execute a query, we really want that query to be executed across the entire fleet. So if you want to say, give me a sum over a given metric, we really want the sum across all Prometheus instances. And so the way that we do it is Thanos basically takes out this module called PromQL from Prometheus, bundles that into a standalone service that can be run and scaled independently. And then in addition to this, we also define something that we call the Store API. And as you can see here, the second RPC, that's the Serious RPC, which the queryer can use to request time series data from any component. And so since Prometheus doesn't actually understand this API, what we do is in the Thanos ecosystem, we have something called the Thanos sidecar, which we deploy next to each Prometheus instance. The queryer can then talk to the sidecar and the sidecar takes data out of Prometheus. And by doing this, now the queryer can connect to multiple Prometheus instances. So we would basically end up, if we were to get a global view, we'd end up with something like this where we have multiple Prometheus instances, a sidecar for each Prometheus instance, and a queryer that's connected to all of these sidecars. And so now the queryer is responsible for executing queries. And so by having this global view over the data, we can now also have global alerting and global rule recording. So for example, if we want something like the global error rate across multiple environments or we have an SLO for the P90 latency, we'd be able to do that with something that we call the Thanos ruler. So yet again, we pull out the component from Prometheus. We take out the rule engine, and we reuse a lot of the code from the rule engine. We package that into something that we call a Thanos ruler. The ruler is then connected to the queryer, and now through the queryer, it has access to the entire dataset. And so in our baby example from before, we can now, as mentioned, we can deploy a Thanos ruler. The ruler can talk, can be connected to the queryer. And because the queryer has a global view over the whole dataset, the ruler back extension can execute alerting rules across the entire dataset. All right. We also mentioned that it's kind of challenging to store data on disk for longer periods of time. It's also hard to move disks around. And so for this purpose, the Thanos sidecar can be configured to upload data from Prometheus into object storage. So Prometheus is going to create data on disk every two hours by default. The sidecar can then upload that data to object storage. And then the Thanos store gateway can query the data from object storage directly. And so both the store gateway and the sidecar now implement the same API so the queryer can talk to both of them. So again, if we extend our example from before, we would configure both of these sidecars to upload data to, for example, an S3 bucket, a GCS bucket. We would deploy the store gateway, which has the same store API like the sidecar. And so the queryer in this case would get the latest data from Prometheus to the sidecars and would get historical data from the store gateway and directly get into object storage. So there's two kind of important things to note here. The store gateway doesn't actually download data from object storage. It downloads a very small part of that particular data set. And then we call this an index header, and then it uses that index header to make further requests to object storage on demand as we query data. So we don't actually need all of these massive disks. Data can stay in object storage, and we just need to download very small parts. Also, both the store and what we call here the compactor, which optimizes this data over time, have a UI. So if you were to visualize, if you wanted to get a visual representation of what your data looks like, you can open the UI. And here we see we have these blocks, what we call blocks. They all have a time duration. Some are two-day blocks. Other are seven-day blocks and so on. So as data comes in, the compactor merges these blocks and creates bigger blocks out of smaller blocks. Also, this is the compactor. When it's doing its job, we get up to something like this, but then there are cases where the compactor would fall behind, and we might see a situation where we have a bunch of small blocks in object storage. This is just keeping in mind that it's worth keeping in mind that having a situation like this is going to increase costs against object storage because we have to make simply more API costs. So monitoring the compactor, making sure that it's doing a good job, is going to be important for both cost control but also making sure that the queries perform well. Okay, and finally, the last kind of component that's been added into the Thomas ecosystem is something that we call the Thomas receiver. The reason why it exists is because the sidecar model might be problematic in certain cases. So if we have, you know, hundreds of clusters, for example, we might not be able to open ports to all of these sidecars or to all of these permit uses that are running all over the place. And so we have certain situations where a global query or the route simply cannot connect to many, many different sidecars. And so for this reason, we've introduced something that we call the Thomas receiver, which is a kind of component to which metrics can be pushed to. So it's more of a push-based approach, essentially. And so what we have here on the right-hand side might be something like permit use agents, which are fairly lightweight. They can scrape metrics and remote write them to the receiver component. And then the quater, just like it can query the sidecar, can also query the receiver. And by doing this, we get to, instead of a pull-based approach to a push-based approach. Finally, these kind of approaches are not mutually exclusive. You can use them both at the same time. So you can be pulling metrics from certain places. And you can also be pushing metrics from other places. And this is kind of a fairly common scenario that people have. We also have a similar set of a Shopify, where we kind of utilize all of these components in a fairly complex setup. Okay, so that was the overview, the introduction to Thanos. And now we'll look into some of the recent improvements. Okay, so with that detailed introduction out of the way, let's talk a bit about the several new features, improvements, and optimizations that the awesome Thanos community has been cooking up behind the scenes. Starting with the distless store gateway. So as Philip mentioned in the introduction, the store API acts as a sort of glue to fetch data from the various Thanos back-ins. So the Thanos Store Gateway is one such backend that basically acts as a caching layer and exposes the GRPC Store API for fetching chunks from Prometheus, Format, TSTB blocks in any object storage. So we use three distinct caches to make this operation of fetching and querying data from a historical TSTB much more performant for us. Firstly, we cache the metadata.json file for each TSTB block on disk alongside a small portion of the TSTB index file, which we call as index header. And the index header is basically a truncated TSTB index file. We also, apart from that, we also maintain an index cache and a caching bucket to cache the postings, series and chunks from the TSTB blocks during the store API series RPC calls. But recently with larger and larger Thanos installations which have hundreds of historical TSTB blocks and object storage and also need higher availability via multiple replicas, we have observed that caching the index header files on disk becomes problematic and just simply unsustainable. Storing it with startup time starts to slow down as the gateway needs to cache all the index header files during startup and the disk fills up very quickly. Now you could throw money at the problem and increase the size of the disk but it is expensive and not very practical when you need to run multiple different replicas. Fortunately, there is now a solution for this which is to just disable caching the index headers on the disk completely. That is, the store gateway will now be stateless and not cache the index headers on disk nor will it load from it. This still doesn't change the store gateway's internal memory representation. It will just create that on the fly during a particular query instead of referring back to the disk. So that's where this new feature enabled the store gateway becomes fully stateless and this makes it possible to run the store gateway over hundreds of TSTB blocks and object storage without having to pay for ultra expensive SSDs. And we still cache posting series and chunks as we usually do. The next improvement that we'd like to talk about is the quality of service improvements for Thanos especially for monitoring as a service use cases where you would be serving multiple different talents. So the meaning of quality of service in SAS terms is simply the ability to provide different priorities to different users or data flows and to guarantee a certain level of performance. And we want to make sure that a single tenant or user does not ruin and deteriorate services for other tenants. So in the context of Thanos, there are basically two data flows for end users that we want to protect from disruptions which are the read and the write parts. So let's delve into them individually. So now consider a scenario where in which the receive component is maybe deployed in a hashing configuration with multiple different tenants remote writing metrics to it. Now maybe you have scaled the receive nodes to only be able to tolerate around 100k series from each tenant. Now if a certain tenant starts to misbehave and send way more metrics than that, maybe a million series you would start to face disruptions in the form of ooms and receive hashering will start to lose stability. You then have a couple of different options for mitigating that. So you could use an HP or a VPS setup that can scale your Thanos hashering up to a certain point. But if a tenant writes even more metrics than that, you would like to home and crash loop again. You might want to scale infinitely, but again that is not very practical or economical. So the way we want to tackle this problem is via granular per tenant remote write limits that allow us to keep our write path completely disruption free by ensuring that we only ingest what we allow it to ingest. So with this with these limits every time Protobuf encoded remote write request comes in, we'll check if it is under our request size, under request samples and under request series limits. And we also ensure that the tenant, the remote write request is coming from is still below a certain number of head or active series, after which we allow it to be ingested into the particular tenants TSTP. The configuration for this limit ensure that you can set global defaults for the limiting configs and override those values per tenant on every single receive node. For the active series limit, we directly query any Prometheus compatible meta monitoring endpoint to query the total number of head series for each tenant. Now coming to the query path, the way it works is that a user can fire off a promo kill query to a carrier, which can then call several store APIs and fetch data using the series RPC call. Now if the store API happens to be a store gateway then it also downloads data from object storage to fulfill that series RPC call. But in the case of two excessive queries that end up may be selecting too much data or too many samples, we end up with a couple of different bottlenecks. The query can start to oom and crash if it ends up fetching too many samples and having to run promo kill functions on top of them to fulfill multiple different queries. The store gateway might also start to oom if it ends up having to download a lot of TSDB data from object storage to fulfill a single series RPC call. And as with everything, you can probably throw money at the problem again and scale up vertically and horizontally, but that is not really economical or practical. So we ended up introducing limits at the store API level and the store gateway level that allow for more stability in these spots. So with these we can set limits on every store API implementation like receive, rule or store gateway to limit the number of series and samples returned in a single series RPC call. This ensures that you have an upper limit on the total data that can be requested by a single promo kill query. And with the already present query concurrency limit you can even size your Thanos queries accordingly. We also limit the downloaded bytes on the store gateway to ensure that fetching historical data from the TSDB from the object storage TSDB does not end up ooming in. So with these limits in place we end up having a pretty comfortable way of operating a multi-tenant monitoring as a service platform with Thanos and ensuring some level of quality of service. Now let's also talk about another feature which makes the right path on Thanos substantially stabbler than before which is consistent hashing on receives. So this was a feature which adds a new hashing mode to the Thanos receive component called Ketama which enables very stable scale up and scale down scenarios and it is now our recommended way of running Thanos receive hashering. So let us take a look into how this works. Ketama hashering is configured the same way as our old hash mark based hashering using hashering.json files on every single receive node to specify the topology and we add a flag on each received to switch to Ketama mode. So in this mode every receive node is assigned a section to manage within the hashering and these sections are composed of a range of hashes. So when a new remote write request is received by the receive node we iterate through the time series present in that request and hash the labels with the name of the tendon that the request came from. We then batch and forward these time series to the sections that is to the receive nodes that should ingest them. So with this we ensure a stable ingest path that actually has consistent hashering and hence an even distribution of data and tenants across your receive nodes. Internally at Red Hat we also use a Kubernetes controller which is open sourced under the observatory mark which monitors stateful set configuration for your receives and the pod status and simultaneously ensures that the receive nodes have the correct updated hashering.json configs so that their internal representation of the receive topology is as close to the real world as possible and this makes operating and scaling set up so much more automated. Let's now segue into one of the greatest highlights for the past few months which is the new multi threaded promqueal query engine. So to provide you with a little bit of context the Prometheus engine currently is a single threaded function that passes and traverses the query abstract syntax tree recursively and evaluates the query result alongside it all at once before returning result. This is the promqueal engine that we all know and love but due to the way distributed systems like Thanos are set up it becomes difficult to query large sets of data that are distributed across different networks. Not to mention being limited to a single thread causes issues. So with that in mind the Thanos promqueal engine project was started by Philip based on the Volcano query engine paper which specifies the architecture for an extensible and parallel query engine that can utilize concurrency and multiple cores to its fullest and allow space for several different optimization techniques. Now the current Volcano based promqueal engine works somewhat like this it passes the query into an abstract syntax tree using the same upstream parser of Prometheus it then tries to optimize the query expression by applying several logical plan optimizers for it and once such logical plan optimizers have been applied the query expression is somewhat simpler than it used to be it traverses the AST again and constructs a tree of executable operators that is the physical query plan the root of this operator tree can now be executed to fetch the final result of the promqueal query and the operators themselves can use multiple threads or even run parallel to each other. So what do these operators look like? Well essentially they implement an interface like this and it has two important methods the first of which is series which returns all that an operator will ever return in its lifetime and this is useful for parent or upstream operators to allocate the offers they would need beforehand. The second is the next call which returns the vectors of samples of all series for a particular execution step so every operator calls next and series on its child operators until there is no more data to be written that is until the query has reached leaves. Now we mentioned operators it's a good idea to also mention how the data flows between these operators so as mentioned operators are arranged in a tree like fashion where every operator calls next and this sort of model allows for samples of an individual time series to flow from each execution step from the left most to the right in this diagram. Since most promqueal expressions are aggregations the samples are reduced as they are put in by the operators from the right and because of this the samples can be decoded and kept in memory in patches Aside from this the operators are aside from the operators that are one to one mapping with regular promqueal constructs we also have some exchange operators that allow for flow control and concurrency. So let's see how that works so we have enabled two types of parallelism operators the first one is inter operator parallelism so as the operators are independent and rely on a common interface they can be run in parallel so as soon as one operator has processed a bunch of samples they can be pulled in by the next and then the next thing is intra operator parallelism so where parallelism can be added within the individual operator using certain coalesce special operators and they would be indistinguishable from regular ones as they pass on data using the same next call so now that we know how it works to some extent on a much higher level let's talk about benchmarking so we have quite a lot of go benchmarks for this project for benchmarking several different forms of queries both instant and range queries against the original engine and we run these on every commit domain on a CNCF sponsored equinex GitHub actions run and we published the results on a central web page for everyone to track and finally as this is an alternative downstream implementation of promcal let's talk a bit about how we maintain compatibility with the original engine so we use a special compatibility engine flow which while evaluating query checks if the new promcal engine supports it and if not we fall back to the regular promethe essential so while most functions have already been ported over such an approach ensures that even if the upstream spec changes we would still end up being able to support it and nearly all of our tests in this project test the new engine by comparing results between the upstream and the downstream engines and asserting on my matches alright that's that was a lot of information right the something that we're really excited to be working on and that's going to end up in the next maybe several months is what we call distributed execution to explain why it's so important in Thanos we can visualize this like small example where we have so Thanos allows us to stack queries on top of each other so we can have a block cluster A cluster B they run prometheous with sidecars and each cluster can have its own quaterer and then we can have a quaterer at the root that's doing basically the federation and the global view and so today if you execute something like sum or an average or account expression what's going to happen is the root quaterer is going to pull data through the quaterers on the second level is going to use them and it's going to pull all of the relevant time series in memory and execute a query in memory so this is fairly wasteful because the second level quaterers are going to be doing very little work even though they are capable of executing prompt calls so this has obviously scalability issues as we extend the environment as we add more clusters the root query becomes a bottleneck so what we are aiming to do with distributed execution is the root quaterer is going to decompose a query into what we call subqueries and it's going to push them down as low as possible in the query path so in this case the sum will become a sum over these two partitions and then at the root we'll do the sum of those sums and so by doing this every query in the query path is going to be utilized to its fullest and queries will be able to run much faster and will be much more scalable essentially also the good thing is that any prompt call expression can be kind of decomposed this way so for example a count can be a sum over count, a group can be a group over groups top care or top case and so on so there are maybe one or two aggregations in prompt call which cannot be done this way so this is again super exciting but still under development we are still figuring out some edge cases and finally there is something called native histograms a very powerful feature that's landing in prompt QL and in upstream Prometheus, if you're interested in what native histograms are, how they work Ganesh and Bjorn from the Prometheus maintenance team they have excellent talks so make sure to check those out but this is also a feature that we are pulling into Thanos because Thanos heavily bores from Prometheus, we have an issue that kind of tracks the development of that feature so we expect it to land again in the next maybe several months alright so then people also very often have questions such as can Thanos scale to X, Y, Z and always it's like a different metric some different constraints or ideas in mind what scalability means so in order to kind of set some sort of a bound at least a low round of where we know Thanos can perform adequately I've basically taken a screenshot of one of our internal dashboards that we have a Shopify for in our internal monitoring which shows the basically what we call the head series or the cardinality of the entire data set across the entire monitoring infrastructure so this is how many time series we have globally within the last two hours that's in the Prometheus parlance that's what head series means so we see that here obviously cardinality varies, the top panel shows the global one but we are basically capable of getting anywhere between 3 billion and 6 billion time series within a two hour window which is fairly high also if we break that down per region we run Thanos in different regions the two biggest regions, US is in US central they also spiked to about 1.5 billion and also finally like Thanos is not like a done project we are still making constant improvements to it there's still ongoing work and there will be more work in the future just to kind of illustrate that we had an environment where the couriers were simply not updated for any reason and by simply updating them from like a 6 month old version to the latest version we saw approximately a 50% reduction in like both CPU and memory usage and in addition to this we also saw like latency like P90 latency drop down by 50% just by doing an update and we basically just picked up all of the changes like the small incremental changes that have happened over a 6 month period also the reason why we have such a stable research usage here is because these couriers are used to execute recording and alerting rules so they are constantly executing queries and the load on them is about 16,000 queries per second so now maybe just final words about the Thanos community so with the intro and all the exciting new features and the work we've been doing to constantly improve the Thanos project I also want to quickly share how you can get involved with the Thanos community and how you can get involved with these efforts so our major project communication communications happen over the CNCF Thanos and Thanos Dev Slack channels for code or bugs issues feature requests we usually use GitHub we discuss on the issues and PIs on the relevant repos and you can also raise GitHub discussions as well we also have bi-weekly community office hours on Thursdays at around 2pm UTC over Zoom we maintain a public agenda talks so that you can reference previous discussions and add in your own agenda items and we usually try to be present at KubeCon via talks project meetings, booths and this time we even had a country press session to help on board newer open source contributors we also mentor pretty extensively in the CNCF we try to submit projects for the LFX mentorships which happen nearly every quarter as well as for GSOC which happens like once a year so please feel free to apply if you are a student or even if you are a full-time engineer who is looking to explore observability, monitoring or the CNCF we try to not only do some technical projects but also try to provide some sort of holistic guidance to help them become greater open source engineers who are awesome to work with and finally Thanos' website now has a dedicated space for technical blogs on the topic so we love to hear your stories on how you are using or adopting Thanos and if you have some nice use case for it or even if you are using it as a dependency to build something cool so feel free to share this is a nice way to garner feedback from the community thank you I know if we have any time for questions left yeah there's one over there maybe just the mic thank you for the presentation I've got two questions so the first one is if every Prometheus are down can we still use the Thanos with storage only and the second one is what is the order for a request are we requesting first of all the storage or are we requesting directly the Thanos sidecars yes so if all Prometheus are down you can still query data from object storage and we usually request data from the sidecars which then talk to the Prometheus instances so both the sidecar and the Prometheus are on the same query path and the data is requested in parallel from the sidecar and from object storage and it's kind of joined together at query time welcome there's a question there on the new functionalities you mentioned in which versions are they available or will they be available with the exception of the distributed execution in native histograms they are available in 0.31 so the latest release can you reference the change log for any specific feature thank you hi I have one question about the RAM usage for Prometheus if I'm converting to Thanos because currently we have a multi cluster setup where we have two Prometheus instances and they have very high RAM usage because they monitor our Github runners and sometimes Prometheus run out of memory and then we have holds in our queries because the data got lost because nobody could save it and do you know how this would change if we convert to Thanos sidecars and then to the Thanos query how do you think the performance of Prometheus would change there so if you start uploading data to object storage which is the easiest way to do it with the sidecars you need to have good retention in Prometheus so that should help reduce memory you also won't be executing queries inside Prometheus which also increases memory and I think with those two changes you should be able to at least bring the memory usage down to a bit thank you and the distributed execution would help I think as well