 Hi everyone, my name is Liron Cohen and today I will talk about some of the different solutions for achieving highly available long-term scalable Prometheus. A little bit about myself before we start. I have been in DevOps and Site Reliability Engineer for the past seven years. Before my current job as an SRE at Riskified, I was a DevOps consultant in multiple companies. And when I'm not working or studying, I enjoy traveling, diving and basically anything water related, as you can probably tell by my happy face in this photo. So before we dive into the different solutions we decided to check, let's briefly talk about the issues we had. So what we started with was an architecture used by a lot of people and companies. Two Prometheus servers that scrapes metrics from the same targets, both monitor the same sets of jobs for high availability. Where each Prometheus has its own local disk for durability. We're running on AWS, so in our case it's PBS. What are the issues with this architecture? Well, it's not scalable. It's not really highly available. If one Prometheus is going down or in the process of rolling update, there will be gaps in the data we will see. For example, in Grafana. So we can't really load balance between the different instances. There is no centralization, no global view of data. If, for example, we want to query multiple clusters, we can't. And there is no long term storage of data. We can configure a retention of multiple years as an example this way. So we knew we needed a solution for all of these issues. And we knew that there are some tools that can help us. Eventually we were left with three potential tools we decided we wanted to choose from. M3 Cortex and Tanks. It's important to note that I'm not representing any of these projects. I'm not part of the maintainers of none of them. We simply wanted to examine their different architectures to understand what each tool can offer, what are the advantages and disadvantages of each tool and how they differ from each other. Why did we want to know this? All of these projects have some similarities. They are all retaining go, open source projects that are compatible with Prometheus. And they all offer us a solution that can suit the issues we just talked about. All these projects offers a long term storage for our metrics, a global view of data, they are horizontally scalable, and making sure our metrics are highly available and durable. All these solutions provide replication of time series data across regions and or availability zones. So it sounds like all of these tools are doing pretty much the same things. How can we compare them? Well, we decided to focus on different aspects that can be categorized according to four general categories, performance, high availability, cost and operational complexity. I will start by reviewing each one of the architectures so we can understand what are the pros and cons of each and be able to compare all of them by the end. Eventually, I will tell you what we chose to use, but please remember that our choice was made before some new features were added to these projects. One last note, I will focus only on the right path and the read or query path of the architectures and will not show all of the components such as the components that help us run rules of our samples and integrate with the alert manager. This is to help us remain focused on the main differences between these projects. So let's start by talking about M3. So M3 was originally developed by the observability team at Uber with a goal of providing teams with highly available and centralized metrics platform. It's an open source project under the Apache tool license. The foundation of the platform is M3DB, a distributed time series database or TSTB that has a rebuilt in replication of time series data points across nodes and different availability zones and origins. M3 is based on a push based model, meaning the permissions from servers use remote right API to push data to M3. So let's start by understanding how do you write path works. First of all, Prometheus scrapes metrics as we know. Then the M3 coordinator, which can be deployed as a sidecar alongside Prometheus, gets Prometheus remote right request. It can also get Prometheus remote request as we will see in a moment. As its name implies, it's responsible for coordinating writes and reads in M3DB. When the metrics gets to M3DB, writes are compressed in memory and eventually flushed to the disk. The duration the data will remain in memory depends on the configured block size, which is the duration of time dictating how long new writes will be compressed in memory before being flushed to the disk. For example, block size can be set to two hours. We can also add CD which stores the metadata used by each of the components. It means that M3 coordinator and M3DB relies on at CD as a source of truth for clustering management. Basically, we can start by writing only the M3 coordinator and M3DB with at CD. We can also add another component, the M3 aggregator that runs the dedicated metrics aggregator and provides stream based down sampling before metrics are stored in M3DB based on the dynamic rules stored in at CD. So this is the right path. What about the read path? Well, under read path, Grafana or other API client sends its query to M3 query. This component is responsible for exposing the metrics and metadata of the time series stored in M3DB. So for writes to M3DB, we use a dedicated deployment of M3 coordinator instances. And then for queries, we can use a dedicated deployment of M3 query instances. Note that it's also possible to use just the M3 coordinator if we don't mind the read path and the write path won't be isolated. In order to support efficient reads, M3DB implements various caching policies that determine which flash blocks are kept in memory and which are not. So that's another thing that improve performance. And as we saw previously at CD stores the metadata used by each of the components. So what are the advantages of M3? First of all, data resides within the cluster on disk and replicated between availability zones and origins. There is no use of cloud storage service such as S3 or Google cloud storage, which means that A bandwidth costs are relatively low. So if you're running things both on-prem and on cloud, it might be a good idea for you to use M3. Since cloud bandwidth costs might be high when moving data between the cloud and on-prem data centers. And B, it means lower latency, which of course means a better performance. Fetching data from an object store like Amazon S3 might be slower than using local disk. It's also a push-based model system. Prometheus uses remote write API to send data to M3, which also have some advantages. In particular in use cases where you have ephemeral cluster or in terms of availability, if the Prometheus server becomes unavailable, all data up to that point is still available for queries. There are a few components as you saw, which might make it easier to deploy. So any terms of caching queries, there are various caching policies you can implement in order to support efficient reads. So we can see how the M3 solution is really focused on being a system that can manage huge amounts of data, even petabytes of metrics with the primary concern to scale monitoring horizontally in a cost-effective nature. But the M3 solution also has some cons. M3DB might be complex to operate. After all, it's another database in your infrastructure that you need to take care of and learn how to bootstrap and recover. I will mention that there is an M3DB operator that aims to automate everyday tasks around managing M3DB, but it doesn't automate every single edge case. I will also mention that I found the M3 official documentation lacking and it might be harder to understand, deploy, and debug it as a result. Second thing to consider is that M3 requires external dependency at CD. So it's eventually one cluster you need to take care of. And lastly, the push-based model also has some disadvantages. It might be more complex than a pull-based system like Thanos. We will discuss about it later on. And shipping samples from Prometheus over the network immediately as they are scraped is not very efficient. So after understanding some of the advantages and disadvantages of M3, let's talk about other solution to our issues. Cortex. Cortex is in CNCF incubating project and another solution for horizontally scalable, highly available long-term Prometheus. Its initial focus was mainly on scalability and high performance and later on in collaboration with Thanos team, the Cortex team also added other focuses to the Cortex architecture. When talking about Cortex, we can separate our discussion into two. Chang storage, the storage that Cortex started with, and block storage, the support for which was added recently with the help of the Thanos project during their collaboration. Most of the Cortex architecture is quite the same on these two use cases, but there are differences regarding the pros and cons and what we can get. Let's start by talking but taking a look at the Cortex architecture in case Chang storage is in use. Cortex, like M3, also uses push-based model, which means that the Prometheus server use remote write API to push data to Cortex. So when talking about the right path, Prometheus servers scrape samples from various targets and use Prometheus remote write API to push data to the distributors. The distributors are responsible for validating the samples they get and can also deduplicate incoming samples from multiple HA replicas of the same Prometheus servers. In order to coordinate which replica is currently elected as the leader, the only replica the distributor will accept samples from, it's needed to have a key value store, where the data about the elected replica is saved, and it can be console or at CD. Then, the valid samples are split into batches and sent to multiple ingestors in parallel. On the right path, the ingester will be responsible for writing incoming series to a long-term store. The incoming series are kept in memory for a while and periodically flushed to the storage. There are different solutions to prevent data loss of in-memory series that have not yet been flushed to the long-term storage by using multiple replicas of each time series in the ingestors, and or by using write-a-head log which is used to write to a persistent disk all incoming series samples until they are flushed to the long-term storage. So, the ingestors will batch and compress samples in memory and will periodically flush them out to the long-term storage. So, when talking about chunk storage, each single time series is stored in a separate object called chunk that contains the samples for a given period, defaults to 12 hours. The chunks are indexed by time range and labels in order to provide a fast lookup across chunks. The index will be kept in a key value store, Amazon DynamoDB, Google Bigtable, or Apache Cassandra. And the chunks will be kept in an object store, such as Amazon S3, DynamoDB, Google Cloud Storage, or Microsoft Azure Storage. Note that, for example, if we are talking about AWS, you can store the chunks in S3 and index in DynamoDB or put everything in DynamoDB. Using just S3 is not an option, unless you use the block storage engine that we'll be discussing in a moment. So, we talked about the write-a-head, but again, what about the read path? When we query data, we can do so by sending the query directly to the courier or to the query front-end. The query front-end is used to accelerate the read path. It can optionally split the query and serve it from the cache. The query front-end stores the query into in-memory queue, and then the courier component picks it up and executes it. The courier fetch samples from the in-memory series samples in the ingestors and the long-term storage while executing query, because the ingestors hold the in-memory series that have not yet been flushed. Finally, courier sends results back to the query front-end, which then forwards it to the client. The query front-end is an important component that Thanos also added to their architecture lately because of some important features. Splitting the query front-end split queries of multiple days into multiple single-day queries. It gives us the ability to execute queries in parallel on downstream queries, which means A, faster query execution, and B, it prevents out-of-memory issues when executing large multi-day queries. It also provides caching. It supports caching query results using memcached readies or an in-memory cache, and queuing. It has a queuing mechanism that is used for different purposes, including retry and failure of large queries, in case of OMRs in the courier. In addition to the cache that the query front-end uses in order to keep the results of the world query, I want us to notice that there are additional caching layers. The chunk cache that stores recent immutable compressed chunks and is used by queries to reduce load on the chunk store and the index cache. Index read cache that stores entire rows from the index and is used by queries to reduce load on the index, and the index write cache that is used for deduplication and reducing load on the database by avoiding rewriting index and chunk data that has already been stored. So after going over the chunk storage architecture, let's talk about block storage. The block storage architecture is actually based on Thanos architecture. The block storage itself is based on from ETSDSDB. It stores each tenants time series into their own TSDB. The in-memory samples in the ingestors will be flushed when a new TSDB block is created and defaults to two hours block range periods. To an object store such as Amazon S3, Google Cloud Storage, etc. In this architecture, there are two additional components that are based on Thanos components as we will see soon. First, the store gateway. The query blocks from the object store and is used by the query at query time. It also uses index cache, chunk cache, and metadata cache in order to speed up queries and reduce the number of API calls to object storage. And there is also the compactor, which is responsible for reducing the number of blocks stored in the long term storage by emerging and deduplicating smaller blocks into larger ones. And by that also making them query more efficiently. There are additional components, options, and abilities I didn't talk about. But looking at the Cortex architecture, we can already point some of the advantages Cortex gives us. It gives us the ability to use chunk storage in case we are putting performance at the top of our priorities. Chunk storage is faster than block storage. We can also decide to use block storage when we are putting simplicity and cost reduction at the top of our priorities. Block storage is eventually cheaper than chunk storage. For example, S3 costs are lower than DynamoDB costs. Plus, it's much more simple to use only S3 bucket than to use and pay care of an extra DynamoDB table. Cortex also gives us lots of caching layers, which can improve performance significantly. There is also the query front end, which allows query parallelization and results caching, which also have great impact on performance. Even in the push-based model, the Cortex is based on also offers multiple benefits. In terms of performance, Prometheus pushed data to Cortex. So if the cluster you are collecting the metrics from and the cluster where the query is are far away geographically, keeping all the data in Cortex will decrease query latency. And in gaps in the graphs caused by Prometheus restart, because the pushes happen as soon as the data is scraped. Even if one of the lifts of Prometheus server is down, you will not see gaps in the data. And again, in terms of availability, when an application becomes unavailable, all data up to date point is still available centrally. Also, it's a great option if you don't want to enable ingress to your clusters. And basically it offers all the advantages of the push-based system we talked about earlier when we talked about M3. So we can really see how Cortex has been designed with focus on high performance of long term storage. But there are some disadvantages as well. Some people might find it more complex system than others, because it includes relatively large number of components. The Cortex team did added the ability to run Cortex as a single binary, which means a single deployment, which makes things easier. But in production systems, it's recommended to deploy it as multiple independent microservices. So we can tune and configure the different components. Secondly, in order to use all of its features, such as the duplicating, you rely on external dependencies, such as at CD or console. Eventually, those are more moving parts to deploy, operate and monitor. The push-based model that we mentioned already also have some cons, as we mentioned before. Extra complexity and resources, you need to manage separate Cortex cluster and storage on top of your Prometheus deployment. If your network has momentary issues and the in-memory buffer of Prometheus cannot hold more data, you might end up losing some of the data. And shipping metrics over the network immediately as they are scraped to a remote storage is not so efficient and might cause data loss if the network is clicking. Lastly, the different storage types have some trade-offs as well. Chunk storage gives us great performance, but is expensive when comparing to the block storage, and it's an additional resources that need to be taken care of. Block storage is cheaper and simpler, but its performance is lower. So we saw that Cortex can offer us multiple ways to deploy it, which has its own trade-offs. As I mentioned before, the Cortex team worked closely with the Panos team. So what Cortex and Thanos added components and abilities that eventually improved their solution and added more options and abilities. So let's see what Thanos architecture looks like and how their collaboration with Cortex bettered it. Thanos is also a CNCF incubating project and is a solution for highly available Prometheus with long-term storage. Its initial focus was on operational simplicity and cost-effectiveness, but as Cortex added new capabilities that were inspired from Thanos, Thanos also added improvements that were inspired from Cortex. Contrary to Cortex, Thanos originally used a pool-based model architecture. That means that Thanos pulls out series from Prometheus at query time. Later on, inspired by the Cortex push-based model, the Thanos team added support for push-based model as well. Let's take a look at the Thanos pool-based model starting again with the Rypa. In Thanos, there is a Thanos side card transiting the same pod as the Prometheus server. Its purpose is to upload data, TSDV blocks, into an object storage such as IWSS3, Google Cloud Storage, or Microsoft Azure Storage, and to give other Thanos components access to the time series data in Prometheus, as we will see when we will talk about the query time. The side card uploads the TSDV blocks to an object storage as Prometheus produces them every two hours. This gives us the ability to configure Prometheus servers to run with relatively low tension. Notice that using this model means Prometheus cannot be fully stateless. If it crashes or restarts, the last two hours of metrics will be lost. So, persistent disk for Prometheus is still needed. Using Prometheus remote write API gets you closer to stateless Prometheus, but it still won't be fully stateless, so it is still always recommended to have a persistent disk. We can see how the Rypa here is super easy. You can actually only add a Thanos side card to your Prometheus pods and configure your object storage in order to save long term data. What's great about Thanos is that features can be deployed independently of each other. You can start by only deploying a side card and gradually add other components to use and deploy other features. Now let's talk about the query time. Cortana or other API client sends its query to the Thanos query component. Then the query component aggregates and it duplicates data from the underlying components. It query recent metrics from the Thanos side card, which exposes Prometheus metrics and older metrics from the store gateway. The store gateway query metrics from the object store and also supports an index cache and experimental caching bucket. The chunks and metadata caching using memcached or in-memory cache to speed up loading of chunks from TSTB blocks. There is also the compactor that scans the object storage and is responsible for compacting data and down sampling blocks in order to speed up queries. As for the other projects, there are other components such as ruler that we won't focus on. So as the Thanos and Cortex teams started to collaborate, it has been decided to add optional multiple components as well. The query frontend that can be put in front of Thanos queriers to improve the read path in order to have some important features like splitting and results caching. It is based on the Cortex query frontend component that we mentioned before. Note that at the moment only range queries can be deleted and cached. Those are the only queries that query the Thanos query frontend can process at the moment. The Thanos team also added the option to use a push face model by adding a component named receiver, which receives data from Prometheus remote right and uploads it to an object storage. So again, we can point to some of the advantages Thanos offer us by looking at its architecture. First, its architecture is relatively simple. We can gradually install its components. Storing the long term data in a block storage gives us, as we mentioned regarding Cortex, simplicity and cost reduction. We talked about the pros and cons of the push base model before and the same applies here. We also have the option of using the pool based model, which is the classic one when talking about Thanos. The pool based model gives us simplicity and it makes the right path more efficient because it chips full compressed blocks every two hours by default, which can also prevent data loss in case of network issues that last more than few minutes. And lastly, it's recently added query frontend can improve performance. Looking at these advantages, we can see how in Thanos case, its main focus is indeed operational simplicity and cost effectiveness. But again, there are some disadvantages. The block storage is slower than the alternative storage solution that Cortex and M3 offers. The same cons of the push based model that we talked about also apply here, but there are also tradeoffs when we talk about pool based model. It means that the data from the last two hours is less durable samples aren't being saved immediately to remote storage, which means we cannot access data from the last two hours. If there are network issues, and we can even lose it if from it goes down. It might also have worse latency for the query path. If the cluster you are collecting the metrics from, meaning where the Pometeus servers are, and the cluster where the query is located are distance geographically. And there is not as much caching, query frontend, only caches, range queries, as we mentioned before. So now after going over the architectures of each solution, let's see how the four categories I mentioned in the beginning are satisfied or not. Our four categories were performance, high availability, operational complexity and cost. This is a summary of all that we talked about. And based on this comparison comparison, we can see the tradeoffs that each solution offers us. There are other aspects that can be compared that we didn't talk about. But the bottom line is that I think all of the solutions are great. I'm pretty positive you will be satisfied with whichever solution you choose. But there are some differences that might make one of these tools and better match with your needs and architecture. There are also other aspects according to which we can compare the solutions. For example, from QL compatibility. According to Prumlap's latest test, this is the Prum QL compatibility of M3 tunnels and cortex. Official documentation, as I said, I do find M3 official docs lacking. And they also mentioned in the official docs that additional work is still needed. I personally enjoyed Cortex documentation the best, but Thanos also has great docs. If we're talking about installation via home charts, you can find home charts for all of these tools. But take note that Cortex does mention that their official home chart still needs work that M3 offers only a home chart for the M3DB operator. And the Thanos does not have an official home chart, but does have multiple community home charts options. We are using the QPromitius stack home chart, the Promitius community chart that many use, including us, by the way, to install Promitius operator, Promitius, Rofana, Alert Manager, and more. You already have build-in integration with Thanos. And you can configure the Thanos cycle very easily with it. So at this point, I'm quite sure you're asking yourself, okay, so why did you choose? We eventually decided to go with Thanos. We wanted to keep it simple. Thanos offers a good enough performance for our use case. Its costs are relatively low. The QPromitius stack home chart, which we use, has ability and support to install the Thanos sidecar. So we thought the Thanos answered our needs best. Then again, all of these solutions have pros and cons. The best way for you to choose one is simply to decide on your priorities. And that's it. Thank you for your time and I'm available for questions.