 All right. Thanks, everyone, for coming today. So the title of this talk is Distribute in PromQL Expressions. We'll focus in this session on some of the work that we've done with Moat and with the Thanos community on implementing distributed query execution in the Thanos project. My name is Philip. I am a production engineer at Shopify. And with me today is Moat, who is a senior software engineer at Red Hat. The agenda that we have today is the following one. We'll briefly cover why PromQL and the PromQL engine is kind of hard to scale at the moment, specifically in Prometheus. And then we'll cover two different approaches that we are working on in Thanos, namely query pushdown, what we call query pushdown, and query sharding, to speed up query execution. And then we'll take a look at sharding in practice. And Adean will give a brief outlook of where we hope Thanos is going to develop in the future. So if we take a look at the current state of Prometheus, I mean, we've heard so many times here at KubeCon that it's become very well-adopted in the cloud-native monitoring space. Basically, anyone that's maybe used the Kubernetes cluster has seen a Prometheus instance lying around somewhere. It's very effective for real-time monitoring. And what's very good about Prometheus is this PromQL query language that allows us to write all kinds of expressive queries against what we call time series data. And so the ideal use case for Prometheus is really the single cluster monitoring. Whenever we want to go beyond a single cluster, there are different challenges that are non-trivial to overcome, some of them being just the fact that scraping across network boundaries can be, it comes with its own challenges. We have to rely on disks for retention. Discs can be hard to move between AZs, regions, clusters, and so on. And overall, the story about how to scale Prometheus horizontally is not there yet. And that's a conscious design decision, so there's nothing wrong with that. And as a result, there are all these different projects like Cortex, Thanos, Recentlymimir that come in and try to solve some of the scalability issues that Prometheus has. And if we look at Thanos, we can essentially store data for as long as we want in S3 buckets. In addition to that, we get things like red replication, multi-tenancy, all of these kind of cloud native features that we take for granted today. And now that we have all of the data, all of this kind of long retention of data, massive data, we can also execute drum kill against that data set. However, the drum kill engine that is in Prometheus, Thanos, and all of these projects is still kind of the same engine that's used inside Prometheus. And that engine is single threaded, single processed, and as a result, has to kind of fit all of the time series for a query in memory, has to execute the query in single thread. And the longer the query range and the higher the cardinality, the harder this problem of executing the query becomes. And so we can visualize where this bottleneck happens before we actually try to solve the problem. We can visualize where the problem happens on a very simple model. Let's say we have a Kubernetes cluster that has a Prometheus instance inside that's scraping just a set of targets. And when we outgrow that instance, we can deploy another instance that scrapes a different set of targets. And when we want to get a global view of the entire metric space, we can deploy something like a Thanos query that's going to pull data from these Prometheus instances and execute Bramkill queries. And if we kind of outgrow this cluster, we can deploy, for example, another cluster. Let's say our production environment is now bigger. And now when we want to get a global view again of everything that's going on, we can layer a new Thanos query on top. Now this Thanos query is going to kind of communicate to the intermediary ones, and it can execute global queries across the entire set of metrics. Now the problem is whenever we want to execute a global query, this specific Thanos query that's at the root of the tree, it has to reach out to all of the underlying Prometheus instances, has to fetch time series data that's relevant for the query in memory in a single process and execute the query in that process. What's also interesting to note here is that these intermediary queries in the middle, they're not actually doing, and they're not very active in the query execution path. So what they do is mostly they provide some basic service discovery mechanisms. So they allow the root query to discover these instances, these leaves, Prometheus leaves at the bottom. And they also kind of traffic data to the top. But other than that, they're not actually executing PromQL, even though they have PromQL engines inside them. And so that's kind of the current state, the default mode that we have in Thanos. And this is why the queries, query execution can get slow with more data. And now that we know this, we can look at what we've been working on, kind of the entire Thanos community to alleviate some of the problems. And so the first thing that we will look into is an approach that is referred to as query pushdown. So with query pushdown, we can speed up queries by simply avoiding streaming a large amount of series in memory. If we want to look at how that works, let's say we have, again, a simple model, a queryer with two instances. And let's say that we want to execute the following query. We want to calculate the maximum or particular metric. Yeah, the maximum or particular metric. And so what we can do now is we can take advantage of the computational power, the computational engine that's built into these Prometheus instances. We can execute the query locally first on the local data set. So these Prometheus are going to execute this specific prompt field query locally, and they're going to propagate just the intermediary result upwards. And then the queryer is going to recalculate the query with a much smaller data set. And by that, we kind of distribute the query execution through the entire system. So what you need to know was the field you are with query pushdown. The important thing is that it's already implemented in Thanos, and you can start using it today. Obviously, the benefits are that it speeds queries up. It lowers prompt field execution latency by processing data at rest, so where it's at. And it doesn't ship these megabytes or even gigabytes of data over the network. And if you want to play with it, you can enable it as a feature flag and start using it today. The current limitations that are there, which we hope over time are going to become less and less, is that we can only apply it to a smaller set of prompt field functions. So those are the min, max, top, k, bottom, k, and group. They are kind of, we can refer to them as the idempotent functions, which we can apply over and over again at various levels in the query execution path. The flip side of that is that the more common expressions, such as sum rates, calculating quantiles, for example, there's no good way to kind of defer execution to the leaves yet. So there's still more work to be done there. Then there are kind of two other things that we have to keep in mind. If we're going to execute queries, then we need components to have a query execution engine. And currently in Thanos, it's only the Thanos sidecar that can execute queries. So the other store components are simply going to ignore this instruction. So you can still enable, push down. Even if you have a mixed fleet of, let's say, sidecars, receivers, store gateways, it's just that you can do that in a safe manner, so to say. It's just that it's only the Prometheus instances can take advantage of this. And finally, if you decide to do this, to enable this feature, during the transition period, just keep in mind that query execution is not free. And if these Prometheus are going to start executing queries, make sure you monitor for increased CPU usage, increased memory usage, so that you make sure they're adequate to provision to handle this additional load. Cool. So as Philip mentioned, what we found with query pushdown was that it was super efficient for a very specific use case. What we wanted to do next was to explore a technique that would allow us to distribute queries more generally across PromQL. We also wanted to make sure that our evaluation of queries was limited to query components. The nuance that Philip mentioned around only being able to push down queries if there's a PromQL engine present was a critical issue. We also wanted to extend some of the existing query sharding in Thanos so that we could distribute work even further. So those were our requirements going into query sharding. The TLDF query sharding is that query sharding is just splitting a query into distinct parallelizable subqueries. So before we talk about query sharding, it's probably worth looking at what a time series actually is. So you can see for this metric, HTTP request duration seconds, we have this series access and this time access. Each metric is composed of unique label dimensions. So in this case, we have pod and status with various values. And then each unique time series has observations over time. These are what you would commonly refer to as samples. Each sample has a value associated with it and a time step. So when we're talking about sharding, I mentioned that Thanos already has a sharding implementation. There's this notion of horizontal sharding or time splitting where the query can be split across time and then run in parallel. This is because each time slice will only have the time series once. And a query that's spanning four days can be split into four one day queries as an example. The caveats with time splitting currently is that it only works for range queries. So if you have rules or alerts to find or if you're trying to use it for instant queries, that's not as trivially parallelizable as range queries are. And even for short intervals on the graph that we were looking at before, a time slice can still have arbitrary cardinality. This is commonly known as the high cardinality problem in Prometheus, which is the more time series you have, the more expensive it is to execute a query. So this is what time-based or horizontal sharding looks like. Say you want to sum by a pod for a specific metric. You want to do a summary over a specific metric. The way that time slicing works is if your original range is, in this example, T0 to T3, you would split that up into three distinct ranges and you would execute them in parallel or in series. But that means that you've limited the number of samples that you might get for each time series, but you haven't limited the number of time series you might get for each request. So how do we manage this vertical cardinality, this scary problem that a lot of people in the Prometheus ecosystem talk about? We wanted to work on extending and supplementing the existing sharding that we already have in Thanos with this notion of vertical sharding. So the same way that horizontal sharding works across time, vertical sharding works across the unique labeled dimension of each metric. Each shard is disassociated because we're guaranteeing that each unique time series will always end up on the same logical shard. So you can process these queries in parallel without worrying about duplication or correctness of your query. And we wanted, critically for Thanos, to do this without too much planning ahead of time. The reason why is because we don't know where time series are ahead of time with a query and we wanted to avoid additional fetch requests or just checking where time series are before executing a particular query. And in our implementation, leaves are responsible for deterministically assigning time series to shards. We also didn't wanna make any changes to the Thanos read path. So a commonly asked question is if you have this issue of duplicate time series, why not just de-duplicate them when you write? That's something that other projects do. So for example, Cortex, they de-dupe on write. We didn't wanna change the right path. We wanted this improvement to be almost transparent to the user where they don't actually need to change how they've set up Thanos. So the first expression that we wanted to implement for this vertical query sharding is this very common grouped sum rate expression. So assume you have the same expression where you're summing by pod and you want to see the rate of increase of a counter metric, in this case, HTTP request duration seconds. Instead of sharding across the time axis, we actually want to shard across the unique time series themselves. So what we're saying for each assigned shard index is that the same time series will always appear there and then the query can process the query as it would normally with an unsharded query. With a crucial final step, which is that we then concatenate the result of everything that's been executed. Now with that being said, we can actually use both in conjunction. So if you do a time range or a horizontal sharding pass first and then split each horizontal shard vertically across time series, you've now taken one query that can potentially have arbitrary time range and arbitrary cardinality and you've split it into manageable parallelizable chunks that can be computed in parallel. All right. So as Mon mentioned, we kind of worked on this approach for a while and we have a reference implementation in a branch in Thanos and we use that reference implementation to see whether the algorithm actually is going to make any improvements. In order to do that, we generated a very simple benchmark. We generated a synthetic dataset essentially. We are 100,000 series and by that we hope to simulate a setup where we have let's say a hundred clusters with a thousand policies in each cluster. And then we compared query execution in a sharded versus non-sharded setup. So for the sharded setup, we use something we call a sharding factor of three. That just means one query, one big query would get split into three smaller queries and would be executed the fan out in a parallel manner. And in the default mode, in the non-sharded implementation, we would just execute queries in a round robin manner across five different queries. And to keep everything simple, we kind of did the benchmark on a single node essentially on a single machine. And what we wanted to see is first, are we going to improve user experience? So our query is going to get executed faster and second, are we going to destabilize or stabilize the system more by sharding queries? So what's the operator experience going to be essentially? And one way we thought we could do that is by measuring peak and average memory usage of queries over the course of this entire benchmark. So if we jump right into the results or into the numbers, this particular graph shows the P90 latency of executing queries over 10 to 15 minutes without sharding. So this is what you get by default in Thanos. We see that at the beginning, the query latency is about nine seconds. At the end, we end up somewhere around 19 seconds. And in the middle, there's also this kind of big spike where the P90 jumps to about 55 seconds to execute a query. So that just speaks to the fact that just doing more work in a single process over time can destabilize a lot of things in the system. And if we compare that to the sharded setup, so when we actually split queries in smaller shards, when we start the benchmarks, the P90 is about five seconds. So already we have about the 40% decrease in query execution. And when we end the benchmark, the latency is about 5.7 seconds. So what we can see from this graph, which might not be, again, easy to measure, maybe numbers is that we get a much more predictable and much smoother query response time, but simply partitioning work across multiple nodes. Or multiple processes. In this case, the single node is multiple processes. Looking at the memory usage, again, this is the default mode, no sharding what you get out of the box right now. If you install Thanos, the peak memory for a query is about 1.5 gigabytes to execute a single query, whereas the average is about 1.2 gigabytes. And comparing that with the sharded setup, we get, again, we see a decrease to about 800 megabytes. So approximately 50%, and the average is also about 700 megabytes. And again, what's visible here is that the memory usage is much more consistent. There aren't big jumps and drops, and the lines are much closer together, so there's much more predictability in the system. So in summary, vertical sharding benefits, very intuitive. If we split work across multiple processes, we expect lower peak memory across those processes, lower latency, that all makes sense. The crucial advantage which we hope to achieve with vertical sharding, over specifically sharding by kind of time range, is that we can apply these things to queries. So things like alerting rules and recording rules can also be executed in a sharded manner. And just we still have a few things to hash out so we don't support, so we don't have an implementation for the full range of prompt real expressions. So we need to have some sort of aggregations in the query expressions in order to shard on those labels, but I think that over time we'll just need a bit more time to kind of hash out the details there. And one more thing to keep in mind is that as we start to split queries up, queries will be kind of smaller, but there's going to be more of them. So there's going to be more requests per second, let's say, throughout the system, and we'll have to make sure that we keep an eye on that and make sure that all components are kind of properly provisioned and capable of handling that increased volume. Cool, so I guess we've heard about practically how something like vertical and horizontal query sharding works. We've seen the performance benefits in the benchmark that we've presented. What I wanted to show is, Philip and myself both work in kind of fleet monitoring and observability teams at our respective companies, and we wanted to give an impression of vaguely the scale of problem that this kind of solution or that this kind of approach solves. So let's suppose you have a central observability cluster that's running Thanos. Let's suppose that you have thousands of clusters that you have to monitor, either platform monitoring or user workload monitoring, and each of these clusters is producing some amounts of series per day and you want 30-day retention. With Thanos, you can really have arbitrary retention, but let's just say it's 30-day for the sake of example. I'm gonna simplify the query path just for the sake of demonstration, but it stands to reason that if somebody is writing metrics to a central place that they probably want to query the metrics as well. So let's say our users want to some metric by cluster. We have our Thanos query, but this presents an interesting challenge with the caveats we've mentioned already, which is that we need the nodes to be big enough to be able to hold all of the time series that a particular query is fetching before it can evaluate promqr. So we're holding 20 million active series over 30 days. Even if you time slice with Thanos as it currently stands, that's still a lot of metrics that you might have to retrieve in a given query. So let's say you create a 100 gigabyte Thanos query, so you give it a 100 gig of memory and you just hope that a query doesn't exceed that amount. You probably need a couple of replicas because you want to serve concurrent traffic. You're not gonna deal with one query at a given time. And you just kind of hope that this query executes and that you have enough memory to facilitate it. That's ignoring even if you have concurrent queries that are already going on in the process. And if you run out of memory, you run out of luck and your Thanos query falls over. So let's look at the same example, but let's assume that we have vertical sharding. So we've taken this arbitrary length query, we've split it by time, and then we've sharded it by index. Same expression, we're summing by cluster ID, but now instead of having single nodes, having this condition that a single node has to be able to hold all of the series in memory, we can shard the query across a fleet of kind of fungible Thanos queries. Query is a stateless anyway. And this query load is going to be attenuated across our fleet of Thanos queries instead of our kind of special, large processes that we have to look after. And this just means as a platform engineer, if you're maintaining the system, you can sleep better at night and not worry about this arbitrary problem and this arbitrary complexity problem. And when it comes to scaling up and scaling down Thanos queries, because you're sharding queries, it's simply the case of having a horizontal pod autoscaler create more Thanos queries or reduce them when you don't have query load. So as Philip mentioned, there is an implementation for this for grouping aggregation queries. There is also a proposal in upstream Thanos that discusses this and we're working on expanding promql support. So now that we've done some rates by some value, we want to start exploring the rest of the promql standard library. This was Philip and myself's first foray into contributing to Thanos. So I recommend that you all have a look at these if this interests you and contribute to Thanos. The community is super friendly. We had a great time talking to people about these ideas and there's plenty of cool work to take a look at. For more context, this was quite a simplified view of the problem. But if you want to learn more about the Prometheus TSDB and the exposition format or Thanos itself and how to do fleet monitoring in Thanos, there are these great key contours by members of the community of the Prometheus and Thanos community that I recommend you watch. Yeah, and thanks to the community and the maintenance. Right, so I think we'll have some time for questions and if anyone has any questions, I think you can just walk up to that mic and shoot. Hello. Thank you for the talk. So it seems like the vertical sharding might be useful for Prometheus itself also. Like I think the majority of the time is spent in getting the data from the disk, right? So it seems usable for a single Prometheus as well. What are your thoughts on that? Yeah, that's true. I guess the thing that Prometheus tries to do, and this is, I assume, intentional by design, is effectively restrict the problem space. Prometheus is very good at doing a very specific thing and introducing complexities to promQL engine itself, how it processes queries, are probably out of the scope of Prometheus itself. But there's been talks in various meetups this week about what a future promQL engine would look like that's built for this kind of cloud native use case instead of the single Prometheus use case. But it's not an easy problem. It's quite complex. I think there are also some kind of proposals in Prometheus itself that sharding at the TSDB level so that the block can be, you can retrieve shards of the block. We are, again, we are following that work, making sure that at least what we're doing with Thanos is backwards compatible or compatible with Prometheus. But changing such a stable project like a 10-year-old CNCF graduated project takes a bit more time than adding something kind of around it or on top of it. Hey, thanks for the talk again. So you're proposing sharding on the query level, but if our data comes from the store, Thanos store, can we also shard on the store level? So that's what we do. We actually, so the query requests a shard from the store and the store only retrieves a subset of data. So sharding in this particular implementation, sharding is going to be implemented end-to-end from when you execute the query to the store when the store actually retrieves data, sends it back to the query. Does that answer the question or? Yeah, more or less, but we have one store instance or multiple? You can still maintain your existing setup. Different, so queriers that handle different shards will still talk to the same. Even if you have a single store instance, they will talk to this one store instance. They will just request one shard of the data. So the existing setup for you will not change. Yeah, thanks. Did you by any chance have a look at the S3 request rate when you do vertical sharding? So does that increase or? Because that's in our use case, it's we are limited by our S3 request rate. We basically have unlimited compute and memory. I think that's a very good point. So we haven't looked specifically at this particular metric, but in Thanos there are various way to cache data from S3. So cache both index data and chunk data. And so we're hoping that even if there is an increase, we can reduce it by adding some intermediary caching so that we don't have to fetch data twice. So to answer your question, the TLDR, we haven't specifically looked at it, but we hope that there's, or we think that there's a way to actually address this problem so that we don't increase requests towards S3. Thanks. Cool. We'll be outside after the talk if anybody wants to grab us and ask us a question, but I'd recommend you have a look at the proposal and upstream and engage with the community if you have any concerns or ideas. All right, thank you.