 Welcome to our presentation. Thank you for coming. Even though it's late during the day, we will be giving you a presentation called From UI to Storage, Unraveling, Unraveling the Magic of Fana's Query Processing. And we hope that it will be entertaining and useful for you. And isn't there a better way to start a presentation than by talking about ourselves first, just to pick your interest? I'm Gedris. I'm a serial ability engineer at Vinted. Vinted is mostly a second-hand fashion marketplace. It's quite popular, especially in France and Paris. So you are probably familiar with it if you live here. So we run a huge Thanos stack in Vinted for all of our monitoring needs. I've been personally a Fana's maintainer for almost, not almost, but for five years and counting. So it's been quite a long journey. And I'm still enjoying it. And I hope that I will keep enjoying it for the next five years. And I also write a blog at Gedris.blog about infrastructure, monitoring related topics, programming, go. So if that interests you, please feel free to jump on it. And with me, I have Michael, who will now talk about himself a little bit and do the rest of the presentation. Hi, I'm Michael, an SRE at Ivan, the trusted open source data platform for everyone. We have a booth, go check it out. Though only for two more weeks, I decided to start a new journey at Cloudflare, which is pretty cool. I've been a Thanos maintainer for six months or something, mostly fixing the odd bug here and there and playing with the query engine, which is a fairly captivating project, but enough about us. So why did we decide to give this talk? I want to introduce you to Thanos, of course. And we figured it might be a nice angle to approach this from the point of view of a query execution. And we hope to give you a bird's-eye view about how data flows to Thanos while it tries to answer your queries. Yeah, what even is a Thanos? To put into one sentence, it's a system of microservices designed to scale Prometheus. And you can deploy Thanos besides your running Prometheus to archive data into object storage and achieve very high retention rates. Typical months, two years instead of days. In fact, yesterday I learned that the original working title for the project was from LTS, which tells you everything. And it also allows you to cluster up your Prometheus deployments horizontally and still keep a single pane of glass by proxying queries and merging results. Thanos is an incubating CNCF project, which is the reason we are allowed to speak here in the first place. And in addition to being conceptually really tied to Prometheus, we also use Prometheus libraries for many tasks. In fact, here's a word cloud generated from our imports from internal modules. And you see Prometheus is super prominent there. So safe to say, we love Prometheus a lot. And we hope Prometheus loves us back a little too. And since we rely so heavily on Prometheus libraries, we also naturally try to upstream the odd bug we find or any improvements that we deem necessary. So it's basically a great example of open source working properly. I mentioned Prometheus a lot now. So probably that warrants a short introduction, a very short one. So Prometheus deals in time series, mainly identified by a name. In this case, HTTP requests total. And a few disambiguating labels like the status code for your HTTP requests. You would typically use Prometheus client libraries to instrument your application. And every time you answer an HTTP request, you increment the counter. And if you now point Prometheus to your application, it will go and collect those samples and build up internal database of time series for your querying pleasure. Prometheus does much more. It's a full blown monitoring solution. But for the purpose of this talk, this is enough to know. So sorry for all the Prometheus aficionados in the audience if I left something important out. You probably want to experience the same querying pleasure with Thanos 2. So naturally, we implemented a microservice for that, the app named Queria. And the purpose of the Queria, that's the first component of Thanos that we're going to touch today, is to evaluate queries naturally. Prometheus Queria is based, it's typically from QL, the Prometheus Queria language. And the query on the screen might be the query you issue if you want to count Kubernetes nodes in your clusters, disambigurated by a region. And once you receive a query in the Queria, it will go and talk through internal Prometheus libraries to parse it into a syntax tree, which also corresponds very tightly to the physical plan that's actually getting executed by the Prometheus engine. In this case, we would recursively evaluate this query by first looking at the sum, which would then request data from kube-node-info in this case, which is a leaf node. So it would have to go to storage to fetch data. And storage in the PROMQL engine is an interface. It's abstracted out. The call you would make is called select. And in this case, you would select all time series that have a name label with the value kube-node-info. Now, Prometheus would implement select by basically going into its time series database, fetching all the series with the name kube-node-info, fetching the data from memory mapped blocks and returning it to the engine, which then can pass it up to tree and evaluate the expression. That's pretty fast and great. But the Thanos query is a stateless component, so we don't have the same luxury to go to memory mapped blocks and fetch data from disk. So we have to implement select somewhat differently. And we do. Typically, we point the query to two things we call store APIs on the right there. And the store API is a defined DRPC interface for retrieving time series data. Thanos shipped with a few built-in store APIs that makes sense given its mission to scale Prometheus. But generally, it's an interface anyone could implement a new store API if they want to. And Gidius will introduce the built-in store APIs later on. And now we have many store APIs potentially to fetch data from very disparate sources. So we don't have the luxury of one source of truth anymore. And in fact, Prometheus typically is deployed in HA pairs, disambigurated by the replica label. And if we now fan out to both of those, we would get all the data. For example, here, kube-node-info for node A and B, but two pairs for replica one and two. This is no problematic, right? If we send it back to the query engine and bubble it up the execution tree, we would get the result four instead of the expected two, which is not great. But if you actually tell the queryer that it has to deal with replicated data, one of its responsibilities is to deduplicate it. So this is great. We can answer from QL now. But it also comes with a few problems, right? Instead of fetching data from memory blocks, we have to go over potentially slow network links and talk to many store APIs, which can also retrieve a fair amount of data. So your responses will be as slow as the slowest store. This outdoes the performance profile of Thanos to a degree that kind of justifies doing our own Thanos-specific optimizations to queries. And in fact, note that queryer is stateless. It does nothing but answering queries. So we can do different optimizations. We don't have to worry about queries interfering with data collection. So we can throw more resources at queries and adopt our execution to that. So in short, we wrote our own crumpler engine to do that. The query on this slide demonstrates average amount of HTTP requests that returned with status code 500. We might want to evaluate that. And if we do that, we parse it to the abstract syntax tree on the left, which you see ends with two calls to storage, right? Once for HTTP request total, all of it, and once for status code 500. So if we would do that with the scheme I described earlier, we would go to storage twice, amplifying the amount of data we fetch over the network. One optimization that we can do is we just fetch it once and do client-side filtering. And then decode the chunks in parallel. And we also have plenty of other optimizations. But to be honest, there is only so much you can optimize on the query level once you are actually in the queryer. So one of our maintainers, Philip, recently dreamed up another way of evaluating from QL using our engine. And that would be the distributed and mode of operating. As you see, if you want to have the sum of your kube node info, the amount of Kubernetes nodes in all of your clusters, that's distributive, right? You could just fetch it in the individual clusters instead of asking for data from them. You just ask for partial sums and sum the sums again. And if you shard your data in a way that guarantees that the sums are not overlapping, this is actually well-defined. And typically, query results are orders of magnitude smaller than raw data. So this is very promising, but it doesn't look like much. So I asked Philip to provide us with some examples who runs it in production on a Shopify at a very large scale. And this is one of those. Please understand, there might be industrial espionage going on. So I had to blur out the results. That's very business-critical data. But it's basically the same query I talked about, evaluated once in the Prometheus engine with the select scheme that I introduced. And note that's returning in 15 seconds. And please understand, that's not a step against the Prometheus engine. It's great. This is only because of the way we have to implement select over network. So 15 seconds, that's great. If we now configure the Thanos engine and configure it properly to use the distributed mode, we can execute the same query. And it returns in two seconds. That's a seven-time speedup. So yeah, it's pretty paradigm shifting. It's amazing. Really excited about it. So try it out if you can. Yeah, with thank you. And with that, I hand over to my lovely co-presenter to talk about the Store API implementations that ship with Thanos. Next slide. Yeah, what's a query engine without the storage layer? So we will go through different microservices that are part of Thanos, which implement the storage layer. We will go in the order, in my opinion, of the sophistication of the different components. And let's start with the simplest one, the sidecar. For many years, that was the most popular component. So it does what it says. It's a sidecar. It literally attaches to Prometheus and uses its interfaces and reads data from the disk to implement the Store API interface. So Prometheus implements the remote read API, not to be confused of the other API by the same name. But the remote read API allows it to read in a streamed way from Prometheus through HTTP. Sidecar also has this component called Shipper. So whenever Sidecar notices that Prometheus had produced a new block, it hard links to it so that it wouldn't disappear. A hard link is another reference, at least in Linux. It might be different on other systems. But on Linux, it's another reference to an iNode. And this means that even if Prometheus deletes that block, it still exists on the disk. Like, it doesn't get garbage collected. And Sidecar implements the Store API. So whenever a query comes in from an user, the courier sends a serious request to the Sidecar, and then it fetches the needed data from Prometheus. Yep, next slide. The ruler. So we have all of the data that we need, but Prometheus also has the subsystem, which allows you to periodically execute alerting and recording rules. And ruler implements the same thing, but in a separate process. The ruler has an integrated TSDB. So we are quite literally starting another TSDB in the ruler process. And that TSDB is used for storing the results of the recording and alerting rules evaluation. And of course, the ruler, just like Prometheus, can also send alerts to alert manager. If it's configured. And yet, the same Store API interface, the courier sends a request through GRPC, and ruler reads from that TSDB and sends back the data. Now we are upping up the sophistication anti, because the receiver has more than one TSDB. It has the multi-TSDB component. So I think three years ago or so, Red Hat has donated this component to the Fanos project. And the receiver component is responsible for storing data, which comes over the remote write interface. So remote write is a nifty way in Prometheus to write data from one node to another. It's really just a protobuf encoded metrics data that gets sent over HTTP. So you can see in this picture that there are multiple Prometheus agents scraping metrics data, and they are all sending the data that they have through remote write. A cool thing about multi-TSDB is that it also reuses the same deduplication heap. So Michael didn't mention that, but whenever a serious request comes in through gRPC, the response needs to be sorted. So for example, if someone asks for a metric called up, then the receiver has to look into all of the TSDBs that it knows about. In this case, there's like n tenants. So it has to merge data coming from all of the tenants into one coherent sorted stream. Yeah, and that reuses the same shipper component because it's the same TSDB, just like Prometheus. So whenever a new block is produced, the shipper notices that and uploads it to remote object storage. And yeah, last but not least, the store components. We have all of the data in remote object storage now, but we need to fetch it just to complete the full circle. So the store component does just that. And again, the same problem reappears, but in a different form. The store component knows about typically like hundreds and hundreds of blocks, and they might have the needed data. And again, because we want to fetch data in parallel to make it as fast as possible, this also means that we have to merge everything on the fly into one stream. So we reused the same heap. Yeah, so we solved you the idea of using Thanos. We told you about all of the components, the new engine, but how to use the new engine. It's quite simple. There's a parameter you can use to change the default engine that is used whenever a request comes in through HTTP. It's also possible to use that dropdown. You also saw it in the screen chat before. You can use that dropdown to select a different engine for your query. Use this dropdown to check how the queries are working. And if it looks good, then just use that parameter to change the default engine. And that's not all. We also noticed an opportunity to improve the promql engine even more. So technically, Prometheus has this functionality where it can expose you certain statistics about the query execution. But it's kind of hard to use because it's not really visible in the UI. And we thought we could improve it. So we implemented these two, in my opinion, cool features. The first one is the explain button. So it shows you the whole operator tree of the query without actually executing it. In this screen chat, you can see the same query that was used before. And it shows you the whole query execution tree. So you can use it to optimize your query without even executing it. And this was inspired by the SQL analyze statement. If you're familiar with it, it kind of does the same thing. But there's another thing, the analyze checkbox that you can see. So the analyze checkbox makes it so that the promql engine now will capture extra metadata about the query execution. Right now, we only captured the wall time of each operator. You can see there are a bunch of them in this tree. And it captures the wall time. So wall time, quite literally, means the time that you see on the wall. It's not actually the time used by CPU executing all of the instructions, but the wall time. And because of that, you can see that the top level operator is taking the most time. And just taking a look at the screen chat, you can see that data fetching took the most time. And that the query engine is quite fast. It only took 0.12 seconds to do all of its calculations. So yeah, should you use Thanos? Yes, you should. There are lots of things that we still want to implement. Of course, nothing is perfect, just like in life. So what can we do? So we would like to invite you to try it out. You can also join the Thanos community on the CNCF Slack. I think there are more than 5,000 people on it right now. So on the CNCF Slack, you can connect with like-minded people, help others, ask questions to yourself if you have any, and contribute. So we welcome any kind of contributions, whether it's code, documentation. We are a friendly community. That's it. Can scan the code, provide us with feedback. If you have any questions, please approach the microphones and ask questions. Hello. Thank you for the great presentation. It was very interesting to see how this actually looks from the inside. I've got a question about the duplication. It always seemed like magic to me that something is retrieving data from different instances which might have slightly different timestamps, but still the labels which would allow the duplication to happen. How is the question of the timestamp drift resolved in the duplication? Can you elaborate on that a bit? Do you want to? So yeah, that happens. But if there are more than one replica of the same data, I think it first chooses the one with the newest data, and then it is based on gaps. So for example, we are iterating through data from multiple replicas. And if there is a gap, we switch to the other replica. But overall, this timestamp drift problem is real. And that's why I think it's recommended to use the receivers. Because with receivers, you can scrape once and then make a copy of the same data. And it is actually much more efficient, because in the courier, we check whether it's an identical copy of the data. And if it's an identical copy of the data coming from multiple receiver instances, we don't decode it twice our price. So I would recommend you to use the receivers. Thank you. Hi, thanks for your presentation. It was really nice to follow you. I have two questions. My first question is the TANOS PromQer engine. Is it stable or still an experimental state? It's pretty much stable. You can just use it. We have the thing with implementing your own PromQer engine, it's fairly easy to test for correctness. You just execute the same query against the upstream engine and against the PromQer engine, and you check that the results match. So actually, one of our maintainers implemented a fuzzer that just generates random PromQer engines and queries and data and executes 10,000 random queries on random data, compares the results on every pull request, and then we also have thousands, I think, of acceptance tests. So at this point, I'm fairly sure it's absolutely correct. So and also people are using it in production. I'm fairly sure Shopify uses it for the distributed nodes. You can just use it. And we use it in winter. And we use it in winter. I've also used it. OK. My second question is, I saw the analyzer, which I personally feel is really helpful to found issues in PromQers. And is this also feature available if I'm thinking about ruler? I have, for example, 500 rules. I'm not able to go by hand, one by one, each query to find some performance issues. So are there statistics or even metrics available about the rule execution for each rule? And then if there's a high time query, then I can manually go deeper to it? Yeah, that's how at least we approach it. Because there's metrics about the query, like the execution duration that come from Prometheus itself. And because we import the same libraries, it's the same rule manager. So it exposes the same metrics. So what we do is take a look at, for example, top 10 rules that take the longest to execute. And then you go through the courier execute and manually, and then see what's happening. Why is it taking so long? Thanks. I don't know. Thank you for your work on the TANOS project. And the question is, there is one more component of TANOS. There is TANOS from QL Connector, yes? There are some more components, actually. There is the Compactor and the Query front end. But it sounds like the Query front end is directly involved in the query path. And it is, but it's more a proxy to downstream Query APIs. And it does some caching. So we chose to leave them out. Yes, sorry. So the question was, does this distributed querying, does it have some changes in store IP? So will it work with TANOS from QL Connector? So store API is the same. Nothing changes in it. Just distributed querier works on querier side, as it said. I think distributed querier actually doesn't use the store API. There is another API implemented, the Query API, Query RPC. And downstream queriers implement the Query API, right? And that's the way you distribute queries and fetch them back. So it's not actually using the store API. I think there was a presentation during FANOSCon yesterday by Bartek about that connector. And they showed like a demo of the distributed mode, how it works with the connector that fetches the query response from the response to the query from Google-managed Prometheus, I believe. So the distributed mode works with that. Thank you. Again, thank you for the presentation. You did show this example where you parallelized the query based on what was it? The region label. Or does the engine know which labels it can use to split the query up so that it can be rejoined back safely? Does it know that magically, or do I need to tell it? So if you point the query engine at downstream carriers, they also expose an info RPC API, right? And through that info API, it can know what labels, what external labels your queriers know. And then you can shout on that in the engine. There is an analyzer going over the query and shouting it with knowledge of the labels that the downstream carriers expose. That's the approximate mechanism. There's nothing I need to configure like this is this data center, this is this Kubernetes cluster, this is just this region. You just need to point it at remote queriers and the rest is automatic. Though you need to make sure that your data is actually shouted well, if it's actually physical clusters, that's perfectly fine. But you probably can build weird situations where the data is overlapping and then it wouldn't be well defined anymore. But for a region, for example, that's fairly sure to assume that the data is shouted properly. Hi. Thank you for the presentation. I have a question maybe not about the presentation. But on the compactor, can we expect native Instagram support soon? Sorry to put you in this spot. Accept pull requests. Thank you. Yeah, hi. Also, thank you for your presentation. And I have a question about the ruler component, actually. Because if I understood it correctly, does the ruler actually does the evaluation of the rules, basically? And does the computation, which in Prometheus the Prometheus does by itself? So if I have a, now I'm missing the word, the very expensive CPU stuff, does the ruler component does in Thanos now? No, you still need another set of one courier, which actually does that computation. Ruler uses the query HTTP API to send queries to some kind of courier instances. OK, thank you. I missed the presentation yesterday. So how do we enable the distributed courier? And is it released, or is it going to be in the next release? Thanks for asking. It wasn't the last release, I think, right? Yeah, you can just toggle a flag on the courier. It's just a flag, and then the magic will happen. Yeah, you have to point it at couriers. So you have to set up the topology in a conductive way. But once you do that, you just need to enable a flag. I think query.distributedMode or something, or queryMode equals distributed, I don't know. It's in the documentation, but it's released. Perfect, thank you. Like the key part is that you have to have multiple layers of couriers, because one layer needs to do the duplication. Otherwise, it's not possible to know which Prometheus, for example, instance, has gaps in data. You have to do it at the duplication. That's all. Thank you for coming. Thank you.