 Okay, so thank you everyone for coming to the session. As Raric mentioned, we wanted to walk through a new feature that helps Thanos not only become faster at queer execution, but also helps us with just general interoperability with the rest of the ecosystem. I'm just briefly short round of intros. My name is Philip. I am a production engineer at Shopify. I work on everything related to metrics on our metrics stack. I've also been a maintainer for a couple of years now, and I'm currently based in Germany, in Munich. And we also have Max today with us. Hello everybody. My name is Hamoud Amin. I also go by Max. I'm very excited to be here. This is actually my first ever talk at a conference. A little bit about me. I'm a software engineer at Google. I was an open source mentor on a couple of projects. You can find me at GitHub at Max Amin's and I'm based at New York. All right. So I just wanted to kind of build a bit of context for people that maybe have missed the first couple of talks or not very familiar with Thanos. So by now, I would expect everyone to, or maybe most people to be familiar with Prometheus, which is this standalone monitoring server that we drop closed applications. Prometheus then scrapes applications directly stores the data on this and has a very flexible and interesting query language called PromQL, which is going to be important for what we're gonna be talking about today. And so Prometheus kind of shines at this use case where we have to monitor a single environment. The problem we typically see with Prometheus is that when we go beyond a single environment, as we've seen a couple of times today, it becomes hard to kind of deploy these Prometheus instances, but also gets hard to have a global overview of the data. It becomes hard to retain that data for a long time and so on. So Prometheus gets kind of intentionally left this gap, never intended to solve this problem. And for this reason, we have multiple projects today, many open source projects, trying to kind of solve this problem of having a highly available set up of Prometheus that allows us to also retain data for a long time. So Thanos obviously being one of them, but we also have in the same space, Cortex, Namir, Victoria Metrix. And what's also very exciting and interesting is that if we take a look at the big three cloud providers, all of them have a managed Prometheus offering today. So we can send data to all three cloud providers via remote write and all of them allow us to also query that data using PromQL. And so even though this is a good development for the ecosystem, it's very nice to see various competing solutions. I would say it also creates a bit of a fragmentation where we kind of have to commit to one solution fully and it's hard to use multiple solutions in a single system at the same time. So what I mean by that is we typically don't hear people use both Thanos and Cortex or Thanos and Namir as part of the same solution today. And it kind of puts us in a bad spot where we have to fragment our data. In addition to that, cloud providers, for example, will give you some metrics for free. If you use Google Cloud or Amazon, you will get metrics for their services and you might even be able to query those metrics with PromQL, but as soon as you want to combine the metrics with some local data that you have in Prometheus or Thanos, it's gonna be hard to put the data in a single query. So we still have to maintain, sorry, two different systems at the same time. And so what we would ultimately want is to be able to consolidate data across all of the systems using PromQL because all of them, as a matter of fact, support PromQL. And so Thanos, if you take a look at how Thanos works, Thanos already does federation at some level. So Thanos, the Thanos quaterer can query multiple data stores, multiple store APIs. It can query Prometheus through the sidecar, can get data from object storage through the store gateway and so on. So then the question is, is solving this problem just a matter of implementing a custom store component that's going to act as a proxy to the various systems that you wanna query. Now that is a kind of possible solution. So the way that we execute queries today is we would take, for example, this particular expression. As an example, we break it down into what we call query fragments. The query execution flows from top to bottom, and then so the sum operations gonna ask data from the rate, which asks data from the selector. And once we get to the selector, we kinda have to fan out to all of these different store components that we're connected to. Unfortunately, what we typically see is that in this process, the actual data retrieval part is the most expensive part of the query. As I think Colin mentioned, latency can be a pretty big bottleneck there, but it's also the fact that Thanos cannot execute queries in a multi-threaded way. We can use multiple cores to execute queries. So we have much more opportunities to shrink this blue bar, whereas for the green bar, we don't have that many options. So this is typically how we visualize time series data, and as this data set grows vertically, which means as we add personality or horizontally, as we extend the range, that green bar is also going to extend. But there's also another problem, which is certain cloud providers are not going to give you data in the format that the query expects. A lot of these data is kinda locked behind in the cloud, and this is not a problem that we can just solve engineering. So to conclude, writing kind of a custom store component is a possibility, but I would say it's not a very practical solution. So before we take a look at what is a good solution, I want to talk about a very exciting feature that's just merged and made available on the main branch. We'll be coming up in the next release. And as the name says, distributed query execution allows us to execute queries, not only using multiple cores, but also using multiple machines at the same time. So we can use multiple machines to execute a single query. And what we have here is a setup where we have a physical partitioning of the data. So we have, you know, this can be an example of one region running Prometheus Store Gateway and another region running the same setup, both of which have queries connected. We would just connect them using the standard flags that we use today. And now up to now, I would say if we connected the third querier on the top, that querier would have to request raw data from all of the downstream stores. With distributed execution, we can tell that querier to actually decompose queries into separate fragments, delegate those fragments to the lower level queriers, which are actually going to do the work, not only faster, because they're working on a smaller dataset, but they're also closer to the actual data. They're going to kind of crunch this aggregation only on their part, return upwards the result of that aggregation and that aggregation will be then completed by the root querier. So very similar to kind of a MAC-produced framework. And the reason why this is so powerful is because most aggregations in PromQL today can be broken down, in fact, in this way. Even something as expensive as top K can be done as a top K over top Ks. Account can be done as a sum of accounts and so on. We can also use this mechanism or this technique if you have a logical partitioning of the data. So with the Thanos receiver, we can have a multi-tenant setup. So the receiver can accept data from multiple clusters. For example, it's gonna assign each cluster its own label. And then we can connect, let's say, a querier on the left-hand side and ask it to only query data whose external label cluster starts with this. On the right-hand side, we can do the same with this querier. We can ask it to query data whose cluster label starts with this. And then we can connect this distributor query on the top that's going to start delegating queries into these lower-level queriers that only work on a subset of the data. So just to kind of illustrate the impact of this feature, internally in our team, we track five or six queries that we run constantly. We track them as SLIs, basically. And this kind of graph tries to visualize what has happened to the latencies of those queries once we've moved to distributed execution, which is also kind of expected. So since we are able to evaluate queries closer to the data and also partition them across machines, the latency for certain queries has dropped. In some cases, even up to 10 times. So with that being said, I'm going to head off to Max now to tell us how, using this feature, we can kind of get to this interoperability story that we initially talked about. So now we reach the part of our talk where we talk about connecting Thanos queries to our external components, or as our title of our talk says, connecting Thanos to the outer run. So, Phillip highlighted all the advantages of distributed queries and it allows you to have a much more scalable Thanos architecture. So what's next? So this presents us an opportunity to do something amazing because there's a lot of data sources out there that have promql support. So what if we leverage the existing Thanos carrier engine and apply its other data sources? This would mean you could theoretically query all of these data sources with your Thanos carrier and you could review them in a single view, revolutionizing how you view your metrics. So one of these data sources that we'll focus on is Google Cloud Manager Service for Prometheus, or GMP for short. So a quick overview of GMP, it's just a managed Prometheus experience, it deploys the Prometheus instances and it scales with it and it just injects your metrics and stores it in Cloud Monarch. But the thing we care about, the best part is it allows you to query all your metrics via global promql endpoint, which is exactly what we're looking for in this case. So yeah, that's why we built Thanos promql connector. As the name suggests, it connects Thanos carrier to a promql endpoint and it just lives in the middle between your Thanos carrier and a promql endpoint and it converts requests between the two. And it's open source, it's a couple hundred lines of codes, very lightweight. So feel free to test it out and play with it, make it your own, add new features to it. Contributions are welcome. So what is required to use it? You just need a promql endpoint to point to that and maybe you need some auth credentials. So in GMP's case, for Google Cloud, you just need your project ID and your service account key and you could just plug that into your Docker compose if you already use that and it should just start working immediately with all your other queries. So this is the view Philip shared. Distributed queries and all their advantages of pushing them down. And now you can add another data source for example GMP and you could delegate all your metrics to it and the promql connector is viewed by the query the same way it views every other query and your promql queries will be distributed to it the same exact way and you get the results in a single view. All right, let me show you guys a demo. Let me just mirror my display. So the first thing I'm gonna show you guys is actually Google Cloud. I'm gonna show you guys a metric in Google Cloud that you can only view in Google Cloud and this is a Google Cloud storage. So you see this metric, it's just a bunch of buckets. It has like an S shape. As you can see, it's like 156 megabytes. So you can copy that same query and you can put it in your Thanos query and then let's extend the dates. I think I put two days on the Google Cloud. Yeah, so you have the same S shape and 156, 57. So for the first time ever you can see your Google Cloud metrics in Thanos query. All right, let's talk about the next thing which is the flow I just showed you. So here you can see like our Thanos query is connected to two different Prometheuses and also over here you can see it's connected to our GMP query. So you could query all three at the same time. So let's query, for example, up. So you can see I have one GMP cluster and one, two Prometheuses. So let's just sum it up by cluster. So yeah, you can see all three. So you're connected to a remote endpoint for GMP and then you have two local Prometheus instances. Let's look at a more interesting metric. Let's look at HTTP requests and let's look at a rate and let's sum it by cluster. So yeah, oops. You can see all your HTTP requests from all your clusters and the cool part is you can actually remove by cluster right here and you can have a global view of what all your HTTP requests look like across all your servers. And then you could use that to analyze like, for example, like error codes to see which servers are giving you error codes from a high level overview. So next thing I wanna show you guys is this explain tab. So you can click the explain and it shows you how the queries are delegated. So you could see there are three different queries that were delegated, two to our local Prometheus clusters and one to our GMP endpoint in Google Cloud. And obviously in the future, some improvements will be to replace remote X with exactly the remote endpoint it was sent to. The next thing you can do is you can click analyze and it shows you like how long each of your queries took. So you can use it as a like an SRE for top-down view to see like, which remote endpoints that you have or are kind of struggling are not performing as expected. Yeah, that's it for the demo. Let me show you guys the conclusion. All right, so yeah, now we reached the end of our talk. Real quick, I'm gonna summarize everything we went over. So you can query all your data in a uniform way. Like you have like a global view of all your data from all your different instances, data sources. You have more control over your data and this is available to use today. So feel free to try it out and test it. Let us know if you find any bugs and feel free to contribute to it. And as the image here eludes, this could be the future of Thanos Queer. You can have many different data sources. You could use many different data sources and be like a data source agnostic, for example. Thank you guys for coming to our talk. Really appreciated. If you have any questions, let us know. Feel free to stop us in the hall if you have any questions. Thank you. I think we should have time for a few questions. I might have missed this, but is there a external label setable for these other endpoints that you can control fan out? Yeah, so in the connector, I think we'll want to add external labels so that you can, as I mentioned, control the fan out. And also in the distributed query mode, we already take into account external labels so that will also work, yeah. Thank you for your presentation. So there's a similar old project called Promxify and I was thinking a bunch about this problem. So while you are migrating to the distributed mode, I know we have a bunch of tests, but maybe you did some extra correctness tests or how to call them, whether the results still return the correct response or not, like with distributed mode and whether it's turned off. Yeah, so obviously in the beginning, there were some, I would say subtle issues, subtle bugs, but by now we kind of use it in production for a long time. We haven't seen any problems. Today, I think we should have kind of squashed out even the most complicated bugs, but obviously there can always be something that we haven't taken into account. So this is why in the UI, you can choose between the Prometheus and Thanos engine and if you suspect that the query is not showing you what you want, you can try it out with the Prometheus engine which is going to do the standard fan out. Amazing, all right, we have another question. Thank you for the presentation. Is it possible to avoid that service account key file, I mean to connect to GCP using our Cloud Identity, for example? Maybe you should move the answer for this. So is your question, is it possible to avoid using a service account to query Google Cloud? Service account key file, the G1 one. Yeah, I think you need for, I think as far as I know, you need a key file for Google Cloud. But if you're using other data sources, you can... Yeah, I want to avoid using, for example, Barcloth Identity. Can you repeat, for example, Open Engine? Okay, I can answer this. Yeah, so essentially there is this way of like Workflow Identity where you're just, if deploy, for example, Thanos from QL Connector in Google Cloud, like on GKE, there is a way to automatically authenticate with this default kind of, yeah, like permission to write to metrics. So it's like open source, you know, a simple proxy right now, so it's not possible now, but we might contribute it. So please feel free to add the issue on top of this project that you want that, and we can definitely do that for Google. But we want to have this proxy also available for other cloud providers, right, to use and to connect Thanos to multiple things. So also we'll probably, I mean, we'll for sure accept contributions to add authorizations for those vendors as well. So it's a good question. Like, yeah, it can be more automatic. Maybe last question. You talk about a lot of operations that are sort of homomorphic. You can express a sum as a sum of sums. What happens to operations where you can't do that? Like an average can't be expressed as an average of an average. So there's the average we transform that into a sum of a count. So the sum is done globally, the count is done globally, and then we divide those. The only operation where we cannot do it is quantile because for quantile you need to get the data in one central place. And yeah, there we have to kind of pay the cost of still kind of centralizing data. But if you apply an aggregation before applying quantile, then that aggregation's gonna get distributed. So if you do a quantile over a sum, then the sum will get distributed and the quantile will be done centrally. Nice, you want to catch us, right? On some bugs, yeah. So I think it's part of this promql engine work where you kind of like really construct the query into ST3 and yeah, we thought about those edge cases. So it's a good question. Okay, thank you so much for speakers.