 Thank you Okay, I can hear my max should work. So welcome everyone and welcome to our Thanos Introduction slash some trends some slash some updates and live demo presentation We are super excited to be here and and really yeah meet all the people like sometimes in person like I Made I've been in person for the first time as well, even though we were working for like two years almost in the community So it's amazing to be to be in person here So welcome and so I have been with me today So thank you Bartek. So this is Ben yeah, and I'm a software development engineer at AWS And I'm one of the maintainer of the Thanos project And I'm also contributors to some CNCF projects like Cortex Prometheus logo CD etc. And I have a puppy called we Thank you. Yeah, my name is Bartwami Podka. You can call me Bartek. I'm principal software engineer at Red Hat I'm maintainer of various projects including main Prometheus and Thanos lots of goal line projects as well and I'm active in the CNCF. I also recently have my puppy is Purple Heron because I wrote a book It's finished is being published in November. So pretty soon. You can order it now It's efficient go about, you know, how to write efficient goal line But really it's also language agnostic how to have a good observability driven practices over towards better efficiency of your solutions But yeah, let's really start with just introducing Thanos. Maybe you know what Thanos is maybe not Maybe yeah, let's ask like how please raise a hand if you know what what times project is Well, actually amazing. Yeah, so maybe it will be just Retiration or maybe you will learn something new. So I will let Ben to to introduce Thanos right now. Thanks Bartek So to introduce Thanos, let's start with the story of Prometheus first Prometheus is a monitoring system that is mainly inspired by Google Bergmo. So with Prometheus We are providing a highly reliable easy to operate monitoring solution and it's just a single binary But it provides very many powerful features like data scraping querying using very flexible promql and alerting and But Prometheus comes with its own problem because it somewhat lacks scalability and high availability and Here this brings us to Thanos, which is a distributed Prometheus and Thanos is more high scalable and horizontally Prometheus it provides a global view for querying multiple premises at the same time Instead of using a local disk as Prometheus it uses object storage like S3 to store long-term data And it also offers some other features like down sampling and multi-tenancy Then let's speak about the Thanos community and Thanos now is an CNCF incubating project It was open sourced by improbable since in 2017 and Thanos has a vibrant community and now we have over 490 contributors and over 3700 users on our slack channel and we also have over 11k get-hop stars and We have also many adopters and a lot of users are quite happy about Thanos project So if your company is also using Thanos right now and not in our adopters list Feel free to open a PR upstream and we are really happy to have you So after talking about the Thanos community, let's do some deep dive on some Thanos internals. So let's get back to Prometheus first I just mentioned Prometheus is just a single binary, but it has multiple components providing some essential functionalities Right. For example, we have query engine script engine and engine for rule and alert evaluation and Some TSDB component like compactor So Just in on Thanos, we want to scare more on the query path first So we extract the query engine to a separate component called Thanos carrier so that we can make it horizontally scalable and Next for the script engine, we keep using the original premises one but in this case we deploy a tunnel sidecar components closely to the premises so that we can get a globally query view and In order to scare more on the rule and alert engine, we have Thanos rulers with us So as you can see with these components, I just mentioned we got the Simplest Thanos deployment model, which is called sidecar deployment. So with this model We can have a highly available Prometheus and also a global view for querying and alerting and And actually we can extend more on this model like one issue with Prometheus is that Prometheus uses local disk to store time series data although local disk works for most of use cases, it Have some issue when we want to store data in a very long retention time period and in this case object storage is usually a better solution because It's either a scalable so we can Control the size and also it's much much cheaper so we have the Thanos sidecar component with us and it can upload the tstb blocks to the object storage every two hours and In order to query the data from the object storage, we have a new component called store gateway and In order to make our long-term query more efficient We have a component called Thanos compactor which improves the query performance by merging blocks together So we have another mode, which is sidecar less or we usually call it receiver mode So in this mode, you can have your metrics collector to push metrics to Thanos receivers using remote ride practical So this could be useful if your network topology has some limitations to the pull model Or if you just want to have a central place to store or query your metrics So that's the introduction about Thanos and next I will hand it over to Bartek to talk about the recent train of running observability as a service Thank you Ben Yeah, so, you know with those components you seen where what Ben explained We have like very flexible model in Thanos of how you can kind of use it for what use cases You have it kind of you know kind of fits your needs Um, however with all the users using it in different scenarios We saw certain pattern which which kind of like maybe You know Motivated us to prioritize certain feature and I would like to Talk about that and really what what it can be called is observability as a service So what I mean is that you know in the past when maybe there was no Thanos You used to solve your monitoring needs with just prometeus and it was probably you know solving majority of your cases And you generally put prometeus next to your processes and it's next to real workloads And you have like built-in alerting and and querying and dashboard dashboarding and like This rich record system, you know can can can solve many many of your monitoring needs However with cloud native community and bringing more communities Easier way to install and distribute those communities clusters. We kind of tend to Really use communities or clusters as a couple not as a pet. So we have more of them So in this case, you might want to scale this solution into something that supports, you know, multiple clusters In this case, you know, Thanos was created to allow you, you know perform queries distributed, you know Against distributed kind of storage So for example, you could keep your data in prometeuses and still use tano square in maybe recall remote location to have something We call global view. So essentially you we are able to aggregate data With multiple sources from multiple sources However, you know, if you are maybe power user if you have a certain use cases You might want to opt in into something called like receiver mode where we are kind of trying to offload as much of data from prometeuses and as soon as possible as prometeuses scrapes them try to send them to remote location So the the kind of like client side like cluster side is kind of cheap and like as as, you know as simple as possible and and you kind of we have like standards and protocols to do to do so And then you kind of have a bit more complex Kind of architecture on the cluster side But it's kind of like only one cluster that has this complexity and you could put all the data there And and and do your allerging and monitoring from from this place And this is where, you know Thanos allows that as well if you want and with this approach we saw That it kind of got popular in a sense that many users switch or like run hybrid solutions that have both You know This pool model with sidecar as well as receiver model However, we we seen kind of like some pattern where you know companies and large corporation tries to compose this kind of Cluster side observability into like abstraction into some cloud right? Cloud that you know as from user perspective They only you know the your teams that maybe developers or sre's who are managing your applications They don't need to know exactly if you are using tunnels or cortex or some vendor You can kind of like if they switch as well as long as you use maybe prometheus or open telemetry or some kind of standard You know you can totally, you know Switch and kind of like stream the data into some cloud and just you know You expect this to have alerting and you you expect this to have some promql queries And your you know your company is much Than easier or organization might might be kind of like much more Convenient to you know run the applications because those those teams doesn't have to understand You know necessarily how observability has to work So this is kind of very popular nowadays and we kind of acknowledge that as a Thanos team And we want to prioritize certain aspects of the system that makes it easier And we kind of can categorize three things Multitenancy with isolation and quality of service because at the end you want to You know kind of have maybe dedicated observability Observability team that is focused just on maintaining operating the system and then That means you probably have different users from different teams that doesn't necessarily need to see each other metric Or perhaps maybe you have a secret team that doesn't that that kind of have Metrics that are somehow collocated to the user. So you don't want to kind of You know leak those and and and let other maybe not Admin users to see that data. So multitenancy is very strong important reason why we kind of yeah I mean important important characteristic of such system reliability is another part Like if you want to have a dedicated service Usually in organization we end up building a lot of services that depend on this observability So if you want to alert from that place, it has to have a lot of nines in your SLO So it means it has to be reliable. So that's kind of another part where we are trying to prioritize As a team as a community Finally scalability and efficiency right like of course Like we can we can you know told you from our practice from our experience Once you start this idea Hey, you have this service and you can just use it add a tenant and just send our some metrics and then query them Like there are so many internal teams that starts to want to use it, right? And we have that in red hat as well as like immediately They would like to push to you like billions of series because it's internal So they cannot pay for that because kind of the same company So you end up having like very very high-scaled very quickly So we want to make sure this solution is is is is kind of efficient and cheap to use right? And we want to make sure tunnels is Yeah, it's kind of like fulfilling those those use cases So let's spend a little bit time on on going through various latest kind of improvements in this area So I will let Ben to mention some of those. Yeah, thanks Bartek So let me give some updates about the recent improvement we made for tunnels So the first thing I want to talk about is called bucket prefix So this is one of the oldest feature requests in the sign-offs community since I think 2018 Yeah, and so with custom bucket prefix, we can have a bucket level multi-tenancy by default And each tenant can have their own data under its bucket prefix So we only need one bucket to hold all the data And yeah, thanks for the awesome tunnels community's contribution We finally have this feature landed in the v0.2018.0 release So to enable it, it's very simple. We just need to add one more line To the object storage configuration file as it shows in this picture. So, yeah, it's just one line of configuration And yeah, next feature I want to talk about is about really limiting So as Bartek just mentioned, we are running like tunnels in observability as a service mode So yeah, usually it comes with multi-tenancy, right? And if we are talking about multi-tenancy, we usually means soft tenancy Because we want to reduce costs. We want to save money. So we use soft tenancy We use soft tenancy, but this also means it's all our tenants They're going to share the same physical clusters together, right? So if we have some overly abused tenants or they are sending a huge amount of traffic To protect our service and to protect our other tenants We want to maintain our quality of service. So real limiting is really really important And again, it's not just real limiting. So actually it applies to all limits or general limits So for the limits improvement, we added two tunnels We have recently added those features here And the first one I want to mention is about remote write limits. So we added four limits here One is the request size and also the number of series and samples per remote write request And also the maximum number of concurrent requests here And another interesting limit I want to talk about is about the active series limit So why active series is an important limit? So in premises TSDB, active series means the number of series in the TSDB's head block and they are all in memory So the more active series we have, the more memory we are going to use So we want to avoid or we want to limit this metric to avoid high cardinality And to avoid our service being all killed But anyway, we can like scale up more like to To a better machine right to use more memory, but it's not cost efficient, right? So we this kind of limit is super important for us And to get the number of series Usage per tenant It's not that easy because we cannot either get it locally for each tunnel's receiver Because each tenant's data are spread across the whole receiver clusters And also we have a data replication feature enabled. So usually we write Three three x more series So it's not easy to do the calculation from the a single receiver But to solve this we simply we have a solution which simply asks a meta monitoring solution For example, the premises running in the same clusters that monitors your channels By querying that premises we can get The current tenant series usage by some metrics So yeah, that's how we solve it And uh, yeah next, uh, let's talk about one scalability improvement we made for hashrin So, um Like the sonos receiver component it uses hashrin But previously the sonos receiver was simply using hash mode to distribute series for example Let's say we have this example series here and we have three receiver replicas And based on the hash mode calculations the destination would be on receiver two But if we scale out more we increase one more receiver in this case the hash mode value it would become zero And this is super bad because with this naive hash mode Hash mode calculation each time when we change the number of servers Almost every series will be mapped to different instances. So this is Yeah, to me, I was totally wrong and it will keep some issue like high current cardinality as well because we will have more serious more active series in the head block So to solve this what we really need is consistent hashrin So thanks for the awesome blog post written by damiel grisky So we implement the kentama hashrin here In the yeah recent tunnels release. So if you are interested in the details Please do check out this blog post and Yes, that's pretty much about the improvement. I want to introduce but next I will hand it over to Bartek to talk about the really exciting query efficiency improvement Thanks a lot. I hope it's exciting. Yeah, so I mean ben already mentioned, you know reliability improvements and multi-tenancy improvements So last thing that we should improve is of course scalability and efficiency and there's always something to improve There's always a better way to to do things and have it cheaper. So let's dive into what we are kind of enabling here So, you know, generally we have from the past like very simplistic solution for querying You we discuss like different kind of storages Which we kind of hide behind the grpc api we call it story api So those story api leaves Could be your promitius and sidecar could be your store gateway. It could be a receiver could be a ruler And all of them allows us to fetch some data from multiple Places like different storages and so on So what we want to do in term to achieve this global query is to put the promcule engine So the actual the key part that computes and and gets the storage data and puts that into your promcule result is Is that is the promcule that we actually promcule engine that we put into query or microservice? And we really actually import promitius code, right? So so it's one to one the same And it kind of works you put for example, I have an example query here It's just an account aggregation of Count aggregation of some metric for two days and you can see it simply takes all the data from the storage It needs two days for this metric, right? So what's the problem of this is that this is a single point of failure and To answer any question any arbitrary promcule we have to pull all of this data into memory and It's just not very cost effective. And of course it's hard to scale So in order to scale it we had to kind of do some improvement So first of all comes from the community and thanks to the core cortex and cortex kind of team and project We kind of borrowed their their idea of this query front and microservice another microservice that sits on top And talk to the queriers and essentially it has lots of different transformations of the promcule that you pass That one of them is is to kind of split Split by day or split by any time So which means that for example our query for two days could be split into two queries For one day and we can distribute that into one query and just run with multiple cores Or distribute into Into like totally different machines. So we achieve some kind of horizontal scalability with this And then you know storage cores are also distributed because we we ask more and smaller smaller queries The problem with this approach is still We can get you know, kind of slow and expensive You know boxes of the querier because imagine If this metric even for one day gives me like one million results one million ton series So I have to pull into memory one million times series for this one day And and and still have to do it and still I cannot really scale that So for for those reasons we started a lots of initiatives and You know, we had to talk earlier this year in valencia from moaad and philip And how to distribute those queries even more efficiently. So the two solutions we added For portion of the queries is first is push down. So what push down is doing is essentially trying to You know for certain queries be able to not Really fetch all the series from storage but really calculate certain aggregation which which is possible For some aggregations calculate the results directly near the storage. For example, a store gateway could be You know, when you count when you when you ask for count for certain metric The the storage itself store gateway can calculate me the count. So instead of giving me one million series It gives me one series, right? So so it's much more cheaper on the network side And and it can what additional part is that we can kind of shard this work easier and make it more concurrent, right? So so this is what what what push down is doing and it's enabled for some For some computations, especially against prometheus sidecar And by the way because prometheus has a promo cool engine so we can kind of offload that execution there Now another point that actually solves this kind of Sharding or only horizontal sharding Horizontal sharding means by time we had that but we wanted to introduce something better on the vertical side So maybe we should split a querying a computation into multiple pieces within one day, for example And this is what query sharding is doing. It's already enabled Maybe by flag or or but like yeah, it's kind of like Also, it's not possible every query. So we have to be careful here But generally if you enable it you can have you know, we essentially split for example one query into four Queries because it's pleased by day into two and then those one day queries can be split for example by two By simply kind of a certain hash mode on on the series and how we aggregate them Together this is not very easy if you do average for example because you cannot average of averages It's not the same as an average of that It's not the same as average of this all the all the sources all the sources So we have to be very careful, but there are some optimizations you can make with that And I've already improved the scalability of the storage queries as well because there are more of them and just smaller However, hopefully you see certain patterns. So we have some magic some pushdown on the stores Store apis and you know, there's some magic there some complexity We add then there is a big magic on query front end We have some complexity to shard to to transform queries I don't know if you see the pattern, but there is a pattern why we are not improving the promql engine itself and We the truth is we were a little bit scared of it, right? It was a kind of like, um, you know multiple years um, I mean old code and and and kind of like very intrigue very optimized for a primitive use Use it was designed to run only one one cpu core maximum. So it was not concurrent I mean at least not not the way we would want it. So it was it was simple. It was amazing But perhaps there is a way to attack this and and open open this pandora box So this is what we'll be talking in two weeks in prom on prom con in munich So if you want to I think there should be still tickets for that So join us there. We'll be talking about that with philip But I shortly would like to show you and and get you excited about this stuff because I am So first of all, like we have this is like a drop-in replacement from promql engine So you can totally use it in tannos event. It's in the tannos community organization from ql engine Project is you know open source contributions are welcome. We already have a bunch of projects not only tannos contributing to that Why it's amazing. So first of all it is based on volcano design, which is behind, you know, multiple proper mature sql engines And it allows essentially nice framework of operators for for different execution parts For for example for aggregation, we have another operator and for storage selection We have for scan of storage. We have another operator and then it allows us to move those parts Around for optimizations or really understanding what's happening inside. It also allows for nice concurrency So, you know, if you can see that Existing sql engine is very popular to have those query planners and optimization optimizers So usually you you parse your query for example promql query Into logical plan you optimize that and then you you you prepare physical plan and optimize that only then you execute the query We never had this before right and and this is good opportunity to add those things where we can do informed optimizations So this is like production production. Well, this is a promql engine code. We literally already did that We create a logical plan. We optimize logical plan We then add physical plan and and optimize that as well And then actually we it's not everything implemented. We just started this project month ago So, um, so we fall back to alt engine If there is anything we don't support so you could run this on production and we run this in production red hat in and Shopify Um a lot. So so so I recommend you to try it out Another cool stuff is that, you know, there are cool explain views in, you know, sql engine language And I was missing that in in promql, especially when we introduced those, you know operators so we We wanted added that so let's me let me now start some Demo so I have something to show you And let's see if that will work. So first I will have to mirror my screen Yeah, I want that and It's kind of super big, but so what I will do I will start our setup The setup is very simple. It has two store gateways and tunnel square here And kind of two replicas of that. So so we have one replica, which is both promql enabled new promql enabled and one replica, which is not enabled and We use as you can see Um Things are popping up. That's good How do I make it big or smaller actually? So now you can see I'm I'm using essentially some goaling end-to-end Framework we have that you can kind of like orchestrate containers in a in a like unit test in going fashion And since we run it, um, you can see that a couple of things opened We run essentially multiple containers. We have promql just to monitor this setup And let's me try to kind of make it smaller We have you know, we have parka because we want to we are actually working on optimizations, right? so we have to have profiling and We have tracing and finally we have two tannos well promu to use what tannos ui's Which are really similar to promu to use that are two Essentially we are calling queriers to to give us this ui and perform certain queries. There are errors, but they are on purpose Because I don't want to kind of execute the query yet So let's we have a new promql here or like set up with new program from ql and old Old one. So let's kind of query old one. So what I'm querying I have like two weak huge block of data with 10 million series So it's it's and I only query one day And it's carefully fast honestly Let's see let's see the uh the new new one So um, so essentially I'm doing like simple aggregation. I don't know what's going on with this room Maybe it's a screen sharing, but it's so fast. Anyway, uh, it used to be like 10 seconds on my test runs I think we should just bump it up. Yeah, let's let's do bigger one two days But essentially you already see that six seconds was on old query and then a new query was already twice faster Oh, no we could I want two days two days. Let's see if that will work Uh, and also I'm doing really a lot of mistakes because I'm trying to run this At the same time. So, you know, not good optimization like benchmark practices However, we already see some improvements like 11 seconds for old one a new one For eight seconds. So so why why is that faster? Right? It's just a new prom ql engine implementation So the reason why it's fast and kind of to explain this We added this explain mode. So to prom ql. I just have explained Comment and it kind of like prints me, you know, we are working on user experience I just added this like yesterday, but but this view is is what we will try to maintain and add Just to show you, you know, what's what's the execution plan? Will be so you can already see that we have a certain parallelism and concurrency within the The prom ql engine so it noticed that I have 12 cpu. So it kind of like sharded into six And it already it doesn't start the storage yet. We'll do that later on in later versions But right now already sharding within the the prom ql engine like just on my One machine is is already improving the situation because we are essentially counting those series like, you know Six six times in the parallel Um, okay, but that's that's boring, right? Let's let's go and have like one more More intricate query. So this is a typical query we do in prom ql So for example, you can see I have some cluster version And I want to kind of have a percentage of this cluster version without certain version, for example, right and This can be expensive because like the the naive prom ql implement or like the old prom ql implementation were Essentially scanning the storage twice, right just first to have this cluster version And just to have this, you know, actually I'm pulling right now like 200 000 series So so kind of a lot and then I do another scan of storage And the reason why it was implemented like that it was implemented for prom q's where you have in memory all of this data versus here in Thanos we have unfortunately Some kind of network lookup and even object storage lookup. So it can be a little bit slow We can see 11 seconds and actually not so bad. It's actually pretty fast With so much data. So let's see how our new query will perform here So what's amazing is that this logical plan and physical plan separation it allows us to optimize and really Open space for other people to contribute, you know, more optimizations. So we can see it's Super, it's like three seconds versus 11 seconds. It's I'm impressed even and What's amazing is like, let's see what's what's going on So if I explain this it's kind of like more complex query as you can see because it just has more more things And generally it it has a two separate kind of parts that only divides after all And you can see all of them fetch some series and this is before optimization So we can see in my logical explanation is that there was some optimization for Some sorting of matches because it's just whatever it's just faster And then you have merged search selects optimizers. So this is the key thing. I want to show you So what is happening is that this is after optimizations You can see that new thing was injected. It's called filtered selector But really what's happening here is is is on those series selectors So series selectors essentially fetches the data and we can see their memory address of this actual series selector in memory In in my in my query So you can see the address is exactly the same for all shards, but also exactly the same across Different part of the query versus before it was a different memory address You know For first part and second part this indicates that the same data is used So we cash the same series and we literally make a query only once to get the cluster versions and then Virtually filter for a second part and reuse the same memory So that's a huge optimizations and you can see everything in our tracing solution. So let's go to younger Grab the query range And oh, let's find them. So one was three seconds And the second was 11. So let's go to the old the the the old old Promql many things happens, but what's really I want you to focus on is the Query or selects So you can see that selection of this data There were two of those and it took some time It was parallel, but it still overloaded my machine in some way and there was some congestion So at the end it was slower Plus we have to kind of like proxy all of it. So at the end, um, you know, we made the two calls and One for, you know, this filtered data and second it without without the filter And what's what's what's cool with the simple optimization, which is kind of like maybe five lines of code We already can achieve something like that where we have only One selection and again, this is my local networks. It's ultra fast But if you go and fetch series from promql sidecar, this really matters, right? So It's already faster a lot. So so this is what I wanted to show today and Hopefully this excites you it excites me definitely And if you want to learn more and really like try to maybe write those optimizations I would like to invite you to do so and the last before thing before before we finish I would like to mention that we are doing mentoring mentoring So if you if you know any students, but actually we also mentor not students like full-time employees If they want to join, you know open source space and really start contributing Check the websites and website and sign up for the program and we'll try I mean we mentor a lot as a community So so you're welcome to to join us here And that's it. That's it for us. Thank you. Yeah, thank you