 Welcome Los Angeles and everyone online. It's a crazy time, but thanks so much for having me here. And today I'm going to talk about Taunus, CNCF project. And the title is highly available, pluggable, long-term storage metric for everyone. It's one of my most longest title ever for a talk. Really excited to talk about this. First of all, a little bit about myself. I'm from the Netherlands, or Holland, and I'm a buzzword engineer. Basically, I do everything that I like, which really depends on the moment. So this can be SRE or DevOps or at the moment observability. I do this at Fullstack. I really like open source software and especially anything related to Kubernetes. And on the side, I do a little bit of hacking. For example, on Hacker One, these are just things that I really like to enjoy in my spare time, besides working with wood. I really love to create things. So now we got the basic comfort. Let's get started with our talk. And we have to start with Prometheus first. I want to give you a little introduction. Since this is a talk for everyone, I want to give you this introduction about Prometheus. Perhaps you have heard of it, perhaps you are using it. I personally really love this project. It's a lie for quite a while. It's heavily used, especially also on Kubernetes. But what is it that it does? Well, Prometheus is a key player for this part. This is an HTTP page and it's serving metrics. It's just displaying text in a metric format. And for example, this page is actually displaying metrics about an HTTP server. For example, about request duration, but also about how many requests we had in total and what was the code, the HTTP code that was for our given request and what are the counters for this. And for example, for an HTTP web server, this is pretty nice information to have because you can see, for example, how many 500 errors do I have? And this is really nice, but we want to go from such a page as this to something like this. And this is Grafana and it's a visualization tool. And what you can see here are different graphs, but these graphs are metrics plotted over time. And this is really nice, but we can also do different things such as alert. And this is an example of an alert that we have created with Prometheus. And this is an alert for a high load. So if the node load one is above 0.5, well, it runs into a firing state. So we can get a notification that something with a metric happened and it passes some threshold and we get an alert. So it's important to get a top-down view of Prometheus as it's, well, it's doing quite a lot of things, but the things that you're doing is doing really well. So in the middle we have the Prometheus server and we can do the retrieval of those metric pages. So basically that's the retrieval part. We are pulling metrics from those metric endpoints and we can do this in a certain interval, which is just the scraping interval. But how do we discover these targets? Well, that can be done via multiple methods. If you check out the Prometheus website, you can have all kinds of service discovery methods. I've listened to, which is the community service discovery and the file service discovery. And basically what we often do with the file discovery are static targets. So we input a host, which is just a static target and Prometheus will try to scrape that endpoint. With communities, we can use discovery via the communities API server, which is really neat because we can just run our services and let the API server do everything and let Prometheus discover every metric part that we have. And this makes it really, really easy to automatically discover our scraping targets. So that is the retrieval part, but you can see in the next part is the DSDB, which is the database, the time series database. And it's storing its data on the node, on basically a generic volume. Then we also have this HTTP server. And, well, this allows for prompt well, which is the query language for Prometheus to allow you to query your metrics over time. And Prometheus does have, it's a web user interface, which is also really nice, but perhaps a more common user interface is Grafana, which is really nice for the visualization. And you can do so much more with Grafana, but that's out of scope for right now. We also have this alert manager, which you can see on the top right. And basically what we can do is create alerts and certain thresholds. And if a certain metric over time hits a certain threshold, we can send in an alert. So Prometheus is really awesome at doing all these kinds of things, but perhaps it's just lacking a few features. For example, as you can see, the DSDB is storing its data on the node itself, on disk or on a volume, depending on if you run on Kubernetes, which is fine if you have a few servers that you are scraping and you have a data retention of 30 days, that works perfectly because Prometheus is able to scrape a lot of targets and is able to retain a lot of data. But the retention period, I haven't seen, for example, a retention period of two years for 10,000 of servers in one single Prometheus. That can be quite tough for Prometheus. It wasn't built for a very long retention period. So let's question Prometheus for a little bit. I don't want to say anything bad about Prometheus because I really love Prometheus and so should you. But there are a few limitations. For example, I have a single node with Prometheus on it. I have a second node with four services on it. And Prometheus has discovered my four services and it's all running well. It's scraping my four services and I can query the data. What if node one crashes? Well, this might pose a problem depending on your use case, of course. But at the moment, it isn't scraping my four services and I also don't have an opportunity to query my data from the period that Prometheus was running. And obviously, if you are known to give me these, we can just add replicas. We can just add another Prometheus, which is what we can do. So here we have two nodes and both nodes are running Prometheus. So this removes the single point of failure. And if something happens to one Prometheus, that's fine because you still have the other Prometheus replication. If we have this use case, we still have node two. It's still scraping service three and four and we can still query our data. We can also still query our data from the period that node one was still alive because node two still has this data. Both Prometheus's are both scraping our targets, which allows us for one Prometheus to go offline. There's just a little downside to this because both nodes have the same time series. But as we have discussed, a Prometheus will scrape on a certain interval and this isn't a fixed interval in time. So it isn't scraping at zero, zero, zero, zero, it's just scraping every 15 seconds and it really depends on Prometheus on when they're scraping it. So it could be the case that both Prometheus's have the same raw time series data. But the time series would most likely be the same. However, the samples would be different to the different scraping times. Also the problem is, is that for example, one node goes offline, we are missing the data on that Prometheus. And for example, we implement a load balancer. So our queries are going to either node one or node two and this is working just perfectly. But once node one went offline, we are missing data. So if our query hits node one, we still have this gap in our data. And this is fine for example, alert manager because alert manager is aware of this behavior and it can just filter and duplicate data. But if you are talking about making queries with, for example, Grafana and making test boards, it could be that if you just hit the refresh button, you hit the Prometheus that was offline for five minutes and you have a gap in your data. If you might refresh and you don't have a sticky session on your load balancer, you might hit the older Prometheus and you don't have the gap anymore. But you might also see different samples because they can be different due to the scraping times. And let's talk about the unified few. Let's say I have five fitness clusters and I want just a single few for all my Prometheus. Well, can we solve this natively? And the answer here is a typical engineering question. Yes and no, we can, but perhaps not perfectly. So one way to do this is we are one single Prometheus instance, perhaps with replication. And we can use the communities service discovery config. And as you can see here, there's a definition for an API server. And by default, it's assuming it's running in its own cluster. But we can also add another configuration and basically scrape our remote cluster. There are a few downsides to this. Well, first of all, you need to have access to your external cluster because it is gathering the information about your services via the API server. So you will need to have access to your other server. And this might be just a little bit annoying for you. Perhaps another solution, which I personally don't recommend is user federation. So federation is just a native solution within Prometheus, which allows you to scrape different Prometheus servers. So what we could do is implement a Prometheus server on your external cluster and just expose your Prometheus and let it be scraped by a central cluster. This works, but I wouldn't recommend because it requires a lot of data to be scraped. There can be massive amounts of metrics on your communities cluster. And if this grows, this solution doesn't scale that well. Another solution, which I just added for completion is expose your metric endpoints in another cluster. So let's say you only have one endpoint, one metric endpoint in your external cluster. You could add an Ingress in front of that and just let it expose and add that as a static target to your central cluster. To be honest, the solution is met. I wouldn't know why you want to do this, but it is a possibility. I normally assume that you want to have all your metrics well, gathered from your clusters and not just one single point, but if this is your use case, you could do this and just define your target as a static target. So I've talked a little bit about some form of limitations in Prometheus, but it does make Prometheus bad or anything. In fact, we actually really love Prometheus and this is where the talk of Thanos is about, about extending Prometheus because we like Prometheus and there's a lot of love between our communities. And yeah, let's get started about talking about Thanos. First of all, our community. I think we are really proud about everyone involved in the Thanos community. We are where fully absorbed from the start. And well, it's been a few years. We started in November in 2017. Myself not included, I just came on later. Yeah, and we are a CNCF incubating project. And personally for myself, this is really important. As I said, I really love open source, but I guess it's perhaps even more important about what company or foundation is behind an open source project. And this is why I love the CNCF. Well, we have transparent confidence. We do maximum of two votes per company involved in the project. And I mean, this is really important for an open source project to maintain an healthy state. And I really liked that. We're also hitting almost 10,000 GitHub stars, which would be another milestone for this project. And we have so many contributors and Slack users are really active. And I really love the vibes on our Slack. So please feel welcome to join us on CNCF Slack on Thanos channel. Happy to see you there. And feel free to ask questions there. Don't always take my only word for it. We have numerous companies running Thanos in production to ensure they have a reliable and scalable monitoring platform. I'm also really, really happy that those companies are willing to add themselves to the adoption page on our GitHub. And also, if you are listening right now and you're using Thanos, please feel free to add yourself to the adoption list. If you are unsure on how to do that, just give me a message or find us on Slack. I'm happy to help you here. And again, thanks for everyone involved in this project. So let's get started about the feature of Thanos. So I guess one of the main things is the global query view. So we can have a multiple clusters with Prometheus and we can add them all to a unified way to query them all, which is really awesome. And I will get back to how this works and how it would look like. It's also unlimited retention. And perhaps this is weird. I mean, you can have a limited retention for everything, but what Thanos incorporates is storing the data on the object store. So it's really cheap and efficient and the skills. And that makes it the unlimited retention. And Thanos is also Prometheus compatible. A lot of our source code is basically parts of Prometheus itself. We use the same API for queries and from Koel and basically it's a plug and play compatibility. Another cool feature is the down sampling. Let's say we have metrics for a two year retention. If I'm going to, well, display this in a nice visualization chart in Grafana, I'm not going to plot every 30 seconds on a two year timeline. This will be a few lesser plots. And we can just down sample this data to make really efficient long-term queries for yourself, which really speeds up the process. Thanos consists of multiple components and well, the four core components that I'm going to cover in this talk is the sidecar, the query component, the store and the bucket component. There are more components, but just for the sake of this, I'm just going through these four components as they are the most important and provides the most value. So the sidecar can just be hooked next to Prometheus which makes it really plug and play. We can just add that as a sidecar. It's optionally capable of uploading your metrics to the object store. We don't need to use this. You don't have to use the object store or long-term metrics. If you just want to use Thanos for a unified way to query all your Prometheuses, that's fine as well. And the sidecar enables the querier to actually query our data. And as discussed, well, the query is able to see every component. So our stores and other sidecars, every component implements this store API and the query component can just discover these targets and allows you to query on whatever component of Thanos that you want. And I will go into a little bit more details later on about this. And we have this compactor and the compactor is basically a procedure which Prometheus hasn't done yet because we are storing our data to the object store, Prometheus isn't compacting our data yet. And this is where the compactor comes in and the compactor component allows you to compact the data, which is more efficient. It's also responsible for downsizing our data for the longer-term metrics which speeds up things very well. We also have the store component and the core feature of the store component is to act as an API gateway to our object store. And this store component allows us to query our store code data in this object store. So this is an example of a typical highly available setup which incorporates Thanos. So as you can see, I have multiple clusters and each cluster is running on Prometheus and each Prometheus has a sidecar with Thanos sidecar. We also have a generic monitoring cluster and we have two query components and these are stateless and you can run multiple replicas of this and we have Grafana which hooks into the query component. So we are querying via Grafana to the query component and the query components are doing their queries to all our sidecars. And as you can see cluster two has multiple replicas and the great part of this is that the query component does deduplication on the fly. So even if in cluster two one node goes down or Prometheus goes down, we can still have our data and well, when for example the Prometheus is back online we used to have this gap in our data if we for example hit exactly that Prometheus that was offline for a certain amount of time while the query components just prevents that as it does the deduplication of the data. So you always have the same unified way to distinguish your data. And we can extend this further to allow the sidecar to upload the data to the object store. This is not entirely enough, this is the first step. So it will upload the data and we can add the store component to actually allow us to query this object store but can do a lot of different things. As you can see here, we have one cluster with only a sidecar and we have another cluster with a query component and store component and the sidecar. And this works just as well. So in our monitoring cluster we have a unified way to query all our clusters but we only have one cluster with a long term metrics and this works just fine. And this is really awesome. And we can go just further. We can add another Grafana with multiple query components on our cluster one. And this might be really useful if you are working with different teams wanting their own Grafana instance. That's possible. We can just have this one cluster at the Grafana at the query components and we still have our monitoring cluster in which we can see all our clusters. And we can just continue doing this. We can, well, for example, have a team that has two clusters and you want to give them their own Grafana instance. And now if you are in the Grafana on cluster one, you are able to query but the unified view a cluster one and cluster two. But we still have this monitoring cluster in which we can see everything. For example, if you have a third cluster and we can go basically bananas. This is a draft that I made a while ago and I just figured to conclude this. This is just having various components and as you can see, everything is just chainable, blockable and you don't have to go into details of this graph. What I'm trying to say with this is that it's really well to just plug and play, to make changes, to switch things up and to move things through various pipelines, basically. And this makes it really robust. And you might say, how do I get started? Well, first of all, please check out our website. I will upload these slides and you can check all the links for yourself. But basically, our tunnels website has lots of information and again, if you are missing anything, please create an issue or a pull request. Happy to improve this. I also want to give a shout out to the Prometheus operator which is really awesome to have like a unified way to set up your Prometheus's and it also allows you to incorporate Thanos sidecar with Prometheus operator. There's also the cube Thanos helm chart. I personally haven't used it but I'm aware that a lot of people are using it so I will just include it. And there are also community charts which is just a search and there are multiple of those. And I'm also really proud to have a cut the code out which is basically a course of Thanos. And there are seven lessons which gets you to start with using Prometheus and Thanos. And I really would recommend you to try out the cut the code out if you think this talk was really nice. Please do check it out. And we also have more. So we have our Slack. It's on the CNCF Slack and we have our channels there. Feel free to join. And we also have questions and discussions on our GitHub and obviously our GitHub in general for pull request and issues. And again, check out our website. And I would like to thank you all and our community. Everyone involved in Thanos and the CNCF and thanks so much for having me here. I hope everyone had a great time and still enjoys other talks. And hopefully next year we can see all each other in real life. Really looking forward to that. Thank you.