 Okay, hello everyone, bonjour, right? I warmly welcome you to first TANOSCon ever in KubeCon EU 2024 in Paris, a co-located day. We are super excited to be here and spend half a day to talk about focus on TANOS, focus on using TANOS, maintaining developing and really celebrating this together for the first time. So I'm super excited. Who else is excited? Are you excited? Make noise. Woo! Yeah, that's some energy. Amazing. So my name is Bartek Wodka, but I will introduce myself more later on. And I'm today with Michael. Hi, Michael. Also warm bonjour for myself. And I'm an SRE at Ivan, but only for two more weeks after which I start to position at CloudFair, which is pretty cool. Yeah. Yeah, I joined the community last year and generally have too much time on my hands, so I enjoy spending that hacking on TANOS. But enough of that. What even is a TANOS? I kind of suppose most of you already know what the TANOS is, but for those TANOS curious folks of you, wondering the halls of KubeCon this year and finding yourself in this room. First of all, welcome. And second of all, you will hear about TANOS all day, so I keep this very brief and give a very reductive summary. TANOS is a system of microservices that can conspire to provide a highly available Prometheus setup with long-term storage capabilities. It also provides a single pane of glass for all your metrics and keeps them queryable on large ranges through down-sampling and compaction. But as I said, you will hear enough about TANOS during this conference. So much more exciting to me. TANOSCon. 2023 has been a pretty great year for TANOS. We had lots of new triage members, maintainer team members, lots of features. In fact, we have a dedicated talk for it given by fellow maintainer Saswata. During the course of this day, lots of new stargazers, corporate adopters, CNCF, hashtag TANOS, channel dwellers, docker pullers and issue submitters, which is just a sign of a great and thriving community. And as the slide shows, we had six talks last year in three conferences, three at PromCon and three at KubeCon EU and KubeCon NA. So we figured it's time for a dedicated event for this project where you and the community can meet, connect and geek out about TANOS. And thank you all for being here and making this a thing. That's great. Bartek, do you want to take the formalities? Of course, I love formalities. Yeah, so for organizational point of view, make sure you follow code of contact. Let's make sure the space is safe and friendly for everyone. Golden rule, treat others as you'd like to be treated. So remember, captioning. So for this event, captioning is available. So if you would like to translate things and have a caption in your own language, on your device, just make sure to scan this QR code. And yeah, use it, it's super useful. Yeah, let's quickly go over the agenda. We have a couple more minutes. When we created this event and submitted the call for papers, we did not anticipate that we would get like 30 talks and we had to just pick six of them because we only have half a day. But rest assured, those six talks are pretty entertaining. So this will be great. We'll be starting strong with a retrospective of the TANOS project by its original masterminds and what they learned from the journey of creating a successful open source project. Then we will have not one, but two case studies of massive TANOS deployments in the wilds and the challenge they faced and how they managed the clusters in production, cloud flare and reddit, respectively, followed by a quick recap of 2023 by Saswata, as I already mentioned. Then a really exciting talk. An entirely new single pane of glass use case unlocked by our new distributed query engine that drastically improves the utility of the TANOS query is really exciting to me. Then we'll see how TANOS can serve as a multi-tenant monitoring solution and learn about the best practices and gotchas, setting that up. And last but not least, TANOS receiver is known as one of the trickier components to operate and keep running. And Joe will share his journey, his hard fought lessons about keeping this running most of the time at least. And then we already reached the end of TANOSCon, but it's already lunchtime, so probably you're happy about that, and we can connect over lunch and celebrate TANOS. Yeah. We're prepared to start this event. Bartek Fabian, take it away. Thank you, Michael. All right, let's go. So yeah, my name is Bartek Podka, and today I'm really excited to be here with Fabian Reinhards to start this event with a short story about TANOS beginnings. And for those who don't know, we happen to be initial authors of TANOS project with the initial design and creating the first MVP. I think it was six years ago or so. And to be honest, it doesn't matter like who started the project because this project in this amazing state right now is thanks to relentless open source work from 20 or more members of the team, 600 unique contributors with 80 roughly active contributors monthly, which is amazing. Six years of really amazing growth, organic growth, honest work, and open source collaboration. However, being from the start, I think of this project in sync is growth gave me and Fabian a bit of perspective on what worked and what didn't work over the, along the way. What helped really to build successful project and one didn't. And that's why I really wanted to bring Fabian today so we could discuss what was the most surprising learnings from this journey. And since six years have passed since the creation of the TANOS, we have a six learnings for you. So let's rewind to the beginnings of TANOS and the story begins really, really years before TANOS first line of code. And it started with the Fabian burn, how Fabian was born. That's of course, it's actually more situated. So before TANOS actually happened, I happened to work at a startup called Improbable. We are early Prometheus and Kubernetes adopters, especially because we run dozens of Kubernetes clusters around the world plus other environments, FML clusters, and so on. So we are hitting some Prometheus issues on scale. And we are on the TANOS console. I really, I don't think I really have to explain in details what those issues are because probably you are here because of them. But just till the R is global view, inability to support long pretension, reliability of storage, and not mentioning dynamic scaling that we have to, it's always good to have. And we knew that we are not alone in this problem. So we went for hand for the solution. And obviously we started with existing solutions in this space, right? We just didn't want to manage, but we knew some basic requirements. We didn't want to manage disks on Kubernetes, right? We were kind of sick of running elastic search for logs for a long time on Kubernetes. We really wanted to keep our Prometheus dashboards, others in instrumentations, and so on because it serve us so much, so, so well. So we were looking for something that looks like Prometheus, but it scales. And one of this project at the time was Cortex, which started in 2016 as a project Frankenstein, actually by our dear friend Tom Wilkie. And it was really, really close to something we needed. Like we collaborated together, we met with the team, but it was a little bit too complex and expensive due to DynamoDB dependency. But still we tried to kind of like contribute big table support instead of DynamoDB for indexes. And I remember doing some PUC with this, but it was just not clicking, right? The operational and cost difference between running just Prometheus servers versus like a full blown Cortex cluster was just a little bit too big. Now, what I really loved about working at this London startup at that time is that we didn't just complain or just look for a vendor or project that would just solve all our problems without any work. We were actually willing to put the effort and deep level work to get things done. So really thanks to the leadership of this company, I guess, the technical, at least for the infrastructure team, we started to reach out to Prometheus community members for help, for suggestions. And we were really lucky to contact Fabian. Right, hi, great to be here. So I want to take you back a bit to around the same time from the Prometheus team perspective. So back then we were quite busy shipping Prometheus 2.0. So that was, if you recall, the release where we had the new TSDB. About half a year probably to fully integrate this policy at Hardnet. And as part of that, we worked with our users right to really get this stress test. And of all, Prometheus 2.0 launched around end of 2017. And this new TSDB was a big leap forward. I think resource consumption improved like 10X or more on some dimensions. So it really helped this vertical scalability of Prometheus which solved some of the kind of scaling problems that Improbable and other companies had. But at the same time, I also felt we are kind of reaching the limits of how far we can take this, right? I didn't see another 10X improvement to be gained in just a single note. And besides that, just making it more scalable in one note doesn't solve all the problems that we mentioned, right? Like it doesn't give you transparent H.A. It doesn't give you a global view if you have to meet servers in different data centers. And it also doesn't give you like reliable persistence for long term storage. So it's only kind of part of the equation. So the question was, what's next? How do we solve the next level of problems that our users are having? And all these problems you mentioned are not new, right? Yet they have been kind of being brought up by the community for like years. And as such, I've been thinking about them over the years on and off, right? And I always felt like it's possible to do this in some kind of additive way, right? Like add this on top of Prometheus without making it more complex to operate, right? Because the operation simplicity was definitely one of the core features of the whole thing that made people really come to like it. And I bounced around ideas I think on this over the years, right? One of them was like certainly object storage as a storage mechanism for long term storage, which kind of frees you from having to manage this. But it wasn't super popular from what I recall. So I think on the one side, we had kind of folks who were like, it's gonna get more complex operationally and we can't have that because it's like a system that you rely on for your operations. So it has to be as simple as possible. And that's why we have moderation, right? And that's how we should be using. And then the other side I guess was, hey, you can't actually build this thing without getting more complex, so you have to build something complex. In hindsight, I guess, right, we can say like even though there were significant doubts, right, they were like some ideas that made sense as time has ultimately seemed to work out. So the first learning I think from this whole project with hindsight is, well, you have to question this at a score, right? You should get feedback and listen to it and adjust it, right? But if you think there's something there that's worth exploring, probably just go for it. And as I mentioned, we were like working with our users to test and stress test a Provisional 2.0 and one of these users was in problem. They had one of the, I think, the artist deployments of Provisional 2.0 across kind of the globe in many data centers. So that would be a good candidate and they gave us really good feedback. And I dug around in emails like yesterday on the train and I found one by Michal where he mentioned, hey, this new TSDB has like this nice storage format. Can't be just like put these blocks into object storage and then fetch them again on demand for occurring. And that was kind of quite the line with where my head was at the time. So we got talking and talking and talking and talking. And it took another six months to actually get to a point where we had, we're ready to actually collaborate on this, right? As Bartek mentioned, Improbable was quite keen on actually investing into this, right? In terms of like resources. But it took time to actually align like what we actually want to build, right? What timeframe we have and how we can make this collaboration happen. But after six months of back and forth, yeah, we finally got started and I flew to London and spent the first week of this project at the Improbable office and you can see a picture of that time. Yeah, and we got going. And in hindsight, I think the second learning is that sometimes you need luck because when I went through this email chain, there were like so many opportunities for like nothing to happen, right? Way more opportunities than for this thing to actually work out. I think it's quite lucky that we actually got to work on this together. Yeah, and when we started working on this together, as you can see, first commit here, this thing was still called from LTS as a kind of working title. So from Asia's long term sewage. And Bartek actually found out there was actually one commit before the initial commit which was the commit created by someone at Improbable to set up the repository. And as you can nice to see, this was meant to be open sourced from the very start because the very first commit actually adds a license file for the Apache tool license. And we spent three months just like hats down building this first MVP. And I think the really cool thing about this is that for MVP was pretty full fetched, right? It had like all the bells and whistles of the core features had basically. We had this query federation mechanism right where you can like fan out to multiple data centers if you want. We had persistence into object storage. We had compaction in there, which included down sampling. And we had transparent edge A was, I think it's like, I think maybe like the most. And even more importantly, we actually deployed this to production while we were working on this. And that's really kind of alluding to this additive approach I mentioned before, right? By making this pluggable on top of existing deployments, this was really, really easy to adapt with really low risk, right? Because by just adding something on top, it doesn't work out, nothing breaks, your operations are not affected. And you can just see where that works. And if yes, great, if not, well, you just go back. And that kind of confirmed the hypothesis, right? That users are not necessarily unhappy with promesias. They just want something more on top of it. So the third thing is, I think, you have to really meet your users where they are. And you have to understand where they are. And that is actually possible to build something in three months and start using it instead of like a much, much longer time horizon. Yeah, and about a month after this three months kind of project work, we actually presented it to the public, I think it was at the cloud fair office at the Promises or General Observability Meetup, yeah. And everything from here on onwards is about to start again. Yeah, thank you Fabian. Yeah, so this MVP creation, like this three months of like deep, deep down work was super insightful to me. And one of the learning that I just realized now when looking at it was how important it was time to market here. And to deliver something that can be used and have a great design and have code quality and so on. But something finished, right? Something that people can install and use and try out before maybe community moves to other problems and so on. So maybe that's obvious, but we tend to drag our software development a bit, right? This is because maybe, I don't know, coding is fun. We just keep going. Maybe we are afraid that the solution is not perfect yet or fully tested. But the thing is that what I learned, I think, from Fabian really, is the pragmatic push to deliver high quality software at given time, right? So when that time appears, you literally stop whatever you're doing. You pack whatever you have in consumable package. You document this. Literally, you write a blog post, create a talk for the nearest Meetup nearby, and you do it. And this is kind of like amazing, kind of like achievable thing that you just deliver. And then once you've done that, you think what's next from that on how to, you can improve it, how to add more tests and so on. And I think it's really, the fact that we kind of like did it and finished, it established like a really good foundation for a ton of success. And I rarely see this process being so productive in other projects I have been even participating in. And one aspect that's really helped this happening is really that deep focus, right? And I really mean that. I literally did nothing else to do three to months, except maybe Christmas, but really designed a code and write the code, test it, and ship the initial version of Thanos. I don't think if I will ever have that deep focus ever in my life, because I have a baby, I have open source duties these days, there are distractions. And the thing is like try to find the space for deep focus if you want to achieve like something, I don't know, very quickly, something solid. And really, yeah, I would try to get this deep focus at some point when I went to build new things. And this is, yeah, super important learning for me. So after MVP, after beginning of 2018, we started, yeah, just grinding, fixing bugs, finalizing feature was mostly me. And but after time, new adopters and users came in and helped me to kind of rebuild Thanos, it started very small. Like literally I started Slack channel and I was helping users who had questions. And then those users who I helped, they helped other people and kind of compound effect, like was built on top of that where users were helping each other then kind of like doing so much work that we transitioned them to Thanos maintainer and so on. So it was really beautiful to see. Another milestone, of course, was just the recognition from CNCF that we are the project is maturing. So we donated the project to CNCF as a sandbox stage and then immediately, a year after we end up being in incubation state. And honestly, we could try to even apply for graduation state, but let us know if you'd like to start to see. To me, it's like, yeah, we could do it, but there's, I'm not sure if this is like super high priority here. But generally, CNCF help, like look, we are in the ThanosCon, they help us to organize this event and they help us massively along the way to spread the message and really maintain the Thanos in a vendor agnostic, let's say, in an open source license way. Another worth noting change happened in Thanos in 2019 where Fredrik and Rantik proposed with the Red Hat team and executed actually the remote right support to Thanos or what we call receiver deployment. From that point, Thanos was able not only to query promicuses via sidecar, but also ingest the remote right streams out of any source of metrics. And we could, yeah, we could say really that receiver mode nowadays is equal or if not more popular than the sidecar mode. And many, many organizations, and you will see from today's talks, they have a hybrid approach where they use mix of those modes, sidecar and the receiver on the way. So it was a great contribution from Red Hat and I actually end up joining Red Hat because how active they were and many people are still maintenance from Red Hat. So amazing core contribution there. But some learning I want to share with you here is that the introduction of the receiver is really like a good moment to think about it. It's again, something I think I learned from Fabian, but it's just that upfront design matters, right? Like really think through your abstractions, interfaces, APIs, it might take time, but you know, and keep it time books, but generally it's worth it, right? In case of Thanos, initial development and final, this is what we don't really share. The final, the design and those ideas, this abstraction APIs didn't happen in those two free months, right? It happened actually two years even before starting coding, right? Because we were bouncing the idea around. We had this idea forming up. It's just, we finalized them maybe, but it took, you know, two years. So for the receiver mode, it wouldn't happen if you would not have a story API, right? So our GRPC API that just, you know, packs in a useful way, you know, the ability to fetch series from any source. And, you know, thanks of that, receiving kind of remote write code was as simple as embedding Prometheus existing TSDV code into a process to produce and, you know, have it to expose story API and receive the remote write and it was like the initial work was done. So it was kind of beautiful how this interface was well-designed, you know, a load this use case without changing it at all, right? And changing other parts of tunnels that have to connect to it. And many other features when you think about it, like state for those six years without changing, right? We still have functional sharding in a sense of like your query or can pick what sources to query based on external labels. We still use object storage as our main source of truth of data. We use the STP format in an unchanged format since the beginnings. We use PromQL, of course, we love it. And non-sampling, it, you know, I think it's, I mean, it was designed at the end of this MVP process, but it served as well. And, you know, we didn't literally change it. And yeah, so all those things stayed for the six years and which proof how strong foundations really pays off here. Last but not the least aspect is community work, of course, writing code and fixing bugs is one thing. But helping others, teaching them how to use the project and enabling new maintainers is really important step. And I'm really grateful for the community and the TANOS team that jumped and helped me with a lot of energy to mentor, you know, like new members in the community to deliver in TANOS talks, you know, maintain community meetings and like hundreds of releases. It's pretty amazing and kind of the learning here to me, I want to share final learning is proactive community that enables people. What I mean, it's not enough to just write blog posts and just, you know, do talks. I think what worked for TANOS project is about personally connecting with other people, potential users, potential contributors or maintainers, listening to their story and listening what they need and actually enabling them, trying to help solve their problems. Sometimes mentor them, sometimes literally teaching them Linux or teaching it from scratch. And, but almost always, I remember directly reaching to those people who asked questions like in the very beginning, you know, and trying to help them, you know, for example on Slack, right? And literally, you owe direct messaging them and I was kind of like proactive in a sense of like, oh, we have this idea, you solve it. Let's come with me and have a talk, TANOS talk about it. Let's have a blog post. Let's maybe you want to be a contributor and maintainer officially and take more responsibilities. I was asking those questions and being very direct, right? It's not, I was not waiting for maintainers to just come to me, right? So this is important step. Be proactive and help them and don't really expect anything in return, but in reality, the love will come back to you, right? And you have to give essentially without expecting anything back and it pays off. I see people kind of growing thanks to interacting with TANOS and I know about several people who got really amazing job thanks of being part of TANOS team or contributing and this is super powerful, super amazing. It's kind of a side effect of the project and yeah, I'm super excited for this to kind of like continue and enable more people. So finally, to some act, I'm particularly grateful of TANOS growth, right? This is the history of our stars in GitHub. It look how linear this is, this is going kind of up and it looks totally different than a normal hype kind of cycle where you have a hype and then like rejection and then maybe stabilization and this is because it's organic growth. It's very healthy and all of this is without even one zero version, without mentioning this beta and stable marketing work sometimes, even though we were super stable we deprecated I think two things generally in the history of TANOS. So I think it's something it's worth celebrating today. So thank you so much. I will open it for questions. Yeah, let's have some questions if you have any. Go for it, just try to. Okay, so when I will repeat the question, when and why we changed from the LTS to TANOS in terms of some naming, right? I think immediately kind of like, like literally when the first, second comment appeared, I don't know, probably the same week, we kind of like, of course we were discussing the name and if you are curious, essentially we have in this improbable startup, we have infrastructure with multiple microservices and they always have superhero names for Marvel, from whatever we have, for example, Spider-Man that was hosting web, of course. We have Aquaman's, we have Iron Man. Iron Man, I remember it was orchestrating virtual machines, Iron, right? So lots of stuff, so we really wanted to have a hero and TANOS was clicking because it was, you know, killing half of your problems, monitoring problems and generally work well. Yeah, I think that's the story. All right, we have to really finish for another talk, but thank you so much and we are, you know, you can grab us for more questions later on. Thank you.