 Hi. First ever Thanos gone is pretty fucking neat to be honest. As Bartok said, so neat that I have travelled halfway around the world and am currently enduring jet lag that would make fucking Kronos blush in order to come here and talk to you about Thanos, right? In particular, I want to talk about running a highly available globally distributed Thanos that serves more than 50 terabytes of metrics a day, from over a thousand sidecars in Cloudflare's 600 or so data centres around the world, and if all that sounds like something an SRE might scream in some sort of night terror, it is, and I've had it. You can ask my partner, but thankfully in order to stave off all of the nightmares that lurk around every corner of our infrastructure these days, we've put a bunch of tooling to help, and today I want to talk about that. I want to talk about our Thanos journey, how we've started from kind of humble beginnings up to what we have now, which helps us meet a decently respectable set of SLOs. But before we get into all that, hi, I'm Colin Douch. I tech feed our observability team at Cloudflare. We build, maintain and operate all the tooling that helps Cloudflare debug Cloudflare, and part of that is operating this nightmare of a Thanos cluster. Our Thanos journey started pretty early on, as Bach said, around the beginning of 2018. We had just replaced our old monitoring setup, Nagios with Prometheus, and we were looking to replace our longer metric storage, which at the time came in the form of a decently large OpenTSDB instance. Few main reasons for that, but chief among them, OpenTSDB relies on HBase, and HBase relies on Hadoop, and Hadoop is nightmare software created by the trickster god Janus in order to confuse and befuddle any SRE who tries to deal with it, or at least that's what I would say if I wasn't just trying to pass paper over the fact that Java was not a core competency of our team at the time. We were very much a go shop, and so maintaining it was a real bear. Back then our numbers were a hell of a lot smaller. We had just under a hundred data centers, just under a billion active time series. Our Thanos infrastructure was obviously quite simple as well. In each of our data centers, we operate a few management nodes ranging from one up to I think 10, depending on the size of our data center. Each of those management nodes runs Prometheus, so hey, let's slap a Thanos sidecar next to all those. We'll create a bucket in our internal S3 storage. I can upload to that. We'll throw in a store. We'll throw in a query. Hey, done, right? And that worked really well for about a year and a half, but it had a few glaring problems. Chiefly, it didn't really scale well. These days, you can split up a bucket with like time-based partitioning and things like that. That didn't exist at the time, and so we were mostly stuck with just vertically scaling our store. At the same time, our compactors were really struggling with sporadic bursts of memory and CPU that our infrastructure was really struggling to accommodate. So once our Thanos got too slow to be really acceptable, it was time to do some shouting, right? Our one bucket became nine, one for each region we operate in. Each data center uploads to the bucket corresponding to its region. Alongside each region, we have a store, compactor, central query, driving all together. Again, pretty boring, right? The one interesting thing to note there is we chose regions because they're pretty static, at least until we make an Antarctica data center or something, which Akamai has for some reason. We can sort of create a set of thousand for structure for our regions and then sort of set it and forget it. We're not getting new regions anytime soon. When time-based sharding dropped in Thanos, we took the opportunity to normalize our stores a bit. Despite being a global network, our regions are very mismatched in terms of capacity. North America has much more capacity than APAC, for example. These differences in capacity filter down the difference in time series, differences in bucket sizes, differences in compute requirements for our stores. Time-based sharding allows us to normalize that a bit. We can spin up multiple stores for each bucket, but at the time we weren't collecting any query metrics or anything like that. It was very much vibe-based. We would spin up a new store whenever things started to feel slow. It's not super rigorous, but honestly, vibe-based scaling kind of great for anything you don't need to be too rigorous about. Eventually, as we got more capacity as a team and as our Thanos continued to grow towards the nightmare that it is today, it came time when we did need to start being a bit more rigorous. New data centers were coming online faster than we could keep up with sharding our stores. As much as we wanted to, we were really falling behind. Thanos' user experience was suffering as a result. That's what brings me to this talk, scaling Thanos in dynamic Prometheus environments. It turns out scaling Thanos is really difficult in a world where bits of your infrastructure can come and go on a whim. I want to talk about all the tooling we've developed to handle that. It came in four main flavors. We've got our storage. We've got our compactors. We've got our Thanos stores and we've got our Thanos queries. Let's talk about Ceph. For the uninitiated, Ceph is a blob storage. You can throw files at it. It stores them. You can slap an S3 API in front of it. Voila, you got your own private S3. Cloudflare Ceph was one of these things that everyone loved to use and no one wanted to own. It fell down to our core SRE team, who I feel very sad for, but it meant it was basically unmaintained, which is a bit of a problem when we use a lot of Ceph for our Thanos. Thankfully, Cloudflare, when we started looking for alternatives, was just around the corner from announcing our own object storage. This isn't a sales pitch and my marketing team is going to hate me for this, but I don't give a shit if you use it. Just a bit of flavor to our experience. R2 was pretty much a perfect solution for us. A, it was maintained. B, we could dog food our products, which our product teams really, really like. And C, not having things centralized in our core data center meant that we could start to treat Thanos as a bit of a stateless application, which did wonders for our HA story. The question was, how do we migrate? We keep 13 months of retention by default. So just lopping that off and starting fresh in R2, it's not really workable. At the same time that 13 months is about 50 petabytes or so of data. So migrating that manually over any form of network link is not going to work either. The recommended upstream solution to this is to run two siteguards, sorry, not yet, to upload to one object storage and then replicate out to another. We didn't really want to fuss with our existing Ceph stuff because we had to keep it going while we were migrating. So in our case, this would mean that we would have to centralize into Ceph and then decentralize out back to R2, which fell a little bit icky to us. Again, the obvious solution to us was to run two siteguards, one uploading to Ceph, one uploading to R2. But no, Thanos doesn't want you doing that. When you run a Thanos sidecar, it stores which blocks it's uploaded in a JSON file in your Prometheus data directory, Ceph and metadata JSON file, just a giant array of block IDs. That file is a const in Thanos, so no matter how many sidecars you want to run, every block will only ever be uploaded once, even if they're pointing to completely different object storages. A bit of a problem, but hey, one patch later and we make that configurable, we can start to put up a bit of retention in R2 while keeping our existing Ceph more or less unchanged. Hey, you're welcome. Let's talk about those per region buckets, though. Despite being able to scale out with time based sharding, those sort of per region buckets put a bit of a vertical limit on how much we can scale out. As much as we can keep sharding out our stores, you start to get diminishing returns when your stores are handling 24 or even 12 hours of data in order to meet our SLOs. What we really wanted to do was push this to its logical conclusion, down into per data center buckets, which brings me to the dynamic part of our journey, because Cloudflare's network has been growing pretty quickly recently. It used to be that we would get a new data center online every couple of months or so, which might have been workable if we wanted to manually create this Thanos infrastructure. But that doesn't really work these days. These days we have a new data center coming up, we're an old data center going away every week or so. It's far too much to be managing manually. We needed some sort of automated way to do it, right? The way we started was a central provisioning service. In our core data centers, our central data centers, we keep a database of data centers, which is a hell of a sentence to say, apparently. We have a system that can poll that database and so, hey, we'll create Thanos infrastructure whenever we notice a new data center in that database. Unfortunately, that comes with a bit of a race condition. If our data center comes up before our central provisioning service has noticed that that new data center exists, well, our sidecar comes up, it doesn't have a bucket and it immediately crashes and pages you at 3 a.m. and our infrastructure team gets sad because they can't bring up the data center. Bit of a problem. We wanted to do something a bit more in line to the sidecar process. We ended up building this into the Thanos init script. When our Thanos sidecars come up, they're provisioned with admin credentials to our R2. When they first start, they check whether they have a bucket. If they have a bucket, great. If they don't, they can make their own one and then drop themselves back down to write credentials so we don't get any privilege escalation or anything like that. With that in place, it means that our sidecars can never crash because they don't have a bucket. Whenever a data center comes up, there will always be a place for it to put Thanos infrastructure into, which is pretty nice. Let's talk about compactors. Compactors are annoying. They're really, really bursty. They slumber for eons until eventually waking up and consuming the souls of your poor CPUs and RAM sticks. In our infrastructure, there's two places we could use to confine such a being. We can put them in Kubernetes in our core data centers or we can decentralize them out on our edge. You might think that Kubernetes is the perfect place for this, right? That's what Kubernetes does. It scales on down really, really quickly. But we have a nice thing being an entirely bare metal organization. Our edge traffic is interesting. This is CPU traffic from one of our edge colos. You can see it's really cyclical. When a new data center wakes up for the day and people are using the internet and people are sending traffic through Cloudflare, our CPUs are pretty busy. They're doing stuff, serving the internet. When it's nighttime for that data center, fewer people are using the internet. We're serving less traffic and we have a bunch of wasted CPU. Perfect for these sort of batch jobs to mop up that sort of capacity. This isn't uncommon in an on-prem environment. We have to provision for that peak load. We can't use EC2 and scale up and down as we want to. On our edge, we use Nomad to provision these things. We can create a periodic job, basically a cron. This allows us to start a compactor at the low time for that data center. It can keep chugging. You'll note we have to time it out before the data center wakes up for the morning because otherwise we're chewing resources that would be better served serving our customer requests, which is not great. If we ever hit that timeout, that's a real problem, though, because it means we haven't been able to compact all the blocks for that day. If we ever hit it multiple days in a row, we start to get a bit of a backlog and that becomes a real problem. Thankfully, we haven't hit it yet. We have solutions when we do hit that timeout. We can increase resource quotas. We can shout it out again, but it's an easy enough problem to fix when we do hit it. Let's talk about stores. Stores are really annoying because unlike compactors, they don't operate in a vacuum. You actually have to be able to query them, and that means that they have to be in a place that is convenient to query. The annoying fact about our edge is that most of our edge is not convenient to talk to. At any given time, maybe a quarter of our data centers are offline fully or have packet loss on one or more of their links, so there's really only one place we can put these. We have to centralize them next to our queries. Scaling a store is an interesting problem. Like compactors, they're really bursty whenever someone issues a query, but in much more variable bursts, depending on the size or weight of that query, this makes it really hard to run automated scaling solutions, like horizontal pod autoscalers, because by the time your new resources have come up, well, the query's over. You can't use those new resources anymore. What we really wanted was something a bit more stable that we could use to pre-provision a bunch of infrastructures, that it would be around when those queries first started hitting. What we found is that actually it's kind of trivial. This graph is incredibly noisy, right? But I'm a bit blind, and there is a bit of a trend there. This is query times versus the amount of data that a store has to process, taking into account indexing for blocks and things like that. If we measure that for our desired SLOs, we work out that we can handle about five terabytes of blocks per store to be within those SLOs. With that information, we created what we call the store sinker. The store sinker's job is to take that five terabyte number and chunk up each of our buckets into those chunks, and then create a store for each of those chunks. Within those chunks, we can then scale normally with traditional signals, CPU, memory, all that stuff. We do some additional logic around pre-provisioning more stores for the most recent data, because that's what people query. But this is kind of nice. It helps us scale out a bit more stably than we would if we were using horizontal part of the scalings and things like that. Finally, let's talk about queries. Queries are interesting. Do you know what happens when Thanos issues a query? Well, by default, it doesn't know where all your data is. If you chuck a query into Thanos, it fans out to every store API it knows of, which is really bad in a distributed world, because it means that every query you do is bottlenecked by your slowest sidecar or store API, which is a problem in Cloudflare's world, because our slowest store API is generally somewhere really badly connected, sub-Saharan Africa, that sort of thing. All of our queries are slow, not great. Thankfully, Thanos, as Artec mentioned, can limit that fan out if you include one or more external label filters. Basically, your sidecars, they can read the Prometheus config that they sit next to, they can pick up the external labels for that Prometheus, and they can broadcast that to your query. If you have a query that has an external label filter that doesn't match the external labels of your sidecar, well, great. We don't have to talk to that sidecar. This can dramatically improve your P99 query times, which is where our last component comes in, the label enforcer. This is the smallest component we run, but also the most effective. What it is, it's a little go reverse proxy that sits between our query front end and our query, and it inspects every query that comes through that path. What it then does is that if that query has an external label, it lets it through, and if it doesn't, it rejects it. This allows us to communicate this to our engineers incredibly nicely. So many of our engineering teams are first weaning onto metrics in general when they first join us. We don't expect observability as a core competency. So to be able to teach people about Thanos is pretty great. For example, you can query one data center with a data center name label, that's one of our external labels. You can query multiple with a regex, or you can query all of them with a regex that matches all of them. But that makes the slow case opt in. You opt into querying all of them. You opt into that slow query, which I think is kind of neat. And that gives us our four components. We have our sitecast that automatically create their own buckets. We have our compactors that get scheduled around the world in order to mop up that spare capacity. We have stores that scale out automatically depending on the compute requirements for the buckets. And we have queries that guide our engineers towards fast queries by default. That's about all I have. Those four components together help us meet a decently respectable set of SLOs, I think. It's going to become increasingly a problem in a world of data locality and in a world of edge computing that these distributed Thanos problems are going to become more of a thing. And I hope that we can continue to work towards a proper solution for this. As I mentioned, slides will be on that link after this when I get around to filling them up. But thank you. Questions? Any questions to Colin? Are you ever going to contribute the label enforcer to upstream Thanos? Yes. Once our legal team ticks it. You mentioned time-based partitioning. Have you also done any of the hash-based partitioning? Have you found a difference between the two? Time-based partitioning works really well for us because we can sort of work out which time blocks are queried most. Hash-based becomes pretty random and hard to deal with. Maybe the last question. Make one. Okay. Thank you, Colin. Amazing.