 Hello everyone. Welcome to better scalability and more isolation, the Cortex Shuffle Sharding story. Hope everyone's having a good KubeCon so far. So my name's Tom. I am the VP product at Gravan Labs. In my, what little spare time I have I guess, I'm one of the Prometheus team. My main contribution being the remote right code there. I also started the Cortex project, which is the horizontally scalable version of Prometheus that's part of the CNCF Sandbox. More recently I started Loki, which is Gravan Labs log aggregation system. When I'm not looking after these projects, I have a bunch of 3D printers sitting on my desk and I used to make beer as well, but I haven't actually had a chance to brew for a while. So today we're going to talk about shuffle sharding and how it allows us to build a more scalable version of Cortex with better isolation. But before we do that, I'd like to spend some time just introducing you to Cortex, what it does, what it is, why it's important. Before we go on and talk about how we solved this problem before we added shuffle sharding, we'll then talk about shuffle sharding, what it does, and finally how good it is. If it really delivers on what we said it would. So without further ado, Cortex, horizontally scalable Prometheus. So Prometheus is an awesome monitoring system, it's incredibly easy to use, and we see a lot of people get started with Prometheus very, very easily. You deploy it alongside your applications, you maybe instrument your applications or add a few exporters to adapt them to Prometheus, and very quickly you attach your Grafano very quickly, you can build some really awesome dashboards. And you can really get some great insight into your application's behavior and start debugging it and start responding to any problems it might have. It is a really powerful and flexible system. The challenge we see in Prometheus is really when you start to grow beyond the confines of a single location, beyond the confines of a single data center, a single region. Maybe you've got your application deployed in three different locations. Grafano labs we run 15 plus Kubernetes clusters. In the first instance what we see is users add data sources to Grafano for each one of these instances. This allows you to build dashboards with little drop downs that you can select what region you're interested in. I guess the reason Prometheus has to be deployed like this is because Prometheus has to be next to your application. It wants to talk to the local cluster to do service discovery and it wants to connect directly to the application to collect metrics from it. In the Prometheus world there is a solution to bringing this all together into a central global view. And I guess to be clear the problem is whilst this approach is fine for getting information about an individual cluster, there's no way in this approach of really getting a kind of global latency number or finding out what the global error rate is. It just can't do it because each cluster is being monitored by an independent Prometheus. So in Prometheus world we recommend people deploy a global federation server and this federation server can scrape the federation endpoint on each of your Prometheus's and bring that data into a single place where you can run these central queries, where you can ask things like what's my global latency, what's my slowest region and so on. This isn't that tricky to set up. It generally works reasonably well. You've got to get the authentication and firewalls all kind of working correctly and you've got to secure the network and so on. But in general this is feasible as possible. Where this starts to break down is when people scale very large and start storing all the raw data in a single Prometheus server, it's very easy to overwhelm that central global federation server. So we recommend as best practices that people only federate pre-aggregated data and commonly this might mean recording rules that have erased away let's say the instance label. So these are very useful for building those dashboards but it really prevents you doing kind of drill down and add hot queries against this global federation server. If this federation server points to a problem in a region it won't be able to point to a problem with a particular instance of a service because you've erased that label away. So we were looking five years ago now, we were looking for a different way of doing this, maybe a better way of doing this. And this is where we built Cortex. So Cortex replaces the need for that global federation server and you can push, you can have all the edge locations push all their raw samples directly to that Cortex cluster. And this is good for two reasons. One, you know this is a push, not a pull now. So in some ways this is kind of more sympathetic towards how a lot of organizations have their networks organized but also the Cortex cluster is scalable. And so it can, as you add more clusters, as you add more metrics in individual locations, you can scale up that Cortex cluster to take all the load, all the raw data. And this means it makes it really easy to do these kind of ad hoc queries. You've got all the data there, you can drill down just within the central Cortex cluster. And because you've centralized all of this, there's also like one natural place to add things like long term storage, to invest in query performance and really make sure your users know there's one place to go to get all their answers. So that's Cortex in a nutshell really. It's a time series database. It uses the same storage engine and query engine as Prometheus. And what we've done in Cortex is really add the distributed systems glue to turn those from a kind of single node solution into something that works in a horizontally scalable kind of clustered fashion. So Cortex is horizontally scalable. It's highly available. We replicate data in Cortex between nodes. This means when a node fails, you're not going to see gaps in your graphs. And we add a more durable long term storage. So in Cortex, you can store data in an object store and effectively store data for as long as you like. And finally, one of the things I think that makes Cortex quite different to a lot of systems is it natively from day zero kind of built to be multi tenant to support different isolated tenants on the same cluster. This means if you're an internal kind of observability team providing a service to the rest of your organization, Cortex is really easy to kind of deploy and add lots and lots of different isolated teams within your organization to without having to spin up a separate cluster per team. We've, you know, we joined the CNCF a few years ago. We're part of the incubating phase now. And it's Apache licensed. It's available on GitHub. A bit of a bit of a timeline, you know, as I've mentioned, we kind of started the project. Julius and I started the project almost five years ago. Now, we originally stored all the data in DynamoDB in Amazon DynamoDB. And then over the next year or two, I added support for big table and for Cassandra. One thing I'm particularly proud of with Cortex is in the early days, I think we kind of got the right path right, you know, we, you know, it's very scalable and very performant from the get go and it didn't take us a long time to kind of, you know, make that kind of effectively done. And so early on in the life of the project, we started focusing on query performance on on accelerating and distributing and parallelizing massive queries against Prometheus data. And I feel like we made made some really good strides there with with query caching with with parallelization and sharding and I'm very proud of what we achieved. We joined the the CNCF sandbox just about two and a half years ago now, and then really was the focus on ease of use and on the community. We launched a website, we did a 1.0 release, we wrote a load of docs. Generally, we really kind of put a lot of effort into making Cortex easier to use. And now now we're up to date. Now, more recently in the past year or so, we've been focused on new and exciting features in Cortex. So we added a system called block storage. This is basically the same thing Thanos does, where we we've reduced the only dependency that Cortex has now is on an object store, making it a lot easier to deploy and manage. Also block storage is is fantastically cheaper to operate than than the previous kind of DynamoDB chunk storage. We added shuffle sharding towards the end of last year. This is what I'm going to talk about for the rest of the talk. So I won't go into any more detail now. And then more recently, we've added things like query federation, relaxing some of those multi tenancy isolation features so you can query data in multiple different tenants and per tenant retention. So different tenants can have different amounts of data stored for different lengths of time. So yeah, really exciting, really exciting progress on Cortex. But today, we're going to talk about shuffle sharding. But to tease you a little bit more first, I really have to describe how how we you know how things work before shuffle sharding. So in a Cortex system, you know, one of the main goals of Cortex is to be this horizontally scalable. What this means is we need to be able to take in data and shard it and spread it amongst the nodes in a cluster. So we do this by hashing the labels within within the samples that get written. And this is really how we make Cortex scalable, right? How we make a cluster in aggregate able to cope with more writes and more reads than any single node in that cluster can. This is all automatic, the user doesn't really have to configure anything, you know, and as you add new nodes, we can scale up and scale down as you remove nodes. It's really quite cool. The challenge with this is a single node outage can potentially impact all of the all of the tenants on the cluster, you know, the tenants of the cat and the dolphin and the fox. So to prevent this kind of single node outage and, you know, it's worth noting as you add more nodes, the chance of any one of them failing just randomly is higher, right? To avoid this outage from a node failure, we replicate the data between nodes. So we use a replication factor of three and core reads and writes. What this means is when you write data, we write to three nodes, but we only wait for a positive response from two of them from a quorum of them. This means when there's a node outage, you know, you can continue to write uninterrupted to the cluster because we'll still be getting that positive response from two of the nodes. What you'll see, though, is if a second node fails, even with replication factor three, we're going to have an outage, right? Because we're not getting that positive response on writes. We don't know that they've succeeded. And therefore, you basically have an outage for all of your tenants. Now, what's potentially more worrying is as cortex clusters get bigger and bigger, five years ago, we were running four or five node clusters and 10 and 20 node clusters. Now, we're running multi-hundred node clusters. And that chance of two node failures just randomly or through user error is getting higher, right? So the chance of there being a total outage on the cluster is getting higher. And it's worse than that. Because every tenant, in effect, is writing to every node in the cluster, if there's a bug in cortex, if there's a misconfiguration and the tenant finds a way to exploit that, a poison request or a bad query could take out an entire cluster for all tenants. So these really are the problems we're trying to solve with shuffle sharding. And shuffle sharding, to be clear, is not the only way of solving it. We could do something simple, right? We could do something we call bulkheads, right? This is where you can think of this as instead of having one big nine node cluster, you just have three smaller three node clusters. And you would just map tenants to clusters. And this way an outage in a poison request by cat would not affect Dolphin or Fox, a sentence I never thought I'd say. We also see a two node outage would have to be in the same shard to impact any tenant. A challenge here is that this mapping is relatively rigid. It's very hard in this world to have a tenant that needs all nine nodes worth of throughput. It's also hard, if I want to scale up, do I scale all of the shards up? Do I scale one of the shards up? What do I do? And generally you can see how this kind of cellular approach is, it can be a bit of a management burden. So this is where shuffle sharding comes in. And now I'm going to try and explain to you how shuffle sharding works and then we'll go on and analyze how we tune it and what its properties are. So first thing's worth saying is we didn't invent shuffle sharding. The first time I became aware of it was based on this Amazon article in its Builders Library about how they improved the isolation in Route 53, the DNS service, using this technique that they called shuffle sharding. We read this when this was published, it got passed around internally at Grafana Labs and we were like, yeah, this would be a really interesting piece of work to do on Cortex. We could see its direct benefits. So what shuffle sharding does is it effectively picks a random subcluster of the cluster for each tenant. We pick this subset random but it is a deterministically random. So we use the tenant ID. We actually hash the tenant ID to select the nodes in the cluster and then with those nodes in the cluster that that tenant is using, we use the normal Cortex replication scheme to distribute writes among those nodes. This means this gives you a nice property where you can have tenants of different sizes using the same cluster and you can control, depending on how many nodes you give each tenant, the isolation between tenants. To give you an example, if we have a three-node outage in this situation, we can see that it's only affected one tenant because both Dolphin and Fox only had one node affected by that outage. So this is kind of the basic idea. It gives you much better tolerance to failure with kind of a partially degraded state. Another example, and I talked about a poison request earlier, if CAP were to do a poison request, you can see how Dolphin and Fox are not affected because they've again only got one node that's been poisoned. So this is the basic idea. We randomly select subsets of the node for each tenant, subsets of the cluster for each tenant. We then randomly distribute using the normal scheme, samples from these tenants within that sub-cluster. And then we make sure we want to tune the number of nodes that we give to each tenant to optimize for isolation. So this is where we kind of have to start thinking about, well, how many nodes do we want to give each tenant? And how do we optimize isolation? And what are the trade-offs? I'm going to play cards. So imagine we had a 52-node cluster represented by a deck of cards. We're going to shuffle that deck and we're going to deal out four cards. How many different hands do you think, how many different combinations of four cards are there? Well, it turns out there's some maths that can work this out. It's called the N choose K problem. And 52 choose 4 is about 270,000. So if I were to pick sets of four nodes from my 52-node cluster, there's 270,000 different combinations of four nodes. It's a huge number. But that's actually in and of itself not super useful. What I really want to know is, of those 270,000 combinations, how many of them share one node? How many of them share two nodes in common? And it turns out that's not difficult to work out either. There's a link in the top to a Stack Overflow article about how to work it out. My maths is not good enough to derive this. But suffice to say, almost three quarters of these selections don't share any nodes in common. A quarter of them share one node in common and only about 2.5% share two nodes in common. So this is an incredibly strong result that shows that, for arguments sake, a 52-node cluster where all the shuffle shards were of size 4, a two-node outage would only impact 2.5%, well, less than 2.5% of the tenants. Worst case, 2.5% of the tenants. But there's more to it than that, right? When we're picking how many nodes to give each tenant, we need to trade off, fewer nodes means we're going to have better isolation. If I give each tenant one node, the isolation between each tenant is going to be as good as it possibly can be, right? Because the chance of two tenants basically hitting the same node is just going to be like one in 52. If I give tenants more nodes, though, I'm going to be able to spread that load more evenly. And in a cortex cluster, the tenants aren't all the same size, right? We have some very large tenants, we have some very small tenants, we have everything in between. So we need an algorithm really for picking how many nodes, how many shuffle shards to give each tenant. One thing I would say is that better load balancing isn't just a nice to have, right? Better load balancing can lead to higher utilization of resources, can lead to a lower cost of running the cluster. And if you run cortex as an offering, as your SaaS platform, like we do in Grafana Cloud, this is super important to us. So we proposed a simple algorithm, right? This is to give tenants the number of shuffle shards proportional to the number of series. So let's say if you've got a million series and we decide that we're going to give you one shuffle shard per 100,000, we'd give you 10 shuffle shards. And really what we want to do is find out what that 100,000 number is. What is the right value for that number? What is that constant? So again, as I said earlier, my maths is not good enough to derive this from kind of first principles. If anyone in the audience knows how to do this kind of mathematically, I'd be really interested in chatting to you. But I'm a software engineer, so we built a simulator. The simulator kind of simulated a cortex cluster of a certain size, I think we simulated kind of 60, 70 nodes, simulated a set of tenants of roughly a distribution of sizes that we observe in our production clusters, and simulated kind of picking shuffle shard sizes, distributing the samples to each of the nodes, and measuring kind of the variance in node load just based on the number of series that they have, and the number of tenants that were impacted by two nodes going away. What proportion of tenants would be impacted by two nodes? We actually measured any two nodes going away. I think the simulator's open source, so do ask me afterwards if you want to link to the source code. So suffice to say we've got a couple of graphs from the simulator. This one shows the load balancing, how well load is distributed within the cluster, versus the size of each shuffle shard. So shuffle shard along the x-axis, load distribution along the y, and what you can see here is as you increase the size of the shuffle shards, the distribution of load gets worse as we predict it. You can see kind of interestingly the distribution of load starts to tail off. I believe this happens as kind of just small tenants start to hit the minimum number of shuffle shards, which is three for replication. We also see here that at kind of let's just pick a number, the numbers aren't super relevant in this. This is just a general rule of thumb, but let's say a shuffle shard size of 40,000 series, we can see that the maximum size a node gets to, the maximum number of series on a single node is about one and a half million, and the minimum is about 750,000. So there's a factor of two difference here. That gives us some kind of idea. We probably don't want a factor of two difference in the size of our nodes. This is going to make it very hard to optimize utilization. We also see as you increase the size of the shard, the isolation measured as the percentage of tenants affected by a two-node outage, the isolation starts to fall and eventually again, plateaus. So we can see that let's say 30,000 again, 30,000, 40,000, your way less than one percent of tenants in your cluster are affected by a two-node outage. This was modeled with 1,000 tenants. We're averaging, I think, 100,000 series per tenants. One of the key things this simulation took into account was it also measured, you know, also simulated replication factor. So whilst working on this, we kind of picked some numbers, we debated internally, we kind of find where the two graphs cross, and we came up with this kind of good rule of thumb. You know, at around 20,000 series per shard, we have a roughly 20% variance in the series per node and roughly 2% of tenants affected by a two-node outage. And I believe our production config that we run on our large cortex clusters matches this roughly, I think 20, 30,000 series per shard is what we run internally. And this is really good because what this means is by reducing the chance of an outage for most tenants when there's two nodes that are suffering problems, we've been able to scale up to even larger cortex clusters to hundreds of nodes as opposed to 10s. We've also managed to better isolate tenants from each other. So there'll be less noisy neighbor. There'll be less chance of a poison pill affecting other tenants. We managed to do all of this whilst keeping the variance in load amongst these nodes relatively bound and therefore kind of not reducing, you know, not increasing rather the cost of running this cluster and not passing on any kind of cost to the customer for this. So I think this is a really positive result. I'm really kind of pleased with the work and surprised at how effective shuffle sharding is. We talked today about, you know, what cortex is, the horizontally scalable version of Prometheus that kind of allows you to centralize your observability into a single cluster and act as kind of your own service provider within your organization. We've talked about how we distributed load before we implemented shuffle sharding and how we just distributed all tenants to all nodes and how we used a hashing algorithm and kind of a DHT to do that. Then we've talked about shuffle sharding, how shuffle sharding effectively builds small virtual clusters inside a much larger real cluster and how these virtual clusters improve the isolation between tenants at not a huge expense in terms of utilization. And that's really the talk. I wanted to say thank you to a few people. I wanted to say thank you to Marco. Marco and thank you to Peter who really did all the work here. They should be the ones giving this talk. What's more kind of the slides I'm giving here are an evolution of Marco's internal slides that he gave at a talk inside Grafana Labs. I also wanted to say thank you to Amazon. They sponsored Grafana Labs to make these changes to Cortex. Really worked closely with us on the design and on reviewing it and really kind of giving them some great feedback. If you want to hear more about how Grafana Labs and Amazon have worked together to help Amazon launch their Prometheus service, there's a blog post on Amazon's blog and a blog on Grafana's blog that really goes into a little detail about how the relationships worked and what kind of things we've built for Amazon. And with that, I'd like to say thank you and open up the floor to questions. Thank you.