 My name is Eben, and in my day job I build a lot of cool stuff at this company called Honeycomb I.O. And I'd be happy to talk to you later about tracing events, observability for distributed systems, some of the stuff that we do. But that's not what we're here to talk about today. Instead, we're here to talk about queuing theory. Now, I feel like a lot of things in computer science queuing theory has kind of a naming problem. I don't know about you, but I'm not really big on queuing, waiting around in lines. I am not too enthusiastic about theory either, unless it has some useful application for what I do. But the truth is that queuing theory is all about asking and then answering questions that we have as system operators. It's common to have these questions that are relatively difficult to answer. You know, if we have something that we scale on Kubernetes or whatever, what level of utilization is appropriate. If we want to make changes, what changes should we make to improve performance. And if we're not willing to accept just pure guesswork as an answer, then queuing theory gives us both a vocabulary and a toolkit, a set of results, theorems developed and presented by some really smart people to help us approximate the software systems that we build with simpler models that we can either reason about mentally or analyze analytically. And this helps us interpret the data that we collect, either from instrumenting our true production systems or from conducting controlled benchmarks. And ultimately, this helps us better interrogate and understand the systems that we build and operate every day. These methods are especially relevant, of course, when infrastructure is distributed, dynamically scaled, ephemeral, highly concurrent, cloud native, if you will. So although this is not a talk about modern hit new technology, in fact, it's a talk about results that are more than 100 years old in some cases, I think it's one that is increasingly relevant to this audience. So in this talk, I want to make these theses concrete by talking first about how you would build a model of a simple serial system, one that does just one thing at a time, and use it to derive some insights about real world workloads. Then I want to extend that to talking about parallel systems, how you can reason about load balancing strategies, and how you can use something called the universal scalability law to predict how your system will scale as your traffic or your capacity grows. And finally, we'll summarize some of the things that we see along the way and some key takeaways. Before we dive in, an important and essential caveat. Any model is, by definition, a flawed representation of reality. It is reductive, otherwise it wouldn't be a model. It's also worthless unless backed by real data. If we just make up some stuff and roll with it, that is not helpful unless we somehow validate and back that model with production data, instrumentation, and controlled experiments, benchmarks. However, if you do gather data, having some sort of model helps you make sense of it. Instead of drowning in metrics, you can ask useful questions and validate your assumptions. If your service has some sort of performance bottleneck, perhaps you can identify that bottleneck before you actually hit it and cause an outage. If you want to create a benchmark, you can use queuing theory to decide whether that benchmark is a plausible representation of some different real world workload. And finally, if you're planning changes that might be expensive to make, a re-architecture or something, you can use queuing theoretic results to help decide whether those are good changes or not, what their effects are likely to be. So let's dive in by talking about serial systems. To motivate this section, I want to talk about one of the services that we operate at Honeycomb, our ingestion API. The details of this thing aren't too important, but it looks like a lot of services that you are probably familiar with are perhaps run yourself. It's user-facing. In our case, it receives ingestion data from customer infrastructure, kind of a lot of data. It's stateless, highly concurrent, runs across many servers. It's mostly CPU bounce, and it's supposed to be low latency, which for us means target latencies in the single-digit millisecond range at worst. So we currently run this thing on EC2. We could run it as a stateless service on Kubernetes as well. Either way, that means that we could in principle allocate more or less infinite resources to this thing. Unfortunately, that would also cost more or less infinite money. So our question is, how do we allocate appropriate resources to this service? How do we provide high performance at a reasonable price point? There are a couple of ways that we could approach this problem. We could guess, but that would make for a pretty short talk. We could do large-scale, production-scale load testing, either generate enormous amounts of traffic against a real infrastructure, or create a full-scale copy of a real infrastructure and generate enormous amounts of traffic against that. This is obviously a pretty good idea, like it would be a worthy undertaking, but it's also a time-consuming thing for a resource-constrained engineering team to do rigorously. So I want to explore a different strategy. We'll do some small experiments, things that you can do with a few instances with limited time, and then build a performance model that will help us take the results of our experiments and reason about what they imply for our production system. Make sense? So here's the idea. We have end cores available to run this thing. We know it's more or less CPU-bound and is some large number. But what can we do with just one core? What's the maximal single-core throughput of this service? To answer that question, we'll simulate requests arriving uniformly at random, kind of like the requests that we see in the wild, and we'll measure the observed latency at different levels of throughput. So here's what this looks like. Initially, we're doing pretty good. You know, single-digit millisecond latency, a couple thousand requests per second, and then at some point, things start to go really south. So that's obviously not good. Our question, can we find a model that predicts this behavior that helps us identify this knee in the response time curve before we actually hit it that will then let us extrapolate from production data to say, okay, this is the maximum safe utilization for the service. Here's the idea. Because we've constrained ourselves to a single core, we're going to use a single-Q single-server model. So we'll say we just have a server, which is some sort of black box that processes tasks. Tasks arrive independently. If the server is busy, those tasks might have to wait until the server can actually work on them, do its data processing or whatever. Finally, the server gets to the task. It doesn't work. It returns to the client. So intuitively, it's pretty clear that in this model, the busier the server is, the longer tasks will probably have to wait before they can be completed, kind of like what we saw in the picture. The busier the server was, the greater the latency was. And our question is, well, as a function of throughput, exactly how much longer? So that's the question. Here are the assumptions we're making. First, we've forgotten completely about the Linux kernel, socket buffers, the go runtime, the scheduler, all of that. We've dramatically reduced this mature, relatively complex system to just a single, more or less black box model. Later, we'll see whether that's actually a fair simplification to make. We're also making these three assumptions. First, the tasks arrive independently and randomly. It's some average rate that we'll call lambda. Say lambda equals 3,000 requests per second. And second, that the server takes a constant time, yes, the service time, say 200 microseconds, to actually process each task. Later, we'll revisit these assumptions and see what happens when we relax them. And finally, like we said, the server can only do one thing at once, and it doesn't preempt once it starts a task it has to finish it. So, step three, you all have stats PhDs, right? Just kidding. To build up our intuition for this system, let's draw ourselves a picture of how its state evolves over time. The state of the server changes. When a task arrives, the server goes from idle to busy. The outstanding work at the server goes from zero to s. Then the server works on the task, outstanding work goes down, then the server is idle again. Another task arrives, the same thing happens. Now, intuitively, it's pretty clear that when throughput is low, the server is mostly idle and tasks almost never have to queue. In general, they can be served immediately, and latency will be low. However, as throughput increases, the server is busy more of the time, and it's more likely that tasks will have to wait. In this diagram, let's try and distinguish the waiting time for a task, which is blue here, from the service time for a task, which is orange. Here, the task shows up, it has to wait for a little bit, ah, and then the server can work on it. So, armed with this picture, can we talk about the metric that we're interested in, which is average wait time, not just for a given arbitrary task, but in general, over a large set of tasks, how long will they have to wait? There are two ways that we can identify the average wait time in this picture. The first is the average wait of these blue parallelograms, the average time between when a task arrives and when it's ready to be serviced. Makes sense? But the other way that we can identify average wait time is the average height of the graph. Remember, tasks arrive independently and uniformly at random. So, if we were a task and we show up at some random point in time right here, the height of the graph at that point represents how long we'd have to wait. So, the average height represents the average wait time. If this is giving you, like, bad high school geometry flashbacks, that's totally fair, but bear with me because there will be some cool insight at the end of this little exercise. The idea is that we'll relate the average width and the average height using the area under the graph, and then solve to give us an expression, a formula for the wait time that we can then use to make predictions. So, over a long time interval t, the area under the graph is the width, the time span t, times the average height of the graph, which we already said was the average wait time, let's call that w. So, the area is t times w. On the other hand, we have all these tasks that are represented both by a triangle in this diagram, the orange triangle representing when it's being serviced, and a parallelogram, the blue parallelogram that's representing when the task is waiting. Now, we can express the parallelogram area as its height, which is s the service time, times its width, which on average is w, which we said was the average wait time. A little subtle, but we'll see where this gets us. So, the area under the graph is the number of tasks times the area of each triangle plus the average parallelogram area, which we can express as s squared over two, the triangle area, plus s times w, the parallelogram area. Now, how many tasks are there? Well, over some time span t, the number of tasks is a function of the arrival rate called lambda. So, putting it all together, we have this kind of gnarly expression that will make a lot more sense in a minute. We know that the area under the graph is lambda t times s squared over two, plus s times w, which we also know is the area under the graph, t times w. Ha! This is great, because we can relate the two and then solve for w. You might have to trust me on the algebraic manipulation here, but if you simplify the top equation, what you get is this expression, w is equal to lambda s squared over two times one minus lambda s. Let's bust out our graphing calculators and draw a picture of what this looks like. This is what it looks like. As the server becomes saturated, as the throughput lambda grows, the wait time also grows without bounds. This looks a lot like the picture that we saw in our experiment, and so this is encouraging. Even without actually fitting this to the data that we gathered, we can identify three utilization regimes. When utilization is pretty low, there's no problem. If someone comes along and says, hey, I think we should run this thing at 60% utilization instead, that's probably a bad idea. Even this relatively crude model gives us some pretty good insights. For example, if we're choosing resource limits for our pods, there's a big difference between going from a little bit of utilization to a little bit more utilization when your utilization is low and going from utilization to a little bit more utilization when your utilization is high. The latter is very, very bad. So this is great, but so far we've just been theorizing and we don't actually know that this model correctly describes the data that we saw in practice. It kind of looks like the same shape of graph, but, you know, is it good or is it just some bullshit that we made up? Well, there's a way to assess this. Crudely, here's what we can do. We'll choose a subset of our real data and the subset that we'll choose is the data that we're actually likely to get from production, data from the safe regime, low utilization regime. Then we'll fit our model to those data points. Well, I won't go into the tedious details of how you would do this. It's pretty simple to do with any sort of statistical software package, like R or NumPy. And finally, we'll compare that to the data points that we've withheld from our training set, if you will. And so in this case, the fit is actually strikingly good. So this is encouraging. It means that we've made some radical assumptions but still come up with a parametric model that's a real system. Awesome. There are some bigger lessons that we can take away from this model. First, in this type of system and really in any type of system, improving baseline service time helps a lot. If you have a service and you find some sort of performance optimization that just straight up makes it faster, that is the most impactful thing you can do. Let's see why that is. Here's the thought experiment. This is an optimization that cuts the average service time in half and people love our service because it's super fast. So now our throughput doubles because we have more users. What happens to the wait time? Using our equation, we can see what happens. Because there's a factor of s squared in the numerator of this equation, the numerator is twice as small. On the other hand, the changes in the denominator here cancel each other out so the denominator stays the same. What does that mean? Even after we double throughput, the wait time is still improved. This is kind of awesome, right? If we find some optimization that lets us cut service time in half, we can double throughput and still be faster than we were before. Alternatively, we can more than double throughput, almost triple throughput in this case, and get the same performance that we had before. To me, this is really counterintuitive and really compelling. It says that improving baseline service time doesn't just improve user-facing latency. It also improves our own capacity for many concurrent requests. So this is a powerful thing to remember. Investing in straight-up performance optimization is maybe the best use of your time. The second lesson is that variability is bad. Remember, our tests were arriving at random, but if they just showed up at perfectly spaced intervals every 200 microseconds or something, tests would never have to wait even if the server is highly utilized. Unfortunately, that's not what happens in reality. So the slowdown that we see is because our rivals are variable. Similarly, we can show that if job sizes are also variable, things get even worse. Here's a little empirical experiment. We choose two distributions of task sizes. One, the green distribution is all jobs are the same size, the model we started with. The other, the average is the same, but we choose from a long-tailed distribution, so job sizes are variable. You can run through a sort of similar geometric argument and see that when task sizes are variable, our time is even worse than when task sizes are constant. So this, again, is a powerful lesson for us as system designers and operators. It means that it behooves us first to measure the variability in our tasks using instrumentation techniques like histograms, heat maps, distributed tracing for outline analysis, and secondly, to minimize variability using design strategies like batching. The variability of a batch of requests will always be less than the variability so batching helps reduce variability. Similarly, aggressively preempting slow tasks or applying time-outs can help reduce the impact of outlier tasks on the tasks that queue up behind it. And finally, client back-pressuring concurrency control help limit variability in arrival rates in a non-trivial system. So this is great, we've learned some stuff, but we don't actually just have one core or one server, we have lots and lots. So what can you say about the performance of our big old fleet of servers? That's what I want to talk about a bit in the second half of this talk, parallel systems. So if we know now that one server can handle T requests per second, subject to one of our latency SLAs, do we need N servers to handle N times T requests per second? Well, it depends on how we assign tasks to servers, right? There are lots of different ways we could do that. We could always assign tasks to the least busy server, we could assign tasks randomly in round-robin fashion or some other way. We could always assign tasks to the most busy server, but that's probably a bad idea. Intuitively, it depends. So let's do a little simulation. This is another performance modeling technique that I find pretty compelling. If you don't know how to reason analytically about something, you can write a little simulation about it. Like I cooked up some code for this, you could do it too. Here's the idea. We're going to simulate N equals, say, 16 servers and tasks arriving independently at randomly. And then when a new task shows up, we'll assign it randomly to a server, wait until the server finishes it, and then measure the cumulative latency distribution over this workload. So sometimes, because we're assigning randomly, a bunch of tasks all arrive at the same servers at the same server, and then those tasks have to wait. And so there's a latency tail. On the other hand, if we take the same workload, but assign tasks optimally, always choose the least busy server, it's much less likely that a task will have to wait. We can almost always find an idle server even when most of the servers are busy most of the time. And so here we have near constant latency, even though the throughput is the same in both systems. Obviously, optimal assignment here is better than random assignment. We can reinforce this with a bit of a probabilistic argument. If we have one server at some utilization row, say it's busy 60% of the time and a new task shows up, then the probability that the new task will have to wait is the utilization row, 60%. But if we have n servers at the same utilization row and a new task shows up, the probability that the new task will have to wait is the probability that all servers are busy simultaneously, which is much less than a row. What this means is that in theory, if we have many servers, higher utilization gives us the same queuing probability. If we have n times more traffic, we'll need fewer than n times more servers. This is awesome, right? It's like economies of scale, not just because of vendor economies of scale, but because of actual physical improvement and the efficiency of the system. This is great. The more people use our service, the cheaper it is to run. Awesome. There's just one problem with this argument. We're assuming that we can optimally assign tasks to servers, but optimally assigning tasks to a pool of servers is really a coordination problem in disguise. We all have to agree on which server gets which task. And in real life, coordination is expensive. There's no way around it. We can't just magically pick the least busy server with our psychic powers in general. We need some sort of physical coordination mechanism. For this problem of stateless request assignment, that would be a load balancer, a proxy or something. For similar problems, that could be a cluster scheduler, for example. Either way, when a task shows up, we decide where it wants to assign that task. First, it has to interrogate all of the back ends to figure out which is the best one. Then it has to actually assign the task to the best chosen back ends. This is in a free process. If it takes us some constant time, alpha, to assign a task, say just the time it takes to forward it from the load balancer to the back ends, and we have n tasks to process, then the time it takes us overall to process n tasks in parallel is the assignment time, which is alpha times n, because you have to do this assignment more or less serially, plus the service time, which is s. The throughput here is the number of tasks n divided by the total time. As a graph, this looks like this. As we try to scale, our throughput scales sublinearly. We spend more and more of our time just assigning tasks to servers, and less and less of our time actually working on the tasks. So this is rough. Assign tasks in constant time. If we have to do something that depends on the number of servers, for example, go talk to each one of them and figure out which one is best, then it's even grimmer. If the assignment cost grows with n, then eventually we spend all of our time just figuring out where to assign tasks, and none of our time actually working on them. This is one example of what's called the universal scalability law in action. The universal scalability law says that the scalability of more or less any parallel computing process looks like this graph. You find parameters alpha that describe the constant assignment cost and beta that describe the coordination cost, and can model the scalability using this equation. Unless you can find ways to minimize the coordination cost, as you try to scale your system, it will actually perform worse and worse. So this is pretty rough, right? This is completely contrary to what we saw. As we try to scale our system, so what we've learned from this is that making scale and variance design decisions is really hard. At low parallelism it's better for us to try and coordinate because it makes latency more predictable. But at high parallelism coordination degrades throughput and we're better off just assigning things randomly. Otherwise we'll spend all of our time making choices and none of our time actually working on them. So can we be clever? Can we find strategies to trade off these two fundamental characteristics of system performance and do something that performs pretty well across all spectrums of scale? Well, yes we can. So here are two pretty neat ideas for how to do that. The first, we do it approximately. We know that finding the very best out of N servers is an expensive proposition, but choosing one randomly is kind of bad. So here's the idea. We'll make a trade-off. We'll pick two at random. We'll compare those two and then we'll always choose the better one of those two. What does that give us? A lot. It has constant overhead, right? Even if we have many, many servers because we only have to talk to two. And you can show that it improves the instantaneous maximum load on any given server from O of log N to O of log log N which is basically O of 1, right? This is not just like some theoretical thing that someone made up. This technique is used in the SPARO research schedule scheduler and also implemented in HashiCorps Nomad to do distributed, stateless scheduling with low latency for lots of short tasks like cron jobs and stuff. It uses this two random choices assignment plus some optimizations. And I really like this because it's a relatively straightforward, very effective strategy. It's something that anyone here could cook up unlike the Kubernetes scheduler, which is kind of complicated and scary, and it's way, way better than a straw man naive random assignment strategy. So that's cool. The second idea for beating this quadratic beta penalty I call iterative partitioning. And it illustrates how the universal scalability law applies not just to this problem of task assignment, but to any parallel process. Here we're going to talk not about stateless services but about stateful services. This idea as far as I know was invented at Facebook to power this analytic system called SCUBA and is reincarnated in fashion at Honeycomb where I work. The idea is you have a lot of analytical data, instrumentation data from systems, and you want to answer questions about it pretty quickly. Which requests are slower? How is my site performance changing over time? And the way you do that is by distributing your query over lots of storage nodes. So we have a bunch of data and we have a bunch of storage nodes. Each node is responsible for some slice of the data and when we want to answer a query say, you know, which of our requests was slowest over the past week or something all of the nodes participate in figuring out the answer to that query in parallel. So first leaf nodes read data from disk, compute their partial results. Here's my slowest query, here's my slowest query, here's my slowest query and then the aggregator node merges those partial results together and returns the results to the client. It's sort of like a map reduce thing. And so we can choose the level of fan out here. Again, we can distribute the data across lots and lots and lots of servers and our question is what level of fan out is optimal. The reason that's a worthwhile question to ask is because there are two parts to the time it actually takes to serve a query with this distributed query system. First there's the scan time, the time it takes for each leaf node to read its data off of disk and figure out its partial result and then there's the aggregation time, the time it takes to merge all those results together. The scan time is proportional to the total amount of data divided by the number of leaf nodes that we have, so the more leaf nodes we have, the better the scan time is, the less work each leaf node has to do. But the aggregation time is proportional to the number of partial results, so the more leaf nodes we have, the worse the aggregation time gets. Putting those two together, we get the same scalability graph that we saw before. At first, as we add more nodes, our performance improves, and then as we add more nodes, our performance starts to degrade. So this is pretty rough, right? It means that we can't scale beyond 40 or 50 nodes without actually regressing performance, which is a tough proposition if you're trying to be, I don't know, a web scale or something. So here's the idea. Instead of trying to do all the aggregation at once, we'll do it iteratively by fanning out queries across multiple levels. We know that the throughput gets worse for a very large fan out, so we'll make the fan out a constant F and we'll add intermediate aggregators between the root and the leaf nodes. Now, you can work through the math and see that the aggregation time is no longer proportional to the number of leaf nodes, it's proportional to the height of our aggregation tree, which is proportional to the log of the total number of servers we have participating in the system. This might seem a little abstract, but what does it mean? It means that as we add more servers, we continue to scale. By doing this computation iteratively and aggregating iteratively, we're able to amortize the aggregation costs over a bunch of servers and continue to scale even to very, very large datasets. So what have we learned? Again, making scale invariant design decisions is really hard because performance characteristics at low scale and at high scale are very different. But with smart compromises, we can produce pretty darn good results. With randomized choice, we can approximate optimal assignment very cheaply. With iterative parallelization, we can amortize coordination costs across more of our service. And the universal scalability law in both cases helped us quantify the effect of these choices and figure out how to design better systems before actually building them. In conclusion, queuing theory, not so bad. Model building isn't some magic thing done by wizards with PhDs. All you have to do is state your goals, state the assumptions about the system you're making, and don't be afraid. You can bust out a textbook and do a bunch of math, but you can also draw a picture to reason about the system. You can implement a simulation to help discover what happens as low changes without actually exercising your system at massive scale. When modeling scaleability is based on latency versus throughput, it's essential to measure and to minimize variability and to beware of unbounded queues which produce unbounded latency. And finally, the best way to have more capacity is to do less work in the first place. When modeling scalability, it's essential to remember that coordination is expensive, but that you can express and predict its costs with the universal scalability law. If you have something that you need to do, you can share some tricks, the randomized approximation and iterative partitioning. That's all I have. Thank you very much.