 Hi everyone, my name is Will and today I want to talk to you about simulating our Meso's Framework Cook. My plan for this talk is first tell you what simulation testing is in general and give you a sense of why you might want to do it and then tell you a bit about Cook, our open source batch scheduler that's really designed for the case when you have a lot more resource requests than you have capacity to fulfill it. Tell you why we built Cook in the first place and then tell you some of the more recent challenges that cause us to consider building a simulator. And then tell you a little bit about how we built the simulator and the trade-offs we made and go over one of the cases in which we use a simulator to improve the system. And so to start, it helps to have kind of an example system in mind when thinking about simulating a system. And this is kind of a generic system that you might have at your own company. It doesn't actually matter what it's really doing. We have some clients sending to a service. Service puts some data on a queue. Workers pull off the queue, do some processing, put it back on the queue. The service either writes it to a database, sends it back to the clients, maybe puts it back on the queue. I mean, you know, a generic service. And the point here is that it's, even this is pretty complex. Like there's no way that we're going to be able to just understand by looking at this or even by inspecting some metrics. What it's doing, holistically, and if we make any changes, what those changes are going to do? We might have a sense, you know, we'll have mental models for the system. We kind of have a sense that, like, if we add more workers, they'll probably scale out linearly, hopefully. But we don't really know. And so if we wanted to start answering these questions, one way we could do it would be to just make the change in production and, you know, just see what happens. But that's probably, you know, that might be fine, but oftentimes it's not really what you want to do. You can't even ask, you can't even really ask the opposite question, which is, what would have happened if I didn't make this change? You can't ask the counterfactual. And so, you know, we have this case where we want to test a hypothesis. That's really what these questions boil down to. Because, you know, if I add more workers, my hypothesis is that I'll be able to scale out linearly. But we don't have a way to experiment with our system. You know, we can put it in production, but our production system isn't really a controlled environment. So what I mean by experiment is a scientific process to test a hypothesis. And so, you know, we can have a hypothesis and we can test it in production, but it's not really scientific. You know, there are lots of things changing besides whatever change we made to test our hypothesis. The workloads are going to change as we run our experiment. We might have an automated process or somebody that's logging into those servers and like updating things. And then what if that process fails or that person goes to lunch? And now, like, your system's in this weird half state. And you know, this is not the controlled environment you want to run. You want to be running experiments. And if we were a physical scientist looking at a physical system, like, that would definitely get thrown out in an academic review process. So you know, can we get closer? Can we get closer to running real experiments in a scientific manner? And so this really boils down to can we control the environment that our system runs in? And well, half of the answer to that is, well, yeah, I mean, we could just stand up another copy of our production system and just don't tell anybody about it. If no one's touching it, then it's pretty much controlled. You know, nothing's really happening that doesn't happen from just the system's own processes. But this isn't a really interesting system. There's nothing happening. But if we just send it production traffic, we're back to what we had in production, right? We've now lost any control we had. So what we can do is we can stand up synthetic clients. Now we control the entire environment. And what this means now is we can change any one piece and run a real experiment. Now, I should say that this still isn't fully controlled, right? Like, servers can still fail. There can still be network partitions. But it's way better than what we have in production. And so, you know, let's say we change this lead service. You know, we can start looking at, well, what happens when we upgrade our service? You know, hypotheses, like, does it remain correct? Or what if we changed our workload? What if we get twice as many requests as we normally do? Or we switch from a read workload to a write workload? Or, you know, what if something fails? You know, now we can start to ask these questions and run these experiments and really do this in a scientific manner. And so, this is kind of the heart of simulating a system. Is putting your system into a place where it's fully controlled. And this gives you a way to principally experiment on your system. This is probably a good time to point out that when you do this, your experiments might not always match what happens in production. So you might run your simulation, get some great results that say, you know, if you scale out workers, it scales out linearly. And you're like really happy and you do it in production and it doesn't scale linearly. And this is because no matter how close you get to a clean environment that's controlled but still like production, it will never be production. But I think that's still okay. You know, even if a model is not perfect, it can still be extremely useful. And so, we can now start looking at this, our controlled environment, our simulation, where we have our system and we have these requests and we start having them flow through and then we change something. And this is our way of experimenting. But, you know, so far we've talked about our system as just a copy of production. But you actually have choice here. And it's really on a spectrum from high fidelity and high cost to low fidelity but also lower cost. And so, the high fidelity side is you just take a copy of your production system. You know, the results you're going to get from these experiments should match well to what happens in production because, well, I mean, you're looking at effectively your production system. But this is going to be expensive, both in terms of dollars but also in terms of time. So, the dollar side probably makes sense, right? You need boxes to run your entire system a second time. And you need, you know, these fake clients. And if you have a very large system that's taking tons of requests, you're going to need a lot of boxes to be able to simulate this. I think the time part is a little more subtle though. Since you're doing actual network requests, actual disk writes, actual database reads, you can only run this real time. So, if you want to simulate over seven days of data, it's going to take seven days to run your experiment. And this means you either have to run very few experiments or you have to run very short experiments. And both aren't great. I mean, on the one hand, you don't get to get that many results. On the other, your results have to have this additional caveat that, you know, here are these results, but they only account for 30 minutes of data, let's say. And so they might not generalize to a longer running in production. On the other side of the spectrum is building a model of your system, either a mathematical model or you can, you know, write some code that approximates how your system works. And this can be much cheaper. You can often run in a single box on a single process and you can often run it way faster than your production system. You know, in this case, you know, you're not making any network requests, any disk writes, no database reads. I mean, this you can run way faster. And so you can run either much more experiments or much longer experiments. And so in that regard, you know, you could feel more confident about your results. The problem is that, you know, you're now taking approximations of your system. And so you might miss entire classes of behavior of your actual system so that when once you have your results and you try to apply them to production, they don't actually apply. And so, you know, you now have the caveat that you were working with a model of your system. And so your results might not generalize. But again, a not perfect model can still be useful. And there's really a full spectrum here where you can make models of small pieces of your system to get somewhere between these two extremes. And the choice we made with Cook was to take the part of our code that handles scheduling, the part that's like the actual Mesos framework, and use that code, but then mock everything else. So we run all of our databases in memory and we built a mock of Mesos so that we could run our system much faster in a single box. And so we'll see that in a little bit. But now, given that we have this setup, you know, we can start to ask, we can start to run these experiments with certain hypotheses. So, I mean, I think one of the most basic ones is basically the foundation of testing. If I upgrade from version one to version two, my hypothesis is that my system remains correct. And we could just test this. We could either generate traffic that looks like production traffic, or we can just take a trace of production traffic. You know, maybe look at what happened last week and apply it to this week and apply it, sorry, apply it to your upgraded system. And now, if your system remains correct, either because you have properties that you would assert or you just check that it behaves the same way it did last week. Well, now you could be confident or you could be at least more confident that your system's going to be correct when you deploy to production with this upgrade. You might have another hypothesis that if request rates double, your latency stays about the same. That, you know, your system scales well or that a doubling of request rate isn't going to hit a knee in performance where, you know, you jump to a new latency for all requests. Or you might look at a hypothesis around server failures. So, you know, maybe if a server fails, your hypothesis is that your system doesn't fall over. And, you know, when we look at real-world examples of people actually doing this, you know, this is a pretty common thing to look at. And so, probably the most famous example of using simulation testing to test your system and specifically to look at faults is Kyle Kingsbury's worker, Afer you may have known him as, with Jepsen. And so, the idea of Jepsen is it stands up a distributed system and sends it requests and then injects faults, either network partitions or server failures and checks that the properties the system advertises actually apply to the system in the face of faults. And if you haven't seen his work before, I recommend you go look at these posts. They're amazing. They're a lot of fun to read. And also, like really terrifying when you realize that lots of systems that we depend on every day have some serious problems. And so, I'm pretty thankful that Kyle's going through and finding these problems and helping us get more robust systems. On the academic side, there's this paper lineage driven fault injection where they take a model of your system and have a correct request flow. And then, they have this tool that will look at this request flow and try to find places where if they inject faults, they can make the request flow incorrect. And once this tool runs, what you get back is one of two things. Either you get a case where your system doesn't have the properties it says it does and you get this nice lamp or diagram showing where things went wrong. Or you get a gold star that says, for this particular request flow, your system is robust to some level of faults. And so, if you kill all the servers, your system's probably not robust to that. But given some level of faults, your system will be robust. And that's a really powerful guarantee. This is the kind of thing that I hope we get to see more of. A different type of example that's not testing for faults, but instead testing to see if we can optimize our system is this boat paper. And so, what they did was they looked to see if you change configuration settings on Cassandra, could you reduce latency? And sure enough, they were able to find settings that reduce latency by about 3x. And this is by changing just configuration. No code was changed. The way they did this was they just stood up a Cassandra cluster and ran through some workloads, changed the configuration, did the same thing and found settings and found these improved settings. I mean, this is really powerful. So, we've looked at simulation testing in general and now I'm gonna tell you a little bit about Cook. So, I mentioned before, Cook is an open source distributed job scheduler and it's really built for the case when you have a lot more requests than you have capacity to run. And the reason we built Cook was a few years ago, this is kind of what it looked like at Two Sigma and it still looks like this. We have a bunch of users that want lots of compute and often during the work day at least, their requests are for way more compute than we really wanna keep on hand. And so, we need to share it fairly. So, Cook has two mechanisms to do this. The first one is that we order jobs based on DRF, the same algorithm Mesos is using to decide what framework to give resources to next. And then we schedule so that users that have low share get scheduled sooner. And this works well when we have an empty cluster and a bunch of users show up and we were able to provide them all compute. But what can happen is if like two o'clock in the morning, there's only a few users that want compute, we're happy to give all of them large shares of the cluster. But as the work day comes around, more users want compute. And then we have this problem where the users that are already running, their jobs are going to take a long time. So, the users that are waiting will have to wait a long time before they can get their fair share. And so we have another component that will rebalance the cluster. It will preempt jobs from users that have a lot of resources and give them to users that have only a little bit. And so between these two mechanisms, we were able to provide an environment where users could request resources and then be able to run their jobs and be confident that they'll complete in a reasonable amount of time. But recently we've started leveraging the public cloud and with it has come a lot of capabilities and with it challenges. And so the first challenge is around heterogeneous clusters. Now, in the public cloud, you can purchase machines that are memory optimized or compute optimized. And then the choice of what job to run on what machine actually matters. They might have better performance on a memory optimized machine, but it might cost more. And so being able to trade off a cost for performance would be, I think, really powerful. But at the moment, our scheduling algorithm will just schedule the job in any box where it will fit. Another challenge we've had is around scaling in the public cloud. So I mean, that's one of the most powerful components of the cloud, is that when you need to burst, you can. And then when you no longer need the compute, you just scale down. But this really begs the question, well, when should you scale? And right now we have some heuristics to do the scaling, but what we really want is something that's more integrated with our scheduling algorithms. And so earlier this year, we decided we wanted to revisit some of the algorithms we used to schedule. But we had this problem. It's the same problem that we saw earlier, where we want to run the experiment of if we change our scheduling algorithm, are we able to take advantage of these things and either improve performance or reduce cost? But if we do it in production, we have all these confounding variables, prices change, workloads change. We can't be really confident that our changes actually have the effect we want them to. And so we set out to build the simulator. And so now I'm gonna tell you a little bit about how the simulator is built, how it's designed. And to do this, first it helps to look at how Cook is architected both from an external view and then internally, and then revisit what Mesos looks like. So this is kind of how we run Cook in production. We have three Cook servers for high availability, zookeeper for leader election, datomic as our data store, and then Mesos of course for resources. Internally there's a bunch of components, but the three biggest components are a component to take the jobs that are waiting and rank them based on DRF. And then send this ranking to a component that handles scheduling. It also takes in new offers. And then once we have these matches, we'll send, we'll call launch tasks on the scheduler driver, on the Mesos driver, to actually send those requests to Mesos. We have a second component that handles rebalancing of the cluster. So this is the component that preempts from users with lots of share and gives it to users with little share. And this is also taking in both the view of the entire cluster and the ranking of all the jobs we want to run. And once it finds tasks that we want to preempt in favor of others, it will call kill tasks on Mesos to free up those resources. And so now on the Mesos side, I mean, you've probably seen this effective diagram. We have a few Mesos masters for high availability. So you keep it for some state and to handle leader election, the Mesos agents for resources. And then on the framework side, we have some client library, the Mesos driver, that handles the communication between the framework and Mesos. And so I mentioned earlier that what we chose to do was take Cook and then mock all of its dependencies. And so what this means is that we will end up mocking both the Mesos master and the Mesos driver and then bringing our databases in memory. Thankfully, both of them have a testing in memory version. So when you build a simulator, there's some high level properties you need to think about that are gonna affect your entire design as well as how you even interact with the simulator. And those two are whether to have a simulator be deterministic or non-deterministic and whether to have it be real time or faster than real time. And so looking at determinism, having a be deterministic has some really nice properties. It means that you can be more confident in the simulation itself. So if the simulation runs and you run the same simulation twice, so you make no changes, the results should be the same. And if they're not, you know there's a problem. You know, there's some bug in the simulation itself. This also means that if you do make a change between two simulations, the results that you see, the difference in those results is strictly from the change you made and not from any noise from the non-determinism. The problem with having your simulator be deterministic is that your distributed system is definitely not deterministic. And so by forcing your simulator and because of this, your actual system to be deterministic, it means that you're missing a class of behavior that stems from the non-determinism. And so again, you have to add this caveat with your experiments that, you know, this assumes that there is no non-determinism, which may or may not drastically affect how you interpret the results. Another choice you have to make is between having the system be real-time versus fast-than-real-time. And earlier we said that if you have your production system and just make a copy of it, you're forced to have it be real-time. And so this is really talking about what you decide your system means. And so if you make it fast-than-real-time, you're able to run more experiments. And so what we chose to do is make our system be deterministic and fast-than-real-time. And so at the highest level, this is what our simulator looks like. We have Cook and our Mach of Mesos that are talking to each other and we have the simulation driver that's going to instantiate both, wire them up and then drive the simulation forward. And so we'll look at each component individually starting with the Mach of Mesos. And so we implemented this. We needed to Mach both the driver and the master. And so on the driver side, we needed to provide implementations for launch tasks, kill tasks, decline offer. I mean, if you've built a framework before, I'm sure you've seen this sort of stuff. And then the Mach of Mesos needs to be able to handle these calls as well as be able to call into Cook and provide resource offers, let it know when it's been registered and provide information when tasks move from move to running and to completed. And this was a lot of fun to implement but we don't really have time to go into it. Ask me afterwards, I'm happy to talk about it. On Cook's side, we made a really strong decision to try to change as little as possible. And this is for two reasons. One is the less you change from simulation to production, the more confident you can be in your results. But the other thing is more from a software perspective. The more changes you need to make to accommodate the simulator, the harder it's going to be to keep your simulator and your code in line. And so by making as few changes as possible, it reduces how much effort you need to do to maintain your simulator. And so the only thing we really did was provide a way to trigger each of these components instead of having them all be event-based or time-based. And so this allows our simulation driver, once it's instantiated and wired everything, to trigger each component individually. And this allows us to have the simulation be deterministic. And so first on each cycle, it submits new jobs. It triggers our MacaMasos, send new offers and send status updates. Triggers cook to rank, triggers cook to schedule, triggers cook to rebalance. And between each of these stages, it's incrementing simulation time. And handling time when you want your simulation to be deterministic is actually tricky and you need to think about it. You know, time is one of those things that's going to make your system be non-deterministic. And so one way to handle this is between each of these events to keep track of the wall clock time that maps to what the simulation time was. And then in addition to doing that, you also need to keep track of the inputs and outputs. So you know, at this simulation time, this component got these inputs and got these outputs. When you go to look back at the results of your simulation, you can see how everything flowed. The problem with doing this is that you end up keeping track of a lot of data, both what wall clock time's mapped to simulation time, but also all the data that's flowing. And you also need to have ways of getting that data. So right now, because of the way we're doing the triggering, all the data flow is still handled by the normal code that we use in production. If we wanted to be able to get that data out, we'd have to add deeper hooks into all of our components, which, like I said, we don't want to do. And so we did something I'm not entirely proud of, but it does work, which is just stop time. So we're running on the JVM, we'll use Jota time, and deep in the bowels of Jota time that I hope nobody knows exists, is a function that you can set current time millies. So any call to get time now gives the back the same instant until you reset it. And so we use this so that all of our components, when they get time, all have the same time. It's now just simulation time. So any timestamps that are written to the database also have simulation time. And what this means is that we didn't have to make any changes to get any of this data out. We didn't have to keep track of what simulation time was. All we had to do was change current time millies throughout the running of the simulation. And so, like I said, this is a little sketchy, but it means that we didn't have to make these deep changes we didn't want to make. And it also means that if we look at the actual database of two simulations that have no changes between them, they will be identical as well, which is kind of a nice property when you want to look at the individual events that occurred in the system. And so, I mean, there's a lot more details about the simulator, but this kind of gives you a nice feel for how it works. And so now I want to tell you about how we used it. And what we did was kind of similar to what that boat paper did, where we looked at if we could change configuration settings to improve how we do rebalancing. And what I mean by improve in this case is, could we reduce the number of preemptions and reduce the waste that stems from those preemptions while still keeping fairness about the same? And so, I mean, I've mentioned it before, basically the idea of how the preemption works is if we have some users that have a lot, a large share of the cluster, we have users that have a very small share and they're not getting more resources at a quick enough pace, we'll rebalance the cluster, we'll preempt from users that have a lot and give to users that have a little. And what we're really changing here is just configuration settings. And the two knobs we had at our disposal were the number of tasks to preempt at any cycle and how selective to be when we choose to make preemptions. And so this is how unfair the allocation needs to be before we're willing to preempt tasks. And we set up our experiment to look at data between May 1st and May 6th. And what we chose to do is look at reducing the number of preemptions and increasing the selectivity. And we had a feeling this would work well because previously I had run a simulation that didn't touch any of these preemption settings. But I had misconfigured it and it had it so that it didn't have the settings that the production system was using. And the results from it, when I was looking at it, I was like, wow, this is really weird. Showed that this combination of reducing preemptions and increasing selectivity would let us reduce the number of preemptions and reduce waste. Whereas doing just one, one of them would, either waste would increase or the number of preemptions would increase. And so this was kind of this non-intuitive result that kind of popped out just from, well, a mistake. But lots of good things come from misconfigurations and mistakes. So we have this experiment and the results we were looking for in simulation was that fairness and just total resources that we used over time stayed about flat or decreased only ever so slightly but waste decreased significantly. And so that's exactly what we ended up seeing. Total resource use stayed about the same, fairness dropped by only a little bit and so we'll look at that in a moment and then waste dropped by about 15%. So we were really happy with this. And then looking at the simulation results for fairness, we looked at what we call starvation over time and what I mean by starvation is if you look at how much resources a user would get if we just evenly partitioned the cluster per user and then looked at how much resources they got allocated by Cook. Anything under that fair, anything under that even allocation, we considered them being starved. And so when we look at this graph, up is bad and what we'll see is that our old production settings and the new settings that we went to, the new settings were always higher but they tracked very closely. And so we were pretty happy to trade off this slight increase in unfairness for that large drop in waste. And so once we had finished our analysis, we were happy to go to production. And what you'll see is we're looking at waste over time now. And there'll be a line showing when the change went in and after the change went in, waste drops and then stays down. And so we were happy with these results. We were able to have a hypothesis, test it in our simulation, find results that we were happy with and then apply them to production and get to see that they matched pretty well with what we saw in our simulations. This was really powerful. There's no way we would have been able to just guess around and poke at these settings in production, right? We'd have to wait a few days to see whether what the effect of these settings were and we'd like go back and reset them. This would be really painful to find without something like a simulator. And so we've looked at what it means to simulate a system in general. We looked at what cook is and what it meant to simulate cook and saw a case in which using the simulator helped us improve the system. I should say aside from just this case, we also use it to improve testing in our system. So while building it and while starting using our simulator, we found two really nasty bugs in cook that had been there for years. And then once we started applying it to our continuous integration, we've caught a few bugs now that we didn't catch from any of our other testing and would have made it further into our process of QA and release before we would have hit the bug in probably in production. And so before I wrap up, I wanna leave you with just this one idea which is simulation testing helps us understand our system more deeply. Being able to experiment with our system and being able to do it in a principle and scientific way helps us understand not only whether our system is robust and how to improve it, but also let's poke at these complex interactions between these components. And any tool you can have that can help us understand these complex distributed systems more, I think is really useful. And so with that, I'm happy to answer any questions you might have. Microphone. I said you're mock of MeSauce. How general purpose is it? Are you planning to open source it? Yeah, so it's pretty general purpose. Well, so it's in the JVM. Cook is written in Clojure, so it's in Clojure. So it would be a little bit of work to make it able to use for general JVM users. If you're in Clojure, yeah, it is open source right now. You can take a look at it in the Cook GitHub project. If there's an interest, we can talk about it after. So how did you know when to stop, right? So you have this ability to go quickly change configuration and it seems like you could spend forever trying to find an optimal simulation. So what criteria did you use to say this is good enough? I'm gonna now go to production with the change. Yeah, that's a really good question. So in this particular case, this was the first case in which we use the simulator to find, to change something and then apply it to production. And so I actually tried only a few settings. I already said I had a sense of what type, what direction the changes should go in order to get the results we wanted. So I tried those configuration settings and two others to make sure that my understanding of how changing those values will affect the system in general. But I was really only looking at those. My concern was if I tried a lot of different settings, I might overfit to something that wasn't going to match production well. And so I was still kind of using a mental model that I had. Checked to see that it looked right in simulation and then went right to production because I didn't want to overfit to something that I still wasn't confident in. I think as you build more confidence in your simulation environment, you can start to run more simulations and really hone in on the right settings. And so like that boat paper, they're standing up an actual Cassandra cluster and sending it actual requests. And so they can be more confident in the simulation, in the simulator. And so they can hone in more on the right settings. So it really depends on how much you can trust your simulator, I guess is the answer to your question. Thanks. All right, thank you. Thank you. Thank you.