 Good afternoon, everyone. Thanks for coming for this session. My name is Ramkeer Ramakrishna, and this is Joshua Cohen. We are at Twitter with Platform Engineering. And today, we're going to talk about how you might apply some neat machine learning algorithm, what's called Bayesian optimization, to the problem of tuning performance in an automated manner in the data center. Right, so let's see. This slide's a bit light over here, but this represents the service graph of Twitter. Twitter runs on microservices. And along the edge of the circle is the set of microservices that together make up the Twitter experience. The edges over there are the edges of the service graph, so that, for example, with the magnified portion over there showing the TFE router, edges going from the TFE router to the other microservices represent the calls that that microservice might make to its downstream services. And in turn, those might make further calls to other microservices and thus, together, satisfy the request that came in at the front end. It turns out that, as you can tell from the number of services that are listed over here, there's a very large number of services that together make up Twitter. And it's about several thousand services, microservices together make up Twitter. And for each of these microservices, there might be several instances, horizontally-scaled instances, that together make up each service. And of course, these services might come in different sizes, depending on how much load they have to, how many requests they need to service. And based upon that, if you look at all of the service instances running in the data center at any time, there might be upwards of several hundred thousand such service instances running. Think of a service instance as a UNIX process, for example. Now, many of our services are actually built out of, they're on the top, the Java virtual machine. So one way of thinking of the service instance is as a JVM. There are, of course, non-JVM service instances as well, running as well, but the JVM makes up a large percentage of the services that we run. These services run on heterogeneous hardware. There could be several generations of servers in our data center. And they're all taking, our service instances might be scheduled on different kinds of hardware. So at any time, a service instance has no idea which piece of hardware it might be scheduled on. And in addition to that, depending on what the service does, it might have different resource requirements. It might require either a large amount of memory or a small memory depending on how much load it serves. It might make use of several cores or only a few cores depending on how much parallelism, how much concurrency it supports. And so the data center actually is composed of many, many different service instances running different flavors of applications. Here's a typical performance stack of any particular instance. So if I look at the stack over here, at the bottom is the hardware of the host and the kernel and OS running on top of the hardware. And in this particular case, we have two Mesos containers running on that host. And each of these containers, in turns, running a JVM process and the microservice is executing on top of the JVM. So it's a layered system. And each layer of this service stack might expose a set of parameters which you might tune for optimal performance of the microservice. Here I list, for example, H1, H2 as being the hardware parameters that might be tunable. Of course, that tuning may not be done dynamically. Maybe it's something that you set when you release the machine into the data center. The kernel might expose certain tunables. The Mesos container might have some parameters set, things such as the number of cores that the container can use, the amount of memory that the container has, the amount of network bandwidth that the container is allowed to use. And then the JVM itself has a whole lot of parameters. And many of us are familiar with tuning various things such as the heap size and various other heap shaping parameters and so forth. In order to maximize the performance of microservice. And finally, the microservice itself might be tunable in terms of various service level parameters that you set in order to change the size or the user experience. So the performance of the microservice itself can be considered a function F of the parameters lower down in the stack, which I list here as HKJMNS. And turning first to the JVM, which is what we are initially targeting, the hotspot JVM has hundreds of parameters. In this case, I'm listing a few of them over here. And if I look at the count of a version of the JVM that's about probably a year old, we find that there are about 7 and 57 tunable parameters. Now, not all these parameters actually affect performance, but there's a large percentage of them that do. So you could say that there are several hundred parameters that can be tuned, and that might affect the performance of your application. These parameters, just even focusing on the JVM, these parameters, there's lots of these parameters. And the performance of your service might be sensitive to some but not to other parameters. Well, we a priori may not know the performance sensitivity of your service to the settings of each of these parameters. Some of the parameters themselves are hardware dependent. So for example, there might be something that tweaks the JIT compiler, and the JIT compiler might emit code that might be better suited to certain CPU instructions. It might be able to make use of CPU instructions for specific hardware better than with older hardware, for example. They might also be mutual interdependency amongst the parameters. So for example, I might say that, oh, I need a heap that is this big. And then once I have fixed that, it might be that the young generation has to be tuned after you figured out what the size of the overall heap might be. So there's lots of interdependencies amongst the parameters. Clearly, hand-tuning something of the scale of Twitter with hundreds of thousands of services, each with different requirements and different performance characteristics is a difficult exercise. It's something that's impossible for any small team of engineers to do. If what you can, how you can scale down the problem is you say, OK, maybe the top few parameters that affect performance are these. And then you focus on those specific set of parameters and try to tune those manually. But even if you were to do that, something a manual exercise would be time-consuming. It would be labor-intensive and very likely error-prone, especially as the number of parameters increases. You have to keep track of what experiments you've run and try and plot a highly multi-dimensional performance surface. As a result, what happens is when new microservices are written, it is often the case that you look at some microservice that resembles one that has already been written. And so you say, hey, this set of parameters worked well for this service and my service kind of looks like the other one. And so I might be able to set my JVM's parameters like they set theirs. And as a result of this, there's a lot of cargo-culting of configurations that goes on. And perhaps you have the time to tune this cargo-culted configuration before you launch your microservice. Perhaps you don't. You find that the performance is reasonable, so I'm going to just make use of it. And you don't have time to tune it, for example. But even if you spent a lot of time tuning something like this and said that, OK, I'll pick these five parameters, tune my microservice, for these parameters, do a whole lot of experiments, maybe I spend two or three weeks performance tuning this service, by the time it's next week and the service owners start rolling out there, the next version of the service, the performance characteristics have changed. And so what was optimal two weeks ago may no longer be optimal now. In addition to that, because of the layering of our stack and because of the fact that in a large data center, we don't know which hardware instance our service might be scheduled on and because of the fact that there might be upgrades going on at various levels of the stack. So for example, I might be changing the version of Linux in my data center from one version to the next. I don't know, a priori, which of these two platforms my service will get scheduled on. So there's quite a bit of, there's a lot of churn going on in the data center, there's a lot of heterogeneity and trying to get optimal performance across the entire spectrum and making that be optimal over time is a difficult problem that no amount of manual tuning will address. And so it turns out that if we follow the manual approach, then it goes without saying that many microservices will in fact be operating below optimality. And so the question is, how can we get a handle on this apparently intractable problem and make it automated? So our approach is to actually leverage some of the literature in industrial engineering and operations research, which have looked at problems of optimizing engineering devices. So for example, you have a device that exposes a set of knobs and by tuning the knobs, you want to find the optimal operational setting for this device. It's an old problem that has been addressed in various engineering fields. And the way to do it is to say, hey, the function F, which is the performance metric that I'm trying to optimize, is a function of x1 through xn, which are the knobs that I'm going to tune to turn. And what my objective is to find a configuration A, which is the settings of x1 through xn to the values a1 through an, which is the knob settings, such that my performance metric F is maximized. But there might be constraints on this optimization problem. So for example, supposing I want to maximize throughput of my service, I might still have some constraints on the latency that the service responses might have to satisfy. So there could possibly be a constraint predicate G. There could be a bunch of constraint predicates, and I can actually represent it as the conjunction of a set of individual predicates that must be satisfied for the optimal configuration to be acceptable. So that's how we would model this performance optimization problem. And then just to give you a little bit more of a flavor of the kinds of constraints that we might want to model, I might say that the parameters x1 and x2, which are parameters to my performance function, might should be in the relationship x1 is less than x2. So that kind of constraints the set of configurations that are acceptable. And that reduces the space within which we must optimize. In the context of the JVM, that might mean, for example, that the size of the new generation or the young generation is smaller than the size of the heap. Or that, for example, there's something called the max-tenuring threshold, which determines how many GCs an object has to survive before it gets promoted into the old generation. And the JVM itself might set some limits on what the value of that setting might be. There could be more complex constraints that relate some complex function of one or two variables with another function of other variables. And then the last one, the constraints and the behavior I've already talked about a little bit before. But further making the problem more difficult is that there are sets of parameters that affect the performance of my device. In other words, the function f, which we may have no control over. So for example, my service doesn't know which other service or which other container it might be co-hosted with on a specific host. And they might be intercontainer crosstalk that I don't have any explicit control over. There might be variation in the load across time, which I may not be able to control. And it is possible that the optimal settings for one load are different from the optimal settings for another load. And then I've already talked about not necessarily having control over the hardware where I'll be scheduled. And so when there is uncertainty in these hidden parameters, that appears as noise in your objective function. So the performance tuning exercise itself would consist of designing a suitable performance metric and then deciding on and refining the set of knobs to tune, and then using an iterative strategy to tune these knobs. And of course, not all knobs are visible to us. So pictorially, that might look like this. The performance engineer picks a set of knobs and picks settings for those knobs and then starts up the system that he's trying to optimize. He makes some performance measurements of the metric f that he's interested in optimizing. And he makes a couple of these measurements with different settings and tries to figure out the shape of the performance surface. But the question is, can we automate this by having a black box tuning assistant, something that will try different parameters, look at the shape of the function, and maybe learn what the function shape might be like. And for complex systems, the shape might be extremely complex. It might be a highly multimodal, non-linear system, as we know most of our systems are. We make use of a technique called Bayesian optimization from the machine learning literature. It's a statistical technique that can be used to learn potentially noisy objective functions. And we've already talked about how we might have noise in our system, iteratively and efficiently. And by efficiently, what I mean here is that through a small number of evaluations or experiments of the response surface, we can quickly find our way to an optimum configuration. And that's what we are saying here, that in a very few small number of iterations, we should be able to find an optimum setting for our parameters. This technique has existed for a while. It was first talked about in the 80s by Marcus in the former Soviet Union. And then there's been more work recently. And we are going to be leveraging work that was done at Harvard in Toronto a couple of years ago, which we have in-house today. And this technique works well with non-linear, multimodal, and highly dimensional functions. So it won't get caught on local maxima. Let me walk through a quick example to show how Bayesian optimization works in practice. Supposing we have taken three measurements of the system, and those are the three. So I'm here looking at a single variable system, performance shown on the y-axis over there, and the knob settings shown in the x-axis over here. And we've had three evaluations that give us that set of points. We might look at this and say that maybe it's worthwhile to look at something that's of setting the value at minus 4 because presumably that setting might result in higher performance. Now, it turns out, and we do not know this, it turns out that the actual function might have this shape, which would still produce the same set of settings. So I'll walk through an example where Bayesian optimization will try and learn the shape of the surface as we go through. So the way Bayesian optimization learns is that it models the function, the unknown function, as a stochastic process. It's what's called a Gaussian process. And this is a distribution, it's a probability distribution over functions. One way to kind of think about this is that at any point of the space, of the settings of the input parameter, there's a distribution of values that the function might take. So for example, if I pick the setting two, we have some uncertainty as to where the value of F2 would be. And so that uncertainty is represented by this cloud, which is a normal distribution with a given mean. And as we make more measurements, this probability distribution will get modified. So we start with a certain prior distribution and as we make observations, it gives us a posterior distribution and we thus iteratively define and go forward. So the way we figure out what the next point at which we should look to make a measurement is that think of the best point that we've got so far, shown over there at the top. And we slice the probability distribution and look at the mass of the probability distribution that's above it. And so if I take any specific point over here and think of the probability of improvement, it would be the integral of the probability distribution that lies above the curve. And that's true of each point. Now I could use the probability of improvement as something that would drive the choice of the next point. It turns out that people have looked at this and found that expected improvement might be a better measure. So for example, I might say that the value of the improvement times the probability integrated over the slice that the part that lies above it would give us a much better measure. So here we've plotted the expected improvement of this Gaussian process given the best value that we have so far. And then it turns out that it's mathematically easy to optimize this function, which is called a surrogate function, the acquisition function. And so it turns out that this point would maximize the expected improvement. And so we pick that point and proceed from there. We make a measurement over there and we find that it actually doesn't improve our performance much. But we keep doing this at every point, refining the probability distribution, the Gaussian process that we have. And over a couple of such iterations, every time picking the best point that the optimization of the expected improvement acquisition function gives us, we eventually find our way to the, to somewhere near the optimum. One thing that I want you to note over here is that there are points at which we actually pick up and do and evaluate the function at points where it actually performs much worse than the best point that we had so far. There are several alternate approaches. I won't go into the details of this because I'll probably run out of time. But a Bayesian optimization has been used at Twitter for a number of optimization problems. I list a few of them over here and we are applying it now to JVM Performance Tuning in the data center. Note that this kind of technique could be applied at pretty much every, any layer of the stack. We are now focusing mainly on the JVM and so that's all that I'm going to focus on now. But there's no reason why the same technique would not extend and be applicable to almost anything in the performance stack, including, for example, the Mesos, the container level which somehow got skipped in this slide. So I'll actually very quickly go through a proof of concept that we did where we picked a bunch of JVM parameters, some of which I've listed over here for a total of about 30 parameters. So taking these 30 parameters, we ran Bayesian optimization over it. And the setup was of a large production service to which we applied this. But we didn't apply it in production, we applied it in a staging environment. And the performance metric that we were optimizing for is listed over here. It's the number of requests per second divided by the GC cost. And so intuitively we want to increase the throughput of the system and we want to reduce the garbage collection cost. And intuitively what that means is the latency of your requests would be minimized. Wild throughput is increased. This shows how the, you know, on the X axis is the iteration number and on the Y axis is the relative performance improvement with respect to the original settings that the service was running with. And we can see that by about the, probably the 20th or 25th iteration, even though we are trying to tune a system with 30 parameters, each of which might take up to 100 different values for a huge space, like 100 to the power of 30, even though we were searching over such a huge space and this could possibly be a highly nonlinear multimodal system. Within 20 iterations we had found our way to something that doubled the performance of the service. And by the 70, whatever 78th iteration we had performance that was 2.2 of the original performance. So that's the optimum over there. And I won't go into the details, but basically this shows the metric as, you know, over time. The blue line represents the tuned system, the optimized system and the yellow line represents the unoptimized system. And looking at the two parts of the optimization function, one was the requests per second, that was the incoming ambient traffic that is the same for both the optimized system and the nonoptimized system, and the improvement was entirely because GC cost was reduced dramatically. I won't actually go into the optimized settings. I'll let you look over them in the slides that we have. But basically what one can take away from this is that if we can optimize, if this optimization carried over well to production, then we might be able to extract a data center footprint reduction and therefore a cost improvement in the data center. And so that's really what we are trying to get at. The key takeaways is that you have to be aware of hardware heterogeneity. You have to factor out hardware effects when you're doing these experiments. There might be load spikes and seasonality and you have to factor those out by running sufficiently long experiments that will average that out. You have to have baseline configurations so that you can normalize your evaluations. This is something that I kind of mentioned a little bit before. You have to run experiments long enough that long range effects appear. So if you don't run your experiments sufficiently long, then the long range effects may not be visible and what you thought was an optimum turns out not to be an optimum after you've run for about a day or more. In order to increase convergence speed, we want to reduce the amount of noise as well as to increase the quality of the optimum we want to make it robust with respect to noise. We can actually, the technique that we are using can in fact be parallelized so you could have several experiments running concurrently to speed up the convergence. And finally, suboptimal suggestions such as the one that I pointed out earlier, you need to be able to detect those and stop them early so that you don't affect the overall quality of service especially if you're running the experiment in production. And the last bullet basically says that staging is not production. If you optimize something in staging, as we found out the hardware, it doesn't translate automatically to production so it makes sense to run these evaluation experiments in production. So I'll actually let Joshua now take over the talk and talk about an implementation of this that he's been doing. Thanks, Romkey. So yeah, so based on what we learned from the original prototype that was built, we've started building a system that we call AutoTune. It's a Bayesian optimization automated tuning service. If you put that all together, it spells out boat. So we like to hit these memes at the peak of their popularity. We can also just call it AutoTune as a service. We take what we learned from our prototype and apply it to a general service that anyone can use. The key goal is that no coding is required. It should just be configuration. It should support any type of service and it should be running continuously or on demand. And yeah, it should just basically, as you would want from any service, be really easy for our engineers and SREs to run. So this is an example of what an AutoTune configuration file looks like. Essentially, you give it some parameters that you want to optimize for. I'm just showing one here for the sake of space, but it could be any number of parameters. You give it an objective query, which is essentially the function that we're trying to optimize. A job key to identify this in our Aurora Mesos cluster. The range of instances you want to experiment on and how long each experimental evaluation should run for. This is just a brief systems diagram, which isn't super interesting. So let's talk about how Aurora and Mesos actually help us to build this AutoTune service. I assume everyone's familiar with Mesos. For those who are not super familiar with Aurora, it's a Mesos scheduler. Initially developed on Twitter, later open sourced. Patchy project was designed for microservices. So what does Aurora bring to the table that lets us build this system? It's got homogeneous jobs, instance level scheduling constraints, programmatic access to the API, fault tolerance automated deploys. These are all things that make it easy to build our AutoTune service on top of Aurora and Mesos. Before we delve any deeper, just a brief primer on how jobs are configured and executed by Aurora. Jobs are a top level construct in Aurora that composed of n identical tasks. Tasks themselves are composed of processes. Individual task execution is managed by Aurora's executor, which is called Thermis. Aurora itself is responsible for tasks being resilient to any failures, either of the tasks themselves or host failures. And Aurora and Thermis together facilitate service discovery. So other services in the data center can find those and talk to them, which will be important when we want to send them production traffic. Aurora also provides a thrift RPC API that lets us have full control over anything we need to do when scheduling our experimental instances. So this is sort of the overview of the process that we need to take when we go to launch experiments Aurora. By the time we're talking to Aurora, we already have our suggestions from Twitter's Bayes-Op service. So effectively we need to find the primary task config from Aurora, inject our suggestions from the Bayes-Op service, make sure that we're consistently running it on the same hardware platform, and then finally pick an instance to run it on. Ideally Aurora jobs are homogenous, but during Canaries and Updates they might not be. So we need to pick a task config that identifies sort of the base level task config that we can make all of our experimental changes to. We use a simple heuristic for this, which is just picking the most common task config among all instances, but if we needed to we could do that differently. So I'm talking a little bit about task configs. This is what an abridged Aurora task config looks like. The key things to notice here on top of the obvious, like these are the resources to use, are it allows specifying task level constraints. It's got some data for Aurora's executor and it allows us to specify metadata. And so this is what Aurora Thermos's executor config looks like. We have a list of processes and there's constraints among those processes that define the order in which they should run. So this constraint here says that the stage process must run to completion before the run process completes. You might be asking, so if we look back here, we have, you can see in the command line, there's a little bit of auto tune specific stuff here. So you might be asking yourself, how can we guarantee that that's all in there? For the sake of time, we have a helper built into our Aurora install at Twitter that all JVM processes run through, which gives us a convenient point to hook into to add those necessary flags. If you'll recall from that command line, we are looking for this auto tune config file and effectively what we do in the auto tune service is inject a new process that generates this file for us, which effectively exports a bash variable that contains all of those command line flags that we want to pass to the experimental instance. But the process by itself is not enough because we need to make sure that that runs before all of our other processes. Otherwise, we won't know that they've picked up those settings, so we also inject some constraints that guarantees that our process runs first. As Romkey mentioned, one of the lessons we learned from our prototype is that hardware differences definitely impacted the optimal settings. So thankfully Aurora allows for us to make scheduling decisions based on mesos attributes. So we just configure our mesos attributes with our mesos agent with an attribute specifying the hardware platform, and then Aurora can schedule on that constraint. The last thing we need to do before we can launch an experiment is pick an instance to run it on. We want to make sure that we're not clobbering an existing experiment, so we inject a little bit of metadata so we can tell our auto tune experiments from normal instances of a production service. So we finally have modified Aurora task config. This is where Aurora really helps us out because of its built-in job update support. We just initiated an Aurora job update with this task config. If the suggestions are bad, Aurora automatically rolls that update back. Twitter doesn't go down as a result of this because we are experimenting on production instances of these services. Auto tune then detects the rollback and updates our base op service to let it know that those settings were no good. Once the experiment is running, Aurora's built-in support for service discovery means that this instance is taking real production traffic just like any other instance of the service. If the suggestions are bad and the service fails, Aurora will restart it, will detect if there's too many restarts and market is bad. If we're also monitoring metrics, while these are running, if the metrics are bad, we market is bad again. So finally, experiments will run for whatever the service owner duration was specified as. Once the evaluation is complete, we check the objective query, feed those results back into the base op service for the next round of experiments and repeat the thing over and over again until, as Romkey demonstrated, we eventually find the global optimum. So eventually, experiments will convert to an optimal setting and this part is currently manual in that SREs will take those settings and apply them to the service, but hopefully one day it will just be automatic and we'll just keep doing this over and over again, continuing to feed those settings back into production services. And so looking back at the diagram that we saw earlier, we replaced this black box tuning assistant with AutoTune. We replaced the performance engineer with the base op service. We have Twitter's existing monitoring infrastructure and it just keeps running and running. So this is great for the JVM, but what can AutoTune do for us outside of the JVM? One area that I'm super excited to explore is automatically sizing Aurora instances. So instead of engineers having to figure out that I need four CPUs and 12 gigs of RAM, AutoTune will just figure it out for you. So in conclusion, given the scale of the problem and the possible gains, it seems clear to us that some form of automation is desirable in this space. Like we do for QA, continuous performance optimization appears inevitable for efficient operation of microservices. Base opt, because it drastically reduces the cost of search for an optimum, appears well suited to the, as the technical basis for this kind of work. Yeah. And yeah, so at Twitter, we think that because of the current state of, we believe that the current state of the art and containerization combined with commoditization of machine learning technologies opens up new frontiers and operations, scalability and performance engineering. Things that were previously unautomatable are now automatable. Our work on AutoTune focuses on a small piece of the stack and uses a tiny subset of what's available in machine learning today. As you move your platform and your infrastructure onto services like Mesos, we encourage everyone to examine what opportunities for automation and optimization are available to you. And in the negative 15 seconds I've got less, I'm happy to answer any questions. Not at this time. I'm right. So yeah, so I mean the underlying machine learning, the Bayesian optimization system is available. Open source, it's called Spearmint. So if you look on GitHub, you can find that. But the orchestration piece that is AutoTune itself, maybe one day, but at this, right. Yeah, and I mean, you know, come to Aurora Slack, talk to us about it, you know, we're, I don't think there's anything here that's particularly proprietary, so we're not opposed to the idea.