 All right, thanks for coming to my talk. I think I ambiguously titled this Optimizing Operating Systems. That might have been the name on the website. And you might be wondering, are we making operating systems more optimal, or are we running an operating system that's trying to optimize an application? It's the latter of the two. So welcome to my title, Entitled End Optimizing Operating System. So setting a little challenge for myself, I thought I'd try to show you the most boring graph that you'll ever see. And what we're going to be doing here is running a program. Its name is Factor. And we're just timing it. How long is it going to execute for? On the x-axis, re-invocations of the program, y-axis runtime. So there's a run, 53 seconds. So now what if we throw that in a loop and run it over and over again? The operating system has seen this exact program run before, instruction by instruction. We're not changing any inputs. This is just a complete repetition. So what's going to happen on that second execution? Exactly the same. But maybe on the third try, nope. All right, well, I think we're starting to see a pattern here. But my question for you is, are you surprised? And I think there's kind of two ways that you can think about that. On the one hand, well, if it's a deterministic program on no input, presumably on a system with a little background noise, no, I'm not surprised. But on the other hand, each execution, that doesn't look good. Sorry, on the other hand, we're actually good here. Each run took about a minute, about 100 billion cycles of completely identical execution. And we're running this on a planet with finite resources right on a machine that could be curing disease or finding you Bitcoin. So maybe it's a little surprising that we don't take advantage of this by default. So this is from Mickey, a pioneer in machine learning. He said, 1,000 runs on the machine don't reeducate my handiwork. Every redundancy is meticulously reproduced. So in this talk, we're gonna wonder what if, right? What if an OS could learn about the programs that it's running and optimize them automatically? And maybe this isn't total science fiction, right? We already expect a lot of this behavior from our compilers. And us as programmers, we're happy to make these resource trade-offs, right? Longer compile times, maybe more memory usage for faster running binaries. But the compiler is really constrained in that it has to produce a single artifact, the binary. And this is a compromise across all possible inputs to your program with little or no knowledge of runtime behavior and what else is going on in the system at runtime. And yet we still expect to be able to grab that knob and as we turn it, expect our programs to automatically run faster. So the question here is why not an optimizing operating system, right? And I argue it's not for lack of power, right? To the operating system. From the perspective of a process, right? The operating system is like a God, right? In that it's omniscient, right? It's omnipotent and it's omnipresent. And so the question I'm posing here is where's our optimizing call to exec, right? This is called that starts programs running. Why can't I say try really hard, right? Spend a lot of resources, a lot of parallel cores making your single threaded code run faster. We already expect some of this type of operations from our JIT compilers like the Java virtual machine. And the idea here is doing caching, flattening, branching code into straight line execution. And the OS has an even larger purview than a JIT compiler. So our vision here is we've seen a lot of systems kind of guided by the mantra of don't create overhead, right? Get out of the way. But what if we actively invested in accelerating these applications, right? With the intention of buying back that investment and more. So I'll provide a few principles here for the optimizing operating system. So the first one is make some effort to actually run that code faster. Secondly, revert back to normal classical computation if you fell at one. And finally, I just wanna emphasize that we're, even though we might be using some statistical techniques for doing these, for enabling these optimizations, all of the execution is gonna remain exactly correct bit for bit with classical execution. So the stock is gonna focus on the design of performance systems that they're intentionally designed to remain amenable to optimization. And that might mean, for example, keeping the state of computation accessible as data, right? Maybe in a compact representation, consolidated and ready for prediction algorithms to work on. And just to be clear, I'm glad to be in front of a machine learning audience. This talk skews a little more systems, but I'd be happy if you come up to me and even if you think it's the simplest idea for how you might think about some of these problems are really right at the beginning of this work. And also I'd like to show you a couple of the opportunities where you might be able to get involved or what you can do in building these systems. All right, so maybe it's time for a little bit more exciting result. So when I joined this research group, my advisor and his collaborators, they're working on this Linux prototype, a system that attempts to learn something about execution and to automatically paralyze and accelerate it. And now running this program factor under the system ask. In this run, we're just looking for what overhead is induced by the system when it's not doing work. And there's our baseline comparable to running natively on Linux. This time through, we're gonna ask it to do a little bit of work, right? Stop that computation, query its state, update some online learning models. And when we run it again, maybe it did even a little bit worse, right? So this is what I'm talking about, investing resources to actually learn about these processes. And sometimes we might get first run improvement, but that's just the game that we're playing. So dropping this in the loop, we'll run it 10 more times and we'll see, hey, maybe we even broke through that plateau, right? Running it again, more improvement, right? We're learning more about execution and trying to take advantage of that. In this next step, we're taking some training offline. We're gonna do some batch training and really try to extract everything we can out of those data sets that we've developed to this point. And we see that there's even more room to make improvements on these run times. So the techniques that we're making use of in this talk, I'm gonna talk about three systems. They all come down to making use of speculative execution. So speculative execution is, it's desirable to us because it allows you to do work that might be conditionally committed to, right? So if you can prove that that speculative work was correct, then we can commit to it. Otherwise you've just wasted maybe power or parallel resources. And the way that our first prototype works is as that program starts executing factor, we spend some time querying the state and learning something about it. And if we've developed some kind of confidence about how we've seen it changing in time, we'll make some predictions about where it might be in the future. So without knowing whether those are correct, we queue them on some parallel cores and we run all five of these cores in parallel. This is our opportunity to extract parallelism from the signal. So I think this is better seen in an animation. So in our work, we developed a mathematical formalism that maps the instantaneous state of a program. So any possible state that a process can be in to a unique point in a high dimensional space. What you're seeing here is that space being flattened down to 2D. And just take a look. So what you're seeing initially is computation that program factor starts running in the bottom left-hand corner here. And after some amount of time, a series of speculations are created and they're all run forward in parallel. And it's only when that initial thread, which we're running classically, reaches the speculative point exactly, a bit-for-bit match of that speculation that we've proven that that was actually on the computational trajectory and we can commit to whatever work has been done in the intervening time. And so I was first seeing this system, I was coming from more of a physics background and I thought, wow, that's really cool. I had no idea you could even think about computation in these terms. And so I was impressed to see the system generalizing to a class of problems. And interestingly, also on unseen inputs, right? So what I was showing you was running an identical program over and over again, but we can even run programs that branch conditionally on their inputs. And we can see first run speed up for new inputs, right? Trajectories that we've never actually executed before, but because we're leveraging learning that was done on similar trajectories, same program, different inputs, we can even see speed up on the first time that you're running the program with a new input given prior learning. So I think this is a pretty interestingly constrained problem, right? You have a runtime upper bound, how long it takes to run this thing classically, but how much better can you do? And looking closely at these ephemeral execution sites, any overhead that you spend deploying computation into them reduces the breadth, the number of speculative sites that you can run, or the depth, how far you can push into them. And speaking to the constraints of this type of online learning problem, right? Every freeze the world moment that you spend introspecting on that initial trajectory is time that could've been spent proving out more of it, right? Learning more about that computation. So it's really maybe a tricky optimization problem that I need help with. So we started wondering, what if we took this Linux prototype app and we developed a custom operating system that just implemented a few of these operations very efficiently, right? Can we reduce the overhead on these predictions? Maybe allowing us to do more or deeper predictions. And as I started looking at that, it really came down to a few manipulations of process address spaces that had to be very quick, right? Operations like, are these two address spaces equal? Which is easy enough, right? I'll just check all the bytes of memory. But of course, if you wanna keep this lightweight, minimize computation on the critical path, so you can't do this, right? So maybe you have to come up with something more clever, maybe some kind of incremental summary data structures that are maintaining knowledge about the difference in how these address spaces are evolving. Other simple operations like cloning and taking two address spaces and applying one over the other as a diff, right? These two operations taken together are how we make those predictions, right? Take your current point, clone it, and then query your predictors, update the state, and you can run from those points. So while I'm off working on that, learning about some systems techniques, so about page tables and dirty bits and copy and write, my lab mate and future collaborator, Jim Catton, has just come back from an internship at IBM. Will you give them a wave, Jim? So as of last week, that's Dr. James Catton, fresh on the job market. Anyways, Jim was in the headspace of, sorry, Jim was in the headspace of cloud computing, right? Where users want their high level code running fast and cloud providers want to squeeze everything they can out of their hardware. Jim was exploring potentially more performant and better isolated OS primitives for cloud computation in particular, considering ditching OS and hardware level virtualization for faster to create unique kernels. Unique kernels are a neat, if esoteric, OS gadget that have proven useful in our work, and so I thought I'd give you a quick crash course. So taking a quick tangent from cloud computing, for the purpose of a contrast, consider your standard application, monolithic OS breakdown, and contrast it with a unique kernel, right? Here in the unique kernel, we're embedding the application inside of a single adder space with some system components, whereas normally they live in two separated domains. This gives us some cool properties, one of them is customizability, right? If you have two applications that both run, say network workloads, one is sensitive to throughput, the other to latency, right? You can put in a custom network stack that's tuned to your application. Another feature is kind of the independence of failure domains, and what you could think about there is, in these two worlds, one world running this monolithic stack, on the other world on top of this lightweight unique kernel monitor, you have a gold process generating this red data, a pink process generating that green data, well where's it gonna be stored when you drop it into the file system? Right, in one world, they're shared in the same kernel structures, in the other, there's replication, so you can imagine if the gold application finds some way to subvert the kernel and learn something about the file system, it could potentially learn about the pink application's data, whereas that might be much harder in the unique kernel space where if you subvert your own file system, you only learn about your files. All right, so the last thing to think about here is some differences that really pop out when you're multiplexing these things, right? In particular, you notice the memory duplication, and that's something that we're gonna come back to, but what I wanna draw your attention to is how well these unique kernels kind of package their data into a single flat address space, and that's part of our system design of keeping these things amenable to, to keeping these things amenable to learning algorithms, right? So the idea there is that when I wanna grab the full state of this running unique kernel, that's a really easy operation, whereas if I'm up there trying to grab the full state of that gold application, I'm gonna have to go down into the kernel and pull data, kind of gleaning it from different data structures. Anyways, that's the end of our unique kernel crash course, so let's come back to cloud computing. So the functions as a service model is perhaps the most fine-grained cloud service on offer, right? It promises clients instantaneous access to arbitrary computational parallelism, right? You can go in there and say, hey, here's a function, run it 100,000 times on these inputs, and you expect that to be distributed well across the back end of the service. But this really shifts the burden from the application developer to the cloud service provider to be able to quickly create these safe execution sites and to maintain their isolation on these back end invokers. As a super quick function as a service crash course, off the critical path, users upload their functions in whatever high level language is, Python, JavaScript, Go, and on demand, they'll send run requests, right? Execute my function F on this input X. And the platform's duty is to execute them, to multiplex them on these back end nodes, maintaining isolation and keeping this performant. In between gyms, unique kernels, and some of my prior work, it started looking promising, right? These transient functions need access to these executional sites, right? Functions can be deployed in parallel. And also there's significant overlap between functions, right? If you write two Python functions, foo and bar, they differ in their function state, but they're all running on the same Python interpreter. So there's a lot of opportunity there to express one of those functions as in terms of a single diff applied to another. So here's our setup for the experiments that we run. So Jim and I sat down for a year and worked on retrofitting a container-based function as a service platform, Apache OpenWisk, really just working on the invoker, the last box all the way to the right. And we retrofitted a custom unique kernel-based operating system called SUSE. And just to drive out the difference, our baseline system is running Linux containers. Our drop-in replacement is running unique kernels. And we learned a few things from the work that we did here about how to build a fast function as a service system. And what it really came down to was two points. The first one was maximizing the amount of functions that you can cache on the system. So if your user says, hey, I want you to run F, if you have it sitting ready to go, you can run it way faster than if you have to, say, bring up an interpreter drop-in that the source code maybe pre-compile it before running it. So the first thing is caching these functions. And the other thing is when you miss in the cache, being able to constitute one of these cold start environments quickly. So to that point, we talked about the memory duplication going on in these unique kernels. And that really seems like a killer for being able to cache a large number of these objects, right, all this memory duplication. And that's exactly what we saw at first. On the same node, we're able to fit 3,000 Linux containers. And because of all the duplication in our system, Seuss was only able to hold 800 of them. But through a technique called snapshotting that I'll get into, we were actually able to do aggressive memory sharing between these environments and cache 52,000 functions on the same node that containers are stuck caching 3,000. And so our secret is no secret at all. It's a technique where you provide the ability to instrument high-level source code with a function call. That makes an exact copy of the function state that can later be deployed from. And just to emphasize here, the idea is not that users are instrumenting their own code with these calls necessarily. That might be a toggle that you choose to provide. We made two lines of source change to how we deploy our invoker, right? So we introduced two of these calls to take a snapshot which enabled all the memory sharing and the latency speedups that I'll show you in the following slides, right? So to understand how we're making use of snapshotting, it's helpful to understand the lifetime of these functions. And there's a lot of parts to that, but it really breaks down into two pieces. The first part is your language-specific snapshot, right? That's bringing up your interpreter. So bringing up your JavaScript interpreter, your Python interpreter. And then on top of that, specializing the environment to your function, right? That's your function foo, that's your function bar. And there's really these two key opportunities for employing our snapshotting technique. Returning to that phase-phase diagram, you can think of this as booting one of our unicernals and faulting in a ton of memory, 100 megabytes, well, relatively, to get that interpreter up. But at that point, taking a snapshot, right? This thing's immutable, it's gonna be held in memory, it'll never be modified. And later, we can deploy functions from that interpreter, right? So now we're no longer paying the cost of, the memory cost of bringing up that interpreter. We can deploy from these snapshots in sub millisecond times. And this is one of the ways in which we're able to take advantage of these diffs applied to starting states, right? So this booted JavaScript interpreter here is the blue state. And then each of these specialized functions is a small diff on top of it. Now, the real power of this technique is the ability to take snapshots that are relative to predecessor snapshots, right? And the idea here is that snapshots will trace a lineage back through prior snapshots, somewhat like a fork tree. And this is really what allows our massive memory sharing. We have a much higher degree of sharing than processes. Everything by default is shared, including heap state, for example. So from these specializations, right? This is just the refining snapshot for bar. You can deploy it and accept any arguments as input and run from those states. So we talked about caching. The other side is when you miss in the cache, you're on a cold start path. So what happens there? Well, in our comparison point, you're on a container creation path. And one thing that we found in our work is that container creation takes a long time. These are great tools for what they're built for, but trying to shoehorn them into this function as a service application without doing special work to optimize them really gets you in a lot of trouble. So this first line right here, that's 500 milliseconds, that's five seconds to create a container. So what you're seeing here is a few different degrees of concurrency. So whether you're using one core, two, four, eight, 16 to create these containers, we see two non-scalabilities. The first one is that the more containers in your system, the longer it takes to create the marginal container. The other one is that it doesn't paralyze well across multiple cores, right? So perhaps there's some kind of locking going on that's preventing these from scaling in parallel. So with respect to our cold start times, once these container creations get on the critical path, we see containers are taking around half a second to bring up and naively Seuss shaves a couple, 100 milliseconds off that time, right? But with snapshotting, we're able to take that down almost in order of magnitude and then using a form of speculative execution, we're able to cut that down by another significant factor. And so we've taken these cold start times from human timescale, right? 400 milliseconds down really close to process creation times. And that's a good sign if you're trying to run a real function execution model, which might be doing fan out and consuming a lot of these environments rapidly. So the argument for how snapshots reduce latency is the same as how they help with memory, right? Bring up that JavaScript interpreter, spend a third of a second, you take a snapshot. Just to drive this point, we do this once on the node per interpreter. So this happens one time at system initialization and never again. Then you can start running your various functions on top of that snapshot. You can take that recursive snapshot to amortize those library imports, those pre-compilations to get yourself into a state where you're ready to deploy execution. Okay, so another penalty that these Unicernals suffer is if you're using off-the-shelf components here, then your system components probably aren't designed to be low latency on their first use, right? There's no reason for a monolithic kernel to have an optimized network stack that's ready to have optimal performance on the first packet because you're almost never in that case, right? But unfortunately for us, once you deploy that JavaScript environment, it's sitting there waiting for a source code to come in from one of these functions, right? So that means that every deployment that we send out is gonna face those first time initialization penalties and they can be significant from what we've measured here. So now turning to this technique of speculation. So we're doing this by hand, but what I wanna drive at is that there's value in automating this, but the idea is instead of taking that snapshot right after you bring up the JavaScript interpreter, go and send in a dummy packet, right? Just to exercise all of those paths and snapshot after that, right? This allows, this gives you a little tool to factor computation out of all of these function paths and drop it into these early snapshots that you only execute a single time. So I'm showing you factoring execution time out of these paths, but it's the same for memory usage. What I think would be interesting here is how do you figure out, well, what do my users actually use, right? Like what libraries could I factor out? What, could I run benchmarks that stress a whole linear algebra benchmark to get as much memory as you can pull out so that you can hold more of these on our node? So some quick graphs. So here's the container-based system you're seeing.